Lecture 001
Stanford CS234 Reinforcement Learning I Introduction to Reinforcement Learning I 2024 I Lecture 1
Source: https://www.youtube.com/watch?v=WsvFL-LjA6U
---
Transcript
[00:00:05] hi everyone we're go...
Stanford CS234 Reinforcement Learning I Introduction to Reinforcement Learning I 2024 I Lecture 1
Source: https://www.youtube.com/watch?v=WsvFL-LjA6U
---
Transcript
[00:00:05] hi everyone we're going to go ahead and
[00:00:06] hi everyone we're going to go ahead and get started I'm Emma Brill um I'm
[00:00:08] get started I'm Emma Brill um I'm delighted to welcome you to
[00:00:10] delighted to welcome you to reinforcement learning
[00:00:11] reinforcement learning cs234 um this is a brief overview of the
[00:00:14] cs234 um this is a brief overview of the class uh and what we're going to be
[00:00:16] class uh and what we're going to be covering today and just want to start
[00:00:20] covering today and just want to start that probably everyone's heard of
[00:00:21] that probably everyone's heard of reinforcement learning these days and
[00:00:23] reinforcement learning these days and that wasn't true about 10 15 years ago
[00:00:25] that wasn't true about 10 15 years ago but you can describe what is happening
[00:00:27] but you can describe what is happening in reinforcement learning by a pretty
[00:00:30] in reinforcement learning by a pretty simple statement which is the idea of an
[00:00:32] simple statement which is the idea of an automated agent learning through
[00:00:34] automated agent learning through experience to make good decisions now
[00:00:38] experience to make good decisions now that's a pretty simple statement to say
[00:00:41] that's a pretty simple statement to say um it sort of encapsulates a lot of what
[00:00:42] um it sort of encapsulates a lot of what me and my lab and many many others have
[00:00:44] me and my lab and many many others have been trying to work on for the last 10
[00:00:46] been trying to work on for the last 10 to 15 years um but it's sort of
[00:00:48] to 15 years um but it's sort of deceptively simple because it involves a
[00:00:50] deceptively simple because it involves a lot of different really challenging and
[00:00:52] lot of different really challenging and important
[00:00:54] things so the first is that any sort of
[00:00:57] things so the first is that any sort of General agenda to try to achieve General
[00:01:00] General agenda to try to achieve General artificial intelligence has to include
[00:01:01] artificial intelligence has to include the ability to make decisions there's
[00:01:04] the ability to make decisions there's been absolutely enormous progress in
[00:01:06] been absolutely enormous progress in what we would call sort of perceptual um
[00:01:08] what we would call sort of perceptual um machine learning things like being able
[00:01:10] machine learning things like being able to perceive faces or cats or like
[00:01:13] to perceive faces or cats or like identify cars and we call that often uh
[00:01:16] identify cars and we call that often uh perceptual machine learning because it
[00:01:18] perceptual machine learning because it focuses on trying to say identify
[00:01:20] focuses on trying to say identify something but of course in reality what
[00:01:22] something but of course in reality what we're all trying to do all the time is
[00:01:24] we're all trying to do all the time is also to make decisions based on our
[00:01:26] also to make decisions based on our perception and based on our information
[00:01:28] perception and based on our information we're receiving and and so it's critical
[00:01:30] we're receiving and and so it's critical if we think about what it means to be
[00:01:32] if we think about what it means to be intelligent to consider how to make
[00:01:34] intelligent to consider how to make decisions and not just any decisions but
[00:01:37] decisions and not just any decisions but what it means to have good
[00:01:39] what it means to have good decisions this sort of question over how
[00:01:42] decisions this sort of question over how can we learn to make decisions
[00:01:43] can we learn to make decisions particularly faced uh by uncertainty and
[00:01:46] particularly faced uh by uncertainty and limited data has been a central question
[00:01:49] limited data has been a central question that people have been thinking about at
[00:01:50] that people have been thinking about at least since the
[00:01:51] least since the 1950s um particularly pioneered by the
[00:01:54] 1950s um particularly pioneered by the ideas of Richard bman and we'll hear a
[00:01:57] ideas of Richard bman and we'll hear a lot more about belman's equation which
[00:01:58] lot more about belman's equation which many of you might have seen before for
[00:02:00] many of you might have seen before for later even um in this lecture or next
[00:02:02] later even um in this lecture or next lecture so there's one sort of argument
[00:02:05] lecture so there's one sort of argument for studying reinforcement learning um
[00:02:07] for studying reinforcement learning um which is it's an essential part of
[00:02:08] which is it's an essential part of intelligence it has to be part of a
[00:02:10] intelligence it has to be part of a general agenda of artificial
[00:02:12] general agenda of artificial intelligence and so we should study it
[00:02:14] intelligence and so we should study it to try to understand what it means to be
[00:02:16] to try to understand what it means to be intelligent and that certainly for me is
[00:02:18] intelligent and that certainly for me is one of the really big motivations is to
[00:02:20] one of the really big motivations is to I think there's just a lot of
[00:02:21] I think there's just a lot of fundamental questions about what is the
[00:02:23] fundamental questions about what is the data needed to learn to make good
[00:02:25] data needed to learn to make good decisions but there's another really
[00:02:27] decisions but there's another really good motivation to study reinforcement
[00:02:29] good motivation to study reinforcement learning which is but it's practical and
[00:02:30] learning which is but it's practical and it allows us to solve problems we'd like
[00:02:32] it allows us to solve problems we'd like to
[00:02:33] to solve so in particular over the last
[00:02:36] solve so in particular over the last roughly decade there started to be a lot
[00:02:38] roughly decade there started to be a lot of really impressive successes of using
[00:02:40] of really impressive successes of using reinforcement learning to tackle
[00:02:42] reinforcement learning to tackle problems or to get unprecedented
[00:02:44] problems or to get unprecedented performance in a lot of really important
[00:02:47] performance in a lot of really important domains so the first one is the board
[00:02:49] domains so the first one is the board game go so who here plays
[00:02:52] game go so who here plays go okay a few people maybe maybe not you
[00:02:55] go okay a few people maybe maybe not you can talk to the people that raise their
[00:02:56] can talk to the people that raise their hands so it's an incredibly popular
[00:02:58] hands so it's an incredibly popular board game it's also an incredi
[00:03:00] board game it's also an incredi hardboard game it's far harder than
[00:03:02] hardboard game it's far harder than chess um and it was considered a really
[00:03:04] chess um and it was considered a really long outstanding question of artificial
[00:03:06] long outstanding question of artificial intelligence but roughly like I guess
[00:03:09] intelligence but roughly like I guess about eight years ago now eight to nine
[00:03:11] about eight years ago now eight to nine years ago um there was a team at Deep
[00:03:13] years ago um there was a team at Deep Mind which was still a fairly small
[00:03:14] Mind which was still a fairly small organization at that point that thought
[00:03:17] organization at that point that thought that they could make significant Headway
[00:03:19] that they could make significant Headway at teaching AI agents to be able to play
[00:03:22] at teaching AI agents to be able to play go and the idea in this case is that
[00:03:25] go and the idea in this case is that we're going to combine between the ideas
[00:03:27] we're going to combine between the ideas of reinforcement learning and Monte
[00:03:29] of reinforcement learning and Monte Carlo research which is something we're
[00:03:31] Carlo research which is something we're going to hear about later in this class
[00:03:33] going to hear about later in this class um to create a system that played go
[00:03:35] um to create a system that played go better than any humans in the world and
[00:03:38] better than any humans in the world and so there's even a movie now about sort
[00:03:40] so there's even a movie now about sort of one of the seminal um games in that
[00:03:43] of one of the seminal um games in that sort of endeavor and how humans felt
[00:03:45] sort of endeavor and how humans felt about that and and how the creators of
[00:03:47] about that and and how the creators of the AI systems felt about that but this
[00:03:50] the AI systems felt about that but this uh feat was achieved far earlier than
[00:03:52] uh feat was achieved far earlier than what people expected and one of the key
[00:03:54] what people expected and one of the key reasons for that was the was using
[00:03:56] reasons for that was the was using reinforcement learning
[00:04:00] another really interesting place that
[00:04:01] another really interesting place that we've seen progress of um using
[00:04:03] we've seen progress of um using reinforcement learning to tackle
[00:04:04] reinforcement learning to tackle incredible challenges is in the idea of
[00:04:06] incredible challenges is in the idea of Fusion science fusion is a potential
[00:04:10] Fusion science fusion is a potential approach for trying to tackle the huge
[00:04:11] approach for trying to tackle the huge energy issues that we have and trying to
[00:04:13] energy issues that we have and trying to transition to more sustainable options
[00:04:15] transition to more sustainable options for that and one of the challenges here
[00:04:17] for that and one of the challenges here and I'm not a fusion um expert is to
[00:04:20] and I'm not a fusion um expert is to manipulate and sort of control um things
[00:04:23] manipulate and sort of control um things within um a vessel and so what the
[00:04:26] within um a vessel and so what the reinforcement learning question in this
[00:04:28] reinforcement learning question in this case is how do you command the
[00:04:30] case is how do you command the controllers the coil controllers in
[00:04:32] controllers the coil controllers in order to manipulate this into different
[00:04:34] order to manipulate this into different types of shapes and so this was a nature
[00:04:36] types of shapes and so this was a nature paper from two years ago where they
[00:04:38] paper from two years ago where they showed you could use deep reinforcement
[00:04:39] showed you could use deep reinforcement learning techniques to accomplish this
[00:04:42] learning techniques to accomplish this in a way that was far more flexible than
[00:04:43] in a way that was far more flexible than had previously been
[00:04:46] had previously been imagined one of my favorite examples of
[00:04:48] imagined one of my favorite examples of the applications of reinforcement
[00:04:49] the applications of reinforcement learning um comes from a pretty recent
[00:04:51] learning um comes from a pretty recent important case which is covid um testing
[00:04:55] important case which is covid um testing so this was a system that was deployed
[00:04:57] so this was a system that was deployed in Greece they had limited resources and
[00:04:59] in Greece they had limited resources and they were trying to understand who you
[00:05:01] they were trying to understand who you should test in order to help control the
[00:05:03] should test in order to help control the epidemic because as many of you may know
[00:05:05] epidemic because as many of you may know there's a lot of sort of free movement
[00:05:07] there's a lot of sort of free movement within Europe um and there was a lot of
[00:05:09] within Europe um and there was a lot of Transitions and they were trying to
[00:05:10] Transitions and they were trying to think about how to leverage their
[00:05:11] think about how to leverage their resources in a data- driven way because
[00:05:14] resources in a data- driven way because of course the epidemic was changing um
[00:05:16] of course the epidemic was changing um and so this was a beautiful paper by a
[00:05:18] and so this was a beautiful paper by a Stanford Graduate um HMA Basi and her
[00:05:21] Stanford Graduate um HMA Basi and her colleague she's a professor over at Penn
[00:05:22] colleague she's a professor over at Penn now that use reinforcement learning to
[00:05:25] now that use reinforcement learning to really quickly do this and it was
[00:05:27] really quickly do this and it was deployed so Greece used this um for for
[00:05:30] deployed so Greece used this um for for their um testing at the
[00:05:32] their um testing at the border but perhaps the most famous
[00:05:34] border but perhaps the most famous example recently is chat
[00:05:36] example recently is chat GPT so I think that um as many of you
[00:05:39] GPT so I think that um as many of you might know natural language processing
[00:05:41] might know natural language processing has had incredible successes over the
[00:05:43] has had incredible successes over the last decade and there worlds a lot of
[00:05:46] last decade and there worlds a lot of work trying to use Transformers um to
[00:05:48] work trying to use Transformers um to make really really capable natural
[00:05:50] make really really capable natural language systems but up to around you
[00:05:53] language systems but up to around you know I guess like a year and a half ago
[00:05:56] know I guess like a year and a half ago um most of that work was not known to
[00:05:57] um most of that work was not known to the broader public so even even though
[00:05:59] the broader public so even even though we were getting these amazing advances
[00:06:01] we were getting these amazing advances in natural language processing it wasn't
[00:06:03] in natural language processing it wasn't at the state yet where like everybody
[00:06:04] at the state yet where like everybody was using it and so the key idea of chat
[00:06:08] was using it and so the key idea of chat GPT was to use reinforcement learning to
[00:06:10] GPT was to use reinforcement learning to create vastly more capable
[00:06:13] create vastly more capable systems and I like to talk about chat
[00:06:15] systems and I like to talk about chat GPT not because it's perhaps the most
[00:06:17] GPT not because it's perhaps the most well-known success for reinforcement
[00:06:18] well-known success for reinforcement learning but also because it exhibits a
[00:06:21] learning but also because it exhibits a lot of the different technical
[00:06:22] lot of the different technical challenges and questions that we're
[00:06:23] challenges and questions that we're going to be covering in this class so
[00:06:25] going to be covering in this class so let's just walk through sort of how at a
[00:06:28] let's just walk through sort of how at a very high level sort of this from chat
[00:06:30] very high level sort of this from chat GPT of how the chat GPT system works in
[00:06:33] GPT of how the chat GPT system works in terms of training so the first thing it
[00:06:35] terms of training so the first thing it does is it does what we would probably
[00:06:37] does is it does what we would probably call Behavior
[00:06:40] cloning or imitation learning and we'll
[00:06:43] cloning or imitation learning and we'll be covering that in this
[00:06:45] be covering that in this class and we'll be talking more about it
[00:06:48] class and we'll be talking more about it even in this lecture so what did it do
[00:06:50] even in this lecture so what did it do so again just to remind I suspect
[00:06:52] so again just to remind I suspect everybody in this class probably uses
[00:06:54] everybody in this class probably uses chat gbt probably multiple times a day
[00:06:56] chat gbt probably multiple times a day or Claude or gemini or one of the other
[00:06:59] or Claude or gemini or one of the other um large language model systems but just
[00:07:01] um large language model systems but just in case you have not um the idea in this
[00:07:04] in case you have not um the idea in this case is you might have some sort of
[00:07:05] case is you might have some sort of prompt or task you want your language
[00:07:06] prompt or task you want your language system to do like explain reinforcement
[00:07:09] system to do like explain reinforcement learning to a
[00:07:10] learning to a six-year-old and then someone gives a
[00:07:13] six-year-old and then someone gives a response like we give treats and
[00:07:15] response like we give treats and punishments to teach Etc you can think
[00:07:17] punishments to teach Etc you can think you can try this out with chat GPT and
[00:07:19] you can try this out with chat GPT and see how well you think it explains it um
[00:07:22] see how well you think it explains it um and then what that was treated as is
[00:07:23] and then what that was treated as is sort of a direct supervised learning
[00:07:25] sort of a direct supervised learning problem so just trying to take that
[00:07:27] problem so just trying to take that input and then to produce that output
[00:07:30] input and then to produce that output and we will call that also imitation
[00:07:32] and we will call that also imitation learning or behavior cloning in this
[00:07:33] learning or behavior cloning in this class and we'll talk about why so that
[00:07:36] class and we'll talk about why so that was the first step and this is sort of
[00:07:37] was the first step and this is sort of what people have been doing in uh
[00:07:39] what people have been doing in uh natural language processing and the
[00:07:41] natural language processing and the systems were good but they weren't that
[00:07:42] systems were good but they weren't that good so the next idea was to try to
[00:07:45] good so the next idea was to try to explicitly think about utility or
[00:07:47] explicitly think about utility or rewards like how good were these um
[00:07:50] rewards like how good were these um particular labels or these particular
[00:07:52] particular labels or these particular outputs so here we're going to actually
[00:07:54] outputs so here we're going to actually build a model we're going to build a
[00:07:56] build a model we're going to build a model of a reward
[00:07:59] model of a reward which relates to model-based
[00:08:01] which relates to model-based reinforcement
[00:08:03] reinforcement learning and the way we're going to do
[00:08:05] learning and the way we're going to do this is um or the way they did this is
[00:08:06] this is um or the way they did this is we collect preference data we ask people
[00:08:08] we collect preference data we ask people to compare or rank ACR across different
[00:08:11] to compare or rank ACR across different forms of outputs and then we use that to
[00:08:13] forms of outputs and then we use that to learn a preference model and we're going
[00:08:15] learn a preference model and we're going to cover that in this class that's going
[00:08:17] to cover that in this class that's going to be one of the differences to this
[00:08:18] to be one of the differences to this class compared to a couple years ago
[00:08:20] class compared to a couple years ago that um I think preference-based reward
[00:08:22] that um I think preference-based reward signals are really important and very
[00:08:24] signals are really important and very powerful and so we're going to be
[00:08:25] powerful and so we're going to be covering that in this class this term so
[00:08:28] covering that in this class this term so in this case they would learn a reward
[00:08:30] in this case they would learn a reward and again don't worry if you haven't if
[00:08:32] and again don't worry if you haven't if you're not familiar with what rewards
[00:08:33] you're not familiar with what rewards are and stuff we'll go through all of
[00:08:34] are and stuff we'll go through all of that I just want to give you a high
[00:08:35] that I just want to give you a high level sort of sense of how chat GPT is
[00:08:37] level sort of sense of how chat GPT is related to some of the things we're
[00:08:38] related to some of the things we're going to cover in the class so they
[00:08:40] going to cover in the class so they learned a reward signal and then they
[00:08:42] learned a reward signal and then they did reinforcement learning using that
[00:08:44] did reinforcement learning using that learned reward system signal so now
[00:08:47] learned reward system signal so now they're going to do reinforcement
[00:08:51] learning and this is called
[00:08:55] learning and this is called rhf because it is reinforcement learning
[00:08:57] rhf because it is reinforcement learning from Human feedback and I'll just note
[00:09:00] from Human feedback and I'll just note here that this was not the first time
[00:09:01] here that this was not the first time this idea was um uh introduced it had
[00:09:04] this idea was um uh introduced it had been introduced maybe about four to 5
[00:09:07] been introduced maybe about four to 5 years before this for sort of Robotics
[00:09:09] years before this for sort of Robotics simulated robotics tasks um but chat GPT
[00:09:13] simulated robotics tasks um but chat GPT demonstrated that this really made a
[00:09:15] demonstrated that this really made a huge difference in
[00:09:16] huge difference in performance and so I think that's a it's
[00:09:19] performance and so I think that's a it's a really nice example of the types of
[00:09:20] a really nice example of the types of ideas that we're going to be covering as
[00:09:22] ideas that we're going to be covering as well asort of the incredible successes
[00:09:23] well asort of the incredible successes that are
[00:09:26] possible now for even before chat GPT
[00:09:29] possible now for even before chat GPT came along there was starting to be a
[00:09:31] came along there was starting to be a huge interest in reinforcement
[00:09:33] huge interest in reinforcement learning so some of you we have an
[00:09:35] learning so some of you we have an optional textbook for the class um it's
[00:09:36] optional textbook for the class um it's by Sutton Narto Richard Sutton is um
[00:09:40] by Sutton Narto Richard Sutton is um from Canada and is one of the sort of
[00:09:42] from Canada and is one of the sort of founders of the field and when I started
[00:09:44] founders of the field and when I started in reinforcement learning and I would go
[00:09:46] in reinforcement learning and I would go give my talks and conferences it used to
[00:09:47] give my talks and conferences it used to be like me and Rich and 30 other people
[00:09:50] be like me and Rich and 30 other people nobody cared about reinforcement
[00:09:51] nobody cared about reinforcement learning um I mean except for you know a
[00:09:53] learning um I mean except for you know a few of us did because we thought it was
[00:09:54] few of us did because we thought it was really amazing but as you can see like
[00:09:56] really amazing but as you can see like sort of through like the 2000s which is
[00:09:59] sort of through like the 2000s which is when I I was getting my training around
[00:10:00] when I I was getting my training around here um there just weren't that many
[00:10:03] here um there just weren't that many papers and the community wasn't nearly
[00:10:04] papers and the community wasn't nearly as large but this nice paper by Peter
[00:10:07] as large but this nice paper by Peter Henderson the the y- axis's uh
[00:10:11] Henderson the the y- axis's uh papers shows that there's been this sort
[00:10:13] papers shows that there's been this sort of enormous increase in interest and I
[00:10:16] of enormous increase in interest and I think a lot of this was really due to
[00:10:18] think a lot of this was really due to the fact that kind of around here there
[00:10:21] the fact that kind of around here there was some amazing excess um successes on
[00:10:23] was some amazing excess um successes on the Atari video games where people
[00:10:25] the Atari video games where people showed that you could learn directly
[00:10:27] showed that you could learn directly from Pixel input to make decisions and
[00:10:30] from Pixel input to make decisions and then there started to be the successes
[00:10:31] then there started to be the successes in alphago and then there became more
[00:10:33] in alphago and then there became more and more
[00:10:34] and more successes so it is an incredible time
[00:10:37] successes so it is an incredible time for reinforcement learning this curve
[00:10:38] for reinforcement learning this curve has continued to go up um however I
[00:10:41] has continued to go up um however I think it's also important to notice that
[00:10:43] think it's also important to notice that there's also a number of
[00:10:45] there's also a number of Skeptics so there was a pretty famous
[00:10:48] Skeptics so there was a pretty famous talk by Yan Lun um in 2016 at one of the
[00:10:52] talk by Yan Lun um in 2016 at one of the major machine learning conferences nurs
[00:10:55] major machine learning conferences nurs so Yan theun for those of you who don't
[00:10:56] so Yan theun for those of you who don't know him is one of the sort of seminal
[00:10:58] know him is one of the sort of seminal figures in neural network research he
[00:11:01] figures in neural network research he has won the Turing award he is um an
[00:11:04] has won the Turing award he is um an amazing amazing researcher so he gave a
[00:11:06] amazing amazing researcher so he gave a keynote at nerps I believe it was a
[00:11:09] keynote at nerps I believe it was a keynote he certainly gave a very famous
[00:11:10] keynote he certainly gave a very famous talk there where he was talking about
[00:11:12] talk there where he was talking about the role of different types of machine
[00:11:14] the role of different types of machine learning questions and at sub areas in
[00:11:17] learning questions and at sub areas in terms of making progress on machine
[00:11:19] terms of making progress on machine learning and he very famously talked
[00:11:21] learning and he very famously talked about machine learning as a cake and so
[00:11:23] about machine learning as a cake and so he said that the main cake is really
[00:11:25] he said that the main cake is really unsupervised learning and that's really
[00:11:27] unsupervised learning and that's really going to be the body the most important
[00:11:29] going to be the body the most important aspect respective machine learning
[00:11:30] aspect respective machine learning things like representation learning from
[00:11:32] things like representation learning from unlabeled data that's really going to be
[00:11:34] unlabeled data that's really going to be the core that's where we're going to
[00:11:35] the core that's where we're going to have huge amounts of data and we're
[00:11:37] have huge amounts of data and we're going to make a lot of progress and then
[00:11:39] going to make a lot of progress and then supervised learning was the icing so
[00:11:41] supervised learning was the icing so that's still pretty important it's like
[00:11:42] that's still pretty important it's like very important part of a cake at least
[00:11:44] very important part of a cake at least in my opinion um it doesn't have as much
[00:11:46] in my opinion um it doesn't have as much we don't have as much supervised
[00:11:48] we don't have as much supervised learning um and it's sort of this
[00:11:50] learning um and it's sort of this additional one and then he argued that
[00:11:52] additional one and then he argued that reinforcement learning was just the
[00:11:54] reinforcement learning was just the Cherry now you know cherries are
[00:11:56] Cherry now you know cherries are important um but uh but not nearly as
[00:11:59] important um but uh but not nearly as much perhaps as the rest of the cake and
[00:12:01] much perhaps as the rest of the cake and he went on and talked about some places
[00:12:02] he went on and talked about some places where he thought that RL still might
[00:12:04] where he thought that RL still might have a role but it was considered a
[00:12:06] have a role but it was considered a really important talk because what he
[00:12:09] really important talk because what he was sort of demonstrating is that
[00:12:11] was sort of demonstrating is that reinforcement learning was having a part
[00:12:13] reinforcement learning was having a part to play in machine learning but maybe
[00:12:15] to play in machine learning but maybe only a very minor part now I think it'd
[00:12:18] only a very minor part now I think it'd be interesting to talk to him today I
[00:12:19] be interesting to talk to him today I haven't talked to him recently so I
[00:12:20] haven't talked to him recently so I don't know what his current opinion is
[00:12:22] don't know what his current opinion is um but I think it's a really important
[00:12:24] um but I think it's a really important thing to think about like where are all
[00:12:26] thing to think about like where are all of these different techniques important
[00:12:28] of these different techniques important and where will we be able to to make the
[00:12:29] and where will we be able to to make the most progress in terms of advancing
[00:12:34] Ai and so with that we're going to try
[00:12:36] Ai and so with that we're going to try and do our first poll which is about why
[00:12:38] and do our first poll which is about why you guys want to take this class um so
[00:12:40] you guys want to take this class um so so we'll look through these you'll have
[00:12:42] so we'll look through these you'll have to bear with us a little bit with um we
[00:12:43] to bear with us a little bit with um we had a few technical difficulties that
[00:12:45] had a few technical difficulties that we're working with CTL on but it should
[00:12:47] we're working with CTL on but it should work out so if you go to either the
[00:12:48] work out so if you go to either the first link and Ed or you go to this HTTP
[00:12:51] first link and Ed or you go to this HTTP you can if you have any issues like do
[00:12:53] you can if you have any issues like do you want to be registered if it's
[00:12:55] you want to be registered if it's hanging just skip the registration
[00:12:57] hanging just skip the registration refresh that should all sort it out and
[00:12:59] refresh that should all sort it out and then didn't just Ed in your sun ID as
[00:13:01] then didn't just Ed in your sun ID as your screen and just take a second and
[00:13:03] your screen and just take a second and write down a bit about why you want to
[00:13:05] write down a bit about why you want to take this
[00:13:06] take this class and it could be anything it could
[00:13:08] class and it could be anything it could be that you're really curious about
[00:13:10] be that you're really curious about something it could be because you're
[00:13:11] something it could be because you're doing an internship and they told you
[00:13:13] doing an internship and they told you you had to take something about
[00:13:14] you had to take something about reinforcement learning any of the any of
[00:13:16] reinforcement learning any of the any of the things are
[00:13:17] the things are fine just take a minute or
[00:13:28] two
[00:13:58] for
[00:15:27] for for
[00:16:15] thanks for all the great reasons um I
[00:16:17] thanks for all the great reasons um I will talk about some of those when I
[00:16:19] will talk about some of those when I talk about also what we're going to
[00:16:20] talk about also what we're going to cover today and try to address why I
[00:16:22] cover today and try to address why I think that a lot of the things people
[00:16:23] think that a lot of the things people are bringing up are things that we're
[00:16:25] are bringing up are things that we're going to be touching upon so I think if
[00:16:28] going to be touching upon so I think if we want to think about um I think it's
[00:16:31] we want to think about um I think it's really important to start thinking about
[00:16:33] really important to start thinking about what is what is reinforcement learning
[00:16:34] what is what is reinforcement learning about because if we understand what it's
[00:16:36] about because if we understand what it's about then we know what sort of uh types
[00:16:38] about then we know what sort of uh types of questions we're interested in this
[00:16:39] of questions we're interested in this space and we also understand what sort
[00:16:41] space and we also understand what sort of applications it might be helpful for
[00:16:43] of applications it might be helpful for um though of course your creativity is
[00:16:45] um though of course your creativity is unlimited so you can see what you might
[00:16:46] unlimited so you can see what you might come up with other ideas that people may
[00:16:48] come up with other ideas that people may not have thought of for applying RL but
[00:16:51] not have thought of for applying RL but the the the four things that people top
[00:16:53] the the the four things that people top typically think about when they think
[00:16:54] typically think about when they think about reinforcement learning as a
[00:16:56] about reinforcement learning as a discipline and as the um sort of what
[00:16:57] discipline and as the um sort of what reinforcement learning involves is
[00:17:00] reinforcement learning involves is optimization delayed consequences
[00:17:02] optimization delayed consequences exploration and
[00:17:04] exploration and generalization so the first is
[00:17:07] generalization so the first is optimization um and the optimization
[00:17:09] optimization um and the optimization aspect is really just saying that we're
[00:17:11] aspect is really just saying that we're thinking about the best way to make
[00:17:14] thinking about the best way to make decisions which means that we explicitly
[00:17:16] decisions which means that we explicitly have to have some notion of
[00:17:19] have to have some notion of utility this um would an example of this
[00:17:23] utility this um would an example of this would be something like finding the
[00:17:24] would be something like finding the minimum distance route between two
[00:17:25] minimum distance route between two cities given a network of roads this
[00:17:27] cities given a network of roads this means you can directly compare different
[00:17:29] means you can directly compare different solutions because if one solution has a
[00:17:31] solutions because if one solution has a smaller distance than the other it is
[00:17:32] smaller distance than the other it is strictly preferred so there are many
[00:17:35] strictly preferred so there are many many important optimization questions
[00:17:37] many important optimization questions and reinforcement learning because it is
[00:17:38] and reinforcement learning because it is concerned with making good decisions
[00:17:40] concerned with making good decisions cares about us being able to rank or you
[00:17:42] cares about us being able to rank or you know decide across those different
[00:17:45] know decide across those different ones the second one is delayed
[00:17:48] ones the second one is delayed consequences the idea being that the
[00:17:50] consequences the idea being that the decisions that we make now can affect
[00:17:51] decisions that we make now can affect things far later so maybe saving for
[00:17:54] things far later so maybe saving for retirement now has some uh immediate
[00:17:57] retirement now has some uh immediate cost but leads to some significant
[00:17:59] cost but leads to some significant benefit later or maybe there's something
[00:18:00] benefit later or maybe there's something you can do early in a video game that
[00:18:02] you can do early in a video game that later has a lot of benefit there are two
[00:18:05] later has a lot of benefit there are two reasons why delayed consequences is
[00:18:07] reasons why delayed consequences is challenging one is for the reason of
[00:18:10] challenging one is for the reason of planning um many of you might have
[00:18:12] planning um many of you might have actually raised your hand if you've
[00:18:13] actually raised your hand if you've taken AI at
[00:18:15] taken AI at Stanford okay so about half of you so
[00:18:17] Stanford okay so about half of you so you probably saw planning in Ai and and
[00:18:19] you probably saw planning in Ai and and planning is the idea that even when we
[00:18:22] planning is the idea that even when we understand how the world works it might
[00:18:23] understand how the world works it might be really complicated to try to decide
[00:18:25] be really complicated to try to decide what the optimal thing is to do so you
[00:18:27] what the optimal thing is to do so you could think of this like chess the rules
[00:18:29] could think of this like chess the rules are known um it's still really
[00:18:30] are known um it's still really complicated to think about what's the
[00:18:32] complicated to think about what's the right thing to do so when the decisions
[00:18:34] right thing to do so when the decisions you make involve reasoning not just
[00:18:36] you make involve reasoning not just about the immediate outcomes but the
[00:18:37] about the immediate outcomes but the longer term ramifications these sort of
[00:18:39] longer term ramifications these sort of planning problems are even
[00:18:41] planning problems are even harder but the other reason this is
[00:18:43] harder but the other reason this is really hard is when we're learning
[00:18:44] really hard is when we're learning meaning that like we don't know how the
[00:18:46] meaning that like we don't know how the world works and we're trying to
[00:18:47] world works and we're trying to understand how um through direct
[00:18:49] understand how um through direct experience so when we're learning um
[00:18:52] experience so when we're learning um temporal credit assignment is hard
[00:18:54] temporal credit assignment is hard meaning that if you take some action now
[00:18:56] meaning that if you take some action now and later on you receive a good outcome
[00:18:58] and later on you receive a good outcome or a bad outcome how do you figure out
[00:19:00] or a bad outcome how do you figure out which of your outcomes caused that good
[00:19:02] which of your outcomes caused that good or bad later later
[00:19:04] or bad later later result this happens all the time to us
[00:19:06] result this happens all the time to us as humans right like how do you know why
[00:19:08] as humans right like how do you know why you got into Stamford well I don't know
[00:19:10] you got into Stamford well I don't know was it because you colored your you know
[00:19:11] was it because you colored your you know you did you wanted coloring contest when
[00:19:13] you did you wanted coloring contest when you were a six because you scored well
[00:19:15] you were a six because you scored well in the SAT because you went to a good
[00:19:16] in the SAT because you went to a good high school or you wrote a really good
[00:19:18] high school or you wrote a really good essay it's really hard to understand
[00:19:20] essay it's really hard to understand this in some cases it may be impossible
[00:19:22] this in some cases it may be impossible but when we're getting to make repeated
[00:19:24] but when we're getting to make repeated decisions it's really important that we
[00:19:26] decisions it's really important that we can start to use the prior experience to
[00:19:28] can start to use the prior experience to figure out out which decisions were
[00:19:30] figure out out which decisions were important or LED to good outcomes so
[00:19:32] important or LED to good outcomes so that we can repeat them so that's one of
[00:19:34] that we can repeat them so that's one of the reasons why this is
[00:19:36] the reasons why this is hard exploration is one of my favorite
[00:19:38] hard exploration is one of my favorite things um in terms of reinforcement
[00:19:40] things um in terms of reinforcement learning and the idea of this is that
[00:19:43] learning and the idea of this is that the agent can only learn about the world
[00:19:45] the agent can only learn about the world through direct experience so it's like
[00:19:48] through direct experience so it's like trying to learn to ride a bike by trying
[00:19:50] trying to learn to ride a bike by trying and failing and trying again um and
[00:19:53] and failing and trying again um and through that direct experiencing
[00:19:54] through that direct experiencing learning the right way to ride a bike
[00:19:58] learning the right way to ride a bike and the key idea about this is that um
[00:20:02] and the key idea about this is that um information is censored in that you only
[00:20:04] information is censored in that you only get to learn about what you
[00:20:06] get to learn about what you try so for example right now you don't
[00:20:08] try so for example right now you don't know how much worse your life would be
[00:20:09] know how much worse your life would be if you were MIT I went to MIT for grad
[00:20:12] if you were MIT I went to MIT for grad school MIT is also a great place um but
[00:20:15] school MIT is also a great place um but you generally can't ever understand what
[00:20:17] you generally can't ever understand what that
[00:20:17] that counterfactual um life would have been
[00:20:19] counterfactual um life would have been like right it's one of the the central
[00:20:22] like right it's one of the the central challenges um it's also a huge challenge
[00:20:24] challenges um it's also a huge challenge in causal inference um which is another
[00:20:26] in causal inference um which is another big interest of mine and something my
[00:20:27] big interest of mine and something my lab works on so so this's this General
[00:20:29] lab works on so so this's this General challenge that you only get to learn
[00:20:30] challenge that you only get to learn about the actual things that you do as
[00:20:33] about the actual things that you do as an agent or as a human as an agent Etc
[00:20:36] an agent or as a human as an agent Etc um and so the question is how do you use
[00:20:38] um and so the question is how do you use that experience to figure out how to
[00:20:40] that experience to figure out how to make good decisions so as a concrete
[00:20:42] make good decisions so as a concrete example of this you can imagine you're a
[00:20:43] example of this you can imagine you're a company and you give some promotion to
[00:20:45] company and you give some promotion to all your customers you can't know what
[00:20:47] all your customers you can't know what it would have been like if you didn't
[00:20:49] it would have been like if you didn't give the promotion to those customers
[00:20:51] give the promotion to those customers and even if you can give it to One
[00:20:52] and even if you can give it to One customer and not another they are not
[00:20:54] customer and not another they are not the same people so I can't rewind and
[00:20:56] the same people so I can't rewind and say dilip who is our head ta this time
[00:20:59] say dilip who is our head ta this time I'm not going to give you the promotion
[00:21:00] I'm not going to give you the promotion let's see how that world would have
[00:21:01] let's see how that world would have worked out that's one of the central
[00:21:04] worked out that's one of the central challenges so we'll talk a lot about
[00:21:06] challenges so we'll talk a lot about exploration later because it's one of
[00:21:07] exploration later because it's one of the key things that is different
[00:21:09] the key things that is different compared to many prior
[00:21:11] compared to many prior approaches and generalization um has to
[00:21:14] approaches and generalization um has to do with this question of really wanting
[00:21:16] do with this question of really wanting to solve really big challenging problems
[00:21:19] to solve really big challenging problems so we'll talk a lot about what decision
[00:21:21] so we'll talk a lot about what decision policies are but in general you can just
[00:21:22] policies are but in general you can just think of them as a mapping from
[00:21:24] think of them as a mapping from experience to to a decision and you
[00:21:26] experience to to a decision and you might think in those cases you could
[00:21:27] might think in those cases you could just pre-program it so like if your
[00:21:29] just pre-program it so like if your robot goes down the hallway if it hits
[00:21:31] robot goes down the hallway if it hits the end of the hallway turn
[00:21:33] the end of the hallway turn left but let's think about um a video
[00:21:36] left but let's think about um a video game which we can think of as just sort
[00:21:37] game which we can think of as just sort of generally having some input image so
[00:21:40] of generally having some input image so let's imagine that it's something like
[00:21:41] let's imagine that it's something like 300 by
[00:21:43] 300 by 400 and let's say we have at least 256
[00:21:46] 400 and let's say we have at least 256 different colors so now we have an image
[00:21:50] different colors so now we have an image set of images that we could see that is
[00:21:51] set of images that we could see that is at least 256 to the 300 CR 400 so those
[00:21:55] at least 256 to the 300 CR 400 so those are at least the space of images and
[00:21:57] are at least the space of images and that's probably an underestimate
[00:21:59] that's probably an underestimate and now we get to think about what we
[00:22:01] and now we get to think about what we would do in each of those different
[00:22:03] would do in each of those different scenarios so the combinatorics are
[00:22:06] scenarios so the combinatorics are completely mind-blowing and we can't
[00:22:07] completely mind-blowing and we can't write these down in a table so this is
[00:22:09] write these down in a table so this is why we would need something like a deep
[00:22:10] why we would need something like a deep neural network or something else in
[00:22:12] neural network or something else in order for us to try to make decisions in
[00:22:15] order for us to try to make decisions in these real like realistic settings which
[00:22:16] these real like realistic settings which are extremely large in terms of the the
[00:22:19] are extremely large in terms of the the type of scenarios the number of
[00:22:21] type of scenarios the number of scenarios we want to make decisions
[00:22:24] scenarios we want to make decisions on so you've probably seen all of these
[00:22:26] on so you've probably seen all of these ideas or at least most of them in other
[00:22:28] ideas or at least most of them in other classes for other types of AI or machine
[00:22:31] classes for other types of AI or machine learning so I think it's useful just to
[00:22:33] learning so I think it's useful just to contrast what is reinforcement learning
[00:22:35] contrast what is reinforcement learning doing um compared to these other ones so
[00:22:38] doing um compared to these other ones so the first is AI planning so in AI
[00:22:41] the first is AI planning so in AI planning generally we're doing some form
[00:22:42] planning generally we're doing some form of optimization Trying to minimize a
[00:22:45] of optimization Trying to minimize a distance or something like that we are
[00:22:47] distance or something like that we are often trying to handle delayed
[00:22:49] often trying to handle delayed consequences and um and those are the
[00:22:52] consequences and um and those are the two main things so we we might also have
[00:22:53] two main things so we we might also have to do generalization if the the size of
[00:22:56] to do generalization if the the size of the Space is really large
[00:22:58] the Space is really large okay so that is how so RL in general
[00:23:02] okay so that is how so RL in general will involve all of these so this is how
[00:23:04] will involve all of these so this is how those would
[00:23:05] those would compare if we think about something like
[00:23:07] compare if we think about something like supervised learning supervised learning
[00:23:09] supervised learning supervised learning does involve learning so we learn from
[00:23:13] does involve learning so we learn from data you know whether something's a cat
[00:23:14] data you know whether something's a cat or
[00:23:15] or not and we have to do
[00:23:20] generalization so we have those two
[00:23:22] generalization so we have those two things and again this is going to be
[00:23:23] things and again this is going to be compared to reinforcement learning which
[00:23:25] compared to reinforcement learning which has all of those
[00:23:29] in contrast to supervised learning where
[00:23:30] in contrast to supervised learning where you get the correct labels and
[00:23:32] you get the correct labels and unsupervised learning we don't get any
[00:23:33] unsupervised learning we don't get any labels but we're still learning from
[00:23:35] labels but we're still learning from experience and we're still trying to do
[00:23:41] generalization now the next thing and
[00:23:43] generalization now the next thing and this has become a really poopular thing
[00:23:45] this has become a really poopular thing is to think about whether we can map um
[00:23:48] is to think about whether we can map um reinforcement learning to imitation
[00:23:49] reinforcement learning to imitation learning we talked about this really
[00:23:51] learning we talked about this really briefly um about chat GPT and we'll talk
[00:23:54] briefly um about chat GPT and we'll talk about a lot more in the
[00:23:56] about a lot more in the course so in imitation learning or
[00:23:59] course so in imitation learning or behavior cloning or reducing
[00:24:01] behavior cloning or reducing reinforcement learning to supervised
[00:24:02] reinforcement learning to supervised learning we generally assume that we get
[00:24:04] learning we generally assume that we get access to expert
[00:24:06] access to expert trajectories so this could be like
[00:24:08] trajectories so this could be like someone saying what they would do in
[00:24:09] someone saying what they would do in response to those prompts it could be
[00:24:11] response to those prompts it could be someone driving a car and then you want
[00:24:13] someone driving a car and then you want to mimic their behavior um or or some
[00:24:16] to mimic their behavior um or or some other similar example so then these
[00:24:18] other similar example so then these ideas is that we get input
[00:24:19] ideas is that we get input demonstrations of good policies and that
[00:24:22] demonstrations of good policies and that allows us to reduce reinforcement
[00:24:23] allows us to reduce reinforcement learning back to supervised
[00:24:25] learning back to supervised learning so we're s of taking this we're
[00:24:28] learning so we're s of taking this we're reducing it back to
[00:24:31] reducing it back to here now I think in general the idea of
[00:24:33] here now I think in general the idea of reductions is incredibly powerful um for
[00:24:36] reductions is incredibly powerful um for those of you that have taken CS Theory
[00:24:38] those of you that have taken CS Theory classes that's what we do all the time
[00:24:39] classes that's what we do all the time we reduce things to sat or other things
[00:24:41] we reduce things to sat or other things like that and in general I think in
[00:24:43] like that and in general I think in computer science it's one of the
[00:24:44] computer science it's one of the strengths of it that they think of how
[00:24:46] strengths of it that they think of how can we reduce one problem to another and
[00:24:48] can we reduce one problem to another and then inherit all the progress that's
[00:24:49] then inherit all the progress that's been made on that
[00:24:50] been made on that problem so in this way reinforcement
[00:24:53] problem so in this way reinforcement learning is similar to other aspects of
[00:24:55] learning is similar to other aspects of computer science in that we will try
[00:24:57] computer science in that we will try often to reduce
[00:24:59] often to reduce reinforcement learning to other problems
[00:25:01] reinforcement learning to other problems this is particularly done in the
[00:25:02] this is particularly done in the theoretical aspects of reinforcement
[00:25:05] theoretical aspects of reinforcement learning yeah yeah just oh whenever you
[00:25:08] learning yeah yeah just oh whenever you um ask just because I'm going to try and
[00:25:09] um ask just because I'm going to try and learn names could you say your name
[00:25:10] learn names could you say your name please yeah my name is yeah so just to
[00:25:13] please yeah my name is yeah so just to be clear um imitation learning then
[00:25:15] be clear um imitation learning then isn't like a separate technique it's
[00:25:17] isn't like a separate technique it's just an application of supervised
[00:25:18] just an application of supervised learning to like the specific
[00:25:20] learning to like the specific reinforcement learning context it's good
[00:25:22] reinforcement learning context it's good question so I think some of you I mean
[00:25:24] question so I think some of you I mean there's a lot of techniques that think
[00:25:25] there's a lot of techniques that think about when you're doing imitation
[00:25:27] about when you're doing imitation learning specifically for kind of
[00:25:28] learning specifically for kind of decision data you can just think of it
[00:25:31] decision data you can just think of it just reducing it back um if you want to
[00:25:35] just reducing it back um if you want to do imitation learning where you might
[00:25:36] do imitation learning where you might recover like the reward function we'll
[00:25:38] recover like the reward function we'll talk more about that soon and others
[00:25:40] talk more about that soon and others then you may need to use other types of
[00:25:41] then you may need to use other types of techniques as well but like the most
[00:25:44] techniques as well but like the most straightforward aspect is just to say
[00:25:46] straightforward aspect is just to say I've got demonstrations I'm going to
[00:25:48] I've got demonstrations I'm going to ignore um sort of like this uh delayed
[00:25:51] ignore um sort of like this uh delayed consequences aspect and exploration and
[00:25:53] consequences aspect and exploration and I'm just going to reduce it
[00:25:55] I'm just going to reduce it back yeah and name first please
[00:25:59] back yeah and name first please wait what do you mean by input
[00:26:00] wait what do you mean by input demonstrations of good calls what does
[00:26:02] demonstrations of good calls what does that mean great question so let me give
[00:26:04] that mean great question so let me give you an example so people have thought a
[00:26:05] you an example so people have thought a lot about this maybe maybe one of the
[00:26:07] lot about this maybe maybe one of the first examples of this or one of the
[00:26:09] first examples of this or one of the first really public examples of this was
[00:26:10] first really public examples of this was for driving like at least what you could
[00:26:12] for driving like at least what you could do is you could I could drive a car it
[00:26:14] do is you could I could drive a car it could record everything that I do in
[00:26:16] could record everything that I do in terms of controlling the steering wheel
[00:26:18] terms of controlling the steering wheel and then we could if I'm a good driver
[00:26:20] and then we could if I'm a good driver they could say that's a good
[00:26:21] they could say that's a good demonstration so instead of the car
[00:26:23] demonstration so instead of the car trying to learn from itself how to steer
[00:26:25] trying to learn from itself how to steer the wheel in order to say successfully
[00:26:27] the wheel in order to say successfully Drive you could have humans drive it and
[00:26:29] Drive you could have humans drive it and it could try to figure out at each point
[00:26:32] it could try to figure out at each point what like how should I steer the wheel
[00:26:34] what like how should I steer the wheel in order to um have good behavior so the
[00:26:38] in order to um have good behavior so the idea is that you actually have access
[00:26:40] idea is that you actually have access already to good demonstrations of what
[00:26:42] already to good demonstrations of what is a good
[00:26:44] is a good policy yeah name CH
[00:26:48] please what do you exactly mean by
[00:26:50] please what do you exactly mean by optimization
[00:26:53] and optimization and what what do you
[00:26:56] and optimization and what what do you mean by optimization
[00:26:59] mean by optimization uh okay good question so what I mean is
[00:27:00] uh okay good question so what I mean is that when we do imitation learning from
[00:27:03] that when we do imitation learning from good trajectories we are assuming that
[00:27:05] good trajectories we are assuming that like we want to do well so we want to
[00:27:07] like we want to do well so we want to actually get a good policy so imitation
[00:27:09] actually get a good policy so imitation learning normally we're not normally
[00:27:11] learning normally we're not normally trying to imitate bad performance um but
[00:27:14] trying to imitate bad performance um but you could think of this as sort of
[00:27:16] you could think of this as sort of reinforcement learning but without the
[00:27:18] reinforcement learning but without the the exploration part because it's not
[00:27:19] the exploration part because it's not trying to pick its own
[00:27:25] data have the optimization
[00:27:28] data have the optimization yeah so I think um because we normally
[00:27:30] yeah so I think um because we normally don't have the notion of uh utility in
[00:27:32] don't have the notion of uh utility in those so you might say this is a cat or
[00:27:34] those so you might say this is a cat or it's not a cat it's not like a good
[00:27:37] it's not a cat it's not like a good picture of a cat or not whereas in
[00:27:38] picture of a cat or not whereas in decisions we often have a real valued
[00:27:40] decisions we often have a real valued scalar value of like it was like you
[00:27:42] scalar value of like it was like you know A7 good
[00:27:45] know A7 good decision yeah name first
[00:27:51] please optimize yes so we do often we
[00:27:55] please optimize yes so we do often we always have loss functions um and that's
[00:27:57] always have loss functions um and that's a great but where in those cases there's
[00:27:59] a great but where in those cases there's not normally a utility that goes so if
[00:28:01] not normally a utility that goes so if you could
[00:28:02] you could get you could maybe have some smooth
[00:28:04] get you could maybe have some smooth notion there of how well do you match
[00:28:06] notion there of how well do you match like a stochastic policy a stochastic
[00:28:08] like a stochastic policy a stochastic output there um but for many of those it
[00:28:11] output there um but for many of those it would be more if it's like did you say
[00:28:13] would be more if it's like did you say it was a cat or not you would have a
[00:28:15] it was a cat or not you would have a binary 01
[00:28:17] binary 01 loss
[00:28:19] loss yeah so does that mean if you have the
[00:28:21] yeah so does that mean if you have the data for imitation learning it's like
[00:28:24] data for imitation learning it's like almost always better than reinforcement
[00:28:26] almost always better than reinforcement learning you're kind of like avoiding
[00:28:27] learning you're kind of like avoiding the you can like directly learn what is
[00:28:31] the you can like directly learn what is good say that again so like if you have
[00:28:34] good say that again so like if you have the data for imitation learning like you
[00:28:36] the data for imitation learning like you have someone actually driving the car
[00:28:38] have someone actually driving the car does that mean that um you'll probably
[00:28:41] does that mean that um you'll probably learn a better policy than reinforcement
[00:28:43] learn a better policy than reinforcement learning or that's almost always great
[00:28:46] learning or that's almost always great question so we'll get into that so the
[00:28:47] question so we'll get into that so the question was if you can hear that is if
[00:28:48] question was if you can hear that is if you have good demonstrations say of
[00:28:49] you have good demonstrations say of driving behavior um that you're using
[00:28:51] driving behavior um that you're using imitation learning can that be better
[00:28:53] imitation learning can that be better than reinforcement learning it will
[00:28:54] than reinforcement learning it will depend on your reinforcement learning
[00:28:55] depend on your reinforcement learning algorithm in general reinforcement
[00:28:57] algorithm in general reinforcement learning should always always be able to
[00:28:59] learning should always always be able to exceed equal or ex um exceed the
[00:29:02] exceed equal or ex um exceed the performance of imitation learning yeah
[00:29:05] performance of imitation learning yeah so can you explain the difference
[00:29:07] so can you explain the difference between ilil and
[00:29:10] between ilil and rlf yes great question so in imitation
[00:29:12] rlf yes great question so in imitation learning um what we would uh have and
[00:29:15] learning um what we would uh have and this is what this was the first part you
[00:29:17] this is what this was the first part you would say people give me um given a
[00:29:20] would say people give me um given a prompt I look on the internet and I
[00:29:21] prompt I look on the internet and I assume that those were good so um in the
[00:29:24] assume that those were good so um in the internet I see if someone said like how
[00:29:26] internet I see if someone said like how to explain uh reinforcement learning to
[00:29:28] to explain uh reinforcement learning to six-year-old this is what they said back
[00:29:29] six-year-old this is what they said back and so I just train on those what rhf
[00:29:33] and so I just train on those what rhf said is that well you know the
[00:29:35] said is that well you know the internet's a big place probably not all
[00:29:36] internet's a big place probably not all of it is good answers so now let's
[00:29:38] of it is good answers so now let's actually ask people which of these two
[00:29:40] actually ask people which of these two responses they prefer and now we're
[00:29:41] responses they prefer and now we're going to try to do reinforcement
[00:29:43] going to try to do reinforcement learning on that to actually get to a
[00:29:44] learning on that to actually get to a better
[00:29:45] better policy
[00:29:48] policy yeah I'd like to ask so alab actually
[00:29:51] yeah I'd like to ask so alab actually discover some go strategies that are not
[00:29:54] discover some go strategies that are not invented by humans that we have never
[00:29:56] invented by humans that we have never experienced before so doesn't mean that
[00:29:59] experienced before so doesn't mean that if we apply imitation learning too much
[00:30:01] if we apply imitation learning too much it might actually hinder the model's
[00:30:02] it might actually hinder the model's capabilities to explore like what is
[00:30:05] capabilities to explore like what is actually good in like what humans have
[00:30:07] actually good in like what humans have thought of which is probably absolutely
[00:30:09] thought of which is probably absolutely and actually I think this is on the next
[00:30:11] and actually I think this is on the next side let's go it good okay perfect so
[00:30:13] side let's go it good okay perfect so this turn to where are some of the
[00:30:15] this turn to where are some of the places that you might hope that
[00:30:16] places that you might hope that reinforcement learning would be better
[00:30:17] reinforcement learning would be better than these other strategies so one of
[00:30:20] than these other strategies so one of them is where you don't have examples of
[00:30:21] them is where you don't have examples of desired Behavior this is exactly like
[00:30:23] desired Behavior this is exactly like the example that was just brought up if
[00:30:25] the example that was just brought up if you want to go beyond Human Performance
[00:30:27] you want to go beyond Human Performance you cannot rely on Human Performance
[00:30:29] you cannot rely on Human Performance just to do imitation learning because
[00:30:30] just to do imitation learning because you're not going to be able to go get
[00:30:32] you're not going to be able to go get better than it so there are a lot of
[00:30:34] better than it so there are a lot of application areas I think particularly
[00:30:36] application areas I think particularly in areas like healthcare or education
[00:30:38] in areas like healthcare or education and others where we think we can go
[00:30:40] and others where we think we can go beyond Human Performance um and so in
[00:30:43] beyond Human Performance um and so in those cases reinforcement learning
[00:30:44] those cases reinforcement learning because it's trying to optimize
[00:30:45] because it's trying to optimize performance um could go beyond it could
[00:30:48] performance um could go beyond it could be a particularly useful technique and
[00:30:50] be a particularly useful technique and others where you don't have any existing
[00:30:52] others where you don't have any existing data for a task so there might be
[00:30:54] data for a task so there might be something where you think of it as um a
[00:30:55] something where you think of it as um a decision-making problem but you don't
[00:30:57] decision-making problem but you don't have prior data you need to learn from
[00:30:58] have prior data you need to learn from scratch um and you want to directly
[00:31:01] scratch um and you want to directly optimize so that's another place where
[00:31:02] optimize so that's another place where reinforcement learning can be very
[00:31:05] reinforcement learning can be very powerful another category is interesting
[00:31:08] powerful another category is interesting because in some ways it's also kind of a
[00:31:10] because in some ways it's also kind of a reduction technique and this is the
[00:31:12] reduction technique and this is the question to place where you have an
[00:31:13] question to place where you have an enormous search or optimization problem
[00:31:15] enormous search or optimization problem with delayed outcomes so there's been a
[00:31:18] with delayed outcomes so there's been a number of examples of the work of doing
[00:31:20] number of examples of the work of doing this from Deep Mind which have been
[00:31:22] this from Deep Mind which have been really extremely elegant so what I I put
[00:31:24] really extremely elegant so what I I put up here is Alpha tensor um if you
[00:31:26] up here is Alpha tensor um if you haven't heard of it it's a faster way to
[00:31:28] haven't heard of it it's a faster way to do matrix multiplication which is kind
[00:31:30] do matrix multiplication which is kind of mind-blowing so what they did is they
[00:31:32] of mind-blowing so what they did is they said all right there's standard ways to
[00:31:34] said all right there's standard ways to do matrix multiplication this comes up
[00:31:36] do matrix multiplication this comes up all the time could we learn an algorithm
[00:31:39] all the time could we learn an algorithm that would be better at matrix
[00:31:40] that would be better at matrix multiplication not me as like a
[00:31:42] multiplication not me as like a scientist try to write down an algorithm
[00:31:44] scientist try to write down an algorithm have a an agent learn one and they they
[00:31:48] have a an agent learn one and they they showed yes and the way they did that was
[00:31:49] showed yes and the way they did that was with reinforcement learning um and
[00:31:51] with reinforcement learning um and they've done this in other cases too
[00:31:53] they've done this in other cases too like learning faster sorting
[00:31:55] like learning faster sorting algorithms so I think this is a pretty
[00:31:56] algorithms so I think this is a pretty incredible Frontier the idea is saying
[00:31:58] incredible Frontier the idea is saying could we have ai actually be inventing
[00:32:00] could we have ai actually be inventing new algorithms um and one of the ways
[00:32:03] new algorithms um and one of the ways that they they framed it here and you
[00:32:05] that they they framed it here and you can think of alphao is similar is that
[00:32:07] can think of alphao is similar is that it was a really really really large
[00:32:09] it was a really really really large search
[00:32:10] search problem and the challenge with really
[00:32:12] problem and the challenge with really really large search problems is that
[00:32:14] really large search problems is that even there we may not have great
[00:32:16] even there we may not have great techniques for solving them and so it's
[00:32:18] techniques for solving them and so it's sort of a reduction you can think of
[00:32:20] sort of a reduction you can think of people taking a planning problem and
[00:32:21] people taking a planning problem and trying to reduce it to a reinforcement
[00:32:23] trying to reduce it to a reinforcement learning problem to make it more
[00:32:25] learning problem to make it more tractable so that's pretty wild most of
[00:32:27] tractable so that's pretty wild most of the time we think of sort of ourl go
[00:32:29] the time we think of sort of ourl go being reduced in the other direction or
[00:32:30] being reduced in the other direction or involving planning or about that but
[00:32:32] involving planning or about that but here in some ways you can think of these
[00:32:34] here in some ways you can think of these as like um either adversarial planning
[00:32:36] as like um either adversarial planning problems or expecting Max problems that
[00:32:38] problems or expecting Max problems that are being reduced back to to learning as
[00:32:40] are being reduced back to to learning as a way to just more efficiently go
[00:32:42] a way to just more efficiently go through the search space so those are
[00:32:44] through the search space so those are two of the areas that I think are
[00:32:45] two of the areas that I think are particularly promising in terms of why
[00:32:47] particularly promising in terms of why reinforcement learning is still a really
[00:32:50] reinforcement learning is still a really um practical and really important area
[00:32:52] um practical and really important area to think
[00:32:53] to think about I think I saw a question back but
[00:32:55] about I think I saw a question back but maybe you yeah oh what was your name
[00:32:58] maybe you yeah oh what was your name for Alpha tenser is that like it's FAS
[00:33:01] for Alpha tenser is that like it's FAS but within some error of the correct
[00:33:02] but within some error of the correct matrix product it's faster but with some
[00:33:05] matrix product it's faster but with some error or do you actually get the corre
[00:33:07] error or do you actually get the corre value no you get the correct value which
[00:33:08] value no you get the correct value which is wild yeah yeah so no it's just better
[00:33:11] is wild yeah yeah so no it's just better yeah and one of the really clever things
[00:33:13] yeah and one of the really clever things they had to think of in this case was
[00:33:14] they had to think of in this case was how do you know that the answer is
[00:33:16] how do you know that the answer is correct how could you like provably
[00:33:17] correct how could you like provably verify that so yeah incredibly
[00:33:21] verify that so yeah incredibly elegant all right now we're going to go
[00:33:24] elegant all right now we're going to go quickly through some course Logistics um
[00:33:26] quickly through some course Logistics um before starting to dive into some
[00:33:27] before starting to dive into some content and feel free to interrupt me
[00:33:29] content and feel free to interrupt me throughout this or anything else if you
[00:33:30] throughout this or anything else if you have other
[00:33:32] have other questions so in terms of the content um
[00:33:35] questions so in terms of the content um we're going to start off by talking
[00:33:36] we're going to start off by talking about markup decision processes and
[00:33:38] about markup decision processes and planning and then we're going to talk
[00:33:40] planning and then we're going to talk about model free um policy evaluation
[00:33:43] about model free um policy evaluation and model free control don't worry if
[00:33:44] and model free control don't worry if you don't know what I mean by model I
[00:33:46] you don't know what I mean by model I I'll specify it then we're going to jump
[00:33:48] I'll specify it then we're going to jump into policy search policy search is
[00:33:50] into policy search policy search is things like um uh proximal policy
[00:33:53] things like um uh proximal policy optimization and other uh and reinforc
[00:33:56] optimization and other uh and reinforc and other approaches some of you guys
[00:33:57] and other approaches some of you guys might have already seen related ideas
[00:34:00] might have already seen related ideas say in robotics if you've taken them um
[00:34:03] say in robotics if you've taken them um and then I'm highlighting here that this
[00:34:04] and then I'm highlighting here that this is one of the important differences
[00:34:06] is one of the important differences compared to Prior years so we're going
[00:34:10] compared to Prior years so we're going to do um a a deep dive into offline
[00:34:12] to do um a a deep dive into offline reinforcement learning offline here
[00:34:14] reinforcement learning offline here meaning that we have a fixed amount of
[00:34:16] meaning that we have a fixed amount of data and we want to learn from it to get
[00:34:18] data and we want to learn from it to get a good decision policy and during this
[00:34:21] a good decision policy and during this we're going to talk about reinforcement
[00:34:22] we're going to talk about reinforcement learning from Human feedback and direct
[00:34:24] learning from Human feedback and direct preference
[00:34:25] preference optimization so that's um going to be a
[00:34:28] optimization so that's um going to be a new that's going to be a new sort of
[00:34:30] new that's going to be a new sort of third part of the course that we haven't
[00:34:31] third part of the course that we haven't done assignments on before so I think
[00:34:33] done assignments on before so I think that'll be pretty exciting and we'll
[00:34:35] that'll be pretty exciting and we'll also talk about exploration and do
[00:34:36] also talk about exploration and do Advanced
[00:34:40] topics so the high level learning goals
[00:34:43] topics so the high level learning goals of the class is that by the end of class
[00:34:44] of the class is that by the end of class you should be able to define the key
[00:34:45] you should be able to define the key features of reinforcement learning you
[00:34:48] features of reinforcement learning you should be able to given an application
[00:34:51] should be able to given an application specify how you would write that down as
[00:34:52] specify how you would write that down as a reinforcement learning problem as well
[00:34:54] a reinforcement learning problem as well as well as whether or not you think it
[00:34:56] as well as whether or not you think it would be good to use RL for it
[00:34:58] would be good to use RL for it that you can Implement and Code common
[00:34:59] that you can Implement and Code common RL algorithms and that you understand
[00:35:01] RL algorithms and that you understand the theoretical and empirical approaches
[00:35:04] the theoretical and empirical approaches for evaluating the quality of an RL
[00:35:06] for evaluating the quality of an RL algorithm so as you could probably
[00:35:08] algorithm so as you could probably imagine from those papers going up
[00:35:09] imagine from those papers going up there's going to be there's going to be
[00:35:10] there's going to be there's going to be continued progress in this field there's
[00:35:12] continued progress in this field there's going to be a huge number of different
[00:35:14] going to be a huge number of different RL algorithms and so one of the key
[00:35:15] RL algorithms and so one of the key things that I hope to talk about is sort
[00:35:17] things that I hope to talk about is sort of how do you evaluate and compare them
[00:35:19] of how do you evaluate and compare them um which might vary depending on the
[00:35:21] um which might vary depending on the application area you care
[00:35:23] application area you care about so the way that the course is
[00:35:26] about so the way that the course is structured is that we'll have live
[00:35:27] structured is that we'll have live lectures we'll have three homeworks
[00:35:29] lectures we'll have three homeworks we'll have a midterm we'll have a
[00:35:31] we'll have a midterm we'll have a multiple choice
[00:35:32] multiple choice quiz we'll do a final project um and
[00:35:36] quiz we'll do a final project um and then we'll have what I call check or
[00:35:37] then we'll have what I call check or refresh your understanding exercises
[00:35:39] refresh your understanding exercises which will be going through um the poll
[00:35:42] which will be going through um the poll anywhere and we'll have problem sessions
[00:35:44] anywhere and we'll have problem sessions which are optional problem sessions are
[00:35:47] which are optional problem sessions are a great chance to think more about the
[00:35:49] a great chance to think more about the conceptual and the theoretical aspects
[00:35:50] conceptual and the theoretical aspects of the class um and they'll be held
[00:35:53] of the class um and they'll be held starting next
[00:35:56] week so one of the main application
[00:35:59] week so one of the main application areas I think about a lot is education I
[00:36:01] areas I think about a lot is education I think education is one of the greatest
[00:36:02] think education is one of the greatest tools we have to try to address poverty
[00:36:05] tools we have to try to address poverty um and inequality and so I'm really
[00:36:07] um and inequality and so I'm really interested in evidence uh to think about
[00:36:10] interested in evidence uh to think about how do we educate effectively so with
[00:36:12] how do we educate effectively so with respect to that I wanted to share this
[00:36:14] respect to that I wanted to share this paper that came out I guess almost a
[00:36:16] paper that came out I guess almost a decade ago to now where they did a study
[00:36:18] decade ago to now where they did a study to look at how people who are taking
[00:36:20] to look at how people who are taking massive open online courses how they
[00:36:23] massive open online courses how they spent their time and how that related to
[00:36:24] spent their time and how that related to their learning outcomes and what they
[00:36:26] their learning outcomes and what they found is that um if you do more
[00:36:28] found is that um if you do more activities there seem to be a six times
[00:36:30] activities there seem to be a six times L larger learning benefit compared to
[00:36:32] L larger learning benefit compared to watching videos or
[00:36:34] watching videos or reading and um you might think this is
[00:36:36] reading and um you might think this is just based on time but it wasn't in fact
[00:36:38] just based on time but it wasn't in fact it seemed like students spent less time
[00:36:40] it seemed like students spent less time per activity than reading a page and I
[00:36:43] per activity than reading a page and I bring this up because sometimes I have
[00:36:45] bring this up because sometimes I have people who come talk to me right before
[00:36:46] people who come talk to me right before the midterm and they say I've rewatched
[00:36:48] the midterm and they say I've rewatched your lectures like three times you know
[00:36:49] your lectures like three times you know what else can I do and while I am
[00:36:51] what else can I do and while I am flattered if that they want to watch the
[00:36:53] flattered if that they want to watch the the lectures time I really highly
[00:36:55] the lectures time I really highly recommend you don't do that that instead
[00:36:57] recommend you don't do that that instead you spend time um doing problems uh or
[00:37:01] you spend time um doing problems uh or going through problems from the sessions
[00:37:02] going through problems from the sessions going through the homework going through
[00:37:03] going through the homework going through the check your understandings it's far
[00:37:05] the check your understandings it's far more effective and efficient in
[00:37:07] more effective and efficient in general so in general engage practice um
[00:37:10] general so in general engage practice um particularly forced recall where you
[00:37:12] particularly forced recall where you have to sort of think about things um uh
[00:37:15] have to sort of think about things um uh without checking the answers is shown to
[00:37:17] without checking the answers is shown to be very effective for learning um and so
[00:37:20] be very effective for learning um and so to achieve the class learning goals I
[00:37:22] to achieve the class learning goals I encourage you to spend as much time as
[00:37:24] encourage you to spend as much time as you can or the time you have available
[00:37:25] you can or the time you have available for the course on those typ type of um
[00:37:28] for the course on those typ type of um sort of directly engaging activities
[00:37:30] sort of directly engaging activities rather than more passive ones like
[00:37:32] rather than more passive ones like reading or watching yeah name
[00:37:37] reading or watching yeah name first um do you have a time frame for
[00:37:40] first um do you have a time frame for when the um problem sessions will be
[00:37:43] when the um problem sessions will be held great question we will announce
[00:37:45] held great question we will announce those um by the end of
[00:37:46] those um by the end of tomorrow and for those ones we know it's
[00:37:49] tomorrow and for those ones we know it's like impossible to coordinate schedules
[00:37:50] like impossible to coordinate schedules so um if you can't make it we encourage
[00:37:52] so um if you can't make it we encourage you to come in person but if you can't
[00:37:54] you to come in person but if you can't make it we also release all the
[00:37:55] make it we also release all the materials and the videos afterwards
[00:38:01] okay um I will highlight I guess just
[00:38:03] okay um I will highlight I guess just also on this too and I saw several
[00:38:05] also on this too and I saw several people asking about this well um just go
[00:38:08] people asking about this well um just go back to this part allow me to
[00:38:11] back to this part allow me to cover so several people mentioned that
[00:38:14] cover so several people mentioned that they were excited about having some more
[00:38:15] they were excited about having some more theoretical aspects um this class does
[00:38:18] theoretical aspects um this class does involve Theory uh it is um perhaps
[00:38:22] involve Theory uh it is um perhaps there's probably more Theory I think
[00:38:23] there's probably more Theory I think probably than the normal um machine
[00:38:25] probably than the normal um machine learning in AI classes probably a little
[00:38:26] learning in AI classes probably a little bit more and not as much as like an
[00:38:28] bit more and not as much as like an advanced seminar on Theory so normally
[00:38:31] advanced seminar on Theory so normally most problem problem sets will have like
[00:38:33] most problem problem sets will have like one Theory question um and if you're not
[00:38:36] one Theory question um and if you're not familiar with some of the sort of
[00:38:38] familiar with some of the sort of theoretical you know theoretical
[00:38:39] theoretical you know theoretical techniques um totally fine you can come
[00:38:41] techniques um totally fine you can come to problem sessions you don't have to
[00:38:42] to problem sessions you don't have to have any prior background um in doing
[00:38:45] have any prior background um in doing proofs to be able to succeed another
[00:38:47] proofs to be able to succeed another thing people asked about were um Monte
[00:38:50] thing people asked about were um Monte Carlo treer several people brought up
[00:38:52] Carlo treer several people brought up reinforcement learning from Human
[00:38:53] reinforcement learning from Human feedback we will be talking about that
[00:38:55] feedback we will be talking about that some people asked about uh multi agents
[00:38:58] some people asked about uh multi agents we're going to be thinking about Monte
[00:38:59] we're going to be thinking about Monte Carlo treesearch and other ways to have
[00:39:01] Carlo treesearch and other ways to have multiple agents that are making
[00:39:02] multiple agents that are making decisions um and a number of people said
[00:39:04] decisions um and a number of people said they wanted to get up to speed on sort
[00:39:06] they wanted to get up to speed on sort of the latest ideas in reinforcement
[00:39:07] of the latest ideas in reinforcement learning so they could read papers or do
[00:39:09] learning so they could read papers or do things in their applications and I think
[00:39:12] things in their applications and I think this was all very relevant to
[00:39:14] this was all very relevant to that um the final thing is just uh we
[00:39:18] that um the final thing is just uh we have five wonderful Tas who will be
[00:39:19] have five wonderful Tas who will be supporting the main ways to get
[00:39:21] supporting the main ways to get information about the class is to go to
[00:39:23] information about the class is to go to the website or go to Ed um we'll be
[00:39:25] the website or go to Ed um we'll be releasing our office hours by theend of
[00:39:27] releasing our office hours by theend of tomorrow and we'll start them for the
[00:39:29] tomorrow and we'll start them for the rest of the week and all of you guys are
[00:39:31] rest of the week and all of you guys are completely capable of succeeding in the
[00:39:33] completely capable of succeeding in the course and we're here to
[00:39:35] course and we're here to help
[00:39:37] help yeah yeah thank you going back to the
[00:39:39] yeah yeah thank you going back to the course topic slide yeah do some of those
[00:39:42] course topic slide yeah do some of those topics include model based approaches as
[00:39:44] topics include model based approaches as well yeah so the first part great
[00:39:46] well yeah so the first part great question so um when we first start
[00:39:49] question so um when we first start talking about here we'll talk about
[00:39:50] talking about here we'll talk about models um at the beginning and
[00:39:52] models um at the beginning and particularly when we're defining Markoff
[00:39:53] particularly when we're defining Markoff decision processes and then we will
[00:39:56] decision processes and then we will likely be talking again about them more
[00:39:57] likely be talking again about them more when we get into the offline
[00:39:59] when we get into the offline approach there's a lot of really
[00:40:00] approach there's a lot of really interesting questions about um when
[00:40:02] interesting questions about um when we're picking different there's a lot of
[00:40:04] we're picking different there's a lot of we we'll get into the fact that there's
[00:40:05] we we'll get into the fact that there's a lot of different representations you
[00:40:06] a lot of different representations you can use for reinforcement learning and
[00:40:09] can use for reinforcement learning and there's a lot of questions over which to
[00:40:10] there's a lot of questions over which to use when or when you combine them and in
[00:40:12] use when or when you combine them and in particular where do errors propagate in
[00:40:14] particular where do errors propagate in the different types of representations
[00:40:16] the different types of representations in terms of leading to error in the
[00:40:19] in terms of leading to error in the final decisions you
[00:40:20] final decisions you make but modelbased reinforcement
[00:40:22] make but modelbased reinforcement learning can certainly be a really
[00:40:23] learning can certainly be a really powerful
[00:40:25] powerful tool any other questions on the
[00:40:26] tool any other questions on the logistics
[00:40:31] all right so let's start to dive into
[00:40:33] all right so let's start to dive into the
[00:40:35] material all right we're going to start
[00:40:37] material all right we're going to start with a refresher exercise so um raise
[00:40:40] with a refresher exercise so um raise your hand if you've seen reinforcement
[00:40:41] your hand if you've seen reinforcement learning at least a little bit in the
[00:40:43] learning at least a little bit in the past okay so most people not all um if
[00:40:46] past okay so most people not all um if you haven't if everything I am about to
[00:40:48] you haven't if everything I am about to say doesn't make sense don't worry we're
[00:40:49] say doesn't make sense don't worry we're going to cover it um but I like to kind
[00:40:51] going to cover it um but I like to kind of get a a gauge in case people are like
[00:40:53] of get a a gauge in case people are like I've seen all of this before for the
[00:40:55] I've seen all of this before for the very beginning of the course um so this
[00:40:57] very beginning of the course um so this is going to be a refresh exercise we're
[00:40:59] is going to be a refresh exercise we're going to do it on Ed um I'll put the
[00:41:00] going to do it on Ed um I'll put the link up again or you can go to Ed it'll
[00:41:02] link up again or you can go to Ed it'll be the second link so here's the
[00:41:04] be the second link so here's the question we're going to think about how
[00:41:05] question we're going to think about how would we formulate a particular problem
[00:41:07] would we formulate a particular problem as a reinforcement learning problem or
[00:41:09] as a reinforcement learning problem or as a Markoff decision process so one of
[00:41:12] as a Markoff decision process so one of the first application areas to use
[00:41:14] the first application areas to use reinforcement learning for Education
[00:41:16] reinforcement learning for Education used it in roughly the following way not
[00:41:19] used it in roughly the following way not exactly the idea was that you would have
[00:41:21] exactly the idea was that you would have a student that didn't know a set of
[00:41:23] a student that didn't know a set of topics let's here just consider addition
[00:41:26] topics let's here just consider addition which we'll assume is an easier topic
[00:41:27] which we'll assume is an easier topic for people to learn and subtraction
[00:41:29] for people to learn and subtraction which we're going to assume is harder
[00:41:32] which we're going to assume is harder imagine the beginning the student
[00:41:33] imagine the beginning the student doesn't know either of these things and
[00:41:35] doesn't know either of these things and what the AI tutor agent can do is they
[00:41:37] what the AI tutor agent can do is they can provide practice problems they can
[00:41:39] can provide practice problems they can provide subtraction uh practice problems
[00:41:42] provide subtraction uh practice problems or they can provide addition practice
[00:41:44] or they can provide addition practice problems and what happens is the AI
[00:41:46] problems and what happens is the AI agent gets a reward of plus one if the
[00:41:49] agent gets a reward of plus one if the agent if the student gets the problem
[00:41:50] agent if the student gets the problem right and they get a minus one if the
[00:41:52] right and they get a minus one if the student gets the problem
[00:41:54] student gets the problem wrong and so what I'd like you to think
[00:41:56] wrong and so what I'd like you to think about here is to model is a decision
[00:41:58] about here is to model is a decision process what would like the state space
[00:42:00] process what would like the state space be the action space the reward model if
[00:42:03] be the action space the reward model if you've taken classes with markraft Sy
[00:42:05] you've taken classes with markraft Sy process before you don't remember it's
[00:42:06] process before you don't remember it's totally fine to look up you know and
[00:42:08] totally fine to look up you know and refresh your memory this is not a test
[00:42:10] refresh your memory this is not a test um I'd like you to write down sort of
[00:42:12] um I'd like you to write down sort of you know what would a Dynamics model
[00:42:13] you know what would a Dynamics model represent in this case and then in
[00:42:16] represent in this case and then in particular what would a policy to
[00:42:18] particular what would a policy to optimize the expected discounted sum of
[00:42:20] optimize the expected discounted sum of rewards do in this case for how I've set
[00:42:23] rewards do in this case for how I've set up this
[00:42:24] up this scenario so I'd like you to write down
[00:42:26] scenario so I'd like you to write down your enter them in Ed and then we're
[00:42:28] your enter them in Ed and then we're going to do some small group discussion
[00:42:31] going to do some small group discussion um in about 5
[00:42:56] minutes for
[00:43:49] and if you're not familiar with these
[00:43:50] and if you're not familiar with these particular words like State space Etc
[00:43:52] particular words like State space Etc it's still fine just to think about you
[00:43:54] it's still fine just to think about you know given what I've told you about the
[00:43:56] know given what I've told you about the reward for an agent what might happen in
[00:43:58] reward for an agent what might happen in this
[00:44:26] case e
[00:45:21] ah okay sorry you might have to switch
[00:45:23] ah okay sorry you might have to switch to
[00:45:25] bed
[00:46:25] for for
[00:47:12] all right try to enter in something it's
[00:47:14] all right try to enter in something it's okay if you're not sure and then turn to
[00:47:16] okay if you're not sure and then turn to someone near you and compare what you
[00:47:25] did
[00:47:55] e
[00:49:25] e e
[00:50:14] all right we're going to come back um
[00:50:15] all right we're going to come back um hopefully I heard a lot of really
[00:50:17] hopefully I heard a lot of really fruitful discussions um
[00:50:20] fruitful discussions um so let's see I know at least one group I
[00:50:23] so let's see I know at least one group I talked to um I had a great idea for what
[00:50:26] talked to um I had a great idea for what the St space could be do you guys want
[00:50:28] the St space could be do you guys want to share what your state space was and
[00:50:29] to share what your state space was and maybe tell your name as well oh
[00:50:34] sure the state space could be just like
[00:50:36] sure the state space could be just like a set of word pairs of like to like
[00:50:39] a set of word pairs of like to like natural numbers or any kind of numbers
[00:50:40] natural numbers or any kind of numbers of like how good the student is at
[00:50:43] of like how good the student is at addition then how good student is um at
[00:50:46] addition then how good student is um at subtraction yeah so you could imagine
[00:50:48] subtraction yeah so you could imagine something which is is at
[00:50:51] something which is is at addition and subtraction so you could
[00:50:53] addition and subtraction so you could imagine something like this where you
[00:50:55] imagine something like this where you just have like a vector pair where it's
[00:50:56] just have like a vector pair where it's like maybe their 0.9 close to Mastery
[00:50:59] like maybe their 0.9 close to Mastery for addition and like 04 close to
[00:51:02] for addition and like 04 close to Mastery for subtraction this is not the
[00:51:04] Mastery for subtraction this is not the only way you could write down there's
[00:51:06] only way you could write down there's lots of choices for the state space but
[00:51:07] lots of choices for the state space but that would certainly be one reasonable
[00:51:08] that would certainly be one reasonable one those are challenging in some ways
[00:51:12] one those are challenging in some ways because you can't directly observe them
[00:51:13] because you can't directly observe them but it's a pretty natural way to write
[00:51:15] but it's a pretty natural way to write it down um and in fact there are
[00:51:17] it down um and in fact there are commercial systems that essentially do
[00:51:18] commercial systems that essentially do that where they have like um for those
[00:51:20] that where they have like um for those of you familiar with hidden Markoff
[00:51:22] of you familiar with hidden Markoff models it's basically a hidden Markoff
[00:51:23] models it's basically a hidden Markoff model over whether someone has mastered
[00:51:25] model over whether someone has mastered something or not can we have a different
[00:51:27] something or not can we have a different type of State space that they wrote
[00:51:30] type of State space that they wrote down yeah may I talked about we
[00:51:33] down yeah may I talked about we basically wanted oh could you say your
[00:51:35] basically wanted oh could you say your name first please um the knowledge that
[00:51:37] name first please um the knowledge that the student has and also maybe the
[00:51:39] the student has and also maybe the questions that have already been asked
[00:51:40] questions that have already been asked to like capture the environment the
[00:51:42] to like capture the environment the current environment that were at um so
[00:51:44] current environment that were at um so we like I guess this is a better
[00:51:46] we like I guess this is a better representation of capturing the
[00:51:47] representation of capturing the knowledge the student has we were
[00:51:48] knowledge the student has we were thinking of also just like the history
[00:51:51] thinking of also just like the history of questions and students answers
[00:51:52] of questions and students answers whether they got it right or not um I
[00:51:55] whether they got it right or not um I guess that's harder to represent
[00:51:57] guess that's harder to represent no that's beautiful so exactly what was
[00:51:59] no that's beautiful so exactly what was that right yeah that's exactly so that
[00:52:00] that right yeah that's exactly so that was the other one I was hoping people
[00:52:01] was the other one I was hoping people might come up with which is the idea of
[00:52:03] might come up with which is the idea of this just being a history like a history
[00:52:05] this just being a history like a history of all the previous questions you've
[00:52:07] of all the previous questions you've asked given or all the questions the
[00:52:08] asked given or all the questions the robot has given the person and what
[00:52:10] robot has given the person and what they've responded um so you could
[00:52:12] they've responded um so you could imagine it's like uh
[00:52:15] imagine it's like uh observation
[00:52:17] observation question
[00:52:19] question reward dot dot dot and in fact those two
[00:52:23] reward dot dot dot and in fact those two representations here the history and how
[00:52:25] representations here the history and how the student um how good the student is
[00:52:27] the student um how good the student is can be depending on your representation
[00:52:28] can be depending on your representation be exactly isomorphic so sometimes this
[00:52:31] be exactly isomorphic so sometimes this can be um a sufficient statistic to
[00:52:33] can be um a sufficient statistic to capture that history and as was pointing
[00:52:35] capture that history and as was pointing out one of the challenges with histories
[00:52:37] out one of the challenges with histories is that they grow unboundedly so if you
[00:52:39] is that they grow unboundedly so if you want to have like your neural network be
[00:52:41] want to have like your neural network be predicting something you might be able
[00:52:42] predicting something you might be able to use something like an lstm or you
[00:52:44] to use something like an lstm or you might want to summarize the state so
[00:52:46] might want to summarize the state so those are both great ideas for what the
[00:52:48] those are both great ideas for what the um States could be there's not a right
[00:52:49] um States could be there's not a right answer both of them would be great but
[00:52:51] answer both of them would be great but there's also other ones the actions I
[00:52:53] there's also other ones the actions I heard many people share what the actions
[00:52:54] heard many people share what the actions are someone want to tell me what they in
[00:52:56] are someone want to tell me what they in there I know you guys mentioned what the
[00:52:57] there I know you guys mentioned what the action space was sure just whether you
[00:53:00] action space was sure just whether you go an addition or subtraction exactly so
[00:53:02] go an addition or subtraction exactly so these are just like what the agent can
[00:53:04] these are just like what the agent can actually do the teaching agent addition
[00:53:07] actually do the teaching agent addition question or subtraction and the reward
[00:53:11] question or subtraction and the reward model is plus one if the student gets it
[00:53:16] right I saw some questions um about what
[00:53:20] right I saw some questions um about what a Dynamics model is and um inside of the
[00:53:24] a Dynamics model is and um inside of the responses people were putting on the
[00:53:25] responses people were putting on the form what I mean by a Dynamics model
[00:53:27] form what I mean by a Dynamics model here and we'll talk a lot more about
[00:53:29] here and we'll talk a lot more about this is what happens to the state of the
[00:53:31] this is what happens to the state of the student after a question is given so in
[00:53:35] student after a question is given so in this case and I talked to some people
[00:53:37] this case and I talked to some people about this who had you know had a great
[00:53:39] about this who had you know had a great understanding of this already the idea
[00:53:41] understanding of this already the idea would be sort of how does either that
[00:53:42] would be sort of how does either that history change after you give um an a
[00:53:45] history change after you give um an a question to the student or how does this
[00:53:47] question to the student or how does this the sort of internal knowledge of the
[00:53:49] the sort of internal knowledge of the student change so the hope would be as
[00:53:51] student change so the hope would be as long as this sort of you know curriculum
[00:53:53] long as this sort of you know curriculum is vaguely reasonable that after you
[00:53:54] is vaguely reasonable that after you give the student an addition question
[00:53:56] give the student an addition question they now know more about addition you
[00:53:59] they now know more about addition you know or they're more likely to have
[00:54:00] know or they're more likely to have mastered addition so that would be sort
[00:54:03] mastered addition so that would be sort of the this idea of there being a
[00:54:04] of the this idea of there being a Dynamics process that where you know you
[00:54:06] Dynamics process that where you know you start in one state you get an action and
[00:54:09] start in one state you get an action and now you transition to a new state
[00:54:10] now you transition to a new state afterwards and we'll talk a lot more
[00:54:12] afterwards and we'll talk a lot more about that now what is the challenge
[00:54:15] about that now what is the challenge with this particular
[00:54:17] with this particular representation yeah can you say your
[00:54:18] representation yeah can you say your name first
[00:54:20] name first please um depending on your
[00:54:22] please um depending on your implementation there's a risk that the
[00:54:24] implementation there's a risk that the the agent just gives really easy
[00:54:26] the agent just gives really easy problems yeah in fact that's exactly
[00:54:29] problems yeah in fact that's exactly exactly right and in fact that's exactly
[00:54:31] exactly right and in fact that's exactly what we think will happen so we think
[00:54:32] what we think will happen so we think that um an agent that is
[00:54:35] that um an agent that is maximizing its
[00:54:38] reward should
[00:54:42] reward should only give easy
[00:54:45] only give easy questions so in this paper which I took
[00:54:47] questions so in this paper which I took the inspiration from for this example it
[00:54:50] the inspiration from for this example it was very close to this where they tried
[00:54:51] was very close to this where they tried to pick not um correctness but how long
[00:54:53] to pick not um correctness but how long it took people to do problems and so if
[00:54:56] it took people to do problems and so if the student took less time to do
[00:54:57] the student took less time to do problems um which isn't necessarily bad
[00:55:00] problems um which isn't necessarily bad in itself it might indicate some notion
[00:55:02] in itself it might indicate some notion of fluency the agent got more reward but
[00:55:04] of fluency the agent got more reward but of course what that means is that you
[00:55:06] of course what that means is that you should just give really easy questions
[00:55:07] should just give really easy questions that will take the student no time to do
[00:55:08] that will take the student no time to do because then the agent can get lots and
[00:55:10] because then the agent can get lots and lots and lots of reward um and this is
[00:55:13] lots and lots of reward um and this is probably not what the intend like the
[00:55:16] probably not what the intend like the designers of this system to try to help
[00:55:17] designers of this system to try to help students learn things intended um they
[00:55:20] students learn things intended um they probably actually wanted the students to
[00:55:22] probably actually wanted the students to learn both addition and subtraction um
[00:55:25] learn both addition and subtraction um but I bring this up because this is an
[00:55:26] but I bring this up because this is an example of what is often called reward
[00:55:28] example of what is often called reward hacking where the reward that we specify
[00:55:31] hacking where the reward that we specify does not necessarily um provide the
[00:55:34] does not necessarily um provide the behavior that we really hope to achieve
[00:55:36] behavior that we really hope to achieve and we will talk a lot more about this
[00:55:38] and we will talk a lot more about this in this case it's a fairly simple
[00:55:40] in this case it's a fairly simple example where we can see it fairly
[00:55:41] example where we can see it fairly quickly but there are a lot of cases
[00:55:44] quickly but there are a lot of cases where it's a lot more subtle to
[00:55:45] where it's a lot more subtle to understand whether or not the system
[00:55:46] understand whether or not the system really will do what you hope it will do
[00:55:49] really will do what you hope it will do um and we'll talk about that more
[00:55:50] um and we'll talk about that more throughout the
[00:55:52] throughout the course all right great so we're going to
[00:55:55] course all right great so we're going to now just start to talk about sort of
[00:55:56] now just start to talk about sort of sequential decision- making more broadly
[00:55:58] sequential decision- making more broadly and some of this will be reviewed for
[00:55:59] and some of this will be reviewed for some of you but I think it's useful to
[00:56:01] some of you but I think it's useful to go through and refresh our memories um
[00:56:03] go through and refresh our memories um so the idea in sequential decision-
[00:56:05] so the idea in sequential decision- making under uncertainty is that we're
[00:56:07] making under uncertainty is that we're going to have an agent that is taking
[00:56:10] going to have an agent that is taking decisions or actions so I'm going to use
[00:56:12] decisions or actions so I'm going to use actions and decisions
[00:56:15] actions and decisions interchangeably which are going to
[00:56:17] interchangeably which are going to interact in the world and then they're
[00:56:19] interact in the world and then they're going to get back some sort of
[00:56:20] going to get back some sort of observation and reward signal so in in
[00:56:23] observation and reward signal so in in the first example I just gave you it's
[00:56:25] the first example I just gave you it's like the agent um
[00:56:26] like the agent um you know provides a problem to the
[00:56:28] you know provides a problem to the student and then they see whether the
[00:56:30] student and then they see whether the student gets that correctly or incorrect
[00:56:32] student gets that correctly or incorrect and then they also use that information
[00:56:34] and then they also use that information to get a reward so s of getting reward
[00:56:36] to get a reward so s of getting reward and feedback and the goal in this case
[00:56:39] and feedback and the goal in this case is for the agent to select actions to
[00:56:41] is for the agent to select actions to maximize the total expected future
[00:56:43] maximize the total expected future reward meaning both the immediate reward
[00:56:45] reward meaning both the immediate reward they get now as well as the rewards
[00:56:47] they get now as well as the rewards they're going to get over time and this
[00:56:49] they're going to get over time and this generally is often going to involve
[00:56:51] generally is often going to involve balancing long-term and short-term
[00:56:53] balancing long-term and short-term rewards so there are lots and lots of
[00:56:55] rewards so there are lots and lots of examples I'll just go through a couple
[00:56:57] examples I'll just go through a couple of them just to give you a sense so one
[00:56:59] of them just to give you a sense so one is something like web advertising um in
[00:57:01] is something like web advertising um in this case you know Amazon for example
[00:57:04] this case you know Amazon for example might choose like a web ad ad to show
[00:57:06] might choose like a web ad ad to show you or a product to suggest to you they
[00:57:08] you or a product to suggest to you they might observe things like view time and
[00:57:10] might observe things like view time and whether or not you click on the ad
[00:57:12] whether or not you click on the ad whether or not you make a purchase and
[00:57:14] whether or not you make a purchase and the goal in this case could probably be
[00:57:15] the goal in this case could probably be for them to optimize either click time
[00:57:17] for them to optimize either click time or view time or
[00:57:20] or view time or Revenue in the context of something like
[00:57:22] Revenue in the context of something like robotics the control space or the
[00:57:25] robotics the control space or the decision space might be something like
[00:57:26] decision space might be something like how to move a
[00:57:28] how to move a joint um and then the feedback that the
[00:57:31] joint um and then the feedback that the the robot might get back might be
[00:57:32] the robot might get back might be something like a camera image of a
[00:57:33] something like a camera image of a kitchen and perhaps they just get a plus
[00:57:35] kitchen and perhaps they just get a plus one if there are no more dishes on the
[00:57:38] one if there are no more dishes on the counter now just quick question could
[00:57:40] counter now just quick question could this potentially be a reward hacked
[00:57:42] this potentially be a reward hacked specification I see some Smiles what
[00:57:44] specification I see some Smiles what could
[00:57:46] could happen yeah sorry robot could just push
[00:57:51] happen yeah sorry robot could just push everything off the car which I will say
[00:57:54] everything off the car which I will say um with you know it's tempting right
[00:57:56] um with you know it's tempting right you're like I'm just going to make it
[00:57:57] you're like I'm just going to make it all go away but in fact it does not
[00:57:59] all go away but in fact it does not solve the problem and now you just have
[00:58:00] solve the problem and now you just have broken dishes and food on the floor um
[00:58:02] broken dishes and food on the floor um so that would not be a good thing to do
[00:58:03] so that would not be a good thing to do so yeah this would be probably not a
[00:58:05] so yeah this would be probably not a great reward to put you probably want a
[00:58:07] great reward to put you probably want a reward more like that the dishes are
[00:58:08] reward more like that the dishes are inside of the dishwasher and finally
[00:58:10] inside of the dishwasher and finally clean so not just that they were put in
[00:58:12] clean so not just that they were put in there but actually that you ran the
[00:58:13] there but actually that you ran the dishwasher so this would be a second
[00:58:15] dishwasher so this would be a second example of of a setting another would be
[00:58:18] example of of a setting another would be something like blood pressure control
[00:58:20] something like blood pressure control where you could imagine that the agent
[00:58:22] where you could imagine that the agent gives recommendations like exercise or
[00:58:25] gives recommendations like exercise or medication the feedback is things like
[00:58:27] medication the feedback is things like blood pressure and then you would Define
[00:58:29] blood pressure and then you would Define some reward like maybe plus one if
[00:58:31] some reward like maybe plus one if you're in a healthy range else some sort
[00:58:33] you're in a healthy range else some sort of you know sloping penalty for being
[00:58:34] of you know sloping penalty for being outside of the healthy
[00:58:38] range all right so all of these are nice
[00:58:40] range all right so all of these are nice examples of the numerous ways where we
[00:58:42] examples of the numerous ways where we often try to make sequences of decisions
[00:58:44] often try to make sequences of decisions under uncertainty in general um we're
[00:58:47] under uncertainty in general um we're going to assume that we have a finite
[00:58:49] going to assume that we have a finite series of time steps so we're not going
[00:58:51] series of time steps so we're not going to be thinking about continuous time in
[00:58:52] to be thinking about continuous time in this class lots of interesting things
[00:58:54] this class lots of interesting things there we're not going to cover it what
[00:58:56] there we're not going to cover it what we're going to assume is that the agent
[00:58:57] we're going to assume is that the agent is making a series of decisions so we're
[00:58:59] is making a series of decisions so we're going to think of there being a series
[00:59:00] going to think of there being a series of time steps like you know 1 minute 2
[00:59:03] of time steps like you know 1 minute 2 minute 3 minute 4 minute the agent will
[00:59:05] minute 3 minute 4 minute the agent will take an action the world will update
[00:59:07] take an action the world will update given that action and admit an
[00:59:08] given that action and admit an observation and reward and then the
[00:59:11] observation and reward and then the agent receives that updates and then
[00:59:13] agent receives that updates and then makes another decision and we just close
[00:59:15] makes another decision and we just close this loop it's a feedback
[00:59:17] this loop it's a feedback cycle in this case as we sort of just
[00:59:19] cycle in this case as we sort of just talked about at a high level we can
[00:59:21] talked about at a high level we can think of there being histories which is
[00:59:23] think of there being histories which is sequences of past actions rewards and
[00:59:26] sequences of past actions rewards and out observations up to the present time
[00:59:29] out observations up to the present time point so the history HT would consist of
[00:59:32] point so the history HT would consist of all the previous actions of the agent
[00:59:34] all the previous actions of the agent the observations it receives and the
[00:59:36] the observations it receives and the reward it's
[00:59:37] reward it's got in general this is something you
[00:59:40] got in general this is something you could use to make decisions you could
[00:59:41] could use to make decisions you could just keep track of everything you've
[00:59:43] just keep track of everything you've experienced so far and then condition on
[00:59:45] experienced so far and then condition on that to try to make your next
[00:59:47] that to try to make your next decision but we often are going to
[00:59:49] decision but we often are going to assume that there's some sort of
[00:59:51] assume that there's some sort of sufficient statistic that we can use to
[00:59:53] sufficient statistic that we can use to summarize the history it will be much
[00:59:55] summarize the history it will be much more practic iCal in many cases yeah oh
[00:59:58] more practic iCal in many cases yeah oh sorry just make observation basically
[01:00:00] sorry just make observation basically like the
[01:00:02] hisory and what's your
[01:00:05] hisory and what's your name um uh so the observation in this
[01:00:07] name um uh so the observation in this case would be something like the
[01:00:09] case would be something like the immediate information you get back after
[01:00:10] immediate information you get back after the Last Action so in the case of the
[01:00:12] the Last Action so in the case of the student it would have been whether they
[01:00:13] student it would have been whether they get the last problem correct or not so
[01:00:15] get the last problem correct or not so just like a single time step and then
[01:00:18] just like a single time step and then the history would be everything like up
[01:00:20] the history would be everything like up to this time Point good
[01:00:24] question so in particular often to make
[01:00:26] question so in particular often to make things tractable and because often in
[01:00:28] things tractable and because often in reality it's not a terrible assumption
[01:00:30] reality it's not a terrible assumption we're going to normally make the Markoff
[01:00:32] we're going to normally make the Markoff assumption um and the idea is that we're
[01:00:34] assumption um and the idea is that we're going to try to come up with some sort
[01:00:36] going to try to come up with some sort of informative information state that is
[01:00:38] of informative information state that is a sufficient statistic of the history so
[01:00:40] a sufficient statistic of the history so we don't have to keep around all of the
[01:00:42] we don't have to keep around all of the prior history of everything the agents
[01:00:44] prior history of everything the agents ever done or seen or gotten reward
[01:00:46] ever done or seen or gotten reward for and what we say is a state St is
[01:00:49] for and what we say is a state St is Mark off if and only if the probability
[01:00:52] Mark off if and only if the probability of going to the next state given the
[01:00:55] of going to the next state given the current state in Act
[01:00:56] current state in Act is the same as if youd conditioned on
[01:00:58] is the same as if youd conditioned on the whole entire
[01:01:00] the whole entire history so another way to say this which
[01:01:03] history so another way to say this which I think is kind of a nice evocative idea
[01:01:05] I think is kind of a nice evocative idea um this is not from me it's from others
[01:01:07] um this is not from me it's from others is that the future is independent of the
[01:01:09] is that the future is independent of the past given the
[01:01:11] past given the present that means if you have a rich
[01:01:13] present that means if you have a rich representation of your current state you
[01:01:15] representation of your current state you don't have to think about the previous
[01:01:18] don't have to think about the previous history and of course in general this
[01:01:20] history and of course in general this will be true if you make S equal to
[01:01:24] will be true if you make S equal to HT but in general we're going to be
[01:01:26] HT but in general we're going to be thinking often of sort of projecting
[01:01:28] thinking often of sort of projecting down to a much smaller State space so
[01:01:30] down to a much smaller State space so for example you might say well I could
[01:01:31] for example you might say well I could think about someone's blood pressure
[01:01:33] think about someone's blood pressure from all of time but maybe it's
[01:01:35] from all of time but maybe it's sufficient just to think of their blood
[01:01:36] sufficient just to think of their blood pressure over the last like two hours in
[01:01:38] pressure over the last like two hours in order to make my next decision
[01:01:41] order to make my next decision yeah uhuh um is there a difference
[01:01:45] yeah uhuh um is there a difference between State and observation and dis
[01:01:49] between State and observation and dis great question yes in general so I'll
[01:01:50] great question yes in general so I'll give you a particular example um Atari
[01:01:52] give you a particular example um Atari um which is these video games that uh
[01:01:54] um which is these video games that uh deine learned an agent to play what
[01:01:56] deine learned an agent to play what their stayed in that case was the last
[01:01:58] their stayed in that case was the last four frames so not just the last frame
[01:02:01] four frames so not just the last frame the last four frames does may have an
[01:02:03] the last four frames does may have an idea why you might want four frames
[01:02:04] idea why you might want four frames instead of one yeah um maybe like uh so
[01:02:08] instead of one yeah um maybe like uh so you can see like if there's momentum to
[01:02:10] you can see like if there's momentum to an object already moving exactly it
[01:02:11] an object already moving exactly it gives you velocity and acceleration yeah
[01:02:14] gives you velocity and acceleration yeah so there are a number of cases where you
[01:02:16] so there are a number of cases where you might think that there are parts of the
[01:02:17] might think that there are parts of the state that really depend on temporal
[01:02:18] state that really depend on temporal differences and then in those cases
[01:02:20] differences and then in those cases you're going to want more than just the
[01:02:21] you're going to want more than just the immediate
[01:02:22] immediate State great questions
[01:02:26] State great questions all right so why is this popular um it's
[01:02:28] all right so why is this popular um it's used all the time uh it's simple it can
[01:02:31] used all the time uh it's simple it can often be satisfied as we were just
[01:02:32] often be satisfied as we were just discussing if you use some history as
[01:02:34] discussing if you use some history as part of the state um generally there are
[01:02:36] part of the state um generally there are many cases where you can just use the
[01:02:38] many cases where you can just use the most recent State not always but but
[01:02:40] most recent State not always but but many cases and it has huge implications
[01:02:43] many cases and it has huge implications for computational complexity data
[01:02:44] for computational complexity data required and resulting
[01:02:46] required and resulting performance what I mean by the resulting
[01:02:48] performance what I mean by the resulting performance is is that in many of these
[01:02:50] performance is is that in many of these cases just like in a lot of statistics
[01:02:52] cases just like in a lot of statistics and machine learning there will be
[01:02:53] and machine learning there will be trade-offs between bias and variance and
[01:02:56] trade-offs between bias and variance and so there'll be a trade-off to between
[01:02:57] so there'll be a trade-off to between using states that are really small and
[01:02:59] using states that are really small and easy for us to work with but aren't
[01:03:02] easy for us to work with but aren't really able to capture the complexity of
[01:03:03] really able to capture the complexity of the world and the applications we care
[01:03:05] the world and the applications we care about so that it might be fast to learn
[01:03:08] about so that it might be fast to learn with those sort of representations But
[01:03:09] with those sort of representations But ultimately performance is poor so
[01:03:12] ultimately performance is poor so there'll often be trade-offs with how we
[01:03:13] there'll often be trade-offs with how we actually the expressive power of our
[01:03:15] actually the expressive power of our representations versus how long it takes
[01:03:17] representations versus how long it takes us to
[01:03:20] learn right so one of the big questions
[01:03:23] learn right so one of the big questions when we talk about sequential
[01:03:24] when we talk about sequential decision-making processes is is um is
[01:03:27] decision-making processes is is um is the state Markoff and is the world
[01:03:29] the state Markoff and is the world partially
[01:03:30] partially observable so partial oh
[01:03:34] observable so partial oh yeah uh my question is that doesn't the
[01:03:36] yeah uh my question is that doesn't the mark of assumption make this reward
[01:03:39] mark of assumption make this reward attribution problem somehow like
[01:03:42] attribution problem somehow like harder why all right good question um
[01:03:45] harder why all right good question um well I don't know I guess you could
[01:03:46] well I don't know I guess you could imagine it might make it easier or
[01:03:47] imagine it might make it easier or harder there's still the question of you
[01:03:50] harder there's still the question of you might only get periodic rewards and you
[01:03:53] might only get periodic rewards and you still would have to figure out which
[01:03:54] still would have to figure out which decisions caused you to get to a state
[01:03:56] decisions caused you to get to a state where you got those
[01:03:59] rewards yeah so let me think of it so
[01:04:01] rewards yeah so let me think of it so you might have a case where the reward
[01:04:03] you might have a case where the reward might be a function of your current
[01:04:05] might be a function of your current state um yeah uh let me think if I can
[01:04:09] state um yeah uh let me think if I can think of a um a good example okay so
[01:04:11] think of a um a good example okay so let's say maybe um you get a you want to
[01:04:16] let's say maybe um you get a you want to run a marathon and you get a plus 100 if
[01:04:18] run a marathon and you get a plus 100 if you make it Boston Marathon is a
[01:04:20] you make it Boston Marathon is a competitive Marathon to get into so you
[01:04:22] competitive Marathon to get into so you you get a plus 100 if you can qualify
[01:04:24] you get a plus 100 if you can qualify for Boston and you do a lot of different
[01:04:26] for Boston and you do a lot of different things in your training regime you like
[01:04:28] things in your training regime you like eat healthy and you sleep and you train
[01:04:30] eat healthy and you sleep and you train and then um and you get zero reward for
[01:04:32] and then um and you get zero reward for any of that and then on like the day of
[01:04:35] any of that and then on like the day of your race you see if you qualify for
[01:04:37] your race you see if you qualify for Boston so your state only like your your
[01:04:40] Boston so your state only like your your reward for getting into Boston only
[01:04:42] reward for getting into Boston only depends on that current state but you
[01:04:43] depends on that current state but you don't know which of those decisions was
[01:04:45] don't know which of those decisions was it that you ate well was it that you
[01:04:46] it that you ate well was it that you slept was it that you trained every week
[01:04:48] slept was it that you trained every week for 17 weeks caused you to get to the
[01:04:51] for 17 weeks caused you to get to the state in which you qualified for Boston
[01:04:54] state in which you qualified for Boston um and so that's independent of the
[01:04:55] um and so that's independent of the market assumption in that case because
[01:04:57] market assumption in that case because you still have the question of what
[01:04:59] you still have the question of what series of decisions allowed you to get
[01:05:00] series of decisions allowed you to get to a state that achieved High reward
[01:05:04] to a state that achieved High reward great
[01:05:05] great question so another thing is whether the
[01:05:07] question so another thing is whether the world is partially observable we will
[01:05:09] world is partially observable we will mostly not be talking about this in this
[01:05:10] mostly not be talking about this in this class Michael coocher has a great class
[01:05:13] class Michael coocher has a great class where he talks about this a lot but this
[01:05:15] where he talks about this a lot but this does relate to the case we talked about
[01:05:17] does relate to the case we talked about with students so for students one way
[01:05:20] with students so for students one way you could think about that is that
[01:05:21] you could think about that is that there's some latent state that you can't
[01:05:22] there's some latent state that you can't directly access which is whether or not
[01:05:24] directly access which is whether or not they know addition or they know
[01:05:25] they know addition or they know subtraction but you get noisy
[01:05:27] subtraction but you get noisy observations when they do problems where
[01:05:29] observations when they do problems where they get it right or get it wrong and
[01:05:32] they get it right or get it wrong and the reason it's noisy is because you
[01:05:33] the reason it's noisy is because you know all of us make mistakes on addition
[01:05:35] know all of us make mistakes on addition sometimes whereas I have complete faith
[01:05:37] sometimes whereas I have complete faith that everyone here actually knows how to
[01:05:38] that everyone here actually knows how to do addition um and sometimes you might
[01:05:40] do addition um and sometimes you might guess right even if you don't know it so
[01:05:42] guess right even if you don't know it so the idea is that it's latent you don't
[01:05:43] the idea is that it's latent you don't directly get to observe it um this comes
[01:05:46] directly get to observe it um this comes up uh in a lot of Robotics problems too
[01:05:49] up uh in a lot of Robotics problems too so I'll just give a quick example here
[01:05:51] so I'll just give a quick example here if you have a robot that uses a Laser
[01:05:54] if you have a robot that uses a Laser Rangefinder to figure out out these
[01:05:56] Rangefinder to figure out out these little arrows or
[01:05:57] little arrows or lasers to figure out its environment so
[01:06:02] lasers to figure out its environment so it could have like 180° of laser range
[01:06:04] it could have like 180° of laser range finders and what it's getting back is
[01:06:05] finders and what it's getting back is just the distance in all these different
[01:06:07] just the distance in all these different angles till where it hits a wall so as
[01:06:10] angles till where it hits a wall so as you can imagine many rooms would look
[01:06:12] you can imagine many rooms would look identical so any room that has like kind
[01:06:14] identical so any room that has like kind of the same dimensions would look
[01:06:15] of the same dimensions would look identical to that robot and it wouldn't
[01:06:17] identical to that robot and it wouldn't be able to tell is it on the third floor
[01:06:19] be able to tell is it on the third floor or the second floor so that would be a
[01:06:21] or the second floor so that would be a partially observable case where it can't
[01:06:23] partially observable case where it can't uniquely identify its state based on
[01:06:24] uniquely identify its state based on this observation
[01:06:26] this observation so we won't talk too much about about
[01:06:28] so we won't talk too much about about that but it's important to know about
[01:06:29] that but it's important to know about another thing is whether the Dynamics
[01:06:30] another thing is whether the Dynamics are deterministic or
[01:06:32] are deterministic or stochastic so there are many cases where
[01:06:34] stochastic so there are many cases where things are close to deterministic like
[01:06:36] things are close to deterministic like if I um put down a piece on a a go board
[01:06:40] if I um put down a piece on a a go board it goes there but there are other things
[01:06:42] it goes there but there are other things that we often treat as stochastic like
[01:06:44] that we often treat as stochastic like when I flip a coin I don't know whether
[01:06:45] when I flip a coin I don't know whether it's going to be heads or tails so
[01:06:47] it's going to be heads or tails so that'll be an important decision and
[01:06:49] that'll be an important decision and then the final thing is whether the
[01:06:51] then the final thing is whether the actions influence only immediate reward
[01:06:53] actions influence only immediate reward um or reward in next state
[01:06:57] so as an example of this you might
[01:06:59] so as an example of this you might imagine if you were making a policy for
[01:07:01] imagine if you were making a policy for what ad to show to people and you just
[01:07:03] what ad to show to people and you just imagine for each person coming onto the
[01:07:05] imagine for each person coming onto the web you just show them like onto your
[01:07:07] web you just show them like onto your website you show an ad and then they go
[01:07:08] website you show an ad and then they go away and they either buy something or
[01:07:09] away and they either buy something or they don't um a bandit would be a case
[01:07:12] they don't um a bandit would be a case where like you just have bless you you
[01:07:14] where like you just have bless you you have a series of customers coming in and
[01:07:16] have a series of customers coming in and so whether or not um I showed a
[01:07:20] so whether or not um I showed a particular ad and he clicks on it or not
[01:07:22] particular ad and he clicks on it or not does not impact whether or not Ellen
[01:07:24] does not impact whether or not Ellen comes along and and leges an ad so
[01:07:27] comes along and and leges an ad so that's a case where you it impacts your
[01:07:29] that's a case where you it impacts your immediate reward but not the next state
[01:07:32] immediate reward but not the next state we can talk more about that all right
[01:07:34] we can talk more about that all right let's think about a particular sort of
[01:07:36] let's think about a particular sort of running example we'll think of a Mars
[01:07:38] running example we'll think of a Mars Rover so Mars Rover is a markof decision
[01:07:40] Rover so Mars Rover is a markof decision process we imagine that Mars is really
[01:07:42] process we imagine that Mars is really small we only have seven places in Mars
[01:07:45] small we only have seven places in Mars so in this case we would have the state
[01:07:48] so in this case we would have the state is the location of the Rover which is
[01:07:50] is the location of the Rover which is one of seven discrete locations we could
[01:07:53] one of seven discrete locations we could have actions called try left and try
[01:07:54] have actions called try left and try right meaning that our Rover is not
[01:07:56] right meaning that our Rover is not perfect so sometimes it tries go a
[01:07:58] perfect so sometimes it tries go a direction and it doesn't succeed and
[01:08:00] direction and it doesn't succeed and let's imagine that we have rewards which
[01:08:02] let's imagine that we have rewards which is there's some interesting field sites
[01:08:05] is there's some interesting field sites and so if you spend time over here you
[01:08:07] and so if you spend time over here you get a plus one and you have spend over
[01:08:09] get a plus one and you have spend over here you get a plus 10 and else you get
[01:08:11] here you get a plus 10 and else you get zero
[01:08:12] zero reward so this would be a particular
[01:08:14] reward so this would be a particular case where we could think of there being
[01:08:15] case where we could think of there being these dates and these actions um and
[01:08:21] rewards so when we think of um a markof
[01:08:23] rewards so when we think of um a markof decision process we think of there being
[01:08:26] decision process we think of there being a Dynamics um and a reward model so in
[01:08:30] a Dynamics um and a reward model so in particular the Dynamics model is going
[01:08:32] particular the Dynamics model is going to tell us how the state evolves as we
[01:08:35] to tell us how the state evolves as we make decisions we will not always have
[01:08:37] make decisions we will not always have direct access to this but the idea is
[01:08:39] direct access to this but the idea is that in the world there is some Dynamics
[01:08:41] that in the world there is some Dynamics process and things are changing as we
[01:08:43] process and things are changing as we make
[01:08:44] make decisions so in particular we generally
[01:08:47] decisions so in particular we generally want to allow for stochastic systems
[01:08:49] want to allow for stochastic systems meaning that given we're currently in a
[01:08:51] meaning that given we're currently in a state and we take a particular action
[01:08:53] state and we take a particular action what is the distribution over our next
[01:08:55] what is the distribution over our next states that we might
[01:08:56] states that we might reach so for example I'm that Mars Rover
[01:08:59] reach so for example I'm that Mars Rover and um I'm going to try to go to the
[01:09:01] and um I'm going to try to go to the right it might be that I can go to the
[01:09:04] right it might be that I can go to the right with like 50% probability but I'm
[01:09:07] right with like 50% probability but I'm I'm not a very accurate Rover and so 50%
[01:09:09] I'm not a very accurate Rover and so 50% of the time I go to the left or maybe I
[01:09:11] of the time I go to the left or maybe I stay in the same
[01:09:12] stay in the same location so this Dynamics model just
[01:09:15] location so this Dynamics model just specifies what actually the distribution
[01:09:17] specifies what actually the distribution of outcomes that can happen in the world
[01:09:18] of outcomes that can happen in the world when I make a
[01:09:19] when I make a decision the reward model predicts the
[01:09:22] decision the reward model predicts the immediate reward which is if I'm in this
[01:09:24] immediate reward which is if I'm in this state and I take the action what is my
[01:09:26] state and I take the action what is my expected
[01:09:27] expected reward I want to highlight here that
[01:09:29] reward I want to highlight here that there are different conventions you
[01:09:31] there are different conventions you could have the reward be a function only
[01:09:34] could have the reward be a function only of the current state excuse me it could
[01:09:36] of the current state excuse me it could be a function of the state and the
[01:09:37] be a function of the state and the action you take or it could be a
[01:09:39] action you take or it could be a function of the state the action you
[01:09:40] function of the state the action you take and the next day you
[01:09:42] take and the next day you reach you'll see all of these
[01:09:45] reach you'll see all of these conventions in reinforcement learning
[01:09:46] conventions in reinforcement learning papers probably the most common one is
[01:09:49] papers probably the most common one is this um but we'll try just to be
[01:09:51] this um but we'll try just to be specific whenever we're using it so it's
[01:09:53] specific whenever we're using it so it's clear and you can always ask me or ask
[01:09:55] clear and you can always ask me or ask any the Tas if it's not
[01:09:57] any the Tas if it's not clear bless you so let's think about
[01:10:00] clear bless you so let's think about sort of what a stochastic Mars Rover
[01:10:02] sort of what a stochastic Mars Rover model would be so I've written down a
[01:10:06] model would be so I've written down a particular choice for the reward and
[01:10:09] particular choice for the reward and let's imagine that part of the Dynamics
[01:10:11] let's imagine that part of the Dynamics model is the following which is if I
[01:10:13] model is the following which is if I start in state S1 and I try to go to the
[01:10:16] start in state S1 and I try to go to the right then I have some probability of
[01:10:18] right then I have some probability of going to S2 else I have some probability
[01:10:21] going to S2 else I have some probability of staying
[01:10:23] of staying here what I want to clear about here and
[01:10:26] here what I want to clear about here and this relates to the question before
[01:10:27] this relates to the question before about U models is that this is like the
[01:10:30] about U models is that this is like the agent's idea of how the world works it
[01:10:33] agent's idea of how the world works it doesn't have to be how the world
[01:10:34] doesn't have to be how the world actually works so what I told you in the
[01:10:37] actually works so what I told you in the previous slides is that imagine that in
[01:10:39] previous slides is that imagine that in this world in reality this gives you
[01:10:42] this world in reality this gives you plus one and this gives you plus 10 in
[01:10:44] plus one and this gives you plus 10 in terms of the reward that's how the world
[01:10:46] terms of the reward that's how the world actually works but the agent might have
[01:10:48] actually works but the agent might have the wrong model of how the world works
[01:10:50] the wrong model of how the world works because it only learns about the world
[01:10:52] because it only learns about the world through its experiences or it just might
[01:10:54] through its experiences or it just might have a bad model
[01:10:56] have a bad model so this is an example of sort of like a
[01:10:58] so this is an example of sort of like a model-based Markov system where we where
[01:11:00] model-based Markov system where we where the agent would have a particular
[01:11:02] the agent would have a particular representation of the Dynamics model and
[01:11:04] representation of the Dynamics model and a particular assumption over the how the
[01:11:06] a particular assumption over the how the rewards
[01:11:10] work in these settings we have a policy
[01:11:13] work in these settings we have a policy a decision policy is just going to be a
[01:11:14] a decision policy is just going to be a mapping from states to actions it's like
[01:11:17] mapping from states to actions it's like an if then
[01:11:19] an if then table if it's deterministic we just have
[01:11:21] table if it's deterministic we just have a single action that we would take in a
[01:11:23] a single action that we would take in a particular state like maybe we always
[01:11:26] particular state like maybe we always show this one ad to a particular
[01:11:28] show this one ad to a particular customer or we could have a stochastic
[01:11:31] customer or we could have a stochastic policy where we randomize this so this
[01:11:33] policy where we randomize this so this would be something like oh when this
[01:11:35] would be something like oh when this customer shows up I show you know a
[01:11:37] customer shows up I show you know a vacation ad or I show a board game ad
[01:11:39] vacation ad or I show a board game ad with you know 90% probability versus
[01:11:42] with you know 90% probability versus 10% both types of policies are really
[01:11:45] 10% both types of policies are really common and it can depend in part what
[01:11:46] common and it can depend in part what sort of domain you're in and whether
[01:11:48] sort of domain you're in and whether you're trying to learn from that
[01:11:51] you're trying to learn from that experience okay so let's see what that
[01:11:53] experience okay so let's see what that would look like in this case so for Mars
[01:11:56] would look like in this case so for Mars rer you could say that no matter where
[01:11:58] rer you could say that no matter where it is it always just tries to go
[01:12:03] right so that would just be one example
[01:12:05] right so that would just be one example of a policy you could have and it just
[01:12:08] of a policy you could have and it just requires you to specify for every single
[01:12:10] requires you to specify for every single state what is the action you would take
[01:12:11] state what is the action you would take or what is the distribution over actions
[01:12:13] or what is the distribution over actions you would
[01:12:16] take so in this sort of setting we're
[01:12:18] take so in this sort of setting we're normally interested in two M oh yeah
[01:12:21] normally interested in two M oh yeah yeah question so it's like making like
[01:12:24] yeah question so it's like making like decisions based on the state that it's
[01:12:25] decisions based on the state that it's in can it like learn to like switch from
[01:12:28] in can it like learn to like switch from like different types of policy so not
[01:12:31] like different types of policy so not just like different actions ver on the
[01:12:33] just like different actions ver on the state but also switch to like checking
[01:12:35] state but also switch to like checking the past State the future state in the
[01:12:37] the past State the future state in the same way like in deep learning it like
[01:12:39] same way like in deep learning it like tries a bunch of different functions can
[01:12:41] tries a bunch of different functions can it do that or can it not do that remind
[01:12:44] it do that or can it not do that remind me your
[01:12:46] me your name um yeah so great question it will
[01:12:49] name um yeah so great question it will in general in general when we're
[01:12:50] in general in general when we're learning we're it'll change its policy a
[01:12:52] learning we're it'll change its policy a lot over time so it might start with a
[01:12:54] lot over time so it might start with a particular policy and then over time it
[01:12:56] particular policy and then over time it will explore lots of different policies
[01:12:57] will explore lots of different policies in trying to search for something that's
[01:12:59] in trying to search for something that's good that's a great question and that
[01:13:01] good that's a great question and that relates to what I was just putting here
[01:13:03] relates to what I was just putting here which is two of the central questions
[01:13:05] which is two of the central questions we're going to talk a lot about
[01:13:06] we're going to talk a lot about particularly at the beginning is
[01:13:07] particularly at the beginning is evaluation and control evaluation says
[01:13:10] evaluation and control evaluation says someone gives you a fixed policy and you
[01:13:12] someone gives you a fixed policy and you want to know how good it is like maybe
[01:13:14] want to know how good it is like maybe your boss says hey I think this is the
[01:13:16] your boss says hey I think this is the right way to advertise to customers and
[01:13:18] right way to advertise to customers and we're going to make a lot of money and
[01:13:19] we're going to make a lot of money and you go out and you just deploy that
[01:13:21] you go out and you just deploy that particular decision policy and you see
[01:13:23] particular decision policy and you see how much money you make so that would be
[01:13:25] how much money you make so that would be evaluation control is you actually want
[01:13:27] evaluation control is you actually want to find the best policy and so in
[01:13:30] to find the best policy and so in general to actually find the best policy
[01:13:32] general to actually find the best policy we're going to have to do a lot of trial
[01:13:34] we're going to have to do a lot of trial and error and we want to do that in a
[01:13:35] and error and we want to do that in a strategic efficient way so we can
[01:13:37] strategic efficient way so we can quickly learn what that good policy
[01:13:41] is so in general we're going to be
[01:13:43] is so in general we're going to be talking about things I just want to
[01:13:44] talking about things I just want to highlight we're going to sort of build
[01:13:45] highlight we're going to sort of build up in complexity in terms of the type of
[01:13:47] up in complexity in terms of the type of problems we're talking about so we're
[01:13:50] problems we're talking about so we're going to be thinking about both like
[01:13:52] going to be thinking about both like planning and control and sort of
[01:13:54] planning and control and sort of thinking about
[01:13:56] thinking about uh how complicated these spaces
[01:13:59] uh how complicated these spaces are okay so we're going to think about
[01:14:03] are okay so we're going to think about evaluation and control because
[01:14:05] evaluation and control because evaluation is often a subpart of doing
[01:14:07] evaluation is often a subpart of doing control if you know how good a policy is
[01:14:09] control if you know how good a policy is you may be able to improve it and then
[01:14:12] you may be able to improve it and then we're going to talk about tabular and
[01:14:14] we're going to talk about tabular and function approximation
[01:14:15] function approximation methods because we're going to want to
[01:14:17] methods because we're going to want to be able to solve really large problems
[01:14:19] be able to solve really large problems and then we're going to talk about both
[01:14:24] planning and
[01:14:25] planning and learning in planning we're going to
[01:14:27] learning in planning we're going to assume someone gives us that Dynamics
[01:14:30] assume someone gives us that Dynamics model and that reward model and the
[01:14:32] model and that reward model and the state in action space and we're just
[01:14:34] state in action space and we're just going to try to find a really good
[01:14:35] going to try to find a really good policy and in learning we're going to
[01:14:37] policy and in learning we're going to actually have to control the decisions
[01:14:38] actually have to control the decisions we make to give us information that
[01:14:40] we make to give us information that allows us to identify an optimal
[01:14:44] policy all right so what we're going to
[01:14:47] policy all right so what we're going to start with is sort of the simplest of
[01:14:48] start with is sort of the simplest of this settings which we're going to
[01:14:49] this settings which we're going to assume that we have a finite set of
[01:14:51] assume that we have a finite set of states and actions we're given models of
[01:14:53] states and actions we're given models of the world meaning someone like writes
[01:14:54] the world meaning someone like writes down down for us what those look like
[01:14:57] down down for us what those look like and we want to evaluate the performance
[01:14:59] and we want to evaluate the performance of the best decision
[01:15:00] of the best decision policy and then compute the optimal
[01:15:02] policy and then compute the optimal policy and we can think of this really
[01:15:04] policy and we can think of this really as AI
[01:15:06] as AI planning okay so to think about how this
[01:15:09] planning okay so to think about how this works we're going to start with Markoff
[01:15:11] works we're going to start with Markoff processes and then build up to mdps and
[01:15:13] processes and then build up to mdps and this is relevant because it turns out
[01:15:15] this is relevant because it turns out you can think of evaluation as basically
[01:15:17] you can think of evaluation as basically being a Markoff reward
[01:15:20] being a Markoff reward process okay so how does a markof chain
[01:15:22] process okay so how does a markof chain work and just raise your hand if you've
[01:15:23] work and just raise your hand if you've seen markof chains before
[01:15:25] seen markof chains before awesome okay so most people have which
[01:15:27] awesome okay so most people have which is great so this is a memor random
[01:15:29] is great so this is a memor random process there's no rewards yet um
[01:15:32] process there's no rewards yet um there's a finite set of states in this
[01:15:33] there's a finite set of states in this case and we have a Dynamics model and if
[01:15:36] case and we have a Dynamics model and if it's just a finite set of States we can
[01:15:37] it's just a finite set of States we can just write this down as a
[01:15:39] just write this down as a matrix okay just says what's the
[01:15:42] matrix okay just says what's the probability going to the next state
[01:15:43] probability going to the next state given the previous state and so you
[01:15:46] given the previous state and so you could just have this say in our this
[01:15:48] could just have this say in our this would be a Markoff chain transition
[01:15:49] would be a Markoff chain transition Matrix for our Mars rer
[01:15:53] Matrix for our Mars rer case and if you want to to get an
[01:15:55] case and if you want to to get an episode you just sample so let's say
[01:15:56] episode you just sample so let's say you're you always touchdown as dat S4
[01:15:58] you're you always touchdown as dat S4 you just sample episodes from that
[01:16:00] you just sample episodes from that particular chain yeah um all rows and
[01:16:03] particular chain yeah um all rows and columns you add to one all of the um
[01:16:07] columns you add to one all of the um what's your name yeah so all of the rows
[01:16:11] what's your name yeah so all of the rows have to sum to
[01:16:16] one then is it coincidence that column
[01:16:19] one then is it coincidence that column yeah okay yeah I was thinking just now
[01:16:22] yeah okay yeah I was thinking just now that I should have changed that it's a
[01:16:23] that I should have changed that it's a good question yeah cuz and we'll see
[01:16:25] good question yeah cuz and we'll see also why why that's important later okay
[01:16:28] also why why that's important later okay in a Markoff reward process it's a
[01:16:29] in a Markoff reward process it's a Markoff chain plus rewards so same as
[01:16:33] Markoff chain plus rewards so same as before but now we have a reward function
[01:16:35] before but now we have a reward function that tells us how good each of those
[01:16:36] that tells us how good each of those states are okay and we're also going to
[01:16:38] states are okay and we're also going to have a discount factor and I'll talk
[01:16:40] have a discount factor and I'll talk about that in a
[01:16:41] about that in a second we still have no actions um and
[01:16:44] second we still have no actions um and we can express r as a
[01:16:47] we can express r as a vector so in this we could imagine our
[01:16:50] vector so in this we could imagine our markof for word process where we have a
[01:16:52] markof for word process where we have a plus one and S1 10 in S7 so+ one plus
[01:16:57] plus one and S1 10 in S7 so+ one plus and zero in all other
[01:16:59] and zero in all other states okay in this case this is where
[01:17:02] states okay in this case this is where we start to see the ideas that are going
[01:17:04] we start to see the ideas that are going to be really useful for decision
[01:17:05] to be really useful for decision processes which is we can start to think
[01:17:06] processes which is we can start to think about how good particular trajectories
[01:17:08] about how good particular trajectories are so we're going to have a horizon and
[01:17:11] are so we're going to have a horizon and you're going to see this in your
[01:17:12] you're going to see this in your homework too which is the number of time
[01:17:13] homework too which is the number of time steps in each episode it could be
[01:17:15] steps in each episode it could be infinite or it could be finite it's like
[01:17:18] infinite or it could be finite it's like basically how many time steps do you get
[01:17:19] basically how many time steps do you get to make
[01:17:20] to make decisions and the return which we're
[01:17:23] decisions and the return which we're going to call GT is just going to be the
[01:17:26] going to call GT is just going to be the discounted sum of rewards from the time
[01:17:28] discounted sum of rewards from the time step current time step to the end of The
[01:17:30] step current time step to the end of The Horizon and a value function in this
[01:17:33] Horizon and a value function in this case is just going to be the expected
[01:17:36] case is just going to be the expected return in general this is not going to
[01:17:38] return in general this is not going to be the same as the actual return unless
[01:17:40] be the same as the actual return unless you just have a deterministic process
[01:17:43] you just have a deterministic process because the idea is that you're going to
[01:17:44] because the idea is that you're going to have stochasticity in the trajectories
[01:17:46] have stochasticity in the trajectories you reach and because of that you're
[01:17:48] you reach and because of that you're going to get different
[01:17:51] rewards so you might wonder if you
[01:17:54] rewards so you might wonder if you haven't seen it for why do we have this
[01:17:55] haven't seen it for why do we have this discount Factor thing so we're of
[01:17:58] discount Factor thing so we're of weighing um earlier rewards more than
[01:18:00] weighing um earlier rewards more than later rewards one is that it's just
[01:18:02] later rewards one is that it's just mathematically really convenient it's
[01:18:04] mathematically really convenient it's going to help us not sum to Infinity
[01:18:05] going to help us not sum to Infinity particularly if we have infinite number
[01:18:07] particularly if we have infinite number of time steps we can make decisions um
[01:18:10] of time steps we can make decisions um and it turns out humans often act as if
[01:18:12] and it turns out humans often act as if there is a discount Factor like often we
[01:18:14] there is a discount Factor like often we sort of implicitly weigh future rewards
[01:18:17] sort of implicitly weigh future rewards less than immediate rewards um and this
[01:18:20] less than immediate rewards um and this is true for organizations
[01:18:21] is true for organizations too and if the epsod lengths are always
[01:18:24] too and if the epsod lengths are always finite you can always bless you use
[01:18:26] finite you can always bless you use gamma equal one meaning you don't have
[01:18:28] gamma equal one meaning you don't have to make a large
[01:18:30] to make a large discount but when you have infinite
[01:18:32] discount but when you have infinite Horizons it's generally important to
[01:18:34] Horizons it's generally important to make this less than one so your rewards
[01:18:35] make this less than one so your rewards don't blow up part of that is because
[01:18:38] don't blow up part of that is because it's really hard to compare Infinities
[01:18:40] it's really hard to compare Infinities so it's hard to say like this reward
[01:18:41] so it's hard to say like this reward that this policy that has infinite
[01:18:43] that this policy that has infinite reward is better than this other policy
[01:18:44] reward is better than this other policy that has infinite reward whereas you can
[01:18:46] that has infinite reward whereas you can keep everything bounded if you have a
[01:18:48] keep everything bounded if you have a gamma less than
[01:18:49] gamma less than one all right next time we will start to
[01:18:53] one all right next time we will start to talk about how we actually can comp the
[01:18:55] talk about how we actually can comp the value of these types of markof reward
[01:18:57] value of these types of markof reward processes and then start to connect it
[01:18:58] processes and then start to connect it to decision processes I'll see you on
[01:19:00] to decision processes I'll see you on Wednesday thanks
Lecture 002
Stanford CS234 Reinforcement Learning I Tabular MDP Planning I 2024 I Lecture 2
Source: https://www.youtube.com/watch?v=gHdsUUGcBC0
---
Transcript
[00:00:05] hi everybody welcome back this is a
[0...
Stanford CS234 Reinforcement Learning I Tabular MDP Planning I 2024 I Lecture 2
Source: https://www.youtube.com/watch?v=gHdsUUGcBC0
---
Transcript
[00:00:05] hi everybody welcome back this is a
[00:00:07] hi everybody welcome back this is a lecture two from reinforcement learning
[00:00:09] lecture two from reinforcement learning um we're going to start with a refresh
[00:00:11] um we're going to start with a refresh your understanding again these are just
[00:00:13] your understanding again these are just a sort of a quick way to check your
[00:00:15] a sort of a quick way to check your conceptual understanding from the most
[00:00:17] conceptual understanding from the most recent lectures or occasionally we'll go
[00:00:19] recent lectures or occasionally we'll go back a little bit to do this you just
[00:00:21] back a little bit to do this you just need to log in to Ed everybody should be
[00:00:23] need to log in to Ed everybody should be added to Ed if you're not just send us
[00:00:25] added to Ed if you're not just send us an email to our mailing list um so if
[00:00:28] an email to our mailing list um so if you go to Ed please follow the steps
[00:00:29] you go to Ed please follow the steps given to log in first before you click
[00:00:31] given to log in first before you click the links so if you follow those steps
[00:00:33] the links so if you follow those steps and then you're logged in with your sun
[00:00:34] and then you're logged in with your sun ID then when you click on the poll links
[00:00:37] ID then when you click on the poll links it should just take you right there and
[00:00:38] it should just take you right there and it'll just log all your responses if
[00:00:41] it'll just log all your responses if you're curious about how we use these
[00:00:42] you're curious about how we use these for participation points um you can just
[00:00:44] for participation points um you can just go to the website to see how we
[00:00:45] go to the website to see how we calculate it I think it's um we use just
[00:00:48] calculate it I think it's um we use just a percentage of these if you do a
[00:00:50] a percentage of these if you do a sufficient percentage then you get full
[00:00:52] sufficient percentage then you get full participation points it's
[00:00:56] optional all right so we're going to
[00:00:57] optional all right so we're going to start with this today the question is in
[00:00:59] start with this today the question is in markup decision process says a large
[00:01:00] markup decision process says a large discount Factor gamma means that
[00:01:02] discount Factor gamma means that short-term rewards are much more
[00:01:03] short-term rewards are much more influential than long-term rewards um
[00:01:06] influential than long-term rewards um and then a second question to start
[00:01:07] and then a second question to start thinking about is in general so last
[00:01:10] thinking about is in general so last time we started talking about sequential
[00:01:12] time we started talking about sequential decision- making under uncertainty and
[00:01:14] decision- making under uncertainty and one of the things we often would like in
[00:01:16] one of the things we often would like in Real World Systems is monotonic
[00:01:18] Real World Systems is monotonic Improvement meaning that if we get more
[00:01:20] Improvement meaning that if we get more data or we get more computation we know
[00:01:23] data or we get more computation we know that the system is going to be better
[00:01:25] that the system is going to be better make in our case better decisions than
[00:01:27] make in our case better decisions than it could if it had less computation or
[00:01:29] it could if it had less computation or less data
[00:01:30] less data and so the question um that I'm posing
[00:01:33] and so the question um that I'm posing to you now and that we're going to
[00:01:34] to you now and that we're going to discuss today is is it possible to
[00:01:36] discuss today is is it possible to construct algorithms for computing
[00:01:38] construct algorithms for computing decision policies so that we can
[00:01:40] decision policies so that we can guarantee with additional computation
[00:01:42] guarantee with additional computation which we could also think of often as
[00:01:43] which we could also think of often as iteration um that we're going to
[00:01:45] iteration um that we're going to monotonically improve the decision
[00:01:48] monotonically improve the decision policy and you can start to think about
[00:01:50] policy and you can start to think about if you're already aware of any
[00:01:51] if you're already aware of any algorithms that might have that property
[00:01:53] algorithms that might have that property if you think it's impossible or if you
[00:01:54] if you think it's impossible or if you think if it's true do you think that all
[00:01:56] think if it's true do you think that all algorithms would satisfy that okay
[00:01:59] algorithms would satisfy that okay that's not for the poll that's just to
[00:02:00] that's not for the poll that's just to start thinking about and we'll come back
[00:02:02] start thinking about and we'll come back to it
[00:02:04] to it later all right so I'll just give you
[00:02:06] later all right so I'll just give you another minute or two to do this refresh
[00:02:08] another minute or two to do this refresh your understanding it's just a quick one
[00:02:10] your understanding it's just a quick one and then we'll
[00:02:24] go and again these are not assessment
[00:02:26] go and again these are not assessment questions so you're welcome to look back
[00:02:27] questions so you're welcome to look back on lecture slides from last time you're
[00:02:30] on lecture slides from last time you're also welcome to talk to anybody right
[00:02:31] also welcome to talk to anybody right next to you
[00:03:08] all right it looks like we actually have
[00:03:09] all right it looks like we actually have um maybe a 23 1/3 split on this question
[00:03:14] um maybe a 23 1/3 split on this question um the correct answer is false does
[00:03:16] um the correct answer is false does somebody want to say why it's
[00:03:20] false yeah and remind me your
[00:03:23] false yeah and remind me your name yeah I think because you multiply
[00:03:25] name yeah I think because you multiply the longer term rewards by uh the gamma
[00:03:28] the longer term rewards by uh the gamma so a large gamma means that the long
[00:03:30] so a large gamma means that the long rewards are we get decently that's right
[00:03:32] rewards are we get decently that's right so if you have exactly said so if gamma
[00:03:35] so if you have exactly said so if gamma was one you would care about short-term
[00:03:36] was one you would care about short-term rewards exactly the same as long-term
[00:03:38] rewards exactly the same as long-term rewards in general if gamma was Zero you
[00:03:40] rewards in general if gamma was Zero you wouldn't care about long-term rewards at
[00:03:42] wouldn't care about long-term rewards at all you'd be entirely myopic but as
[00:03:44] all you'd be entirely myopic but as gamma gets closer to one it's sort of a
[00:03:46] gamma gets closer to one it's sort of a relatively waiting more of longer
[00:03:48] relatively waiting more of longer rewards than you would
[00:03:51] otherwise
[00:03:54] otherwise great all right and yes as I said H well
[00:03:56] great all right and yes as I said H well we'll get more into the conceptual
[00:03:58] we'll get more into the conceptual question later um the other thing that I
[00:04:00] question later um the other thing that I wanted to clarify I saw there was some
[00:04:02] wanted to clarify I saw there was some questions on this last time as well as
[00:04:04] questions on this last time as well as after class as well as on Ed is sort of
[00:04:06] after class as well as on Ed is sort of I had mentioned when I was uh making
[00:04:08] I had mentioned when I was uh making distinguishment between reinforcement
[00:04:10] distinguishment between reinforcement learning and other forms of AI machine
[00:04:12] learning and other forms of AI machine learning this notion of optimization but
[00:04:15] learning this notion of optimization but I think that that was a little bit un
[00:04:17] I think that that was a little bit un like it was more confusing than it was
[00:04:19] like it was more confusing than it was helpful um and because depending on how
[00:04:21] helpful um and because depending on how you think of it in machine learning II
[00:04:24] you think of it in machine learning II we always have some form of metric or
[00:04:26] we always have some form of metric or optimization so you can think of a loss
[00:04:28] optimization so you can think of a loss as also being we're trying to to
[00:04:29] as also being we're trying to to minimize the loss and so that also
[00:04:31] minimize the loss and so that also sounds like an optimization problem so
[00:04:33] sounds like an optimization problem so you can just ignore that distinction for
[00:04:35] you can just ignore that distinction for now I do think in general when we're
[00:04:37] now I do think in general when we're thinking about decision making it's
[00:04:39] thinking about decision making it's going to be very important what we think
[00:04:41] going to be very important what we think of as that metric and so it won't
[00:04:43] of as that metric and so it won't necessarily just be loss functions we
[00:04:44] necessarily just be loss functions we can have lots of different scalar values
[00:04:46] can have lots of different scalar values or even multiple objectives but the
[00:04:49] or even multiple objectives but the distinction of whether or not supervised
[00:04:50] distinction of whether or not supervised learning is using optimization is
[00:04:53] learning is using optimization is perhaps not so
[00:04:56] helpful okay great so let's go ahead and
[00:04:58] helpful okay great so let's go ahead and get started um so I do also just want to
[00:05:01] get started um so I do also just want to highlight that for some of you um and I
[00:05:04] highlight that for some of you um and I got a question about this we've also got
[00:05:05] got a question about this we've also got a couple questions about this this this
[00:05:07] a couple questions about this this this first week or two will overlap a little
[00:05:09] first week or two will overlap a little bit with some of the other classes you
[00:05:10] bit with some of the other classes you might have taken so particularly if
[00:05:12] might have taken so particularly if you've taken 238 with Michael Cocker the
[00:05:14] you've taken 238 with Michael Cocker the beginning May overlap the things that
[00:05:16] beginning May overlap the things that will probably still be different in the
[00:05:18] will probably still be different in the first couple weeks is I expect there's
[00:05:20] first couple weeks is I expect there's going to be um a higher level of theory
[00:05:22] going to be um a higher level of theory in the first week or two about the
[00:05:24] in the first week or two about the properties of some of these algorithms
[00:05:26] properties of some of these algorithms and what sort of guarantees we have and
[00:05:29] and what sort of guarantees we have and then after I suspect after that most of
[00:05:31] then after I suspect after that most of the content in the rest of the class
[00:05:32] the content in the rest of the class will be quite different if you have any
[00:05:34] will be quite different if you have any questions about how this compares to a
[00:05:36] questions about how this compares to a lot of the other decision-making classes
[00:05:38] lot of the other decision-making classes that are offered at Stanford don't
[00:05:39] that are offered at Stanford don't hesitate to reach out to me in office
[00:05:41] hesitate to reach out to me in office hours on Ed or after
[00:05:43] hours on Ed or after class all right now why do we do this
[00:05:46] class all right now why do we do this because also you might be thinking we
[00:05:47] because also you might be thinking we want to get to alphago or I want to get
[00:05:49] want to get to alphago or I want to get to controlling robots or I want to get
[00:05:51] to controlling robots or I want to get to optimizing llms why are we starting
[00:05:53] to optimizing llms why are we starting with systems like the seven State Mars
[00:05:55] with systems like the seven State Mars rover that we're going to look at and
[00:05:57] rover that we're going to look at and the reason is because actually a lot of
[00:05:59] the reason is because actually a lot of the ideas um that enabled people to
[00:06:02] the ideas um that enabled people to solve alphago um and do things like rhf
[00:06:05] solve alphago um and do things like rhf or reinforcement learning from Human
[00:06:07] or reinforcement learning from Human feedback really Builds on this
[00:06:09] feedback really Builds on this fundamental notion of decision processes
[00:06:11] fundamental notion of decision processes and I think it's much easier to really
[00:06:13] and I think it's much easier to really cleanly see how these ideas come up
[00:06:15] cleanly see how these ideas come up bless you when you can actually see
[00:06:17] bless you when you can actually see these in the the world is tabular you
[00:06:19] these in the the world is tabular you can just write down all the states so
[00:06:21] can just write down all the states so that's why I think it's helpful but even
[00:06:23] that's why I think it's helpful but even today we're going to start to see where
[00:06:24] today we're going to start to see where those ideas might be applied so we're
[00:06:26] those ideas might be applied so we're going to start to see things like policy
[00:06:27] going to start to see things like policy search which is sort of the foundations
[00:06:29] search which is sort of the foundations towards things like policy gradients
[00:06:31] towards things like policy gradients which are extremely widely used so you
[00:06:33] which are extremely widely used so you can think of all of these as just being
[00:06:35] can think of all of these as just being building blocks that we're going to use
[00:06:36] building blocks that we're going to use to build up to get to the point we're
[00:06:37] to build up to get to the point we're later going to be and very soon within a
[00:06:39] later going to be and very soon within a couple weeks tackling things that are
[00:06:41] couple weeks tackling things that are state-of-the-art
[00:06:42] state-of-the-art algorithms all right so what we're going
[00:06:44] algorithms all right so what we're going to be doing today is really focusing on
[00:06:46] to be doing today is really focusing on making good decisions given a markof
[00:06:48] making good decisions given a markof decision process and so that means both
[00:06:50] decision process and so that means both being enough to understand how good a
[00:06:52] being enough to understand how good a particular decision policy is as well as
[00:06:54] particular decision policy is as well as what is an optimal decision
[00:06:57] what is an optimal decision policy and
[00:07:00] policy and when I say we're given a model of the
[00:07:01] when I say we're given a model of the world what I mean is that we are given
[00:07:03] world what I mean is that we are given sort of that Dynamics model which tells
[00:07:06] sort of that Dynamics model which tells us how the world evolves when we make
[00:07:11] decisions and we are given a reward
[00:07:13] decisions and we are given a reward model which tells us how good decisions
[00:07:16] model which tells us how good decisions are and last time we talked about
[00:07:19] are and last time we talked about Markoff processes and we were starting
[00:07:21] Markoff processes and we were starting to talk about Markoff reward processes
[00:07:23] to talk about Markoff reward processes because they can end up being really
[00:07:25] because they can end up being really useful when we're trying to evaluate how
[00:07:27] useful when we're trying to evaluate how good a particular decision uh policy is
[00:07:30] good a particular decision uh policy is and we'll see a lot of the same ideas
[00:07:32] and we'll see a lot of the same ideas from markof for word processes to mdps
[00:07:36] from markof for word processes to mdps okay all right so let's just refresh our
[00:07:38] okay all right so let's just refresh our memory so as this is the question that
[00:07:40] memory so as this is the question that we had before of like you know how do we
[00:07:42] we had before of like you know how do we think of uh the influence of discount
[00:07:44] think of uh the influence of discount factors as was said what happens is we
[00:07:47] factors as was said what happens is we multiply the next reward by the discount
[00:07:49] multiply the next reward by the discount Factor two rewards Away by the discount
[00:07:52] Factor two rewards Away by the discount Factor squared Etc and so as you can see
[00:07:55] Factor squared Etc and so as you can see there if the Horizon is really long or
[00:07:57] there if the Horizon is really long or as it goes to Infinity rewards will have
[00:07:59] as it goes to Infinity rewards will have zer value eventually because gamma is
[00:08:01] zer value eventually because gamma is less than
[00:08:02] less than one so the idea of the value function
[00:08:05] one so the idea of the value function was to say remember this is a markof
[00:08:07] was to say remember this is a markof reward process we don't have decisions
[00:08:08] reward process we don't have decisions yet just says how much um is the
[00:08:11] yet just says how much um is the expected discounted summer rewards we
[00:08:13] expected discounted summer rewards we will get starting this state and acting
[00:08:16] will get starting this state and acting most of the time today forever so most
[00:08:18] most of the time today forever so most of the time we'll think of today of just
[00:08:20] of the time we'll think of today of just like getting to act forever and how much
[00:08:21] like getting to act forever and how much reward would you get and because this
[00:08:23] reward would you get and because this gamma if as long as the gamma is less
[00:08:25] gamma if as long as the gamma is less than one here that will be a finite
[00:08:27] than one here that will be a finite number
[00:08:29] number so we're just starting to talk about how
[00:08:31] so we're just starting to talk about how could we compute this so again remember
[00:08:33] could we compute this so again remember the return is going to be a particular
[00:08:36] the return is going to be a particular series of rewards you might get if you
[00:08:37] series of rewards you might get if you start in this state in act forever and V
[00:08:40] start in this state in act forever and V is going to be on average how much
[00:08:41] is going to be on average how much reward would you get if you start in
[00:08:43] reward would you get if you start in this state in act
[00:08:45] this state in act forever all right so one of the key
[00:08:48] forever all right so one of the key ideas here is that Computing the value
[00:08:50] ideas here is that Computing the value of an infinite Horizon Markoff reward
[00:08:53] of an infinite Horizon Markoff reward process leverages the Markoff property
[00:08:56] process leverages the Markoff property which was this idea that the future is
[00:08:58] which was this idea that the future is independent of the past past given the
[00:09:00] independent of the past past given the present so given your current state you
[00:09:02] present so given your current state you don't have to think more about the the
[00:09:04] don't have to think more about the the history so what that applies when we try
[00:09:06] history so what that applies when we try to compute what the expected reward is
[00:09:08] to compute what the expected reward is future reward is from a state is we can
[00:09:10] future reward is from a state is we can think of well what is the immediate
[00:09:11] think of well what is the immediate reward we get in that state plus all the
[00:09:14] reward we get in that state plus all the different states we could get to next
[00:09:16] different states we could get to next under our Dynamics model and then the
[00:09:18] under our Dynamics model and then the value of their
[00:09:20] value of their reward how much do we weigh each of
[00:09:22] reward how much do we weigh each of those well we weigh each of those just
[00:09:23] those well we weigh each of those just according to what is the probability I
[00:09:25] according to what is the probability I could get to each of those next
[00:09:27] could get to each of those next States and if you're familiar things
[00:09:29] States and if you're familiar things like tree search you can think of it as
[00:09:31] like tree search you can think of it as just I'm in my starting State I think of
[00:09:32] just I'm in my starting State I think of all the next States I could go to each
[00:09:34] all the next States I could go to each of them have some weight depending on
[00:09:35] of them have some weight depending on the probability i' get there and then I
[00:09:38] the probability i' get there and then I sum all of those up according to their
[00:09:42] values okay and this is going to be the
[00:09:44] values okay and this is going to be the basis of the Bellman equation which
[00:09:45] basis of the Bellman equation which we're going to see lots
[00:09:47] we're going to see lots about okay so if we wanted to think
[00:09:50] about okay so if we wanted to think about how we could solve this one way we
[00:09:53] about how we could solve this one way we could think of it is if we have a
[00:09:55] could think of it is if we have a tabular world meaning that we can write
[00:09:57] tabular world meaning that we can write we can maintain a uh scale value for
[00:10:00] we can maintain a uh scale value for every single state separately so this is
[00:10:02] every single state separately so this is like our Mars Rover case um then we
[00:10:04] like our Mars Rover case um then we could just Express the value function in
[00:10:06] could just Express the value function in a matrix equation so we say the value of
[00:10:09] a matrix equation so we say the value of each of the states is exactly equal to
[00:10:10] each of the states is exactly equal to the immediate reward plus gamma times
[00:10:13] the immediate reward plus gamma times the um transition probability to all the
[00:10:15] the um transition probability to all the next
[00:10:18] next States and so that's nice because now we
[00:10:20] States and so that's nice because now we can just directly solve for what the
[00:10:22] can just directly solve for what the value is so we know that this has to
[00:10:25] value is so we know that this has to hold so now what we're going to do is
[00:10:27] hold so now what we're going to do is just invert that to solve for V
[00:10:30] just invert that to solve for V so what we would say in this case is we
[00:10:32] so what we would say in this case is we would say V minus gamma * P of V this is
[00:10:37] would say V minus gamma * P of V this is p and again I'll apologize that in the
[00:10:40] p and again I'll apologize that in the different uh things you see online or
[00:10:43] different uh things you see online or the textbook Etc people sometimes use T
[00:10:45] the textbook Etc people sometimes use T for transition Matrix they sometimes use
[00:10:47] for transition Matrix they sometimes use P for like probabilities going to the
[00:10:49] P for like probabilities going to the next state um if it's ever confusing
[00:10:51] next state um if it's ever confusing what notation being used don't hesitate
[00:10:52] what notation being used don't hesitate to reach out okay so we just rewrite it
[00:10:55] to reach out okay so we just rewrite it like this is equal to
[00:10:57] like this is equal to R and then we move this so we have B of
[00:11:01] R and then we move this so we have B of IUS gamma p is equal to r i equals the
[00:11:05] IUS gamma p is equal to r i equals the identity
[00:11:09] Matrix which means V is equal to IUS
[00:11:13] Matrix which means V is equal to IUS gamma
[00:11:14] gamma P inverse time
[00:11:18] P inverse time R so why do I show this I show this
[00:11:21] R so why do I show this I show this because if you know how the world works
[00:11:22] because if you know how the world works you have the Dynamics model you know
[00:11:24] you have the Dynamics model you know what the reward function is and the
[00:11:26] what the reward function is and the world is small enough you can just
[00:11:28] world is small enough you can just directly solve for this
[00:11:30] directly solve for this this isn't for decision yet this is just
[00:11:31] this isn't for decision yet this is just showing us blessy like what the value
[00:11:34] showing us blessy like what the value would be of each of the states so this
[00:11:37] would be of each of the states so this is one way to solve it we would call
[00:11:38] is one way to solve it we would call this like sort of the analytics
[00:11:41] this like sort of the analytics solution and this require one thing to
[00:11:44] solution and this require one thing to note here is this requires a matrix
[00:11:45] note here is this requires a matrix inverse and so there are faster
[00:11:47] inverse and so there are faster algorithms than n cubed n being the
[00:11:49] algorithms than n cubed n being the number of states but you know in general
[00:11:51] number of states but you know in general Matrix inverses are fairly expensive so
[00:11:53] Matrix inverses are fairly expensive so this has been done once but this is a
[00:11:55] this has been done once but this is a fairly expens if your state space the
[00:11:58] fairly expens if your state space the number of states you have is large this
[00:11:59] number of states you have is large this can be expensive and it also requires
[00:12:02] can be expensive and it also requires that um the identity Matrix minus gamma
[00:12:05] that um the identity Matrix minus gamma times uh the dam the Dynamics model is
[00:12:09] times uh the dam the Dynamics model is invertible okay so this is one way we
[00:12:11] invertible okay so this is one way we can solve this yeah and remind me your
[00:12:12] can solve this yeah and remind me your name um in practice what usually happens
[00:12:16] name um in practice what usually happens like do people just go ahead and take
[00:12:19] like do people just go ahead and take the Matrix
[00:12:20] the Matrix inverse let me the questions in practice
[00:12:23] inverse let me the questions in practice you usually find that this these kinds
[00:12:25] you usually find that this these kinds of matrices are invertible and if yes do
[00:12:28] of matrices are invertible and if yes do people just go ahead and The Matrix or
[00:12:31] people just go ahead and The Matrix or it's a good question so in practice is
[00:12:33] it's a good question so in practice is it invertible and what do people do in
[00:12:35] it invertible and what do people do in practice normally we're dealing with
[00:12:36] practice normally we're dealing with State spaces that are far too large so
[00:12:38] State spaces that are far too large so we can't do this yeah good question
[00:12:42] we can't do this yeah good question there might be cases where it's small
[00:12:43] there might be cases where it's small enough but in general no so that's a
[00:12:46] enough but in general no so that's a great motivation for a second approach
[00:12:48] great motivation for a second approach which is instead of doing it directly
[00:12:49] which is instead of doing it directly analytically we're going to use dynamic
[00:12:51] analytically we're going to use dynamic programming and we're going to design an
[00:12:52] programming and we're going to design an iterative algorithm okay and this is
[00:12:55] iterative algorithm okay and this is going to be very very similar to what
[00:12:56] going to be very very similar to what we're going to see for decision
[00:12:57] we're going to see for decision processes so the idea in this case is
[00:12:59] processes so the idea in this case is we're not going to do this in one step
[00:13:01] we're not going to do this in one step but we're going to avoid that Matrix
[00:13:02] but we're going to avoid that Matrix inverse which might be pretty expensive
[00:13:04] inverse which might be pretty expensive so we're going to initialize the value
[00:13:06] so we're going to initialize the value of a state to zero for all s and you can
[00:13:09] of a state to zero for all s and you can think about whether or not it actually M
[00:13:11] think about whether or not it actually M matters what we initialize to but just
[00:13:12] matters what we initialize to but just imagine we do that and then for a series
[00:13:15] imagine we do that and then for a series of iterations K is our iteration
[00:13:17] of iterations K is our iteration variable for all the states in s what we
[00:13:19] variable for all the states in s what we do is we say we're going to sort of make
[00:13:21] do is we say we're going to sort of make a new copy of our value function and we
[00:13:24] a new copy of our value function and we say VK of s is equal to R of S Plus
[00:13:27] say VK of s is equal to R of S Plus gamma sum over S Prime probability of
[00:13:29] gamma sum over S Prime probability of going to S Prime given s times the value
[00:13:32] going to S Prime given s times the value that we already have for Kus one of S
[00:13:35] that we already have for Kus one of S Prime and we just do this over and over
[00:13:38] Prime and we just do this over and over and over again until our value function
[00:13:39] and over again until our value function stops conver stops
[00:13:41] stops conver stops changing and we'll talk soon about
[00:13:43] changing and we'll talk soon about whether it will stop changing the nice
[00:13:45] whether it will stop changing the nice thing about this is that it's only s s
[00:13:48] thing about this is that it's only s s for each iteration so this would be an
[00:13:52] for each iteration so this would be an iteration instead of a matrix inverse
[00:14:00] all right so this is how you could
[00:14:01] all right so this is how you could compute the value of an MRP now we're
[00:14:03] compute the value of an MRP now we're going to see how we could do that for an
[00:14:05] going to see how we could do that for an mdp so markup decision process is very
[00:14:08] mdp so markup decision process is very similar to a Markoff reward process but
[00:14:10] similar to a Markoff reward process but now we get to add in actions so now
[00:14:11] now we get to add in actions so now we're actually going to be starting to
[00:14:13] we're actually going to be starting to make decisions and the idea now is that
[00:14:16] make decisions and the idea now is that the Dynamics transition model um will
[00:14:18] the Dynamics transition model um will probably depend on the action you take
[00:14:20] probably depend on the action you take so you're going to get to different
[00:14:22] so you're going to get to different distributions of next States plus you
[00:14:24] distributions of next States plus you and so it could be something like you
[00:14:25] and so it could be something like you know you think of um depending on the ad
[00:14:27] know you think of um depending on the ad you show a customer they might different
[00:14:29] you show a customer they might different things depending on the controls of your
[00:14:31] things depending on the controls of your robot it's going to move or manipulate
[00:14:33] robot it's going to move or manipulate its hand in a different way General
[00:14:34] its hand in a different way General these Dynamics are going to be a
[00:14:35] these Dynamics are going to be a function of the action um and we are
[00:14:38] function of the action um and we are going to for right now uh talk assume
[00:14:40] going to for right now uh talk assume the reward is a function of the state
[00:14:42] the reward is a function of the state and the action you
[00:14:43] and the action you take so you often say that an mdp is
[00:14:45] take so you often say that an mdp is defined by a tuple sa Dynamics model
[00:14:48] defined by a tuple sa Dynamics model reward model and
[00:14:50] reward model and Gamma okay so we could think of that for
[00:14:53] Gamma okay so we could think of that for here so now we have our same little Mars
[00:14:55] here so now we have our same little Mars Rover but now we actually have two
[00:14:56] Rover but now we actually have two different Dynamics models one one for if
[00:14:59] different Dynamics models one one for if we take A1 and one if we take A2 this is
[00:15:02] we take A1 and one if we take A2 this is just an example one these in these cases
[00:15:04] just an example one these in these cases these are deterministic in general we
[00:15:06] these are deterministic in general we can have them be stochastic and we would
[00:15:08] can have them be stochastic and we would also need to specify what the reward is
[00:15:10] also need to specify what the reward is so maybe we have zero reward in all of
[00:15:11] so maybe we have zero reward in all of these states and plus one here and plus
[00:15:15] these states and plus one here and plus 10 at the end and this would just Define
[00:15:17] 10 at the end and this would just Define so once you've defined the state space
[00:15:19] so once you've defined the state space the action space the reward function the
[00:15:21] the action space the reward function the Dynamics model and Gamma then you've
[00:15:23] Dynamics model and Gamma then you've defined your
[00:15:25] defined your mdp okay all right so now we have get to
[00:15:29] mdp okay all right so now we have get to start to think about policies which is
[00:15:31] start to think about policies which is what we'll be talking about throughout
[00:15:32] what we'll be talking about throughout the course which is how do we make
[00:15:35] the course which is how do we make decisions depending on the state we're
[00:15:37] decisions depending on the state we're in and the policy is going to specify
[00:15:41] in and the policy is going to specify the action take which can be
[00:15:42] the action take which can be deterministic or
[00:15:44] deterministic or stochastic and often we're going to
[00:15:46] stochastic and often we're going to think of it as being
[00:15:48] think of it as being stochastic and we'll talk about the
[00:15:50] stochastic and we'll talk about the properties of stochastic versus
[00:15:52] properties of stochastic versus deterministic ones and why you might
[00:15:53] deterministic ones and why you might want one or the other um quite a bit in
[00:15:55] want one or the other um quite a bit in the class but we can generally do
[00:15:57] the class but we can generally do everything we're doing in each
[00:16:01] all right so an mdp plus a policy is
[00:16:03] all right so an mdp plus a policy is just a Markoff reward
[00:16:06] just a Markoff reward process why is that because once you
[00:16:08] process why is that because once you specify how you're going to act you've
[00:16:11] specify how you're going to act you've sort of removed the policy part and so
[00:16:12] sort of removed the policy part and so if you want to know how good that policy
[00:16:14] if you want to know how good that policy is so let's say someone says again your
[00:16:17] is so let's say someone says again your boss says hey how good is this thing at
[00:16:18] boss says hey how good is this thing at you know like advertising to customers
[00:16:20] you know like advertising to customers for example then once you've decided
[00:16:23] for example then once you've decided what the policy is we can think of the
[00:16:25] what the policy is we can think of the reward as just being a weighted sum over
[00:16:27] reward as just being a weighted sum over what the probability is is taking that
[00:16:29] what the probability is is taking that action in that state times the reward
[00:16:31] action in that state times the reward for that state in action and then your
[00:16:33] for that state in action and then your Dynamics model which is a little more
[00:16:35] Dynamics model which is a little more subtle which is now you're taking a
[00:16:37] subtle which is now you're taking a weighted sum over all of the transition
[00:16:39] weighted sum over all of the transition Dynamics according to the action you
[00:16:41] Dynamics according to the action you take weighted by the probability you
[00:16:43] take weighted by the probability you take that
[00:16:46] action so it just defines um a markof
[00:16:50] action so it just defines um a markof process because now you just have this
[00:16:52] process because now you just have this sort of transform Dynamics model where
[00:16:54] sort of transform Dynamics model where we've merged in the policy Okay so why
[00:16:59] we've merged in the policy Okay so why is this helpful and this is something
[00:17:00] is this helpful and this is something that you may or may not have seen in
[00:17:01] that you may or may not have seen in previous classes one of the reasons why
[00:17:02] previous classes one of the reasons why this is helpful is because now we can
[00:17:04] this is helpful is because now we can just say oh any techniques we have for
[00:17:05] just say oh any techniques we have for markof reward processes we could also
[00:17:07] markof reward processes we could also apply to evaluating the value of a
[00:17:10] apply to evaluating the value of a particular policy in a Markoff decision
[00:17:12] particular policy in a Markoff decision process because we've just reduced an
[00:17:14] process because we've just reduced an mdp and policy evaluation back to an
[00:17:21] MRP all right
[00:17:23] MRP all right so if we think about doing policy
[00:17:25] so if we think about doing policy evaluation with an
[00:17:27] evaluation with an mdp um we can just plug in the actual
[00:17:31] mdp um we can just plug in the actual policy that we would be T be using okay
[00:17:35] policy that we would be T be using okay so what we have in this case is that
[00:17:38] so what we have in this case is that instead of now we actually get to make
[00:17:40] instead of now we actually get to make decisions and so then we get to say what
[00:17:42] decisions and so then we get to say what is the probability of taking the action
[00:17:43] is the probability of taking the action in this state times the expected
[00:17:46] in this state times the expected discounted sum of rewards at that
[00:17:49] discounted sum of rewards at that point so this looks very similar to an
[00:17:52] point so this looks very similar to an MRP except for we're saying based on the
[00:17:54] MRP except for we're saying based on the probability for each action what would
[00:17:56] probability for each action what would we get
[00:17:57] we get next and we call this a Bellman backup
[00:18:00] next and we call this a Bellman backup for a particular
[00:18:02] for a particular policy because this is um going to
[00:18:04] policy because this is um going to specify what is our expected discounted
[00:18:06] specify what is our expected discounted sum of future Awards if we take start in
[00:18:09] sum of future Awards if we take start in this um State and follow the
[00:18:12] this um State and follow the policy and just notice that if the
[00:18:14] policy and just notice that if the policy is actually deterministic we can
[00:18:16] policy is actually deterministic we can reduce it back to a case where we've
[00:18:18] reduce it back to a case where we've sort of averaged
[00:18:20] sort of averaged over these rewards so remember this was
[00:18:23] over these rewards so remember this was just going to be
[00:18:26] just going to be um if you have a particular action and
[00:18:28] um if you have a particular action and then you're just going to index into
[00:18:30] then you're just going to index into what the reward is for that particular
[00:18:31] what the reward is for that particular action so we can see that
[00:18:33] action so we can see that here okay and just raise your hand if
[00:18:36] here okay and just raise your hand if you've seen this before like if you've
[00:18:38] you've seen this before like if you've seen the okay good so like probably at
[00:18:40] seen the okay good so like probably at least two-third of
[00:18:42] least two-third of people all
[00:18:46] right okay so if you want to check your
[00:18:48] right okay so if you want to check your answers if you something this is new for
[00:18:50] answers if you something this is new for you then um one thing to do is to try to
[00:18:53] you then um one thing to do is to try to check that you can um do this sort of
[00:18:55] check that you can um do this sort of value iteration um or this policy
[00:18:57] value iteration um or this policy evaluation for the Mars where for
[00:18:58] evaluation for the Mars where for example we won't go through it in class
[00:19:00] example we won't go through it in class but can check the answers I'll release
[00:19:02] but can check the answers I'll release them at the end of the slide just to
[00:19:03] them at the end of the slide just to check that you know how to apply
[00:19:05] check that you know how to apply this all right so of course shortly
[00:19:08] this all right so of course shortly we're going to be interested in not just
[00:19:09] we're going to be interested in not just EV valuing the value of a single policy
[00:19:11] EV valuing the value of a single policy but finding an optimal
[00:19:12] but finding an optimal policy so one question is how many
[00:19:14] policy so one question is how many policies are there and is the optimal
[00:19:17] policies are there and is the optimal policy value
[00:19:20] policy value unique so we'll just take a second you
[00:19:22] unique so we'll just take a second you can go to the polls and enter in your
[00:19:24] can go to the polls and enter in your answer
[00:19:35] okay great so it looks like most people
[00:19:37] okay great so it looks like most people um got the vasary appeal got the right
[00:19:39] um got the vasary appeal got the right answer for the first one which is it's 2
[00:19:41] answer for the first one which is it's 2 to the 7 in general the number of
[00:19:43] to the 7 in general the number of policies we have is going to be a to the
[00:19:48] policies we have is going to be a to the S because for every single state we
[00:19:50] S because for every single state we could choose any of the
[00:19:52] could choose any of the actions and also most people got the
[00:19:54] actions and also most people got the next one right which is great which is
[00:19:56] next one right which is great which is the optimal policy um the one with the
[00:19:58] the optimal policy um the one with the SI is not always
[00:20:00] SI is not always unique it can be unique it depends on
[00:20:02] unique it can be unique it depends on the problem um but it's not going to be
[00:20:04] the problem um but it's not going to be unique whenever more than one action has
[00:20:07] unique whenever more than one action has the same identical value so when you
[00:20:10] the same identical value so when you have ties
[00:20:11] have ties yeah um how do we generally deal with
[00:20:14] yeah um how do we generally deal with invalid actions because like for example
[00:20:18] invalid actions because like for example if we're in S1 and we choose left I
[00:20:21] if we're in S1 and we choose left I would imagine to me that's an invalid
[00:20:23] would imagine to me that's an invalid action not sure we really do with that
[00:20:27] action not sure we really do with that yeah so the question was um if we have
[00:20:28] yeah so the question was um if we have invalid action so in general you can
[00:20:30] invalid action so in general you can have a different state a different
[00:20:32] have a different state a different action space be possible in every state
[00:20:34] action space be possible in every state that's also very common in like
[00:20:35] that's also very common in like recommendation engines that you know
[00:20:37] recommendation engines that you know you'd only it's only a subset of
[00:20:38] you'd only it's only a subset of Articles you might show to some people
[00:20:40] Articles you might show to some people based on their state um in this
[00:20:42] based on their state um in this particular example we're going to assume
[00:20:43] particular example we're going to assume that it's always it's not actually go
[00:20:45] that it's always it's not actually go left it's try left and so if you try to
[00:20:46] left it's try left and so if you try to go left and there's nothing rest of the
[00:20:48] go left and there's nothing rest of the world you just fail and you stay in the
[00:20:49] world you just fail and you stay in the same place but in general most of the
[00:20:52] same place but in general most of the time in the class we're going to assume
[00:20:53] time in the class we're going to assume the action space is the same for all
[00:20:54] the action space is the same for all states but in some cases it might be
[00:20:56] states but in some cases it might be different good question
[00:21:00] right okay so in mdp control we're going
[00:21:03] right okay so in mdp control we're going to want to not just have the policy like
[00:21:06] to want to not just have the policy like evaluate a particular policy but we're
[00:21:08] evaluate a particular policy but we're going to compute the optimal policy so
[00:21:09] going to compute the optimal policy so we want to take the argmax over the
[00:21:11] we want to take the argmax over the policy space which in general is that a
[00:21:13] policy space which in general is that a to the s
[00:21:15] to the s space and there is going to exist a
[00:21:17] space and there is going to exist a unique optimal value
[00:21:19] unique optimal value function and the optimal policy inside
[00:21:22] function and the optimal policy inside of a tabular m DP in an infinite Horizon
[00:21:25] of a tabular m DP in an infinite Horizon is unique and deterministic so
[00:21:29] is unique and deterministic so um those are two properties that are
[00:21:30] um those are two properties that are good to be familiar
[00:21:32] good to be familiar with so now we're going to think about
[00:21:34] with so now we're going to think about how do we actually compute this and what
[00:21:36] how do we actually compute this and what its other properties are so one is that
[00:21:38] its other properties are so one is that it's stationary um what I mean by that
[00:21:41] it's stationary um what I mean by that here is that in infinite Horizon problem
[00:21:44] here is that in infinite Horizon problem you always have an infinite number of
[00:21:46] you always have an infinite number of additional time steps and so the optimal
[00:21:49] additional time steps and so the optimal thing to do just depends on your state
[00:21:52] thing to do just depends on your state it doesn't depend on the time step we'll
[00:21:55] it doesn't depend on the time step we'll think more about what happens when you
[00:21:56] think more about what happens when you only have a finite number ler your H is
[00:21:59] only have a finite number ler your H is finite and what might happen there but
[00:22:01] finite and what might happen there but for most of today we're just going to
[00:22:02] for most of today we're just going to focus on the infinite Horizon problem
[00:22:05] focus on the infinite Horizon problem and as I said and most of you guys
[00:22:07] and as I said and most of you guys already knew that we um this in general
[00:22:09] already knew that we um this in general is not
[00:22:11] unique okay so one option is policy
[00:22:14] unique okay so one option is policy search and this is where we are going to
[00:22:17] search and this is where we are going to get into oh yeah and remember me your
[00:22:19] get into oh yeah and remember me your name is the optimality condition on the
[00:22:22] name is the optimality condition on the initial St is the optimality conditional
[00:22:25] initial St is the optimality conditional on the what do you mean by that
[00:22:29] on the what do you mean by that uh oh yes the optimality yes it'll be
[00:22:33] uh oh yes the optimality yes it'll be per state yeah so you the optimal policy
[00:22:36] per state yeah so you the optimal policy will be defined per state the idea is
[00:22:38] will be defined per state the idea is that you can take a different action in
[00:22:39] that you can take a different action in every state and you want to know what
[00:22:41] every state and you want to know what the optimal thing is to do to maximize
[00:22:42] the optimal thing is to do to maximize your expected discounted summer rewards
[00:22:44] your expected discounted summer rewards from every state individually like Point
[00:22:46] from every state individually like Point wise good question yeah yeah two
[00:22:48] wise good question yeah yeah two interconnected questions um why is there
[00:22:51] interconnected questions um why is there a unique optimal value function and
[00:22:53] a unique optimal value function and second is um can you remind me again of
[00:22:57] second is um can you remind me again of what was the reason why
[00:22:59] what was the reason why may not necessarily be unique you
[00:23:00] may not necessarily be unique you mentioned a specific
[00:23:02] mentioned a specific case ah so the policy optimal policy is
[00:23:04] case ah so the policy optimal policy is not necessarily unique because there
[00:23:06] not necessarily unique because there could be more than one action with the
[00:23:07] could be more than one action with the same value and the optimal value
[00:23:09] same value and the optimal value function is unique for reasons we'll see
[00:23:11] function is unique for reasons we'll see later in this class like later today
[00:23:14] later in this class like later today we'll prove it um okay so the one of the
[00:23:18] we'll prove it um okay so the one of the things and this is going to go back to
[00:23:19] things and this is going to go back to the conceptual question I put at the
[00:23:21] the conceptual question I put at the beginning of class is we would like to
[00:23:23] beginning of class is we would like to ideally have methods and algorithms that
[00:23:26] ideally have methods and algorithms that have monotonic Improvement uh cap
[00:23:28] have monotonic Improvement uh cap capabilities and so policy search is
[00:23:30] capabilities and so policy search is going to be one of those so what we're
[00:23:32] going to be one of those so what we're going to do here is we're going to try
[00:23:34] going to do here is we're going to try to search to compute the optimal policy
[00:23:36] to search to compute the optimal policy there's a to the S deterministic
[00:23:38] there's a to the S deterministic policies in general you could imagine
[00:23:40] policies in general you could imagine just you know enumerating all of them
[00:23:41] just you know enumerating all of them and evaluating them all but we can often
[00:23:44] and evaluating them all but we can often do better than that and when I say
[00:23:45] do better than that and when I say better what I mean here is we can reduce
[00:23:48] better what I mean here is we can reduce the computation needed to try to
[00:23:49] the computation needed to try to identify the optimal policy so we
[00:23:51] identify the optimal policy so we shouldn't have to iterate through all a
[00:23:53] shouldn't have to iterate through all a to the S
[00:23:56] policies so how does policy iterate a
[00:23:58] policies so how does policy iterate a work the idea is that we're going to
[00:24:00] work the idea is that we're going to alternate between having a candidate
[00:24:02] alternate between having a candidate decision policy that might be optimal
[00:24:04] decision policy that might be optimal we're going to evaluate it and then
[00:24:05] we're going to evaluate it and then we're going to see if we can improve it
[00:24:08] we're going to see if we can improve it and then if we can improve it we will
[00:24:09] and then if we can improve it we will and otherwise we're going to
[00:24:11] and otherwise we're going to Halt so what we do how we do this is
[00:24:13] Halt so what we do how we do this is we're just going to initialize it
[00:24:14] we're just going to initialize it randomly which just means we're going to
[00:24:16] randomly which just means we're going to start off and we're going to say for
[00:24:17] start off and we're going to say for every single state we're going to pick
[00:24:18] every single state we're going to pick an
[00:24:19] an action and then while our policy is
[00:24:22] action and then while our policy is still changing so this is the L1 Norm um
[00:24:25] still changing so this is the L1 Norm um it measures if the policy changed for
[00:24:26] it measures if the policy changed for any state just as a refresher
[00:24:29] any state just as a refresher what we're going to first do is we're
[00:24:30] what we're going to first do is we're going to um evaluate the policy and then
[00:24:32] going to um evaluate the policy and then we're going to try to improve it
[00:24:38] so so in order to do it to do that sort
[00:24:40] so so in order to do it to do that sort of policy Improvement step it's going to
[00:24:42] of policy Improvement step it's going to be helpful to define the Q function
[00:24:44] be helpful to define the Q function again I know for many of you this is
[00:24:45] again I know for many of you this is probably a review the Q function of a
[00:24:47] probably a review the Q function of a particular policy is just what is the
[00:24:49] particular policy is just what is the reward of the immediate state in action
[00:24:52] reward of the immediate state in action plus the discounted sum of future
[00:24:54] plus the discounted sum of future rewards if we were to after that action
[00:24:57] rewards if we were to after that action act according to the policy
[00:24:59] act according to the policy so it's sort of like saying okay first
[00:25:01] so it's sort of like saying okay first when you're in this state you're going
[00:25:02] when you're in this state you're going to take this action and then from then
[00:25:03] to take this action and then from then on you're going to follow whatever your
[00:25:04] on you're going to follow whatever your policy tells you to
[00:25:05] policy tells you to do so and for any of you who've seen Q
[00:25:09] do so and for any of you who've seen Q learning you you've seen this sort of
[00:25:11] learning you you've seen this sort of idea a lot okay so what we're going to
[00:25:14] idea a lot okay so what we're going to try to do in this case why would we want
[00:25:16] try to do in this case why would we want a q function it turns out it's going to
[00:25:17] a q function it turns out it's going to make the policy Improvement step really
[00:25:19] make the policy Improvement step really easy so what we're going to first do is
[00:25:21] easy so what we're going to first do is we're going to say I'm going to take my
[00:25:22] we're going to say I'm going to take my particular policy I'm going to compute
[00:25:24] particular policy I'm going to compute the Q value for that particular policy
[00:25:27] the Q value for that particular policy Pi I CU we're going to be iterating and
[00:25:29] Pi I CU we're going to be iterating and then after that we're going to compute a
[00:25:31] then after that we're going to compute a new policy Pi I + 1 by just taking the
[00:25:33] new policy Pi I + 1 by just taking the ARX of
[00:25:34] ARX of Q so for our Q function we're just going
[00:25:37] Q so for our Q function we're just going to say according to this Q function
[00:25:39] to say according to this Q function which says what are the is the expected
[00:25:41] which says what are the is the expected discounted sum of rewards if I start in
[00:25:43] discounted sum of rewards if I start in this state take this action and follow
[00:25:45] this state take this action and follow Supply um which of those actions is the
[00:25:48] Supply um which of those actions is the best and we can Define that per
[00:25:52] best and we can Define that per state yeah is there any relationship
[00:25:54] state yeah is there any relationship between the Q function and the value
[00:25:56] between the Q function and the value function because it kind of looks
[00:25:57] function because it kind of looks similar
[00:25:58] similar yeah yeah so we often call it a q
[00:26:00] yeah yeah so we often call it a q function the the state action value
[00:26:04] function okay all right
[00:26:08] function okay all right so this is sort of just what we do now
[00:26:10] so this is sort of just what we do now now we're going to have this Q function
[00:26:11] now we're going to have this Q function we're generally going to do this by
[00:26:13] we're generally going to do this by having this q and then we will do PI I +
[00:26:16] having this q and then we will do PI I + 1 of s is equal to
[00:26:20] 1 of s is equal to arax over a of Q of sa per state and
[00:26:26] arax over a of Q of sa per state and then we just repeat this over and over
[00:26:28] then we just repeat this over and over again
[00:26:30] again okay so there's um a number of questions
[00:26:32] okay so there's um a number of questions you might have about this you might say
[00:26:33] you might have about this you might say okay this seems like a vaguely
[00:26:34] okay this seems like a vaguely reasonable thing to do but does it have
[00:26:36] reasonable thing to do but does it have any formal properties are we guaranteed
[00:26:38] any formal properties are we guaranteed to improve you know what can we say
[00:26:40] to improve you know what can we say about
[00:26:40] about this so to do that I think it's useful
[00:26:43] this so to do that I think it's useful to delve into what the policy
[00:26:45] to delve into what the policy Improvement step is actually doing okay
[00:26:48] Improvement step is actually doing okay so what the policy Improvement uh what
[00:26:51] so what the policy Improvement uh what when we compute the Q function this is
[00:26:53] when we compute the Q function this is the equation for the Q function so we s
[00:26:56] the equation for the Q function so we s of take our old policy pii okay and then
[00:26:58] of take our old policy pii okay and then we compute the Q function of this and we
[00:27:01] we compute the Q function of this and we can do this
[00:27:04] can do this iteratively and now what we want to do
[00:27:06] iteratively and now what we want to do in this case is think about what is the
[00:27:09] in this case is think about what is the performance going to be of the new
[00:27:10] performance going to be of the new policy we extract
[00:27:13] policy we extract okay all right so what the Q function
[00:27:18] okay all right so what the Q function says is we're going to be able to show
[00:27:20] says is we're going to be able to show that the Q function the the best thing
[00:27:23] that the Q function the the best thing of the Q function is better than the
[00:27:25] of the Q function is better than the value of the old policy okay
[00:27:28] value of the old policy okay so what does this say so the first thing
[00:27:30] so what does this say so the first thing is just how we've computed this is just
[00:27:32] is just how we've computed this is just the policy evaluation
[00:27:37] step and we know that if we have a Q
[00:27:41] step and we know that if we have a Q function over s and a for a particular s
[00:27:44] function over s and a for a particular s Max over a of Q Pi of sa has to be at
[00:27:47] Max over a of Q Pi of sa has to be at least as good as the Q function for any
[00:27:51] least as good as the Q function for any of this any of the actions okay so we
[00:27:53] of this any of the actions okay so we know that this has to be this thing is
[00:27:56] know that this has to be this thing is always greater than equal to Q Pi I
[00:27:59] always greater than equal to Q Pi I of sa a for all
[00:28:03] a and then this is just that equation
[00:28:06] a and then this is just that equation this is just whatever what exactly is Q
[00:28:08] this is just whatever what exactly is Q pii of
[00:28:09] pii of sa is the
[00:28:12] sa is the definition almost except for it's
[00:28:15] definition almost except for it's particularly for the actions this is
[00:28:17] particularly for the actions this is first
[00:28:23] specifically if we were to follow the
[00:28:25] specifically if we were to follow the previous
[00:28:26] previous policy so remember this is the equation
[00:28:29] policy so remember this is the equation for Q Pi I of sa think about one of
[00:28:32] for Q Pi I of sa think about one of those actions that you could have done
[00:28:34] those actions that you could have done is exactly what the old policy would
[00:28:35] is exactly what the old policy would have told you to do that is what this
[00:28:38] have told you to do that is what this equation is if you just take a here and
[00:28:40] equation is if you just take a here and you plug in pi I of s
[00:28:51] a so that's just exactly what this is
[00:28:54] a so that's just exactly what this is and that is just the definition of v Pi
[00:28:56] and that is just the definition of v Pi I of s
[00:28:59] okay so what is this saying what this is
[00:29:02] okay so what is this saying what this is saying
[00:29:03] saying is if you were to take your new policy
[00:29:07] is if you were to take your new policy Pi I + 1 so remember Pi I + 1 is defined
[00:29:09] Pi I + 1 so remember Pi I + 1 is defined as the argmax of this Q function is
[00:29:12] as the argmax of this Q function is whatever maximizes your new Q function
[00:29:15] whatever maximizes your new Q function so what this says is if you were to take
[00:29:16] so what this says is if you were to take Pi I + one of s for one action and then
[00:29:19] Pi I + one of s for one action and then follow Pi I
[00:29:21] follow Pi I forever so that's what the Q function
[00:29:23] forever so that's what the Q function represents then our expected sum of
[00:29:25] represents then our expected sum of rewards is at least as good as if we'd
[00:29:27] rewards is at least as good as if we'd always followed
[00:29:29] always followed pii so that's what this equation is
[00:29:31] pii so that's what this equation is telling us it's like if I get to make
[00:29:33] telling us it's like if I get to make one decision differently and then from
[00:29:35] one decision differently and then from then on I follow my old policy the value
[00:29:37] then on I follow my old policy the value I can expect is at least as good as the
[00:29:39] I can expect is at least as good as the value I could expect if I had always
[00:29:41] value I could expect if I had always followed the old
[00:29:42] followed the old policy
[00:29:44] policy okay does anybody have any questions
[00:29:46] okay does anybody have any questions about that because then the next step is
[00:29:48] about that because then the next step is going to build on
[00:29:52] that okay can you go back to
[00:29:55] that okay can you go back to that sure for the policy Improvement
[00:29:57] that sure for the policy Improvement yeah
[00:30:07] so the step that we talking about is
[00:30:09] so the step that we talking about is this one right like the policy improv
[00:30:11] this one right like the policy improv yeah we're trying to see like when we do
[00:30:12] yeah we're trying to see like when we do the policy Improvement step and we
[00:30:15] the policy Improvement step and we extract out instead of Max here argmax
[00:30:17] extract out instead of Max here argmax to get out the new policy how does the
[00:30:19] to get out the new policy how does the value of that relate to the value of the
[00:30:22] value of that relate to the value of the thing you could have done before in that
[00:30:25] thing you could have done before in that state and so this is just trying to say
[00:30:27] state and so this is just trying to say like what really is is Q Pi I of sa it
[00:30:30] like what really is is Q Pi I of sa it is the value you get if you first take a
[00:30:32] is the value you get if you first take a and then you follow Pi I from then
[00:30:34] and then you follow Pi I from then onwards okay so it's saying if you were
[00:30:36] onwards okay so it's saying if you were to do that then this new action you've
[00:30:39] to do that then this new action you've computed this argmax policy is actually
[00:30:42] computed this argmax policy is actually better than what you would have gotten
[00:30:44] better than what you would have gotten before or at least as good but the thing
[00:30:47] before or at least as good but the thing that should seem slightly strange to you
[00:30:49] that should seem slightly strange to you is I am not creating this sort of hybrid
[00:30:52] is I am not creating this sort of hybrid policy where like I take one new action
[00:30:54] policy where like I take one new action and then I follow pii forever I'm
[00:30:56] and then I follow pii forever I'm creating an entirely new policy where
[00:30:58] creating an entirely new policy where I'm not just going to follow Pi I + 1
[00:31:01] I'm not just going to follow Pi I + 1 for one step I'm going to follow it for
[00:31:02] for one step I'm going to follow it for all remaining steps okay so this should
[00:31:06] all remaining steps okay so this should not yet convince you that doing that is
[00:31:08] not yet convince you that doing that is actually any better than my old policy
[00:31:11] actually any better than my old policy this would say if you take one new
[00:31:12] this would say if you take one new action and then follow your old policy
[00:31:14] action and then follow your old policy it's going to be better than your old
[00:31:15] it's going to be better than your old policy so that's why we have to do
[00:31:17] policy so that's why we have to do additional work to show that we're
[00:31:19] additional work to show that we're actually going to get a monotonic
[00:31:20] actually going to get a monotonic Improvement if we just follow this new
[00:31:22] Improvement if we just follow this new policy always okay all right so let's go
[00:31:25] policy always okay all right so let's go through that all right so what we're
[00:31:27] through that all right so what we're going to prove is we're going to say
[00:31:29] going to prove is we're going to say actually that's true the new policy we
[00:31:31] actually that's true the new policy we construct through this policy
[00:31:33] construct through this policy Improvement step is somewhat remarkably
[00:31:35] Improvement step is somewhat remarkably going to be strictly a monotonic
[00:31:38] going to be strictly a monotonic Improvement compared to the old policy
[00:31:40] Improvement compared to the old policy unless it's
[00:31:42] unless it's identical that means that every step of
[00:31:44] identical that means that every step of policy Improvement we're going to get a
[00:31:46] policy Improvement we're going to get a better and better policy for every
[00:31:50] better and better policy for every state okay so um and the only time we
[00:31:54] state okay so um and the only time we not is if if we've already converged
[00:31:56] not is if if we've already converged okay so let's go through that
[00:31:59] okay so let's go through that okay so this is going to prove to us
[00:32:01] okay so this is going to prove to us that the value of the old policy is less
[00:32:04] that the value of the old policy is less than or equal to the value of the new
[00:32:05] than or equal to the value of the new policy meaning we're going to get like
[00:32:07] policy meaning we're going to get like this monotonic
[00:32:09] this monotonic Improvement so what we're going to do in
[00:32:11] Improvement so what we're going to do in this case is we are going to first write
[00:32:14] this case is we are going to first write out so this is just the definition this
[00:32:16] out so this is just the definition this is the
[00:32:20] definition of Max over a of our Q pii
[00:32:25] definition of Max over a of our Q pii okay all right so let's just write out
[00:32:27] okay all right so let's just write out what this is okay this is going to be
[00:32:29] what this is okay this is going to be equal
[00:32:30] equal to and it'll be written out more neatly
[00:32:33] to and it'll be written out more neatly on the next page
[00:32:50] too okay so what did I do here I noticed
[00:32:54] too okay so what did I do here I noticed that the definition of pi I + 1 is
[00:32:58] that the definition of pi I + 1 is exactly the ARG Max of this expression
[00:33:00] exactly the ARG Max of this expression instead of Max so when we did the policy
[00:33:03] instead of Max so when we did the policy Improvement the way we did the policy
[00:33:04] Improvement the way we did the policy Improvement was we took the argmax of
[00:33:06] Improvement was we took the argmax of the Q function so instead of having this
[00:33:09] the Q function so instead of having this max out here I'm just going to plug in
[00:33:11] max out here I'm just going to plug in pi I + 1 because that's going to give me
[00:33:13] pi I + 1 because that's going to give me something that's exactly equal to the
[00:33:15] something that's exactly equal to the max a for that whole
[00:33:17] max a for that whole expression okay right and so this is
[00:33:21] expression okay right and so this is exactly equal to that but what we can
[00:33:23] exactly equal to that but what we can show next or what we can do next is that
[00:33:26] show next or what we can do next is that we can just add the same terms and
[00:33:29] we can just add the same terms and notice that this is the same this is
[00:33:31] notice that this is the same this is less than or equal to Q Pi I of S Prime
[00:33:35] less than or equal to Q Pi I of S Prime a
[00:33:36] a prime because the value of pi I um for S
[00:33:41] prime because the value of pi I um for S Prime so following a particular policy
[00:33:44] Prime so following a particular policy always has to be less than or equal to
[00:33:46] always has to be less than or equal to taking the max over the Q function for
[00:33:48] taking the max over the Q function for that
[00:33:49] that policy why is that true because either
[00:33:52] policy why is that true because either either the max is the same as the pi I
[00:33:54] either the max is the same as the pi I action or there's a better action
[00:34:01] okay all right so that's the less than
[00:34:03] okay all right so that's the less than or equals and then we can just expand
[00:34:06] or equals and then we can just expand this expression out and this is going to
[00:34:07] this expression out and this is going to start to get a little bit messy which is
[00:34:09] start to get a little bit messy which is why it'll be nice to have it on the next
[00:34:10] why it'll be nice to have it on the next slide too but what will happen here is
[00:34:14] slide too but what will happen here is you can see how the expansion
[00:34:17] you can see how the expansion works
[00:34:20] works okay and why is this important this is
[00:34:22] okay and why is this important this is important because this is going to allow
[00:34:23] important because this is going to allow us to think about not if we just take
[00:34:25] us to think about not if we just take this new action on the first step but on
[00:34:27] this new action on the first step but on all all future steps okay so what we had
[00:34:29] all all future steps okay so what we had here is we had this thing which was Max
[00:34:32] here is we had this thing which was Max a over Q Pi I of s Prim a prime we're
[00:34:34] a over Q Pi I of s Prim a prime we're going to expand out what that expression
[00:34:36] going to expand out what that expression is okay because
[00:34:39] is okay because notice this thing here is exactly equal
[00:34:43] notice this thing here is exactly equal to this thing which we know is here okay
[00:34:46] to this thing which we know is here okay so we're just going to substitute it
[00:34:48] so we're just going to substitute it okay so we can put that in here so this
[00:34:51] okay so we can put that in here so this is R of S Prime Pi I + 1 of s prime plus
[00:34:57] is R of S Prime Pi I + 1 of s prime plus gamma sum over S Prime meaning S Prime
[00:35:01] gamma sum over S Prime meaning S Prime here I'm just using to speed two time
[00:35:03] here I'm just using to speed two time steps
[00:35:11] away okay why was that useful well what
[00:35:14] away okay why was that useful well what we've just said is that the value of pi
[00:35:17] we've just said is that the value of pi of s is less than or equal to taking
[00:35:19] of s is less than or equal to taking this new better action for one time step
[00:35:22] this new better action for one time step and then following the old policy I've
[00:35:24] and then following the old policy I've now done that recursively so I've said
[00:35:26] now done that recursively so I've said well now that's also less than or equal
[00:35:27] well now that's also less than or equal to if I take that new action once and
[00:35:30] to if I take that new action once and then I take it again and then I follow
[00:35:32] then I take it again and then I follow the old
[00:35:33] the old policy and then you just repeat this
[00:35:36] policy and then you just repeat this okay so like you just keep nesting this
[00:35:38] okay so like you just keep nesting this and what you can see here is that you
[00:35:39] and what you can see here is that you have these less than or equal that
[00:35:41] have these less than or equal that happen when instead of plugging in the
[00:35:43] happen when instead of plugging in the value of the old policy you allow
[00:35:45] value of the old policy you allow yourself to take a Max over that Q
[00:35:47] yourself to take a Max over that Q function okay and what happens if you do
[00:35:50] function okay and what happens if you do this all the way out it's will exactly
[00:35:52] this all the way out it's will exactly become equal to V of Pi I + 1 of s so
[00:35:57] become equal to V of Pi I + 1 of s so dot dot
[00:35:59] dot dot okay so I have that here so what this is
[00:36:02] okay so I have that here so what this is shown here is that the value you had
[00:36:05] shown here is that the value you had under the old policy of the state is
[00:36:07] under the old policy of the state is less than or equal to the value of that
[00:36:09] less than or equal to the value of that State under the new policy so this
[00:36:12] State under the new policy so this proves the monotonic Improvement which
[00:36:14] proves the monotonic Improvement which is super cool so this now says if we do
[00:36:17] is super cool so this now says if we do policy evaluation where you just keep
[00:36:18] policy evaluation where you just keep Computing the Q function and taking a
[00:36:20] Computing the Q function and taking a Max you will always monotonically
[00:36:22] Max you will always monotonically improve unless you stay the same
[00:36:30] all right so now let's do our Le next
[00:36:32] all right so now let's do our Le next check your understanding which is given
[00:36:34] check your understanding which is given everything I've just said if the policy
[00:36:36] everything I've just said if the policy doesn't change can it ever change again
[00:36:39] doesn't change can it ever change again and is there a maximum number of
[00:36:41] and is there a maximum number of iterations of policy
[00:36:44] iteration yeah it yeah um on the
[00:36:47] iteration yeah it yeah um on the previous slide is the dot dot dot
[00:36:48] previous slide is the dot dot dot supposed to represent like just like ALB
[00:36:50] supposed to represent like just like ALB manipulation yeah yeah you just keep
[00:36:53] manipulation yeah yeah you just keep expanding this all the way yeah good
[00:36:57] expanding this all the way yeah good question
[00:37:01] all right let's take a second and do the
[00:37:26] p for
[00:38:07] yeah what's your
[00:38:09] yeah what's your name um where at what point did we feel
[00:38:13] name um where at what point did we feel that this is actually leading to an
[00:38:14] that this is actually leading to an improvement like can we just like stay
[00:38:18] improvement like can we just like stay in the same like value level and like
[00:38:21] in the same like value level and like cuz the inequality was greater than or
[00:38:24] cuz the inequality was greater than or equal to so is it possible that you're
[00:38:26] equal to so is it possible that you're always equal to where you started
[00:38:28] always equal to where you started yeah it's a great question so um and in
[00:38:30] yeah it's a great question so um and in fact that really it so right that like
[00:38:32] fact that really it so right that like I've just shown less than or equal
[00:38:36] I've just shown less than or equal to what what we can um well I guess we
[00:38:39] to what what we can um well I guess we can discuss this in a second but um it
[00:38:41] can discuss this in a second but um it will be a monotonic Improvement unless
[00:38:43] will be a monotonic Improvement unless you're already the optimal
[00:38:45] you're already the optimal policy so if there's any state at which
[00:38:47] policy so if there's any state at which you can improve you will and if you stay
[00:38:49] you can improve you will and if you stay the same well actually we'll talk about
[00:38:51] the same well actually we'll talk about this now so um because the it's nicely
[00:38:54] this now so um because the it's nicely split between the answers for both of
[00:38:56] split between the answers for both of these questions so may maybe everybody
[00:38:57] these questions so may maybe everybody turn to somebody nearby you and discuss
[00:39:01] turn to somebody nearby you and discuss whether you think the policy can ever
[00:39:03] whether you think the policy can ever change if it didn't change initially and
[00:39:05] change if it didn't change initially and is there a maximum number of iterations
[00:39:08] is there a maximum number of iterations okay because it's pretty evenly split
[00:39:09] okay because it's pretty evenly split amongst people who
[00:39:26] voted e
[00:40:04] that's
[00:40:55] I want to make sure to clarify something
[00:40:56] I want to make sure to clarify something cuz I that came up in a good
[00:40:58] cuz I that came up in a good conversation which is let's assume for
[00:41:00] conversation which is let's assume for the moment there are no ties I know I
[00:41:02] the moment there are no ties I know I said that in general optimistic policies
[00:41:04] said that in general optimistic policies can have ties and that's true but for
[00:41:06] can have ties and that's true but for the point of view of this question it is
[00:41:07] the point of view of this question it is easiest to think about if there is only
[00:41:09] easiest to think about if there is only a single unique optimal policy so why
[00:41:13] a single unique optimal policy so why don't we do that again none of these are
[00:41:15] don't we do that again none of these are for um assessment they're only for your
[00:41:16] for um assessment they're only for your own learning but um but just in terms of
[00:41:19] own learning but um but just in terms of what you're thinking through my
[00:41:20] what you're thinking through my intention was to think about the simpler
[00:41:22] intention was to think about the simpler case where there is a single optimal
[00:41:24] case where there is a single optimal policy and then under that case we the
[00:41:27] policy and then under that case we the policy can ever change once it once it
[00:41:30] policy can ever change once it once it hasn't changed once what I mean by the
[00:41:31] hasn't changed once what I mean by the policy doesn't change meaning when we
[00:41:33] policy doesn't change meaning when we have had a policy and we do policy
[00:41:35] have had a policy and we do policy Improvement and our new improved policy
[00:41:37] Improvement and our new improved policy is the same as the old
[00:41:39] is the same as the old policy um so under the case that I just
[00:41:42] policy um so under the case that I just said which is that it's deterministic um
[00:41:45] said which is that it's deterministic um and that there is a single optimal
[00:41:47] and that there is a single optimal policy uh raise your hand if you said
[00:41:50] policy uh raise your hand if you said the policy once it doesn't change it can
[00:41:52] the policy once it doesn't change it can never change again that's the correct
[00:41:54] never change again that's the correct answer okay does somebody want to
[00:41:55] answer okay does somebody want to explain why
[00:42:01] you're all
[00:42:02] you're all correct yeah what's your remind me your
[00:42:05] correct yeah what's your remind me your name um it kind of intuitively made
[00:42:08] name um it kind of intuitively made sense in the sense of like you're doing
[00:42:10] sense in the sense of like you're doing the expected value so like you're
[00:42:12] the expected value so like you're summing over all or you're over all the
[00:42:15] summing over all or you're over all the actions even if there's like St cacity
[00:42:17] actions even if there's like St cacity in the system you're still taking the
[00:42:19] in the system you're still taking the average value so like didn't change
[00:42:23] average value so like didn't change before won change again yeah you are
[00:42:25] before won change again yeah you are taking those and so definitely along
[00:42:26] taking those and so definitely along those line so if we look at sort of what
[00:42:28] those line so if we look at sort of what was the definition of the policy
[00:42:31] was the definition of the policy Improvement step let me just go a couple
[00:42:33] Improvement step let me just go a couple slides back okay so what we said is we
[00:42:35] slides back okay so what we said is we computed the Q function and then we
[00:42:38] computed the Q function and then we extracted a new
[00:42:40] extracted a new policy if Pi I + 1 is the same as Pi I
[00:42:45] policy if Pi I + 1 is the same as Pi I is Q of Pi I + 1 equal to Q of I Pi I +
[00:42:49] is Q of Pi I + 1 equal to Q of I Pi I + 1 or I probably said that wrong to many
[00:42:51] 1 or I probably said that wrong to many eyes let me just so the question is
[00:42:55] eyes let me just so the question is if if pi I is equal to Pi I + 1 is Q Pi
[00:43:01] if if pi I is equal to Pi I + 1 is Q Pi I = to Q Pi I +
[00:43:07] 1 so if it's the same policy do they
[00:43:09] 1 so if it's the same policy do they have the same Q function yeah is he men
[00:43:12] have the same Q function yeah is he men Nots okay so if your policy hasn't
[00:43:14] Nots okay so if your policy hasn't changed meaning that your old policy is
[00:43:16] changed meaning that your old policy is the same as your new policy then Q pii
[00:43:19] the same as your new policy then Q pii is equal to Q Pi I + 1 which means that
[00:43:22] is equal to Q Pi I + 1 which means that when you do this for Q Pi I + 1 and then
[00:43:24] when you do this for Q Pi I + 1 and then try to extract a policy it'll be exactly
[00:43:26] try to extract a policy it'll be exactly the same
[00:43:27] the same so once you're stuck there it'll be
[00:43:29] so once you're stuck there it'll be you'll stuck forever now if you have
[00:43:32] you'll stuck forever now if you have ties it's more complicated so if you
[00:43:34] ties it's more complicated so if you have multiple actions that can achieve
[00:43:36] have multiple actions that can achieve the same Q function depends how you
[00:43:38] the same Q function depends how you break them if you break them
[00:43:39] break them if you break them deterministically you'll stay in the
[00:43:40] deterministically you'll stay in the same place if not you may sort of
[00:43:42] same place if not you may sort of oscillate between all the policies which
[00:43:43] oscillate between all the policies which are optimal otherwise known as all the
[00:43:46] are optimal otherwise known as all the policies for which they have the same Q
[00:43:48] policies for which they have the same Q Pi I okay but in the simpler case that I
[00:43:52] Pi I okay but in the simpler case that I mentioned once you've got to that single
[00:43:53] mentioned once you've got to that single policy you won't ever change
[00:43:57] policy you won't ever change and what that means is given that we
[00:44:00] and what that means is given that we also only have a finite number of
[00:44:01] also only have a finite number of policies if it's deterministic so
[00:44:04] policies if it's deterministic so assuming if we stick to determinance so
[00:44:06] assuming if we stick to determinance so I'll just say
[00:44:08] I'll just say no if Pi star is unique okay for all s
[00:44:15] no if Pi star is unique okay for all s so that means for every state there's a
[00:44:16] so that means for every state there's a unique optimal action um is there
[00:44:19] unique optimal action um is there maximum number of iterations for policy
[00:44:20] maximum number of iterations for policy iteration if you have deterministic
[00:44:22] iteration if you have deterministic policies there's only a of the S
[00:44:24] policies there's only a of the S policies as everyone was saying before
[00:44:26] policies as everyone was saying before which is great um and so since the
[00:44:28] which is great um and so since the policy Improvement step either improves
[00:44:31] policy Improvement step either improves the value of your policy or halts that
[00:44:33] the value of your policy or halts that means you only go through each policy
[00:44:36] means you only go through each policy once and at most once right like there
[00:44:39] once and at most once right like there might be some you never bother to go
[00:44:41] might be some you never bother to go through and so that means that policy
[00:44:43] through and so that means that policy iteration will Halt and it will take at
[00:44:45] iteration will Halt and it will take at most a to the S policies if it takes a
[00:44:47] most a to the S policies if it takes a to the s that means that you evaluated
[00:44:49] to the s that means that you evaluated every single policy in generally you
[00:44:51] every single policy in generally you won't
[00:44:53] won't okay so this is um what shows that we
[00:44:56] okay so this is um what shows that we actually they do get this monotonic
[00:44:57] actually they do get this monotonic Improvement this is really nice with
[00:44:59] Improvement this is really nice with every single because you could imagine
[00:45:00] every single because you could imagine in cases where like there there's oh
[00:45:03] in cases where like there there's oh question yeah
[00:45:06] sure yet that we're going to do better
[00:45:09] sure yet that we're going to do better than random right like there's no we
[00:45:11] than random right like there's no we haven't guaranteed that whatever we
[00:45:13] haven't guaranteed that whatever we converge to is better than
[00:45:15] converge to is better than random oh we we've show we've proven
[00:45:17] random oh we we've show we've proven that we're going to get get to the
[00:45:19] that we're going to get get to the optimal policy and the optimal policy
[00:45:22] optimal policy and the optimal policy may be just random right because
[00:45:24] may be just random right because depending on the the environment you
[00:45:27] depending on the the environment you might just like there there is no you
[00:45:29] might just like there there is no you can't do better than random is that you
[00:45:31] can't do better than random is that you mean in terms of like how you design
[00:45:32] mean in terms of like how you design actions yeah so for example if it is the
[00:45:34] actions yeah so for example if it is the case that all of your actions have
[00:45:36] case that all of your actions have exactly the same reward doesn't matter
[00:45:38] exactly the same reward doesn't matter whether you act randomly or you follow a
[00:45:40] whether you act randomly or you follow a policy the value you get would be
[00:45:42] policy the value you get would be exactly the same as random whether or
[00:45:44] exactly the same as random whether or not you can do better than the random
[00:45:45] not you can do better than the random will depend on the domain the hope is in
[00:45:48] will depend on the domain the hope is in general we can do a lot
[00:45:51] better Okay so we've shown now that
[00:45:54] better Okay so we've shown now that here's an algorithm where as we do a
[00:45:55] here's an algorithm where as we do a more comp more and more computation we
[00:45:57] more comp more and more computation we get better and better policies and this
[00:45:59] get better and better policies and this is great because you may not actually
[00:46:00] is great because you may not actually want to go particularly if like the
[00:46:02] want to go particularly if like the state space is you know very large you
[00:46:03] state space is you know very large you may not want to go till where your
[00:46:05] may not want to go till where your policy entirely stops changing so if you
[00:46:07] policy entirely stops changing so if you have like an ending time time
[00:46:09] have like an ending time time requirement you can still guarantee that
[00:46:11] requirement you can still guarantee that like hey I'm getting better and better
[00:46:12] like hey I'm getting better and better and maybe I stop after 100 iterations or
[00:46:14] and maybe I stop after 100 iterations or a thousand iterations um and just use
[00:46:17] a thousand iterations um and just use that policy okay so this is one which
[00:46:19] that policy okay so this is one which has that nice monotony
[00:46:21] has that nice monotony guarante can you say what that is oh
[00:46:25] guarante can you say what that is oh sure yes and what's your name uh
[00:46:28] sure yes and what's your name uh yeah so a here is the number of actions
[00:46:30] yeah so a here is the number of actions and S here is the number of states so
[00:46:32] and S here is the number of states so the decision policy space is for every
[00:46:34] the decision policy space is for every state you could to pick one of the
[00:46:36] state you could to pick one of the actions so you multiply all of
[00:46:41] those
[00:46:42] those okay so so this also shows here about
[00:46:46] okay so so this also shows here about how yeah exactly what I said on the
[00:46:48] how yeah exactly what I said on the previous one that if your policy doesn't
[00:46:50] previous one that if your policy doesn't change it'll never change again
[00:46:53] change it'll never change again again me
[00:46:57] again me five star is
[00:47:00] five star is unique
[00:47:03] per okay so that's one way to go one way
[00:47:06] per okay so that's one way to go one way is that we can do policy iteration and
[00:47:08] is that we can do policy iteration and the interesting thing about policy
[00:47:09] the interesting thing about policy iteration is that at every time point
[00:47:11] iteration is that at every time point you have an explicit policy and what
[00:47:14] you have an explicit policy and what that policy tells you is how to act
[00:47:16] that policy tells you is how to act forever and when using that policy and
[00:47:19] forever and when using that policy and when you compute the Q value it says how
[00:47:22] when you compute the Q value it says how much rewards will you get if you take
[00:47:23] much rewards will you get if you take this action in this date and then follow
[00:47:24] this action in this date and then follow that other policy forever so again
[00:47:27] that other policy forever so again remember today we're in the infinite
[00:47:28] remember today we're in the infinite Horizon case for you know unless I
[00:47:31] Horizon case for you know unless I specify
[00:47:32] specify otherwise value iteration but but you
[00:47:34] otherwise value iteration but but you know along the way a lot of those
[00:47:36] know along the way a lot of those actions and the decisions we make may
[00:47:37] actions and the decisions we make may not be very good so your early policies
[00:47:39] not be very good so your early policies might be pretty bad we know we're
[00:47:41] might be pretty bad we know we're monotonically improving but the early
[00:47:42] monotonically improving but the early policies might be bad value iteration is
[00:47:45] policies might be bad value iteration is different the idea is that at every time
[00:47:48] different the idea is that at every time every iteration we're going to maintain
[00:47:49] every iteration we're going to maintain the optimal value of starting in a state
[00:47:52] the optimal value of starting in a state but as if we only get to make a finite
[00:47:54] but as if we only get to make a finite number of
[00:47:55] number of decisions so remember in in policy
[00:47:57] decisions so remember in in policy iteration we always have a policy and we
[00:48:00] iteration we always have a policy and we have the value of acting in it forever
[00:48:01] have the value of acting in it forever it just might not be very good value
[00:48:03] it just might not be very good value iteration is say is how what is the
[00:48:06] iteration is say is how what is the optimal thing for me to do if I can just
[00:48:07] optimal thing for me to do if I can just make one decision okay like I can take
[00:48:09] make one decision okay like I can take one step okay I'm going to figure out
[00:48:11] one step okay I'm going to figure out what the optimal thing is to do for one
[00:48:13] what the optimal thing is to do for one step now I'm going to imagine I get to
[00:48:14] step now I'm going to imagine I get to take two steps and I'm going to build on
[00:48:16] take two steps and I'm going to build on what I know I can do for one step and so
[00:48:18] what I know I can do for one step and so now build like the optimal thing to do
[00:48:20] now build like the optimal thing to do for two steps so the interesting thing
[00:48:22] for two steps so the interesting thing with value iteration is you always have
[00:48:24] with value iteration is you always have an optimal value but for the wrong
[00:48:26] an optimal value but for the wrong Horizon
[00:48:28] Horizon so one has a value for the full horiz
[00:48:30] so one has a value for the full horiz the you know the infinite Horizon it
[00:48:32] the you know the infinite Horizon it might be a bad policy the other one has
[00:48:35] might be a bad policy the other one has the optimal value and thing to do but
[00:48:37] the optimal value and thing to do but for the wrong
[00:48:38] for the wrong Horizon okay so and the idea in value
[00:48:41] Horizon okay so and the idea in value iteration is you just kind of keep going
[00:48:42] iteration is you just kind of keep going to longer and longer and longer episodes
[00:48:44] to longer and longer and longer episodes thinking of getting to do like H+ one
[00:48:46] thinking of getting to do like H+ one steps or h plus two steps and then you
[00:48:48] steps or h plus two steps and then you build upon your previous Solutions using
[00:48:50] build upon your previous Solutions using dynamic programming so let's see how to
[00:48:52] dynamic programming so let's see how to do that okay so this is where we get
[00:48:55] do that okay so this is where we get into the Bellman equation is is the sort
[00:48:57] into the Bellman equation is is the sort of the seminal work of um Richard belman
[00:49:00] of the seminal work of um Richard belman and the idea here is as we've said is
[00:49:02] and the idea here is as we've said is that for a particular policy we satisfy
[00:49:04] that for a particular policy we satisfy the Bellman
[00:49:05] the Bellman equation and we can for turn that into
[00:49:08] equation and we can for turn that into an
[00:49:10] an algorithm so in particular there's a
[00:49:12] algorithm so in particular there's a thing called the Bellman backup operator
[00:49:14] thing called the Bellman backup operator and what it says is if you give me a
[00:49:16] and what it says is if you give me a value function which right now we can
[00:49:18] value function which right now we can think of it just being a vector later
[00:49:19] think of it just being a vector later we'll get into function
[00:49:21] we'll get into function approximation and we do a Bellman backup
[00:49:23] approximation and we do a Bellman backup essentially it's like saying I knew you
[00:49:25] essentially it's like saying I knew you know I had a value fun function and I
[00:49:27] know I had a value fun function and I want to think about what should I do if
[00:49:28] want to think about what should I do if I get to do the best thing that
[00:49:29] I get to do the best thing that maximizes my immediate reward plus my
[00:49:32] maximizes my immediate reward plus my expected future reward given that value
[00:49:34] expected future reward given that value function okay so it says I'm going to
[00:49:37] function okay so it says I'm going to figure out if I take a Max over all the
[00:49:39] figure out if I take a Max over all the actions what's the reward at that State
[00:49:41] actions what's the reward at that State in the action plus the discounted sum of
[00:49:43] in the action plus the discounted sum of rewards using the value function you've
[00:49:45] rewards using the value function you've given
[00:49:47] given me and what that does is it need yields
[00:49:50] me and what that does is it need yields a new Vector of of values over all your
[00:49:53] a new Vector of of values over all your States so this is being done person
[00:49:56] States so this is being done person State okay and this is called the
[00:49:58] State okay and this is called the Bellman operator comes up all the time
[00:50:02] Bellman operator comes up all the time okay all right so how do we do value
[00:50:05] okay all right so how do we do value iteration well we're just going to do
[00:50:07] iteration well we're just going to do this
[00:50:09] this recursively so we're just going to Loop
[00:50:11] recursively so we're just going to Loop until we hit convergence um uh just as a
[00:50:14] until we hit convergence um uh just as a refresher this is the um L Infinity norm
[00:50:18] refresher this is the um L Infinity norm and what that means is that it is equal
[00:50:20] and what that means is that it is equal to the max over S V of
[00:50:25] to the max over S V of s right out more
[00:50:31] carefully equal to Max over s v k + 1 of
[00:50:37] carefully equal to Max over s v k + 1 of s minus VK of
[00:50:40] s minus VK of s what that just means is that if you
[00:50:42] s what that just means is that if you have two vectors you look for every
[00:50:44] have two vectors you look for every single entry and you find the entry in
[00:50:46] single entry and you find the entry in which those two vectors are most
[00:50:48] which those two vectors are most different and that's the L Infinity Norm
[00:50:50] different and that's the L Infinity Norm just as a refresher for some of you who
[00:50:52] just as a refresher for some of you who might not have seen it or not seen it
[00:50:53] might not have seen it or not seen it recently so what value iteration does is
[00:50:56] recently so what value iteration does is is we're just going to have a loop it's
[00:50:59] is we're just going to have a loop it's going to look very similar to what we
[00:51:00] going to look very similar to what we saw for the markof reward process we're
[00:51:02] saw for the markof reward process we're going to initialize our value function
[00:51:04] going to initialize our value function and then for each date we're just going
[00:51:05] and then for each date we're just going to do this Bellman
[00:51:07] to do this Bellman backup and so it's like we took our
[00:51:09] backup and so it's like we took our previous value function we do our
[00:51:10] previous value function we do our Bellman backup and we get a new value
[00:51:13] Bellman backup and we get a new value function we do this over and over and
[00:51:15] function we do this over and over and over again until our value function
[00:51:16] over again until our value function stops
[00:51:18] stops changing so for policy iteration we kept
[00:51:20] changing so for policy iteration we kept going until our policy stops changing
[00:51:22] going until our policy stops changing here we keep going until our value
[00:51:24] here we keep going until our value function stops changing
[00:51:26] function stops changing and what that that condition means is it
[00:51:29] and what that that condition means is it says I keep going until the difference
[00:51:31] says I keep going until the difference between my old value of estate and my
[00:51:33] between my old value of estate and my new value of estate is really small for
[00:51:34] new value of estate is really small for all the
[00:51:35] all the states yeah and REM me your
[00:51:38] states yeah and REM me your name I have a question how to conduct
[00:51:42] name I have a question how to conduct this value iteration to you just said it
[00:51:45] this value iteration to you just said it works with finite Horizon and why policy
[00:51:49] works with finite Horizon and why policy iteration works with yeah good question
[00:51:51] iteration works with yeah good question so what you could think of this as so
[00:51:53] so what you could think of this as so great question the beginning if you
[00:51:55] great question the beginning if you don't get to make any decisions the
[00:51:57] don't get to make any decisions the expected discount of sum rewards you get
[00:51:59] expected discount of sum rewards you get from many states is zero you didn't get
[00:52:01] from many states is zero you didn't get to make any decisions you never got any
[00:52:03] to make any decisions you never got any reward the first round of this it's like
[00:52:05] reward the first round of this it's like you're saying okay I get I get zero
[00:52:08] you're saying okay I get I get zero reward if I don't make any
[00:52:09] reward if I don't make any decisions this would K would be you know
[00:52:12] decisions this would K would be you know K would be one here for the next round
[00:52:15] K would be one here for the next round so we'd say like if I get to make one
[00:52:16] so we'd say like if I get to make one decision then I would take a Max over a
[00:52:19] decision then I would take a Max over a my reward times discount Factor time
[00:52:22] my reward times discount Factor time zero so it's now saying what is the best
[00:52:26] zero so it's now saying what is the best thing I get to do if I get to make one
[00:52:29] thing I get to do if I get to make one decision okay so what this will be is on
[00:52:31] decision okay so what this will be is on the on the first round this will just be
[00:52:34] the on the first round this will just be equal to so if V if V is equal to zero
[00:52:37] equal to so if V if V is equal to zero for all s then what we would get when we
[00:52:39] for all s then what we would get when we do this backup we get VK + 1 is just
[00:52:43] do this backup we get VK + 1 is just equal to of let me put over s is equal
[00:52:47] equal to of let me put over s is equal to Max over a r of sa
[00:52:50] to Max over a r of sa a
[00:52:51] a because this part will be zero if your
[00:52:54] because this part will be zero if your value is zero so so now this is like
[00:52:57] value is zero so so now this is like saying okay before I get no reward
[00:52:59] saying okay before I get no reward because I make no decisions now what's
[00:53:00] because I make no decisions now what's the best thing I should do if I get to
[00:53:02] the best thing I should do if I get to make one decision the next round I'll
[00:53:04] make one decision the next round I'll say what if I get to make one decision
[00:53:06] say what if I get to make one decision now and then plus this will get plugged
[00:53:09] now and then plus this will get plugged into your
[00:53:10] into your value
[00:53:13] value yeah um so so the expression that we're
[00:53:15] yeah um so so the expression that we're plugging into the max function is that
[00:53:18] plugging into the max function is that is that the same as like Q of sa like a
[00:53:21] is that the same as like Q of sa like a good question so in general that's going
[00:53:22] good question so in general that's going to be um max over a q of
[00:53:26] to be um max over a q of because yeah because here we're we're
[00:53:29] because yeah because here we're we're requiring ourselves to be take the max
[00:53:31] requiring ourselves to be take the max over the actions we take great question
[00:53:34] over the actions we take great question yeah uh in policy iteration we were also
[00:53:37] yeah uh in policy iteration we were also initializing the values of we were in
[00:53:40] initializing the values of we were in the policy iteration we were just
[00:53:42] the policy iteration we were just randomly initializing our policy so
[00:53:44] randomly initializing our policy so we're saying like in this state you go
[00:53:46] we're saying like in this state you go left in this state you go right Etc and
[00:53:48] left in this state you go right Etc and then we were evaluating the value of
[00:53:49] then we were evaluating the value of that we could when we do that evaluation
[00:53:52] that we could when we do that evaluation yes in that part we were setting b equal
[00:53:53] yes in that part we were setting b equal to zero and then doing this iteratively
[00:54:06] if we had done this in well we can do
[00:54:08] if we had done this in well we can do that we can do this as part of the
[00:54:09] that we can do this as part of the policy evaluation um for policy
[00:54:11] policy evaluation um for policy iteration but what do you mean we would
[00:54:13] iteration but what do you mean we would be able to detect
[00:54:18] Cy comp successive policies
[00:54:27] that's right so it's a good point say if
[00:54:28] that's right so it's a good point say if inside of the policy iteration one
[00:54:30] inside of the policy iteration one instead of just um halting when your uh
[00:54:33] instead of just um halting when your uh policy has stopped changing you could
[00:54:34] policy has stopped changing you could also hold your value function and stop
[00:54:36] also hold your value function and stop changing very cute
[00:54:38] changing very cute function yeah you say yeah and what's
[00:54:41] function yeah you say yeah and what's your
[00:54:49] [Music]
[00:54:51] [Music] name yeah it's a great question um I
[00:54:55] name yeah it's a great question um I don't I'll look that up for next week to
[00:54:57] don't I'll look that up for next week to my knowledge Al there isn't over that in
[00:55:00] my knowledge Al there isn't over that in one that would not be instance dependent
[00:55:03] one that would not be instance dependent um impr practice policy search is very
[00:55:06] um impr practice policy search is very very
[00:55:14] popular I think part of it maybe I
[00:55:17] popular I think part of it maybe I think part of it is probably that um it
[00:55:21] think part of it is probably that um it also often has this nice monotonic
[00:55:24] also often has this nice monotonic Improvement value iteration does does
[00:55:26] Improvement value iteration does does not necessarily have the monotonic
[00:55:27] not necessarily have the monotonic Improvement requirement here so it is
[00:55:31] Improvement requirement here so it is always the optimal thing to do for the
[00:55:32] always the optimal thing to do for the wrong Horizon whereas the other one says
[00:55:35] wrong Horizon whereas the other one says it may not be optimal for ages but it
[00:55:37] it may not be optimal for ages but it will always be monotonically
[00:55:41] improving great questions okay let's see
[00:55:44] improving great questions okay let's see what the properties are for Value
[00:55:45] what the properties are for Value iteration because these are a really
[00:55:47] iteration because these are a really useful great questions and we'll see why
[00:55:49] useful great questions and we'll see why this whole thing ends up working so just
[00:55:52] this whole thing ends up working so just I want to highlight here you could think
[00:55:53] I want to highlight here you could think of belt policy iteration also as Bellman
[00:55:55] of belt policy iteration also as Bellman operations and I think this good to what
[00:55:56] operations and I think this good to what your question just out too so the bman
[00:55:59] your question just out too so the bman back operator for a particular policy Pi
[00:56:01] back operator for a particular policy Pi is defined as follows you don't you see
[00:56:03] is defined as follows you don't you see you don't see the max anymore you just
[00:56:04] you don't see the max anymore you just are committing to doing a particular
[00:56:06] are committing to doing a particular policy and then policy evaluations
[00:56:08] policy and then policy evaluations amounts to Computing the fix point and
[00:56:10] amounts to Computing the fix point and I'll Define that more formally in a
[00:56:11] I'll Define that more formally in a second um and so to do policy evaluation
[00:56:14] second um and so to do policy evaluation you just repeatedly apply this operator
[00:56:17] you just repeatedly apply this operator until V stops changing that was the
[00:56:19] until V stops changing that was the iterative algorithm we saw before just
[00:56:21] iterative algorithm we saw before just with different
[00:56:23] with different notation all right let's start to talk
[00:56:25] notation all right let's start to talk soon second about fixed point so that
[00:56:29] soon second about fixed point so that and then what we would say here is when
[00:56:30] and then what we would say here is when we do policy another way to do policy
[00:56:32] we do policy another way to do policy Improvement is to explicitly do another
[00:56:34] Improvement is to explicitly do another backup but take the arc Max instead of
[00:56:36] backup but take the arc Max instead of the max okay so it's the only difference
[00:56:38] the max okay so it's the only difference this is the same as what you're doing
[00:56:39] this is the same as what you're doing for the Q function so this is
[00:56:43] for the Q function so this is Q Pi K of
[00:56:48] sa I'm just showing you different
[00:56:49] sa I'm just showing you different notations but the same thing and also
[00:56:51] notations but the same thing and also help you will talk about the Bellman
[00:56:52] help you will talk about the Bellman backup for a particular
[00:56:54] backup for a particular policy but normally when people say bman
[00:56:57] policy but normally when people say bman backups they mean for the optimal all
[00:56:58] backups they mean for the optimal all right so let's just go back to Value
[00:57:00] right so let's just go back to Value iteration because while I've told you
[00:57:01] iteration because while I've told you how to compute a value function I
[00:57:03] how to compute a value function I haven't told you how to get out a policy
[00:57:05] haven't told you how to get out a policy from it so the standard way to do this
[00:57:08] from it so the standard way to do this would be you would go through this
[00:57:09] would be you would go through this process and then you would do it say one
[00:57:11] process and then you would do it say one more time and extract the argmax instead
[00:57:13] more time and extract the argmax instead of the max to actually get your
[00:57:15] of the max to actually get your policy so normally in this case you
[00:57:17] policy so normally in this case you don't bother to compute a policy along
[00:57:19] don't bother to compute a policy along the way you just do value iteration a
[00:57:21] the way you just do value iteration a bunch of times then at some point you
[00:57:22] bunch of times then at some point you extract a policy
[00:57:26] all right let's see about some
[00:57:28] all right let's see about some properties of this so why this is a good
[00:57:30] properties of this so why this is a good thing I've already told we've already
[00:57:31] thing I've already told we've already seen that policy iteration is guaranteed
[00:57:33] seen that policy iteration is guaranteed to converge because there's only a
[00:57:34] to converge because there's only a finite number of policies you never
[00:57:36] finite number of policies you never repeat a policy and so either you're at
[00:57:38] repeat a policy and so either you're at the optimal policy already or you keep
[00:57:40] the optimal policy already or you keep improving for Value iteration it may not
[00:57:43] improving for Value iteration it may not be clear yet that this should
[00:57:45] be clear yet that this should converge so I'm first going to just
[00:57:47] converge so I'm first going to just Define a contraction operator so let's
[00:57:49] Define a contraction operator so let's let OB be an operator like the Bellman
[00:57:51] let OB be an operator like the Bellman backup so if you can just think of as
[00:57:53] backup so if you can just think of as like an algebraic equation if this is
[00:57:54] like an algebraic equation if this is something you know if you haven't seen
[00:57:56] something you know if you haven't seen before which is totally fine um and then
[00:57:58] before which is totally fine um and then this is just going to denote any Norm so
[00:58:00] this is just going to denote any Norm so like the L Infinity Norm or
[00:58:03] like the L Infinity Norm or others if when you apply the operator to
[00:58:06] others if when you apply the operator to two different say value functions we can
[00:58:09] two different say value functions we can just think of these here as
[00:58:11] just think of these here as vectors and that distance after you
[00:58:13] vectors and that distance after you apply the operator is smaller than the
[00:58:15] apply the operator is smaller than the distance before then it's a contraction
[00:58:18] distance before then it's a contraction operator so just give it a bit of
[00:58:19] operator so just give it a bit of intuition for this in case contraction
[00:58:21] intuition for this in case contraction operators isn't something you've seen
[00:58:22] operators isn't something you've seen before if you think about having two
[00:58:25] before if you think about having two value functions and then and there is
[00:58:26] value functions and then and there is some states on which they really differ
[00:58:28] some states on which they really differ in their value what this says is that if
[00:58:30] in their value what this says is that if you then apply an operator to them and
[00:58:32] you then apply an operator to them and we're going to prove that the Bellman
[00:58:33] we're going to prove that the Bellman operator is one of them that afterwards
[00:58:35] operator is one of them that afterwards they get closer together so that Max
[00:58:37] they get closer together so that Max difference between the states is smaller
[00:58:39] difference between the states is smaller afterwards yeah is
[00:58:41] afterwards yeah is it uh yeah um the is this like a if or
[00:58:46] it uh yeah um the is this like a if or just if like can there be contraction
[00:58:49] just if like can there be contraction operators that don't satisfy this we're
[00:58:52] operators that don't satisfy this we're going to look at this um as specifically
[00:58:54] going to look at this um as specifically Uh Oh you mean are there so my um I'd
[00:58:58] Uh Oh you mean are there so my um I'd have to check I'm not an expert in all
[00:58:59] have to check I'm not an expert in all of contraction operators so I'll say
[00:59:01] of contraction operators so I'll say what I will show is that the Bellman
[00:59:02] what I will show is that the Bellman operator satisfies this statement and
[00:59:04] operator satisfies this statement and therefore then we can show that we're
[00:59:06] therefore then we can show that we're going to converge to a fixed
[00:59:10] Point all right so particularly under a
[00:59:13] Point all right so particularly under a couple minor assumptions your discount
[00:59:16] couple minor assumptions your discount Factor being less than one or you end up
[00:59:18] Factor being less than one or you end up in a terminal state with probability one
[00:59:20] in a terminal state with probability one essentially both of these make sure that
[00:59:21] essentially both of these make sure that your your expected sum of discounted
[00:59:23] your your expected sum of discounted rewards is
[00:59:24] rewards is bounded then the Bellman backup where
[00:59:27] bounded then the Bellman backup where you do this Max is a
[00:59:29] you do this Max is a contraction which means that your
[00:59:31] contraction which means that your distance is going to shrink and Shrink
[00:59:33] distance is going to shrink and Shrink between two value functions so your v k
[00:59:36] between two value functions so your v k + 1 versus its difference to V K that
[00:59:40] + 1 versus its difference to V K that distance in terms of the max difference
[00:59:41] distance in terms of the max difference in any of the states is going to get
[00:59:43] in any of the states is going to get smaller and we we'll go through that now
[00:59:46] smaller and we we'll go through that now okay so this is proving that we end up
[00:59:50] okay so this is proving that we end up getting the um this is a contraction
[00:59:53] getting the um this is a contraction operator so the bell bone backup is a
[00:59:54] operator so the bell bone backup is a contraction operator on V for gamma less
[00:59:56] contraction operator on V for gamma less than one let me just make this large
[00:59:59] than one let me just make this large okay so we're going to use the infinity
[01:00:01] okay so we're going to use the infinity Norm which again is just same for where
[01:00:03] Norm which again is just same for where is the max difference in the values for
[01:00:05] is the max difference in the values for any two states and what I'm defining
[01:00:07] any two states and what I'm defining here is two different value functions so
[01:00:10] here is two different value functions so this could be anything what I'm going to
[01:00:12] this could be anything what I'm going to try to show is that after you do the
[01:00:13] try to show is that after you do the Bellman operator that can be no larger
[01:00:15] Bellman operator that can be no larger than the max difference before you did
[01:00:17] than the max difference before you did the Bellman backup
[01:00:19] the Bellman backup okay all right so what we have here this
[01:00:23] okay all right so what we have here this is the first inequality so this is
[01:00:24] is the first inequality so this is important what I'm going to say is right
[01:00:26] important what I'm going to say is right now so this is just the definition of
[01:00:28] now so this is just the definition of the Bellman backup operator what you can
[01:00:31] the Bellman backup operator what you can see here is I have two different
[01:00:32] see here is I have two different Maxes because I'm going to do the max
[01:00:35] Maxes because I'm going to do the max over a for the first value function and
[01:00:37] over a for the first value function and a Max over a prime for the other value
[01:00:40] a Max over a prime for the other value function what I'm going to say now is if
[01:00:42] function what I'm going to say now is if you do that instead this has to be less
[01:00:45] you do that instead this has to be less than or equal to if you pulled the max a
[01:00:48] than or equal to if you pulled the max a out and you required both of them to use
[01:00:52] out and you required both of them to use the same action
[01:01:00] and why is this true because essentially
[01:01:02] and why is this true because essentially what we're allowing here is we're
[01:01:04] what we're allowing here is we're allowing us before we could pick
[01:01:07] allowing us before we could pick different actions for both bman backups
[01:01:12] different actions for both bman backups and now we can pick
[01:01:21] one okay so that means that instead of
[01:01:24] one okay so that means that instead of getting to maximize the second thing
[01:01:26] getting to maximize the second thing separately we're just going to try to
[01:01:26] separately we're just going to try to maximize the difference so that's the
[01:01:28] maximize the difference so that's the first place this less than or equal to
[01:01:29] first place this less than or equal to is going to come in okay once we do that
[01:01:32] is going to come in okay once we do that now everything's taking the same action
[01:01:33] now everything's taking the same action on the same time step so we can um get
[01:01:35] on the same time step so we can um get rid of these because they're
[01:01:39] rid of these because they're identical
[01:01:41] identical okay so we can just say this is just
[01:01:43] okay so we can just say this is just exactly equal to Max a I'm going to pull
[01:01:46] exactly equal to Max a I'm going to pull out the discount factor of sum over S
[01:01:49] out the discount factor of sum over S Prime probability of S Prime given
[01:01:52] Prime probability of S Prime given sa of VK of S Prime VJ of
[01:01:59] sa of VK of S Prime VJ of s okay now again what I'm going to do is
[01:02:02] s okay now again what I'm going to do is I'm going to bound this and I'm going to
[01:02:04] I'm going to bound this and I'm going to say the difference between the two value
[01:02:06] say the difference between the two value functions that any state is always less
[01:02:08] functions that any state is always less than or equal to their Max difference
[01:02:11] than or equal to their Max difference across all the states okay so this is
[01:02:13] across all the states okay so this is less than or equal to Max
[01:02:15] less than or equal to Max a gamma Su over S Prime probability of S
[01:02:20] a gamma Su over S Prime probability of S Prime given an
[01:02:22] Prime given an sa VK minus VJ
[01:02:27] okay because the difference between any
[01:02:30] okay because the difference between any between particular States is always less
[01:02:31] between particular States is always less than the max difference between any of
[01:02:33] than the max difference between any of the states so I upper bounded it using
[01:02:36] the states so I upper bounded it using this expression okay but now this term
[01:02:39] this expression okay but now this term Now does not depend on States and so I
[01:02:43] Now does not depend on States and so I can take it out of the sum okay this is
[01:02:45] can take it out of the sum okay this is just some constant it's like seven so
[01:02:47] just some constant it's like seven so this is equal to Max over a gamma
[01:03:03] but this is just a transition model and
[01:03:05] but this is just a transition model and the probability that we go to Su state
[01:03:07] the probability that we go to Su state has to sum up to one if we sum over all
[01:03:10] has to sum up to one if we sum over all next States because for any state in
[01:03:12] next States because for any state in action you're in you always have to go
[01:03:13] action you're in you always have to go to Su next state so this is just equal
[01:03:15] to Su next state so this is just equal to
[01:03:16] to one so we get that this is just equal to
[01:03:20] one so we get that this is just equal to Max over
[01:03:21] Max over a and now there's no more dependence on
[01:03:24] a and now there's no more dependence on a so it just disappears
[01:03:28] okay so what that said is that the
[01:03:30] okay so what that said is that the distance the max difference between the
[01:03:33] distance the max difference between the Bellman backed
[01:03:35] Bellman backed up value functions we get from starting
[01:03:38] up value functions we get from starting with two different value functions has
[01:03:39] with two different value functions has to be no
[01:03:41] to be no larger than the max difference between
[01:03:43] larger than the max difference between the value functions before times
[01:03:49] gamma and if your gamma's less than one
[01:03:51] gamma and if your gamma's less than one that means you're strictly
[01:03:53] that means you're strictly Contracting because it means that that
[01:03:55] Contracting because it means that that Max difference has to be smaller than it
[01:03:57] Max difference has to be smaller than it was before it would be like 0.9 *
[01:03:59] was before it would be like 0.9 * whatever the at most 09 times whatever
[01:04:01] whatever the at most 09 times whatever the distance was
[01:04:03] the distance was before okay so that's really really cool
[01:04:06] before okay so that's really really cool because that means now that if we apply
[01:04:08] because that means now that if we apply value iteration we're repeatedly doing
[01:04:09] value iteration we're repeatedly doing the bman backup we're shrinking the
[01:04:12] the bman backup we're shrinking the distance so if you think of having a
[01:04:13] distance so if you think of having a series of value functions so you've got
[01:04:15] series of value functions so you've got like v0 and V1 and V2 and V3 V4 dot dot
[01:04:20] like v0 and V1 and V2 and V3 V4 dot dot dot can think of what this distance
[01:04:23] dot can think of what this distance is okay and what this is saying saying
[01:04:26] is okay and what this is saying saying is that these
[01:04:27] is that these distances are going to be shrinking over
[01:04:31] distances are going to be shrinking over time and I've told you before that the
[01:04:34] time and I've told you before that the value function is unique so that means
[01:04:36] value function is unique so that means as you shrink and shrink and shrink and
[01:04:37] as you shrink and shrink and shrink and Shrink this is eventually going to
[01:04:40] Shrink this is eventually going to become a unique value
[01:04:43] become a unique value function because if there were you can
[01:04:46] function because if there were you can think about it too if there were two
[01:04:47] think about it too if there were two different value functions you can think
[01:04:48] different value functions you can think about what would happen after you do a
[01:04:50] about what would happen after you do a BMA backup
[01:04:52] BMA backup operator they different
[01:04:56] operator they different okay so this proves it's a
[01:04:59] okay so this proves it's a contraction yeah and just to note this
[01:05:01] contraction yeah and just to note this here even if all the inequalities are
[01:05:03] here even if all the inequalities are equalities is still a contraction if
[01:05:04] equalities is still a contraction if gamma is less than one still kind of
[01:05:06] gamma is less than one still kind of making
[01:05:07] making progress all right so here's some
[01:05:09] progress all right so here's some thoughts in case you want to think about
[01:05:11] thoughts in case you want to think about this more to prove the valuator
[01:05:13] this more to prove the valuator convergence to Unique solution for
[01:05:14] convergence to Unique solution for discret and state action spaces whether
[01:05:16] discret and state action spaces whether initialization matters at all and is the
[01:05:20] initialization matters at all and is the value of the policy extracted from value
[01:05:21] value of the policy extracted from value iteration at each round guaranteed to
[01:05:23] iteration at each round guaranteed to monotonically improve
[01:05:26] monotonically improve so these are all great things to think
[01:05:29] so these are all great things to think about okay so let's go back to sort of
[01:05:31] about okay so let's go back to sort of more practically this is then value
[01:05:34] more practically this is then value iteration for finite well actually I'll
[01:05:35] iteration for finite well actually I'll pause here in case anybody has a
[01:05:37] pause here in case anybody has a question about the proof yes under my
[01:05:39] question about the proof yes under my mirr
[01:05:39] mirr name can you go back to yeah um I I
[01:05:44] name can you go back to yeah um I I understand except the first one where we
[01:05:47] understand except the first one where we like take the max out of the norm Max
[01:05:50] like take the max out of the norm Max action why is that greater than having
[01:05:53] action why is that greater than having separate inside yeah good question so um
[01:05:57] separate inside yeah good question so um this uh what this is saying here is you
[01:06:00] this uh what this is saying here is you know we have you could think of this a q
[01:06:01] know we have you could think of this a q function so we get to pick a Max for
[01:06:03] function so we get to pick a Max for that Q function and then we subtract off
[01:06:06] that Q function and then we subtract off the max over another q function and you
[01:06:09] the max over another q function and you could imagine that you could think of
[01:06:11] could imagine that you could think of there being lots of different pairs of
[01:06:13] there being lots of different pairs of actions in this case and either this Max
[01:06:16] actions in this case and either this Max is the same as this one so it's actually
[01:06:19] is the same as this one so it's actually let say in particular concretely that
[01:06:21] let say in particular concretely that this is like gives you a one that's
[01:06:24] this is like gives you a one that's those one so this one is either A1 or
[01:06:27] those one so this one is either A1 or another
[01:06:29] another a if it's another a that's just because
[01:06:32] a if it's another a that's just because this is actually larger than what Qs of
[01:06:36] this is actually larger than what Qs of like so let me just write this out just
[01:06:37] like so let me just write this out just in case so we can think of there as
[01:06:39] in case so we can think of there as either being qf A1
[01:06:43] either being qf A1 under this or QJ of s a where a is not
[01:06:50] under this or QJ of s a where a is not equal to A1 okay so either this Max is
[01:06:54] equal to A1 okay so either this Max is exactly the same as this one in which
[01:06:56] exactly the same as this one in which case this is equal or this is different
[01:06:59] case this is equal or this is different and the only time it would be different
[01:07:00] and the only time it would be different is if that value was actually larger
[01:07:02] is if that value was actually larger than the value of Q sa1 and if this was
[01:07:05] than the value of Q sa1 and if this was larger this difference would be smaller
[01:07:08] larger this difference would be smaller because you'd be subtracting off a
[01:07:09] because you'd be subtracting off a larger
[01:07:12] value so that's why we can turn this
[01:07:15] value so that's why we can turn this into an inequality it's either the same
[01:07:17] into an inequality it's either the same if they happened to have both picked the
[01:07:18] if they happened to have both picked the same action or it would have picked
[01:07:19] same action or it would have picked another action for which that whole
[01:07:21] another action for which that whole difference would have been
[01:07:22] difference would have been smaller good question
[01:07:27] okay yeah can you go back to the like
[01:07:29] okay yeah can you go back to the like the questions you were posing yep yeah
[01:07:32] the questions you were posing yep yeah um is the value of the third question
[01:07:35] um is the value of the third question out there is that is the value of the
[01:07:37] out there is that is the value of the policy extracted from value
[01:07:39] policy extracted from value iteration aren't we isn't isn't that
[01:07:44] iteration aren't we isn't isn't that like implicit within value iteration
[01:07:45] like implicit within value iteration that it like with each uh like each new
[01:07:49] that it like with each uh like each new value function is better than the
[01:07:50] value function is better than the previous one and therefore the policy
[01:07:52] previous one and therefore the policy will also be better or that's a question
[01:07:54] will also be better or that's a question yeah
[01:07:56] yeah the question whether it is guaranteeing
[01:07:57] the question whether it is guaranteeing that we have not proven anything about
[01:07:59] that we have not proven anything about that yet we Prov that for policy
[01:08:01] that yet we Prov that for policy duration but this is just to think about
[01:08:03] duration but this is just to think about it in this
[01:08:05] it in this case Okay so let's go now anybody else
[01:08:07] case Okay so let's go now anybody else have a question on the
[01:08:11] proof okay all right so one thing I just
[01:08:13] proof okay all right so one thing I just want to mention briefly and it'll come
[01:08:15] want to mention briefly and it'll come up on the homework is thinking about
[01:08:16] up on the homework is thinking about this for finite Horizons so most of
[01:08:18] this for finite Horizons so most of today almost all of today we've assumed
[01:08:20] today almost all of today we've assumed that we get to act forever so we have an
[01:08:22] that we get to act forever so we have an infinite Horizon but there will
[01:08:23] infinite Horizon but there will certainly be cases where there's a
[01:08:25] certainly be cases where there's a finite
[01:08:26] finite Horizon and in this case this goes back
[01:08:29] Horizon and in this case this goes back to the sort of thinking of value
[01:08:30] to the sort of thinking of value iteration is just Computing the optimal
[01:08:32] iteration is just Computing the optimal value for each
[01:08:35] value for each Horizon so if we have K 1 to H H being
[01:08:39] Horizon so if we have K 1 to H H being like the max Horizon you want to compute
[01:08:40] like the max Horizon you want to compute for then for each date at each round you
[01:08:43] for then for each date at each round you would have a a value function k+ 1 which
[01:08:46] would have a a value function k+ 1 which tells you sort of how many decisions you
[01:08:47] tells you sort of how many decisions you make what your horizon is um and we
[01:08:51] make what your horizon is um and we would just do this back up so this looks
[01:08:52] would just do this back up so this looks exactly the same as what we saw before
[01:08:54] exactly the same as what we saw before but now you could also get a policy here
[01:08:57] but now you could also get a policy here so this would be the policy associated
[01:08:58] so this would be the policy associated with that value function of what is the
[01:09:01] with that value function of what is the argmax
[01:09:02] argmax action so it compute a series of
[01:09:06] action so it compute a series of policies one for each
[01:09:08] policies one for each step
[01:09:12] okay one other thing I want to mention
[01:09:14] okay one other thing I want to mention here and there'll be um we'll talk about
[01:09:16] here and there'll be um we'll talk about this on the homework too um one other
[01:09:19] this on the homework too um one other thing I I want to mention here is that
[01:09:21] thing I I want to mention here is that you can also just simulate the value um
[01:09:24] you can also just simulate the value um and then uh of a particular policy so
[01:09:27] and then uh of a particular policy so this is also really popular once we
[01:09:28] this is also really popular once we start to get into really big state
[01:09:30] start to get into really big state spaces so if we think of the fact that
[01:09:32] spaces so if we think of the fact that in a lot of these algorithms we're doing
[01:09:33] in a lot of these algorithms we're doing some sort of policy evaluation one thing
[01:09:36] some sort of policy evaluation one thing you could do is you just take your
[01:09:38] you could do is you just take your policy and you know what your Dynamics
[01:09:39] policy and you know what your Dynamics model is you know what your reward is
[01:09:41] model is you know what your reward is and you just roll it out so if you're in
[01:09:43] and you just roll it out so if you're in a state you simulate what the next state
[01:09:45] a state you simulate what the next state might be then you get a reward um so you
[01:09:49] might be then you get a reward um so you can just generate a really large number
[01:09:50] can just generate a really large number of episodes and then just average them
[01:09:53] of episodes and then just average them so like I'm just like oh how you know
[01:09:54] so like I'm just like oh how you know how good is this policy if your boss
[01:09:56] how good is this policy if your boss asking you you just like run it on 100
[01:09:57] asking you you just like run it on 100 people you average their rewards and
[01:09:59] people you average their rewards and then you're
[01:10:00] then you're done and this is something that becomes
[01:10:02] done and this is something that becomes really popular when it starts to be say
[01:10:04] really popular when it starts to be say hard to write down what that Dynamics
[01:10:06] hard to write down what that Dynamics model is explicitly or do that kind of
[01:10:08] model is explicitly or do that kind of sum over S Prime but it's really easy to
[01:10:11] sum over S Prime but it's really easy to sample and I'll note for that that um
[01:10:13] sample and I'll note for that that um you could use concentration inequalities
[01:10:15] you could use concentration inequalities like hting inequality um and uh
[01:10:17] like hting inequality um and uh Bernstein for those of you familiar with
[01:10:19] Bernstein for those of you familiar with them to bound how many how much data do
[01:10:22] them to bound how many how much data do you need for this estimate to be close
[01:10:24] you need for this estimate to be close to the true
[01:10:26] to the true one and the great thing is that it's not
[01:10:28] one and the great thing is that it's not that many so like if you have an
[01:10:30] that many so like if you have an enormous State space like I don't know
[01:10:32] enormous State space like I don't know your Amazon or something like that or
[01:10:33] your Amazon or something like that or you've got patient data and it's
[01:10:35] you've got patient data and it's incredibly High dimensional um you don't
[01:10:37] incredibly High dimensional um you don't have to do that huge sum over S Prime
[01:10:39] have to do that huge sum over S Prime you can just sample and you're your
[01:10:41] you can just sample and you're your accuracy generally improves by one over
[01:10:43] accuracy generally improves by one over square root n the number of samples
[01:10:45] square root n the number of samples you're doing and the nice thing is that
[01:10:47] you're doing and the nice thing is that this also requires no assumption about
[01:10:49] this also requires no assumption about the Markoff structure so you might have
[01:10:51] the Markoff structure so you might have a partially observable scenario which
[01:10:52] a partially observable scenario which also comes up a lot in things like
[01:10:54] also comes up a lot in things like healthcare and then you can just roll
[01:10:55] healthcare and then you can just roll out your policy and just see how well it
[01:10:58] out your policy and just see how well it works okay well in healthcare you
[01:10:59] works okay well in healthcare you probably wouldn't just randomly roll out
[01:11:01] probably wouldn't just randomly roll out any policy but um You probably see my
[01:11:03] any policy but um You probably see my mean yeah remember your
[01:11:15] name that's right so here this is just
[01:11:17] name that's right so here this is just this is just the policy evaluation stage
[01:11:19] this is just the policy evaluation stage exactly and you could either do it to
[01:11:21] exactly and you could either do it to compute the value of a policy or as you
[01:11:22] compute the value of a policy or as you just just suggested do it for the Q
[01:11:25] just just suggested do it for the Q value so you like start off in a state
[01:11:27] value so you like start off in a state for each of the different actions then
[01:11:28] for each of the different actions then you roll out the policy till the end and
[01:11:31] you roll out the policy till the end and this is just a really popular other
[01:11:33] this is just a really popular other technique and it'll come up in other
[01:11:34] technique and it'll come up in other places so I wanted to start saying we
[01:11:36] places so I wanted to start saying we can do that here
[01:11:37] can do that here too okay so you can also think about
[01:11:41] too okay so you can also think about doing all of these in the case of the
[01:11:42] doing all of these in the case of the Mars Rober and and I I won't go through
[01:11:44] Mars Rober and and I I won't go through it now but I'll you can use these as
[01:11:46] it now but I'll you can use these as sort of examples to step through these
[01:11:48] sort of examples to step through these different algorithms okay and kind of
[01:11:50] different algorithms okay and kind of think about how you would compute these
[01:11:52] think about how you would compute these type of
[01:11:53] type of policies all right
[01:11:57] policies all right so I will maybe I'll I'll get to the end
[01:12:00] so I will maybe I'll I'll get to the end of this but I'll leave you with sort of
[01:12:01] of this but I'll leave you with sort of two things one is um sort of a thought
[01:12:04] two things one is um sort of a thought question which is is the optimal policy
[01:12:05] question which is is the optimal policy stationary what that means is
[01:12:07] stationary what that means is independent of time steps and the finite
[01:12:09] independent of time steps and the finite Horizon tasks and we'll we'll explore
[01:12:11] Horizon tasks and we'll we'll explore this issue to on the
[01:12:14] this issue to on the homework
[01:12:15] homework and I also just want to S of refresh
[01:12:18] and I also just want to S of refresh some terminology so in the context of
[01:12:20] some terminology so in the context of markof decision processes and
[01:12:21] markof decision processes and reinforcement learning when we say
[01:12:23] reinforcement learning when we say models what we normally mean is a
[01:12:24] models what we normally mean is a mathematical model of the Dynamics and
[01:12:26] mathematical model of the Dynamics and reward policy is a function mapping from
[01:12:29] reward policy is a function mapping from states to actions that can be
[01:12:30] states to actions that can be deterministic or stochastic and the
[01:12:32] deterministic or stochastic and the value function is this expected
[01:12:33] value function is this expected discounted sum of rewards from starting
[01:12:35] discounted sum of rewards from starting in a state and following a
[01:12:38] in a state and following a policy the things that should be clear
[01:12:40] policy the things that should be clear to you is you should be a to understand
[01:12:41] to you is you should be a to understand what a Markoff process is a Markoff
[01:12:43] what a Markoff process is a Markoff reward process is an mdp the Bellman
[01:12:46] reward process is an mdp the Bellman operator contraction model Q value and
[01:12:48] operator contraction model Q value and policy and you should be able to
[01:12:49] policy and you should be able to implement both value iteration and
[01:12:51] implement both value iteration and policy iteration and you'll have
[01:12:52] policy iteration and you'll have practice doing that on the homework you
[01:12:54] practice doing that on the homework you should also understand some of this
[01:12:55] should also understand some of this strengths and weaknesses in terms of
[01:12:56] strengths and weaknesses in terms of what we've discussed in terms of the
[01:12:57] what we've discussed in terms of the computational complexity of some of
[01:12:59] computational complexity of some of these different operations and be able
[01:13:01] these different operations and be able to prove contraction properties about
[01:13:03] to prove contraction properties about these as well as sort of understand like
[01:13:06] these as well as sort of understand like which of these are really leveraging the
[01:13:07] which of these are really leveraging the Markoff assumption versus which of them
[01:13:09] Markoff assumption versus which of them don't require that so next week we'll
[01:13:12] don't require that so next week we'll continue to talk about this and we'll
[01:13:14] continue to talk about this and we'll start to talk about uh function
[01:13:15] start to talk about uh function approximation and how we can also learn
[01:13:17] approximation and how we can also learn when we don't know what these models are
[01:13:19] when we don't know what these models are I'll see you then thanks
Lecture 003
Stanford CS234 Reinforcement Learning I Policy Evaluation I 2024 I Lecture 3
Source: https://www.youtube.com/watch?v=jjq51TRNVvk
---
Transcript
[00:00:05] hey everybody welcome back we're going
[0...
Stanford CS234 Reinforcement Learning I Policy Evaluation I 2024 I Lecture 3
Source: https://www.youtube.com/watch?v=jjq51TRNVvk
---
Transcript
[00:00:05] hey everybody welcome back we're going
[00:00:07] hey everybody welcome back we're going to get started with a refresh your
[00:00:08] to get started with a refresh your understanding poll you can go to Ed and
[00:00:11] understanding poll you can go to Ed and to see all the polls for today um just
[00:00:14] to see all the polls for today um just remember to log in first so that we can
[00:00:15] remember to log in first so that we can log it for participation points the two
[00:00:18] log it for participation points the two questions ask you to think about what we
[00:00:20] questions ask you to think about what we talked about last time in terms of
[00:00:21] talked about last time in terms of Markoff decision processes um and what
[00:00:24] Markoff decision processes um and what sort of guarantees or type of properties
[00:00:26] sort of guarantees or type of properties they have
[00:00:47] yeah you you what
[00:00:51] yeah you you what the again yeah great question tabular
[00:00:54] the again yeah great question tabular mdp is where we can write down what the
[00:00:56] mdp is where we can write down what the value is of a state as a table so like
[00:00:58] value is of a state as a table so like you can just have one end for what the
[00:01:00] you can just have one end for what the value is for each
[00:01:02] value is for each state this is in contract to like neural
[00:01:04] state this is in contract to like neural networks or things like that where you
[00:01:06] networks or things like that where you don't have one parameter per state
[00:01:43] okay we'll take like another one or two
[00:01:45] okay we'll take like another one or two minutes we have a good amount of
[00:01:46] minutes we have a good amount of controversy on these questions so I'll
[00:01:49] controversy on these questions so I'll see how this converges
[00:02:24] I remember seeing this a to the power s
[00:02:26] I remember seeing this a to the power s number in in a previous context was it
[00:02:28] number in in a previous context was it in the context of policy
[00:02:32] yeah does someone want to remember um
[00:02:34] yeah does someone want to remember um why is a to the S important I remember
[00:02:37] why is a to the S important I remember why it is yeah remember your
[00:02:39] why it is yeah remember your name is it like the number of total
[00:02:42] name is it like the number of total possible policies exactly right exactly
[00:02:45] possible policies exactly right exactly right so there's at most a to the S
[00:02:46] right so there's at most a to the S potential um uh deterministic policies
[00:03:00] all right so most people selected the
[00:03:02] all right so most people selected the correct answer for the first one um
[00:03:04] correct answer for the first one um which is this is true so ASM totically
[00:03:07] which is this is true so ASM totically this
[00:03:08] this will value iteration and policy
[00:03:10] will value iteration and policy iteration are correct in the tabular
[00:03:14] iteration are correct in the tabular discreet markup decision process case
[00:03:17] discreet markup decision process case and they will ASM totically both
[00:03:18] and they will ASM totically both converge and compute the right value
[00:03:20] converge and compute the right value function the second one looks like it's
[00:03:22] function the second one looks like it's pretty evenly split so I'd like you to
[00:03:25] pretty evenly split so I'd like you to turn to someone near you and um and
[00:03:27] turn to someone near you and um and argue for what you said for the second
[00:03:29] argue for what you said for the second one
[00:03:59] Li
[00:04:56] all right so maybe your has Chang
[00:05:11] [Music]
[00:05:16] all so the answer is true um and about
[00:05:19] all so the answer is true um and about half of you said that do someone get
[00:05:20] half of you said that do someone get want to tell me why this is true yeah is
[00:05:22] want to tell me why this is true yeah is it yeah um I'm not sure if this is
[00:05:25] it yeah um I'm not sure if this is correct but I just I think the value
[00:05:27] correct but I just I think the value iteration might not be um um guaranteed
[00:05:30] iteration might not be um um guaranteed to converge so that's if it's not
[00:05:32] to converge so that's if it's not guaranteed to converge it could just be
[00:05:33] guaranteed to converge it could just be unbounded so that certainly would be the
[00:05:35] unbounded so that certainly would be the case but fortunately it is guaranteed to
[00:05:37] case but fortunately it is guaranteed to converge if gamma is less than one um so
[00:05:40] converge if gamma is less than one um so but you're correct that it that it um
[00:05:42] but you're correct that it that it um Can require more iterations does anybody
[00:05:44] Can require more iterations does anybody have an and remind me your
[00:05:46] have an and remind me your name so my thinking was that uh in uh
[00:05:51] name so my thinking was that uh in uh like in policy itation we know that in
[00:05:52] like in policy itation we know that in each step we're going to improve to a
[00:05:54] each step we're going to improve to a new policy but um in each value step you
[00:05:57] new policy but um in each value step you might not reach a new policy you might
[00:05:59] might not reach a new policy you might take m multiple steps to reach a new
[00:06:02] take m multiple steps to reach a new policy that's right so um in policy
[00:06:05] policy that's right so um in policy iteration um and I talked to some people
[00:06:07] iteration um and I talked to some people about this too there can only be a to
[00:06:09] about this too there can only be a to the S um so in policy iteration there's
[00:06:11] the S um so in policy iteration there's at most a to the S because you only go
[00:06:13] at most a to the S because you only go through each policy once but for Value
[00:06:15] through each policy once but for Value iteration it can be more and I'll give
[00:06:17] iteration it can be more and I'll give you an example um so and one way I would
[00:06:21] you an example um so and one way I would think in general if you see this sort of
[00:06:22] think in general if you see this sort of question is to think about well can I
[00:06:23] question is to think about well can I can come up with a counter example to
[00:06:25] can come up with a counter example to say where this would be different so
[00:06:28] say where this would be different so consider a really silly mark off
[00:06:30] consider a really silly mark off decision process where there's just one
[00:06:31] decision process where there's just one state in one action so if there's one
[00:06:33] state in one action so if there's one state and one action there is literally
[00:06:35] state and one action there is literally one policy you can only do one thing and
[00:06:37] one policy you can only do one thing and there's only one state to do it in um so
[00:06:39] there's only one state to do it in um so that means that policy iteration is
[00:06:40] that means that policy iteration is going to take one round but for what
[00:06:43] going to take one round but for what value um iteration is going to do is
[00:06:45] value um iteration is going to do is it's going to keep going until the value
[00:06:46] it's going to keep going until the value function stops changing or stops
[00:06:48] function stops changing or stops changing within like a very small amount
[00:06:52] changing within like a very small amount and so what would happen in value
[00:06:53] and so what would happen in value iteration and feel free to go back to
[00:06:54] iteration and feel free to go back to your notes from last time is we would
[00:06:56] your notes from last time is we would start off and let's say the reward is 1
[00:07:00] start off and let's say the reward is 1 and Gamma is9 and we initialize the
[00:07:02] and Gamma is9 and we initialize the value function to zero so then if you
[00:07:05] value function to zero so then if you use a geometric series and if you
[00:07:07] use a geometric series and if you haven't seen that before just come chat
[00:07:08] haven't seen that before just come chat with me about it I'm happy to say um the
[00:07:11] with me about it I'm happy to say um the varar of this date is 1 over 1 minus
[00:07:13] varar of this date is 1 over 1 minus gamma because you get 1 plus gamma * 1
[00:07:17] gamma because you get 1 plus gamma 1 plus gamma s 1 dot dot dot so because
[00:07:20] plus gamma s * 1 dot dot dot so because it's like you're always staying in that
[00:07:22] it's like you're always staying in that state and you're always taking that
[00:07:23] state and you're always taking that action and you just day in and day out
[00:07:25] action and you just day in and day out that's what you get forever so so the
[00:07:27] that's what you get forever so so the actual value function it get the you get
[00:07:29] actual value function it get the you get eventually is 1 over 1 minus gamma um or
[00:07:32] eventually is 1 over 1 minus gamma um or you know about 10 um but after the first
[00:07:35] you know about 10 um but after the first iteration of value iteration V1 of s is
[00:07:38] iteration of value iteration V1 of s is just
[00:07:39] just one so that means you know one is not
[00:07:42] one so that means you know one is not close to 10 or not that close to 10 so
[00:07:44] close to 10 or not that close to 10 so we haven't converged yet so we'll have
[00:07:46] we haven't converged yet so we'll have to continue to do a bunch of iterations
[00:07:48] to continue to do a bunch of iterations of value iteration whereas at that point
[00:07:51] of value iteration whereas at that point um policy iteration would stop because
[00:07:53] um policy iteration would stop because it would just evaluate the value of the
[00:07:55] it would just evaluate the value of the one policy we have which is to take that
[00:07:56] one policy we have which is to take that one action and you'd be
[00:07:58] one action and you'd be done and I bring this up just to
[00:08:00] done and I bring this up just to illustrate that like both even though
[00:08:02] illustrate that like both even though both of those algorithms are guaranteed
[00:08:03] both of those algorithms are guaranteed to converge the right thing eventually
[00:08:05] to converge the right thing eventually they can have quite different behavior
[00:08:06] they can have quite different behavior in the short
[00:08:07] in the short term did you have a question I was just
[00:08:10] term did you have a question I was just going to ask is it only converging in
[00:08:12] going to ask is it only converging in the limit or all of them are only
[00:08:15] the limit or all of them are only converging limit yeah so all of them are
[00:08:17] converging limit yeah so all of them are converging um like ASM totically as you
[00:08:20] converging um like ASM totically as you do this over and over again good
[00:08:24] question great well welcome back um uh
[00:08:26] question great well welcome back um uh if you just came in feel free to go
[00:08:28] if you just came in feel free to go through the questions later what we're
[00:08:29] through the questions later what we're going to be doing today is to kind of
[00:08:31] going to be doing today is to kind of continue in this more simple setting
[00:08:33] continue in this more simple setting where we don't do any function
[00:08:34] where we don't do any function approximation um but we are now going to
[00:08:37] approximation um but we are now going to think about the fact where we don't have
[00:08:38] think about the fact where we don't have models of the world and what I mean by
[00:08:41] models of the world and what I mean by that is that we're not given a Dynamics
[00:08:42] that is that we're not given a Dynamics model and we're not given a reward model
[00:08:44] model and we're not given a reward model our agent just has to try things out in
[00:08:46] our agent just has to try things out in the world to learn how good they are and
[00:08:48] the world to learn how good they are and we're going to start with model free
[00:08:50] we're going to start with model free policy evaluation um in the case where
[00:08:53] policy evaluation um in the case where we still have a small enough number of
[00:08:55] we still have a small enough number of contexts or states that we could write
[00:08:57] contexts or states that we could write down a value for every single one of of
[00:08:59] down a value for every single one of of them separately so that's why we call it
[00:09:03] them separately so that's why we call it tabular um I'll just say one thing in
[00:09:05] tabular um I'll just say one thing in terms of logistics so office hours are
[00:09:07] terms of logistics so office hours are on the website we'll try to keep that
[00:09:09] on the website we'll try to keep that calendar updated it's just a Google
[00:09:10] calendar updated it's just a Google Calendar it'll include the location um
[00:09:13] Calendar it'll include the location um uh and if you go to Q status for the
[00:09:15] uh and if you go to Q status for the one1 office hours we'll make sure that
[00:09:17] one1 office hours we'll make sure that the zoom is either on that or on the web
[00:09:19] the zoom is either on that or on the web or on the website um I'll have office
[00:09:22] or on the website um I'll have office hours starting this week on Thursdays uh
[00:09:24] hours starting this week on Thursdays uh and mine are for project and conceptual
[00:09:26] and mine are for project and conceptual questions so feel free to come ask me
[00:09:28] questions so feel free to come ask me about anything in class um I I won't be
[00:09:30] about anything in class um I I won't be going through code but you can go talk
[00:09:32] going through code but you can go talk to the Tas about that um or you can come
[00:09:34] to the Tas about that um or you can come in and brainstorm about projects or or
[00:09:36] in and brainstorm about projects or or general questions about reinforcement
[00:09:38] general questions about reinforcement learning um I'm sure some of you are
[00:09:40] learning um I'm sure some of you are starting to think about projects uh just
[00:09:42] starting to think about projects uh just in general some people are asking so
[00:09:44] in general some people are asking so what's kind of in scope um I put
[00:09:46] what's kind of in scope um I put something on Ed about that but also in
[00:09:48] something on Ed about that but also in general it can be like a new idea for
[00:09:49] general it can be like a new idea for reinforcement learning a new application
[00:09:51] reinforcement learning a new application um it can be something you're doing in
[00:09:54] um it can be something you're doing in if you're already doing research and
[00:09:55] if you're already doing research and reinforcement learning it can also be
[00:09:57] reinforcement learning it can also be replicating part of an existing paper
[00:10:00] replicating part of an existing paper and this is really helpful actually it's
[00:10:02] and this is really helpful actually it's helpful for the whole Community because
[00:10:04] helpful for the whole Community because there's a lot of work going on um and
[00:10:06] there's a lot of work going on um and people are making different choices
[00:10:08] people are making different choices about hyperparameters and seeds and
[00:10:10] about hyperparameters and seeds and things like that so it really is very
[00:10:12] things like that so it really is very valuable too to see what things we can
[00:10:15] valuable too to see what things we can replicate does anybody have any
[00:10:16] replicate does anybody have any questions about that or other logistics
[00:10:18] questions about that or other logistics of the class before we get
[00:10:21] going all right so let's get into policy
[00:10:24] going all right so let's get into policy evaluation um so as I said what we're
[00:10:27] evaluation um so as I said what we're going to be doing today is to think
[00:10:29] going to be doing today is to think about how do we learn through direct
[00:10:31] about how do we learn through direct experience how good decisions are and
[00:10:33] experience how good decisions are and we're going to assume that we have a
[00:10:34] we're going to assume that we have a fixed policy so again like our boss says
[00:10:37] fixed policy so again like our boss says how good is this way of you know
[00:10:38] how good is this way of you know advertising to customers or U maybe
[00:10:41] advertising to customers or U maybe you're in a setting where you're trying
[00:10:42] you're in a setting where you're trying to see how good is the patient outcomes
[00:10:44] to see how good is the patient outcomes from the current
[00:10:45] from the current protocol and the idea is that we're only
[00:10:47] protocol and the idea is that we're only going to be using data from the
[00:10:49] going to be using data from the environment so um and today this
[00:10:52] environment so um and today this experience is going to come from
[00:10:52] experience is going to come from executing that policy let me just move
[00:10:54] executing that policy let me just move this up so you can see a bit
[00:10:57] this up so you can see a bit better okay today we're we going to
[00:10:59] better okay today we're we going to assume that when we get this data it's
[00:11:01] assume that when we get this data it's from directly executing a particular
[00:11:03] from directly executing a particular policy later we'll think we'll talk
[00:11:05] policy later we'll think we'll talk about other sort of relaxations to
[00:11:08] about other sort of relaxations to this and so I'm going to try to motivate
[00:11:10] this and so I'm going to try to motivate today why this is a useful thing to do
[00:11:13] today why this is a useful thing to do um and what sort of properties we would
[00:11:15] um and what sort of properties we would want to try to compare different
[00:11:17] want to try to compare different algorithms so it will turn out that this
[00:11:19] algorithms so it will turn out that this type of thing comes up all the time it
[00:11:21] type of thing comes up all the time it comes up when we actually want to make
[00:11:22] comes up when we actually want to make decisions and learn better policies um
[00:11:25] decisions and learn better policies um and it's going to be an important part
[00:11:27] and it's going to be an important part of much more complicated algorithm
[00:11:29] of much more complicated algorithm like deep Q learning and policy gradient
[00:11:32] like deep Q learning and policy gradient and other things where we want to sort
[00:11:33] and other things where we want to sort of be able to see how good are current
[00:11:35] of be able to see how good are current things so that then we can take gradient
[00:11:37] things so that then we can take gradient steps or improve our
[00:11:41] policy okay so this is what we're going
[00:11:43] policy okay so this is what we're going to try to cover today we're going to
[00:11:45] to try to cover today we're going to cover multicolor policy evaluation
[00:11:47] cover multicolor policy evaluation temporal difference learning certainty
[00:11:49] temporal difference learning certainty equivalence and batch policy evaluation
[00:11:52] equivalence and batch policy evaluation and maybe raise your hand if you've seen
[00:11:53] and maybe raise your hand if you've seen temporal difference learning
[00:11:56] temporal difference learning before okay raise your hand if you've
[00:11:58] before okay raise your hand if you've seen Q learning
[00:12:00] seen Q learning okay great Q learning is the control
[00:12:02] okay great Q learning is the control version basically of temporal difference
[00:12:03] version basically of temporal difference learning so you'll see a lot of
[00:12:05] learning so you'll see a lot of similarities there okay all right before
[00:12:08] similarities there okay all right before we dive into this I just want to recall
[00:12:10] we dive into this I just want to recall a couple definitions we'll have we're
[00:12:12] a couple definitions we'll have we're going to use G for the return which
[00:12:14] going to use G for the return which means that from this particular State
[00:12:16] means that from this particular State what is the discounted sum of rewards we
[00:12:18] what is the discounted sum of rewards we get for a particular
[00:12:20] get for a particular episode the state value function is
[00:12:22] episode the state value function is saying on average what is that reward we
[00:12:25] saying on average what is that reward we get and the state action value says if I
[00:12:28] get and the state action value says if I start in this state take this action and
[00:12:30] start in this state take this action and then follow this policy what is the
[00:12:32] then follow this policy what is the expected discounted sum of
[00:12:35] rewards so we saw last week that we
[00:12:39] rewards so we saw last week that we might want to do dynamic programming for
[00:12:41] might want to do dynamic programming for policy evaluation when we do have access
[00:12:43] policy evaluation when we do have access to the models so again what I mean by
[00:12:46] to the models so again what I mean by that is that where someone gives you a
[00:12:47] that is that where someone gives you a function for the reward and a function
[00:12:50] function for the reward and a function for the Dynamics model and so we saw we
[00:12:53] for the Dynamics model and so we saw we could do this sort of bellman like
[00:12:54] could do this sort of bellman like backup for a particular policy this was
[00:12:57] backup for a particular policy this was different than the Bellman equation
[00:12:58] different than the Bellman equation because there's no Max
[00:12:59] because there's no Max not trying to take a Max over different
[00:13:01] not trying to take a Max over different actions we're just taking whatever
[00:13:03] actions we're just taking whatever action is specified by the policy and in
[00:13:07] action is specified by the policy and in this equation here this is for like a
[00:13:09] this equation here this is for like a deterministic
[00:13:12] policy otherwise we need like some
[00:13:14] policy otherwise we need like some additional averaging over all the
[00:13:16] additional averaging over all the actions that could be taken by that
[00:13:18] actions that could be taken by that policy and just to remind ourselves here
[00:13:21] policy and just to remind ourselves here um before we converge this policy this V
[00:13:25] um before we converge this policy this V Pi K is like an estimate of the value of
[00:13:27] Pi K is like an estimate of the value of the policy it's not the actual value yet
[00:13:30] the policy it's not the actual value yet it's just like an estimate and it's
[00:13:31] it's just like an estimate and it's hopefully improving over as we do more
[00:13:36] iterations and what you can and another
[00:13:38] iterations and what you can and another just good thing to remind ourselves of
[00:13:40] just good thing to remind ourselves of is that this is here this sort of expect
[00:13:44] is that this is here this sort of expect expected discounted sum of rewards what
[00:13:46] expected discounted sum of rewards what we're doing in this equation is we are
[00:13:48] we're doing in this equation is we are plugging in this term as an estimate of
[00:13:53] plugging in this term as an estimate of the expected discounted rewards for the
[00:13:56] the expected discounted rewards for the future so we're saying like oh we've got
[00:13:58] future so we're saying like oh we've got this estimate of the value function
[00:14:00] this estimate of the value function we're going to plug in and say my reward
[00:14:02] we're going to plug in and say my reward is my immediate reward plus my
[00:14:03] is my immediate reward plus my discounted sum of future rewards and
[00:14:06] discounted sum of future rewards and this is what I'm using for my discounted
[00:14:07] this is what I'm using for my discounted sum of future
[00:14:08] sum of future rewards and so this is known as
[00:14:11] rewards and so this is known as bootstrapping because we're plugging in
[00:14:13] bootstrapping because we're plugging in one estimate in order to help us do
[00:14:15] one estimate in order to help us do another
[00:14:16] another estimate and we'll see a picture of this
[00:14:18] estimate and we'll see a picture of this graphically
[00:14:19] graphically later right Monte Carlo policy
[00:14:21] later right Monte Carlo policy evaluation is a really simple idea but
[00:14:24] evaluation is a really simple idea but it's very useful um and it is commonly
[00:14:27] it's very useful um and it is commonly done so as essentially the idea with POC
[00:14:30] done so as essentially the idea with POC multicol policy evaluation is we are
[00:14:32] multicol policy evaluation is we are just going to simulate or act in the
[00:14:35] just going to simulate or act in the real world so this is just saying you've
[00:14:37] real world so this is just saying you've got a policy which means you know what
[00:14:39] got a policy which means you know what action to take in every state today
[00:14:41] action to take in every state today we'll mostly focus on deterministic
[00:14:42] we'll mostly focus on deterministic policies just to make it easier um so
[00:14:46] policies just to make it easier um so I'll just say that
[00:14:47] I'll just say that for for
[00:14:50] for for most of
[00:14:52] most of today
[00:14:54] today assume Pi is deterministic
[00:15:01] but all these ideas can easily be
[00:15:02] but all these ideas can easily be extended just easy to write that down
[00:15:04] extended just easy to write that down without having to do an expectation over
[00:15:05] without having to do an expectation over actions everywhere okay so what's the
[00:15:08] actions everywhere okay so what's the idea in this case well the value
[00:15:11] idea in this case well the value function is just an average over the
[00:15:14] function is just an average over the returns it's like an expectation over
[00:15:16] returns it's like an expectation over the trajectories or the returns you
[00:15:18] the trajectories or the returns you could get by following the
[00:15:20] could get by following the policy and therefore the value is just
[00:15:23] policy and therefore the value is just the mean of returns and we know how to
[00:15:26] the mean of returns and we know how to approximate means we just do things a
[00:15:28] approximate means we just do things a bunch of times than we
[00:15:31] bunch of times than we average and so as an example of this it
[00:15:33] average and so as an example of this it might be something there someone says
[00:15:35] might be something there someone says okay when I want to know if we um say
[00:15:38] okay when I want to know if we um say give a particular set of patient
[00:15:39] give a particular set of patient treatments um and maybe those treatments
[00:15:42] treatments um and maybe those treatments take a year for example and someone
[00:15:44] take a year for example and someone wants to know on average how good is
[00:15:46] wants to know on average how good is that well what you could do is you could
[00:15:47] that well what you could do is you could have a 100 patients I'll go through that
[00:15:50] have a 100 patients I'll go through that particular protocol for a year and then
[00:15:52] particular protocol for a year and then average their outcomes and that would be
[00:15:55] average their outcomes and that would be an example of Monte Carlo policy
[00:15:57] an example of Monte Carlo policy evaluation is you just execute the
[00:15:59] evaluation is you just execute the policy for many different episodes and
[00:16:02] policy for many different episodes and then you average okay and one thing just
[00:16:05] then you average okay and one thing just to note here is that you can have cases
[00:16:07] to note here is that you can have cases here where not all the trajectories of
[00:16:09] here where not all the trajectories of the same
[00:16:10] the same length so imagine in the patient case I
[00:16:13] length so imagine in the patient case I just gave you might have that some
[00:16:14] just gave you might have that some people drop out of a trial during the
[00:16:16] people drop out of a trial during the year or maybe they finish their
[00:16:18] year or maybe they finish their treatment successfully and and so then
[00:16:20] treatment successfully and and so then they're also done so all the
[00:16:22] they're also done so all the trajectories may not be the same length
[00:16:24] trajectories may not be the same length but essentially you can just think of it
[00:16:25] but essentially you can just think of it as just have many many trajectories you
[00:16:28] as just have many many trajectories you know maybe this one has a gal 10 this
[00:16:31] know maybe this one has a gal 10 this had a gal 5 this has a gal 10 you just
[00:16:35] had a gal 5 this has a gal 10 you just average over all of them and that is
[00:16:38] average over all of them and that is your value
[00:16:40] your value function
[00:16:42] function okay now one of the benefits of this is
[00:16:44] okay now one of the benefits of this is that when we do this we're not actually
[00:16:47] that when we do this we're not actually benefits or drawbacks and we'll talk
[00:16:48] benefits or drawbacks and we'll talk about this this is making no assumption
[00:16:51] about this this is making no assumption that the system is a Markoff decision
[00:16:53] that the system is a Markoff decision process it's just averaging so your
[00:16:57] process it's just averaging so your system might not be Markoff
[00:17:00] system might not be Markoff and what I mean by that is that you in
[00:17:02] and what I mean by that is that you in general will have a finite set of
[00:17:04] general will have a finite set of features to describe the state so if we
[00:17:06] features to describe the state so if we think about that patient example I just
[00:17:08] think about that patient example I just had maybe you have different vitals of
[00:17:10] had maybe you have different vitals of the patient maybe you have static
[00:17:12] the patient maybe you have static demographic variables but we do not have
[00:17:15] demographic variables but we do not have all the features that are probably going
[00:17:16] all the features that are probably going to describe how someone's going to react
[00:17:18] to describe how someone's going to react to a set of
[00:17:19] to a set of treatments and so because of that you
[00:17:22] treatments and so because of that you may or may not think that in the
[00:17:23] may or may not think that in the features you have access to the system
[00:17:25] features you have access to the system is
[00:17:26] is Markoff but this doesn't require this
[00:17:28] Markoff but this doesn't require this state to be mark off it's just averaging
[00:17:30] state to be mark off it's just averaging you just roll out your policy many times
[00:17:32] you just roll out your policy many times and you
[00:17:33] and you average now a really important thing
[00:17:35] average now a really important thing here is that it can only be applied to
[00:17:37] here is that it can only be applied to episodic
[00:17:39] episodic mdps so what do I mean by that I mean
[00:17:41] mdps so what do I mean by that I mean your episode has to end in order for you
[00:17:44] your episode has to end in order for you to see what the total return
[00:17:47] to see what the total return was so if you have sort of horizon
[00:17:50] was so if you have sort of horizon lengths or episodes that last for a year
[00:17:52] lengths or episodes that last for a year that's okay it's a little bit slow but
[00:17:54] that's okay it's a little bit slow but you could do that but if you want to
[00:17:56] you could do that but if you want to just think of how good a policy is if
[00:17:58] just think of how good a policy is if you just are going to act forever and
[00:17:59] you just are going to act forever and never stop this wouldn't work we're not
[00:18:03] never stop this wouldn't work we're not without some additional
[00:18:06] approximations somebody have any
[00:18:07] approximations somebody have any questions about either of these two
[00:18:09] questions about either of these two things about it not assuming the state
[00:18:10] things about it not assuming the state is marov yeah um you're talking about
[00:18:13] is marov yeah um you're talking about like Medical Treatments so um I mean
[00:18:17] like Medical Treatments so um I mean does this only work if like the
[00:18:18] does this only work if like the treatment only lasts like I know a same
[00:18:21] treatment only lasts like I know a same amount of time for every patient like
[00:18:22] amount of time for every patient like six months um because if they have
[00:18:25] six months um because if they have different Lanes right like how can that
[00:18:26] different Lanes right like how can that be episodic great question yeah so what
[00:18:30] be episodic great question yeah so what is like well if it can it be episodic if
[00:18:32] is like well if it can it be episodic if the episodes are different length it
[00:18:33] the episodes are different length it could be so it could be that you have
[00:18:35] could be so it could be that you have like a fixed policy and maybe that
[00:18:36] like a fixed policy and maybe that policy says if someone doesn't respond
[00:18:38] policy says if someone doesn't respond to this type of treatment we do this
[00:18:40] to this type of treatment we do this additional type of treatment and in fact
[00:18:42] additional type of treatment and in fact that's very common um as long as the
[00:18:45] that's very common um as long as the episode is guaranteed to end like you
[00:18:47] episode is guaranteed to end like you know that treatment could only last day
[00:18:48] know that treatment could only last day for a year total you can still average
[00:18:50] for a year total you can still average overall those outcomes you just sum over
[00:18:53] overall those outcomes you just sum over the the return for those different
[00:18:55] the the return for those different ones yeah for each of these um
[00:18:59] ones yeah for each of these um trajectories are we supposed to begin
[00:19:01] trajectories are we supposed to begin with this like different state or we can
[00:19:04] with this like different state or we can actually start with the same state great
[00:19:06] actually start with the same state great question so um if we want to get a value
[00:19:08] question so um if we want to get a value function for all states we need to see
[00:19:10] function for all states we need to see all the states inside of these
[00:19:11] all the states inside of these trajectories and we'll talk about how we
[00:19:12] trajectories and we'll talk about how we estimate these in a second so I think
[00:19:14] estimate these in a second so I think this will be
[00:19:16] this will be answered okay so for example how might
[00:19:18] answered okay so for example how might we compute this so we would like to get
[00:19:21] we compute this so we would like to get the value for all the states that are
[00:19:23] the value for all the states that are reachable inside of your policy so what
[00:19:26] reachable inside of your policy so what you could do is we can initialize two
[00:19:28] you could do is we can initialize two different variables One n of s is just
[00:19:30] different variables One n of s is just going to be the counts the number of
[00:19:32] going to be the counts the number of times that we've updated our estimate
[00:19:34] times that we've updated our estimate for State s g ofs here is going to start
[00:19:38] for State s g ofs here is going to start with zero which is we've never seen any
[00:19:40] with zero which is we've never seen any returns from this
[00:19:41] returns from this state so what every visit Monte Carlo
[00:19:44] state so what every visit Monte Carlo does is it samples an episode so this
[00:19:47] does is it samples an episode so this goes up to sometime step T TI this has
[00:19:51] goes up to sometime step T TI this has to be finite but it could you know be
[00:19:53] to be finite but it could you know be different on different episodes and then
[00:19:56] different on different episodes and then we compute the discounted sum of Rewards
[00:19:58] we compute the discounted sum of Rewards for that episode okay so starting at
[00:20:02] for that episode okay so starting at time step T excuse me how much reward do
[00:20:05] time step T excuse me how much reward do we get till the end of the
[00:20:07] we get till the end of the episode and then for every time step
[00:20:09] episode and then for every time step until the end we see if this is the
[00:20:12] until the end we see if this is the first time that states been visited then
[00:20:14] first time that states been visited then we update the total number of times we
[00:20:16] we update the total number of times we visited this state for the first time
[00:20:18] visited this state for the first time per EP you know in each episode we
[00:20:20] per EP you know in each episode we increment the total return and we
[00:20:24] average so that just like Steps along
[00:20:27] average so that just like Steps along and it just says okay you know maybe the
[00:20:29] and it just says okay you know maybe the first time I reached if you think of the
[00:20:30] first time I reached if you think of the Mars rover example see I have it in the
[00:20:32] Mars rover example see I have it in the next one no I think I put it so for a
[00:20:34] next one no I think I put it so for a lot of today I've moved a lot of the
[00:20:36] lot of today I've moved a lot of the worked examples till the end of the
[00:20:38] worked examples till the end of the slides but if you want to go through
[00:20:39] slides but if you want to go through them later I encourage you to um so for
[00:20:42] them later I encourage you to um so for example in the case of the Mars rover
[00:20:44] example in the case of the Mars rover you might imagine you start in state
[00:20:45] you might imagine you start in state like S3 and then on that particular one
[00:20:48] like S3 and then on that particular one you get a reward of one for that episode
[00:20:51] you get a reward of one for that episode and so then you would average in one
[00:20:54] and so then you would average in one starting in that state as three till the
[00:20:56] starting in that state as three till the end
[00:21:00] so this is first visit Monte Carlo
[00:21:02] so this is first visit Monte Carlo evaluation which means you only update a
[00:21:04] evaluation which means you only update a state at most once in each episode so if
[00:21:07] state at most once in each episode so if you had something like this S1 went to
[00:21:11] you had something like this S1 went to S2 went to S3 went to S2 went to S3 so
[00:21:16] S2 went to S3 went to S2 went to S3 so in this case dot dot
[00:21:18] in this case dot dot dot you would update S1 once in this
[00:21:22] dot you would update S1 once in this trajectory and you would update S2 once
[00:21:24] trajectory and you would update S2 once and S3 once even though you in fact
[00:21:27] and S3 once even though you in fact visit those States multi times you only
[00:21:29] visit those States multi times you only update them for the first time you visit
[00:21:32] update them for the first time you visit yeah so could this have the problem
[00:21:35] yeah so could this have the problem where if the state is really rare
[00:21:38] where if the state is really rare uncommon then we get a really bad ass or
[00:21:42] uncommon then we get a really bad ass or we just never visit it at all so if you
[00:21:44] we just never visit it at all so if you don't ever visit a state under the
[00:21:46] don't ever visit a state under the policy that's okay because then you
[00:21:48] policy that's okay because then you don't have a value for it but then it's
[00:21:49] don't have a value for it but then it's kind of undefined because you never
[00:21:51] kind of undefined because you never would reach it um uh in this case as
[00:21:55] would reach it um uh in this case as says if there's a state that's really
[00:21:57] says if there's a state that's really rare for you to reach inside of your
[00:21:59] rare for you to reach inside of your policy it might take a lot of
[00:22:01] policy it might take a lot of trajectories in order to get a good
[00:22:02] trajectories in order to get a good estimate of it so maybe there's some
[00:22:04] estimate of it so maybe there's some rare side effect um of a treatment plan
[00:22:07] rare side effect um of a treatment plan and it's going to take a lot of
[00:22:09] and it's going to take a lot of trajectories um to to observe that that
[00:22:12] trajectories um to to observe that that was one of the challenges with the covid
[00:22:13] was one of the challenges with the covid vaccine is that you know of course it
[00:22:14] vaccine is that you know of course it was a finite number of people is pretty
[00:22:16] was a finite number of people is pretty large number but pretty but finite and
[00:22:18] large number but pretty but finite and some side effects don't show up until
[00:22:20] some side effects don't show up until you get much many many more it's true
[00:22:23] you get much many many more it's true generally for treatments even if on
[00:22:24] generally for treatments even if on average they're totally fine that you
[00:22:25] average they're totally fine that you won't see some of those rare side
[00:22:26] won't see some of those rare side effects until you get an enormous number
[00:22:28] effects until you get an enormous number of
[00:22:29] of trajectories now covid vaccine certainly
[00:22:31] trajectories now covid vaccine certainly had you know the benefits there way out
[00:22:33] had you know the benefits there way out let way um side effects but my point is
[00:22:36] let way um side effects but my point is just to highlight that you know
[00:22:37] just to highlight that you know depending on how frequently you see the
[00:22:39] depending on how frequently you see the states it will take you more or less um
[00:22:41] states it will take you more or less um number of total episodes in order to
[00:22:43] number of total episodes in order to observe yeah so kind of it doesn't
[00:22:46] observe yeah so kind of it doesn't matter that I saw like S2 again or S3
[00:22:48] matter that I saw like S2 again or S3 again in this trajectory for this
[00:22:50] again in this trajectory for this algorithm no um I mean it probably
[00:22:52] algorithm no um I mean it probably affects the reward still it's just that
[00:22:54] affects the reward still it's just that you don't use that data so an
[00:22:55] you don't use that data so an alternative is called every visit where
[00:22:58] alternative is called every visit where every time you see this dat in that
[00:22:59] every time you see this dat in that trajectory you update
[00:23:03] it and as you might imagine there let's
[00:23:05] it and as you might imagine there let's say you see a it's a really long
[00:23:07] say you see a it's a really long trajectory and you see s to many times
[00:23:10] trajectory and you see s to many times then you would update for all of
[00:23:13] those so I'm just going to show you of
[00:23:15] those so I'm just going to show you of three common different ones so this is
[00:23:17] three common different ones so this is sort of the this is a work example you
[00:23:20] sort of the this is a work example you can go through later if you want which
[00:23:21] can go through later if you want which is if you imagine this is the Mars rover
[00:23:24] is if you imagine this is the Mars rover the rewards are on either side this is
[00:23:26] the rewards are on either side this is the particular trajectory you can
[00:23:27] the particular trajectory you can compute the first visit and the every
[00:23:29] compute the first visit and the every visit Monte Carlo
[00:23:31] visit Monte Carlo estimates so both of those are totally
[00:23:34] estimates so both of those are totally reasonable things to do perhaps more
[00:23:35] reasonable things to do perhaps more common is what's known as incremental
[00:23:37] common is what's known as incremental Monte Carlo okay and this does kind of
[00:23:40] Monte Carlo okay and this does kind of what you would expect it to do which is
[00:23:41] what you would expect it to do which is you maintain a running estimate for what
[00:23:44] you maintain a running estimate for what is the value under a policy for a
[00:23:46] is the value under a policy for a particular State and you smoothly update
[00:23:49] particular State and you smoothly update that as you get more
[00:23:50] that as you get more data so what we would do in this case is
[00:23:53] data so what we would do in this case is you keep track of the number of times
[00:23:55] you keep track of the number of times you visited that
[00:23:56] you visited that state and then you weigh your old
[00:23:59] state and then you weigh your old estimate by the number of times you
[00:24:02] estimate by the number of times you visitate minus 1 NS plus your new return
[00:24:05] visitate minus 1 NS plus your new return you just observed div by NS so that's
[00:24:08] you just observed div by NS so that's just sort of your way you know you're
[00:24:09] just sort of your way you know you're kind of constantly updating your value
[00:24:12] kind of constantly updating your value function for this state as you get more
[00:24:15] function for this state as you get more data and for those of you who have done
[00:24:18] data and for those of you who have done machine learning which is probably most
[00:24:19] machine learning which is probably most of you this should look pretty familiar
[00:24:21] of you this should look pretty familiar this is kind of like a learning
[00:24:24] this is kind of like a learning rate this is your updated value and this
[00:24:26] rate this is your updated value and this is your old value
[00:24:30] is your old value and in fact that's what we're going to
[00:24:31] and in fact that's what we're going to see here okay so you can think of this
[00:24:34] see here okay so you can think of this in general it doesn't have to be 1/ n it
[00:24:37] in general it doesn't have to be 1/ n it can just be any Alpha here so any sort
[00:24:40] can just be any Alpha here so any sort of like Alpha here is um just a learning
[00:24:44] of like Alpha here is um just a learning rate and we're just smoothly updating
[00:24:47] rate and we're just smoothly updating our estimate of what is the value
[00:24:49] our estimate of what is the value function for a particular
[00:24:53] State and we'll see lots lots and lots
[00:24:56] State and we'll see lots lots and lots of algorithms like that similar to part
[00:24:57] of algorithms like that similar to part what you saw in machine
[00:25:00] what you saw in machine learning the key thing here is the the
[00:25:02] learning the key thing here is the the estimates that we're using which we
[00:25:04] estimates that we're using which we often call sort of you know we also
[00:25:06] often call sort of you know we also might use the word targets often is we
[00:25:08] might use the word targets often is we have this estimate
[00:25:11] have this estimate here and then we have our old estimate
[00:25:13] here and then we have our old estimate so this is you know one sample of what
[00:25:15] so this is you know one sample of what is the return starting in this state
[00:25:17] is the return starting in this state till the end of the
[00:25:21] episode so I think it's helpful to think
[00:25:23] episode so I think it's helpful to think a little bit about what this looks like
[00:25:25] a little bit about what this looks like pictorially and this also relates a lot
[00:25:27] pictorially and this also relates a lot to we'll see this we talk about things
[00:25:29] to we'll see this we talk about things like
[00:25:31] alphago so I think it's helpful today
[00:25:34] alphago so I think it's helpful today and it'll look somewhat familiar to you
[00:25:35] and it'll look somewhat familiar to you if you've seen things like Mini Max
[00:25:37] if you've seen things like Mini Max trees or expecting Max trees who here
[00:25:39] trees or expecting Max trees who here has seen expecting Max trees
[00:25:42] has seen expecting Max trees before okay so maybe one so most people
[00:25:45] before okay so maybe one so most people not so this might be a useful sort of um
[00:25:47] not so this might be a useful sort of um uh representation so I think one way to
[00:25:49] uh representation so I think one way to think about this is what we're trying to
[00:25:51] think about this is what we're trying to do is we're trying to um think of what
[00:25:53] do is we're trying to um think of what the value is starting in a certain State
[00:25:56] the value is starting in a certain State and we want to take um we know what the
[00:25:58] and we want to take um we know what the action is we're going to take cuz we've
[00:25:59] action is we're going to take cuz we've got a fixed policy so that says we start
[00:26:01] got a fixed policy so that says we start in this date s we take this action a
[00:26:04] in this date s we take this action a this is prescribed by our policy so this
[00:26:06] this is prescribed by our policy so this is pi of Pi of s is going to equal
[00:26:11] is pi of Pi of s is going to equal a and then after we do that because the
[00:26:13] a and then after we do that because the world might be stochastic we're going to
[00:26:15] world might be stochastic we're going to have a bunch of next States we could
[00:26:17] have a bunch of next States we could reach so we have probability of S Prime
[00:26:20] reach so we have probability of S Prime given s and
[00:26:23] given s and a and what we're trying to do when we do
[00:26:26] a and what we're trying to do when we do um policy evaluation is we're trying to
[00:26:28] um policy evaluation is we're trying to get expectation over all the potential
[00:26:30] get expectation over all the potential Futures we might end up in by following
[00:26:32] Futures we might end up in by following this policy you know and maybe in some
[00:26:34] this policy you know and maybe in some cases there's really good patient
[00:26:35] cases there's really good patient outcomes hopefully most of the time and
[00:26:37] outcomes hopefully most of the time and maybe sometimes there's less good
[00:26:38] maybe sometimes there's less good patient outcomes and we want to do an
[00:26:40] patient outcomes and we want to do an expectation over all of this so we can
[00:26:42] expectation over all of this so we can think of that as a tree as just we start
[00:26:44] think of that as a tree as just we start in a state we take an action we look at
[00:26:46] in a state we take an action we look at the branching factor of all possible
[00:26:48] the branching factor of all possible next States and then we
[00:26:50] next States and then we repeat excuse
[00:26:52] repeat excuse me so if we think of what the policy
[00:26:54] me so if we think of what the policy evaluation diagram is doing for each
[00:26:56] evaluation diagram is doing for each state we know what the next action is we
[00:26:57] state we know what the next action is we take
[00:26:59] take and then we Branch again in terms of
[00:27:02] and then we Branch again in terms of States so this is like S Prime and this
[00:27:05] States so this is like S Prime and this is S
[00:27:12] Prime and so we can just think of like
[00:27:14] Prime and so we can just think of like kind of this tree of possibilities going
[00:27:16] kind of this tree of possibilities going out but we don't have any there's no
[00:27:18] out but we don't have any there's no branching on the actions because the
[00:27:19] branching on the actions because the actions are fixed by our
[00:27:22] actions are fixed by our policy so then if we go all the way out
[00:27:24] policy so then if we go all the way out and then what we want to do is we want
[00:27:25] and then what we want to do is we want to figure out what the value function is
[00:27:27] to figure out what the value function is can think of this is you know an
[00:27:29] can think of this is you know an expensive way to do dynamic programming
[00:27:31] expensive way to do dynamic programming um what you would do is you would take
[00:27:33] um what you would do is you would take an expectation over the states and then
[00:27:35] an expectation over the states and then you would propagate those values back up
[00:27:37] you would propagate those values back up to the
[00:27:38] to the root and if you don't find this a useful
[00:27:40] root and if you don't find this a useful conceptual way to think about it it's
[00:27:42] conceptual way to think about it it's fine it just I think it can be helpful
[00:27:44] fine it just I think it can be helpful to then think about what these different
[00:27:45] to then think about what these different algorithms are doing in terms of
[00:27:47] algorithms are doing in terms of approximations so this is sort of what
[00:27:49] approximations so this is sort of what we would like to do in order to get V Pi
[00:27:52] we would like to do in order to get V Pi of s but we want to do this a much more
[00:27:55] of s but we want to do this a much more computational umly efficient way and
[00:27:58] computational umly efficient way and also sample efficient way so what Monte
[00:28:01] also sample efficient way so what Monte Carlo policy evaluation is going to do
[00:28:04] Carlo policy evaluation is going to do is it's going to look like that
[00:28:05] is it's going to look like that particular equation here and what it is
[00:28:08] particular equation here and what it is doing here is it is going to
[00:28:11] doing here is it is going to approximate of these full Expectations
[00:28:14] approximate of these full Expectations by a
[00:28:17] by a sample so in particular what it's doing
[00:28:19] sample so in particular what it's doing here is it's updating the Value Estimate
[00:28:21] here is it's updating the Value Estimate by using a sample of the return to
[00:28:24] by using a sample of the return to approximate an expectation
[00:28:29] and we do this many times you know we
[00:28:30] and we do this many times you know we average over many such returns so it's
[00:28:32] average over many such returns so it's kind of like saying you have this
[00:28:33] kind of like saying you have this enormous branching tree you could do an
[00:28:35] enormous branching tree you could do an expectation over all of that sort of
[00:28:37] expectation over all of that sort of explicitly up from the roots or you
[00:28:39] explicitly up from the roots or you could just sample many times and that's
[00:28:41] could just sample many times and that's also going to approximate the
[00:28:42] also going to approximate the tree and the more samples you get the
[00:28:45] tree and the more samples you get the better that's going to be as an
[00:28:46] better that's going to be as an approximation of the
[00:28:48] approximation of the tree and this type of ideas has been um
[00:28:51] tree and this type of ideas has been um used in many different types of
[00:28:52] used in many different types of algorithms there's um some really nice
[00:28:54] algorithms there's um some really nice work in the mid 2000s like by Michael
[00:28:56] work in the mid 2000s like by Michael Kern and others and then similar ideas
[00:28:59] Kern and others and then similar ideas were really the foundation that then led
[00:29:00] were really the foundation that then led to some of the advances of like Monte
[00:29:02] to some of the advances of like Monte Carlo tree search and then that went
[00:29:04] Carlo tree search and then that went into
[00:29:05] into alphago so this what this is what Monte
[00:29:09] alphago so this what this is what Monte Carlo tree search is doing so notice
[00:29:11] Carlo tree search is doing so notice it's not doing any form of
[00:29:12] it's not doing any form of bootstrapping there's no dynamic
[00:29:14] bootstrapping there's no dynamic programming that's going on here it's
[00:29:16] programming that's going on here it's just rolling out and then what we have
[00:29:18] just rolling out and then what we have here is it is
[00:29:22] using this here as sort of U this sample
[00:29:26] using this here as sort of U this sample as an exp as an AO
[00:29:30] okay all right so that's how Monti Carlo
[00:29:32] okay all right so that's how Monti Carlo policy evaluation works one natural
[00:29:35] policy evaluation works one natural question in this case is how good is
[00:29:36] question in this case is how good is that estimate so we're going to see lots
[00:29:38] that estimate so we're going to see lots of different ways and lots of algorithms
[00:29:40] of different ways and lots of algorithms for trying to do policy evaluation and
[00:29:42] for trying to do policy evaluation and so you might know ask like well how do I
[00:29:44] so you might know ask like well how do I pick them on them like what are the
[00:29:45] pick them on them like what are the properties I should think about so one
[00:29:48] properties I should think about so one pretty basic property that you might
[00:29:50] pretty basic property that you might want is consistency which means that as
[00:29:53] want is consistency which means that as you get more and more data does your
[00:29:55] you get more and more data does your estimate actually converge to the true
[00:29:59] estimate actually converge to the true value of the policy for all the
[00:30:02] value of the policy for all the states and this is something you
[00:30:04] states and this is something you probably want in many cases at least
[00:30:06] probably want in many cases at least because otherwise it means that even if
[00:30:08] because otherwise it means that even if you had infinite data your estimate
[00:30:10] you had infinite data your estimate still going to be wrong now as we start
[00:30:12] still going to be wrong now as we start to think about more complicated settings
[00:30:13] to think about more complicated settings we might have to St you know um sort of
[00:30:16] we might have to St you know um sort of be satisfied with this less good
[00:30:19] be satisfied with this less good objective but here for right now we're
[00:30:20] objective but here for right now we're going to we're hoping we can just write
[00:30:23] going to we're hoping we can just write down the value of every state as an
[00:30:24] down the value of every state as an entry and a table that we should be able
[00:30:26] entry and a table that we should be able to get consistency
[00:30:28] to get consistency a second thing we might want is
[00:30:29] a second thing we might want is computational efficiency we'd like this
[00:30:32] computational efficiency we'd like this not to be too expensive for us to
[00:30:34] not to be too expensive for us to compute we'd like us not to require too
[00:30:36] compute we'd like us not to require too much
[00:30:37] much memory and we'd like it to have
[00:30:40] memory and we'd like it to have statistical efficiency which is sort of
[00:30:43] statistical efficiency which is sort of essentially how does the accuracy of the
[00:30:45] essentially how does the accuracy of the estimate change with the amount of
[00:30:47] estimate change with the amount of data and what that means here is like
[00:30:49] data and what that means here is like sort of more formally we'd like to know
[00:30:51] sort of more formally we'd like to know how quickly do these things converge um
[00:30:53] how quickly do these things converge um as you get more and more
[00:30:55] as you get more and more data and then in reality we often care
[00:30:57] data and then in reality we often care about empirical accuracy just how what
[00:31:00] about empirical accuracy just how what is our mean squared error for our types
[00:31:01] is our mean squared error for our types of our
[00:31:02] of our estimators so how good is Monte Carlo
[00:31:05] estimators so how good is Monte Carlo well let's just first quickly remind
[00:31:08] well let's just first quickly remind ourselves that the bias of an estimator
[00:31:10] ourselves that the bias of an estimator is the so if we have an estimat Theta
[00:31:12] is the so if we have an estimat Theta which we're going to be thinking of is
[00:31:14] which we're going to be thinking of is like our value function approximate
[00:31:16] like our value function approximate approximation it's going to be the
[00:31:18] approximation it's going to be the difference between on average what our
[00:31:21] difference between on average what our estimator is versus the True Value
[00:31:23] estimator is versus the True Value that's our
[00:31:25] that's our bias and the variance of an estimator is
[00:31:27] bias and the variance of an estimator is the difference between this and its
[00:31:29] the difference between this and its expectation squared and the expectation
[00:31:31] expectation squared and the expectation of that and the mean squared error is
[00:31:33] of that and the mean squared error is going to be variance plus bias
[00:31:36] going to be variance plus bias squ
[00:31:37] squ okay all right so generally we would
[00:31:39] okay all right so generally we would like an estimator that has low mean squ
[00:31:41] like an estimator that has low mean squ error which means we want it to have
[00:31:42] error which means we want it to have like lower zero bias and low
[00:31:47] variance something to think about if
[00:31:48] variance something to think about if you're less familiar with these is um
[00:31:51] you're less familiar with these is um whether or not if an estimator is
[00:31:52] whether or not if an estimator is unbiased is it consistent it is not
[00:31:55] unbiased is it consistent it is not necessarily consistent just so you know
[00:31:58] necessarily consistent just so you know right so what we would like here is that
[00:32:02] right so what we would like here is that ASM totically the probability that our
[00:32:04] ASM totically the probability that our estimator so n here is the amount of
[00:32:06] estimator so n here is the amount of data we're using to construct that
[00:32:07] data we're using to construct that estimator the probability that as we get
[00:32:10] estimator the probability that as we get an infinite amount of data that our
[00:32:12] an infinite amount of data that our estimate is different than the true
[00:32:14] estimate is different than the true value by more than Epsilon it has to go
[00:32:16] value by more than Epsilon it has to go to
[00:32:19] to zero okay so we would like it to be
[00:32:21] zero okay so we would like it to be consistent so how does Monte Carlo fa on
[00:32:24] consistent so how does Monte Carlo fa on these sort of um
[00:32:26] these sort of um properties well um first visit is
[00:32:30] properties well um first visit is unbiased so it's an unbiased estimator
[00:32:33] unbiased so it's an unbiased estimator of the DU policy and by the law of large
[00:32:35] of the DU policy and by the law of large Lumar as the amount of data you have
[00:32:37] Lumar as the amount of data you have goes to Infinity per state so you know
[00:32:40] goes to Infinity per state so you know if you have really rare States you're
[00:32:41] if you have really rare States you're still going to need a number of samples
[00:32:43] still going to need a number of samples to you know to estimate them but as the
[00:32:45] to you know to estimate them but as the amount of data you have goes to Infinity
[00:32:47] amount of data you have goes to Infinity you'll converge so it's consistent and
[00:32:48] you'll converge so it's consistent and it's
[00:32:50] it's unbiased okay every visit Monte Carlo is
[00:32:54] unbiased okay every visit Monte Carlo is biased one way to think about that is um
[00:32:57] biased one way to think about that is um in the first case all your data is IID
[00:33:00] in the first case all your data is IID independent and identically distributed
[00:33:02] independent and identically distributed in the every visit case imagine that you
[00:33:05] in the every visit case imagine that you visit State S2 and then like you know
[00:33:07] visit State S2 and then like you know four steps later you visit
[00:33:09] four steps later you visit S2 well their returns are going to be
[00:33:12] S2 well their returns are going to be correlated because they're both in the
[00:33:13] correlated because they're both in the same trajectory so they're not IID
[00:33:16] same trajectory so they're not IID anymore that's just some intuition for
[00:33:18] anymore that's just some intuition for why it might be biased but it's also
[00:33:20] why it might be biased but it's also consistent and it often has better means
[00:33:21] consistent and it often has better means squared error because you get to use
[00:33:23] squared error because you get to use more of your data inside of a single
[00:33:26] more of your data inside of a single trajectory to do more updates
[00:33:28] trajectory to do more updates and then incremental Monte Carlo methods
[00:33:30] and then incremental Monte Carlo methods depend on the learning baate as you
[00:33:33] depend on the learning baate as you might
[00:33:34] might expect
[00:33:36] expect so see that here so um let's imagine
[00:33:40] so see that here so um let's imagine that we are going to have our Alpha
[00:33:42] that we are going to have our Alpha parameter which is our learning rate
[00:33:43] parameter which is our learning rate which is trading off between our new
[00:33:45] which is trading off between our new estimate and our old
[00:33:47] estimate and our old estimate it can actually change per time
[00:33:49] estimate it can actually change per time step so just like how you can generally
[00:33:51] step so just like how you can generally Decay your learning rate you can change
[00:33:53] Decay your learning rate you can change your learning rate here and if your
[00:33:55] your learning rate here and if your learning rate is such that if you sum up
[00:33:57] learning rate is such that if you sum up all of its values for a particular state
[00:33:59] all of its values for a particular state it goes to Infinity but the square is
[00:34:01] it goes to Infinity but the square is less than infinity then you will
[00:34:03] less than infinity then you will converge to the True
[00:34:07] Value so and again this is a Prett the
[00:34:10] Value so and again this is a Prett the these are um a pretty common types of
[00:34:12] these are um a pretty common types of criteria we'll see for some of the
[00:34:14] criteria we'll see for some of the algorithms we have that sort of under
[00:34:16] algorithms we have that sort of under some sort of smoothness guarantees for
[00:34:18] some sort of smoothness guarantees for the um uh learning rates we'll have some
[00:34:20] the um uh learning rates we'll have some decent properties yeah for my
[00:34:23] decent properties yeah for my Ming those conditions aren't met like do
[00:34:26] Ming those conditions aren't met like do you definitely not have a guarantee or
[00:34:28] you definitely not have a guarantee or are there other conditions that can give
[00:34:30] are there other conditions that can give you a guarantee and those are just like
[00:34:31] you a guarantee and those are just like some of the great question so asking you
[00:34:34] some of the great question so asking you know um is it is it required to have
[00:34:38] know um is it is it required to have these conditions these are sufficient um
[00:34:41] these conditions these are sufficient um they aren't necessary always a lot of
[00:34:43] they aren't necessary always a lot of that will depend on the particular
[00:34:44] that will depend on the particular problem domain too and like what the
[00:34:46] problem domain too and like what the Dynamics and the reward is um to my
[00:34:48] Dynamics and the reward is um to my knowledge I'm not sure if there are
[00:34:50] knowledge I'm not sure if there are other really General conditions like
[00:34:51] other really General conditions like that but there might be for specific
[00:34:53] that but there might be for specific problem classes it's a good question
[00:34:59] now one of the problems with this is
[00:35:01] now one of the problems with this is that in generally it's a pretty high
[00:35:02] that in generally it's a pretty high variance estimator so you're kind of
[00:35:05] variance estimator so you're kind of getting certainly like with every visit
[00:35:07] getting certainly like with every visit or sorry certainly for um uh first visit
[00:35:11] or sorry certainly for um uh first visit Monte Carlo you're only updating the
[00:35:13] Monte Carlo you're only updating the state at most once per episode so it can
[00:35:15] state at most once per episode so it can take a long time so you can imagine that
[00:35:17] take a long time so you can imagine that if you have very different outcomes from
[00:35:20] if you have very different outcomes from the same starting state so maybe you
[00:35:22] the same starting state so maybe you know most of the time you have pretty
[00:35:24] know most of the time you have pretty average outcomes but maybe one in 100
[00:35:26] average outcomes but maybe one in 100 times you have a really bad outcome it's
[00:35:28] times you have a really bad outcome it's going to take a long time for that
[00:35:30] going to take a long time for that estimator to converge so in general this
[00:35:32] estimator to converge so in general this is a pretty high variance estimator even
[00:35:35] is a pretty high variance estimator even though it is often unbiased and it is
[00:35:38] though it is often unbiased and it is consistent and then the other big
[00:35:40] consistent and then the other big requirement is that it requires episodic
[00:35:42] requirement is that it requires episodic settings so you have to wait till the
[00:35:44] settings so you have to wait till the end of the episode to update your
[00:35:46] end of the episode to update your estimate and for here right now that
[00:35:49] estimate and for here right now that might not seem that bad but when we
[00:35:50] might not seem that bad but when we start getting into control de and
[00:35:52] start getting into control de and decision- making you might want to use
[00:35:54] decision- making you might want to use the data you have already in that
[00:35:55] the data you have already in that episode to change the behavior of the
[00:35:57] episode to change the behavior of the agent
[00:35:59] agent so you can imagine something like if
[00:36:00] so you can imagine something like if you're um doing like uh self-driving
[00:36:04] you're um doing like uh self-driving cars or something you're already getting
[00:36:05] cars or something you're already getting some evidence that um the car is not
[00:36:07] some evidence that um the car is not working as expected within a single
[00:36:09] working as expected within a single episode that might be really long you
[00:36:11] episode that might be really long you might want to use that information to
[00:36:12] might want to use that information to change how your steering for
[00:36:15] change how your steering for example okay all right so just to
[00:36:18] example okay all right so just to summarize here what it does is it's um
[00:36:20] summarize here what it does is it's um it's not using the Markoff process it's
[00:36:22] it's not using the Markoff process it's updating your value function estimate
[00:36:24] updating your value function estimate using a sample of the return to
[00:36:26] using a sample of the return to approximate the expectation and under
[00:36:28] approximate the expectation and under some pretty mild conditions it converges
[00:36:30] some pretty mild conditions it converges to the true value of the
[00:36:33] to the true value of the state and in some cases it will turn out
[00:36:36] state and in some cases it will turn out that even if you actually know the true
[00:36:38] that even if you actually know the true Dynamics model and reward you might
[00:36:39] Dynamics model and reward you might still want to do
[00:36:42] this and I think one thing that's useful
[00:36:44] this and I think one thing that's useful to think about here is um yeah systems
[00:36:46] to think about here is um yeah systems which you think the markof property
[00:36:48] which you think the markof property might be violated at least with the
[00:36:50] might be violated at least with the features that you'd be using to
[00:36:52] features that you'd be using to represent the
[00:36:55] state all right now let's go on to
[00:36:58] state all right now let's go on to temporal difference learning and this is
[00:37:00] temporal difference learning and this is again sort of related to Q learning
[00:37:01] again sort of related to Q learning which we'll get to in the next
[00:37:04] which we'll get to in the next lecture so set and Berto which is a
[00:37:07] lecture so set and Berto which is a textbook that um is an optional one for
[00:37:09] textbook that um is an optional one for oh yeah and is
[00:37:12] oh yeah and is it um that a good question so if we
[00:37:14] it um that a good question so if we don't know the rewards model how do we
[00:37:17] don't know the rewards model how do we calculate the rewards for the trajectory
[00:37:19] calculate the rewards for the trajectory ah great question so the Assumption here
[00:37:21] ah great question so the Assumption here is that um it's kind of like you either
[00:37:24] is that um it's kind of like you either are in a real setting where you can
[00:37:27] are in a real setting where you can sample these from an article or
[00:37:28] sample these from an article or something in the real world is giving
[00:37:29] something in the real world is giving you these so you may not have an
[00:37:31] you these so you may not have an explicit representation for the word
[00:37:32] explicit representation for the word model but you can get them so like your
[00:37:34] model but you can get them so like your customer buys something or they don't or
[00:37:36] customer buys something or they don't or you have um you know a side effect so
[00:37:39] you have um you know a side effect so you don't NE have a parametric model but
[00:37:40] you don't NE have a parametric model but you are getting real rewards it's a good
[00:37:44] you are getting real rewards it's a good question anybody else have any other
[00:37:46] question anybody else have any other questions about Monte Carlo before we go
[00:37:48] questions about Monte Carlo before we go on to temporal difference
[00:37:50] on to temporal difference learning and I'm going to call it just
[00:37:52] learning and I'm going to call it just temporal difference learning now um and
[00:37:54] temporal difference learning now um and then I'll specify that it's actually td0
[00:37:57] then I'll specify that it's actually td0 for most of what I'm going to talk about
[00:38:00] for most of what I'm going to talk about so just specify
[00:38:03] so just specify mostly discuss tt0 and I I'll specify
[00:38:07] mostly discuss tt0 and I I'll specify what I mean by the zero
[00:38:08] what I mean by the zero shortly so certain abto which is one of
[00:38:11] shortly so certain abto which is one of the optional textbooks for the class um
[00:38:13] the optional textbooks for the class um says if one had to identify one idea as
[00:38:15] says if one had to identify one idea as Central and Noble to RL it would
[00:38:17] Central and Noble to RL it would undoubtedly be temporal difference
[00:38:19] undoubtedly be temporal difference learning and what their point is is that
[00:38:23] learning and what their point is is that it really is sort of it's a way still to
[00:38:24] it really is sort of it's a way still to construct estimators both for control um
[00:38:27] construct estimators both for control um and for policy evaluation and the idea
[00:38:30] and for policy evaluation and the idea is if we think back to that tree I
[00:38:32] is if we think back to that tree I showed you and I'll show you some more
[00:38:33] showed you and I'll show you some more there's going to be a way to sort of
[00:38:35] there's going to be a way to sort of combine between the idea of sampling to
[00:38:37] combine between the idea of sampling to approximate expectations and
[00:38:39] approximate expectations and bootstrapping to approximate future
[00:38:42] bootstrapping to approximate future returns and we'll see that in a second
[00:38:45] returns and we'll see that in a second it is model free meaning you don't need
[00:38:46] it is model free meaning you don't need to have like a parametric representation
[00:38:48] to have like a parametric representation of the reward function or the Dynamics
[00:38:50] of the reward function or the Dynamics model and the nice thing is you can use
[00:38:52] model and the nice thing is you can use it in episodic settings or in infinite
[00:38:55] it in episodic settings or in infinite discounted Horizon settings you just
[00:38:57] discounted Horizon settings you just like set up off your robot and then it's
[00:38:58] like set up off your robot and then it's just going to have to learn to act
[00:39:01] just going to have to learn to act forever okay and one of the key ideas is
[00:39:04] forever okay and one of the key ideas is that we're going to update our estimates
[00:39:05] that we're going to update our estimates of the value of a state immediately so
[00:39:07] of the value of a state immediately so I'll put Pi here because we're still
[00:39:09] I'll put Pi here because we're still talking about a policy after every
[00:39:12] talking about a policy after every single Tuple of State action reward next
[00:39:15] single Tuple of State action reward next state all right so let's see how that
[00:39:18] state all right so let's see how that works so again remember our goal is just
[00:39:21] works so again remember our goal is just to compute the expected discounted sum
[00:39:23] to compute the expected discounted sum of rewards for a particular
[00:39:25] of rewards for a particular policy now let's think back to the bman
[00:39:27] policy now let's think back to the bman operator so if we know the mdp models
[00:39:30] operator so if we know the mdp models and we have a particular policy we could
[00:39:32] and we have a particular policy we could write the Bellman operator like
[00:39:34] write the Bellman operator like that and what we were doing in
[00:39:36] that and what we were doing in incremental every visit Monte Carlo is
[00:39:39] incremental every visit Monte Carlo is we were updating the estimate using one
[00:39:41] we were updating the estimate using one sample of the
[00:39:43] sample of the return and the idea now is to say well
[00:39:47] return and the idea now is to say well this was one
[00:39:48] this was one sample but maybe we could just maintain
[00:39:50] sample but maybe we could just maintain we have access to a value function why
[00:39:53] we have access to a value function why couldn't we look up and instead of
[00:39:54] couldn't we look up and instead of having you know what the rewards were in
[00:39:57] having you know what the rewards were in the state till the end of the
[00:39:59] the state till the end of the trajectory we have we observed a
[00:40:02] trajectory we have we observed a particular reward we got to a particular
[00:40:04] particular reward we got to a particular next
[00:40:05] next state why don't we use the value
[00:40:07] state why don't we use the value function for that
[00:40:11] state so what we're doing in this case
[00:40:14] state so what we're doing in this case is instead of using G we're plugging in
[00:40:16] is instead of using G we're plugging in the immediate reward plus gamma times
[00:40:18] the immediate reward plus gamma times the discounted sum of future rewards
[00:40:20] the discounted sum of future rewards using our current estimate of the value
[00:40:22] using our current estimate of the value function for that next state we reached
[00:40:29] and here the reason one of the things is
[00:40:31] and here the reason one of the things is that we don't need we don't have to wait
[00:40:33] that we don't need we don't have to wait do this
[00:40:35] do this immediately as soon as we reach S Prime
[00:40:42] immediately as soon as we reach S Prime so as soon as we reach S Prime we can
[00:40:44] so as soon as we reach S Prime we can you know as soon as we see the next
[00:40:45] you know as soon as we see the next state we can immediately update the
[00:40:46] state we can immediately update the value of our current state so we have to
[00:40:48] value of our current state so we have to wait till the end of the episode we can
[00:40:50] wait till the end of the episode we can use this for infinite Horizon
[00:40:53] problems
[00:40:56] problems okay so this is what that looks like and
[00:41:00] okay so this is what that looks like and we're of going to call that the TD
[00:41:03] we're of going to call that the TD Target and again that should look like
[00:41:05] Target and again that should look like machine learning it should look like um
[00:41:07] machine learning it should look like um uh what we just did with Monte Carlo
[00:41:09] uh what we just did with Monte Carlo that what we're plugging in here is
[00:41:11] that what we're plugging in here is we're saying we're taking our old
[00:41:13] we're saying we're taking our old estimate and we are moving it we are
[00:41:15] estimate and we are moving it we are shifting a little bit by our learning
[00:41:17] shifting a little bit by our learning rate
[00:41:19] rate towards our Target which is our reward
[00:41:22] towards our Target which is our reward plus our discount some of future rewards
[00:41:24] plus our discount some of future rewards using that plugin estimate
[00:41:28] and when we think of sort of how much
[00:41:30] and when we think of sort of how much our estimate is changing we often call
[00:41:33] our estimate is changing we often call that the td0
[00:41:35] that the td0 error which looks at excuse me how
[00:41:38] error which looks at excuse me how different is my current estimate at the
[00:41:40] different is my current estimate at the value of a state versus the estimate
[00:41:43] value of a state versus the estimate that I'm plugging
[00:41:45] that I'm plugging in and again if you've seen Q learning
[00:41:47] in and again if you've seen Q learning before this is going to look really
[00:41:49] before this is going to look really similar to what we had but there's no
[00:41:51] similar to what we had but there's no Max or things like that you'll see those
[00:41:54] Max or things like that you'll see those soon okay
[00:41:58] soon okay so the td0 learning algorithm just looks
[00:42:01] so the td0 learning algorithm just looks like the following you sample a tuple of
[00:42:04] like the following you sample a tuple of a state action rewards uh next state you
[00:42:08] a state action rewards uh next state you update the value for that starting State
[00:42:11] update the value for that starting State and you repeat and so your T goes to t +
[00:42:14] and you repeat and so your T goes to t + 1 and then you get the next coule just
[00:42:17] 1 and then you get the next coule just do this over and over and over again so
[00:42:19] do this over and over and over again so like in our Mars Rover example you have
[00:42:21] like in our Mars Rover example you have sort of State action reward next state
[00:42:23] sort of State action reward next state you update and then you just shift along
[00:42:31] let's see what that might look like
[00:42:33] let's see what that might look like here
[00:42:34] here so in this case let's imagine we have a
[00:42:37] so in this case let's imagine we have a policy where we always take action A1
[00:42:40] policy where we always take action A1 we're going to make our discount Factor
[00:42:42] we're going to make our discount Factor one to make the math easy um and we're
[00:42:45] one to make the math easy um and we're going to assume that any action from
[00:42:47] going to assume that any action from State one or S7 terminates the
[00:42:50] State one or S7 terminates the episode and then what we see in this
[00:42:52] episode and then what we see in this case is we have the following trajectory
[00:42:53] case is we have the following trajectory we start in state S3 we take action A1
[00:42:57] we start in state S3 we take action A1 we get a reward of zero so this is the
[00:42:59] we get a reward of zero so this is the reward we transition to State S2 and so
[00:43:02] reward we transition to State S2 and so so forth till the end of the
[00:43:04] so forth till the end of the episode so what we would have in this
[00:43:07] episode so what we would have in this case is that we would make it um so that
[00:43:11] case is that we would make it um so that the first up we update we would do be V
[00:43:13] the first up we update we would do be V of
[00:43:14] of S3 and what we would say is that's my
[00:43:16] S3 and what we would say is that's my old estimate of V of
[00:43:18] old estimate of V of S3 * 1 - Alpha I've just Rewritten the
[00:43:22] S3 * 1 - Alpha I've just Rewritten the above equation here because there's this
[00:43:24] above equation here because there's this was basically one and this is Alpha *
[00:43:27] was basically one and this is Alpha minus B plus Alpha the immediate re
[00:43:31] minus B plus Alpha the immediate re plus gamma V of
[00:43:36] S2 so that's what that would look like
[00:43:39] S2 so that's what that would look like and
[00:43:40] and here my imagine that I'm I I've
[00:43:43] here my imagine that I'm I I've initialized all of them to be zero to
[00:43:45] initialized all of them to be zero to start so this would still just look like
[00:43:48] start so this would still just look like zero okay and in fact the only what
[00:43:52] zero okay and in fact the only what would be the only state I would update
[00:43:53] would be the only state I would update to not be zero in this episode
[00:44:13] for it to be updated not to be zero
[00:44:15] for it to be updated not to be zero either its immediate reward has to be
[00:44:17] either its immediate reward has to be one or it has to be transitioning to a
[00:44:19] one or it has to be transitioning to a state whose value is not zero
[00:44:30] yeah so what we're seeing here is that
[00:44:33] yeah so what we're seeing here is that um in this case we have state action
[00:44:35] um in this case we have state action reward next date and so this is the TD
[00:44:38] reward next date and so this is the TD update and what I was saying here is
[00:44:41] update and what I was saying here is that we've initialized all of them to be
[00:44:43] that we've initialized all of them to be zero which means that in order for their
[00:44:45] zero which means that in order for their value to change from being zero either
[00:44:49] value to change from being zero either their immediate reward has to be non Zer
[00:44:52] their immediate reward has to be non Zer or we have to transition to a state
[00:44:53] or we have to transition to a state whose value is not zero because all of
[00:44:57] whose value is not zero because all of them their current value is
[00:45:00] zero
[00:45:04] yeah were you going to guess which state
[00:45:07] yeah were you going to guess which state it's upd yeah which one it was well when
[00:45:10] it's upd yeah which one it was well when you're in
[00:45:12] you're in state state one you have a reward one
[00:45:14] state state one you have a reward one that's right yeah so this is so none you
[00:45:16] that's right yeah so this is so none you don't see any reward here until you get
[00:45:18] don't see any reward here until you get to State S1 I'll just highlight
[00:45:22] to State S1 I'll just highlight here so at that point is when you update
[00:45:25] here so at that point is when you update your value function that's the first
[00:45:27] your value function that's the first first time that you get to anything
[00:45:28] first time that you get to anything that's that any reward becomes non zero
[00:45:31] that's that any reward becomes non zero so in that case what you get is S1 is
[00:45:33] so in that case what you get is S1 is equal to V of S1 1 - Alpha + Alpha * 1 +
[00:45:39] equal to V of S1 1 - Alpha + Alpha * 1 + gamma B of s Terminal S terminal is
[00:45:43] gamma B of s Terminal S terminal is always zero so it just becomes Alpha *
[00:45:50] 1 so why am I make I do a lot of algebra
[00:45:53] 1 so why am I make I do a lot of algebra here I want to do it because um if you
[00:45:56] here I want to do it because um if you work this out and I won't go through it
[00:45:58] work this out and I won't go through it here but I think it's a useful
[00:46:00] here but I think it's a useful exercise the TD episode TD estimate you
[00:46:04] exercise the TD episode TD estimate you would get for your whole value function
[00:46:06] would get for your whole value function at the end of this episode is quite
[00:46:08] at the end of this episode is quite different than what you get with Monte
[00:46:10] different than what you get with Monte Carlo so TD updates after every single
[00:46:13] Carlo so TD updates after every single tupple every single state action reward
[00:46:15] tupple every single state action reward next state Tuple and so that means when
[00:46:17] next state Tuple and so that means when you reach the end of the episode if you
[00:46:19] you reach the end of the episode if you look at what your value function would
[00:46:21] look at what your value function would be and I've written the value function
[00:46:22] be and I've written the value function here as um just as a vector but this is
[00:46:25] here as um just as a vector but this is the value of S1 this is the value of s 7
[00:46:27] the value of S1 this is the value of s 7 so I've just written it as a vector this
[00:46:30] so I've just written it as a vector this is what your value function would be it
[00:46:32] is what your value function would be it would say oh my current estimate for S1
[00:46:34] would say oh my current estimate for S1 is one and everything else is zero but
[00:46:37] is one and everything else is zero but if you look at first visit Monte Carlo
[00:46:39] if you look at first visit Monte Carlo it's quite different okay and if we make
[00:46:42] it's quite different okay and if we make gamma equal to one here which I said it
[00:46:44] gamma equal to one here which I said it would be it would be 1 one 1 0 0 0
[00:46:47] would be it would be 1 one 1 0 0 0 0 why is this because Monte Carlo waits
[00:46:50] 0 why is this because Monte Carlo waits till the end of the episode and then it
[00:46:52] till the end of the episode and then it uses the returns to update any state
[00:46:56] uses the returns to update any state that was visited once in that
[00:46:59] that was visited once in that episode and the reason that's important
[00:47:01] episode and the reason that's important is that now actually we have we filled
[00:47:03] is that now actually we have we filled in a lot more things because we knew we
[00:47:05] in a lot more things because we knew we observed in that case that not just did
[00:47:07] observed in that case that not just did we get a reward here but then we saw
[00:47:09] we get a reward here but then we saw what S2 got which was also a reward of
[00:47:11] what S2 got which was also a reward of one and what S3 got which was also a
[00:47:13] one and what S3 got which was also a reward of
[00:47:14] reward of one and the reason I bring this up is
[00:47:16] one and the reason I bring this up is that there's going to be different
[00:47:17] that there's going to be different choices about how these behave
[00:47:18] choices about how these behave particularly when you don't have a lot
[00:47:19] particularly when you don't have a lot of data at the beginning which may be
[00:47:21] of data at the beginning which may be more or less data efficient or sample
[00:47:24] more or less data efficient or sample efficient and ideas of sample efficiency
[00:47:26] efficient and ideas of sample efficiency will come up a lot we'll see that a lot
[00:47:28] will come up a lot we'll see that a lot later on um but we'll see it on uh
[00:47:30] later on um but we'll see it on uh Thursday as
[00:47:32] Thursday as well all right so what does this look
[00:47:34] well all right so what does this look like in terms of the
[00:47:35] like in terms of the tree so we go back to our tree which is
[00:47:38] tree so we go back to our tree which is like expanding out potential Futures
[00:47:40] like expanding out potential Futures what we can see here is that TD is
[00:47:43] what we can see here is that TD is updating the value estimate using a
[00:47:44] updating the value estimate using a sample of St + one to approximate an
[00:47:48] sample of St + one to approximate an expectation so in reality if you're
[00:47:50] expectation so in reality if you're doing dynamic programming you would want
[00:47:52] doing dynamic programming you would want to do a weighted expectation over all
[00:47:54] to do a weighted expectation over all the next States you could reach weight
[00:47:57] the next States you could reach weight by the probability of getting there what
[00:47:59] by the probability of getting there what TD is doing is it's just sampling one of
[00:48:02] TD is doing is it's just sampling one of those and that sample is you know an
[00:48:05] those and that sample is you know an approximation of that expectation so
[00:48:08] approximation of that expectation so we're going from this to sampling the
[00:48:11] we're going from this to sampling the next
[00:48:13] next state but similar to dynamic programming
[00:48:15] state but similar to dynamic programming it is then bootstrapping so unlike Monte
[00:48:18] it is then bootstrapping so unlike Monte Carlo which goes all the way out to get
[00:48:20] Carlo which goes all the way out to get a sample of that value function here
[00:48:23] a sample of that value function here we're just plugging in V so this part
[00:48:25] we're just plugging in V so this part looks the same
[00:48:29] so TD does both sampling to approximate
[00:48:33] so TD does both sampling to approximate expectations and it boot straps by using
[00:48:35] expectations and it boot straps by using your existing estim of the value
[00:48:44] function all right so let's just do a
[00:48:46] function all right so let's just do a check your understanding um so this is a
[00:48:50] check your understanding um so this is a poll so what I'd like you to think about
[00:48:52] poll so what I'd like you to think about is how this learning rate might affect
[00:48:53] is how this learning rate might affect things so whether different choices of
[00:48:57] things so whether different choices of this is going to weigh the TD Target or
[00:48:59] this is going to weigh the TD Target or more uh more or less than the pass V
[00:49:02] more uh more or less than the pass V estimate um and what might happen when
[00:49:06] estimate um and what might happen when your state space is stochastic meaning
[00:49:09] your state space is stochastic meaning that when you start in one state you
[00:49:10] that when you start in one state you might end up in multiple next States
[00:49:13] might end up in multiple next States what does that mean about convergence
[00:49:15] what does that mean about convergence and the implication for learning
[00:49:18] and the implication for learning rates as well as thinking about
[00:49:20] rates as well as thinking about deterministic markof decision processes
[00:49:22] deterministic markof decision processes deterministic Market decision processes
[00:49:25] deterministic Market decision processes what I mean by that is that
[00:49:29] oops P of S Prime given sa is equal to 1
[00:49:34] oops P of S Prime given sa is equal to 1 for
[00:49:37] for exactly 1 s prime meaning that there's
[00:49:40] exactly 1 s prime meaning that there's no stochasticity when you're in a state
[00:49:42] no stochasticity when you're in a state in action you always go to one
[00:49:43] in action you always go to one particular next
[00:49:45] particular next state so that's a deterministic barup
[00:49:47] state so that's a deterministic barup decision
[00:49:48] decision process so just take a few minutes now
[00:49:50] process so just take a few minutes now and um look into this
[00:50:04] and you should be able to select all
[00:50:06] and you should be able to select all that are true but if you can't let me
[00:50:09] that are true but if you can't let me know you cannot
[00:50:12] know you cannot okay right well then I mean again these
[00:50:15] okay right well then I mean again these are only for your thoughts so just try
[00:50:16] are only for your thoughts so just try to like write down for yourself which of
[00:50:18] to like write down for yourself which of these you think are true and then we'll
[00:50:20] these you think are true and then we'll talk about them the second I'll check
[00:50:22] talk about them the second I'll check into these for next time
[00:51:12] yes that's what I just heard sorry about
[00:51:14] yes that's what I just heard sorry about that so I'll try to fix that for next
[00:51:15] that so I'll try to fix that for next time just try to have in your head of
[00:51:17] time just try to have in your head of which ones you think are correct and
[00:51:18] which ones you think are correct and I'll ask you to compare with someone in
[00:51:19] I'll ask you to compare with someone in a second thanks for letting me
[00:51:25] know for
[00:51:58] all right turn to your neighbor um and
[00:52:00] all right turn to your neighbor um and check and uh particularly focus on the
[00:52:04] check and uh particularly focus on the last
[00:52:05] last two and see if you agree on your answers
[00:52:08] two and see if you agree on your answers for those
[00:52:38] like let's say MVP a single
[00:53:54] no
[00:54:02] I'm GNA make
[00:54:12] [Music]
[00:54:30] you like construct pry
[00:55:21] yeah I think people
[00:55:28] no have any
[00:55:57] all right great had some great
[00:55:59] all right great had some great discussions um okay so for the first one
[00:56:03] discussions um okay so for the first one this is going to be false because if we
[00:56:07] this is going to be false because if we have Alpha equals z then we don't care
[00:56:09] have Alpha equals z then we don't care about the TD Target at all just just
[00:56:13] about the TD Target at all just just totally drops out so we're always we
[00:56:15] totally drops out so we're always we never
[00:56:16] never update um in the second case this is
[00:56:21] update um in the second case this is true because this means if Alpha is
[00:56:23] true because this means if Alpha is equal to one then this part and this
[00:56:26] equal to one then this part and this part cancels out we just have
[00:56:28] part cancels out we just have this so that means whenever we see an
[00:56:31] this so that means whenever we see an update we always update uh we totally
[00:56:34] update we always update uh we totally change your estimate
[00:56:35] change your estimate potentially the third one is a little
[00:56:37] potentially the third one is a little bit subtle um this is true does somebody
[00:56:40] bit subtle um this is true does somebody want to give me an example where this
[00:56:41] want to give me an example where this might
[00:56:45] occur yeah yeah um if you have like two
[00:56:48] occur yeah yeah um if you have like two states where they just keep pointing at
[00:56:50] states where they just keep pointing at each other is that the cas yes and in
[00:56:53] each other is that the cas yes and in particular if there's um if you could
[00:56:56] particular if there's um if you could would go to either of those state with
[00:56:57] would go to either of those state with some
[00:56:58] some probability yeah so so if you're in
[00:57:01] probability yeah so so if you're in these are cases so imagine that like I I
[00:57:03] these are cases so imagine that like I I to think of it is like a coin flip so
[00:57:05] to think of it is like a coin flip so imagine that you have one state where
[00:57:07] imagine that you have one state where after this you either go to a
[00:57:09] after this you either go to a state maybe it's 50% probability you get
[00:57:12] state maybe it's 50% probability you get plus one and 50% probability you get
[00:57:14] plus one and 50% probability you get minus
[00:57:16] minus one and then your problem just resets so
[00:57:18] one and then your problem just resets so imagine it's like a really short problem
[00:57:20] imagine it's like a really short problem you start off you get zero re W you
[00:57:22] you start off you get zero re W you transition to a state and then your
[00:57:23] transition to a state and then your episode resets so in this case either on
[00:57:26] episode resets so in this case either on that um round you're gonna get plus one
[00:57:28] that um round you're gonna get plus one or you're GNA get minus one so you're
[00:57:31] or you're GNA get minus one so you're either get plus one or minus one here
[00:57:32] either get plus one or minus one here and you'll just flip back and forth
[00:57:34] and you'll just flip back and forth between the plus one minus one plus
[00:57:35] between the plus one minus one plus oneus one plus one one so that's just to
[00:57:38] oneus one plus one one so that's just to highlight that if you do have systems
[00:57:39] highlight that if you do have systems which are stochastic the fact that in
[00:57:42] which are stochastic the fact that in your target you are using a single
[00:57:44] your target you are using a single sample of that stochasticity to
[00:57:46] sample of that stochasticity to approximate the expectation um can be
[00:57:49] approximate the expectation um can be bad but that does not mean and I guess
[00:57:52] bad but that does not mean and I guess this gets to I think was it like that
[00:57:54] this gets to I think was it like that was asking before that like in many of
[00:57:56] was asking before that like in many of these cases it it's sort of the ne um
[00:57:59] these cases it it's sort of the ne um there's the sort of cases where like it
[00:58:02] there's the sort of cases where like it might be possible that this would happen
[00:58:04] might be possible that this would happen but it won't always so in this case
[00:58:06] but it won't always so in this case there do exist deterministic systems um
[00:58:08] there do exist deterministic systems um where um even if Alpha is equal to one
[00:58:11] where um even if Alpha is equal to one you can converge so again think of
[00:58:13] you can converge so again think of something that I like often to think
[00:58:15] something that I like often to think about really small mdps to get some
[00:58:16] about really small mdps to get some intuition for this if you have a case
[00:58:18] intuition for this if you have a case where there's just a terminal State and
[00:58:22] where there's just a terminal State and there's no more transitions so like you
[00:58:23] there's no more transitions so like you get to some point where you always go to
[00:58:25] get to some point where you always go to some terminal state dat and then it's
[00:58:27] some terminal state dat and then it's plus 10 there there's no more updates
[00:58:30] plus 10 there there's no more updates then just plus 10 there's no more
[00:58:33] then just plus 10 there's no more expectation and in general any case
[00:58:35] expectation and in general any case where you know if there's no
[00:58:36] where you know if there's no stochasticity and you're near the end
[00:58:38] stochasticity and you're near the end and there's no more stochasticity in
[00:58:39] and there's no more stochasticity in that episode um those can be cases real
[00:58:42] that episode um those can be cases real St
[00:58:43] St converge okay great so this is just and
[00:58:46] converge okay great so this is just and I I encourage you um to go through some
[00:58:48] I I encourage you um to go through some of the worked examples if you want to
[00:58:50] of the worked examples if you want to just to see some more comparisons over
[00:58:52] just to see some more comparisons over uh you know the difference between Monte
[00:58:54] uh you know the difference between Monte Carlo and TD methods in this case
[00:58:57] Carlo and TD methods in this case just to summarize what we're doing in TD
[00:58:59] just to summarize what we're doing in TD learning we're bootstrapping and
[00:59:00] learning we're bootstrapping and sampling we're sampling to approximate
[00:59:02] sampling we're sampling to approximate our expectation overall the
[00:59:04] our expectation overall the stochasticity we're bootstrapping
[00:59:06] stochasticity we're bootstrapping because we don't want to use a full
[00:59:07] because we don't want to use a full return um we are in taking V to
[00:59:11] return um we are in taking V to approximate that can be use an episodic
[00:59:13] approximate that can be use an episodic or infinite Horizon settings it is
[00:59:15] or infinite Horizon settings it is generally lower variance we're doing
[00:59:17] generally lower variance we're doing lots and lots and lots of updates um it
[00:59:20] lots and lots and lots of updates um it is a consistent estimator if your
[00:59:21] is a consistent estimator if your learning rate Alpha satisfies the same
[00:59:23] learning rate Alpha satisfies the same conditions specified for incremental
[00:59:25] conditions specified for incremental Monte Carlo policy
[00:59:28] Monte Carlo policy valuation I here today only introduce
[00:59:31] valuation I here today only introduce td0 What td0 refers to is you take the
[00:59:34] td0 What td0 refers to is you take the immediate reward then you immediately
[00:59:36] immediate reward then you immediately bootstrap and plug in the value of um
[00:59:40] bootstrap and plug in the value of um the next state so we did
[00:59:48] r+
[00:59:51] r+ okay we did r + gamma V of S Prime
[00:59:57] okay we did r + gamma V of S Prime versus summing up all your discounted
[00:59:59] versus summing up all your discounted rewards for the whole episode in general
[01:00:01] rewards for the whole episode in general you could have something kind of in
[01:00:02] you could have something kind of in between so you could
[01:00:04] between so you could have plus RT + 1+ gamma
[01:00:09] have plus RT + 1+ gamma squ so you could have something like
[01:00:11] squ so you could have something like this so in general you could say like do
[01:00:14] this so in general you could say like do some sort of combination of like using
[01:00:17] some sort of combination of like using partial returns and then bootstrapping
[01:00:19] partial returns and then bootstrapping um there's a lot of different TD methods
[01:00:22] um there's a lot of different TD methods that kind of interpolate between taking
[01:00:24] that kind of interpolate between taking one step and then plugging in the value
[01:00:26] one step and then plugging in the value versus only plugging not using any value
[01:00:29] versus only plugging not using any value function
[01:00:30] function approximation and if you want to think
[01:00:31] approximation and if you want to think about this graphically it's kind of like
[01:00:33] about this graphically it's kind of like thinking
[01:00:34] thinking about do you plug in V of S Prime here
[01:00:38] about do you plug in V of S Prime here or do you plug it in
[01:00:40] or do you plug it in no way
[01:00:42] no way lower yeah is there
[01:00:46] lower yeah is there like is there empal like um estimate of
[01:00:50] like is there empal like um estimate of what like a good trade of for for like
[01:00:53] what like a good trade of for for like computational complexity versus uh
[01:00:56] computational complexity versus uh performance like have they found like a
[01:00:58] performance like have they found like a good number for it good question
[01:01:01] good number for it good question unfortunately in many cases it will be
[01:01:03] unfortunately in many cases it will be depending on the domain um one thing I
[01:01:06] depending on the domain um one thing I think to think about here too is
[01:01:09] think to think about here too is that you can think of this part doesn't
[01:01:11] that you can think of this part doesn't require the Markoff assumption right so
[01:01:14] require the Markoff assumption right so if you have a system where you're not
[01:01:16] if you have a system where you're not confident but maybe you're like well
[01:01:18] confident but maybe you're like well maybe it's
[01:01:19] maybe it's like I'm willing to say that I'll plug
[01:01:22] like I'm willing to say that I'll plug in a markup assumption eventually
[01:01:23] in a markup assumption eventually because it's going to be lower variance
[01:01:25] because it's going to be lower variance um but want to preserve the fact that
[01:01:27] um but want to preserve the fact that maybe it's not markof and sort of have a
[01:01:28] maybe it's not markof and sort of have a short Horizon um often people do use
[01:01:31] short Horizon um often people do use something in between the two so they
[01:01:33] something in between the two so they often do consider this in between for
[01:01:35] often do consider this in between for for multiple reasons but it gives you
[01:01:36] for multiple reasons but it gives you some of this flexibility it often is um
[01:01:39] some of this flexibility it often is um a lower bias it's a great
[01:01:43] a lower bias it's a great question all right what we're going to
[01:01:45] question all right what we're going to do now is think about also how some of
[01:01:46] do now is think about also how some of these ideas relate to dynamic
[01:01:48] these ideas relate to dynamic programming which is what we saw in an
[01:01:50] programming which is what we saw in an earlier lecture because we could use
[01:01:51] earlier lecture because we could use this also for policy evaluation we know
[01:01:54] this also for policy evaluation we know how to use it for policy evaluation
[01:01:56] how to use it for policy evaluation if we are given the models but some of
[01:01:59] if we are given the models but some of you guys might have been thinking well
[01:02:00] you guys might have been thinking well we have data now if we have data because
[01:02:03] we have data now if we have data because we're taking the policy in the
[01:02:04] we're taking the policy in the environment couldn't we use that to
[01:02:06] environment couldn't we use that to estimate a reward model or couldn't we
[01:02:07] estimate a reward model or couldn't we use that to estimate the Dynamics model
[01:02:10] use that to estimate the Dynamics model and that's what's known as certainty
[01:02:12] and that's what's known as certainty equivalence um approaches so the idea
[01:02:15] equivalence um approaches so the idea here is that you're going to be getting
[01:02:16] here is that you're going to be getting data as you execute this policy and you
[01:02:19] data as you execute this policy and you can compute a Dynamics model from that
[01:02:21] can compute a Dynamics model from that data so you could use like a maximum
[01:02:23] data so you could use like a maximum likelihood mdp model remember right now
[01:02:26] likelihood mdp model remember right now we're in the tabular setting so we can
[01:02:28] we're in the tabular setting so we can have a parameter for every single state
[01:02:30] have a parameter for every single state in action so we can just count we can
[01:02:32] in action so we can just count we can just say how many times was I in this
[01:02:34] just say how many times was I in this state took this action and transitioned
[01:02:36] state took this action and transitioned to this next state divided by the number
[01:02:38] to this next state divided by the number of times I was in that state in action
[01:02:41] of times I was in that state in action so this just gives you um a maximum
[01:02:43] so this just gives you um a maximum likelihood estimate of the Dynamics
[01:02:44] likelihood estimate of the Dynamics model and you can do the same thing for
[01:02:46] model and you can do the same thing for the reward model and of course as you
[01:02:48] the reward model and of course as you might imagine you can do this with much
[01:02:50] might imagine you can do this with much more complicated function approximators
[01:02:52] more complicated function approximators like deep neural networks too but the
[01:02:54] like deep neural networks too but the idea is that once you have this model
[01:02:57] idea is that once you have this model and it's called a certainty equivalence
[01:02:59] and it's called a certainty equivalence model because we're now going to ignore
[01:03:02] model because we're now going to ignore any error in these models so we have
[01:03:05] any error in these models so we have finite data these models will definitely
[01:03:06] finite data these models will definitely be wrong but let's ignore that for now
[01:03:09] be wrong but let's ignore that for now so once you have this maximum likelihood
[01:03:11] so once you have this maximum likelihood mdp model you can just compute the value
[01:03:14] mdp model you can just compute the value of a policy using the same methods we
[01:03:16] of a policy using the same methods we saw last week because you have Dynamics
[01:03:19] saw last week because you have Dynamics model now and a reward model and you can
[01:03:22] model now and a reward model and you can see some examples about this at the end
[01:03:24] see some examples about this at the end of the lecture slides
[01:03:30] so one of the benefits of this and this
[01:03:32] so one of the benefits of this and this gets to question is this is really data
[01:03:34] gets to question is this is really data efficient so I showed you an example for
[01:03:36] efficient so I showed you an example for the Mars rover before where we only
[01:03:39] the Mars rover before where we only updated one of the states with TD
[01:03:41] updated one of the states with TD learning we updated three of the states
[01:03:44] learning we updated three of the states with um uh Monte Carlo what this does
[01:03:47] with um uh Monte Carlo what this does here is it tries to update it computes
[01:03:49] here is it tries to update it computes the Dynamics mod reward model for all
[01:03:51] the Dynamics mod reward model for all states and actions and then it tries to
[01:03:53] states and actions and then it tries to update all of them okay so it's going to
[01:03:56] update all of them okay so it's going to compute a value for every single
[01:03:59] compute a value for every single state now the downside of that is that
[01:04:02] state now the downside of that is that now we're doing policy evaluation with a
[01:04:03] now we're doing policy evaluation with a full model which is either going to be
[01:04:05] full model which is either going to be something like s s a for iterative
[01:04:07] something like s s a for iterative methods or maybe even worse so it's
[01:04:10] methods or maybe even worse so it's computationally expensive but it's
[01:04:11] computationally expensive but it's really data
[01:04:14] really data efficient because as soon as you reach
[01:04:16] efficient because as soon as you reach any state for which um you get say a
[01:04:18] any state for which um you get say a positive reward you can kind of
[01:04:20] positive reward you can kind of propagate that to any other state that
[01:04:22] propagate that to any other state that you know is possible to reach from there
[01:04:26] you know is possible to reach from there it's still consistent it's going to
[01:04:28] it's still consistent it's going to converge to the right right thing for
[01:04:30] converge to the right right thing for Markoff models um and I it can easily
[01:04:34] Markoff models um and I it can easily generally be used for all policy
[01:04:35] generally be used for all policy evaluation which we're going to get into
[01:04:37] evaluation which we're going to get into yeah sorry what is NSA in this equation
[01:04:40] yeah sorry what is NSA in this equation sorry great question NSA here is the
[01:04:42] sorry great question NSA here is the number of times we've been in that state
[01:04:43] number of times we've been in that state and taken that
[01:04:45] and taken that action yeah so this is counts
[01:04:59] okay okay yeah this seems like pretty
[01:05:01] okay okay yeah this seems like pretty similar to Carlo I think so um is just
[01:05:05] similar to Carlo I think so um is just the difference that you're Genera like
[01:05:07] the difference that you're Genera like probability as opposed to like
[01:05:09] probability as opposed to like calculating G great question um we're
[01:05:11] calculating G great question um we're going to hold that thought for asking
[01:05:13] going to hold that thought for asking how similar is this to mtic car not it's
[01:05:15] how similar is this to mtic car not it's actually going to be pretty different
[01:05:16] actually going to be pretty different we're going to see this in a second um
[01:05:18] we're going to see this in a second um we are going to be we're going to be
[01:05:20] we are going to be we're going to be using we are using our data similar to
[01:05:22] using we are using our data similar to mon we're using the data to we're going
[01:05:24] mon we're using the data to we're going to use the data here to compute models
[01:05:26] to use the data here to compute models and then propagate information but
[01:05:27] and then propagate information but they're going to end up making some
[01:05:29] they're going to end up making some interesting different decisions so let's
[01:05:30] interesting different decisions so let's see that now it's a great
[01:05:33] see that now it's a great PR okay so now let's get into batch
[01:05:35] PR okay so now let's get into batch policy evaluation so I've said like
[01:05:37] policy evaluation so I've said like there are these different methods they
[01:05:38] there are these different methods they might have different computational
[01:05:40] might have different computational complexity they might be more or less
[01:05:41] complexity they might be more or less data
[01:05:42] data efficient so one thing that you might
[01:05:45] efficient so one thing that you might imagine doing and we'll see this a lot
[01:05:47] imagine doing and we'll see this a lot shortly a lot more next time is well if
[01:05:50] shortly a lot more next time is well if I have some data how could I best use
[01:05:52] I have some data how could I best use the data that I have and this comes a
[01:05:54] the data that I have and this comes a lot up a lot in the research that me and
[01:05:56] lot up a lot in the research that me and my lab do because we're often dealing
[01:05:57] my lab do because we're often dealing with patient data or student data or
[01:06:00] with patient data or student data or legal data or others where like it's
[01:06:01] legal data or others where like it's really expensive to get the data or it's
[01:06:03] really expensive to get the data or it's costly or could be harmful and we want
[01:06:05] costly or could be harmful and we want to get as much information as we can out
[01:06:07] to get as much information as we can out of the data we have so when I say batch
[01:06:10] of the data we have so when I say batch what I mean is imagine that you have a
[01:06:11] what I mean is imagine that you have a set of K episodes and now what you want
[01:06:13] set of K episodes and now what you want to do is you want to do value you want
[01:06:16] to do is you want to do value you want to do um policy evaluation just with
[01:06:18] to do um policy evaluation just with that data so what we're going to do is
[01:06:20] that data so what we're going to do is repeatedly sample one of the episodes
[01:06:22] repeatedly sample one of the episodes that we have of those K and we're going
[01:06:24] that we have of those K and we're going to apply Monte Carlo or TD Z to that
[01:06:26] to apply Monte Carlo or TD Z to that episode we're just going to do that over
[01:06:28] episode we're just going to do that over and over and over and over again and
[01:06:30] and over and over and over again and we'll see this a lot more next time as
[01:06:31] we'll see this a lot more next time as well and so the idea is to just
[01:06:34] well and so the idea is to just understand um if we do this given that
[01:06:37] understand um if we do this given that finite amount of data what will Monte
[01:06:39] finite amount of data what will Monte Carlo and td0 converge to in terms of
[01:06:41] Carlo and td0 converge to in terms of the um evaluation of the policy so let's
[01:06:44] the um evaluation of the policy so let's go through that this this really nice
[01:06:46] go through that this this really nice example from sardo to kind of illustrate
[01:06:48] example from sardo to kind of illustrate this we have a really small domain we
[01:06:51] this we have a really small domain we have two states we're going to say gamma
[01:06:53] have two states we're going to say gamma is equal to one there's no discounting
[01:06:55] is equal to one there's no discounting and we have eight episodes of experience
[01:06:58] and we have eight episodes of experience so in one episode we started in um State
[01:07:02] so in one episode we started in um State uh a we got a reward of zero we
[01:07:05] uh a we got a reward of zero we transitioned to B and we got another
[01:07:07] transitioned to B and we got another reward of zero okay so in this case you
[01:07:10] reward of zero okay so in this case you can think of it as like this had a
[01:07:12] can think of it as like this had a trajectory like
[01:07:14] trajectory like that in some episodes we started in
[01:07:17] that in some episodes we started in state B and we just got an immediate
[01:07:20] state B and we just got an immediate reward of one and we observe that six
[01:07:23] reward of one and we observe that six times so in six trajectories we just
[01:07:25] times so in six trajectories we just happened to to start in state B and we
[01:07:26] happened to to start in state B and we got a reward of one and then in one
[01:07:29] got a reward of one and then in one trajectory we started in state B and we
[01:07:31] trajectory we started in state B and we got a reward of
[01:07:33] got a reward of zero so first imagine if you ran TD
[01:07:36] zero so first imagine if you ran TD updates over this data in infinite
[01:07:38] updates over this data in infinite amount of time what do you think the
[01:07:40] amount of time what do you think the estimative VB would
[01:07:44] be remember that what it works for there
[01:07:47] be remember that what it works for there is we have 1us Alpha times our old
[01:07:51] is we have 1us Alpha times our old estimate so I'll just put V plus Alpha *
[01:07:55] estimate so I'll just put V plus Alpha * our immediate reward plus gamma times
[01:07:58] our immediate reward plus gamma times the next state but here it's just
[01:08:01] the next state but here it's just terminal because we always
[01:08:04] terminal because we always terminate after B we always terminate so
[01:08:07] terminate after B we always terminate so you never get any future discounted
[01:08:08] you never get any future discounted rewards in this case so what the updates
[01:08:11] rewards in this case so what the updates look like for TD learning is that you
[01:08:13] look like for TD learning is that you would have 1- Alpha time your old
[01:08:14] would have 1- Alpha time your old estimate plus Alpha time whatever reward
[01:08:16] estimate plus Alpha time whatever reward you get in
[01:08:18] you get in B and imagine you just iterate over
[01:08:21] B and imagine you just iterate over these over and over and over
[01:08:23] these over and over and over again somebody have any guesses of what
[01:08:26] again somebody have any guesses of what the reward would be for
[01:08:36] v um would it be 75 yeah somebody else
[01:08:42] v um would it be 75 yeah somebody else want to explain how why it's
[01:08:51] 75 is see some
[01:08:54] 75 is see some nods yeah cuz in this case we had eight
[01:08:58] nods yeah cuz in this case we had eight episodes in two of
[01:09:00] episodes in two of them when we started in B we got zero
[01:09:03] them when we started in B we got zero and in six of them we got
[01:09:05] and in six of them we got one so we just average those rewards and
[01:09:09] one so we just average those rewards and imagine you know we're just doing this
[01:09:11] imagine you know we're just doing this many many many many many many
[01:09:13] many many many many many many times So eventually you would just
[01:09:15] times So eventually you would just converge to this estimate being
[01:09:21] 75 what about for Monte
[01:09:24] 75 what about for Monte Carlo so let's do Monte Carlo for v
[01:09:27] Carlo so let's do Monte Carlo for v b what would that look like so remember
[01:09:31] b what would that look like so remember for Monte Carlo it would be 1 Alpha * B
[01:09:35] for Monte Carlo it would be 1 Alpha * B of
[01:09:36] of B plus Alpha * G where you start in
[01:09:42] B plus Alpha * G where you start in state
[01:09:48] B is it going to be the same thing is it
[01:09:50] B is it going to be the same thing is it going to be different
[01:10:00] so Monty Carlo we're just going to we're
[01:10:03] so Monty Carlo we're just going to we're averaging over all the returns we get
[01:10:06] averaging over all the returns we get starting in that
[01:10:18] state is it like so when we start at B
[01:10:21] state is it like so when we start at B so then wouldn't it be 6 over s uh 6
[01:10:24] so then wouldn't it be 6 over s uh 6 over 8 oh 68 again yeah um but don't we
[01:10:27] over 8 oh 68 again yeah um but don't we start at a in the first
[01:10:29] start at a in the first one uh in one episode we start an A but
[01:10:31] one uh in one episode we start an A but we're just trying to compute the value
[01:10:32] we're just trying to compute the value of you right now we'll get to in a
[01:10:34] of you right now we'll get to in a second yeah think ahead yeah so the
[01:10:39] second yeah think ahead yeah so the Monte Carlo estimate so we're just
[01:10:40] Monte Carlo estimate so we're just trying to contract so just to what we're
[01:10:42] trying to contract so just to what we're trying to do here is we're trying to see
[01:10:43] trying to do here is we're trying to see will these two algorithms converge to
[01:10:45] will these two algorithms converge to the same thing or not we're going to
[01:10:47] the same thing or not we're going to start off and look at what the value of
[01:10:48] start off and look at what the value of B of State B in Monte in TD we would
[01:10:51] B of State B in Monte in TD we would converge to
[01:10:53] converge to 75 because the immediate reward
[01:10:55] 75 because the immediate reward is either one or zero and the discount
[01:10:57] is either one or zero and the discount sum of future words is zero because we
[01:11:00] sum of future words is zero because we terminate and in Monte Carlo we just
[01:11:02] terminate and in Monte Carlo we just average over all the returns we get when
[01:11:04] average over all the returns we get when we've started in B and that is also 6id
[01:11:08] we've started in B and that is also 6id 8 all right so now this the hard
[01:11:11] 8 all right so now this the hard one okay what about V of
[01:11:14] one okay what about V of a what will we converge to in these two
[01:11:18] a what will we converge to in these two cases so let's do check your
[01:11:20] cases so let's do check your understanding and you can respond in the
[01:11:22] understanding and you can respond in the poll and feel free to talk to someone
[01:11:23] poll and feel free to talk to someone next to you
[01:11:27] and again the intent of this is to think
[01:11:28] and again the intent of this is to think about are these actually Computing the
[01:11:30] about are these actually Computing the same thing or not in this
[01:11:34] same thing or not in this case and remember this is a different
[01:11:36] case and remember this is a different setting than I told you before that both
[01:11:38] setting than I told you before that both of these things can be consistent but
[01:11:40] of these things can be consistent but that was if you get infinite data what
[01:11:43] that was if you get infinite data what this is looking at is if you only have a
[01:11:44] this is looking at is if you only have a finite amount of data you just go over
[01:11:46] finite amount of data you just go over it over and over again either with Monte
[01:11:49] it over and over again either with Monte Carlo updates or with TD will you
[01:11:51] Carlo updates or with TD will you converge to the same thing
[01:11:56] and if you're not sure or you're
[01:11:57] and if you're not sure or you're confused feel free to put that in the
[01:11:58] confused feel free to put that in the poll
[01:12:24] too for
[01:13:19] okay there's lots of different answers
[01:13:21] okay there's lots of different answers here um so this is a great one to talk
[01:13:23] here um so this is a great one to talk to somebody nearby you so I you see if
[01:13:25] to somebody nearby you so I you see if you're getting the same things and can
[01:13:27] you're getting the same things and can use our collective intelligence
[01:13:57] think
[01:14:25] get old
[01:14:56] [Music]
[01:15:05] I'm hearing a lot of good discussion so
[01:15:06] I'm hearing a lot of good discussion so I'm fir to interrupt but this is kind of
[01:15:08] I'm fir to interrupt but this is kind of a fun one so why don't we um let's start
[01:15:11] a fun one so why don't we um let's start with uh
[01:15:12] with uh TD so raise uh someone want to explain
[01:15:16] TD so raise uh someone want to explain why it's 75 for TD there are multiple
[01:15:19] why it's 75 for TD there are multiple people that got
[01:15:22] that yeah would you want explain what
[01:15:24] that yeah would you want explain what you and your partner I think yeah so I
[01:15:26] you and your partner I think yeah so I think if you just like um look so uh
[01:15:29] think if you just like um look so uh we're just looking at like
[01:15:31] we're just looking at like this your name sorry yeah there's like
[01:15:35] this your name sorry yeah there's like just one episode where the a where we're
[01:15:37] just one episode where the a where we're at a um and so in that episode um the
[01:15:40] at a um and so in that episode um the immediate reward is zero but then we
[01:15:42] immediate reward is zero but then we have to do plus gamma time um the reward
[01:15:45] have to do plus gamma time um the reward of the next state B um and then V Pi V
[01:15:48] of the next state B um and then V Pi V is 75 we gotten the previous part so
[01:15:52] is 75 we gotten the previous part so Valu to 75 that's right
[01:15:55] Valu to 75 that's right so Monte Carlo is not that estimate so
[01:15:58] so Monte Carlo is not that estimate so TD gives you 75 what does Monte Carlo
[01:16:01] TD gives you 75 what does Monte Carlo give you it's not
[01:16:08] 75 and again multiple of you guys got it
[01:16:11] 75 and again multiple of you guys got it correct oh sorry I I have a question is
[01:16:13] correct oh sorry I I have a question is gamma the same as Alpha here oh good
[01:16:16] gamma the same as Alpha here oh good question um uh G no here I'm assuming
[01:16:18] question um uh G no here I'm assuming that gamma is one and it's a great
[01:16:20] that gamma is one and it's a great question so I someone else was asking
[01:16:22] question so I someone else was asking this too so um I'm assuming that Al is
[01:16:26] this too so um I'm assuming that Al is set
[01:16:28] set correctly for these to converge oh yeah
[01:16:33] correctly for these to converge oh yeah it's a good question someone else had
[01:16:34] it's a good question someone else had that too so I'm assuming that we're like
[01:16:36] that too so I'm assuming that we're like we're going over our data in infinite
[01:16:37] we're going over our data in infinite amount of time but we're decaying Alpha
[01:16:39] amount of time but we're decaying Alpha correctly as we do that a great question
[01:16:41] correctly as we do that a great question somebody want to explain what Monte
[01:16:43] somebody want to explain what Monte Carlo is it's not
[01:16:45] Carlo is it's not 75 is
[01:16:47] 75 is it zero yes it is great someone want to
[01:16:51] it zero yes it is great someone want to explain why it's zero
[01:16:55] so B to Carlo is
[01:16:59] zero yeah we've only seen one trory a
[01:17:04] zero yeah we've only seen one trory a shows up
[01:17:05] shows up and that's right remind me your name I'm
[01:17:08] and that's right remind me your name I'm yeah what said is exactly right so we've
[01:17:10] yeah what said is exactly right so we've only seen one trajectory and I know some
[01:17:12] only seen one trajectory and I know some other people made the same observation
[01:17:14] other people made the same observation so we've only seen one trajectory where
[01:17:16] so we've only seen one trajectory where there was a at
[01:17:17] there was a at all we for mon Carlo we just average
[01:17:20] all we for mon Carlo we just average over all the returns we've seen we that
[01:17:22] over all the returns we've seen we that so that's only zero so I bring that
[01:17:25] so that's only zero so I bring that because even though um ASM totically all
[01:17:28] because even though um ASM totically all of these things converge to the right
[01:17:29] of these things converge to the right thing under some mild assumptions um
[01:17:32] thing under some mild assumptions um with finite data which is what we're
[01:17:33] with finite data which is what we're almost always going to have in reality
[01:17:35] almost always going to have in reality even if you go over it multiple times
[01:17:36] even if you go over it multiple times they are converging to sometimes totally
[01:17:38] they are converging to sometimes totally different things and here is what they
[01:17:40] different things and here is what they are converging to in general Monte Carlo
[01:17:43] are converging to in general Monte Carlo is converging to the minimum mean
[01:17:45] is converging to the minimum mean squared error with respect to the
[01:17:47] squared error with respect to the observed returns so it's just going to
[01:17:49] observed returns so it's just going to set it so it minimizes the error between
[01:17:51] set it so it minimizes the error between the observe turns it's seen and um H its
[01:17:55] the observe turns it's seen and um H its value so in this case that would be B of
[01:17:58] value so in this case that would be B of a equals
[01:17:59] a equals z so that is the minimum means squ error
[01:18:02] z so that is the minimum means squ error td0 converges to the dynamic programming
[01:18:05] td0 converges to the dynamic programming policy for the mdp with a maximum
[01:18:08] policy for the mdp with a maximum likelihood model
[01:18:09] likelihood model estimates so you guys remember how we
[01:18:11] estimates so you guys remember how we just talked about certainty equivalence
[01:18:13] just talked about certainty equivalence what we were doing here is we're taking
[01:18:15] what we were doing here is we're taking all our
[01:18:16] all our data it's a it's a the answer you get
[01:18:19] data it's a it's a the answer you get from td0 if you do this batch process is
[01:18:21] from td0 if you do this batch process is the same as if you had computed your
[01:18:23] the same as if you had computed your maximum likely could markof decision
[01:18:26] maximum likely could markof decision process from the data you have and then
[01:18:28] process from the data you have and then you did dynamic programming with it okay
[01:18:31] you did dynamic programming with it okay so this that will be exactly the same as
[01:18:35] so this that will be exactly the same as this and so in particular it is
[01:18:37] this and so in particular it is leveraging and using the markof
[01:18:39] leveraging and using the markof Assumption and that's why it could
[01:18:42] Assumption and that's why it could actually chain these things together so
[01:18:44] actually chain these things together so you can see here for Monte Carlo it
[01:18:47] you can see here for Monte Carlo it doesn't doesn't know that the value of a
[01:18:51] doesn't doesn't know that the value of a has to be related to the value of B in
[01:18:53] has to be related to the value of B in terms of this boot tring
[01:18:56] terms of this boot tring relationship but TD is making That
[01:18:59] relationship but TD is making That explicit it's using the Markoff decision
[01:19:02] explicit it's using the Markoff decision process to say the value of a has to
[01:19:04] process to say the value of a has to exactly be equal to the immediate reward
[01:19:06] exactly be equal to the immediate reward you get in a plus gamma times the states
[01:19:08] you get in a plus gamma times the states that I could get into which is always B
[01:19:10] that I could get into which is always B so the value of
[01:19:11] so the value of B so TD learning is explicitly baking
[01:19:15] B so TD learning is explicitly baking that into the the solution you get
[01:19:16] that into the the solution you get whereas Monte Carlo is not Monte Carlo
[01:19:18] whereas Monte Carlo is not Monte Carlo is just trying to minimize the mean
[01:19:19] is just trying to minimize the mean squared error for the returns you see so
[01:19:22] squared error for the returns you see so they can end up giving you very
[01:19:24] they can end up giving you very different solutions and depending on
[01:19:26] different solutions and depending on whether your prop markup property is
[01:19:27] whether your prop markup property is really satisfied or not you might want
[01:19:29] really satisfied or not you might want one or the
[01:19:32] other awesome so this just summarized
[01:19:34] other awesome so this just summarized quickly sort of some of the different
[01:19:36] quickly sort of some of the different properties and approaches and just
[01:19:37] properties and approaches and just highlights here that temporal difference
[01:19:39] highlights here that temporal difference really does explo exploit this Markoff
[01:19:41] really does explo exploit this Markoff structure and that could be really
[01:19:42] structure and that could be really helpful if you want to leverage that to
[01:19:44] helpful if you want to leverage that to get better estimates of earlier States
[01:19:46] get better estimates of earlier States like in the case we just
[01:19:47] like in the case we just saw so just to summarize we finished
[01:19:50] saw so just to summarize we finished going through policy evaluation with um
[01:19:52] going through policy evaluation with um uh tabular settings and then on uh th uh
[01:19:55] uh tabular settings and then on uh th uh Wednesday what we're going to do is uh
[01:19:56] Wednesday what we're going to do is uh talk about control and we'll start to
[01:19:58] talk about control and we'll start to talk about function approximation as
[01:19:59] talk about function approximation as well all right thanks see you then
Lecture 004
Stanford CS234 Reinforcement Learning I Q learning and Function Approximation I 2024 I Lecture 4
Source: https://www.youtube.com/watch?v=b_wvosA70f8
---
Transcript
[00:00:06] all right welcome bac...
Stanford CS234 Reinforcement Learning I Q learning and Function Approximation I 2024 I Lecture 4
Source: https://www.youtube.com/watch?v=b_wvosA70f8
---
Transcript
[00:00:06] all right welcome back um we're going to
[00:00:07] all right welcome back um we're going to start lecture four in reinforcement
[00:00:09] start lecture four in reinforcement learning so we're going to be covering
[00:00:11] learning so we're going to be covering today Q learning we're going to cover
[00:00:13] today Q learning we're going to cover deep Q learning this result came out in
[00:00:16] deep Q learning this result came out in roughly 2014 um and I remember it being
[00:00:18] roughly 2014 um and I remember it being a really big deal because one of the big
[00:00:20] a really big deal because one of the big conferences neural information
[00:00:21] conferences neural information processing systems Deep Mind came and
[00:00:24] processing systems Deep Mind came and had this like amazing demonstration that
[00:00:26] had this like amazing demonstration that they were able to now have an agent that
[00:00:29] they were able to now have an agent that could learn to play video video games
[00:00:30] could learn to play video video games really well and an important thing to
[00:00:32] really well and an important thing to note here is like they're doing video
[00:00:34] note here is like they're doing video games from Pixel input so like they're
[00:00:36] games from Pixel input so like they're just getting the same input is what we
[00:00:38] just getting the same input is what we do and what the agent was learning to do
[00:00:40] do and what the agent was learning to do is to control the game um through this
[00:00:43] is to control the game um through this and through reinforcement learning and
[00:00:45] and through reinforcement learning and so we'll talk today about the algorithm
[00:00:47] so we'll talk today about the algorithm that they did to do that um and we'll
[00:00:49] that they did to do that um and we'll build up to that point and this is a
[00:00:51] build up to that point and this is a short video they show to just illustrate
[00:00:54] short video they show to just illustrate how the agent is learning through direct
[00:00:57] how the agent is learning through direct experience to try to optimize the score
[00:01:00] experience to try to optimize the score and so what it learns in this case is it
[00:01:01] and so what it learns in this case is it starts to learn particular strategies
[00:01:03] starts to learn particular strategies that allow it to do really
[00:01:05] that allow it to do really well which may or may not be the same
[00:01:07] well which may or may not be the same ones is what humans would
[00:01:09] ones is what humans would use and so it was pretty incredible this
[00:01:12] use and so it was pretty incredible this is one of the sort of most impressive
[00:01:14] is one of the sort of most impressive successes of reinforcement learning at
[00:01:16] successes of reinforcement learning at this point um particularly at trying to
[00:01:18] this point um particularly at trying to do tasks that humans can do as well um
[00:01:21] do tasks that humans can do as well um and from Pixel inputs and so we're going
[00:01:23] and from Pixel inputs and so we're going to see today sort of how that algorithm
[00:01:27] works all right so but before we do that
[00:01:30] works all right so but before we do that let's start with a quick check your
[00:01:31] let's start with a quick check your understanding um these are posted inside
[00:01:33] understanding um these are posted inside of Ed and this asks you to think about
[00:01:36] of Ed and this asks you to think about the policy Improvement stage so we're
[00:01:38] the policy Improvement stage so we're going to be talking today a lot about
[00:01:40] going to be talking today a lot about learning through direct experience um
[00:01:43] learning through direct experience um and scaling up towards function
[00:01:44] and scaling up towards function approximation with doing that but first
[00:01:46] approximation with doing that but first let's think about uh when we're doing
[00:01:49] let's think about uh when we're doing this what sort of form the policy has um
[00:01:53] this what sort of form the policy has um and then as we do this evaluation we do
[00:01:55] and then as we do this evaluation we do this repeated evaluation and policy
[00:01:57] this repeated evaluation and policy Improvement um what happens in this case
[00:02:07] these are the first two questions on the
[00:02:08] these are the first two questions on the polls
[00:02:35] sorry I just joined
[00:02:36] sorry I just joined CL it's on Ed yeah what's your name
[00:02:42] CL it's on Ed yeah what's your name thanks yeah so if anybody's not as you
[00:02:44] thanks yeah so if anybody's not as you new the class you can go to Ed you
[00:02:46] new the class you can go to Ed you should be able to get to that through
[00:02:58] canvas
[00:03:28] for all right we have good agreement on
[00:03:30] for all right we have good agreement on the first one this policy is stochastic
[00:03:33] the first one this policy is stochastic under the assumption that um for each
[00:03:35] under the assumption that um for each state there's a unique Max and it means
[00:03:37] state there's a unique Max and it means that the new policy will be
[00:03:39] that the new policy will be deterministic so almost every I think
[00:03:40] deterministic so almost every I think everybody said that correctly which is
[00:03:42] everybody said that correctly which is great um so now so this is
[00:03:46] great um so now so this is the the answer for this but there's some
[00:03:49] the the answer for this but there's some disagreement about the second one so why
[00:03:51] disagreement about the second one so why don't you turn to a neighbor and compare
[00:03:53] don't you turn to a neighbor and compare what you got for whether you can compute
[00:03:55] what you got for whether you can compute Q Pi I +1 um by using this to generate
[00:03:59] Q Pi I +1 um by using this to generate new
[00:04:24] trajectories and remember what I mean by
[00:04:26] trajectories and remember what I mean by this is I want to know whether or not
[00:04:27] this is I want to know whether or not you can get the state action value for
[00:04:30] you can get the state action value for every state and action pair under this
[00:04:32] every state and action pair under this new policy so I want to know if you can
[00:04:35] new policy so I want to know if you can compute Q of sa under this um new policy
[00:05:09] so I'll give you a hint if a policy is
[00:05:14] so I'll give you a hint if a policy is deterministic how many actions does it
[00:05:16] deterministic how many actions does it take in the same
[00:05:21] state one right so are you going to get
[00:05:24] state one right so are you going to get any data about any other actions in that
[00:05:27] any data about any other actions in that state so can we compute the Q value of
[00:05:31] state so can we compute the Q value of all actions in that state
[00:05:34] all actions in that state no that's right yeah so this is
[00:05:39] no that's right yeah so this is false we can't compute it because if we
[00:05:42] false we can't compute it because if we have a deterministic policy then we only
[00:05:45] have a deterministic policy then we only ever take Pi of s so we would only take
[00:05:49] ever take Pi of s so we would only take Pi of I + 1 of s that would be the only
[00:05:53] Pi of I + 1 of s that would be the only action we' ever take in that state
[00:05:55] action we' ever take in that state because the policy is deterministic it
[00:05:57] because the policy is deterministic it only takes that one that one action and
[00:05:59] only takes that one that one action and so that means you're just not going to
[00:06:00] so that means you're just not going to get any data about what it would be like
[00:06:02] get any data about what it would be like to take other actions in that
[00:06:05] to take other actions in that state and so that's useful to know
[00:06:07] state and so that's useful to know because it means that if we had models
[00:06:08] because it means that if we had models of the Dynamics or if we had if we um if
[00:06:11] of the Dynamics or if we had if we um if we had models of the reward and we could
[00:06:12] we had models of the reward and we could do some other things then we might be
[00:06:14] do some other things then we might be able to compute these Q values but here
[00:06:16] able to compute these Q values but here if we're going to start thinking about
[00:06:17] if we're going to start thinking about just learning this from data and from
[00:06:19] just learning this from data and from direct experience that if we have a
[00:06:21] direct experience that if we have a deterministic policy it's not going to
[00:06:23] deterministic policy it's not going to give us any data about trying different
[00:06:25] give us any data about trying different actions in the same state and so that's
[00:06:28] actions in the same state and so that's going to introduce some challenges that
[00:06:30] going to introduce some challenges that we have to tackle when we're trying to
[00:06:32] we have to tackle when we're trying to get data about the world um in order to
[00:06:34] get data about the world um in order to learn an optimal qy
[00:06:41] function great so what we're going to be
[00:06:43] function great so what we're going to be doing today then is try to think about
[00:06:45] doing today then is try to think about building on what we learned last time
[00:06:47] building on what we learned last time about policy evaluation um where we're
[00:06:49] about policy evaluation um where we're trying to learn directly from experience
[00:06:51] trying to learn directly from experience to be able to evaluate how good a
[00:06:52] to be able to evaluate how good a particular decision policy is how do we
[00:06:54] particular decision policy is how do we leverage that information to then
[00:06:56] leverage that information to then actually learn an optimal policy to
[00:06:57] actually learn an optimal policy to actually learn a good decision um you
[00:07:00] actually learn a good decision um you know a good policy uh without having to
[00:07:02] know a good policy uh without having to model of how the world works so we don't
[00:07:03] model of how the world works so we don't have access to an explicit parametric
[00:07:05] have access to an explicit parametric representation of the Dynamics model or
[00:07:07] representation of the Dynamics model or the reward model and then we're also
[00:07:09] the reward model and then we're also going to talk about uh value function
[00:07:12] going to talk about uh value function approximation and in particular we're
[00:07:13] approximation and in particular we're going to talk about Q
[00:07:16] going to talk about Q learning with deep neural
[00:07:19] learning with deep neural networks AK DQ which led to this really
[00:07:23] networks AK DQ which led to this really seminal result in like having machines
[00:07:26] seminal result in like having machines that can just play directly from Vision
[00:07:28] that can just play directly from Vision to learn how to play games like like
[00:07:29] to learn how to play games like like Atari um but I'll just pause here in
[00:07:31] Atari um but I'll just pause here in case anybody had any questions or
[00:07:33] case anybody had any questions or logistic questions before we dive into
[00:07:38] this all right and we're going to cover
[00:07:40] this all right and we're going to cover a lot today um because next week we're
[00:07:42] a lot today um because next week we're going to start policy gradient methods
[00:07:44] going to start policy gradient methods um and we're doing that because we think
[00:07:46] um and we're doing that because we think that that's a really important thing to
[00:07:48] that that's a really important thing to focus on um so but there will be quite a
[00:07:51] focus on um so but there will be quite a lot today uh and you're welcome to reach
[00:07:53] lot today uh and you're welcome to reach out I've put a bunch of worked examples
[00:07:54] out I've put a bunch of worked examples at the end in case people want to step
[00:07:56] at the end in case people want to step through some of those with Mars Rober
[00:07:57] through some of those with Mars Rober and others
[00:08:00] and others all right so these are we're going to
[00:08:01] all right so these are we're going to we're going to discuss a bunch of things
[00:08:02] we're going to discuss a bunch of things and we're going to start by thinking
[00:08:04] and we're going to start by thinking about staying in the tabular land so
[00:08:06] about staying in the tabular land so staying where we can write down the
[00:08:07] staying where we can write down the value function as a vector and then
[00:08:09] value function as a vector and then trying to learn how to make optimal
[00:08:11] trying to learn how to make optimal decisions in that
[00:08:13] decisions in that case so let's first just talk about the
[00:08:15] case so let's first just talk about the idea of generalized policy Improvement
[00:08:18] idea of generalized policy Improvement so we've s seen before this idea of
[00:08:20] so we've s seen before this idea of alternating between policy valuation and
[00:08:23] alternating between policy valuation and policy Improvement and now we're going
[00:08:25] policy Improvement and now we're going to think about that for slightly more
[00:08:26] to think about that for slightly more General cases of policies
[00:08:30] so what we just said here is that if the
[00:08:33] so what we just said here is that if the policy is deterministic we can't compute
[00:08:35] policy is deterministic we can't compute the state action value for any action
[00:08:37] the state action value for any action that's not the policy and so um what
[00:08:40] that's not the policy and so um what we'd like to be able to do now is to
[00:08:42] we'd like to be able to do now is to have kind of more coverage and to do
[00:08:44] have kind of more coverage and to do that we're going to have stochastic
[00:08:45] that we're going to have stochastic policies because if the policy is
[00:08:47] policies because if the policy is stochastic then we'll try multiple
[00:08:49] stochastic then we'll try multiple actions in the same state and we can use
[00:08:51] actions in the same state and we can use that data to estimate the Q
[00:08:53] that data to estimate the Q function so we're staying in what we're
[00:08:55] function so we're staying in what we're calling model-free policy iteration
[00:08:57] calling model-free policy iteration meaning we're not trying to explicitly
[00:08:59] meaning we're not trying to explicitly build build a Dynamics or reward model
[00:09:01] build build a Dynamics or reward model we're just trying to directly estimate a
[00:09:03] we're just trying to directly estimate a q function and once we have a Q function
[00:09:05] q function and once we have a Q function then we can extract from it an argx
[00:09:08] then we can extract from it an argx policy or something else okay and we're
[00:09:11] policy or something else okay and we're now going to be using an estimated Q
[00:09:16] because we will be
[00:09:22] estimating Q from data directly from
[00:09:26] estimating Q from data directly from experience
[00:09:30] all right so this is going to introduce
[00:09:31] all right so this is going to introduce this General challenge of exploration
[00:09:33] this General challenge of exploration which is we can only learn about the
[00:09:35] which is we can only learn about the things we try in the world this is just
[00:09:37] things we try in the world this is just like the you can't know how much better
[00:09:39] like the you can't know how much better or worse your life would be right now if
[00:09:40] or worse your life would be right now if you're drinking coffee at Koopa same
[00:09:43] you're drinking coffee at Koopa same thing like we we can only learn about
[00:09:44] thing like we we can only learn about the actions that we take um and the and
[00:09:47] the actions that we take um and the and so we need to learn about actions by
[00:09:49] so we need to learn about actions by trying them so we need to
[00:09:51] trying them so we need to explore but the downside in general is
[00:09:53] explore but the downside in general is if we try new actions we are spending
[00:09:55] if we try new actions we are spending less time using our knowledge to make
[00:09:58] less time using our knowledge to make good decisions
[00:10:00] good decisions so you might imagine that you can act
[00:10:02] so you might imagine that you can act random randomly always and that would
[00:10:04] random randomly always and that would work for like learning a lot about the
[00:10:06] work for like learning a lot about the world and learning a lot about Q
[00:10:08] world and learning a lot about Q functions but you wouldn't be finding
[00:10:10] functions but you wouldn't be finding you wouldn't be acting using that
[00:10:12] you wouldn't be acting using that knowledge to try to gain High reward so
[00:10:14] knowledge to try to gain High reward so this is known as the general challenge
[00:10:16] this is known as the general challenge between like exploration and
[00:10:18] between like exploration and exploitation um how much time do we
[00:10:21] exploitation um how much time do we spend exploring and getting new data
[00:10:23] spend exploring and getting new data about things that might be good versus
[00:10:25] about things that might be good versus how many time how much of the time do we
[00:10:27] how many time how much of the time do we exploit our knowledge of how the world
[00:10:28] exploit our knowledge of how the world works according to the data we have so
[00:10:30] works according to the data we have so far to try to make good
[00:10:32] far to try to make good decisions and this will come up a lot
[00:10:34] decisions and this will come up a lot this is there's a really deep questions
[00:10:36] this is there's a really deep questions around here about thinking of um how do
[00:10:39] around here about thinking of um how do we quantify our uncertainty in our
[00:10:40] we quantify our uncertainty in our knowledge and then how do we propagate
[00:10:42] knowledge and then how do we propagate that uncertainty into the value of that
[00:10:44] that uncertainty into the value of that uncertainty for Downstream decision-
[00:10:46] uncertainty for Downstream decision- making so we'll see a lot more about
[00:10:47] making so we'll see a lot more about that later in the course and this is
[00:10:50] that later in the course and this is continues to be a really active area of
[00:10:51] continues to be a really active area of research this is not at all solved um
[00:10:54] research this is not at all solved um but here we're just going to start to
[00:10:55] but here we're just going to start to see some simple methods to try to tackle
[00:10:57] see some simple methods to try to tackle this um challenges sort of balancing
[00:11:00] this um challenges sort of balancing between these two
[00:11:02] between these two things so one of the simplest things you
[00:11:04] things so one of the simplest things you could imagine doing is what's called
[00:11:06] could imagine doing is what's called Epsilon greedy and the idea with Epsilon
[00:11:09] Epsilon greedy and the idea with Epsilon greedy is you're going to just spend
[00:11:10] greedy is you're going to just spend some of the time doing things randomly
[00:11:12] some of the time doing things randomly and some of the times doing things the
[00:11:14] and some of the times doing things the best way you know how because um you're
[00:11:16] best way you know how because um you're kind of exploiting that knowledge so if
[00:11:18] kind of exploiting that knowledge so if we just have a finite number of actions
[00:11:20] we just have a finite number of actions because right now we're still in the
[00:11:21] because right now we're still in the tabular case so we just have a finite
[00:11:24] tabular case so we just have a finite number of states and a finite number of
[00:11:25] number of states and a finite number of actions then Epsilon greedy policy says
[00:11:29] actions then Epsilon greedy policy says um
[00:11:30] um with high probability so we have some
[00:11:32] with high probability so we have some Epsilon here epsilon's going to be less
[00:11:34] Epsilon here epsilon's going to be less than one could be like probability could
[00:11:36] than one could be like probability could be 0.1 for example so with high
[00:11:38] be 0.1 for example so with high probability you're going to take
[00:11:39] probability you're going to take whatever action maximizes your Q value
[00:11:42] whatever action maximizes your Q value in your current
[00:11:44] in your current state so you're going to kind of exploit
[00:11:46] state so you're going to kind of exploit your knowledge for whatever your state
[00:11:47] your knowledge for whatever your state action value says and you're going to do
[00:11:49] action value says and you're going to do that with probability 1 minus Epsilon
[00:11:52] that with probability 1 minus Epsilon and then otherwise you're going to take
[00:11:53] and then otherwise you're going to take an action at
[00:11:55] an action at random and so when you pick an action
[00:11:57] random and so when you pick an action Rand uniformly at random it might be one
[00:11:59] Rand uniformly at random it might be one of the same one as the argmax or it
[00:12:01] of the same one as the argmax or it might be a different one but either way
[00:12:03] might be a different one but either way the main idea is that essentially you
[00:12:04] the main idea is that essentially you spend 1 minus Epsilon percentage of the
[00:12:07] spend 1 minus Epsilon percentage of the time being greedy with respect to your
[00:12:10] time being greedy with respect to your knowledge and Epsilon per of time acting
[00:12:15] randomly so it's like you know maybe you
[00:12:18] randomly so it's like you know maybe you say like okay I'm committed to trying
[00:12:19] say like okay I'm committed to trying out new things at my restaurant so once
[00:12:21] out new things at my restaurant so once a week I will try a random dish and the
[00:12:22] a week I will try a random dish and the other six days I'll pick whatever I like
[00:12:25] other six days I'll pick whatever I like like whatever I've liked in the past and
[00:12:26] like whatever I've liked in the past and has always been good
[00:12:29] has always been good so this is a pretty simple strategy this
[00:12:31] so this is a pretty simple strategy this is not trying to have a deep notion of
[00:12:33] is not trying to have a deep notion of uncertainty or trying to quantify that
[00:12:35] uncertainty or trying to quantify that but um but neveress this can be pretty
[00:12:39] but um but neveress this can be pretty effective so in particular we can prove
[00:12:42] effective so in particular we can prove things about policy improvement with
[00:12:44] things about policy improvement with Epsilon greedy policies so what we
[00:12:46] Epsilon greedy policies so what we proved in the past is that if you do
[00:12:48] proved in the past is that if you do policy iteration when you know the
[00:12:50] policy iteration when you know the Dynamics and reward models you are
[00:12:52] Dynamics and reward models you are guaranteed to monotonically improve so
[00:12:55] guaranteed to monotonically improve so each round of policy iteration either
[00:12:57] each round of policy iteration either you would stay the same in which case
[00:12:58] you would stay the same in which case you found the optimal policy or you
[00:13:01] you found the optimal policy or you wouldn't change it and in that case um
[00:13:04] wouldn't change it and in that case um or or you would
[00:13:06] or or you would improve but when we did that proof we
[00:13:08] improve but when we did that proof we assumed policy Improvement using um a
[00:13:10] assumed policy Improvement using um a deterministic policy and it turns out
[00:13:12] deterministic policy and it turns out the same property holds with Epsilon
[00:13:14] the same property holds with Epsilon greedy
[00:13:16] greedy policies so if your policy is always
[00:13:18] policies so if your policy is always like an Epsilon greedy policy you can
[00:13:20] like an Epsilon greedy policy you can also get this kind of monotonic
[00:13:25] improvement so in particular and I'm not
[00:13:27] improvement so in particular and I'm not going to do the full proof today but
[00:13:29] going to do the full proof today but I'll leave it in just for time but um
[00:13:32] I'll leave it in just for time but um what this shows here is imagine that you
[00:13:33] what this shows here is imagine that you have a Q function like you have some
[00:13:35] have a Q function like you have some policy Pi I and you have a Q function
[00:13:37] policy Pi I and you have a Q function which tells you the state action value
[00:13:39] which tells you the state action value for that policy
[00:13:40] for that policy pii and Pi I is eg greedy which means
[00:13:44] pii and Pi I is eg greedy which means some of the time it acts greedy with
[00:13:46] some of the time it acts greedy with respect to that Q function and some of
[00:13:48] respect to that Q function and some of the time it selects an action at random
[00:13:51] the time it selects an action at random so that's what it means to be um an EG
[00:13:53] so that's what it means to be um an EG greedy policy with respect to that Q is
[00:13:56] greedy policy with respect to that Q is it it's making those decisions when it's
[00:13:57] it it's making those decisions when it's being greedy it's with respect to that Q
[00:14:00] being greedy it's with respect to that Q function so what this says is that Pi I
[00:14:02] function so what this says is that Pi I + 1 is a monotonic Improvement um so
[00:14:06] + 1 is a monotonic Improvement um so that V pii + 1 is greater than uh V Pi
[00:14:09] that V pii + 1 is greater than uh V Pi I and we can prove this here so
[00:14:12] I and we can prove this here so essentially we're trying to prove in
[00:14:13] essentially we're trying to prove in this case that the new policy that you
[00:14:15] this case that the new policy that you extract through doing policy Improvement
[00:14:18] extract through doing policy Improvement which is still an egedy policy is going
[00:14:20] which is still an egedy policy is going to be better than your old egedy
[00:14:24] to be better than your old egedy policy and the main idea is just to say
[00:14:26] policy and the main idea is just to say you can kind of also do policy
[00:14:27] you can kind of also do policy Improvement when you don't have
[00:14:29] Improvement when you don't have deterministic policies but you have
[00:14:31] deterministic policies but you have these kind of egedy policies and you
[00:14:32] these kind of egedy policies and you could still get monotonic
[00:14:36] Improvement and now I'll leave that um
[00:14:39] Improvement and now I'll leave that um I'll put that at the end uh for for
[00:14:41] I'll put that at the end uh for for later post
[00:14:42] later post proof okay so this is just to highlight
[00:14:45] proof okay so this is just to highlight like here's one thing we could do and
[00:14:46] like here's one thing we could do and we're going to see that this is actually
[00:14:47] we're going to see that this is actually going to be a pretty helpful thing to do
[00:14:49] going to be a pretty helpful thing to do this is one thing we could do um to try
[00:14:51] this is one thing we could do um to try to get data about other actions so so
[00:14:54] to get data about other actions so so we're not just taking a single action in
[00:14:55] we're not just taking a single action in a single state but we actually have some
[00:14:58] a single state but we actually have some probability of trying out multiple
[00:15:00] probability of trying out multiple actions and just to make that concrete
[00:15:02] actions and just to make that concrete if you think back to our Mars Rover
[00:15:03] if you think back to our Mars Rover example there are only seven states so
[00:15:05] example there are only seven states so if you act in it for a long time you'd
[00:15:07] if you act in it for a long time you'd repeatedly reach the same States what
[00:15:09] repeatedly reach the same States what this EG greedy policy is doing is saying
[00:15:11] this EG greedy policy is doing is saying like even when you get to the same state
[00:15:13] like even when you get to the same state you might take different actions and so
[00:15:14] you might take different actions and so over time you're going to get data that
[00:15:16] over time you're going to get data that allows you to estimate the Q value of
[00:15:18] allows you to estimate the Q value of that whole
[00:15:20] that whole policy so now we're going to see is how
[00:15:22] policy so now we're going to see is how we can use these ideas of kind of EDD
[00:15:24] we can use these ideas of kind of EDD policies to actually do control so what
[00:15:27] policies to actually do control so what I mean by that is that we're going to
[00:15:28] I mean by that is that we're going to try to learn optimal ways of acting in
[00:15:30] try to learn optimal ways of acting in the environment and we're going to start
[00:15:32] the environment and we're going to start we're going to have the same scenario as
[00:15:34] we're going to have the same scenario as last time so we're going to either have
[00:15:35] last time so we're going to either have Monte Carlo approaches where we simulate
[00:15:38] Monte Carlo approaches where we simulate in the world and then we use that to try
[00:15:40] in the world and then we use that to try to improve or temporal difference
[00:15:42] to improve or temporal difference approaches which more directly try to
[00:15:44] approaches which more directly try to use the Bellman and Markoff
[00:15:47] use the Bellman and Markoff structure okay so let's start with Monte
[00:15:50] structure okay so let's start with Monte Carlo so remember what we had before we
[00:15:53] Carlo so remember what we had before we used to have this Monte Carlo policy
[00:15:55] used to have this Monte Carlo policy evaluation algorithm where on we would
[00:15:58] evaluation algorithm where on we would repeatedly Loop we would sample the Ki
[00:16:00] repeatedly Loop we would sample the Ki episode so we just like sample a series
[00:16:03] episode so we just like sample a series of states and actions under a particular
[00:16:06] of states and actions under a particular policy okay and then you could compute
[00:16:10] policy okay and then you could compute the return from each step till the end
[00:16:13] the return from each step till the end of the
[00:16:14] of the episode and then what you would do is
[00:16:16] episode and then what you would do is you would bless you you would update for
[00:16:18] you would bless you you would update for the first time you visited a particular
[00:16:19] the first time you visited a particular State action Tuple you would update the
[00:16:22] State action Tuple you would update the Q value by a weighted average between
[00:16:24] Q value by a weighted average between your old
[00:16:25] your old estimate and then your new Target which
[00:16:29] estimate and then your new Target which was just the sum of rewards you got
[00:16:31] was just the sum of rewards you got starting in that state in action till
[00:16:33] starting in that state in action till the end of the
[00:16:34] the end of the episode okay so this is cod we often
[00:16:37] episode okay so this is cod we often call as like our
[00:16:40] Target and we were using that because um
[00:16:43] Target and we were using that because um we knew from Monte Carlo that uh a sing
[00:16:46] we knew from Monte Carlo that uh a sing that that what we want to do is really
[00:16:47] that that what we want to do is really estimate the value of starting in this
[00:16:49] estimate the value of starting in this state taking this action and following
[00:16:51] state taking this action and following this policy to the end um to the end of
[00:16:54] this policy to the end um to the end of the episode that uh we can get a sample
[00:16:56] the episode that uh we can get a sample of that by doing this and that's sort of
[00:16:59] of that by doing this and that's sort of this sample is an unbiased um
[00:17:01] this sample is an unbiased um approximation to the true expected sum
[00:17:04] approximation to the true expected sum of rewards you would get starting in
[00:17:05] of rewards you would get starting in this state in action and going till the
[00:17:07] this state in action and going till the end of the
[00:17:08] end of the episode yeah we can
[00:17:12] apply we're going to see that yes
[00:17:14] apply we're going to see that yes exactly yeah so when we thought about
[00:17:16] exactly yeah so when we thought about this before we thought the policy was
[00:17:18] this before we thought the policy was like a deterministic policy or that was
[00:17:19] like a deterministic policy or that was the easiest way to think that but now
[00:17:21] the easiest way to think that but now the policy could be stochastic and so it
[00:17:23] the policy could be stochastic and so it could be
[00:17:23] could be e yeah great question okay so now policy
[00:17:27] e yeah great question okay so now policy this policy
[00:17:29] this policy good well here we'll we'll go on to the
[00:17:31] good well here we'll we'll go on to the next
[00:17:32] next one okay so this was my Monte Carlo
[00:17:37] one okay so this was my Monte Carlo policy
[00:17:38] policy evaluation now what we could try to do
[00:17:40] evaluation now what we could try to do is Monte Carlo online control so what
[00:17:44] is Monte Carlo online control so what I'm going to do here is I'm going to
[00:17:45] I'm going to do here is I'm going to introduce a different an additional line
[00:17:47] introduce a different an additional line here at the bottom which says after I do
[00:17:50] here at the bottom which says after I do an episode I'm going to potentially
[00:17:52] an episode I'm going to potentially change my policy so you can think of
[00:17:54] change my policy so you can think of this is like my policy evaluation part
[00:17:57] this is like my policy evaluation part and this is my policy Improvement
[00:17:59] and this is my policy Improvement and again I'll just write out what this
[00:18:01] and again I'll just write out what this means so what this means is that for for
[00:18:05] means so what this means is that for for each state each
[00:18:08] each state each s um the policy for S is going to be
[00:18:12] s um the policy for S is going to be equal to ARG
[00:18:15] equal to ARG Max Q of
[00:18:20] sa with probability 1us Epsilon else
[00:18:26] sa with probability 1us Epsilon else random
[00:18:29] random so that's what I mean by I say we're
[00:18:30] so that's what I mean by I say we're doing the policy Improvement step is we
[00:18:32] doing the policy Improvement step is we take our Q function we say either you
[00:18:35] take our Q function we say either you would take the argmax action or you
[00:18:38] would take the argmax action or you would act
[00:18:39] would act randomly sorry what are we looping over
[00:18:41] randomly sorry what are we looping over in the outermost
[00:18:43] in the outermost Loop is
[00:18:44] Loop is it yeah this would be yes this would be
[00:18:47] it yeah this would be yes this would be K yeah yeah so this is just you can
[00:18:49] K yeah yeah so this is just you can think of the loop here and I'll write
[00:18:50] think of the loop here and I'll write that down
[00:18:52] that down um Loop over the
[00:18:54] um Loop over the episodes so it's like I play one game of
[00:18:57] episodes so it's like I play one game of Atari and then I update my policy
[00:18:59] Atari and then I update my policy evaluation and maybe I change my policy
[00:19:01] evaluation and maybe I change my policy then I do another round of Atari so I
[00:19:03] then I do another round of Atari so I like Play Break Out you know a million
[00:19:06] like Play Break Out you know a million times sometimes more than that in some
[00:19:08] times sometimes more than that in some of these
[00:19:09] of these cases yeah and right is it yeah it's
[00:19:12] cases yeah and right is it yeah it's still yeah I'm going to confused by this
[00:19:15] still yeah I'm going to confused by this last line that you have out there so I
[00:19:18] last line that you have out there so I mean isn't it implicit that you are
[00:19:21] mean isn't it implicit that you are using the a new you're using a new Que
[00:19:24] using the a new you're using a new Que in the on let's say you're done with
[00:19:27] in the on let's say you're done with iteration number k
[00:19:31] when you're sampling the next episode
[00:19:33] when you're sampling the next episode you're using the
[00:19:35] you're using the updated ah great question okay so maybe
[00:19:38] updated ah great question okay so maybe I should so what this says here is that
[00:19:41] I should so what this says here is that um initially you construct so your Q
[00:19:43] um initially you construct so your Q initially is zero everywhere you could
[00:19:45] initially is zero everywhere you could initialize in some ways but your Q is
[00:19:46] initialize in some ways but your Q is zero everywhere and you're going to slap
[00:19:48] zero everywhere and you're going to slap something that's egy with respect to
[00:19:49] something that's egy with respect to that now if your Q value is zero
[00:19:52] that now if your Q value is zero everywhere it means that all of your
[00:19:53] everywhere it means that all of your actions are tied you have no information
[00:19:55] actions are tied you have no information you basically are just acting randomly
[00:19:57] you basically are just acting randomly what this says is that the way we Act is
[00:19:59] what this says is that the way we Act is always with respect to our current
[00:20:01] always with respect to our current policy so the first time are you can
[00:20:04] policy so the first time are you can think of as like motor babbling right
[00:20:05] think of as like motor babbling right like your agent will just like randomly
[00:20:06] like your agent will just like randomly Press buttons it'll move over the screen
[00:20:09] Press buttons it'll move over the screen it'll do that till it wins or loses the
[00:20:10] it'll do that till it wins or loses the game and then it will update its Q value
[00:20:15] game and then it will update its Q value and what this is saying is that the next
[00:20:17] and what this is saying is that the next time you're going to change what that
[00:20:19] time you're going to change what that policy is that you're using to act so
[00:20:22] policy is that you're using to act so hopefully it won't Babble quite as much
[00:20:24] hopefully it won't Babble quite as much it's like oh well sometimes I hit
[00:20:25] it's like oh well sometimes I hit something and then I got an increase in
[00:20:26] something and then I got an increase in the points so maybe I'll try to do that
[00:20:28] the points so maybe I'll try to do that action
[00:20:36] again great question okay so I've have
[00:20:38] again great question okay so I've have not said anything about that yet I
[00:20:39] not said anything about that yet I haven't said anything about what the
[00:20:40] haven't said anything about what the properties are of this yeah in the back
[00:20:42] properties are of this yeah in the back and remember your name is it required to
[00:20:45] and remember your name is it required to do this on policy or could you do this
[00:20:46] do this on policy or could you do this off policy like collect a number of
[00:20:48] off policy like collect a number of demonstrations and then update later
[00:20:50] demonstrations and then update later great question yes you can definitely do
[00:20:51] great question yes you can definitely do off policy and we'll see that in a
[00:20:52] off policy and we'll see that in a couple slides yeah okay any questions
[00:20:55] couple slides yeah okay any questions these are great okay so you should be be
[00:20:58] these are great okay so you should be be skeptical that this is naturally going
[00:21:00] skeptical that this is naturally going to do anything reasonable but it's
[00:21:01] to do anything reasonable but it's certainly something you could run right
[00:21:03] certainly something you could run right like something that you could write down
[00:21:04] like something that you could write down in a computer so this is a process so
[00:21:07] in a computer so this is a process so then a question would be and I I put
[00:21:11] then a question would be and I I put some this is an optional worked example
[00:21:13] some this is an optional worked example you can go through it just to think
[00:21:14] you can go through it just to think about like how it would actually update
[00:21:16] about like how it would actually update these so some important properties are
[00:21:19] these so some important properties are how expensive is this does it converge
[00:21:21] how expensive is this does it converge to the optimal qar as well as what is is
[00:21:24] to the optimal qar as well as what is is it is its empirical performance
[00:21:29] it is its empirical performance let's think first whether or not we
[00:21:30] let's think first whether or not we think this is a good idea um and whether
[00:21:32] think this is a good idea um and whether or not we think that this procedure here
[00:21:35] or not we think that this procedure here is guaranteed to become a good estimate
[00:21:39] is guaranteed to become a good estimate of um the optimal qar so this is another
[00:21:43] of um the optimal qar so this is another check your understanding it's on Ed but
[00:21:46] check your understanding it's on Ed but what I would like you to think about
[00:21:49] what I would like you to think about here is that given the process I've just
[00:21:51] here is that given the process I've just shown you here do you think that the Q
[00:21:54] shown you here do you think that the Q value we're Computing is an estimate of
[00:21:57] value we're Computing is an estimate of the current policy and do you think it
[00:21:59] the current policy and do you think it will ultimately become
[00:22:02] will ultimately become qar and if you think it might or might
[00:22:05] qar and if you think it might or might not under some conditions that's fine
[00:22:06] not under some conditions that's fine too you can put that in
[00:22:08] too you can put that in there
[00:22:11] there yeah like K changes with that's right
[00:22:37] what is q k QP K is the true State
[00:22:41] what is q k QP K is the true State action value function for the py K
[00:22:45] action value function for the py K policy so what is expected discounted
[00:22:47] policy so what is expected discounted sum of rewards if you start in state s
[00:22:49] sum of rewards if you start in state s take action a and then follow p
[00:22:51] take action a and then follow p k yeah thanks for the clarification
[00:23:10] you my
[00:23:13] you my meting why
[00:23:16] you I have not said anything about
[00:23:18] you I have not said anything about whether I
[00:23:19] whether I very oh oh
[00:23:22] very oh oh um oh yeah yeah here I am yes I don't
[00:23:25] um oh yeah yeah here I am yes I don't yep why are we doing that we'll talk
[00:23:26] yep why are we doing that we'll talk about that
[00:23:30] forgot that I had already put that in
[00:23:31] forgot that I had already put that in there yeah we can talk about
[00:23:41] it okay and one thing just to note here
[00:23:44] it okay and one thing just to note here and I think this is question to so just
[00:23:46] and I think this is question to so just to be clear here as you're thinking
[00:23:48] to be clear here as you're thinking about this so this is like an
[00:23:51] about this so this is like an approximation to policy iteration so
[00:23:53] approximation to policy iteration so we're kind of doing policy evaluation
[00:23:55] we're kind of doing policy evaluation and then policy
[00:23:56] and then policy Improvement but it's it's helpful to
[00:23:59] Improvement but it's it's helpful to think about kind of how much time we're
[00:24:01] think about kind of how much time we're spending doing policy Improvement versus
[00:24:03] spending doing policy Improvement versus policy evaluation so what this is saying
[00:24:05] policy evaluation so what this is saying here is that you're going to sample one
[00:24:09] here is that you're going to sample one episode and then you're going to do
[00:24:11] episode and then you're going to do policy
[00:24:12] policy evaluation okay this is all just one
[00:24:16] evaluation okay this is all just one episode so it's like I'm going to play
[00:24:18] episode so it's like I'm going to play one like I'm going to play until I win
[00:24:20] one like I'm going to play until I win or lose at breakout once under a
[00:24:24] or lose at breakout once under a particular policy and then I'm going to
[00:24:26] particular policy and then I'm going to change my policy then I'm going to play
[00:24:28] change my policy then I'm going to play with my new policy once and I'm going to
[00:24:32] with my new policy once and I'm going to change my
[00:24:34] change my policy and some of those games might be
[00:24:36] policy and some of those games might be really long or some of them might be
[00:24:38] really long or some of them might be really short yeah so this is like just
[00:24:42] really short yeah so this is like just this is like representing
[00:24:45] this is like representing after so then I think I might just be
[00:24:51] confused when they're just like if I'm
[00:24:53] confused when they're just like if I'm playing a
[00:24:54] playing a game the epis just follow one after the
[00:24:57] game the epis just follow one after the other there just there's just one K
[00:25:00] other there just there's just one K episode there's one kith episode yeah so
[00:25:03] episode there's one kith episode yeah so like K is like like I play so if K is
[00:25:06] like K is like like I play so if K is one I'm going to play my first game and
[00:25:08] one I'm going to play my first game and I'm going to play it until I win or lose
[00:25:09] I'm going to play it until I win or lose until the you know the game so maybe
[00:25:11] until the you know the game so maybe breakout finishes or maybe I'm playing
[00:25:13] breakout finishes or maybe I'm playing Tetris and like I fail and I died and
[00:25:15] Tetris and like I fail and I died and that and that is one episode and I'm
[00:25:17] that and that is one episode and I'm going to use that to then update my Q
[00:25:19] going to use that to then update my Q function then I'm going to change it and
[00:25:20] function then I'm going to change it and say okay my next round I'm going to play
[00:25:22] say okay my next round I'm going to play differently then I play Tetris again
[00:25:24] differently then I play Tetris again until I fail and then um I see what the
[00:25:27] until I fail and then um I see what the total point points are I update my Q
[00:25:29] total point points are I update my Q function and I repeat and some of those
[00:25:31] function and I repeat and some of those episodes might be really short so maybe
[00:25:33] episodes might be really short so maybe the first time particularly for these
[00:25:34] the first time particularly for these agents the first time they play Tetris
[00:25:36] agents the first time they play Tetris maybe they lose in like 10 steps might
[00:25:39] maybe they lose in like 10 steps might be a really short step um later maybe
[00:25:41] be a really short step um later maybe they play for a long time but in general
[00:25:43] they play for a long time but in general I've not told you anything about how
[00:25:45] I've not told you anything about how long these episodes are they might be
[00:25:47] long these episodes are they might be really short but they might be really
[00:25:51] long
[00:25:53] long okay and I think one useful way I find
[00:25:56] okay and I think one useful way I find to think about this is that think about
[00:25:58] to think about this is that think about a they're really short like really
[00:26:00] a they're really short like really really short like I take two steps and I
[00:26:01] really short like I take two steps and I just fail I did something really dumb so
[00:26:04] just fail I did something really dumb so in that
[00:26:05] in that case think about whether Q would be a
[00:26:08] case think about whether Q would be a good estimate of Q Pi
[00:26:11] good estimate of Q Pi K like would it be good if you've only
[00:26:13] K like would it be good if you've only seen two states or would it be pretty
[00:26:20] bad so why you turn to someone near you
[00:26:22] bad so why you turn to someone near you I think most people have voted if you
[00:26:23] I think most people have voted if you are have written something you haven't
[00:26:25] are have written something you haven't but why don't you check and see what you
[00:26:27] but why don't you check and see what you think
[00:26:57] for
[00:27:27] e e
[00:28:05] one
[00:28:37] time
[00:28:58] so
[00:29:54] don't C
[00:30:10] okay awesome I'm hearing a lot of really
[00:30:11] okay awesome I'm hearing a lot of really good discussion but I'm going to
[00:30:12] good discussion but I'm going to interrupt you because I want to make
[00:30:13] interrupt you because I want to make sure we get to dqn um so this is where
[00:30:17] sure we get to dqn um so this is where so one of the reasons that I bring up
[00:30:19] so one of the reasons that I bring up this particular example is that you know
[00:30:21] this particular example is that you know here it's tabular things are a little
[00:30:23] here it's tabular things are a little bit smaller so it's a bit easier to see
[00:30:25] bit smaller so it's a bit easier to see but essentially what I kind of want you
[00:30:26] but essentially what I kind of want you guys to get out of today is that it
[00:30:28] guys to get out of today is that it should be sort of shocking that
[00:30:30] should be sort of shocking that reinforcement learning works um and
[00:30:32] reinforcement learning works um and we're not going to have time to go
[00:30:33] we're not going to have time to go through all the Deep mathematical
[00:30:35] through all the Deep mathematical reasons for why it does work sometimes
[00:30:36] reasons for why it does work sometimes um in this class but I'm happy to give
[00:30:38] um in this class but I'm happy to give people pointers but so there's several
[00:30:40] people pointers but so there's several things that are really kind of odd if
[00:30:42] things that are really kind of odd if you start to think about this when when
[00:30:43] you start to think about this when when you go through this so first of all Q is
[00:30:45] you go through this so first of all Q is not an estimate of Q Pi K it is not
[00:30:49] not an estimate of Q Pi K it is not because it is averaging over policies
[00:30:51] because it is averaging over policies that are changing every episode or
[00:30:54] that are changing every episode or potentially changing every episode right
[00:30:56] potentially changing every episode right because um in in fact in general it will
[00:30:59] because um in in fact in general it will be right because we're decaying Epsilon
[00:31:01] be right because we're decaying Epsilon so we're changing Epsilon each round
[00:31:03] so we're changing Epsilon each round which means we're making things more and
[00:31:04] which means we're making things more and more deterministic but in addition to
[00:31:06] more deterministic but in addition to that our our Q might be changing so
[00:31:08] that our our Q might be changing so essentially like I'm just trying a
[00:31:10] essentially like I'm just trying a policy one round um and then I update my
[00:31:14] policy one round um and then I update my queue and then I try something again and
[00:31:17] queue and then I try something again and it's sort of like you know extreme
[00:31:18] it's sort of like you know extreme example of this would be like flipping a
[00:31:19] example of this would be like flipping a coin once and deciding whether what it's
[00:31:21] coin once and deciding whether what it's bias is or something like that like
[00:31:23] bias is or something like that like that's just not very much data to do
[00:31:25] that's just not very much data to do this evaluation and you're also you're
[00:31:27] this evaluation and you're also you're averaging this over
[00:31:28] averaging this over many many different policies so Q is not
[00:31:31] many many different policies so Q is not an estimate of Q Pi K it's this weird
[00:31:34] an estimate of Q Pi K it's this weird weighted average of all the previous
[00:31:36] weighted average of all the previous data and all the policies you you've
[00:31:37] data and all the policies you you've done
[00:31:40] before like is the last latest k k minus
[00:31:45] before like is the last latest k k minus one well but not really right because
[00:31:47] one well but not really right because it's I mean you've averaged in that part
[00:31:50] it's I mean you've averaged in that part that part is from PI K plus one but this
[00:31:52] that part is from PI K plus one but this old thing was over all of your is like
[00:31:53] old thing was over all of your is like this weird weighted average of all the
[00:31:55] this weird weighted average of all the other policies you've tried so yes it is
[00:31:58] other policies you've tried so yes it is that but also like that plus all the
[00:32:00] that but also like that plus all the other policies so it's this weird thing
[00:32:03] other policies so it's this weird thing right um the second thing is that we're
[00:32:06] right um the second thing is that we're only doing and I was talking to some
[00:32:07] only doing and I was talking to some people about this we're only doing one
[00:32:10] people about this we're only doing one roll out to try to evaluate a policy and
[00:32:13] roll out to try to evaluate a policy and you might imagine there's a lot of
[00:32:14] you might imagine there's a lot of stochasticity like even in something
[00:32:16] stochasticity like even in something like you know um some games there's like
[00:32:18] like you know um some games there's like random roles of the dice and stuff like
[00:32:19] random roles of the dice and stuff like that which means even with the same
[00:32:21] that which means even with the same strategy you might get different
[00:32:22] strategy you might get different outcomes each time um so it' be like if
[00:32:24] outcomes each time um so it' be like if you know you drove to SF and you did it
[00:32:26] you know you drove to SF and you did it once and there was no traffic and so
[00:32:28] once and there was no traffic and so you're like oh I can always get to SF in
[00:32:30] you're like oh I can always get to SF in like you know I don't know 20 minutes on
[00:32:32] like you know I don't know 20 minutes on the highway but for those of you that
[00:32:34] the highway but for those of you that tried to SF you would know that often
[00:32:35] tried to SF you would know that often there's lots of traffic and so you would
[00:32:37] there's lots of traffic and so you would need to average over many rounds of
[00:32:39] need to average over many rounds of doing this to see how good a particular
[00:32:41] doing this to see how good a particular rout is so the weird thing here is that
[00:32:44] rout is so the weird thing here is that we're just doing like kind of one roll
[00:32:46] we're just doing like kind of one roll out we're averaging into this weird Q
[00:32:48] out we're averaging into this weird Q thing which is now going to be this
[00:32:50] thing which is now going to be this weighted average over all the policies
[00:32:51] weighted average over all the policies we've done and we have this weird
[00:32:54] we've done and we have this weird Epsilon
[00:32:55] Epsilon thing and it is should not be clear yet
[00:32:58] thing and it is should not be clear yet that we will necessarily converge to qar
[00:33:01] that we will necessarily converge to qar like we are getting more and more
[00:33:02] like we are getting more and more deterministic over time because we're
[00:33:04] deterministic over time because we're reducing Epsilon so reducing Epsilon
[00:33:07] reducing Epsilon so reducing Epsilon here towards zero eventually we're going
[00:33:08] here towards zero eventually we're going to converge towards something
[00:33:11] to converge towards something deterministic but you may or may not be
[00:33:13] deterministic but you may or may not be convinced yet that the thing we're going
[00:33:14] convinced yet that the thing we're going to converge to is actually
[00:33:18] qar so
[00:33:20] qar so fortunately there are some sufficient
[00:33:23] fortunately there are some sufficient conditions under which we can guarantee
[00:33:25] conditions under which we can guarantee that this sort of thing will converge to
[00:33:26] that this sort of thing will converge to qar and it's really it's quite beautiful
[00:33:30] qar and it's really it's quite beautiful that this works okay so one is what's
[00:33:32] that this works okay so one is what's called greedy in the limit of infinite
[00:33:34] called greedy in the limit of infinite exploration or
[00:33:36] exploration or Glee so the idea in this case is that if
[00:33:39] Glee so the idea in this case is that if you can ensure that all state action
[00:33:41] you can ensure that all state action pairs are visited an infinite number of
[00:33:43] pairs are visited an infinite number of times meaning the number of counts that
[00:33:46] times meaning the number of counts that you have for a particular State and
[00:33:48] you have for a particular State and action pair goes to Infinity for all
[00:33:50] action pair goes to Infinity for all states and
[00:33:52] states and actions this is for all
[00:33:59] and the behavior policy and what I mean
[00:34:01] and the behavior policy and what I mean by the behavior policy is is this is the
[00:34:03] by the behavior policy is is this is the policy you're actually using to make
[00:34:04] policy you're actually using to make decisions in the world um and it will be
[00:34:07] decisions in the world um and it will be important there'll be distinctions
[00:34:08] important there'll be distinctions between this and other policies soon
[00:34:10] between this and other policies soon which is why we call this Behavior
[00:34:12] which is why we call this Behavior policy if the behavior so if you s Paul
[00:34:15] policy if the behavior so if you s Paul State action pairs an infinite number of
[00:34:16] State action pairs an infinite number of times and your behavior policy converges
[00:34:19] times and your behavior policy converges to the greedy policy which means that
[00:34:21] to the greedy policy which means that ASM totically the action you select in a
[00:34:24] ASM totically the action you select in a state is exactly equal to the ARG Max of
[00:34:27] state is exactly equal to the ARG Max of your Q function
[00:34:28] your Q function with probability one so you're just
[00:34:30] with probability one so you're just getting more and more deterministic so
[00:34:32] getting more and more deterministic so then then you were being greedy in the
[00:34:34] then then you were being greedy in the limit of infinite exploration that says
[00:34:37] limit of infinite exploration that says that you're exploring everything an
[00:34:40] that you're exploring everything an infinite number of times you're always
[00:34:41] infinite number of times you're always continuing to try all actions in all
[00:34:43] continuing to try all actions in all states but you're getting more and more
[00:34:45] states but you're getting more and more deterministic so this is what it means
[00:34:47] deterministic so this is what it means to be
[00:34:49] to be Glee if you have a Glee
[00:34:53] Glee if you have a Glee algorithm um and I'll just note here
[00:34:55] algorithm um and I'll just note here like a simple way to do this is to do EG
[00:34:58] like a simple way to do this is to do EG greedy where Epsilon Is Res reduced to
[00:35:01] greedy where Epsilon Is Res reduced to zero at the following
[00:35:06] rate so we'd have this so that's sort of
[00:35:09] rate so we'd have this so that's sort of a simple one you know
[00:35:11] a simple one you know and
[00:35:20] visit and that should hold as as long as
[00:35:22] visit and that should hold as as long as you have an egy strategy then you will
[00:35:25] you have an egy strategy then you will be able to visit all states and actions
[00:35:27] be able to visit all states and actions so so you're you're going to be visiting
[00:35:28] so so you're you're going to be visiting all states and actions under this GLE
[00:35:32] all states and actions under this GLE strategy then under that the Monte Carlo
[00:35:34] strategy then under that the Monte Carlo algorithm I just showed you four tabular
[00:35:37] algorithm I just showed you four tabular representations will converge to
[00:35:40] representations will converge to qstar which means as long as you Decay
[00:35:43] qstar which means as long as you Decay Epsilon at this rate you are actually
[00:35:45] Epsilon at this rate you are actually converging to qar you're getting more
[00:35:46] converging to qar you're getting more and more deterministic you're still
[00:35:47] and more deterministic you're still visiting all states and actions an
[00:35:49] visiting all states and actions an infinite number of times and this
[00:35:51] infinite number of times and this procedure is guaranteed to ASM totically
[00:35:53] procedure is guaranteed to ASM totically get you to the optimal Q function which
[00:35:55] get you to the optimal Q function which is pretty cool and it should be somewhat
[00:35:56] is pretty cool and it should be somewhat surprising
[00:36:00] all right so that is Glee and that is
[00:36:02] all right so that is Glee and that is one one of the reasons why we like to
[00:36:04] one one of the reasons why we like to think about EG greedy algorithms because
[00:36:06] think about EG greedy algorithms because they have this nice property that we can
[00:36:08] they have this nice property that we can prove that we are going to get an
[00:36:09] prove that we are going to get an optimal policy even though all we're
[00:36:11] optimal policy even though all we're doing is we're acting in the world and
[00:36:13] doing is we're acting in the world and we're getting this
[00:36:14] we're getting this data now what you should be thinking
[00:36:16] data now what you should be thinking about at this point is that all right
[00:36:17] about at this point is that all right here's the Monte Carlo approach to doing
[00:36:19] here's the Monte Carlo approach to doing this there's probably going to be a
[00:36:20] this there's probably going to be a temporal difference uh approach to doing
[00:36:22] temporal difference uh approach to doing this and that's what we're going to see
[00:36:23] this and that's what we're going to see now so now we're going to look into
[00:36:26] now so now we're going to look into temporal difference methods for
[00:36:29] temporal difference methods for control okay so one of the interesting
[00:36:32] control okay so one of the interesting things is that um there's going to be
[00:36:35] things is that um there's going to be two different types of algorithms that
[00:36:37] two different types of algorithms that we're going to focus on for temporal
[00:36:39] we're going to focus on for temporal difference for
[00:36:40] difference for control and the idea in these settings
[00:36:43] control and the idea in these settings is that we're going to alternate between
[00:36:44] is that we're going to alternate between two steps again kind of this policy
[00:36:46] two steps again kind of this policy evaluation versus policy Improvement and
[00:36:49] evaluation versus policy Improvement and one of the key things to think about in
[00:36:50] one of the key things to think about in this case is sort of how much time are
[00:36:52] this case is sort of how much time are we spending doing evaluation versus
[00:36:53] we spending doing evaluation versus Improvement and what are we trying to
[00:36:55] Improvement and what are we trying to evaluate and what are we improving with
[00:36:57] evaluate and what are we improving with respect to
[00:36:58] respect to so the idea now is that we're going to
[00:37:00] so the idea now is that we're going to compute qy using temporal difference
[00:37:02] compute qy using temporal difference updating with an EG greedy policy and
[00:37:05] updating with an EG greedy policy and then we're going to do policy
[00:37:06] then we're going to do policy Improvement in the same way that we saw
[00:37:08] Improvement in the same way that we saw before for Monti Carlo methods so we can
[00:37:11] before for Monti Carlo methods so we can do this EG greedy thing where we are
[00:37:13] do this EG greedy thing where we are greedy with respect to our current Q
[00:37:15] greedy with respect to our current Q value and the first algorithm we're
[00:37:17] value and the first algorithm we're going to see is called
[00:37:18] going to see is called sarsa and the reason it is called sarsa
[00:37:21] sarsa and the reason it is called sarsa is it is
[00:37:23] is it is State
[00:37:25] State action reward
[00:37:28] action reward next
[00:37:29] next state next action is short for
[00:37:33] state next action is short for that s r s sarsa that's an easy way to
[00:37:38] that s r s sarsa that's an easy way to think to to remember why this method
[00:37:40] think to to remember why this method would be called Sara because those are
[00:37:43] would be called Sara because those are the um tles we need in order to do
[00:37:45] the um tles we need in order to do updates we need s a r s prime a prime to
[00:37:48] updates we need s a r s prime a prime to do an
[00:37:49] do an update and this is going to be an on
[00:37:51] update and this is going to be an on policy algorithm and this is related to
[00:37:54] policy algorithm and this is related to um what was suggested in the back remind
[00:37:55] um what was suggested in the back remind me your name
[00:37:58] me your name yeah exactly what said so can we also
[00:38:00] yeah exactly what said so can we also use sort of off policy data and we'll
[00:38:02] use sort of off policy data and we'll see that really shortly but um Sara is
[00:38:05] see that really shortly but um Sara is going to be on policy and what we mean
[00:38:06] going to be on policy and what we mean by that is that it's going to be
[00:38:08] by that is that it's going to be Computing an estimate of the Q value of
[00:38:10] Computing an estimate of the Q value of the policy we're using to act or the
[00:38:12] the policy we're using to act or the what policy we're using to make
[00:38:14] what policy we're using to make decisions in the
[00:38:15] decisions in the world so let's see how it works so in
[00:38:18] world so let's see how it works so in general the the form of Sara is the
[00:38:21] general the the form of Sara is the following um we are going to iterate our
[00:38:25] following um we are going to iterate our Loop is going to be such that
[00:38:28] Loop is going to be such that we start off so this is the a the we
[00:38:30] we start off so this is the a the we start in some State this is the S we
[00:38:32] start in some State this is the S we take an action a um we observe reward in
[00:38:35] take an action a um we observe reward in the next state and then we Loop and we
[00:38:39] the next state and then we Loop and we take the next action still according to
[00:38:41] take the next action still according to the same
[00:38:43] the same policy and then what we're going to do
[00:38:45] policy and then what we're going to do is we're going to update our Q function
[00:38:48] is we're going to update our Q function given this topple of sorsa essentially
[00:38:51] given this topple of sorsa essentially and what we're going to do in this case
[00:38:53] and what we're going to do in this case is it's going to look similar to what we
[00:38:55] is it's going to look similar to what we saw before so we're going to have
[00:38:58] saw before so we're going to have our updated one is our old
[00:39:04] our updated one is our old Value Plus Alpha so this is like our
[00:39:07] Value Plus Alpha so this is like our learning rate our
[00:39:10] learning rate our Target St + 1 A+
[00:39:14] Target St + 1 A+ one minus Q of
[00:39:20] st8 so this is the
[00:39:26] Target and it's it's going to look
[00:39:27] Target and it's it's going to look similar to what we saw for td0 where we
[00:39:30] similar to what we saw for td0 where we plug in our immediate reward plus our
[00:39:32] plug in our immediate reward plus our estimate of the expected discounted sum
[00:39:34] estimate of the expected discounted sum of rewards starting in that next state
[00:39:37] of rewards starting in that next state and one of the important things to
[00:39:38] and one of the important things to notice in this case is we are plugging
[00:39:42] notice in this case is we are plugging in the actual action we took in the next
[00:39:46] in the actual action we took in the next state so we're saying what is the
[00:39:48] state so we're saying what is the expected discounted sum of reward
[00:39:49] expected discounted sum of reward starting in this state and taking this
[00:39:51] starting in this state and taking this action well one one uh estimate of it is
[00:39:54] action well one one uh estimate of it is the immediate reward I got plus gamma
[00:39:56] the immediate reward I got plus gamma times the Q value for the state I
[00:39:58] times the Q value for the state I reached plus the action I would take
[00:40:00] reached plus the action I would take under this policy
[00:40:01] under this policy next and that's one of the reasons why
[00:40:03] next and that's one of the reasons why it's called on policy because it's being
[00:40:05] it's called on policy because it's being specific to the action you would
[00:40:07] specific to the action you would actually take under this
[00:40:10] actually take under this policy all right and then after the the
[00:40:12] policy all right and then after the the then the next thing we do is we do
[00:40:14] then the next thing we do is we do policy Improvement and what we would do
[00:40:16] policy Improvement and what we would do in this case is again similar to what we
[00:40:18] in this case is again similar to what we saw in the other one so for all s this
[00:40:20] saw in the other one so for all s this just means for all for anybody who's
[00:40:22] just means for all for anybody who's hasn't seen this notation Pi of s is
[00:40:25] hasn't seen this notation Pi of s is equal to R Max over
[00:40:28] equal to R Max over a q of
[00:40:31] a q of sa with probability almost
[00:40:45] abs and then what we do is we update our
[00:40:48] abs and then what we do is we update our time step we update our
[00:40:51] time step we update our Epsilon and then what we're going to do
[00:40:53] Epsilon and then what we're going to do is just repeat so then we're going to um
[00:40:56] is just repeat so then we're going to um go then we're going to go to the next
[00:40:58] go then we're going to go to the next we're going to take our next state take
[00:41:00] we're going to take our next state take an action and repeat this
[00:41:02] an action and repeat this updating so this is called Gap um quick
[00:41:06] updating so this is called Gap um quick question uh like do we I I have bit
[00:41:09] question uh like do we I I have bit confused about setting Pi of s do we say
[00:41:13] confused about setting Pi of s do we say Pi is a deterministic policy that is one
[00:41:16] Pi is a deterministic policy that is one of this with this probability and the
[00:41:18] of this with this probability and the other one with the other probability or
[00:41:19] other one with the other probability or are we saying it's a stochastic policy
[00:41:22] are we saying it's a stochastic policy that can it's a stochastic policy yeah
[00:41:25] that can it's a stochastic policy yeah so it's a um it's a stochastic policy at
[00:41:27] so it's a um it's a stochastic policy at the very beginning it's totally random
[00:41:29] the very beginning it's totally random you just take any action in any state
[00:41:31] you just take any action in any state later you're defining it with respect to
[00:41:32] later you're defining it with respect to your current Q value and you're either
[00:41:34] your current Q value and you're either Being Greedy with respect to that Q
[00:41:36] Being Greedy with respect to that Q value selecting action at random yeah so
[00:41:39] value selecting action at random yeah so um one concern that I had was like What
[00:41:41] um one concern that I had was like What if we reach a Dom State
[00:41:44] if we reach a Dom State and ah good question okay and um this
[00:41:46] and ah good question okay and um this actually came up in another conversation
[00:41:48] actually came up in another conversation earlier this morning yes so if it is if
[00:41:51] earlier this morning yes so if it is if you reach a terminal State then you just
[00:41:52] you reach a terminal State then you just reset so if t t + 2 is
[00:42:01] terminal
[00:42:03] terminal reset
[00:42:05] reset episode and
[00:42:08] episode and S so if you ever reach a state where
[00:42:10] S so if you ever reach a state where it's Turnal what would happen next is
[00:42:12] it's Turnal what would happen next is then your whole whole episode just
[00:42:14] then your whole whole episode just resets you sample as initial state from
[00:42:16] resets you sample as initial state from the world and then you repeat so just
[00:42:17] the world and then you repeat so just like if I like finish my game I failed
[00:42:19] like if I like finish my game I failed at Tesla it reinitializes the world so
[00:42:22] at Tesla it reinitializes the world so these are still sort of assumed to be
[00:42:23] these are still sort of assumed to be continuing processes
[00:42:25] continuing processes yeah um I'm wondering
[00:42:31] what great question so um and what we're
[00:42:35] what great question so um and what we're going to see in just like a slide or two
[00:42:36] going to see in just like a slide or two and you guys are probably half of you at
[00:42:38] and you guys are probably half of you at least has probably seen this before
[00:42:39] least has probably seen this before we're going to see Q learning and that's
[00:42:41] we're going to see Q learning and that's where it's going to be off
[00:42:42] where it's going to be off policy okay very quick question uh like
[00:42:46] policy okay very quick question uh like the uh when it's s t plus2 uh do we do
[00:42:49] the uh when it's s t plus2 uh do we do the step seven or do we skip it for one
[00:42:51] the step seven or do we skip it for one step and do it in the next one cuz like
[00:42:54] step and do it in the next one cuz like if it's terminal or in general like if
[00:42:56] if it's terminal or in general like if it's terminal if it's terminal you would
[00:42:59] it's terminal if it's terminal you would halt here and then you would reset the
[00:43:01] halt here and then you would reset the whole thing then you would need to take
[00:43:03] whole thing then you would need to take an action observe or or next state and
[00:43:05] an action observe or or next state and then jump into five so You' sort of have
[00:43:06] then jump into five so You' sort of have to reset to
[00:43:07] to reset to two great
[00:43:09] two great question all right so let's see um well
[00:43:12] question all right so let's see um well first let's talk about whether this is
[00:43:13] first let's talk about whether this is guaranteed to do anything reasonable um
[00:43:15] guaranteed to do anything reasonable um and then we'll going so I've written
[00:43:18] and then we'll going so I've written this out neatly here and then there's a
[00:43:20] this out neatly here and then there's a worked example for the Mars Rob at the
[00:43:22] worked example for the Mars Rob at the end of the
[00:43:23] end of the slides okay so one thing to note here to
[00:43:27] slides okay so one thing to note here to is that now we've defined a general
[00:43:28] is that now we've defined a general learning rate so that we have a general
[00:43:30] learning rate so that we have a general learning rate here okay and we also have
[00:43:34] learning rate here okay and we also have make sure I keep this in
[00:43:37] here we're going to keep updating our
[00:43:40] here we're going to keep updating our Epsilon okay so is this a good
[00:43:45] Epsilon okay so is this a good approach so we can think of a couple
[00:43:47] approach so we can think of a couple different things here we can think of
[00:43:48] different things here we can think of the computational complexity so here
[00:43:51] the computational complexity so here after each tle we're doing an
[00:43:53] after each tle we're doing an update and in fact we know that that's
[00:43:56] update and in fact we know that that's in general only going to um change the Q
[00:43:58] in general only going to um change the Q value for the states and the actions
[00:44:00] value for the states and the actions that we're updating so we just deserve
[00:44:01] that we're updating so we just deserve doing that that small update each time
[00:44:04] doing that that small update each time we don't have to sum over all the states
[00:44:06] we don't have to sum over all the states so there's nothing that depends on the
[00:44:08] so there's nothing that depends on the state space siid per update but of
[00:44:10] state space siid per update but of course we're doing this many many many
[00:44:12] course we're doing this many many many times does this converge to the optimal
[00:44:14] times does this converge to the optimal Q function so what we have here in this
[00:44:17] Q function so what we have here in this case is we have this weighted um
[00:44:19] case is we have this weighted um combination between our last Q function
[00:44:21] combination between our last Q function and like this new
[00:44:23] and like this new Target and again Q is an estimate of the
[00:44:27] Target and again Q is an estimate of the performance of a policy that might be
[00:44:28] performance of a policy that might be changing each time
[00:44:30] changing each time point so it's similar to Monte Carlo
[00:44:32] point so it's similar to Monte Carlo like we're just like we're constantly
[00:44:33] like we're just like we're constantly changing the policy in this case and so
[00:44:36] changing the policy in this case and so that should feel a little bit
[00:44:37] that should feel a little bit concerning and empirically it often does
[00:44:39] concerning and empirically it often does quite well but Q learning is more
[00:44:42] quite well but Q learning is more popular so what are the convergence
[00:44:44] popular so what are the convergence properties so it turns out that in terms
[00:44:46] properties so it turns out that in terms of some of the mathematical formulations
[00:44:48] of some of the mathematical formulations this relates really strongly to
[00:44:49] this relates really strongly to stochastic
[00:44:51] stochastic approximation and this is a deep
[00:44:53] approximation and this is a deep literature with lots of really amazing
[00:44:54] literature with lots of really amazing results kind of the the in the finite
[00:44:57] results kind of the the in the finite State and finite action case it's going
[00:44:59] State and finite action case it's going to converge to the optimal qar for sarsa
[00:45:03] to converge to the optimal qar for sarsa if your policy sequence satisfies the
[00:45:06] if your policy sequence satisfies the condition of Glee so we're going to
[00:45:07] condition of Glee so we're going to visit all states and actions an infinite
[00:45:08] visit all states and actions an infinite number of times and we're getting
[00:45:10] number of times and we're getting greedier and greedier over time and we
[00:45:13] greedier and greedier over time and we have to put in a condition about the
[00:45:14] have to put in a condition about the learning rates the step sizes so in
[00:45:18] learning rates the step sizes so in particular they have to satisfy the
[00:45:19] particular they have to satisfy the Ruben's Monroe
[00:45:21] Ruben's Monroe sequence so they have to satisfy these
[00:45:23] sequence so they have to satisfy these two things which is their sum goes to
[00:45:25] two things which is their sum goes to infinity and their squared this an
[00:45:27] infinity and their squared this an infinity and we've seen this
[00:45:30] infinity and we've seen this before and an example of this would be a
[00:45:32] before and an example of this would be a = 1/t satisfies these
[00:45:37] conditions
[00:45:40] conditions so these results really sort of rely on
[00:45:43] so these results really sort of rely on these really nice results from
[00:45:44] these really nice results from stochastic approximation because it
[00:45:46] stochastic approximation because it should be a little bit surprising you
[00:45:47] should be a little bit surprising you can think of this as like kind of
[00:45:48] can think of this as like kind of there's these different there's these
[00:45:51] there's these different there's these mixing processes that are going on
[00:45:52] mixing processes that are going on because our policy is changing our
[00:45:54] because our policy is changing our estimates are changing how can we be
[00:45:56] estimates are changing how can we be sure that it's is essentially going to
[00:45:57] sure that it's is essentially going to be stable enough that over time we're
[00:45:59] be stable enough that over time we're actually going to converge to something
[00:46:01] actually going to converge to something that's both fixed like we're not just
[00:46:03] that's both fixed like we're not just going to oscillate forever and that it
[00:46:06] going to oscillate forever and that it is optimal so it should not be at all
[00:46:09] is optimal so it should not be at all clear why this would necessarily work um
[00:46:12] clear why this would necessarily work um and uh the this is where we sort of rely
[00:46:15] and uh the this is where we sort of rely on those results from stochastic
[00:46:17] on those results from stochastic approximation that also had to be
[00:46:18] approximation that also had to be extended to think about these particular
[00:46:20] extended to think about these particular cases during a number of really
[00:46:22] cases during a number of really beautiful papers from the
[00:46:24] beautiful papers from the 1990s so there's those the 1992 and 1994
[00:46:31] 1990s so there's those the 1992 and 1994 papers that show
[00:46:34] papers that show this so some really cool results that
[00:46:36] this so some really cool results that illustrate why this is possible okay so
[00:46:40] illustrate why this is possible okay so sarsa for tabular settings um under some
[00:46:43] sarsa for tabular settings um under some mild conditions is guaranteed to
[00:46:44] mild conditions is guaranteed to converge to qar so now let's see if we
[00:46:47] converge to qar so now let's see if we can do off policy learning so off policy
[00:46:50] can do off policy learning so off policy learning is the idea that now we're
[00:46:51] learning is the idea that now we're going to be trying to estimate and
[00:46:53] going to be trying to estimate and evaluate a policy using experience
[00:46:55] evaluate a policy using experience gathered from following a different
[00:46:56] gathered from following a different policy
[00:46:58] policy so so far we've been thinking about like
[00:46:59] so so far we've been thinking about like Monte Carlo methods and starsa where
[00:47:01] Monte Carlo methods and starsa where we're at least sort of kind of trying to
[00:47:03] we're at least sort of kind of trying to always approximate the value of the most
[00:47:05] always approximate the value of the most recent policy or averaged over all those
[00:47:07] recent policy or averaged over all those policies but now we're going to
[00:47:09] policies but now we're going to explicitly be trying to estimate qar at
[00:47:12] explicitly be trying to estimate qar at all time
[00:47:13] all time points okay so in Q
[00:47:17] points okay so in Q learning we are going to try to directly
[00:47:20] learning we are going to try to directly estimate the value of pi star which
[00:47:22] estimate the value of pi star which remember we don't know because if we
[00:47:23] remember we don't know because if we knew what pi star was then we wouldn't
[00:47:25] knew what pi star was then we wouldn't have to do any of this learning with
[00:47:27] have to do any of this learning with another Behavior policy pi b so we're
[00:47:29] another Behavior policy pi b so we're going to be acting in one way and we're
[00:47:31] going to be acting in one way and we're going to be trying to use that data to
[00:47:33] going to be trying to use that data to estimate the value of an alternative
[00:47:35] estimate the value of an alternative policy and that's what how what Q
[00:47:37] policy and that's what how what Q learning
[00:47:38] learning does so in Q learning the key difference
[00:47:42] does so in Q learning the key difference is that instead of trying to think about
[00:47:44] is that instead of trying to think about what is the action we actually took on
[00:47:46] what is the action we actually took on the next time step we're just going to
[00:47:48] the next time step we're just going to figure out what is the best action I
[00:47:50] figure out what is the best action I could have taken because we know for the
[00:47:53] could have taken because we know for the qstar value it is the estimate of the
[00:47:56] qstar value it is the estimate of the optimal
[00:47:57] optimal expected reward you could get if you
[00:47:59] expected reward you could get if you take the the current action and then act
[00:48:01] take the the current action and then act optimally from now
[00:48:02] optimally from now on so really you would normally like to
[00:48:05] on so really you would normally like to have something like this so sum over S
[00:48:07] have something like this so sum over S Prime probability of S Prime given sa a
[00:48:10] Prime probability of S Prime given sa a Time B star of S Prime that's what you'd
[00:48:13] Time B star of S Prime that's what you'd have in like the Bellman equation and
[00:48:16] have in like the Bellman equation and what we're going to do here what Q
[00:48:17] what we're going to do here what Q learning does is it
[00:48:20] learning does is it approximates that by this
[00:48:23] approximates that by this Max and that is different than what
[00:48:25] Max and that is different than what sersa does because sari used the actual
[00:48:30] sersa does because sari used the actual action and CER says I don't really care
[00:48:32] action and CER says I don't really care what actual action you took I care about
[00:48:34] what actual action you took I care about what is the best thing you could have
[00:48:35] what is the best thing you could have done there because that's giving me a
[00:48:36] done there because that's giving me a better estimate of the maximum expected
[00:48:38] better estimate of the maximum expected discounted sum of rewards I'd get from
[00:48:41] discounted sum of rewards I'd get from that state till the end of
[00:48:43] that state till the end of time so that is what Q learning is doing
[00:48:46] time so that is what Q learning is doing so it looks really similar to the sarsa
[00:48:48] so it looks really similar to the sarsa update but our Target is going to be the
[00:48:50] update but our Target is going to be the reward I got plus the best reward that I
[00:48:53] reward I got plus the best reward that I think I could have achieved from that
[00:48:54] think I could have achieved from that next state
[00:48:58] okay all right so then we get an
[00:49:00] okay all right so then we get an algorithm that looks extremely similar
[00:49:02] algorithm that looks extremely similar to what we saw before but we have this
[00:49:05] to what we saw before but we have this Max over the next action and then I'll
[00:49:08] Max over the next action and then I'll just make sure I think I forgot to write
[00:49:13] that so whether we're doing Monte Carlo
[00:49:16] that so whether we're doing Monte Carlo or Sara or Q learning in all of these
[00:49:18] or Sara or Q learning in all of these cases we're interleaving Gathering some
[00:49:20] cases we're interleaving Gathering some data under our current Epsilon greedy
[00:49:22] data under our current Epsilon greedy policy and then using it to update a q
[00:49:24] policy and then using it to update a q value and because we don't know what the
[00:49:27] value and because we don't know what the actual Q function is we're sort of doing
[00:49:29] actual Q function is we're sort of doing this weighted approximation between um
[00:49:33] this weighted approximation between um our current estim of the Q function and
[00:49:35] our current estim of the Q function and the Target that we just put in and we do
[00:49:37] the Target that we just put in and we do this over and over and over
[00:49:42] again so similar to sarsa the conditions
[00:49:45] again so similar to sarsa the conditions to make sure that Q learning in the
[00:49:47] to make sure that Q learning in the tabular case so things get a lot more
[00:49:49] tabular case so things get a lot more complicated once we go
[00:49:52] complicated once we go into into the function approximation
[00:49:55] into into the function approximation case but in order for Q learning um with
[00:49:58] case but in order for Q learning um with EEG exploration to converge the optimal
[00:50:00] EEG exploration to converge the optimal qar you again need to visit everything
[00:50:02] qar you again need to visit everything infinitely often your step sizes has to
[00:50:04] infinitely often your step sizes has to satisfy the robins Monroe sequence um
[00:50:08] satisfy the robins Monroe sequence um and one important thing to notice here
[00:50:12] and one important thing to notice here is that you can estimate qar without
[00:50:16] is that you can estimate qar without being
[00:50:19] being Glee which is different than sarsa
[00:50:22] Glee which is different than sarsa because you're always doing this
[00:50:25] because you're always doing this Max so so even if you act completely
[00:50:28] Max so so even if you act completely randomly so just like infinite
[00:50:30] randomly so just like infinite exploration not being greedy you can
[00:50:33] exploration not being greedy you can learn qar because in your qar estimate
[00:50:38] learn qar because in your qar estimate here you're always doing this Max
[00:50:41] here you're always doing this Max a that's an important difference
[00:50:43] a that's an important difference compared to
[00:50:45] sersa but if you actually want to use
[00:50:47] sersa but if you actually want to use that information to make good decisions
[00:50:49] that information to make good decisions in the world you need to become greedy
[00:50:50] in the world you need to become greedy over time and be using that information
[00:50:53] over time and be using that information to actually select the best action
[00:50:54] to actually select the best action according to your Q function
[00:50:57] according to your Q function so in for EG algorithms with Q learning
[00:50:59] so in for EG algorithms with Q learning you normally Decay your Epsilon over
[00:51:01] you normally Decay your Epsilon over time so you're getting more and more
[00:51:02] time so you're getting more and more deterministic and you're taking your
[00:51:04] deterministic and you're taking your estimate of what qar is and using it to
[00:51:06] estimate of what qar is and using it to make
[00:51:10] decisions okay we're now going to go
[00:51:12] decisions okay we're now going to go into function approximation I'm just
[00:51:14] into function approximation I'm just going to pause there in case people had
[00:51:15] going to pause there in case people had any
[00:51:22] questions um so foring
[00:51:32] great great question so um if there are
[00:51:35] great great question so um if there are no ties in your Q function like as in
[00:51:38] no ties in your Q function like as in like for any action there is or any
[00:51:39] like for any action there is or any state there is a uniquely best action
[00:51:41] state there is a uniquely best action it'll converge to a deterministic
[00:51:44] it'll converge to a deterministic policy um if there are ties it'll
[00:51:46] policy um if there are ties it'll generally you know pick between those
[00:51:47] generally you know pick between those arbitrarily there'll be like an infinite
[00:51:49] arbitrarily there'll be like an infinite number of optimal policies if there are
[00:51:51] number of optimal policies if there are ties in your Q function great question
[00:51:58] all right so now what we're going to do
[00:51:59] all right so now what we're going to do is we're going to layer on function
[00:52:00] is we're going to layer on function approximation on top so this was all
[00:52:02] approximation on top so this was all assuming that we just had this table
[00:52:04] assuming that we just had this table where we could write down the value for
[00:52:06] where we could write down the value for every state and action separately and
[00:52:08] every state and action separately and now we want to use function
[00:52:09] now we want to use function approximation so we can start to do
[00:52:11] approximation so we can start to do problems like
[00:52:13] problems like Atari so the motivation for doing this
[00:52:16] Atari so the motivation for doing this and I know for those of you who've taken
[00:52:17] and I know for those of you who've taken machine learning this is probably clear
[00:52:20] machine learning this is probably clear but it's nice to think about what this
[00:52:21] but it's nice to think about what this means in the context of reinforcement
[00:52:23] means in the context of reinforcement learning so what are the things that we
[00:52:24] learning so what are the things that we might be storing or trying to to
[00:52:26] might be storing or trying to to manipulate that might be the Dynamics or
[00:52:28] manipulate that might be the Dynamics or reward model the value function the
[00:52:30] reward model the value function the state action value function or the
[00:52:32] state action value function or the policy and if you were thinking about
[00:52:34] policy and if you were thinking about pixel space you do not want to write
[00:52:36] pixel space you do not want to write that down as like one different value
[00:52:39] that down as like one different value for every bless you for every different
[00:52:41] for every bless you for every different um possible image in the world so we're
[00:52:44] um possible image in the world so we're going to want compact representations
[00:52:46] going to want compact representations like what we can do with neural networks
[00:52:48] like what we can do with neural networks so that we reduce the memory we need to
[00:52:50] so that we reduce the memory we need to like write down those Dynamics models
[00:52:51] like write down those Dynamics models the value function or Q or the policy we
[00:52:54] the value function or Q or the policy we reduce the computation and ideally we
[00:52:56] reduce the computation and ideally we might even be able to reduce the
[00:52:59] might even be able to reduce the experience and I think this last point
[00:53:01] experience and I think this last point maybe is a particularly interesting one
[00:53:02] maybe is a particularly interesting one to think about so you can imagine like
[00:53:04] to think about so you can imagine like if you're agent is learning to play an
[00:53:06] if you're agent is learning to play an Atari game or play Breakout it um it
[00:53:09] Atari game or play Breakout it um it might want to know that like oh well if
[00:53:11] might want to know that like oh well if these pixels are slightly different here
[00:53:13] these pixels are slightly different here most of the time you might still take
[00:53:14] most of the time you might still take the same decision and so then instead of
[00:53:17] the same decision and so then instead of having to learn from scratch what to do
[00:53:18] having to learn from scratch what to do in each state you can get this sort of
[00:53:20] in each state you can get this sort of generalization and that could be really
[00:53:22] generalization and that could be really important in terms of reducing the
[00:53:24] important in terms of reducing the amount of data we need to learn to make
[00:53:25] amount of data we need to learn to make good decisions
[00:53:29] all right so how do we do this what
[00:53:32] all right so how do we do this what we're going to try to do is we're going
[00:53:33] we're going to try to do is we're going to essentially do the same thing as what
[00:53:34] to essentially do the same thing as what we did before but we're also going to
[00:53:36] we did before but we're also going to have to incorporate a function
[00:53:37] have to incorporate a function approximation step so let's just think
[00:53:40] approximation step so let's just think about how we would do this if we had an
[00:53:41] about how we would do this if we had an oracle so what I mean by this is we're
[00:53:44] oracle so what I mean by this is we're not thinking yet right now about all the
[00:53:46] not thinking yet right now about all the learning and like Gathering data we're
[00:53:48] learning and like Gathering data we're just assuming how do we fit a function
[00:53:49] just assuming how do we fit a function to represent our Q
[00:53:51] to represent our Q function so let's imagine that you had
[00:53:53] function so let's imagine that you had an oracle that for any state in action
[00:53:55] an oracle that for any state in action it would give you the true value for a
[00:53:57] it would give you the true value for a particular policy and that state in
[00:53:59] particular policy and that state in action so it would tell you like that's
[00:54:01] action so it would tell you like that's three or that's
[00:54:03] three or that's seven so then you could say okay now
[00:54:05] seven so then you could say okay now I've just got a supervised learning
[00:54:06] I've just got a supervised learning problem I've got input tles of states
[00:54:09] problem I've got input tles of states and actions and I have output values of
[00:54:11] and actions and I have output values of my Q
[00:54:12] my Q function and what I want to do now is
[00:54:14] function and what I want to do now is just learn a function to like a
[00:54:16] just learn a function to like a regression function to say given the
[00:54:17] regression function to say given the state in action what is the
[00:54:19] state in action what is the output and so imagine that you're in a
[00:54:22] output and so imagine that you're in a case where like you know we have a
[00:54:24] case where like you know we have a continuous set of states and we only
[00:54:25] continuous set of states and we only have one
[00:54:27] have one action then you might just have all
[00:54:29] action then you might just have all these different
[00:54:31] these different points and maybe you just want to learn
[00:54:33] points and maybe you just want to learn a function that predicts the
[00:54:37] a function that predicts the Q for every single state you just learn
[00:54:40] Q for every single state you just learn like a parametric function or it could
[00:54:41] like a parametric function or it could be a deep neural
[00:54:43] be a deep neural network and in general just like in
[00:54:45] network and in general just like in supervised learning the objective is
[00:54:46] supervised learning the objective is going to be to find the best approximate
[00:54:48] going to be to find the best approximate representation of Q given some weights
[00:54:50] representation of Q given some weights or given some neural network
[00:54:53] or given some neural network architecture so we've got like you know
[00:54:55] architecture so we've got like you know some neural net
[00:54:58] and we're just going to fit this to try
[00:55:01] and we're just going to fit this to try to if we had these
[00:55:03] to if we had these points but of course we don't have these
[00:55:05] points but of course we don't have these points and we're going to see how we're
[00:55:07] points and we're going to see how we're going to handle it we don't have these
[00:55:08] going to handle it we don't have these points but this is kind of the intuition
[00:55:09] points but this is kind of the intuition is that if you had these then you could
[00:55:11] is that if you had these then you could do the function approximation step by
[00:55:13] do the function approximation step by saying okay well how do I I am going to
[00:55:15] saying okay well how do I I am going to handle generalization by using like a
[00:55:17] handle generalization by using like a linear function or a deep neural network
[00:55:19] linear function or a deep neural network to say for each of these states and
[00:55:20] to say for each of these states and actions what is the
[00:55:24] output so just to highlight here in this
[00:55:27] output so just to highlight here in this class generally we will be focusing on
[00:55:29] class generally we will be focusing on methods that use stochastic gradient
[00:55:30] methods that use stochastic gradient descent to try to fit these functions
[00:55:32] descent to try to fit these functions and again I expect most of this is
[00:55:34] and again I expect most of this is familiar for you guys if you've done
[00:55:35] familiar for you guys if you've done machine learning if you haven't you can
[00:55:37] machine learning if you haven't you can come talk to me or any of the
[00:55:39] come talk to me or any of the Tas um generally we're going to just use
[00:55:42] Tas um generally we're going to just use mean squared error and we're going to
[00:55:44] mean squared error and we're going to try to fit a function that minimizes the
[00:55:47] try to fit a function that minimizes the mean squared error we're going to do
[00:55:48] mean squared error we're going to do gradient descent to find a local minimum
[00:55:50] gradient descent to find a local minimum and we're going to do stochastic
[00:55:52] and we're going to do stochastic gradient descent um just to you to
[00:55:54] gradient descent um just to you to compute an approximate gradient
[00:56:00] so in this
[00:56:03] so in this case I have here is just to write that
[00:56:06] case I have here is just to write that out really quickly you would have
[00:56:07] out really quickly you would have something like this you'd have
[00:56:11] WJ is just the derivative of
[00:56:16] this like
[00:56:20] that so I'm going to take this equation
[00:56:22] that so I'm going to take this equation star and I'm just going to take the
[00:56:24] star and I'm just going to take the derivative of it which is going to be
[00:56:47] two so we're just going to take the
[00:56:49] two so we're just going to take the derivative of this and essentially that
[00:56:50] derivative of this and essentially that just means we're going to have to take
[00:56:51] just means we're going to have to take the derivative through our Q function
[00:56:54] the derivative through our Q function representation like using autoi per deal
[00:56:57] representation like using autoi per deal networks and then we can use this to
[00:56:59] networks and then we can use this to update our our
[00:57:01] update our our weights all right so we'll do stochastic
[00:57:03] weights all right so we'll do stochastic gradient descent to do
[00:57:05] gradient descent to do this and the main thing is that that's
[00:57:09] this and the main thing is that that's what we're going to be doing to plug in
[00:57:11] what we're going to be doing to plug in um in order to do policy evaluation or
[00:57:13] um in order to do policy evaluation or to do
[00:57:15] to do control so of course in general we don't
[00:57:17] control so of course in general we don't have those we don't have for each state
[00:57:19] have those we don't have for each state in action what the Q value was if it was
[00:57:21] in action what the Q value was if it was we wouldn't need to do any learning we
[00:57:23] we wouldn't need to do any learning we need to learn that from data and so the
[00:57:26] need to learn that from data and so the idea is that we're going to do model
[00:57:28] idea is that we're going to do model free state action value function
[00:57:30] free state action value function approximation so just like what we've
[00:57:32] approximation so just like what we've been seeing before we we're doing model
[00:57:34] been seeing before we we're doing model free state action value function um now
[00:57:38] free state action value function um now we're going to actually do that but just
[00:57:39] we're going to actually do that but just do an approximation we're instead of
[00:57:40] do an approximation we're instead of writing it down as a table we're going
[00:57:42] writing it down as a table we're going to write it down with these parameters
[00:57:43] to write it down with these parameters function parameterized
[00:57:45] function parameterized functions okay
[00:57:49] functions okay so the idea now is like similarly we
[00:57:52] so the idea now is like similarly we just saw before all these methods either
[00:57:54] just saw before all these methods either where we use Monte Carlo methods or
[00:57:56] where we use Monte Carlo methods or difference methods to try to do these
[00:57:58] difference methods to try to do these approximations um now what we're going
[00:58:00] approximations um now what we're going to do is that when we do the estimate
[00:58:02] to do is that when we do the estimate update step we're also going to fit the
[00:58:04] update step we're also going to fit the function
[00:58:05] function approximator so just like in the
[00:58:07] approximator so just like in the algorithms we saw before where we do
[00:58:09] algorithms we saw before where we do like policy evaluation and policy
[00:58:10] like policy evaluation and policy Improvement now when we do the policy
[00:58:12] Improvement now when we do the policy evaluation we're also going to just
[00:58:13] evaluation we're also going to just refit like our whole Q function for
[00:58:18] refit like our whole Q function for example okay so let's see how that could
[00:58:20] example okay so let's see how that could work so for Monte Carlo value function
[00:58:23] work so for Monte Carlo value function approximation we're going to remember
[00:58:25] approximation we're going to remember that our return G is an unbiased but
[00:58:27] that our return G is an unbiased but noisy sample of the expected return so
[00:58:30] noisy sample of the expected return so we can think of us having like this
[00:58:32] we can think of us having like this state action return State action return
[00:58:35] state action return State action return Etc and so you can substitute in those
[00:58:39] Etc and so you can substitute in those G's for the True Q Pi when you're doing
[00:58:42] G's for the True Q Pi when you're doing your
[00:58:44] fitting so let's see what that would
[00:58:46] fitting so let's see what that would look like so in this
[00:58:49] look like so in this case remember what we would like here
[00:58:51] case remember what we would like here when we're doing our function
[00:58:52] when we're doing our function approximation is that this is the real Q
[00:58:56] approximation is that this is the real Q of the policy but we don't know what the
[00:58:58] of the policy but we don't know what the real policy real Q value is so we're
[00:59:00] real policy real Q value is so we're just going to plug in our observe
[00:59:03] just going to plug in our observe return so you know we
[00:59:06] return so you know we want
[00:59:07] want would like
[00:59:10] would like Q of
[00:59:13] Q of sa but we don't have that so we're going
[00:59:15] sa but we don't have that so we're going to plug in the return that we just
[00:59:18] to plug in the return that we just observed and then we'll just do the
[00:59:20] observed and then we'll just do the derivative we'll be plugging that in for
[00:59:21] derivative we'll be plugging that in for our derivative and then we update our
[00:59:22] our derivative and then we update our weights using that derivative with
[00:59:24] weights using that derivative with respect to minimizing the me
[00:59:28] squitter so this would just be for
[00:59:31] squitter so this would just be for policy evaluation if you have a fixed
[00:59:32] policy evaluation if you have a fixed policy you would just do this at each
[00:59:35] policy you would just do this at each time point so after you see um after you
[00:59:37] time point so after you see um after you get a return then you would update your
[00:59:40] get a return then you would update your Q function you do this many many
[00:59:42] Q function you do this many many times and for some of you this might
[00:59:44] times and for some of you this might start to look redundant but I think it's
[00:59:45] start to look redundant but I think it's just useful to see that essentially the
[00:59:47] just useful to see that essentially the structure of all of these algorithms
[00:59:49] structure of all of these algorithms whether it is policy evaluation or
[00:59:51] whether it is policy evaluation or tabular or function approximation is
[00:59:53] tabular or function approximation is extremely similar we are just either
[00:59:56] extremely similar we are just either sampling an episode or sampling a tle we
[00:59:59] sampling an episode or sampling a tle we are going to do one step which is like
[01:00:00] are going to do one step which is like policy evaluation where we update our
[01:00:02] policy evaluation where we update our estimate of the Q function maybe
[01:00:04] estimate of the Q function maybe optionally do function approximation
[01:00:06] optionally do function approximation fitting and then we're going to use that
[01:00:08] fitting and then we're going to use that to figure out how to act next if we
[01:00:11] to figure out how to act next if we doing control we'll see an example of
[01:00:13] doing control we'll see an example of that
[01:00:14] that shortly okay so that is Monte
[01:00:17] shortly okay so that is Monte Carlo okay
[01:00:23] oops okay all right for temporal
[01:00:26] oops okay all right for temporal learning it's very similar but now we
[01:00:28] learning it's very similar but now we are going to um have this weighted sum
[01:00:31] are going to um have this weighted sum where we plug
[01:00:32] where we plug in we bootstrap so we plug in our
[01:00:34] in we bootstrap so we plug in our current estimate of the value of S Prime
[01:00:37] current estimate of the value of S Prime so this is the same update we saw before
[01:00:39] so this is the same update we saw before this was for tabular cases and now we're
[01:00:42] this was for tabular cases and now we're going to do it for function
[01:00:45] going to do it for function approximation okay so let's first just
[01:00:48] approximation okay so let's first just see how we do it for function approxim
[01:00:50] see how we do it for function approxim it's just useful I think when we look at
[01:00:51] it's just useful I think when we look at this to think about all the different
[01:00:52] this to think about all the different ways we're doing
[01:00:54] ways we're doing approximations we are sampling to
[01:00:56] approximations we are sampling to approximate the expected value over the
[01:00:58] approximate the expected value over the next state we are bootstrapping to plug
[01:01:00] next state we are bootstrapping to plug in what the value of those states are
[01:01:02] in what the value of those states are and now we're also going to do function
[01:01:03] and now we're also going to do function approximation because we're going to
[01:01:05] approximation because we're going to represent the value of a function with
[01:01:07] represent the value of a function with some weights okay so we're going to have
[01:01:09] some weights okay so we're going to have these
[01:01:11] these weights all
[01:01:13] weights all right and again we can just do
[01:01:15] right and again we can just do stochastic gradient descent to fit our
[01:01:17] stochastic gradient descent to fit our weight function to represent that value
[01:01:22] function okay
[01:01:27] so you'll get something like this where
[01:01:29] so you'll get something like this where as long as you're if you're in a
[01:01:30] as long as you're if you're in a terminal State you'll restart the
[01:01:32] terminal State you'll restart the episode otherwise you'll just be doing
[01:01:34] episode otherwise you'll just be doing these Computing the gradient with
[01:01:36] these Computing the gradient with respect to your minimizing your mean
[01:01:38] respect to your minimizing your mean sored eror and updating your
[01:01:43] weights and then the then we're we'll
[01:01:47] weights and then the then we're we'll see how we do this for
[01:01:49] see how we do this for control so for control it's going to be
[01:01:51] control so for control it's going to be very
[01:01:53] very similar now what we'll make sure to do
[01:01:55] similar now what we'll make sure to do is we're always going to be using the Q
[01:01:56] is we're always going to be using the Q function instead of the value
[01:01:58] function instead of the value function and we're now often going to be
[01:02:01] function and we're now often going to be doing off policy learning again like Q
[01:02:04] doing off policy learning again like Q learning so we'll again do stochastic
[01:02:07] learning so we'll again do stochastic gradient descent with respect to our Q
[01:02:09] gradient descent with respect to our Q function we're going to sample the
[01:02:10] function we're going to sample the gradient and we'll have very you know an
[01:02:12] gradient and we'll have very you know an algorithm is very similar to the one
[01:02:14] algorithm is very similar to the one we've seen
[01:02:15] we've seen before so we can either use sarsa where
[01:02:20] before so we can either use sarsa where we have our Q function where we always
[01:02:22] we have our Q function where we always plug in what is the actual action we
[01:02:24] plug in what is the actual action we took next or we can have q learning
[01:02:27] took next or we can have q learning where we plug in a Max over the next q
[01:02:32] function and raise your hand if you've
[01:02:34] function and raise your hand if you've implemented deep Q learning
[01:02:37] implemented deep Q learning before okay so 1% but most people not
[01:02:40] before okay so 1% but most people not yeah okay so you going to imagine in
[01:02:41] yeah okay so you going to imagine in general this is any form of function
[01:02:43] general this is any form of function approximator but often this is going to
[01:02:45] approximator but often this is going to be like a deep neural
[01:02:47] be like a deep neural network
[01:02:49] network okay now one thing I just want to
[01:02:51] okay now one thing I just want to highlight here is that again just sort
[01:02:53] highlight here is that again just sort of being in terms of being concerned
[01:02:55] of being in terms of being concerned whether all this is going to work
[01:02:57] whether all this is going to work there's a lot of approximations that are
[01:02:59] there's a lot of approximations that are happening here so particularly for Q
[01:03:01] happening here so particularly for Q learning and it's led to what Sarto um
[01:03:04] learning and it's led to what Sarto um the authors of the book that is optional
[01:03:06] the authors of the book that is optional textbook for the class called The Deadly
[01:03:09] textbook for the class called The Deadly Triad and what they say is that if you
[01:03:11] Triad and what they say is that if you are doing bootstrapping meaning that
[01:03:13] are doing bootstrapping meaning that you're plugging in an estimate of what
[01:03:15] you're plugging in an estimate of what is the value of the next state and
[01:03:17] is the value of the next state and you're doing function approximation like
[01:03:19] you're doing function approximation like you're using like a deep neural network
[01:03:20] you're using like a deep neural network or linear function and you're doing off
[01:03:23] or linear function and you're doing off policy learning where you acting in a
[01:03:25] policy learning where you acting in a different way than the the data you're
[01:03:26] different way than the the data you're getting under those cases you may not
[01:03:28] getting under those cases you may not converge at
[01:03:30] converge at all like you just your Q function May
[01:03:33] all like you just your Q function May oscillate you may not converge to um
[01:03:35] oscillate you may not converge to um anything and um you are certainly not
[01:03:38] anything and um you are certainly not guaranteed to converge to
[01:03:40] guaranteed to converge to qar so it's just good to keep in mind
[01:03:44] qar so it's just good to keep in mind that that could occur um I think for
[01:03:47] that that could occur um I think for some intuition for why this can occur
[01:03:49] some intuition for why this can occur the Bellman operator if you think back a
[01:03:51] the Bellman operator if you think back a couple lectures ago we proved as a
[01:03:52] couple lectures ago we proved as a contraction meaning that as we apply it
[01:03:54] contraction meaning that as we apply it repeatedly we went to this fix point in
[01:03:57] repeatedly we went to this fix point in the tabular setting but the problem is
[01:04:00] the tabular setting but the problem is is that like when you do a bman backup
[01:04:01] is that like when you do a bman backup that operator is a contraction meaning
[01:04:03] that operator is a contraction meaning that if you apply the bment operator two
[01:04:05] that if you apply the bment operator two different things their distance gets
[01:04:06] different things their distance gets smaller
[01:04:07] smaller afterwards value function approximation
[01:04:10] afterwards value function approximation fitting can be an expansion which means
[01:04:12] fitting can be an expansion which means if you take two things and then you try
[01:04:14] if you take two things and then you try to do value function approximation like
[01:04:16] to do value function approximation like you f align to this one and align to
[01:04:17] you f align to this one and align to this one the distance between two points
[01:04:20] this one the distance between two points afterwards can actually get bigger than
[01:04:21] afterwards can actually get bigger than before you did the value function
[01:04:23] before you did the value function approximation so there's a really
[01:04:25] approximation so there's a really beautiful example of this in a paper by
[01:04:26] beautiful example of this in a paper by Jeff Gordon um from 1995 I will just
[01:04:30] Jeff Gordon um from 1995 I will just Jeff
[01:04:32] Jeff Gordon 1995 has a really nice example of
[01:04:36] Gordon 1995 has a really nice example of this where you just can kind of visually
[01:04:37] this where you just can kind of visually see when you have these two functions
[01:04:39] see when you have these two functions and these like points that after you do
[01:04:41] and these like points that after you do this value function approximation you've
[01:04:43] this value function approximation you've actually made the distance between them
[01:04:45] actually made the distance between them bigger and so that means that you have
[01:04:47] bigger and so that means that you have this thing where you're kind of
[01:04:48] this thing where you're kind of alternating between something which you
[01:04:50] alternating between something which you know is a contraction and driving you um
[01:04:52] know is a contraction and driving you um driving you towards a fix point and
[01:04:54] driving you towards a fix point and something which might actually Amplified
[01:04:56] something which might actually Amplified distance differences and so because of
[01:04:58] distance differences and so because of that it's not always the case that
[01:04:59] that it's not always the case that you're guaranteed to converge to a fixed
[01:05:01] you're guaranteed to converge to a fixed point so this is something important to
[01:05:04] point so this is something important to know however I think it's also kind of a
[01:05:06] know however I think it's also kind of a um it's an important part of the history
[01:05:09] um it's an important part of the history of the field in that in the 1990s there
[01:05:11] of the field in that in the 1990s there was a bunch of work showing that this
[01:05:13] was a bunch of work showing that this could occur that like even with some
[01:05:15] could occur that like even with some really simple settings like linear value
[01:05:17] really simple settings like linear value function approximators you just
[01:05:19] function approximators you just approximate things with a line um that
[01:05:21] approximate things with a line um that sometimes you could get these kind of
[01:05:22] sometimes you could get these kind of oscillations or lack of convergence and
[01:05:24] oscillations or lack of convergence and so people were really concerned about
[01:05:26] so people were really concerned about using function approximators with
[01:05:27] using function approximators with reinforcement learning but then what
[01:05:30] reinforcement learning but then what happened is that Deep Mind showed well
[01:05:32] happened is that Deep Mind showed well actually there are some ways to tackle
[01:05:34] actually there are some ways to tackle this and we can do really amazing things
[01:05:36] this and we can do really amazing things with it and so I think it's a useful
[01:05:38] with it and so I think it's a useful like a useful lesson from history over
[01:05:40] like a useful lesson from history over the difference between like what can
[01:05:42] the difference between like what can occur in maybe some sort of hard you
[01:05:44] occur in maybe some sort of hard you know not ideal cases versus what
[01:05:47] know not ideal cases versus what actually occurs in practice and so we
[01:05:48] actually occurs in practice and so we shouldn't let like sort of some of the
[01:05:51] shouldn't let like sort of some of the negative examples limit us from
[01:05:53] negative examples limit us from considering what might work in some
[01:05:54] considering what might work in some other scenarios
[01:05:56] other scenarios so let's see that now let's see
[01:05:59] so let's see that now let's see dqn okay
[01:06:01] dqn okay so the idea with dqn is we're going to
[01:06:04] so the idea with dqn is we're going to use these ideas to actually play Atari
[01:06:06] use these ideas to actually play Atari so we're going to take in images of the
[01:06:07] so we're going to take in images of the game we're going to use convolutional
[01:06:09] game we're going to use convolutional neural networks and we're going to um
[01:06:11] neural networks and we're going to um have a really big deep neural network to
[01:06:13] have a really big deep neural network to represent the Q function and do Q
[01:06:15] represent the Q function and do Q learning with
[01:06:16] learning with it okay so the idea was well we knew
[01:06:21] it okay so the idea was well we knew that sometimes like Q learning with
[01:06:22] that sometimes like Q learning with value function approximation can diverge
[01:06:25] value function approximation can diverge and 's a number of different issues but
[01:06:27] and 's a number of different issues but one of them is kind of this stability
[01:06:29] one of them is kind of this stability thing so we know that there's
[01:06:30] thing so we know that there's correlations between samples your data
[01:06:32] correlations between samples your data is not IID which is what you would
[01:06:34] is not IID which is what you would normally want for when you're doing
[01:06:35] normally want for when you're doing function approximation and the other is
[01:06:38] function approximation and the other is that you have this kind of
[01:06:38] that you have this kind of non-stationary Target thing which is
[01:06:41] non-stationary Target thing which is like when you plug in say with TD
[01:06:43] like when you plug in say with TD learning you're plugging in gamma plus
[01:06:45] learning you're plugging in gamma plus sorry R plus gamma times the value of
[01:06:48] sorry R plus gamma times the value of your next state and that value of the
[01:06:50] your next state and that value of the next Tate is constantly changing as you
[01:06:51] next Tate is constantly changing as you get more
[01:06:52] get more data so what dqn did is they said well
[01:06:56] data so what dqn did is they said well what we're going to do is we're going to
[01:06:57] what we're going to do is we're going to use experience Replay in particular
[01:06:59] use experience Replay in particular we're going to reuse tuples over time
[01:07:02] we're going to reuse tuples over time and we're also going to get fixed Q
[01:07:03] and we're also going to get fixed Q targets and both of those things ended
[01:07:05] targets and both of those things ended up making a really big
[01:07:07] up making a really big difference particularly one of them
[01:07:09] difference particularly one of them we'll see in a second so the idea of
[01:07:11] we'll see in a second so the idea of experience replay is to say in general
[01:07:14] experience replay is to say in general if I think about um states that are
[01:07:16] if I think about um states that are nearby their Q function might be pretty
[01:07:19] nearby their Q function might be pretty similar and if I'm doing lots of updates
[01:07:21] similar and if I'm doing lots of updates that's breaking my IID stuff that I want
[01:07:23] that's breaking my IID stuff that I want for my function approximation
[01:07:26] for my function approximation so another thing you could do is just
[01:07:27] so another thing you could do is just have a replay buffer of lots of all the
[01:07:29] have a replay buffer of lots of all the different tuples you've seen in the past
[01:07:31] different tuples you've seen in the past and you just sample from one of those
[01:07:34] and you just sample from one of those and then compute a Target value and then
[01:07:36] and then compute a Target value and then do stochastic gradient
[01:07:38] do stochastic gradient descent and this might be really helpful
[01:07:41] descent and this might be really helpful anyway just in terms of data efficiency
[01:07:43] anyway just in terms of data efficiency because it means that instead of like
[01:07:44] because it means that instead of like taking your data and using it once and
[01:07:46] taking your data and using it once and then throwing it away you keep it and
[01:07:48] then throwing it away you keep it and then you can replay it just like how we
[01:07:50] then you can replay it just like how we talked about sort of batch learning last
[01:07:52] talked about sort of batch learning last time so an experience replay can be
[01:07:55] time so an experience replay can be useful because we're both replaying our
[01:07:56] useful because we're both replaying our data and so we can sort of squeeze more
[01:07:58] data and so we can sort of squeeze more information out of it and also we can
[01:08:00] information out of it and also we can select from very different parts of the
[01:08:02] select from very different parts of the past history which makes those updates
[01:08:04] past history which makes those updates more
[01:08:05] more independent okay so this is um and in
[01:08:08] independent okay so this is um and in general we're not going to keep the
[01:08:09] general we're not going to keep the buffer for all time we might keep like
[01:08:11] buffer for all time we might keep like the last million episodes or things like
[01:08:13] the last million episodes or things like that okay so that's one thing we could
[01:08:16] that okay so that's one thing we could do um now the other thing is that if we
[01:08:21] do um now the other thing is that if we think about what's happening in this
[01:08:22] think about what's happening in this case the way we change the weights is
[01:08:25] case the way we change the weights is going to be um in general the weights
[01:08:30] going to be um in general the weights appear here and here and here so this
[01:08:34] appear here and here and here so this target value is a function of the
[01:08:36] target value is a function of the weights itself because you're using a
[01:08:37] weights itself because you're using a value function approximation to
[01:08:39] value function approximation to represent the value of your next state
[01:08:41] represent the value of your next state and so the problem is that in general
[01:08:43] and so the problem is that in general this is going to change on your next
[01:08:45] this is going to change on your next update because you've just changed your
[01:08:47] update because you've just changed your your
[01:08:47] your weights um and this can also lead to
[01:08:50] weights um and this can also lead to instabilities because if you think of
[01:08:52] instabilities because if you think of sort of supervised learning you know
[01:08:54] sort of supervised learning you know your XY pairs your y is changing for
[01:08:56] your XY pairs your y is changing for even for the same X over time because
[01:08:58] even for the same X over time because you're changing your Q
[01:09:00] you're changing your Q function okay because this is a function
[01:09:02] function okay because this is a function of the weight and so as the weights
[01:09:04] of the weight and so as the weights change this sort of Target value is
[01:09:06] change this sort of Target value is going to change even for the same input
[01:09:09] going to change even for the same input so the second idea is to have fixed Q
[01:09:12] so the second idea is to have fixed Q updates and what the idea here is and so
[01:09:14] updates and what the idea here is and so remember this is like when we say the
[01:09:16] remember this is like when we say the target weight this is it's going to be
[01:09:18] target weight this is it's going to be what's we're using for Target weights is
[01:09:21] what's we're using for Target weights is that the weight the the weights or the
[01:09:23] that the weight the the weights or the parameters we're using to to estimate
[01:09:26] parameters we're using to to estimate the value of the next state we reach we
[01:09:28] the value of the next state we reach we are going to not update those as
[01:09:30] are going to not update those as much so we're going to have our Target
[01:09:33] much so we're going to have our Target network using a different set of Weights
[01:09:34] network using a different set of Weights than the weights that are being updated
[01:09:36] than the weights that are being updated so you can see here that we have a w
[01:09:38] so you can see here that we have a w minus meaning that like we're trying to
[01:09:40] minus meaning that like we're trying to make this more like supervised learning
[01:09:42] make this more like supervised learning where we have a fixed output y that is
[01:09:44] where we have a fixed output y that is not changing um while we're trying to
[01:09:46] not changing um while we're trying to update our
[01:09:48] update our W and so if you think about sort of the
[01:09:50] W and so if you think about sort of the example we want to just sort of draw it
[01:09:52] example we want to just sort of draw it like this here's our states here's our Q
[01:09:55] like this here's our states here's our Q function
[01:09:56] function um right now we'd like to sort of make
[01:09:58] um right now we'd like to sort of make sure that these points when we're like
[01:10:00] sure that these points when we're like trying to fit a line that those y's are
[01:10:02] trying to fit a line that those y's are not changing a lot what we're trying to
[01:10:03] not changing a lot what we're trying to fit the line and in general because
[01:10:06] fit the line and in general because they're a function of the weights
[01:10:07] they're a function of the weights themselves they might be moving and
[01:10:08] themselves they might be moving and perturbing and so what we're saying is
[01:10:10] perturbing and so what we're saying is no we're going to fix these so you can
[01:10:12] no we're going to fix these so you can think of this as just being you know a
[01:10:14] think of this as just being you know a fixed number for a while and then do
[01:10:16] fixed number for a while and then do multiple updates on this W to try to fit
[01:10:18] multiple updates on this W to try to fit that
[01:10:21] function and so what that means is we
[01:10:24] function and so what that means is we just have to uh we we keep around these
[01:10:26] just have to uh we we keep around these Target weights and we keep around um the
[01:10:28] Target weights and we keep around um the other
[01:10:30] other weights and this allows us to do this is
[01:10:32] weights and this allows us to do this is what the is called the fixed Q
[01:10:34] what the is called the fixed Q updating so if you think about what the
[01:10:37] updating so if you think about what the pseudo code would look like in this ta
[01:10:47] case is it's going to look pretty
[01:10:49] case is it's going to look pretty similar to the things we've seen you're
[01:10:50] similar to the things we've seen you're going to sample an action you're going
[01:10:51] going to sample an action you're going to observe order in the next state
[01:10:52] to observe order in the next state you're going to store the transition in
[01:10:54] you're going to store the transition in a replay buffer so you're going to keep
[01:10:56] a replay buffer so you're going to keep track of it then you're going to sample
[01:10:57] track of it then you're going to sample a random mini batch of tles from the
[01:10:59] a random mini batch of tles from the past you're going to do something you
[01:11:01] past you're going to do something you know keep track of if episodes
[01:11:03] know keep track of if episodes terminated otherwise you're going to say
[01:11:05] terminated otherwise you're going to say my target y that I'm going to try to fit
[01:11:07] my target y that I'm going to try to fit in my function approximator is my
[01:11:09] in my function approximator is my immediate reward plus the Maxum
[01:11:11] immediate reward plus the Maxum reactions of my Q function with my
[01:11:13] reactions of my Q function with my target
[01:11:14] target weights so I use my deep neural network
[01:11:16] weights so I use my deep neural network to predict the value of that state
[01:11:18] to predict the value of that state action pair and then I'm going to do
[01:11:20] action pair and then I'm going to do gradient descent on the difference
[01:11:21] gradient descent on the difference between these predicted y's and my
[01:11:24] between these predicted y's and my current estimate my current weights so
[01:11:26] current estimate my current weights so this is just the function fitting
[01:11:28] this is just the function fitting part and then you repeat this and then
[01:11:31] part and then you repeat this and then periodically you update your target
[01:11:33] periodically you update your target weights okay and I just want to
[01:11:36] weights okay and I just want to highlight here there's a bunch of
[01:11:38] highlight here there's a bunch of different choices to be made you have to
[01:11:39] different choices to be made you have to decide like what function approximator
[01:11:41] decide like what function approximator you're using you're using a deep neural
[01:11:42] you're using you're using a deep neural network um what's your learning rate how
[01:11:45] network um what's your learning rate how often to update the target weight uh how
[01:11:48] often to update the target weight uh how big should your replay buffer be there's
[01:11:49] big should your replay buffer be there's a lot of different choices that you have
[01:11:51] a lot of different choices that you have to
[01:11:54] make all right let's just take a quick
[01:11:57] make all right let's just take a quick second here to see if this Parts give
[01:11:59] second here to see if this Parts give chance to think about this
[01:12:01] chance to think about this part you may have just SE the answer but
[01:12:03] part you may have just SE the answer but that's okay um which is okay in dqn
[01:12:07] that's okay um which is okay in dqn we're going to compute the target value
[01:12:08] we're going to compute the target value for the sampled State action reward next
[01:12:09] for the sampled State action reward next States using a separate set of Target
[01:12:12] States using a separate set of Target weights so does that change the
[01:12:14] weights so does that change the computation time does it change the
[01:12:16] computation time does it change the memory requirements um are you not
[01:12:20] memory requirements um are you not sure put that in here
[01:12:26] we're now going to maintain two
[01:12:28] we're now going to maintain two different sets of weights to do our
[01:12:29] different sets of weights to do our function
[01:12:54] approximation e
[01:13:30] right um yep I see almost every converts
[01:13:33] right um yep I see almost every converts the the right answer very quickly it is
[01:13:36] the the right answer very quickly it is doubling the memory requirements so you
[01:13:37] doubling the memory requirements so you have to keep track of a second set of
[01:13:39] have to keep track of a second set of parameters um H it does not change the
[01:13:42] parameters um H it does not change the computation time just changes the the
[01:13:45] computation time just changes the the memory requirements so we just keep
[01:13:47] memory requirements so we just keep around two copies they have your deep
[01:13:48] around two copies they have your deep neural network one with the old weights
[01:13:49] neural network one with the old weights one with the new ones um and then the
[01:13:51] one with the new ones um and then the queue updating with respect to that is
[01:13:53] queue updating with respect to that is the same
[01:13:55] the same all right let's see what that actually
[01:13:56] all right let's see what that actually does so the kind of the key Innovations
[01:13:58] does so the kind of the key Innovations for dqn where we are going to use deep
[01:14:01] for dqn where we are going to use deep neural networks that had been done
[01:14:02] neural networks that had been done before but um not with I think this is
[01:14:04] before but um not with I think this is the first really big example with
[01:14:05] the first really big example with convolution convolutional neural
[01:14:07] convolution convolutional neural networks um it's going to maintain these
[01:14:10] networks um it's going to maintain these really large episodic replays and then
[01:14:13] really large episodic replays and then it is also going to have these fixed
[01:14:15] it is also going to have these fixed targets all right so what they have here
[01:14:18] targets all right so what they have here is they're going to do these series of
[01:14:20] is they're going to do these series of convolutions output and they're going to
[01:14:22] convolutions output and they're going to Output a q value for each action and and
[01:14:25] Output a q value for each action and and they're going to use that to make
[01:14:26] they're going to use that to make decisions and I think one of the things
[01:14:29] decisions and I think one of the things well there's multiple really remarkable
[01:14:30] well there's multiple really remarkable things about this paper one is that they
[01:14:33] things about this paper one is that they got extremely good performance across a
[01:14:34] got extremely good performance across a really wide set of games so instead of
[01:14:36] really wide set of games so instead of only having a few Benchmark tasks they
[01:14:38] only having a few Benchmark tasks they looked at the whole Suite of performance
[01:14:40] looked at the whole Suite of performance they are learning a different policy per
[01:14:42] they are learning a different policy per video game but it is the same neural
[01:14:44] video game but it is the same neural network architecture and I believe all
[01:14:46] network architecture and I believe all the same hyperparameters too so the idea
[01:14:48] the same hyperparameters too so the idea with that is to say like could we
[01:14:50] with that is to say like could we actually have sort of the same type of
[01:14:51] actually have sort of the same type of architecture in the same way that we
[01:14:53] architecture in the same way that we don't swap brains when we do different
[01:14:54] don't swap brains when we do different tasks
[01:14:55] tasks um and but have the same learning
[01:14:57] um and but have the same learning algorithm learn to be able to do many
[01:14:59] algorithm learn to be able to do many different types of tasks and so I think
[01:15:01] different types of tasks and so I think that was pretty impressive that they
[01:15:03] that was pretty impressive that they showed that that was possible so you
[01:15:04] showed that that was possible so you have the same algorithm same hyper
[01:15:06] have the same algorithm same hyper parameters but it could you know learn
[01:15:07] parameters but it could you know learn to do well in many different
[01:15:10] to do well in many different tasks I think one of the interesting
[01:15:12] tasks I think one of the interesting things about the paper is to consider
[01:15:14] things about the paper is to consider what were the aspects that were
[01:15:15] what were the aspects that were important for Success so here's just um
[01:15:18] important for Success so here's just um a subset of algorithms uh sorry subset
[01:15:20] a subset of algorithms uh sorry subset of the domains this is six this is a few
[01:15:22] of the domains this is six this is a few of the games and they also compare to
[01:15:25] of the games and they also compare to using a much more simple function
[01:15:27] using a much more simple function approximator and what you can see here
[01:15:29] approximator and what you can see here is that the Deep neural network is not
[01:15:32] is that the Deep neural network is not actually better right like the Deep
[01:15:33] actually better right like the Deep neural network does not look better than
[01:15:36] neural network does not look better than um uh than the linear case so it's not
[01:15:39] um uh than the linear case so it's not clear that just using a more function
[01:15:41] clear that just using a more function approx like um it wasn't just that they
[01:15:43] approx like um it wasn't just that they used a much more careful function
[01:15:47] approximator and the second thing was
[01:15:49] approximator and the second thing was whether they used this fixed q and that
[01:15:52] whether they used this fixed q and that helped um so you can see now that they
[01:15:55] helped um so you can see now that they are exceeding the performance of using a
[01:15:57] are exceeding the performance of using a more simple function approximator so
[01:15:59] more simple function approximator so this idea of kind of keeping things
[01:16:01] this idea of kind of keeping things stable is helpful um in terms of
[01:16:04] stable is helpful um in terms of oscillations but keeping using the
[01:16:07] oscillations but keeping using the replay was incredibly helpful so they
[01:16:10] replay was incredibly helpful so they went from like you know three or 10 up
[01:16:12] went from like you know three or 10 up to
[01:16:13] to 241 um or in some you know something
[01:16:16] 241 um or in some you know something from either roughly three times as good
[01:16:18] from either roughly three times as good and sometimes even more like a couple
[01:16:19] and sometimes even more like a couple orders of
[01:16:20] orders of magnitude so it was incredibly helpful
[01:16:23] magnitude so it was incredibly helpful to use an experience play buffer and
[01:16:26] to use an experience play buffer and maybe this isn't so surprising because
[01:16:28] maybe this isn't so surprising because it means that they are just reusing
[01:16:29] it means that they are just reusing their data a lot um but it was really
[01:16:33] their data a lot um but it was really you know incredibly important I think
[01:16:35] you know incredibly important I think that's really helpful to motivate why
[01:16:36] that's really helpful to motivate why thinking about sample efficiency and re
[01:16:38] thinking about sample efficiency and re reusing your data is helpful um and then
[01:16:41] reusing your data is helpful um and then combining these ideas led to even bigger
[01:16:43] combining these ideas led to even bigger benefits so it was helpful to have both
[01:16:45] benefits so it was helpful to have both the fixed targets and the replay buffer
[01:16:47] the fixed targets and the replay buffer but if you could only pick one the
[01:16:48] but if you could only pick one the replay buffer was just enormously
[01:16:52] replay buffer was just enormously helpful all right so as you guys know um
[01:16:56] helpful all right so as you guys know um uh there's been an enormous amount of
[01:16:57] uh there's been an enormous amount of interest in reinforcement learning and
[01:16:58] interest in reinforcement learning and deep reinforcement learning since there
[01:17:00] deep reinforcement learning since there was some immediate improvements kind of
[01:17:02] was some immediate improvements kind of within the next year or two um one is
[01:17:05] within the next year or two um one is called double dqn um uh and that also
[01:17:09] called double dqn um uh and that also it's a very simple change just maybe one
[01:17:11] it's a very simple change just maybe one or two lines and it does increase some
[01:17:14] or two lines and it does increase some of the requirements but for uh memory
[01:17:17] of the requirements but for uh memory but it is a really helpful approach so
[01:17:21] but it is a really helpful approach so um it tries to deal with the fact that
[01:17:22] um it tries to deal with the fact that you can get some interesting
[01:17:23] you can get some interesting maximization by it issues and happy to
[01:17:26] maximization by it issues and happy to talk about that offline um but so there
[01:17:28] talk about that offline um but so there was a few different immediate next
[01:17:29] was a few different immediate next algorithms uh but then there's been
[01:17:31] algorithms uh but then there's been enormous amount of work since and I
[01:17:32] enormous amount of work since and I think it really led to huge excitement
[01:17:34] think it really led to huge excitement in how we could couple these with really
[01:17:36] in how we could couple these with really um impressive function
[01:17:38] um impressive function approximators so just to summarize um
[01:17:41] approximators so just to summarize um the things that you should understand is
[01:17:42] the things that you should understand is to be able to implement td0 of Monte
[01:17:44] to be able to implement td0 of Monte Carlo one policy evaluation so things
[01:17:46] Carlo one policy evaluation so things like we talked about last time um you
[01:17:48] like we talked about last time um you should be able to implement Q learning
[01:17:49] should be able to implement Q learning SARS MC control algorithms again in
[01:17:52] SARS MC control algorithms again in tabular
[01:17:53] tabular settings you should understand and what
[01:17:55] settings you should understand and what are the issues that can cause
[01:17:56] are the issues that can cause instability um so things like function
[01:17:59] instability um so things like function approximation bootstrapping and off
[01:18:01] approximation bootstrapping and off policy learning and have an intuitive
[01:18:03] policy learning and have an intuitive sense for why that might be concerning
[01:18:05] sense for why that might be concerning and then also you should know some of
[01:18:06] and then also you should know some of the key features in dqn dqn that were
[01:18:09] the key features in dqn dqn that were critical and then next week we're going
[01:18:11] critical and then next week we're going to start to talk about a very different
[01:18:12] to start to talk about a very different way to do things which is just policy
[01:18:14] way to do things which is just policy gradient methods it is similar again to
[01:18:17] gradient methods it is similar again to this you can see how important policy
[01:18:18] this you can see how important policy iteration is it's going to be similar to
[01:18:20] iteration is it's going to be similar to policy iteration um and it's kind of
[01:18:23] policy iteration um and it's kind of similar to policy iteration of Monte
[01:18:24] similar to policy iteration of Monte Carlo and certain ways and directly
[01:18:25] Carlo and certain ways and directly trying to work with the policy I'll see
[01:18:28] trying to work with the policy I'll see you then
Lecture 005
Stanford CS234 Reinforcement Learning I Policy Search 1 I 2024 I Lecture 5
Source: https://www.youtube.com/watch?v=L6OVEmV3NcE
---
Transcript
[00:00:05] hi everybody welcome back we're going to
[0...
Stanford CS234 Reinforcement Learning I Policy Search 1 I 2024 I Lecture 5
Source: https://www.youtube.com/watch?v=L6OVEmV3NcE
---
Transcript
[00:00:05] hi everybody welcome back we're going to
[00:00:06] hi everybody welcome back we're going to go ahead and get started with a refresh
[00:00:08] go ahead and get started with a refresh your
[00:00:28] understanding okay hopefully everyone
[00:00:30] understanding okay hopefully everyone had a chance to think about this a
[00:00:31] had a chance to think about this a little bit more um so let's go through
[00:00:33] little bit more um so let's go through the answers the first one is
[00:00:37] true so if you are trying to evaluate
[00:00:41] true so if you are trying to evaluate the value of this is in the tabular case
[00:00:43] the value of this is in the tabular case so this is where we're assuming we're
[00:00:45] so this is where we're assuming we're going to sample each coule at random and
[00:00:48] going to sample each coule at random and we do a q learning update we do this an
[00:00:49] we do a q learning update we do this an infinite amount of times um we know for
[00:00:53] infinite amount of times um we know for standard tabular learning we can
[00:00:55] standard tabular learning we can converge to the true value of a policy
[00:00:58] converge to the true value of a policy um uh under as long as our learning rate
[00:01:01] um uh under as long as our learning rate schedule as such so if there's an exists
[00:01:03] schedule as such so if there's an exists a learning rate schedule under if you're
[00:01:06] a learning rate schedule under if you're decaying your learning rate um at the
[00:01:07] decaying your learning rate um at the right level then you will converge to
[00:01:10] right level then you will converge to the True Q value in the tabular case
[00:01:12] the True Q value in the tabular case because there's no function
[00:01:13] because there's no function approximation that's happening there in
[00:01:16] approximation that's happening there in the second case this is also true so we
[00:01:19] the second case this is also true so we talked a bit um about how we could think
[00:01:22] talked a bit um about how we could think about doing these things in a batch way
[00:01:24] about doing these things in a batch way where we do it over and over and over
[00:01:25] where we do it over and over and over again we take our existing data and we
[00:01:27] again we take our existing data and we run it through our either TD learning
[00:01:29] run it through our either TD learning update or our sarer update or other
[00:01:32] update or our sarer update or other things um and we said that the TD
[00:01:35] things um and we said that the TD learning updates if you do it in a batch
[00:01:37] learning updates if you do it in a batch way are equivalent to just taking um a
[00:01:40] way are equivalent to just taking um a certainty equivalent model which means
[00:01:43] certainty equivalent model which means you estimate the Dynamics model and you
[00:01:44] you estimate the Dynamics model and you estimate the reward model excuse me um
[00:01:47] estimate the reward model excuse me um from your existing data and then you do
[00:01:49] from your existing data and then you do dynamic
[00:01:50] dynamic programming so that's what we saw I
[00:01:52] programming so that's what we saw I think we saw that in lecture
[00:01:53] think we saw that in lecture three this one is false does somebody
[00:01:57] three this one is false does somebody want to say why it's false this one is
[00:01:59] want to say why it's false this one is not true
[00:02:03] there's a number of reasons why it could
[00:02:04] there's a number of reasons why it could be false I I want to share why why is
[00:02:07] be false I I want to share why why is dqi not guaranteed to necessarily
[00:02:09] dqi not guaranteed to necessarily convert to the optimal Q
[00:02:16] function
[00:02:19] function yeah I mean would you need to enforce a
[00:02:23] yeah I mean would you need to enforce a certain number of iterations to for it
[00:02:26] certain number of iterations to for it to have any chance of converging at all
[00:02:28] to have any chance of converging at all so good point so it related that so
[00:02:29] so good point so it related that so certainly if you don't do enough
[00:02:30] certainly if you don't do enough iterations but even if you do an
[00:02:32] iterations but even if you do an infinite number of iterations it also
[00:02:34] infinite number of iterations it also might not be guaranteed to converge me
[00:02:36] might not be guaranteed to converge me tell me why you went infinite so right
[00:02:39] tell me why you went infinite so right you certainly need a lot of iterations
[00:02:40] you certainly need a lot of iterations but even if you had a lot of interations
[00:02:42] but even if you had a lot of interations you still still might not be guaranteed
[00:02:44] you still still might not be guaranteed to
[00:02:44] to converge I think here it helps to think
[00:02:47] converge I think here it helps to think about what we often call realizability
[00:02:48] about what we often call realizability which is we don't know what the
[00:02:50] which is we don't know what the functional form is of Q and so you could
[00:02:52] functional form is of Q and so you could think of the fact that I'm GNA draw it
[00:02:55] think of the fact that I'm GNA draw it in as if the state space was
[00:02:57] in as if the state space was onedimensional but in general of course
[00:02:59] onedimensional but in general of course like the state space is like this Vector
[00:03:02] like the state space is like this Vector um or it's like you know images and so
[00:03:03] um or it's like you know images and so it's really high dimensional but imagine
[00:03:06] it's really high dimensional but imagine that it was onedimensional even here you
[00:03:08] that it was onedimensional even here you don't know what what your V function or
[00:03:10] don't know what what your V function or your Q function might look like and so
[00:03:13] your Q function might look like and so if you are using the wrong approximator
[00:03:15] if you are using the wrong approximator like if you are using say a
[00:03:17] like if you are using say a line instead of a multi degree
[00:03:21] line instead of a multi degree polinomial then no matter how much data
[00:03:23] polinomial then no matter how much data you have you're not going to converge to
[00:03:24] you have you're not going to converge to the optimal key function or you what be
[00:03:26] the optimal key function or you what be guaranteed that
[00:03:28] guaranteed that you because just can't even realize
[00:03:31] you because just can't even realize it so in general um and there's all
[00:03:35] it so in general um and there's all sorts of additional instability things
[00:03:36] sorts of additional instability things that mean we can't be guaranteed it's
[00:03:38] that mean we can't be guaranteed it's going to converge so we're not
[00:03:40] going to converge so we're not guaranteed it'll converge but
[00:03:41] guaranteed it'll converge but empirically it often does pretty well so
[00:03:44] empirically it often does pretty well so we'll see we and if you look at the
[00:03:46] we'll see we and if you look at the empirical results it often does really
[00:03:48] empirical results it often does really quite
[00:03:49] quite well great so so far in the class we've
[00:03:53] well great so so far in the class we've talked a lot about these value function
[00:03:55] talked a lot about these value function based methods where we thought about
[00:03:57] based methods where we thought about having an explicit representation of the
[00:03:59] having an explicit representation of the expected sum of discounted rewards
[00:04:02] expected sum of discounted rewards starting in a state or starting in a
[00:04:03] starting in a state or starting in a particular state in action and so we
[00:04:06] particular state in action and so we talked a lot about value functions and Q
[00:04:08] talked a lot about value functions and Q functions um and now we're going to talk
[00:04:10] functions um and now we're going to talk a lot about policy
[00:04:12] a lot about policy search and so we're going to still think
[00:04:15] search and so we're going to still think about there being this policy which is a
[00:04:17] about there being this policy which is a mapping of states to actions or a
[00:04:20] mapping of states to actions or a mapping from States and actions to a
[00:04:24] mapping from States and actions to a number between zero and one such that it
[00:04:26] number between zero and one such that it sums to one we always have to do at
[00:04:28] sums to one we always have to do at least one action in every state but we
[00:04:31] least one action in every state but we don't necessarily have to have an
[00:04:32] don't necessarily have to have an explicit representation of the value
[00:04:34] explicit representation of the value function
[00:04:35] function anymore so these have been very popular
[00:04:40] anymore so these have been very popular and important and if we think back to
[00:04:43] and important and if we think back to what ourl algorithms involve they
[00:04:45] what ourl algorithms involve they involve optimization delayed
[00:04:47] involve optimization delayed consequences exploration and
[00:04:49] consequences exploration and generalization and we've seen examples
[00:04:51] generalization and we've seen examples of the all of these ideas so far and
[00:04:53] of the all of these ideas so far and we'll go a lot more into some of them as
[00:04:55] we'll go a lot more into some of them as we go through the course but one thing
[00:04:57] we go through the course but one thing you might be wondering about is you know
[00:05:00] you might be wondering about is you know could we play the trick that's often
[00:05:01] could we play the trick that's often done in U computer science and try to
[00:05:03] done in U computer science and try to reduce reinforcement learning to another
[00:05:05] reduce reinforcement learning to another problem so could we do something like
[00:05:07] problem so could we do something like just like online optimization so we know
[00:05:10] just like online optimization so we know that we uh don't know how the world
[00:05:12] that we uh don't know how the world works and we're trying to find a good
[00:05:14] works and we're trying to find a good control policy but could we do something
[00:05:16] control policy but could we do something like sort of um online optimization
[00:05:19] like sort of um online optimization where we're trying to search for a good
[00:05:21] where we're trying to search for a good policy and in this way you can give
[00:05:24] policy and in this way you can give policy gradient being related at a high
[00:05:25] policy gradient being related at a high level to this type of idea it's not a
[00:05:27] level to this type of idea it's not a reduction based approach but it's sort
[00:05:29] reduction based approach but it's sort of thinking about sort of well can we
[00:05:31] of thinking about sort of well can we just directly search to find a good
[00:05:33] just directly search to find a good policy and policy gr methods have been
[00:05:35] policy and policy gr methods have been extremely influential particularly over
[00:05:37] extremely influential particularly over the last 5 to 10 years so they're used
[00:05:39] the last 5 to 10 years so they're used for lots of areas they're used for
[00:05:41] for lots of areas they're used for things like you know sequence level
[00:05:42] things like you know sequence level training with recurrent roal networks
[00:05:44] training with recurrent roal networks that was based on reinforce which is an
[00:05:46] that was based on reinforce which is an algorithm we're going to go through
[00:05:47] algorithm we're going to go through today um it has been used for things of
[00:05:50] today um it has been used for things of end to-end training of deep Vis Muto
[00:05:53] end to-end training of deep Vis Muto policies so this was really influential
[00:05:55] policies so this was really influential work in the robotics Community about a
[00:05:57] work in the robotics Community about a decade ago I'm just going to show you a
[00:05:59] decade ago I'm just going to show you a quick quick video of
[00:06:00] quick quick video of it so this is work that was done by
[00:06:03] it so this is work that was done by Professor Chelsea Finn as part of her
[00:06:05] Professor Chelsea Finn as part of her PhD thesis along with Sergey LaVine and
[00:06:07] PhD thesis along with Sergey LaVine and others at
[00:06:11] Berkeley see if this will work with
[00:06:28] audio so what you can see there that
[00:06:30] audio so what you can see there that what they're
[00:06:32] what they're doing is what they're going to be trying
[00:06:34] doing is what they're going to be trying to do is learn from like so they showed
[00:06:36] to do is learn from like so they showed you a really they showed a big Network
[00:06:38] you a really they showed a big Network and what they're trying to go is go
[00:06:40] and what they're trying to go is go directly from pixels to learn what the
[00:06:43] directly from pixels to learn what the robot should do and this is one of the
[00:06:45] robot should do and this is one of the first examples of people trying to do
[00:06:47] first examples of people trying to do this directly from images let's just go
[00:06:49] this directly from images let's just go back to some of the tasks that they're
[00:06:50] back to some of the tasks that they're using
[00:07:18] and so that was part of the motivation
[00:07:20] and so that was part of the motivation too is that they want to be able to
[00:07:22] too is that they want to be able to learn these tasks in way that will
[00:07:23] learn these tasks in way that will generalize
[00:07:39] so this is another example of sort of
[00:07:41] so this is another example of sort of trying to do direct policy gradient
[00:07:43] trying to do direct policy gradient methods um in order to go from like
[00:07:47] methods um in order to go from like really um large complex State spaces
[00:07:50] really um large complex State spaces into direct decisions uh now in homework
[00:07:54] into direct decisions uh now in homework two and we haven't covered po yet but
[00:07:56] two and we haven't covered po yet but you're going to be implementing proximal
[00:07:58] you're going to be implementing proximal policy optimization which is one of the
[00:08:00] policy optimization which is one of the methods that build on the sort of
[00:08:01] methods that build on the sort of methods that we're going to talk about
[00:08:02] methods that we're going to talk about today um and that was used PO was used
[00:08:05] today um and that was used PO was used as part of training chat GPT so as you
[00:08:07] as part of training chat GPT so as you can see all of these algorithms have
[00:08:09] can see all of these algorithms have become incredibly influential in part
[00:08:11] become incredibly influential in part because they can scale really well to
[00:08:13] because they can scale really well to extremely complex inputs whether it be
[00:08:14] extremely complex inputs whether it be images or high dimensional robotic tasks
[00:08:17] images or high dimensional robotic tasks or even things like natural language and
[00:08:19] or even things like natural language and so they're very powerful they're often
[00:08:22] so they're very powerful they're often used in conjunction with things like
[00:08:24] used in conjunction with things like State action values as we'll talk about
[00:08:26] State action values as we'll talk about later but you don't have to use them
[00:08:28] later but you don't have to use them with them so they're really useful sort
[00:08:30] with them so they're really useful sort of class of things to know
[00:08:32] of class of things to know about so in particular just like how
[00:08:35] about so in particular just like how last time we saw that you could
[00:08:37] last time we saw that you could approximate a state action value or a
[00:08:40] approximate a state action value or a value function with a set of parameters
[00:08:42] value function with a set of parameters so we can do function
[00:08:43] so we can do function approximation um in those cases we
[00:08:47] approximation um in those cases we thought of directly learning a value
[00:08:49] thought of directly learning a value function or a state action value
[00:08:50] function or a state action value function and then it generate a policy
[00:08:53] function and then it generate a policy from the state action value so something
[00:08:55] from the state action value so something like EG greedy where we either take what
[00:08:57] like EG greedy where we either take what the Q value suggests as the best action
[00:08:59] the Q value suggests as the best action or weact randomly and what we're going
[00:09:01] or weact randomly and what we're going to do today instead and I'll try to be
[00:09:04] to do today instead and I'll try to be careful about not using the same um we
[00:09:06] careful about not using the same um we used W before to parameterize our state
[00:09:08] used W before to parameterize our state action values and I'm going to try to be
[00:09:10] action values and I'm going to try to be careful about using Theta just to make
[00:09:11] careful about using Theta just to make it clear um we're going to directly
[00:09:13] it clear um we're going to directly parameterize the policy and we're going
[00:09:16] parameterize the policy and we're going to try to learn parameterized policies
[00:09:18] to try to learn parameterized policies so we can think of these as like you
[00:09:19] so we can think of these as like you know deep convolutional neural networks
[00:09:21] know deep convolutional neural networks which at the end will output um either
[00:09:24] which at the end will output um either an action or if we have an action as
[00:09:26] an action or if we have an action as input we'll output a
[00:09:27] input we'll output a probability and the goal in this case as
[00:09:30] probability and the goal in this case as is normal is that we want to find a way
[00:09:32] is normal is that we want to find a way to act in the world that will give us
[00:09:33] to act in the world that will give us High reward so we want to find a policy
[00:09:35] High reward so we want to find a policy with the highest value function V pi and
[00:09:38] with the highest value function V pi and we're again not going to be focusing on
[00:09:40] we're again not going to be focusing on model based learning so we're so going
[00:09:43] model based learning so we're so going to try to directly learn from experience
[00:09:44] to try to directly learn from experience we're not going to assume we have access
[00:09:46] we're not going to assume we have access or that we're explicitly building a
[00:09:50] model and I think one of the things
[00:09:52] model and I think one of the things that's helpful to think about is there's
[00:09:54] that's helpful to think about is there's these sort of different views or lenses
[00:09:55] these sort of different views or lenses into reinforcement learning so this is a
[00:09:57] into reinforcement learning so this is a nice picture from David silver who's an
[00:09:59] nice picture from David silver who's an amazing person in reinforcement learning
[00:10:01] amazing person in reinforcement learning he was you know one of the main leads on
[00:10:03] he was you know one of the main leads on alphao and a number of other incredible
[00:10:05] alphao and a number of other incredible papers so we can think of it is like you
[00:10:08] papers so we can think of it is like you have some methods which are value based
[00:10:09] have some methods which are value based we're explicitly building a value
[00:10:11] we're explicitly building a value function we have other ones that are
[00:10:12] function we have other ones that are policy based and the ones that are in
[00:10:14] policy based and the ones that are in the intersection are often known as
[00:10:16] the intersection are often known as actor critic methods who here has either
[00:10:19] actor critic methods who here has either implemented or heard of actor critic
[00:10:20] implemented or heard of actor critic methods before okay so some people and
[00:10:23] methods before okay so some people and underground they're extremely popular
[00:10:25] underground they're extremely popular and in actor critic methods um you will
[00:10:27] and in actor critic methods um you will often combine between the benefit it's a
[00:10:29] often combine between the benefit it's a value based and policy and so for
[00:10:32] value based and policy and so for example alphago is a actor critic method
[00:10:35] example alphago is a actor critic method in the sense that it is often having
[00:10:37] in the sense that it is often having explicit representation of the policy
[00:10:38] explicit representation of the policy and of a value
[00:10:41] and of a value function okay so we'll get to actor
[00:10:43] function okay so we'll get to actor critic methods later today we're going
[00:10:45] critic methods later today we're going to focus on policy
[00:10:48] based so now that we're going to most of
[00:10:51] based so now that we're going to most of the time we've thought about policies so
[00:10:53] the time we've thought about policies so far we thought about deterministic
[00:10:54] far we thought about deterministic policies or egedy policies and now we're
[00:10:57] policies or egedy policies and now we're going to think much more generally about
[00:10:58] going to think much more generally about stochastic policies um and that's going
[00:11:02] stochastic policies um and that's going to be important because as we saw last
[00:11:04] to be important because as we saw last time if you only have a deterministic
[00:11:06] time if you only have a deterministic policy it's much harder to learn about
[00:11:07] policy it's much harder to learn about actions you don't try whereas now we're
[00:11:10] actions you don't try whereas now we're going to think about having stochastic
[00:11:11] going to think about having stochastic policies where you're going to be
[00:11:12] policies where you're going to be getting information about lots of
[00:11:14] getting information about lots of different
[00:11:15] different actions so let's think about a
[00:11:17] actions so let's think about a particular example also to kind of
[00:11:19] particular example also to kind of illustrate some of the things that
[00:11:21] illustrate some of the things that policy gradiant methods are going to
[00:11:22] policy gradiant methods are going to help us handle so who here has played
[00:11:26] help us handle so who here has played rock paper
[00:11:27] rock paper scissors okay most people I think it's
[00:11:29] scissors okay most people I think it's called Rambo in um Chinese it's a very
[00:11:32] called Rambo in um Chinese it's a very popular game throughout the world it's a
[00:11:34] popular game throughout the world it's a stochastic game where you know or what
[00:11:38] stochastic game where you know or what each side can pick um a particular
[00:11:40] each side can pick um a particular strategy and the state you could think
[00:11:43] strategy and the state you could think of there being a state you could keep
[00:11:45] of there being a state you could keep track of what your opponent has done
[00:11:46] track of what your opponent has done over time so think for a second about
[00:11:49] over time so think for a second about whether the um a deterministic policy
[00:11:51] whether the um a deterministic policy can be optimal if you're playing this
[00:11:53] can be optimal if you're playing this game
[00:11:56] repeatedly so raise your hand if a
[00:11:58] repeatedly so raise your hand if a deterministic policy can be
[00:12:01] deterministic policy can be optimal raise your hand if you think a
[00:12:03] optimal raise your hand if you think a stochastic policy is
[00:12:05] stochastic policy is optimal okay someone who said stochastic
[00:12:08] optimal okay someone who said stochastic explain
[00:12:11] why
[00:12:14] why yes two is like circular like there's no
[00:12:17] yes two is like circular like there's no one one best one that that's right yeah
[00:12:21] one one best one that that's right yeah so there's no best like there's nothing
[00:12:22] so there's no best like there's nothing that strictly dominates all the other
[00:12:24] that strictly dominates all the other strategies and also if you're
[00:12:25] strategies and also if you're deterministic what can your opponent do
[00:12:29] deterministic what can your opponent do like if I say I'm always going to pick
[00:12:31] like if I say I'm always going to pick um paper what does my opponent
[00:12:34] um paper what does my opponent do they're always gonna pick like the
[00:12:38] do they're always gonna pick like the other one like rock to to beat me so
[00:12:40] other one like rock to to beat me so anything you do that's deterministic can
[00:12:42] anything you do that's deterministic can be exploited by your opponent if you are
[00:12:44] be exploited by your opponent if you are playing repeatedly and so the optimal
[00:12:47] playing repeatedly and so the optimal thing to do here is to be sarcastic the
[00:12:49] thing to do here is to be sarcastic the optimal policy has to be sarcastic here
[00:12:51] optimal policy has to be sarcastic here otherwise all deterministic policies are
[00:12:54] otherwise all deterministic policies are strictly dominated by good stochastic
[00:12:57] strictly dominated by good stochastic policy okay and now you might think well
[00:13:00] policy okay and now you might think well all right well that sounds different
[00:13:01] all right well that sounds different than what we've seen so far but one of
[00:13:03] than what we've seen so far but one of the challenges here is the system is not
[00:13:05] the challenges here is the system is not Markoff um so it's not stochastic what
[00:13:09] Markoff um so it's not stochastic what your um adversary will play next they're
[00:13:11] your um adversary will play next they're not random or it might be if they're
[00:13:12] not random or it might be if they're playing stochastic policy excuse me but
[00:13:14] playing stochastic policy excuse me but in general the sto that they can react
[00:13:16] in general the sto that they can react to What You've seen so far um and it's
[00:13:18] to What You've seen so far um and it's not just like you know a random
[00:13:20] not just like you know a random environment like a coin flip on the next
[00:13:21] environment like a coin flip on the next time and in this case actually a uniform
[00:13:24] time and in this case actually a uniform random policy is optimal it's a Nash
[00:13:27] random policy is optimal it's a Nash equilibrium right
[00:13:29] equilibrium right okay so that's one case where having a
[00:13:32] okay so that's one case where having a stochastic policy would be really
[00:13:33] stochastic policy would be really helpful so you could just have a fixed
[00:13:35] helpful so you could just have a fixed stochastic policy and it would be
[00:13:37] stochastic policy and it would be optimal but you couldn't necessar write
[00:13:39] optimal but you couldn't necessar write this down easily as a q function and
[00:13:41] this down easily as a q function and just take the ARX there's not a
[00:13:42] just take the ARX there's not a deterministic policy for this
[00:13:44] deterministic policy for this environment that is optimal and so it's
[00:13:47] environment that is optimal and so it's less clear how you would write that down
[00:13:49] less clear how you would write that down directly in terms of a q function in
[00:13:51] directly in terms of a q function in part because the system do not mark
[00:13:53] part because the system do not mark off so here's another example where we
[00:13:56] off so here's another example where we might want to have stochastic policies
[00:13:57] might want to have stochastic policies and it's where we have alias or partial
[00:14:00] and it's where we have alias or partial observability so imagine this case where
[00:14:02] observability so imagine this case where like you have a robot that's walking
[00:14:03] like you have a robot that's walking along and you know maybe they have
[00:14:05] along and you know maybe they have sensors so they can tell how far they
[00:14:07] sensors so they can tell how far they are from the walls but under Those
[00:14:09] are from the walls but under Those sensors these two gray boxes look
[00:14:12] sensors these two gray boxes look identical because like from the agents
[00:14:15] identical because like from the agents point of view if they have only
[00:14:17] point of view if they have only immediate sensors both of those places
[00:14:18] immediate sensors both of those places will look identical and so they can't
[00:14:21] will look identical and so they can't distinguish those gray States and
[00:14:23] distinguish those gray States and imagine that um you know you just
[00:14:25] imagine that um you know you just because you have a feature
[00:14:26] because you have a feature representation that just tells you about
[00:14:28] representation that just tells you about what your what you have a wall to the
[00:14:30] what your what you have a wall to the north to the east to the South or to the
[00:14:31] north to the east to the South or to the West so those two gr States would look
[00:14:33] West so those two gr States would look identical if that was your feature
[00:14:35] identical if that was your feature representation so you could have a value
[00:14:38] representation so you could have a value based reinforcement learning
[00:14:39] based reinforcement learning representation where use an approximate
[00:14:41] representation where use an approximate value function where you take in this as
[00:14:44] value function where you take in this as the state
[00:14:45] the state representation or you could have a
[00:14:46] representation or you could have a policy based one that takes in
[00:14:50] policy based one that takes in those so the challenge here is that if
[00:14:53] those so the challenge here is that if you're value based you have to do the
[00:14:55] you're value based you have to do the same thing in those two gra States
[00:14:58] same thing in those two gra States because you you can't distinguish them
[00:15:00] because you you can't distinguish them so from your perspective it's like
[00:15:02] so from your perspective it's like you're in the same place no matter which
[00:15:03] you're in the same place no matter which of those two places you're in so if
[00:15:05] of those two places you're in so if you're going to do a value function
[00:15:07] you're going to do a value function based and then extract a deterministic
[00:15:09] based and then extract a deterministic policy you would either always have to
[00:15:12] policy you would either always have to go say to the left in those cases or
[00:15:14] go say to the left in those cases or always go to the
[00:15:16] always go to the right and neither of those would always
[00:15:20] right and neither of those would always be
[00:15:22] be good
[00:15:24] good okay so under alosine meaning that we
[00:15:27] okay so under alosine meaning that we don't know whether which of the two GR
[00:15:29] don't know whether which of the two GR States we're in when we're in one of
[00:15:30] States we're in when we're in one of them ad opal policy will always move
[00:15:32] them ad opal policy will always move west in both States or east in both
[00:15:34] west in both States or east in both States and either way it might get stuck
[00:15:36] States and either way it might get stuck and never be able to reach the money but
[00:15:40] and never be able to reach the money but um and that's what's going to happen if
[00:15:41] um and that's what's going to happen if we do a value based reinforcement
[00:15:42] we do a value based reinforcement learning approach so that's not great
[00:15:45] learning approach so that's not great you're going to you know Traverse this
[00:15:46] you're going to you know Traverse this for a long time you're not going to be
[00:15:47] for a long time you're not going to be getting high reward what could you do if
[00:15:50] getting high reward what could you do if you wanted to have a stochastic
[00:15:56] policy so that allows you to act
[00:15:59] policy so that allows you to act R you know randomly or you know
[00:16:00] R you know randomly or you know stochastically in any state what do you
[00:16:02] stochastically in any state what do you think would be the right thing to do in
[00:16:03] think would be the right thing to do in the gray States if you could have a
[00:16:05] the gray States if you could have a stochastic
[00:16:08] policy uh with just some probability you
[00:16:11] policy uh with just some probability you go either East or yeah exactly so um so
[00:16:15] go either East or yeah exactly so um so you could just randomize it so an
[00:16:17] you could just randomize it so an optimal sastic policy will randomly move
[00:16:20] optimal sastic policy will randomly move East or West in The Gray State because
[00:16:22] East or West in The Gray State because it doesn't know which one it's in and
[00:16:24] it doesn't know which one it's in and half the time that'll be the right thing
[00:16:25] half the time that'll be the right thing to do so um so that means much more the
[00:16:29] to do so um so that means much more the time it'll go into here and it generally
[00:16:31] time it'll go into here and it generally will reach the goal State pretty
[00:16:34] will reach the goal State pretty quickly so this is another case where
[00:16:36] quickly so this is another case where the system is not markof this is notk
[00:16:42] the system is not markof this is notk off and the state
[00:16:44] off and the state features so because we have alosine
[00:16:47] features so because we have alosine meaning the system is partially
[00:16:48] meaning the system is partially observable is not a markup system one
[00:16:51] observable is not a markup system one way to handle that is to treat it as a
[00:16:53] way to handle that is to treat it as a partially observable markof decision
[00:16:54] partially observable markof decision process Michael Cocker talks a lot about
[00:16:56] process Michael Cocker talks a lot about those in his classes um but an
[00:16:59] those in his classes um but an alternative is to use a stochastic
[00:17:01] alternative is to use a stochastic policy and you can also act very well
[00:17:04] policy and you can also act very well here so those are two examples of sort
[00:17:07] here so those are two examples of sort of the type of thing that might be able
[00:17:09] of the type of thing that might be able to EAS be easy to handle with policy
[00:17:11] to EAS be easy to handle with policy gradient methods or stochastic policies
[00:17:13] gradient methods or stochastic policies that might be hard to um tackle with the
[00:17:16] that might be hard to um tackle with the type of methods we've seen so far okay
[00:17:18] type of methods we've seen so far okay so now we have to think about if we have
[00:17:20] so now we have to think about if we have policies and in general we're going to
[00:17:21] policies and in general we're going to want them to be stochastic how are we
[00:17:23] want them to be stochastic how are we going to learn what are good policies
[00:17:25] going to learn what are good policies like we have this you know now we have a
[00:17:27] like we have this you know now we have a function space over policy and we want
[00:17:29] function space over policy and we want to learn which of them have good
[00:17:32] to learn which of them have good values so if we're in an episodic
[00:17:35] values so if we're in an episodic environment we can use the policy value
[00:17:37] environment we can use the policy value at the start state so we can just say
[00:17:39] at the start state so we can just say I'm going to similar to the Monte Carlo
[00:17:41] I'm going to similar to the Monte Carlo methods if I start in this state I run
[00:17:42] methods if I start in this state I run this policy what would be my expected
[00:17:44] this policy what would be my expected reward be until the end of the episode
[00:17:47] reward be until the end of the episode we're going to mostly focus on the
[00:17:48] we're going to mostly focus on the episodic case today um but you can
[00:17:51] episodic case today um but you can extend these to sort of more of an
[00:17:52] extend these to sort of more of an infinite Horizon
[00:17:55] infinite Horizon case all right so once we think of it in
[00:17:57] case all right so once we think of it in this way we can really think okay this
[00:17:59] this way we can really think okay this sounds like an optimization problem so
[00:18:02] sounds like an optimization problem so we really just want to find um the
[00:18:05] we really just want to find um the parameters that maximize the value so
[00:18:07] parameters that maximize the value so you could
[00:18:08] you could say here you can think of this as being
[00:18:10] say here you can think of this as being your thetas so I'm just draw it one
[00:18:12] your thetas so I'm just draw it one dimensional but in general this could be
[00:18:14] dimensional but in general this could be you know all the parameters in a deep
[00:18:15] you know all the parameters in a deep neural network and then this is V of
[00:18:18] neural network and then this is V of theta or of a particular starting state
[00:18:22] theta or of a particular starting state it might look like
[00:18:24] it might look like this and what your goal would be is to
[00:18:27] this and what your goal would be is to find the parameters of your policy that
[00:18:31] find the parameters of your policy that maximize the value
[00:18:33] maximize the value function and so this is an optimization
[00:18:36] function and so this is an optimization problem but it's a hard optimization
[00:18:38] problem but it's a hard optimization problem because we don't have that
[00:18:39] problem because we don't have that function you can only estimate it
[00:18:41] function you can only estimate it through
[00:18:42] through data and so you can imagine like you
[00:18:44] data and so you can imagine like you start off you have no idea how how Theta
[00:18:47] start off you have no idea how how Theta maps to be then you have to learn that
[00:18:48] maps to be then you have to learn that over
[00:18:53] time so once we think of it as an
[00:18:55] time so once we think of it as an optimization problem where we don't know
[00:18:57] optimization problem where we don't know what the function is there are a lot of
[00:18:59] what the function is there are a lot of different methods we could think about
[00:19:00] different methods we could think about to try to solve this problem and what
[00:19:02] to try to solve this problem and what we're going to focus on today mostly is
[00:19:04] we're going to focus on today mostly is ones that are going to exploit something
[00:19:06] ones that are going to exploit something about the structure of sequential
[00:19:08] about the structure of sequential decision
[00:19:09] decision processes but there are methods that
[00:19:11] processes but there are methods that completely ignore kind of all of this um
[00:19:15] completely ignore kind of all of this um and in particular you can even use
[00:19:17] and in particular you can even use things that completely ignore gradients
[00:19:18] things that completely ignore gradients so you can do things like hill climbing
[00:19:20] so you can do things like hill climbing or genetic algorithms or cross entropy
[00:19:23] or genetic algorithms or cross entropy methods where you may not think of sort
[00:19:25] methods where you may not think of sort of any of the type of structure of the
[00:19:26] of any of the type of structure of the parameter space and in some cases that
[00:19:29] parameter space and in some cases that can work really
[00:19:31] can work really well so there's a really nice example um
[00:19:34] well so there's a really nice example um by my colleague uh step Collins who's
[00:19:37] by my colleague uh step Collins who's over in the mechanical engineering
[00:19:38] over in the mechanical engineering department he does some really
[00:19:40] department he does some really interesting work on
[00:19:50] yeah yes but you can it's a you can make
[00:19:53] yeah yes but you can it's a you can make it a distribution over actions so you
[00:19:55] it a distribution over actions so you can you can output something between
[00:19:57] can you can output something between zero and one you can have input action
[00:19:59] zero and one you can have input action and so then you would have a stochastic
[00:20:02] and so then you would have a stochastic policy out the netork you would then
[00:20:05] policy out the netork you would then compute up for all the actions then you
[00:20:06] compute up for all the actions then you would have to have some yeah you'd have
[00:20:07] would have to have some yeah you'd have to pick a random number and then use
[00:20:09] to pick a random number and then use that to select yeah yeah good
[00:20:14] question so my colleagu uh Stephen
[00:20:16] question so my colleagu uh Stephen Collings over in mechanical engineering
[00:20:18] Collings over in mechanical engineering does a lot of work on exoskeletons there
[00:20:20] does a lot of work on exoskeletons there are lots of reasons exoskeletons could
[00:20:21] are lots of reasons exoskeletons could be really helpful particularly for
[00:20:23] be really helpful particularly for people that have motor impairments but
[00:20:25] people that have motor impairments but one of the challenges of them is that uh
[00:20:27] one of the challenges of them is that uh you have to them actually help people
[00:20:29] you have to them actually help people walk so um if you clamp something on to
[00:20:32] walk so um if you clamp something on to say your leg then the way my physical
[00:20:35] say your leg then the way my physical configuration is may not be the same as
[00:20:37] configuration is may not be the same as your physical configuration I'm pretty
[00:20:38] your physical configuration I'm pretty tall and so you'd really like to make
[00:20:40] tall and so you'd really like to make sure that this helps each individual in
[00:20:42] sure that this helps each individual in the best way possible but you don't want
[00:20:44] the best way possible but you don't want it to have to learn over the course of
[00:20:46] it to have to learn over the course of like you know two years how to best
[00:20:48] like you know two years how to best optimize to someone because they're not
[00:20:50] optimize to someone because they're not going to wait that long to use it so
[00:20:52] going to wait that long to use it so what they did is they use policy methods
[00:20:54] what they did is they use policy methods policy search methods to quickly
[00:20:57] policy search methods to quickly personalize the parameters of an
[00:20:59] personalize the parameters of an exoskeleton um and they called this
[00:21:00] exoskeleton um and they called this humanin the loop exoskeleton
[00:21:02] humanin the loop exoskeleton optimization so the idea in this case is
[00:21:04] optimization so the idea in this case is what they're trying to figure out is
[00:21:06] what they're trying to figure out is what is the parameters of their
[00:21:08] what is the parameters of their exoskeleton and what they're looking at
[00:21:10] exoskeleton and what they're looking at is essentially how much it helps you
[00:21:12] is essentially how much it helps you walk so how much it reduces the effort
[00:21:14] walk so how much it reduces the effort needed to walk and so what they could do
[00:21:16] needed to walk and so what they could do in this case they're not using like a
[00:21:18] in this case they're not using like a gradient bath method they're using just
[00:21:20] gradient bath method they're using just CES which is so they train is of
[00:21:23] CES which is so they train is of continuous optimization is they'd have
[00:21:25] continuous optimization is they'd have people walk under sort of a few
[00:21:26] people walk under sort of a few different control parameters
[00:21:29] different control parameters they would see which of those seem to be
[00:21:30] they would see which of those seem to be most effective and then they would move
[00:21:32] most effective and then they would move the policies they' try in that direction
[00:21:34] the policies they' try in that direction with some
[00:21:35] with some stochasticity and I think it was within
[00:21:37] stochasticity and I think it was within maybe two or 3 hours using this they
[00:21:40] maybe two or 3 hours using this they could find substantially better policies
[00:21:42] could find substantially better policies I think it increased metabolic
[00:21:44] I think it increased metabolic efficiency like maybe by 20% or 30% it
[00:21:47] efficiency like maybe by 20% or 30% it was pretty remarkable and so this was
[00:21:49] was pretty remarkable and so this was published in science about um seven
[00:21:50] published in science about um seven eight years ago but that's another
[00:21:52] eight years ago but that's another example of a place where you can do this
[00:21:54] example of a place where you can do this sort of like online optimization you
[00:21:56] sort of like online optimization you don't necess have to think about the
[00:21:57] don't necess have to think about the temporal structure of the
[00:22:01] policy
[00:22:03] policy yeah start with a default policy and
[00:22:07] yeah start with a default policy and then try to improve that before great
[00:22:09] then try to improve that before great question yeah so in all of these cases
[00:22:11] question yeah so in all of these cases we're going to have to assume that we
[00:22:12] we're going to have to assume that we initialize our policy parameterization
[00:22:14] initialize our policy parameterization in some way just like how we initialized
[00:22:16] in some way just like how we initialized our value function to zero to start now
[00:22:18] our value function to zero to start now we're going to or if you had it for deep
[00:22:20] we're going to or if you had it for deep uh for the Deep Q Network it would be
[00:22:22] uh for the Deep Q Network it would be whatever your neur network
[00:22:23] whatever your neur network parameters yeah great question
[00:22:29] now it's just useful to to know about
[00:22:31] now it's just useful to to know about these because um they often work pretty
[00:22:33] these because um they often work pretty well so I think sometimes we like to
[00:22:36] well so I think sometimes we like to leverage the structure specific to say
[00:22:38] leverage the structure specific to say our markof decision process but in some
[00:22:40] our markof decision process but in some cases just leveraging these ones which
[00:22:42] cases just leveraging these ones which may not use very much structure at all
[00:22:44] may not use very much structure at all can actually do really well so it's just
[00:22:47] can actually do really well so it's just good to keep in mind that there are a
[00:22:48] good to keep in mind that there are a lot of ways to do online
[00:22:51] lot of ways to do online optimization all right so this is often
[00:22:53] optimization all right so this is often a great Baseline to try the great thing
[00:22:55] a great Baseline to try the great thing about this is it can work with any
[00:22:56] about this is it can work with any policy parameterizations even if it's
[00:22:58] policy parameterizations even if it's not differentiable because it's not
[00:23:00] not differentiable because it's not using gradients so it doesn't need to be
[00:23:02] using gradients so it doesn't need to be differential and it's also often very
[00:23:05] differential and it's also often very easy to paralyze so CES for those of you
[00:23:08] easy to paralyze so CES for those of you who not who haven't seen it before um
[00:23:11] who not who haven't seen it before um you'll have a number of different
[00:23:12] you'll have a number of different policies you kind of try in parallel and
[00:23:14] policies you kind of try in parallel and then you'll use that to sort of update
[00:23:16] then you'll use that to sort of update and shift to another set of policies and
[00:23:18] and shift to another set of policies and that's what they did uh Professor
[00:23:20] that's what they did uh Professor Collings did and in a lot of cases the
[00:23:23] Collings did and in a lot of cases the problems that we think about are places
[00:23:24] problems that we think about are places where you know you'll have many
[00:23:25] where you know you'll have many customers you'll have many many robot
[00:23:27] customers you'll have many many robot arms and so you can par
[00:23:29] arms and so you can par things one of the limitations is that
[00:23:31] things one of the limitations is that it's often less data efficient because
[00:23:32] it's often less data efficient because it's ignoring the temporal structure so
[00:23:35] it's ignoring the temporal structure so if if you have temporal structure or you
[00:23:36] if if you have temporal structure or you have a gradient gradient information it
[00:23:38] have a gradient gradient information it may be more effective to use
[00:23:41] may be more effective to use that all right so what we're going to
[00:23:44] that all right so what we're going to focus on in this class is differentiable
[00:23:47] focus on in this class is differentiable methods so we're going to focus on
[00:23:48] methods so we're going to focus on places where we can do stochastic
[00:23:51] places where we can do stochastic gradient descent including on the policy
[00:23:54] gradient descent including on the policy parameterization so if we have like our
[00:23:56] parameterization so if we have like our policy parameterized by a deep neural
[00:23:58] policy parameterized by a deep neural network we can propagate through that
[00:24:00] network we can propagate through that and update it update those
[00:24:03] and update it update those parameters so we're going to um we're
[00:24:06] parameters so we're going to um we're going to focus here mostly on methods
[00:24:08] going to focus here mostly on methods that do use gradient descent and that
[00:24:11] that do use gradient descent and that often Leverage The sequential structure
[00:24:12] often Leverage The sequential structure of that we're making a sequence of
[00:24:14] of that we're making a sequence of decisions and we want to optimize to
[00:24:16] decisions and we want to optimize to make those sequence of
[00:24:19] decisions so to do that we're going to
[00:24:21] decisions so to do that we're going to explicitly Define the gradient and we're
[00:24:23] explicitly Define the gradient and we're going to write down the value function
[00:24:25] going to write down the value function in terms of as a function of the policy
[00:24:27] in terms of as a function of the policy parameters so that we can be
[00:24:29] parameters so that we can be clear that this value function um you
[00:24:33] clear that this value function um you know relies on those policies and we're
[00:24:36] know relies on those policies and we're going to focus today on episodic markof
[00:24:38] going to focus today on episodic markof decision processes where we go for a
[00:24:40] decision processes where we go for a single episode stop reset and keep
[00:24:43] single episode stop reset and keep going
[00:24:45] going right so now what we're going to do is
[00:24:48] right so now what we're going to do is we're only going to be trying to get in
[00:24:49] we're only going to be trying to get in general to a local maximum now it's
[00:24:53] general to a local maximum now it's possible you're lucky and you're sort of
[00:24:55] possible you're lucky and you're sort of convex in the in the space of the the
[00:24:58] convex in the in the space of the the value it's bace of the policy parameters
[00:25:00] value it's bace of the policy parameters but in general we're not going to assume
[00:25:02] but in general we're not going to assume convexity so at best we're going to hope
[00:25:04] convexity so at best we're going to hope to just get to some sort of local Maxima
[00:25:06] to just get to some sort of local Maxima in our
[00:25:07] in our space so if we have again if we only had
[00:25:11] space so if we have again if we only had one parameter and we have something like
[00:25:14] one parameter and we have something like this where we might get to here we might
[00:25:17] this where we might get to here we might get to here in general we're not going
[00:25:18] get to here in general we're not going to make Global optimality guarantees
[00:25:21] to make Global optimality guarantees this is in big cont been contrast to the
[00:25:24] this is in big cont been contrast to the tabular cases we saw before we were
[00:25:25] tabular cases we saw before we were guaranteed to get to the optimal Q
[00:25:27] guaranteed to get to the optimal Q function optimal value function now
[00:25:30] function optimal value function now we're just going to hope to given our
[00:25:32] we're just going to hope to given our policy parameterization let's try to get
[00:25:35] policy parameterization let's try to get to What's um a local Optima in that
[00:25:38] to What's um a local Optima in that policy
[00:25:40] policy parameterization so it's sort of a
[00:25:41] parameterization so it's sort of a policy specific policy class specific
[00:25:44] policy specific policy class specific guarantee and it's only a local Optima
[00:25:47] guarantee and it's only a local Optima and what we'll be doing is we're just
[00:25:48] and what we'll be doing is we're just going to be trying to sort of take the
[00:25:49] going to be trying to sort of take the gradient of the policy with respect to
[00:25:51] gradient of the policy with respect to the
[00:25:52] the parameters okay and as usual we're going
[00:25:54] parameters okay and as usual we're going to have a step size parameter so we're
[00:25:56] to have a step size parameter so we're going to take the gradient of the value
[00:25:58] going to take the gradient of the value fun with respect to the parameters and
[00:25:59] fun with respect to the parameters and take a small step and the key thing is
[00:26:02] take a small step and the key thing is going to be thinking about places where
[00:26:03] going to be thinking about places where we can do this All directly using sort
[00:26:06] we can do this All directly using sort of smooth
[00:26:10] functions okay now one way you could do
[00:26:13] functions okay now one way you could do this of course when you see this now you
[00:26:15] this of course when you see this now you probably immediately think of Auto diff
[00:26:16] probably immediately think of Auto diff methods and think about we can just back
[00:26:18] methods and think about we can just back parate Etc but it's worth noting that
[00:26:20] parate Etc but it's worth noting that when these methods began to start to get
[00:26:23] when these methods began to start to get popular um they didn't necessarily have
[00:26:25] popular um they didn't necessarily have Auto diff yet uh I know they there was
[00:26:28] Auto diff yet uh I know they there was research then still um and what one of
[00:26:31] research then still um and what one of the things people started thinking about
[00:26:32] the things people started thinking about for this is how you could use this for
[00:26:34] for this is how you could use this for robotics so this is a nice paper from
[00:26:36] robotics so this is a nice paper from 2004 so 20 years ago by Peter Stones
[00:26:39] 2004 so 20 years ago by Peter Stones group it was right around then I think
[00:26:41] group it was right around then I think maybe RoboCop was maybe 6 years old then
[00:26:43] maybe RoboCop was maybe 6 years old then or something um I think they started
[00:26:45] or something um I think they started back in 1998 or so so they're these
[00:26:47] back in 1998 or so so they're these little quadruped robots and the goal was
[00:26:49] little quadruped robots and the goal was to think about having getting robotics
[00:26:51] to think about having getting robotics to the stage where you could have robots
[00:26:54] to the stage where you could have robots play human players I think that was the
[00:26:56] play human players I think that was the goal by either 2030 or 2050 I forget but
[00:26:59] goal by either 2030 or 2050 I forget but what they was going to start with it was
[00:27:00] what they was going to start with it was quadrupeds so one of the big challenges
[00:27:03] quadrupeds so one of the big challenges at the beginning because you know
[00:27:04] at the beginning because you know everywhere you start with the beginning
[00:27:05] everywhere you start with the beginning challenges and you go from there is just
[00:27:07] challenges and you go from there is just getting them to walk fast enough so if
[00:27:09] getting them to walk fast enough so if they're going to score goals and they're
[00:27:11] they're going to score goals and they're going to compete you need them to walk
[00:27:12] going to compete you need them to walk quickly and so there's this question of
[00:27:14] quickly and so there's this question of just how do you learn fast walks you
[00:27:17] just how do you learn fast walks you know so that they can sort of trying to
[00:27:19] know so that they can sort of trying to teach robots to run and what they found
[00:27:21] teach robots to run and what they found here is that they could use policy
[00:27:23] here is that they could use policy methods and policy search methods just
[00:27:25] methods and policy search methods just to learn a faster way for it to walk and
[00:27:28] to learn a faster way for it to walk and so they parameterized sort of the curve
[00:27:31] so they parameterized sort of the curve of how the the foot moves as a set of
[00:27:33] of how the the foot moves as a set of parameters and that defined the policy
[00:27:35] parameters and that defined the policy for moving those joints and then what
[00:27:37] for moving those joints and then what they did is they just had these walk
[00:27:39] they did is they just had these walk back and forth many many times and what
[00:27:41] back and forth many many times and what they would do is they'd have them walk
[00:27:43] they would do is they'd have them walk um you know with some particular policy
[00:27:46] um you know with some particular policy parameters they would see how fast they
[00:27:48] parameters they would see how fast they walked they would do finite different
[00:27:50] walked they would do finite different methods so they weren't trying to
[00:27:51] methods so they weren't trying to explicitly do auto dip or anything there
[00:27:54] explicitly do auto dip or anything there and then they would slightly change the
[00:27:56] and then they would slightly change the policy parameters and repeat
[00:27:59] policy parameters and repeat and they substantially they learned a
[00:28:01] and they substantially they learned a substantially faster walk during that
[00:28:03] substantially faster walk during that and I think it took maybe around four
[00:28:05] and I think it took maybe around four hours or so but they just had to replace
[00:28:07] hours or so but they just had to replace the batteries a couple
[00:28:09] the batteries a couple time so just an example to say like you
[00:28:11] time so just an example to say like you know it's lovely to have autoi you can
[00:28:13] know it's lovely to have autoi you can do really complicated things now but
[00:28:14] do really complicated things now but these methods can work even in really
[00:28:16] these methods can work even in really basic settings particularly where you
[00:28:18] basic settings particularly where you think you have pretty bad models of like
[00:28:20] think you have pretty bad models of like how the world works and so now you can
[00:28:22] how the world works and so now you can just be directly data driven and why is
[00:28:24] just be directly data driven and why is this as a hard problem for those of you
[00:28:26] this as a hard problem for those of you that haven't done robotics it involves a
[00:28:28] that haven't done robotics it involves a whole bunch of contact forces um you
[00:28:31] whole bunch of contact forces um you know the ground may be uh well there
[00:28:34] know the ground may be uh well there they have to learn on this particular
[00:28:35] they have to learn on this particular ground you may not know because it's
[00:28:37] ground you may not know because it's commercial Hardware you may not know
[00:28:39] commercial Hardware you may not know exactly all the parameters that the
[00:28:40] exactly all the parameters that the designers put in so you can just be data
[00:28:43] designers put in so you can just be data driven okay as opposed to maybe having
[00:28:45] driven okay as opposed to maybe having like a physics
[00:28:48] like a physics simulator all right so just to summarize
[00:28:50] simulator all right so just to summarize so far the benefits of policy based RL
[00:28:52] so far the benefits of policy based RL is that um we're going to have often
[00:28:56] is that um we're going to have often better convergence properties often
[00:28:58] better convergence properties often going to be able to guarantee that we
[00:28:59] going to be able to guarantee that we get to a local Optima whereas we didn't
[00:29:01] get to a local Optima whereas we didn't have that for deep Q learning um they're
[00:29:03] have that for deep Q learning um they're often really effective in high
[00:29:04] often really effective in high dimensional or continuous action spaces
[00:29:07] dimensional or continuous action spaces and you can learn stochastic
[00:29:09] and you can learn stochastic policies but the methods we've seen so
[00:29:11] policies but the methods we've seen so far might be more inefficient and higher
[00:29:13] far might be more inefficient and higher variance and we often only get to
[00:29:15] variance and we often only get to something of a local
[00:29:17] something of a local Optima and we'll see some things to help
[00:29:20] Optima and we'll see some things to help with kind of the inefficiency in a
[00:29:23] with kind of the inefficiency in a second all right so now what we're going
[00:29:25] second all right so now what we're going to dive into is how do we do this when
[00:29:27] to dive into is how do we do this when we are willing to have differentiable
[00:29:33] policies so the hope is that we can
[00:29:36] policies so the hope is that we can actually compute the policy gradient
[00:29:38] actually compute the policy gradient analytically so we don't have to do it
[00:29:40] analytically so we don't have to do it with finite
[00:29:42] with finite differences and we're going to focus on
[00:29:44] differences and we're going to focus on policies where it's differentiable as
[00:29:46] policies where it's differentiable as long as it's non zero so we're going to
[00:29:49] long as it's non zero so we're going to assume that we can always compute the
[00:29:50] assume that we can always compute the gradient of the policy parameters
[00:29:53] gradient of the policy parameters themselves and there are a number of
[00:29:54] themselves and there are a number of different classes we can do this for um
[00:29:56] different classes we can do this for um and there are many popular classes uh
[00:29:58] and there are many popular classes uh including of course deep neural
[00:30:01] including of course deep neural networks so popular ones are often
[00:30:05] networks so popular ones are often softmax softmax is used all the time
[00:30:07] softmax softmax is used all the time I'll explain what is in a second
[00:30:09] I'll explain what is in a second gaussian um and neural networks and
[00:30:12] gaussian um and neural networks and again just to be clear here what I mean
[00:30:13] again just to be clear here what I mean by a policy class is what is the
[00:30:16] by a policy class is what is the functional form we are using to give us
[00:30:18] functional form we are using to give us a
[00:30:21] probability of an action given a
[00:30:24] probability of an action given a state so are we having something like
[00:30:26] state so are we having something like well I guess we can just see on this
[00:30:28] well I guess we can just see on this next slide what what these will look
[00:30:31] next slide what what these will look like okay so we're going to assume I'm
[00:30:34] like okay so we're going to assume I'm going to give you some examples of those
[00:30:35] going to give you some examples of those of what you know what softmax um and G
[00:30:38] of what you know what softmax um and G neural networks look like in a second in
[00:30:40] neural networks look like in a second in terms of how how we differentiate them
[00:30:43] terms of how how we differentiate them but these are just different ways for us
[00:30:44] but these are just different ways for us to parameterize what is the probability
[00:30:46] to parameterize what is the probability of an action given a
[00:30:47] of an action given a state um actually I guess I'll give a
[00:30:49] state um actually I guess I'll give a quick example with Gan for Gan you could
[00:30:52] quick example with Gan for Gan you could imagine let's imagine I have a robot and
[00:30:54] imagine let's imagine I have a robot and I'm trying to figure out say um how much
[00:30:57] I'm trying to figure out say um how much speed to apply then you might have a
[00:31:00] speed to apply then you might have a policy class that
[00:31:01] policy class that says the action I take is equal to a
[00:31:05] says the action I take is equal to a Galan centered around
[00:31:08] Galan centered around 0.5 with some standard deviation so it
[00:31:11] 0.5 with some standard deviation so it would be a stochastic policy um and it
[00:31:15] would be a stochastic policy um and it would say the average amount of speed
[00:31:17] would say the average amount of speed you're going to apply is 0.5 but you're
[00:31:18] you're going to apply is 0.5 but you're going to have some variability around
[00:31:20] going to have some variability around that that would give you some stochastic
[00:31:23] that that would give you some stochastic Behavior sometimes the robot would go
[00:31:25] Behavior sometimes the robot would go really slowly sometimes it would go fast
[00:31:27] really slowly sometimes it would go fast so it would go in a negative
[00:31:30] so it would go in a negative Direction okay so let's keep assuming
[00:31:32] Direction okay so let's keep assuming that the policy is differentiable
[00:31:34] that the policy is differentiable whenever it's non zero we know the
[00:31:35] whenever it's non zero we know the gradient that still doesn't tell us how
[00:31:37] gradient that still doesn't tell us how to solve policy gradient methods yet
[00:31:40] to solve policy gradient methods yet because what we want to do is take
[00:31:41] because what we want to do is take derivatives of the value function so we
[00:31:44] derivatives of the value function so we want to say I want to find the maximum
[00:31:47] want to say I want to find the maximum the policy that has the best value
[00:31:48] the policy that has the best value function which means I'm going to need
[00:31:49] function which means I'm going to need to take the derivative of the value
[00:31:51] to take the derivative of the value function with respect to the policy
[00:31:53] function with respect to the policy parameters okay so remember that the
[00:31:56] parameters okay so remember that the policy value the value of the initial
[00:31:58] policy value the value of the initial starting State under a policy is going
[00:32:00] starting State under a policy is going to be the expected dis expected sum of
[00:32:03] to be the expected dis expected sum of rewards we don't have to use discounting
[00:32:06] rewards we don't have to use discounting for most of today if we assume it's
[00:32:07] for most of today if we assume it's finite let just say we're assume we're
[00:32:11] finite let just say we're assume we're in the episodic
[00:32:13] in the episodic case and so this is finite so no
[00:32:16] case and so this is finite so no discount counting for
[00:32:21] now so we don't need discounting for now
[00:32:23] now so we don't need discounting for now because it's always a finite length so
[00:32:24] because it's always a finite length so we're never going to have infinite
[00:32:26] we're never going to have infinite reward so the policy value is just the
[00:32:28] reward so the policy value is just the expected sum of discounted rewards when
[00:32:30] expected sum of discounted rewards when we follow the policy parameterized by
[00:32:32] we follow the policy parameterized by Theta till the end of the episode
[00:32:35] Theta till the end of the episode starting from the state
[00:32:38] starting from the state s0 and there are lots of different ways
[00:32:40] s0 and there are lots of different ways for us to write this
[00:32:42] for us to write this down so one ways for us to write down B
[00:32:45] down so one ways for us to write down B of Sr is equal to well it's equal to the
[00:32:48] of Sr is equal to well it's equal to the state action value averaged over the
[00:32:51] state action value averaged over the probability of us taking each of those
[00:32:53] probability of us taking each of those actions under our
[00:32:55] actions under our policy so this here just says what is
[00:32:58] policy so this here just says what is the probability of me taking this action
[00:33:00] the probability of me taking this action starting state s0 if I have policy
[00:33:02] starting state s0 if I have policy parameterized by Theta times what is my
[00:33:05] parameterized by Theta times what is my Q value of starting in that state taking
[00:33:08] Q value of starting in that state taking that particular action and then
[00:33:09] that particular action and then following that policy for the rest of
[00:33:12] following that policy for the rest of it okay so this is one way to write it
[00:33:15] it okay so this is one way to write it but we can also think of a quite
[00:33:16] but we can also think of a quite different way which is let's think about
[00:33:19] different way which is let's think about trajectories I'll write down what these
[00:33:21] trajectories I'll write down what these are in a
[00:33:23] are in a second okay so this is a
[00:33:26] second okay so this is a trajectory what's a trajectory that's
[00:33:28] trajectory what's a trajectory that's going to be s0 and then it's going to be
[00:33:30] going to be s0 and then it's going to be an action and then S1 dot dot dot
[00:33:35] dot
[00:33:38] sampled from PI Theta okay so another
[00:33:43] sampled from PI Theta okay so another way we can think of the value and then
[00:33:45] way we can think of the value and then this is going to be the
[00:33:48] this is going to be the reward for that trajectory
[00:33:55] that and we've called RG before so we
[00:33:59] that and we've called RG before so we I'll just write down that in
[00:34:11] case so another way we can think of the
[00:34:13] case so another way we can think of the value is we say well let's just sum over
[00:34:15] value is we say well let's just sum over all possible trajectories we could reach
[00:34:17] all possible trajectories we could reach under this policy and what would be the
[00:34:19] under this policy and what would be the reward for each of those trajectories
[00:34:21] reward for each of those trajectories and I'm just going to take a weighted
[00:34:22] and I'm just going to take a weighted sum now of course you might be thinking
[00:34:24] sum now of course you might be thinking that's totally intractable and yes in
[00:34:26] that's totally intractable and yes in general um if H you have a really long
[00:34:29] general um if H you have a really long trajectory then it's going to be inct
[00:34:32] trajectory then it's going to be inct and you have a really large State space
[00:34:33] and you have a really large State space and you could reach many states in
[00:34:34] and you could reach many states in general it's not going to be possible to
[00:34:35] general it's not going to be possible to actually enumerate this but this is
[00:34:38] actually enumerate this but this is mathematically well defined this is just
[00:34:40] mathematically well defined this is just an expectation over the reward of
[00:34:43] an expectation over the reward of trajectories and we know whenever we see
[00:34:45] trajectories and we know whenever we see expectations that we can approximate
[00:34:47] expectations that we can approximate those with finite samples you can think
[00:34:49] those with finite samples you can think of just taking n samples just like what
[00:34:51] of just taking n samples just like what we saw with Monte Carlo methods and
[00:34:53] we saw with Monte Carlo methods and using that to approximate a
[00:34:55] using that to approximate a trajectory so in general this is inter
[00:34:58] trajectory so in general this is inter able General
[00:35:04] intractable but we can approximate by
[00:35:13] sampling all right so this is one way we
[00:35:17] sampling all right so this is one way we could write down so this is another also
[00:35:19] could write down so this is another also valid way to write down what is the
[00:35:21] valid way to write down what is the value starting the state and following
[00:35:22] value starting the state and following the
[00:35:25] the policy okay so I've written that doubt
[00:35:27] policy okay so I've written that doubt more neatly here
[00:35:28] more neatly here P of T Theta is the probability over
[00:35:31] P of T Theta is the probability over trajectories when you execute that
[00:35:33] trajectories when you execute that policy starting State s0 and that is the
[00:35:35] policy starting State s0 and that is the sum of the words for a
[00:35:36] sum of the words for a trajectory in this class we're going to
[00:35:38] trajectory in this class we're going to focus on this latter definition but
[00:35:41] focus on this latter definition but inside of set and Bardo they have a nice
[00:35:43] inside of set and Bardo they have a nice way to think about policy gradient
[00:35:44] way to think about policy gradient methods that starts from the other
[00:35:46] methods that starts from the other definition so you can always look at
[00:35:47] definition so you can always look at that but both are totally valid
[00:35:53] definitions all right so now we're going
[00:35:55] definitions all right so now we're going to focus on thinking about likely Hood
[00:35:57] to focus on thinking about likely Hood ratio policies so we're going to be
[00:35:59] ratio policies so we're going to be thinking about this case where we have a
[00:36:01] thinking about this case where we have a distribution over trajectories and then
[00:36:04] distribution over trajectories and then what is the sum of rewards for each of
[00:36:05] what is the sum of rewards for each of those
[00:36:06] those trajectories so we have our value
[00:36:08] trajectories so we have our value function and now what we want to do is
[00:36:10] function and now what we want to do is find the
[00:36:12] find the argmax so that we
[00:36:14] argmax so that we maximize us having probability of
[00:36:17] maximize us having probability of getting trajectories with high
[00:36:19] getting trajectories with high reward so that's nice so instead of just
[00:36:21] reward so that's nice so instead of just thinking about the value function we now
[00:36:23] thinking about the value function we now can think of it as okay I want to have
[00:36:25] can think of it as okay I want to have policies that induce trajectories
[00:36:26] policies that induce trajectories through the state space through this
[00:36:28] through the state space through this state in action space they give me high
[00:36:32] reward so what we're going to need to be
[00:36:33] reward so what we're going to need to be able to do is to take a um a gradient
[00:36:37] able to do is to take a um a gradient through the right hand
[00:36:39] through the right hand side okay so that's what we're going to
[00:36:41] side okay so that's what we're going to do
[00:36:44] now just okay so we're going to take the
[00:36:49] now just okay so we're going to take the gradient of this because once we have
[00:36:50] gradient of this because once we have the gradient of the value function with
[00:36:51] the gradient of the value function with respect to the policy parameters we can
[00:36:53] respect to the policy parameters we can update our policy parameters to increase
[00:36:56] update our policy parameters to increase hopefully the value of the policy that
[00:36:58] hopefully the value of the policy that we're at okay so what we're going to do
[00:37:00] we're at okay so what we're going to do is we're going to say we're going to
[00:37:01] is we're going to say we're going to take the gradient with respect to the
[00:37:02] take the gradient with respect to the right hand side we can rewrite this by
[00:37:06] right hand side we can rewrite this by pushing in the
[00:37:13] gradient okay now R of to doesn't depend
[00:37:16] gradient okay now R of to doesn't depend on the policy parameters that's just
[00:37:18] on the policy parameters that's just what is the reward once you've told me
[00:37:20] what is the reward once you've told me what a trajectory is so we can put that
[00:37:22] what a trajectory is so we can put that on the other side
[00:37:27] this is the only part that depends on
[00:37:29] this is the only part that depends on the policy
[00:37:30] the policy parameters and now I'm going to play um
[00:37:33] parameters and now I'm going to play um a trick I'm going to
[00:37:39] note that this is going to be equal to
[00:37:44] note that this is going to be equal to well I'm going to do something that's
[00:37:44] well I'm going to do something that's going to seem not very helpful for a
[00:37:46] going to seem not very helpful for a second and then we'll see why it's
[00:37:48] second and then we'll see why it's helpful just going to multiply and
[00:37:50] helpful just going to multiply and divide by the probability of a
[00:37:55] trajectory I haven't done anything I've
[00:37:57] trajectory I haven't done anything I've just multiplied by one and I've happened
[00:37:59] just multiplied by one and I've happened to multiply by the top and bottom by the
[00:38:02] to multiply by the top and bottom by the probability of that trajectory but then
[00:38:04] probability of that trajectory but then I'm going to note
[00:38:07] I'm going to note that the derivative with respect to log
[00:38:10] that the derivative with respect to log of the trajectory in Theta is just equal
[00:38:14] of the trajectory in Theta is just equal to one over the probability
[00:38:17] OFA time the derivative with respect to
[00:38:22] OFA time the derivative with respect to the trajectory in
[00:38:24] the trajectory in Theta okay because the derivative of log
[00:38:26] Theta okay because the derivative of log is just equal to one over the value
[00:38:28] is just equal to one over the value times the derivative of the thing inside
[00:38:30] times the derivative of the thing inside the log so that looks exactly like this
[00:38:35] the log so that looks exactly like this okay so that's the trick that we're
[00:38:38] okay so that's the trick that we're playing here and so we can rewrite this
[00:38:40] playing here and so we can rewrite this then as the probability I'll tell you
[00:38:43] then as the probability I'll tell you why we do this in a
[00:38:52] second
[00:38:53] second okay all right and let me just rewrite
[00:38:55] okay all right and let me just rewrite it one more time so so it's more easy to
[00:39:09] see why did we do this okay the reason
[00:39:12] see why did we do this okay the reason we did this is that in general it's
[00:39:14] we did this is that in general it's going to be hard for us to think about
[00:39:15] going to be hard for us to think about or it might be tricky for us to think
[00:39:17] or it might be tricky for us to think about how do we propagate our derivative
[00:39:19] about how do we propagate our derivative through something that's an expectation
[00:39:21] through something that's an expectation we had an expectation over all the
[00:39:22] we had an expectation over all the trajectories waited by the reward of
[00:39:24] trajectories waited by the reward of those trajectories we now want to want
[00:39:26] those trajectories we now want to want to take a gradient with with respect to
[00:39:27] to take a gradient with with respect to it we want to end up with something that
[00:39:30] it we want to end up with something that is computable from samples um because
[00:39:32] is computable from samples um because it's easy for us to get samples we can
[00:39:34] it's easy for us to get samples we can actually run our policy in the
[00:39:35] actually run our policy in the environment so by playing this trick
[00:39:38] environment so by playing this trick what we now have is something that we
[00:39:39] what we now have is something that we can also sample because this is now an
[00:39:41] can also sample because this is now an expectation over trajectories of the
[00:39:44] expectation over trajectories of the reward of the trajectory weighted by the
[00:39:46] reward of the trajectory weighted by the gradient of the log of the probability
[00:39:48] gradient of the log of the probability of that trajectory okay and we'll talk
[00:39:50] of that trajectory okay and we'll talk soon about how you compute this part but
[00:39:53] soon about how you compute this part but this expectation can be sampled
[00:39:57] this expectation can be sampled okay because this is just a probability
[00:39:59] okay because this is just a probability over trajectories and so we could sample
[00:40:01] over trajectories and so we could sample say 100 of them and approximate that
[00:40:03] say 100 of them and approximate that outter
[00:40:05] outter expectation so that's one of the right
[00:40:07] expectation so that's one of the right reasons why this is I'm just writing
[00:40:08] reasons why this is I'm just writing this out more neatly here on the next
[00:40:09] this out more neatly here on the next slide this is called the likelihood
[00:40:13] slide this is called the likelihood ratio this term
[00:40:15] ratio this term here and so that's one of the benefits
[00:40:17] here and so that's one of the benefits to doing this is that we want to end up
[00:40:18] to doing this is that we want to end up with something that is computable we
[00:40:20] with something that is computable we want to be able to get this gradient
[00:40:22] want to be able to get this gradient with respect to the value function for
[00:40:23] with respect to the value function for the policy parameters and so this is
[00:40:25] the policy parameters and so this is going to give us something that we can
[00:40:27] going to give us something that we can app approximate with samples and we can
[00:40:29] app approximate with samples and we can compute okay all right now you still
[00:40:33] compute okay all right now you still might be a little bit concerned because
[00:40:35] might be a little bit concerned because all right maybe you think yeah I can
[00:40:38] all right maybe you think yeah I can maybe compute this by like writing
[00:40:40] maybe compute this by like writing things out in the environment but I'm
[00:40:41] things out in the environment but I'm still going to have to take this
[00:40:43] still going to have to take this derivative and how am I going to do that
[00:40:45] derivative and how am I going to do that and what does it end up depending on so
[00:40:48] and what does it end up depending on so let's do a Next
[00:40:50] let's do a Next Step so as I said what we're going to do
[00:40:52] Step so as I said what we're going to do here this is an expectation this is an
[00:40:55] here this is an expectation this is an expectation
[00:40:58] and we're going to approximate that
[00:41:00] and we're going to approximate that expectation with an imperical
[00:41:04] expectation with an imperical estimate okay so we're just going to
[00:41:05] estimate okay so we're just going to instead of actually taking you know all
[00:41:07] instead of actually taking you know all possible trajectories particularly in
[00:41:09] possible trajectories particularly in the case of vision um input you could
[00:41:11] the case of vision um input you could imagine that would be completely insane
[00:41:13] imagine that would be completely insane so we're just going to approximate it by
[00:41:14] so we're just going to approximate it by taking n
[00:41:16] taking n samples but we still have to handle this
[00:41:19] samples but we still have to handle this okay so that's what we're going to do
[00:41:20] okay so that's what we're going to do next so this first part should all seem
[00:41:23] next so this first part should all seem clear the second part should Le
[00:41:25] clear the second part should Le certainly for most of us would not be
[00:41:27] certainly for most of us would not be clear yet about how we do that second
[00:41:30] clear yet about how we do that second part okay so what do we do with that
[00:41:33] part okay so what do we do with that what we're going to do now is we're
[00:41:34] what we're going to do now is we're going to decompose that latter part into
[00:41:37] going to decompose that latter part into States and
[00:41:39] States and actions so remember that what this means
[00:41:42] actions so remember that what this means here is this is going to be a particular
[00:41:45] here is this is going to be a particular trajectory we get by following a policy
[00:41:47] trajectory we get by following a policy for teps or until the end of the episode
[00:41:51] for teps or until the end of the episode okay so let me just write remind
[00:41:52] okay so let me just write remind ourselves what T is going to look like
[00:41:54] ourselves what T is going to look like here
[00:42:00] so this is going to be like time step
[00:42:02] so this is going to be like time step here I'm using the subscript As Time
[00:42:05] here I'm using the subscript As Time step okay so let's just write out what a
[00:42:08] step okay so let's just write out what a trajectory is and what those
[00:42:09] trajectory is and what those probabilities
[00:42:13] are approximating the probability of the
[00:42:15] are approximating the probability of the trajectory just to be one over n no good
[00:42:18] trajectory just to be one over n no good question yeah or oh sorry yes for for
[00:42:21] question yeah or oh sorry yes for for this part yes yes we're assuming that
[00:42:23] this part yes yes we're assuming that we're s for each of the trajectories
[00:42:25] we're s for each of the trajectories we're using like a Monte Carlo estimate
[00:42:27] we're using like a Monte Carlo estimate we're just using 1 over M but you know
[00:42:29] we're just using 1 over M but you know if some trajectories are more likely
[00:42:30] if some trajectories are more likely than others they'll appear more in that
[00:42:32] than others they'll appear more in that set of
[00:42:33] set of M it's good question okay so let's now
[00:42:38] M it's good question okay so let's now try to express what the probability is
[00:42:39] try to express what the probability is of a
[00:42:40] of a trajectory okay so the probability of a
[00:42:44] trajectory okay so the probability of a trajectory we can write out as
[00:42:46] trajectory we can write out as follows so we're going to still have
[00:42:49] follows so we're going to still have that
[00:42:51] outside log we're going to do the
[00:42:54] outside log we're going to do the following okay going to say mu of s not
[00:42:57] following okay going to say mu of s not is equal to the probability of s okay
[00:43:00] is equal to the probability of s okay that's just like what is our probability
[00:43:02] that's just like what is our probability distribution over our starting State
[00:43:04] distribution over our starting State okay so that's me and then what we're
[00:43:06] okay so that's me and then what we're going to have is the
[00:43:08] going to have is the following t = 0 to T minus
[00:43:13] following t = 0 to T minus one we going to have our
[00:43:18] policy so this is going to say what is
[00:43:21] policy so this is going to say what is the probability that I pick the action I
[00:43:23] the probability that I pick the action I picked given the current state I'm
[00:43:26] picked given the current state I'm in times the
[00:43:29] in times the probability of St + 1
[00:43:33] probability of St + 1 given s0 to t a z to
[00:43:40] T so what I've done is I've just written
[00:43:43] T so what I've done is I've just written out what is happening in my trajectory
[00:43:46] out what is happening in my trajectory here is I start I have some distribution
[00:43:48] here is I start I have some distribution over C in this initial State under my
[00:43:51] over C in this initial State under my policy I have some probability of taking
[00:43:53] policy I have some probability of taking a z and that's here then um I'm going to
[00:43:57] a z and that's here then um I'm going to assume for a second that rewards are
[00:43:58] assume for a second that rewards are deterministic but you could add in a
[00:44:00] deterministic but you could add in a reward term here and then I'm going to
[00:44:02] reward term here and then I'm going to say well what's the chance that I get to
[00:44:04] say well what's the chance that I get to State S1 given my history given the
[00:44:07] State S1 given my history given the previous States and the
[00:44:09] previous States and the actions so I've just written this out as
[00:44:12] actions so I've just written this out as um a joint probability okay and now what
[00:44:16] um a joint probability okay and now what I can do is I can use the fact that log
[00:44:17] I can do is I can use the fact that log of a * B is equal to log of a plus log
[00:44:19] of a * B is equal to log of a plus log of B so I'm just going to decompose all
[00:44:21] of B so I'm just going to decompose all these terms okay so I'm not applying my
[00:44:23] these terms okay so I'm not applying my gradient yet I'm just going to have log
[00:44:27] gradient yet I'm just going to have log mu of
[00:44:28] mu of S Plus su/ t = 0 tus
[00:44:38] 1 t = t
[00:44:48] 1 let put the L
[00:44:59] sorry it's a bit messy I'll make sure to
[00:45:01] sorry it's a bit messy I'll make sure to add a clean
[00:45:04] add a clean version um so what I've done is I've
[00:45:07] version um so what I've done is I've just decomposed my log but now this is
[00:45:09] just decomposed my log but now this is really nice
[00:45:11] really nice because this term is not a function of
[00:45:14] because this term is not a function of theta this is just my initial starting
[00:45:16] theta this is just my initial starting State distribution it has nothing to do
[00:45:18] State distribution it has nothing to do with my policy so this drops out does
[00:45:21] with my policy so this drops out does this part depend on my
[00:45:23] this part depend on my policy yes does this part depend on my
[00:45:26] policy yes does this part depend on my policy
[00:45:27] policy no so when we take the d of it it
[00:45:31] no so when we take the d of it it disappears so that is beautiful because
[00:45:33] disappears so that is beautiful because now it means we don't have to know about
[00:45:34] now it means we don't have to know about our Dynamics model okay so the only term
[00:45:38] our Dynamics model okay so the only term that is still around after
[00:45:45] this is this thing
[00:45:58] all
[00:46:00] all right so this is great because now we
[00:46:03] right so this is great because now we don't depend on our Dynamics
[00:46:05] don't depend on our Dynamics model we have written down what this
[00:46:09] model we have written down what this term is as a function of um so we're
[00:46:14] term is as a function of um so we're just doing this term right now as just
[00:46:16] just doing this term right now as just the sum of the derivative of the log of
[00:46:18] the sum of the derivative of the log of the policy at that particular point so
[00:46:21] the policy at that particular point so we're sort of summing up for each of the
[00:46:23] we're sort of summing up for each of the different actions we took along the way
[00:46:26] different actions we took along the way what was the log of their probability
[00:46:27] what was the log of their probability and taking the derivative of that whole
[00:46:33] term all
[00:46:35] term all right so we don't need any Dynamics
[00:46:37] right so we don't need any Dynamics model which is
[00:46:39] model which is great and I'm just going to say here I'm
[00:46:43] great and I'm just going to say here I'm going to make sure that um something is
[00:46:44] going to make sure that um something is consistent here oh
[00:46:49] yeah with all the meth
[00:46:54] uhuh the PSS t+ one for the um part why
[00:47:00] uhuh the PSS t+ one for the um part why do we look at the entire history and not
[00:47:02] do we look at the entire history and not just the Past St in action great
[00:47:05] just the Past St in action great question so what I've written out as so
[00:47:08] question so what I've written out as so this question is a good one I wrote down
[00:47:10] this question is a good one I wrote down here the Dynamics in a really general
[00:47:12] here the Dynamics in a really general form I am writing them down and I'm not
[00:47:14] form I am writing them down and I'm not making the markup assumption we could
[00:47:16] making the markup assumption we could make the markup assumption but what I
[00:47:18] make the markup assumption but what I wanted to point out here is that you
[00:47:19] wanted to point out here is that you don't have to make the markup assumption
[00:47:21] don't have to make the markup assumption does not matter so because the Dynamics
[00:47:26] does not matter so because the Dynamics model are independent of your policy
[00:47:27] model are independent of your policy when you take the derivative they
[00:47:29] when you take the derivative they completely drop out whether they are
[00:47:31] completely drop out whether they are markof whether they are non-markov Etc
[00:47:33] markof whether they are non-markov Etc and so that's really nice it shows that
[00:47:35] and so that's really nice it shows that in this case it's not making the Markoff
[00:47:39] assumption now I did make the Markoff
[00:47:41] assumption now I did make the Markoff assumption somewhere I made it here
[00:47:43] assumption somewhere I made it here because I assume that um I made the
[00:47:44] because I assume that um I made the Markoff assumption in this sense I
[00:47:45] Markoff assumption in this sense I assume my policy was Markoff my policy
[00:47:48] assume my policy was Markoff my policy is only depending on the current state
[00:47:50] is only depending on the current state um but your policy also could depend on
[00:47:52] um but your policy also could depend on a history of States you know you could
[00:47:54] a history of States you know you could have like a current neur uh neural
[00:47:56] have like a current neur uh neural network work or any of the other
[00:47:58] network work or any of the other representations you might want to choose
[00:48:00] representations you might want to choose there um and you would still then this
[00:48:02] there um and you would still then this would just depend on your
[00:48:05] history good question all right so I
[00:48:08] history good question all right so I just want to
[00:48:09] just want to go and I want to make sure that I wrote
[00:48:11] go and I want to make sure that I wrote it down neatly in terms of the most
[00:48:13] it down neatly in terms of the most general form that's why I'm skipping
[00:48:14] general form that's why I'm skipping those right now one of the things to
[00:48:16] those right now one of the things to note here in terms of just notation is
[00:48:18] note here in terms of just notation is that people often call this thing here a
[00:48:20] that people often call this thing here a score function so this derivative with
[00:48:23] score function so this derivative with respect to log of the policy itself we
[00:48:26] respect to log of the policy itself we often call it score
[00:48:29] often call it score function
[00:48:30] function okay so in general
[00:48:34] okay so in general um we um the nice thing is that it's
[00:48:37] um we um the nice thing is that it's generally not very hard to compute the
[00:48:38] generally not very hard to compute the score function so if you have a
[00:48:40] score function so if you have a differentiable function we can compute
[00:48:42] differentiable function we can compute the score function pretty easily in many
[00:48:44] the score function pretty easily in many cases let just make this bit smaller
[00:48:47] cases let just make this bit smaller okay so let's see what that might look
[00:48:49] okay so let's see what that might look like for a couple different policy
[00:48:52] like for a couple different policy classes so one thing we could do which
[00:48:55] classes so one thing we could do which is a pretty popular thing to do is do a
[00:48:56] is a pretty popular thing to do is do a softmax policy so the idea in this case
[00:49:00] softmax policy so the idea in this case is that let's take a linear combination
[00:49:02] is that let's take a linear combination of features so 5
[00:49:05] of features so 5 sa dot product with Theta and then you
[00:49:08] sa dot product with Theta and then you could say the probability of Your Action
[00:49:11] could say the probability of Your Action is proportional to the exponentiated
[00:49:13] is proportional to the exponentiated weight so you take the exponent of that
[00:49:16] weight so you take the exponent of that dot product between the features and
[00:49:18] dot product between the features and then you normalize it and that gives you
[00:49:20] then you normalize it and that gives you generally a stochastic policy um you can
[00:49:23] generally a stochastic policy um you can also have a temperature parameter in
[00:49:24] also have a temperature parameter in there if you want
[00:49:26] there if you want and the nice thing about this is that we
[00:49:28] and the nice thing about this is that we can write it um we can take the
[00:49:30] can write it um we can take the derivative of this very easily okay so
[00:49:32] derivative of this very easily okay so we can just do that quickly here just to
[00:49:34] we can just do that quickly here just to illustrate so what this is this is just
[00:49:37] illustrate so what this is this is just to illustrate that like it is often very
[00:49:38] to illustrate that like it is often very feasible to take the derivative with
[00:49:40] feasible to take the derivative with respect to the policy
[00:49:42] respect to the policy parameterization okay so this is just
[00:49:44] parameterization okay so this is just going to be the derivative of the log of
[00:49:47] going to be the derivative of the log of e to 5 of s t Theta
[00:49:59] so we can do this here and we can
[00:50:01] so we can do this here and we can rewrite this here
[00:50:22] as okay and so this is just going to be
[00:50:26] as okay and so this is just going to be equal to five
[00:50:43] s so I'm just taking the derivative of
[00:50:45] s so I'm just taking the derivative of this for a particular
[00:50:48] this for a particular Theta and so we can just rewrite that 5
[00:51:06] okay so what I've done here so I've
[00:51:08] okay so what I've done here so I've taken the derivative with respect to
[00:51:10] taken the derivative with respect to this function for a particular
[00:51:12] this function for a particular Theta and then what I've said here is
[00:51:14] Theta and then what I've said here is well you could notice that this here is
[00:51:17] well you could notice that this here is exactly just equal to my Pi Theta of
[00:51:20] exactly just equal to my Pi Theta of essay so it's like I'm getting this
[00:51:23] essay so it's like I'm getting this waiting over the
[00:51:24] waiting over the features okay on the next slide nether
[00:51:28] features okay on the next slide nether okay so the score function for the for
[00:51:32] okay so the score function for the for the softmax policy is just going to be
[00:51:34] the softmax policy is just going to be equal to the feature sa FSA minus the
[00:51:38] equal to the feature sa FSA minus the expected value over the policy of the
[00:51:42] expected value over the policy of the features
[00:51:46] yeah oh great question well I could be
[00:51:49] yeah oh great question well I could be for example you could think of it as
[00:51:51] for example you could think of it as like if you have um a large neural
[00:51:53] like if you have um a large neural network that's doing some representation
[00:51:54] network that's doing some representation it could be the last layer like the
[00:51:56] it could be the last layer like the second last there and then you could
[00:51:58] second last there and then you could just do like a linear dot product
[00:52:01] just do like a linear dot product that yeah it's a good question you can
[00:52:04] that yeah it's a good question you can or in case of customers it could be a
[00:52:05] or in case of customers it could be a whole bunch of different features and
[00:52:07] whole bunch of different features and then you have different we
[00:52:10] there all right
[00:52:13] there all right so this is also possible to do for other
[00:52:15] so this is also possible to do for other functions so for Gans um uh we often
[00:52:19] functions so for Gans um uh we often want to think about that for continuous
[00:52:20] want to think about that for continuous action spaces which are really useful
[00:52:21] action spaces which are really useful for robotics where you might have like
[00:52:23] for robotics where you might have like continuous torqus or continuous
[00:52:24] continuous torqus or continuous accelerations Etc you can think of there
[00:52:27] accelerations Etc you can think of there being a mean which is a linear
[00:52:28] being a mean which is a linear combination of some State features your
[00:52:30] combination of some State features your variance might be fixed or could also be
[00:52:33] variance might be fixed or could also be parameterized and then your policy is a
[00:52:35] parameterized and then your policy is a gan so like you maybe you're sampling
[00:52:37] gan so like you maybe you're sampling some particular action dependent on your
[00:52:39] some particular action dependent on your state along with some variance and then
[00:52:42] state along with some variance and then you can again just directly compute what
[00:52:45] you can again just directly compute what the score function would be in this case
[00:52:47] the score function would be in this case in close
[00:52:48] in close form but in general you're often
[00:52:50] form but in general you're often probably going to be using this with
[00:52:51] probably going to be using this with deep neural networks and then you can
[00:52:52] deep neural networks and then you can just use autoi to do this it's just to
[00:52:54] just use autoi to do this it's just to illustrate that there's a number of
[00:52:55] illustrate that there's a number of different functional forms where you can
[00:52:57] different functional forms where you can compute this
[00:53:00] analytically okay all right so just to
[00:53:02] analytically okay all right so just to recap this what we've shown so far is
[00:53:05] recap this what we've shown so far is that we can have policy methods where we
[00:53:07] that we can have policy methods where we have a direct parameterization of the
[00:53:08] have a direct parameterization of the policy we can write down the value
[00:53:10] policy we can write down the value function as being a weighted sum over
[00:53:12] function as being a weighted sum over the trajectories generated by that
[00:53:13] the trajectories generated by that policy times their reward it turns out
[00:53:16] policy times their reward it turns out that when we want to take the derivative
[00:53:17] that when we want to take the derivative of that we can reexpress it so that we
[00:53:19] of that we can reexpress it so that we just think of we only we don't need the
[00:53:21] just think of we only we don't need the Dynamics model and we're weighing these
[00:53:23] Dynamics model and we're weighing these score functions
[00:53:27] so now let's just do um a small check
[00:53:29] so now let's just do um a small check your understanding
[00:53:31] your understanding about uh likelihood ratio score function
[00:53:34] about uh likelihood ratio score function policy gradients and so what I'd like
[00:53:36] policy gradients and so what I'd like you to do is say does it require that
[00:53:38] you to do is say does it require that your reward function is differentiable
[00:53:40] your reward function is differentiable can you only use it with markof decision
[00:53:41] can you only use it with markof decision process is it Ely useful mostly for
[00:53:44] process is it Ely useful mostly for infinite Horizon tasks a and b a b and c
[00:53:47] infinite Horizon tasks a and b a b and c none of the above are not sure let's
[00:53:49] none of the above are not sure let's just take a second to do that
[00:54:00] all right we have a good split at the
[00:54:02] all right we have a good split at the end nobody is not sure um but there is a
[00:54:05] end nobody is not sure um but there is a lot of spread so why don't you talk to
[00:54:07] lot of spread so why don't you talk to your neighbor and see if we can come to
[00:54:09] your neighbor and see if we can come to more consensus
[00:54:28] all right Sor interrup some good
[00:54:30] all right Sor interrup some good discussions but I want to make sure we
[00:54:32] discussions but I want to make sure we get through reinforced today so um so
[00:54:34] get through reinforced today so um so this a little bit of a tricky one in
[00:54:36] this a little bit of a tricky one in fact when I was giving it to one of my
[00:54:37] fact when I was giving it to one of my Tas I forgot to put the none of the
[00:54:39] Tas I forgot to put the none of the above and he was like wait what so um so
[00:54:42] above and he was like wait what so um so it's none of the above um it and so the
[00:54:44] it's none of the above um it and so the first one's part of the actual elegant
[00:54:46] first one's part of the actual elegant aspect of policy gradients so as you can
[00:54:49] aspect of policy gradients so as you can see here you need the policy function to
[00:54:52] see here you need the policy function to be differentiable but the reward
[00:54:54] be differentiable but the reward function does not have to be the reward
[00:54:56] function does not have to be the reward function is not a function of the policy
[00:54:58] function is not a function of the policy in the way that we've written it here so
[00:54:59] in the way that we've written it here so that's pretty elegant so that that has
[00:55:01] that's pretty elegant so that that has motivated people in a really wide range
[00:55:04] motivated people in a really wide range of areas where you really might have
[00:55:05] of areas where you really might have very complicated reward functions um to
[00:55:08] very complicated reward functions um to be interested in using what we're going
[00:55:09] be interested in using what we're going to see soon which is reinforce which is
[00:55:11] to see soon which is reinforce which is based on this idea because you just need
[00:55:13] based on this idea because you just need the policy parameterization um to be
[00:55:15] the policy parameterization um to be differentiable so that's really cool um
[00:55:18] differentiable so that's really cool um B doesn't have to be markof because as
[00:55:21] B doesn't have to be markof because as we saw the Dynamics model drops out and
[00:55:23] we saw the Dynamics model drops out and so was just saying in that case it
[00:55:25] so was just saying in that case it doesn't um doesn't appear at all so
[00:55:27] doesn't um doesn't appear at all so doesn't need to be mark off you don't
[00:55:28] doesn't need to be mark off you don't need differentiability um we are
[00:55:30] need differentiability um we are assuming that it's finded Horizon uh so
[00:55:33] assuming that it's finded Horizon uh so that we can actually episodic so we can
[00:55:35] that we can actually episodic so we can get M more than one if it was infinite
[00:55:37] get M more than one if it was infinite Horizon we'd only get nals 1 so so all
[00:55:40] Horizon we'd only get nals 1 so so all three of these are
[00:55:43] false
[00:55:45] false so let me just make sure I Circle
[00:55:50] that okay so just to give brief
[00:55:52] that okay so just to give brief intuitions because to make sure that we
[00:55:53] intuitions because to make sure that we get to reinforce um you can think of if
[00:55:56] get to reinforce um you can think of if this is a generic way of writing this
[00:55:58] this is a generic way of writing this down we sort of have some function times
[00:56:00] down we sort of have some function times the derivative of log of some other
[00:56:02] the derivative of log of some other probability function and you can think
[00:56:04] probability function and you can think of this first part is measuring how good
[00:56:06] of this first part is measuring how good a sample is and what the idea is is that
[00:56:09] a sample is and what the idea is is that when you have the derivative you're
[00:56:10] when you have the derivative you're trying to sort of move up the log
[00:56:13] trying to sort of move up the log probability of samples that have high
[00:56:16] probability of samples that have high reward because you generally want
[00:56:18] reward because you generally want policies that visit parts of the state
[00:56:19] policies that visit parts of the state in action space where we get high reward
[00:56:21] in action space where we get high reward so that's kind of the intuition and the
[00:56:23] so that's kind of the intuition and the nice thing is that that F doesn't have
[00:56:25] nice thing is that that F doesn't have to be differentiable it could be
[00:56:27] to be differentiable it could be discontinuous could be unknown as long
[00:56:29] discontinuous could be unknown as long as you can get samples from it um so it
[00:56:32] as you can get samples from it um so it could be extremely flexible to what is
[00:56:33] could be extremely flexible to what is that reward or objective
[00:56:36] that reward or objective function so I put a couple slides here
[00:56:40] function so I put a couple slides here um I believe remember if it was John
[00:56:43] um I believe remember if it was John Schulman who originally had these ones
[00:56:45] Schulman who originally had these ones um I put some credits at the front but
[00:56:47] um I put some credits at the front but you can think of sort of com taking a
[00:56:48] you can think of sort of com taking a combination between what the probability
[00:56:50] combination between what the probability is of your input of your X as well as
[00:56:52] is of your input of your X as well as your function so in our case that's
[00:56:53] your function so in our case that's going to be the reward function this is
[00:56:56] going to be the reward function this is is generally going to be the reward
[00:56:57] is generally going to be the reward function over trajectories and this is
[00:56:59] function over trajectories and this is going to be our
[00:57:01] going to be our policy it gives us probabilities of the
[00:57:04] policy it gives us probabilities of the trajectories and so you can think of
[00:57:05] trajectories and so you can think of combining between these two to actually
[00:57:08] combining between these two to actually change your parameter
[00:57:11] change your parameter space just to give a little bit of
[00:57:13] space just to give a little bit of intuition over what the what this sort
[00:57:15] intuition over what the what this sort of um gradient estimation is
[00:57:18] of um gradient estimation is doing
[00:57:20] doing okay so in general we can also write
[00:57:22] okay so in general we can also write down a policy gradient theorem which
[00:57:24] down a policy gradient theorem which says we could either use something like
[00:57:27] says we could either use something like episodic
[00:57:28] episodic reward or we could be trying to look at
[00:57:31] reward or we could be trying to look at average reward per time step or we could
[00:57:33] average reward per time step or we could be trying to look at average value and
[00:57:35] be trying to look at average value and in all of these cases we can end up
[00:57:38] in all of these cases we can end up writing something that looks really
[00:57:39] writing something that looks really similar to the equation I showed you
[00:57:41] similar to the equation I showed you before which is the derivative with
[00:57:43] before which is the derivative with respect to these value functions or
[00:57:46] respect to these value functions or something like a value function is going
[00:57:48] something like a value function is going to look something like the derivative
[00:57:50] to look something like the derivative with this score function the expected
[00:57:52] with this score function the expected value over the trajectories you're going
[00:57:54] value over the trajectories you're going to get of the log of the parameters
[00:57:57] to get of the log of the parameters times the Q function or the um the
[00:58:02] times the Q function or the um the return for that particular State action
[00:58:05] return for that particular State action pair following the
[00:58:06] pair following the policy okay and there's a nice D
[00:58:09] policy okay and there's a nice D derivation in s and Bardo about that at
[00:58:12] derivation in s and Bardo about that at a high level I think the useful thing to
[00:58:13] a high level I think the useful thing to know here is just that um there's many
[00:58:16] know here is just that um there's many we can extend it Beyond just thinking of
[00:58:17] we can extend it Beyond just thinking of kind of like this sample of return we
[00:58:20] kind of like this sample of return we can think of they being Q functions all
[00:58:22] can think of they being Q functions all right now what I've shown you so far uh
[00:58:25] right now what I've shown you so far uh is something that is correct and we can
[00:58:27] is something that is correct and we can turn it into an algorithm but it does
[00:58:29] turn it into an algorithm but it does not leverage much of the temporal
[00:58:31] not leverage much of the temporal structure so what do I mean by that so
[00:58:34] structure so what do I mean by that so what we've written down here is a valid
[00:58:35] what we've written down here is a valid gradient it's unbiased um but it can be
[00:58:38] gradient it's unbiased um but it can be very noisy so we're estimating this by
[00:58:41] very noisy so we're estimating this by Monte Carlo method because we have these
[00:58:43] Monte Carlo method because we have these M samples and as we know from Monte
[00:58:45] M samples and as we know from Monte Carlo methods before they are unbiased
[00:58:47] Carlo methods before they are unbiased but they can be very high variant and so
[00:58:50] but they can be very high variant and so some of the ways to make this more
[00:58:51] some of the ways to make this more practical and what I mean by that is the
[00:58:53] practical and what I mean by that is the better estimate of the gradient um and
[00:58:55] better estimate of the gradient um and hopefully with less data because
[00:58:57] hopefully with less data because ultimately we're going to have to be
[00:58:59] ultimately we're going to have to be using this information to update our
[00:59:00] using this information to update our weights to try to get to a good policy
[00:59:03] weights to try to get to a good policy so we want this to be data efficient is
[00:59:05] so we want this to be data efficient is we can try to leverage the temporal
[00:59:06] we can try to leverage the temporal structure and we can also include
[00:59:08] structure and we can also include baselines right so let's first see the
[00:59:11] baselines right so let's first see the temporal structure so what we've done
[00:59:13] temporal structure so what we've done before is we've summed up all the
[00:59:15] before is we've summed up all the rewards from a whole trajectory and
[00:59:17] rewards from a whole trajectory and we've multiplied it by the sum of this
[00:59:20] we've multiplied it by the sum of this score function for the whole trajectory
[00:59:22] score function for the whole trajectory that's what we've done so far we can
[00:59:25] that's what we've done so far we can instead think of it as what if we have
[00:59:27] instead think of it as what if we have the gradient estimator for a single
[00:59:28] the gradient estimator for a single reward term okay so this is just you
[00:59:32] reward term okay so this is just you know for one time step we can think of
[00:59:35] know for one time step we can think of it for there which is we have that
[00:59:37] it for there which is we have that single time step times the score
[00:59:40] single time step times the score function for the remaining time steps or
[00:59:42] function for the remaining time steps or or for the for the time steps up to that
[00:59:44] or for the for the time steps up to that point
[00:59:46] point okay so it's like we just think of the
[00:59:48] okay so it's like we just think of the partial trajectory until we got that
[00:59:52] reward so we want to think about s of
[00:59:55] reward so we want to think about s of the Dera of this this is the reward we
[00:59:57] the Dera of this this is the reward we got at this time point so instead of
[00:59:58] got at this time point so instead of having this whole sum we just think of
[01:00:00] having this whole sum we just think of well what is sort of the trajectory that
[01:00:02] well what is sort of the trajectory that we got up to that time point and all of
[01:00:04] we got up to that time point and all of their score
[01:00:06] their score functions
[01:00:08] functions okay does that make sense have questions
[01:00:11] okay does that make sense have questions about that
[01:00:14] part okay so this is like for a single
[01:00:17] part okay so this is like for a single time step T Prime and so now what we can
[01:00:20] time step T Prime and so now what we can do is we
[01:00:22] do is we can sum this overall time steps so
[01:00:26] can sum this overall time steps so instead of having the sum of all overall
[01:00:28] instead of having the sum of all overall rewards times this we can say well we
[01:00:31] rewards times this we can say well we know that for one time step it is equal
[01:00:33] know that for one time step it is equal to the expected value of the reward for
[01:00:34] to the expected value of the reward for that time step multiply times these
[01:00:38] that time step multiply times these score functions up to that point so
[01:00:40] score functions up to that point so let's just rewrite it like that now
[01:00:41] let's just rewrite it like that now we're just going to some over the
[01:00:42] we're just going to some over the rewards we got for all time
[01:00:45] steps all
[01:00:47] steps all right so now what we can do is we can do
[01:00:51] right so now what we can do is we can do slight rearrangement okay so what we can
[01:00:54] slight rearrangement okay so what we can notice is that for each of the points so
[01:00:57] notice is that for each of the points so you can think of it as I have so this is
[01:01:01] you can think of it as I have so this is t 0 1 2 3 and you can think of all of
[01:01:06] t 0 1 2 3 and you can think of all of these sort
[01:01:07] these sort of score functions I have at each time
[01:01:10] of score functions I have at each time point so the score function at time Step
[01:01:13] point so the score function at time Step Zero is going to appear for r0 R1 R2 R3
[01:01:19] Zero is going to appear for r0 R1 R2 R3 dot dot dot okay so that's what I've
[01:01:23] dot dot dot okay so that's what I've just done here I've said this first term
[01:01:25] just done here I've said this first term is going to appear here
[01:01:27] is going to appear here in with a reward function of all of the
[01:01:31] in with a reward function of all of the subsequent time
[01:01:33] subsequent time points okay because that decision
[01:01:36] points okay because that decision happened and then we got a reward and
[01:01:37] happened and then we got a reward and then we got a whole bunch of rewards
[01:01:40] then we got a whole bunch of rewards later the on the first time step it can
[01:01:43] later the on the first time step it can affect the reward you get at time step
[01:01:45] affect the reward you get at time step one and all the future after that okay
[01:01:49] one and all the future after that okay so this term here this score function
[01:01:52] so this term here this score function can
[01:01:53] can influence um the one that we get on time
[01:01:55] influence um the one that we get on time step one can influence time step one all
[01:01:58] step one can influence time step one all the way out to the end of the episode
[01:02:00] the way out to the end of the episode the one we get on time step two can
[01:02:01] the one we get on time step two can influence time step two all the way out
[01:02:03] influence time step two all the way out to the end of the
[01:02:06] to the end of the episode so essentially this is like
[01:02:08] episode so essentially this is like saying my reward on time step three
[01:02:10] saying my reward on time step three cannot be impacted by decisions I make
[01:02:12] cannot be impacted by decisions I make on time step four time only flows one
[01:02:15] on time step four time only flows one way so if we think about what those
[01:02:17] way so if we think about what those score functions were and like we think
[01:02:19] score functions were and like we think of like the trajectories that were
[01:02:20] of like the trajectories that were generated they're a temporal structure
[01:02:23] generated they're a temporal structure and so it means that we cannot have
[01:02:26] and so it means that we cannot have um if we change the policy parameters
[01:02:29] um if we change the policy parameters such that decisions in the future change
[01:02:31] such that decisions in the future change that can't affect my reward on earlier
[01:02:33] that can't affect my reward on earlier time steps okay so this is leveraging
[01:02:35] time steps okay so this is leveraging the temporal
[01:02:36] the temporal structure so this just allows us to
[01:02:39] structure so this just allows us to rewrite the equation so that now we have
[01:02:43] rewrite the equation so that now we have for each of the different score
[01:02:44] for each of the different score functions essentially which of the
[01:02:46] functions essentially which of the rewards they
[01:02:48] rewards they influence and the reason this is
[01:02:50] influence and the reason this is important is because here you could see
[01:02:53] important is because here you could see that we are multiplying each of the
[01:02:54] that we are multiplying each of the score functions by all of the
[01:02:57] score functions by all of the rewards and now we're only going to
[01:02:59] rewards and now we're only going to multiply them by the rewards they
[01:03:01] multiply them by the rewards they influence and so in general that's going
[01:03:03] influence and so in general that's going to be way less than having the full set
[01:03:04] to be way less than having the full set of rewards so this is going to reduce
[01:03:07] of rewards so this is going to reduce the variance of our estimator without
[01:03:09] the variance of our estimator without causing any bias just leveraging the
[01:03:11] causing any bias just leveraging the fact that decisions in the future can't
[01:03:13] fact that decisions in the future can't affect your rewards in the
[01:03:17] past all right so that is one of the
[01:03:20] past all right so that is one of the first things that we're going to um do
[01:03:22] first things that we're going to um do in this
[01:03:23] in this case so we're going to write um so
[01:03:26] case so we're going to write um so remember in this case that if we sum up
[01:03:28] remember in this case that if we sum up all the rewards from the current time
[01:03:30] all the rewards from the current time step to the end we've just called that
[01:03:31] step to the end we've just called that the return we've seen that before from
[01:03:33] the return we've seen that before from multi
[01:03:34] multi Carlo so we can just rewrite this
[01:03:36] Carlo so we can just rewrite this expression like
[01:03:38] expression like that and that gives us the reinforce
[01:03:41] that and that gives us the reinforce algorithm so this is the reinforce
[01:03:43] algorithm so this is the reinforce algorithm that has been incredibly
[01:03:44] algorithm that has been incredibly influential um in NLP and Robotics and
[01:03:47] influential um in NLP and Robotics and many many areas okay and so what this
[01:03:50] many many areas okay and so what this says here is that the way we change our
[01:03:53] says here is that the way we change our parameter is just our learning rate
[01:03:55] parameter is just our learning rate times our score function times the
[01:03:58] times our score function times the return we got from that time step till
[01:04:00] return we got from that time step till the end of the episode so we still have
[01:04:02] the end of the episode so we still have to wait till the end of the episode to
[01:04:04] to wait till the end of the episode to update anything but what happens is we
[01:04:06] update anything but what happens is we run a full episode with our current
[01:04:07] run a full episode with our current policy and then for each time step we
[01:04:10] policy and then for each time step we slightly change our policy parameters by
[01:04:12] slightly change our policy parameters by using a learning rate the score function
[01:04:14] using a learning rate the score function for that time step plus the return we
[01:04:17] for that time step plus the return we got from that time step till the end of
[01:04:18] got from that time step till the end of the
[01:04:19] the episode and then we just steep through
[01:04:21] episode and then we just steep through that for the whole episode and that's
[01:04:23] that for the whole episode and that's given us T different updates to our
[01:04:25] given us T different updates to our policy
[01:04:27] policy parameterization and then we just repeat
[01:04:29] parameterization and then we just repeat over and over and over again and what
[01:04:32] over and over and over again and what that guarantees to us is that eventually
[01:04:34] that guarantees to us is that eventually we will land in a local Optima of um the
[01:04:37] we will land in a local Optima of um the value function for the policy
[01:04:43] parameterization so this is called Monte
[01:04:45] parameterization so this is called Monte Carlo policy gradient or known as the
[01:04:46] Carlo policy gradient or known as the reinforce I believe this was in roughly
[01:04:50] reinforce I believe this was in roughly 1992 it's about 30 years ago there's
[01:04:53] 1992 it's about 30 years ago there's been many many many Poli grading
[01:04:56] been many many many Poli grading algorithms are built on this
[01:04:59] idea okay now when you're looking at
[01:05:02] idea okay now when you're looking at this you might still be concerned that
[01:05:04] this you might still be concerned that if from remembering back from the Monte
[01:05:06] if from remembering back from the Monte Carlo methods we've covered that this
[01:05:09] Carlo methods we've covered that this estimate G can often be pretty high
[01:05:11] estimate G can often be pretty high variance so in general if you're just
[01:05:14] variance so in general if you're just directly kind of averaging over sample
[01:05:16] directly kind of averaging over sample returns that might be high
[01:05:17] returns that might be high variance so one of the next fixes we can
[01:05:20] variance so one of the next fixes we can do and we'll get to this more on um
[01:05:22] do and we'll get to this more on um Wednesday is to introduce a Baseline
[01:05:27] and see I'll just say so the the goals
[01:05:30] and see I'll just say so the the goals here is that we're going to hopefully
[01:05:31] here is that we're going to hopefully try to converge as quickly as possible
[01:05:32] try to converge as quickly as possible to local Optima so we want to reduce the
[01:05:36] to local Optima so we want to reduce the um the variance over our gradient
[01:05:38] um the variance over our gradient estimate and so the Baseline is going to
[01:05:41] estimate and so the Baseline is going to allow us to hopefully reduce well in
[01:05:43] allow us to hopefully reduce well in general yes reduce the the variance over
[01:05:45] general yes reduce the the variance over this estimation
[01:05:47] this estimation process and kind of the we'll see two
[01:05:51] process and kind of the we'll see two ideas next which is introducing a
[01:05:54] ideas next which is introducing a Baseline and and then thinking about an
[01:05:56] Baseline and and then thinking about an alternative to the Monte Carlo
[01:05:58] alternative to the Monte Carlo returns so those are the ideas that
[01:06:01] returns so those are the ideas that we're going to go through next I guess
[01:06:02] we're going to go through next I guess I'll just do one more thing and we'll we
[01:06:05] I'll just do one more thing and we'll we we'll go through the proof of it next
[01:06:06] we'll go through the proof of it next time so I'll just introduce the concept
[01:06:07] time so I'll just introduce the concept of Baseline and then we'll we'll prove
[01:06:09] of Baseline and then we'll we'll prove it next time so the idea in this case is
[01:06:11] it next time so the idea in this case is that we're just going to subtract
[01:06:12] that we're just going to subtract something off and we're going to
[01:06:14] something off and we're going to subtract something off that only depends
[01:06:16] subtract something off that only depends on the
[01:06:17] on the state this
[01:06:19] state this is only depends on the
[01:06:24] is only depends on the state okay this is not a function of
[01:06:26] state okay this is not a function of your policy only depends on the state
[01:06:29] your policy only depends on the state and it will turn out and we'll prove
[01:06:31] and it will turn out and we'll prove this next time it's pretty elegant that
[01:06:34] this next time it's pretty elegant that for any choice of something that only
[01:06:35] for any choice of something that only depends on your state the gradient
[01:06:37] depends on your state the gradient estimator is still
[01:06:39] estimator is still unbiased so you could subtract off
[01:06:41] unbiased so you could subtract off anything there that is only a function
[01:06:43] anything there that is only a function of your state and you didn't change the
[01:06:45] of your state and you didn't change the bias of your estimator which is kind of
[01:06:48] bias of your estimator which is kind of wild um and we'll prove that next
[01:06:51] wild um and we'll prove that next time and but the goal is that we can
[01:06:53] time and but the goal is that we can hopefully reduce the variance of our
[01:06:55] hopefully reduce the variance of our estimated gradient by subtracting off
[01:06:57] estimated gradient by subtracting off the right thing okay and just
[01:07:00] the right thing okay and just intuitively the way to think about the
[01:07:02] intuitively the way to think about the Baseline is that you don't necess just
[01:07:05] Baseline is that you don't necess just care about whether or not the gradient
[01:07:07] care about whether or not the gradient is positive or negative and whether
[01:07:09] is positive or negative and whether returns were um good or bad you might
[01:07:12] returns were um good or bad you might care about like well how much better or
[01:07:14] care about like well how much better or worse are these returns compared to
[01:07:16] worse are these returns compared to something else I could have done like I
[01:07:18] something else I could have done like I want to know whether this policy a is
[01:07:19] want to know whether this policy a is better than policy B and maybe both of
[01:07:21] better than policy B and maybe both of them give you positive returns one of
[01:07:23] them give you positive returns one of them gives you 100 and one of them gives
[01:07:24] them gives you 100 and one of them gives you 90
[01:07:26] you 90 but you'd really like the one with 100
[01:07:27] but you'd really like the one with 100 so you'd really like to move your policy
[01:07:29] so you'd really like to move your policy parameters in the direction of stuff
[01:07:31] parameters in the direction of stuff that is better than other Alternatives
[01:07:34] that is better than other Alternatives and that's kind of the idea of a
[01:07:35] and that's kind of the idea of a baseline is to say like well maybe I
[01:07:37] baseline is to say like well maybe I know that I could probably always get
[01:07:38] know that I could probably always get like 90 for this particular State how
[01:07:41] like 90 for this particular State how much better is this policy for this
[01:07:42] much better is this policy for this state compared to something I could do
[01:07:43] state compared to something I could do on
[01:07:44] on average and so we're going to sort of
[01:07:46] average and so we're going to sort of intuitively increase the log probability
[01:07:48] intuitively increase the log probability of an action proportionally to how much
[01:07:50] of an action proportionally to how much its returns were better than expected
[01:07:53] its returns were better than expected where kind of the Baseline is giving you
[01:07:55] where kind of the Baseline is giving you that expected
[01:07:56] that expected value and we'll see formally on
[01:07:59] value and we'll see formally on Wednesday how that by doing this with
[01:08:01] Wednesday how that by doing this with the Baseline it doesn't introduce any
[01:08:02] the Baseline it doesn't introduce any bias so this is going to be one of the
[01:08:04] bias so this is going to be one of the ways that we're going to get better uh
[01:08:06] ways that we're going to get better uh better ingredients the other thing that
[01:08:08] better ingredients the other thing that we're going to do on Wednesday is we're
[01:08:09] we're going to do on Wednesday is we're at least going to start talking about po
[01:08:11] at least going to start talking about po which is the second which is part of
[01:08:13] which is the second which is part of your homework too bless you um which is
[01:08:16] your homework too bless you um which is going to involve more ways to kind of be
[01:08:18] going to involve more ways to kind of be more efficient and effective in the
[01:08:19] more efficient and effective in the policy that we do I'll see you then
[01:08:21] policy that we do I'll see you then thanks
Lecture 006
Stanford CS234 Reinforcement Learning I Policy Search 2 I 2024 I Lecture 6
Source: https://www.youtube.com/watch?v=8PwvNQ5WS-o
---
Transcript
[00:00:05] hi everybody welcome back um we're going
[0...
Stanford CS234 Reinforcement Learning I Policy Search 2 I 2024 I Lecture 6
Source: https://www.youtube.com/watch?v=8PwvNQ5WS-o
---
Transcript
[00:00:05] hi everybody welcome back um we're going
[00:00:07] hi everybody welcome back um we're going to be talking more about policy gradient
[00:00:09] to be talking more about policy gradient methods today and we're going to start
[00:00:10] methods today and we're going to start off with a quick refresh your
[00:00:28] understanding for
[00:01:02] all right let's go ahead and go through
[00:01:03] all right let's go ahead and go through these so everybody said the last thing
[00:01:06] these so everybody said the last thing was false which is correct there not
[00:01:08] was false which is correct there not guaranteed to converge they're not
[00:01:10] guaranteed to converge they're not guaranteed to converge to a global
[00:01:11] guaranteed to converge to a global Optima they're just guaranteed to
[00:01:12] Optima they're just guaranteed to converge to a local Optima of the policy
[00:01:14] converge to a local Optima of the policy gradient space um the first one is
[00:01:21] true there are different ways to write
[00:01:23] true there are different ways to write down this down um but in general what
[00:01:25] down this down um but in general what we're doing is we're going to be trying
[00:01:26] we're doing is we're going to be trying to find take steps in the policy
[00:01:29] to find take steps in the policy parameterization space we're
[00:01:30] parameterization space we're parameterizing our policies by Theta um
[00:01:33] parameterizing our policies by Theta um so that we're going to be trying to move
[00:01:36] so that we're going to be trying to move in the direction of the log of the
[00:01:39] in the direction of the log of the policy parameters times um their value
[00:01:43] policy parameters times um their value the return you get from them um the
[00:01:46] the return you get from them um the second one is false there's a bit of
[00:01:49] second one is false there's a bit of disagreement over this so Theta because
[00:01:52] disagreement over this so Theta because you can see from this first derivative
[00:01:54] you can see from this first derivative we are going to look at the direction of
[00:01:57] we are going to look at the direction of um the derivative with respect to Theta
[00:01:59] um the derivative with respect to Theta of the log of the policy parameters but
[00:02:01] of the log of the policy parameters but it's weighted by the return or waited by
[00:02:04] it's weighted by the return or waited by the Q function so whether we push it up
[00:02:07] the Q function so whether we push it up or not will depend um whether or not
[00:02:09] or not will depend um whether or not we're getting high rewards when we go in
[00:02:11] we're getting high rewards when we go in that direction so this one is
[00:02:14] that direction so this one is false and this one is also
[00:02:18] false and this one is also true that in general what we're trying
[00:02:20] true that in general what we're trying to do is we're trying to find parts of
[00:02:22] to do is we're trying to find parts of the policy space such that when we
[00:02:26] the policy space such that when we follow that policy we visit states and
[00:02:27] follow that policy we visit states and actions which have um higher estimated Q
[00:02:31] actions which have um higher estimated Q function or higher estimated
[00:02:33] function or higher estimated rewards do you have a
[00:02:35] rewards do you have a question stretching
[00:02:39] okay
[00:02:42] okay great all
[00:02:46] right okay so last time we started
[00:02:49] right okay so last time we started talking about policy search which was
[00:02:50] talking about policy search which was this idea of saying we're going to be
[00:02:52] this idea of saying we're going to be directly trying to search in the policy
[00:02:54] directly trying to search in the policy parameterized by some Theta this could
[00:02:57] parameterized by some Theta this could be a gaussian policy class this could be
[00:02:59] be a gaussian policy class this could be softmax or this could be as it will
[00:03:01] softmax or this could be as it will often be a deep neural network and what
[00:03:04] often be a deep neural network and what we're going to talk about today is we're
[00:03:05] we're going to talk about today is we're going to get to talk start talk um
[00:03:07] going to get to talk start talk um finish off that part and then talk about
[00:03:09] finish off that part and then talk about more advanced policy gradient methods
[00:03:11] more advanced policy gradient methods and in particular today we're going to
[00:03:13] and in particular today we're going to cover at least the majority of Po so
[00:03:16] cover at least the majority of Po so this should be enough for you to be
[00:03:17] this should be enough for you to be making significant progress on homework
[00:03:22] two so in particular what we're going to
[00:03:24] two so in particular what we're going to be covering is we've talked last time a
[00:03:26] be covering is we've talked last time a lot about likelihood Ratio or score
[00:03:28] lot about likelihood Ratio or score function policy gradients we're going to
[00:03:30] function policy gradients we're going to talk more about the notion of a Baseline
[00:03:32] talk more about the notion of a Baseline and why introducing that um is not going
[00:03:34] and why introducing that um is not going to incur any bias in the estimate of our
[00:03:37] to incur any bias in the estimate of our gradient we'll talk about alternative
[00:03:39] gradient we'll talk about alternative Target uh targets and then we're going
[00:03:41] Target uh targets and then we're going to talk about Po and again just to
[00:03:43] to talk about Po and again just to remind ourselves PO is what they used in
[00:03:45] remind ourselves PO is what they used in Chad GPT and a huge number of other
[00:03:47] Chad GPT and a huge number of other application areas as well so it's a
[00:03:48] application areas as well so it's a really really useful
[00:03:50] really really useful technique all right so let's just remind
[00:03:53] technique all right so let's just remind ourselves we talked about how we can
[00:03:55] ourselves we talked about how we can take a derivative with respect to the
[00:03:56] take a derivative with respect to the value of a particular policy so
[00:04:00] value of a particular policy so this was the policy
[00:04:03] parameters and we showed that it could
[00:04:06] parameters and we showed that it could look like
[00:04:08] look like this and this was an unbiased estimate
[00:04:11] this and this was an unbiased estimate of the gradient but it could be very
[00:04:13] of the gradient but it could be very noisy um in part because it looks
[00:04:15] noisy um in part because it looks something like our sort of Monte Carlo
[00:04:17] something like our sort of Monte Carlo estimates that we saw before because
[00:04:19] estimates that we saw before because we're looking at these returns um and so
[00:04:22] we're looking at these returns um and so we talked about a couple different fixes
[00:04:24] we talked about a couple different fixes or we started to talk about fixes to
[00:04:25] or we started to talk about fixes to make it tracable so one was to leverage
[00:04:27] make it tracable so one was to leverage the temporal structure meaning that your
[00:04:29] the temporal structure meaning that your reward on time step three can't depend
[00:04:32] reward on time step three can't depend on your actions after time step three
[00:04:35] on your actions after time step three and so we could use that to reduce the
[00:04:36] and so we could use that to reduce the variance of our
[00:04:37] variance of our estimator and now the next thing we're
[00:04:39] estimator and now the next thing we're going to talk about we started talking
[00:04:40] going to talk about we started talking about this last time is
[00:04:43] Baseline so as we talk about this I
[00:04:46] Baseline so as we talk about this I think it's just useful to keep in mind
[00:04:47] think it's just useful to keep in mind throughout this that we're always trying
[00:04:49] throughout this that we're always trying to sort of converge as quickly as
[00:04:50] to sort of converge as quickly as possible so we want these estimates to
[00:04:52] possible so we want these estimates to be of the gradient to be as low variance
[00:04:55] be of the gradient to be as low variance as possible so we can try to be taking
[00:04:57] as possible so we can try to be taking better steps in our uh policy space
[00:05:03] all right so let's look at the Baseline
[00:05:05] all right so let's look at the Baseline started talking about this before and we
[00:05:06] started talking about this before and we said well when we are thinking about how
[00:05:09] said well when we are thinking about how to move our policy inside of the policy
[00:05:12] to move our policy inside of the policy space we want to think about not just
[00:05:14] space we want to think about not just what how much reward we're getting but
[00:05:16] what how much reward we're getting but really maybe how much reward we're
[00:05:17] really maybe how much reward we're getting relative to other things we
[00:05:19] getting relative to other things we could be doing so we want to know how
[00:05:20] could be doing so we want to know how much better this policy is compared to
[00:05:22] much better this policy is compared to other stuff and I said you could
[00:05:24] other stuff and I said you could introduce this Baseline B of St um which
[00:05:29] introduce this Baseline B of St um which was only a function of the state
[00:05:33] was only a function of the state so note
[00:05:35] so note not a function of
[00:05:37] not a function of a of theta or
[00:05:41] a of theta or a now there's been other work including
[00:05:43] a now there's been other work including from my lab thinking about whether we
[00:05:45] from my lab thinking about whether we can introduce baselines that may be a
[00:05:47] can introduce baselines that may be a function of something beyond the state
[00:05:48] function of something beyond the state but for today we're just going to assume
[00:05:50] but for today we're just going to assume that it's only a function of the state
[00:05:52] that it's only a function of the state and what we're going to prove now is
[00:05:53] and what we're going to prove now is that for any choice of the Baseline as
[00:05:56] that for any choice of the Baseline as long as it's only a function of the
[00:05:57] long as it's only a function of the state this gradient estimator is
[00:05:59] state this gradient estimator is unbiased
[00:06:00] unbiased which means that we could introduce this
[00:06:02] which means that we could introduce this here and we're not changing on average
[00:06:05] here and we're not changing on average um what the the gradient estimator is
[00:06:08] um what the the gradient estimator is and a near optimal choice is going to be
[00:06:10] and a near optimal choice is going to be the expected
[00:06:12] the expected return okay so now we're going to again
[00:06:14] return okay so now we're going to again just be trying to think about how much
[00:06:15] just be trying to think about how much better is um taking the actions under
[00:06:18] better is um taking the actions under this current policy compared to other
[00:06:20] this current policy compared to other things we could
[00:06:21] things we could do so let's see we're going to step
[00:06:24] do so let's see we're going to step through why adding a baseline does not
[00:06:27] through why adding a baseline does not incur any bias in our estimate of degree
[00:06:31] so what we're going to do to do this is
[00:06:33] so what we're going to do to do this is we're going to think about how our
[00:06:34] we're going to think about how our gradient comes
[00:06:35] gradient comes together put
[00:06:38] together put this good okay so we can remember to
[00:06:42] this good okay so we can remember to here's our trajectories and then this is
[00:06:45] here's our trajectories and then this is our gradient and what we want to so the
[00:06:48] our gradient and what we want to so the goal is to
[00:06:51] goal is to show this is equal to
[00:06:53] show this is equal to zero why is that because if we think
[00:06:56] zero why is that because if we think about what this term was this first term
[00:07:00] about what this term was this first term was an estimate unbiased estimate of the
[00:07:02] was an estimate unbiased estimate of the gradient and now we've subtracted off
[00:07:04] gradient and now we've subtracted off this term times this term and we want to
[00:07:06] this term times this term and we want to show that in expectation subtracting off
[00:07:09] show that in expectation subtracting off that term is zero which means that we
[00:07:12] that term is zero which means that we didn't introduce any
[00:07:13] didn't introduce any bias okay so let's just step through how
[00:07:15] bias okay so let's just step through how that works so the goal is to show this
[00:07:16] that works so the goal is to show this is zero and we're just going to step
[00:07:18] is zero and we're just going to step through this so our expectation is over
[00:07:20] through this so our expectation is over to TOS are our trajectory so let's just
[00:07:22] to TOS are our trajectory so let's just write it out you can write it out as the
[00:07:25] write it out you can write it out as the states we've seen up to a Time step
[00:07:28] states we've seen up to a Time step t plus the states that we see from that
[00:07:31] t plus the states that we see from that time step onwards okay because that's
[00:07:33] time step onwards okay because that's like we're just writing out our full
[00:07:35] like we're just writing out our full trajectory so we can just think of our
[00:07:36] trajectory so we can just think of our trajectory here as being you know s0 to
[00:07:40] trajectory here as being you know s0 to T and a0 to T ni One and that's just the
[00:07:45] T and a0 to T ni One and that's just the full trajectory so that's TOA and so
[00:07:48] full trajectory so that's TOA and so we're just going to decompose this
[00:07:49] we're just going to decompose this expectation okay so we break this up and
[00:07:52] expectation okay so we break this up and after we've done that we can
[00:07:56] after we've done that we can notice I just make that
[00:08:00] notice I just make that we can notice that we can pull one of
[00:08:01] we can notice that we can pull one of the terms out
[00:08:06] okay I'm going to pull this out because
[00:08:08] okay I'm going to pull this out because this is not a function of this future
[00:08:20] expectation okay and I could pull that
[00:08:22] expectation okay and I could pull that out there because this is just a
[00:08:24] out there because this is just a function of the current state and it
[00:08:25] function of the current state and it doesn't depend on the future States and
[00:08:27] doesn't depend on the future States and the future actions that I take all right
[00:08:30] the future actions that I take all right so that's what I did and then next I'm
[00:08:31] so that's what I did and then next I'm going to notice that well this term here
[00:08:33] going to notice that well this term here is only a function of the state and the
[00:08:35] is only a function of the state and the action it's not again a function of all
[00:08:37] action it's not again a function of all the future actions and future States so
[00:08:39] the future actions and future States so we can just rewrite that
[00:08:41] we can just rewrite that as
[00:08:44] as expectation
[00:08:46] expectation over the action t
[00:09:02] okay all right so what are we going to
[00:09:04] okay all right so what are we going to do next the next thing we're going to do
[00:09:06] do next the next thing we're going to do is I'm just going to write this out more
[00:09:09] is I'm just going to write this out more fully so I'll repeat
[00:09:14] this so we've got our Baseline here St
[00:09:18] this so we've got our Baseline here St and we're going to write out what this
[00:09:19] and we're going to write out what this expectation is okay this is an
[00:09:21] expectation is okay this is an expectation over the actions what
[00:09:23] expectation over the actions what actions are we taking we're exactly
[00:09:25] actions are we taking we're exactly taking the actions according to our
[00:09:26] taking the actions according to our policy
[00:09:36] okay okay so I've just Rewritten what
[00:09:38] okay okay so I've just Rewritten what that expectation is the expectation
[00:09:40] that expectation is the expectation we're taking over actions is exactly the
[00:09:42] we're taking over actions is exactly the probability we take each action
[00:09:44] probability we take each action according to our current
[00:09:46] according to our current policy okay but once we have that we can
[00:09:49] policy okay but once we have that we can play the likelihood Ratio or we can
[00:09:51] play the likelihood Ratio or we can think of you know what is this
[00:09:53] think of you know what is this derivative this derivative is equal to
[00:09:56] derivative this derivative is equal to of St Su over a
[00:10:02] the derivative of log is just going to
[00:10:04] the derivative of log is just going to [Music]
[00:10:06] [Music] be the derivative of the things
[00:10:10] be the derivative of the things inside divided
[00:10:14] by I should have
[00:10:16] by I should have added let me put a Theta in here to make
[00:10:19] added let me put a Theta in here to make it clear that all this is a function of
[00:10:21] it clear that all this is a function of my current Theta okay so I just took the
[00:10:24] my current Theta okay so I just took the derivative with respect to the log but
[00:10:26] derivative with respect to the log but when we see that we realize we can cross
[00:10:27] when we see that we realize we can cross those out so we can cross off this we
[00:10:30] those out so we can cross off this we can cross off this because that's the
[00:10:33] can cross off this because that's the same so now what do we have we have the
[00:10:35] same so now what do we have we have the expectation of our
[00:10:39] states B of St Su over a
[00:10:44] states B of St Su over a derivative remember taking the Der
[00:10:46] derivative remember taking the Der respect to Theta Pi of a St
[00:10:52] respect to Theta Pi of a St thet so this is what this looks like so
[00:10:54] thet so this is what this looks like so far and now what I'm going to do is I'm
[00:10:57] far and now what I'm going to do is I'm going to switch the derivative and the
[00:10:58] going to switch the derivative and the sum
[00:11:00] sum okay so I'm going to say this is B of St
[00:11:04] okay so I'm going to say this is B of St derivative OFA sum
[00:11:10] a okay why was that important because
[00:11:13] a okay why was that important because now and let me
[00:11:18] just that we know here that the sum over
[00:11:22] just that we know here that the sum over all actions we could take at this time
[00:11:23] all actions we could take at this time step has the sum to
[00:11:25] step has the sum to one because that's true for any policy
[00:11:28] one because that's true for any policy so this
[00:11:32] has to equal one so which means we have
[00:11:35] has to equal one so which means we have B of St of the derivative of theta of 1
[00:11:39] B of St of the derivative of theta of 1 which is equal to zero because that's a
[00:11:44] constant let me just WR show it here
[00:11:46] constant let me just WR show it here because it's more neat okay so there's
[00:11:50] because it's more neat okay so there's two sort of I guess there's two main
[00:11:52] two sort of I guess there's two main insights for how we did this proof the
[00:11:54] insights for how we did this proof the first was we thought about our
[00:11:55] first was we thought about our expectation over all trajectories and we
[00:11:57] expectation over all trajectories and we broke it up to the the part of the
[00:11:59] broke it up to the the part of the trajectory that happened before sort of
[00:12:01] trajectory that happened before sort of the state of Interest the St and the
[00:12:03] the state of Interest the St and the part that happened afterwards after we
[00:12:05] part that happened afterwards after we did that we showed that we could rewrite
[00:12:08] did that we showed that we could rewrite that expectation just in terms of a
[00:12:11] that expectation just in terms of a because we didn't care about all the
[00:12:12] because we didn't care about all the future stuff this only depends on at and
[00:12:14] future stuff this only depends on at and St and then we take the
[00:12:17] St and then we take the derivative and then we can see that we
[00:12:19] derivative and then we can see that we can switch these and then that just
[00:12:21] can switch these and then that just becomes one and it's a constant that
[00:12:22] becomes one and it's a constant that doesn't depend on Theta and so the Der
[00:12:25] doesn't depend on Theta and so the Der respect to Theta of it is
[00:12:27] respect to Theta of it is zero and so that is why introducing a
[00:12:31] zero and so that is why introducing a baseline that only depends on the state
[00:12:33] baseline that only depends on the state does not introduce any bias because an
[00:12:35] does not introduce any bias because an expectation its value is
[00:12:37] expectation its value is zero so that allows us to dve what we
[00:12:40] zero so that allows us to dve what we often call sort of a vanilla policy
[00:12:41] often call sort of a vanilla policy gradient method which incorporates both
[00:12:43] gradient method which incorporates both the temporal structure and a Baseline
[00:12:46] the temporal structure and a Baseline and so the idea in this case is that
[00:12:48] and so the idea in this case is that we're going to collect a set we're going
[00:12:49] we're going to collect a set we're going to take our current policy which is
[00:12:50] to take our current policy which is parameterized by Theta going to roll it
[00:12:52] parameterized by Theta going to roll it out um a number of times and then we go
[00:12:56] out um a number of times and then we go for sort of each um time step T in each
[00:12:58] for sort of each um time step T in each trajectory we're going to compute the
[00:13:00] trajectory we're going to compute the return so like our Monte Carlo estimate
[00:13:02] return so like our Monte Carlo estimate and then we can compute the advantage
[00:13:04] and then we can compute the advantage estimate which is we take that current
[00:13:07] estimate which is we take that current um return from that state till the end
[00:13:09] um return from that state till the end of the episode minus our
[00:13:11] of the episode minus our Baseline and then and someone asked me
[00:13:13] Baseline and then and someone asked me about this last time generally we're
[00:13:14] about this last time generally we're going to change refitting the B the
[00:13:16] going to change refitting the B the Baseline each time so we can reestimate
[00:13:19] Baseline each time so we can reestimate the
[00:13:19] the Baseline again it doesn't matter what we
[00:13:22] Baseline again it doesn't matter what we pick for the Baseline it will always be
[00:13:24] pick for the Baseline it will always be unbiased but there will be better or
[00:13:26] unbiased but there will be better or worse choices so you can imagine if the
[00:13:28] worse choices so you can imagine if the Baseline is zero it will never make any
[00:13:30] Baseline is zero it will never make any difference the goal is to hopefully have
[00:13:32] difference the goal is to hopefully have a baseline that's pretty informative and
[00:13:34] a baseline that's pretty informative and has a value close to the value of your
[00:13:38] has a value close to the value of your policy and so then we'll update the
[00:13:40] policy and so then we'll update the policy using our policy gradient
[00:13:41] policy using our policy gradient estimate which is a sum of these terms
[00:13:45] estimate which is a sum of these terms okay so we're going to use all of these
[00:13:49] okay so we're going to use all of these terms where we've got that derivative
[00:13:51] terms where we've got that derivative respect to the log of the policy
[00:13:53] respect to the log of the policy parameters times our
[00:13:55] parameters times our advantage and then repeat this is like a
[00:13:59] advantage and then repeat this is like a Vana policy gradient
[00:14:01] Vana policy gradient algorithm yeah there is no discount
[00:14:04] algorithm yeah there is no discount factor in the return is the
[00:14:08] factor in the return is the intentional yeah good question so right
[00:14:10] intentional yeah good question so right now so question was there's no discount
[00:14:12] now so question was there's no discount Factor there's no discount Factor right
[00:14:14] Factor there's no discount Factor right now because we're assuming we're in the
[00:14:15] now because we're assuming we're in the fully episodic case so we don't have to
[00:14:17] fully episodic case so we don't have to have a discount Factor you could
[00:14:18] have a discount Factor you could certainly include one if you want to
[00:14:20] certainly include one if you want to yeah so for right now here we don't have
[00:14:22] yeah so for right now here we don't have a discount
[00:14:23] a discount Factor now one thing that you might
[00:14:25] Factor now one thing that you might think about when you're starting to look
[00:14:26] think about when you're starting to look at this is to say well a lot of this
[00:14:28] at this is to say well a lot of this feels like the Monte Carlo estimation
[00:14:30] feels like the Monte Carlo estimation that we did earlier in the class we've
[00:14:33] that we did earlier in the class we've been using these G
[00:14:35] been using these G estimators this return um to estimate
[00:14:39] estimators this return um to estimate what is the performance of the policy
[00:14:40] what is the performance of the policy from this time step to the end of the
[00:14:42] from this time step to the end of the episode but as you might imagine in this
[00:14:44] episode but as you might imagine in this case um that generally is a pretty noisy
[00:14:47] case um that generally is a pretty noisy estimate okay so then the question is
[00:14:50] estimate okay so then the question is going to be could we maybe do
[00:14:55] better so there are two places we could
[00:14:58] better so there are two places we could imagine I guess there's two things here
[00:14:59] imagine I guess there's two things here that we could imagine plugging in other
[00:15:01] that we could imagine plugging in other choices for there is what is the return
[00:15:04] choices for there is what is the return of the color policy from a particular
[00:15:06] of the color policy from a particular action till the end of the episode and
[00:15:09] action till the end of the episode and what is my general estimate of the
[00:15:11] what is my general estimate of the performance in that
[00:15:14] state so one thing you can imagine here
[00:15:16] state so one thing you can imagine here is if we think back to Q functions and
[00:15:18] is if we think back to Q functions and value functions maybe we could plug
[00:15:19] value functions maybe we could plug those in instead of using the return and
[00:15:22] those in instead of using the return and using a generic Play sign so you could
[00:15:25] using a generic Play sign so you could plug in instead of saying what is the
[00:15:27] plug in instead of saying what is the return from this date in action to the
[00:15:28] return from this date in action to the end of the episode you can imagine
[00:15:29] end of the episode you can imagine plugging in the Q value for the current
[00:15:32] plugging in the Q value for the current policy from the state and action to the
[00:15:34] policy from the state and action to the end of the episode and you know we can
[00:15:36] end of the episode and you know we can either make gamma equal to one or not
[00:15:37] either make gamma equal to one or not we're going to generally
[00:15:39] we're going to generally assume for
[00:15:40] assume for [Music]
[00:15:42] [Music] now
[00:15:44] now assume
[00:15:46] assume episodic so you can
[00:15:49] episodic so you can set gamma to equal to
[00:15:52] set gamma to equal to what and the state value function could
[00:15:55] what and the state value function could be a good Baseline
[00:16:00] and just to remember here like on this
[00:16:02] and just to remember here like on this slide you can think of G as kind of
[00:16:03] slide you can think of G as kind of being like a q function and B is being a
[00:16:05] being like a q function and B is being a value function so this would be an
[00:16:07] value function so this would be an alternative we could
[00:16:09] alternative we could do okay so let's think about how we
[00:16:12] do okay so let's think about how we generally could sort of reduce variants
[00:16:15] generally could sort of reduce variants so what we've seen so far is we're
[00:16:16] so what we've seen so far is we're mostly using Monti Carlo like returns
[00:16:18] mostly using Monti Carlo like returns now let's see if we can do something
[00:16:21] better so one thing we can do now is
[00:16:24] better so one thing we can do now is we're going to try to plug in and use
[00:16:27] we're going to try to plug in and use things like State action values and this
[00:16:28] things like State action values and this this is where the idea of actor critic
[00:16:30] this is where the idea of actor critic methods come in which are also really
[00:16:32] methods come in which are also really popular in reinforcement
[00:16:35] popular in reinforcement learning so the idea here is that we
[00:16:38] learning so the idea here is that we could do we could reduce the variance at
[00:16:40] could do we could reduce the variance at this estimate of the value function at a
[00:16:42] this estimate of the value function at a single St from a single roll out by
[00:16:45] single St from a single roll out by bootstrapping or doing function
[00:16:47] bootstrapping or doing function approximation at that point so you could
[00:16:50] approximation at that point so you could think back to like deep Q learning or
[00:16:52] think back to like deep Q learning or something like that as a way for us to
[00:16:53] something like that as a way for us to approximate what the value might be or
[00:16:56] approximate what the value might be or just general sort of deep learning for
[00:16:58] just general sort of deep learning for the for the value
[00:17:00] the for the value function so when we do this we end up
[00:17:03] function so when we do this we end up with what is called actor critic methods
[00:17:05] with what is called actor critic methods the idea is that the actor is the policy
[00:17:11] the idea is that the actor is the policy so the
[00:17:13] so the actor is the policy Often parameterized
[00:17:17] actor is the policy Often parameterized by Theta and the value function or the
[00:17:19] by Theta and the value function or the sttion value function is the
[00:17:22] sttion value function is the critic and it's representing a v or a q
[00:17:26] critic and it's representing a v or a q function so that's why they're called
[00:17:28] function so that's why they're called actor critic factor is our policy
[00:17:30] actor critic factor is our policy parameterization critic is our state
[00:17:32] parameterization critic is our state action value and the great thing is that
[00:17:35] action value and the great thing is that we can use both of those inside of of a
[00:17:38] we can use both of those inside of of a policy grading algorithm so you are
[00:17:40] policy grading algorithm so you are constantly updating an estimate of the
[00:17:42] constantly updating an estimate of the state action value as well as having an
[00:17:45] state action value as well as having an explicit policy parameterization and you
[00:17:47] explicit policy parameterization and you use them together to hopefully um
[00:17:49] use them together to hopefully um increase the rate at which we learn to
[00:17:50] increase the rate at which we learn to get a good policy now in this case
[00:17:53] get a good policy now in this case normally what we're doing here is we're
[00:17:54] normally what we're doing here is we're Gathering data using the policy and then
[00:17:57] Gathering data using the policy and then we're using that data to fit a
[00:18:00] we're using that data to fit a Critic okay and the reason we call it a
[00:18:02] Critic okay and the reason we call it a Critic is because the critic is sort of
[00:18:04] Critic is because the critic is sort of trying to estimate the performance of
[00:18:06] trying to estimate the performance of the an explicit um representation of the
[00:18:09] the an explicit um representation of the performance of the policy so the actor
[00:18:12] performance of the policy so the actor makes decisions and the critic says
[00:18:14] makes decisions and the critic says that's how good it was that's why it's
[00:18:15] that's how good it was that's why it's called actor critic a3c um is a pretty
[00:18:19] called actor critic a3c um is a pretty popular actor critic method there's
[00:18:20] popular actor critic method there's quite a lot of others so many of the
[00:18:22] quite a lot of others so many of the reinforcement learning algorithms will
[00:18:24] reinforcement learning algorithms will end up being essentially actor critic
[00:18:25] end up being essentially actor critic algorithms and so it'll be useful to
[00:18:27] algorithms and so it'll be useful to have both representations so if you
[00:18:29] have both representations so if you think of it you'd have a sort of a deep
[00:18:30] think of it you'd have a sort of a deep neural network to represent your policy
[00:18:32] neural network to represent your policy and you'd have a deep neural network a
[00:18:33] and you'd have a deep neural network a separate one could be could you you
[00:18:35] separate one could be could you you could share parameters but you don't
[00:18:36] could share parameters but you don't have to to represent your value
[00:18:39] have to to represent your value function all right once we do that we
[00:18:43] function all right once we do that we can um think of rewriting our policy
[00:18:45] can um think of rewriting our policy gradient formulas so this was what we
[00:18:48] gradient formulas so this was what we had before we could approximate this now
[00:18:50] had before we could approximate this now as saying well what if we just plugged
[00:18:53] as saying well what if we just plugged in instead of that return G which is
[00:18:55] in instead of that return G which is that sum over the rewards we plugged in
[00:18:57] that sum over the rewards we plugged in a q function and we plugged in a
[00:18:59] a q function and we plugged in a parameterized q function with a
[00:19:01] parameterized q function with a parameter W so these were our
[00:19:04] parameter W so these were our weights so now just to highlight here
[00:19:06] weights so now just to highlight here we're going to have these two sets of
[00:19:07] we're going to have these two sets of parameters W and and Theta Theta for the
[00:19:11] parameters W and and Theta Theta for the policy W for the the value
[00:19:13] policy W for the the value function and um if we let the Baseline
[00:19:16] function and um if we let the Baseline be an estimate of the V then we can just
[00:19:19] be an estimate of the V then we can just directly write down sort of a state
[00:19:20] directly write down sort of a state action Advantage
[00:19:23] action Advantage function where we look at the difference
[00:19:25] function where we look at the difference between the Q and The
[00:19:26] between the Q and The V and so now V is serving as our our
[00:19:29] V and so now V is serving as our our Baseline and I'll just highlight here
[00:19:31] Baseline and I'll just highlight here that using the advantage function was
[00:19:32] that using the advantage function was one of the first things that I think
[00:19:34] one of the first things that I think there was um got best paper maybe in
[00:19:37] there was um got best paper maybe in 2016 right after deep Q learning started
[00:19:40] 2016 right after deep Q learning started coming out and people thought about
[00:19:41] coming out and people thought about these different um adaptations one of
[00:19:44] these different um adaptations one of the things that was proposed is to think
[00:19:45] the things that was proposed is to think about trying to maximize with respect to
[00:19:48] about trying to maximize with respect to advantages but here we're going to be
[00:19:50] advantages but here we're going to be using that within a policy gradient
[00:19:53] using that within a policy gradient approach okay now one of the things you
[00:19:56] approach okay now one of the things you might wonder here is like okay well
[00:19:57] might wonder here is like okay well we've got these extremes on the one hand
[00:19:59] we've got these extremes on the one hand you could have this Monte Carlo return
[00:20:01] you could have this Monte Carlo return of what is the value of the state in
[00:20:03] of what is the value of the state in action that you get from starting that
[00:20:05] action that you get from starting that state in action and rolling out to the
[00:20:06] state in action and rolling out to the end of the episode and the other is you
[00:20:08] end of the episode and the other is you could plug in a q function now there
[00:20:11] could plug in a q function now there might be some sort of blending between
[00:20:12] might be some sort of blending between these
[00:20:14] these two so these are known as endep
[00:20:17] two so these are known as endep estimators so a Critic in general
[00:20:19] estimators so a Critic in general doesn't have to pick sort of a temporal
[00:20:21] doesn't have to pick sort of a temporal difference temporal difference in the
[00:20:23] difference temporal difference in the way that we've seen it so far is
[00:20:24] way that we've seen it so far is normally so this is I'll just write down
[00:20:25] normally so this is I'll just write down here it's often we've seen it as td0
[00:20:30] here it's often we've seen it as td0 which means we have the immediate reward
[00:20:33] which means we have the immediate reward plus gamma * V of S Prime so TD Z you
[00:20:37] plus gamma * V of S Prime so TD Z you take you sing of your immediate reward
[00:20:39] take you sing of your immediate reward and then you plug in or bootstrap
[00:20:40] and then you plug in or bootstrap immediately when you say like and I saw
[00:20:42] immediately when you say like and I saw the next state and I'm going to plug in
[00:20:44] the next state and I'm going to plug in my value for that next state so if you
[00:20:46] my value for that next state so if you think back to our tree representation
[00:20:48] think back to our tree representation it's like you see one actual observe
[00:20:50] it's like you see one actual observe return and then you plug in your
[00:20:52] return and then you plug in your estimate but in general you could trade
[00:20:54] estimate but in general you could trade off between taking one step or taking
[00:20:56] off between taking one step or taking two steps or three steps and then play
[00:20:58] two steps or three steps and then play plugging in your estimate of the
[00:21:01] plugging in your estimate of the return so in particular here's a number
[00:21:05] return so in particular here's a number of different types of um estimators you
[00:21:08] of different types of um estimators you could look at so you could have let's
[00:21:11] could look at so you could have let's call this R hat one which is you get
[00:21:14] call this R hat one which is you get your met reward this is what we've seen
[00:21:15] your met reward this is what we've seen before plus gamma then you bootstrap a
[00:21:18] before plus gamma then you bootstrap a second one would be you take your next
[00:21:20] second one would be you take your next two again your gamma can be one or not
[00:21:23] two again your gamma can be one or not um and then you plug in it and sort of R
[00:21:27] um and then you plug in it and sort of R hat Infinity would be your normal Carlo
[00:21:29] hat Infinity would be your normal Carlo return which is you don't do any
[00:21:30] return which is you don't do any bootstrapping you just sum up all your
[00:21:33] bootstrapping you just sum up all your rewards and you can think of each of
[00:21:35] rewards and you can think of each of these as being estimates of your Q
[00:21:37] these as being estimates of your Q function and then you could just get
[00:21:39] function and then you could just get Advantage estimators where you subtract
[00:21:41] Advantage estimators where you subtract off the V of your current state in each
[00:21:43] off the V of your current state in each of those
[00:21:45] of those settings so those are all things you
[00:21:47] settings so those are all things you could do they're called nstep estimators
[00:21:49] could do they're called nstep estimators where n is sort of the the number of
[00:21:50] where n is sort of the the number of time steps until you
[00:21:52] time steps until you bootstrap and one of the important
[00:21:54] bootstrap and one of the important things to think about is where you might
[00:21:55] things to think about is where you might want to tradeoff between these ones so
[00:21:58] want to tradeoff between these ones so we'll do just a check your understanding
[00:22:00] we'll do just a check your understanding now if we think about introducing these
[00:22:02] now if we think about introducing these type of advant um Blended estimators how
[00:22:05] type of advant um Blended estimators how does bias and variance
[00:22:09] does bias and variance tradeoff so but we go ahead and um do
[00:22:11] tradeoff so but we go ahead and um do that
[00:22:27] now for
[00:23:13] can you like more than one you should be
[00:23:15] can you like more than one you should be able to that work okay good
[00:23:32] I'll give you one more minute to think
[00:23:33] I'll give you one more minute to think about it and put in your answer and then
[00:23:36] about it and put in your answer and then there's a lot of um variability in what
[00:23:38] there's a lot of um variability in what people are saying and so why don't you
[00:23:42] people are saying and so why don't you talk to your neighbor in a
[00:23:50] second all right turn to neighbor
[00:23:52] second all right turn to neighbor compare what we got
[00:24:25] what do you think about it
[00:24:34] yeah you are subtracting the B function
[00:24:37] yeah you are subtracting the B function but if you think back to monteo and TV
[00:24:39] but if you think back to monteo and TV method they have the same
[00:24:51] bias well this is the question is is are
[00:24:55] bias well this is the question is is are these both do these both have the same
[00:24:56] these both do these both have the same bias so if you ignore
[00:24:58] bias so if you ignore for second the first part of the
[00:25:01] for second the first part of the estimate do you think the first one has
[00:25:03] estimate do you think the first one has higher high higher bias or the second
[00:25:05] higher high higher bias or the second one yeah I I don't know
[00:25:08] one yeah I I don't know why if you think back the first one
[00:25:10] why if you think back the first one should be like the temporal difference
[00:25:12] should be like the temporal difference method and the second one should be like
[00:25:14] method and the second one should be like Monte
[00:25:16] Carlo and remember I guess it should be
[00:25:18] Carlo and remember I guess it should be clear that these V's are all estimates
[00:25:21] clear that these V's are all estimates so they're not like converged so then
[00:25:24] so they're not like converged so then the
[00:25:27] top
[00:25:30] exactly
[00:25:33] exactly because exactly yeah and then what do
[00:25:35] because exactly yeah and then what do you think that means for the
[00:25:37] you think that means for the variant um I guess higher variance
[00:25:46] also but close yeah so for the first one
[00:25:49] also but close yeah so for the first one you're toally right that it's got higher
[00:25:51] you're toally right that it's got higher bias because you're immediately boot
[00:25:52] bias because you're immediately boot strapping but in general it will have
[00:25:54] strapping but in general it will have lower variance okay oh is that just
[00:25:56] lower variance okay oh is that just generally a tradeoff yeah yeah that
[00:25:58] generally a tradeoff yeah yeah that normally the Monte Carlo methods because
[00:26:00] normally the Monte Carlo methods because you're summing them up they're normally
[00:26:01] you're summing them up they're normally totally unbiased but they have lots of
[00:26:03] totally unbiased but they have lots of terms so you could think of there's lots
[00:26:04] terms so you could think of there's lots of stochasticity oh okay yeah and then
[00:26:07] of stochasticity oh okay yeah and then the other way is normally I that
[00:26:09] the other way is normally I that makesense
[00:26:20] cool all right I'm going to ask if
[00:26:23] cool all right I'm going to ask if people Al turn this yeah I'm gonna ask H
[00:26:25] people Al turn this yeah I'm gonna ask H did ibody change their mind after
[00:26:26] did ibody change their mind after talking to someone
[00:26:28] talking to someone okay at least a few yep all right so um
[00:26:32] okay at least a few yep all right so um I I think one of the things too I just
[00:26:33] I I think one of the things too I just wanted to clarify here is that I find it
[00:26:36] wanted to clarify here is that I find it easiest to I think it's often easiest to
[00:26:38] easiest to I think it's often easiest to think about this without subtracting the
[00:26:39] think about this without subtracting the V because the V is the same in both of
[00:26:42] V because the V is the same in both of these so you can just focus to
[00:26:43] these so you can just focus to understand which of them has high
[00:26:45] understand which of them has high variance or high bias you can just focus
[00:26:47] variance or high bias you can just focus on the first things and when you look at
[00:26:49] on the first things and when you look at just the first things it should remind
[00:26:51] just the first things it should remind you of Monte Carlo methods versus
[00:26:52] you of Monte Carlo methods versus temporal different methods um so the
[00:26:55] temporal different methods um so the first one has low variance and high bias
[00:26:59] first one has low variance and high bias does somebody want to say
[00:27:01] does somebody want to say why so the first one which looks kind of
[00:27:04] why so the first one which looks kind of like a td0
[00:27:05] like a td0 update so this one
[00:27:09] update so this one has low
[00:27:12] has low variance so
[00:27:14] variance so A1 low variance High
[00:27:18] A1 low variance High bias so I want to share why that
[00:27:24] is which is it sorry wasn't it wasn't
[00:27:28] is which is it sorry wasn't it wasn't just yeah
[00:27:30] just yeah just just to make sure I'm going to
[00:27:32] just just to make sure I'm going to explain what each of them have but yes
[00:27:34] explain what each of them have but yes um do you want to say in the back yeah
[00:27:36] um do you want to say in the back yeah and
[00:27:37] and my I'm so I'm not entirely sure but I
[00:27:40] my I'm so I'm not entirely sure but I think the intuition at least when I was
[00:27:42] think the intuition at least when I was thinking about it was that um using the
[00:27:44] thinking about it was that um using the actual values of like for example RT +1
[00:27:47] actual values of like for example RT +1 um is like more accurate versus like if
[00:27:49] um is like more accurate versus like if you bootstrap very early um that's like
[00:27:52] you bootstrap very early um that's like more of an estimate like it's not it's
[00:27:54] more of an estimate like it's not it's not as accurate yeah that's exactly
[00:27:55] not as accurate yeah that's exactly right so it's the same as in temporal
[00:27:57] right so it's the same as in temporal difference meth in general it's a little
[00:28:00] difference meth in general it's a little misleading so to look at the maybe I
[00:28:01] misleading so to look at the maybe I should put hats over all these next year
[00:28:03] should put hats over all these next year but um all of these V are just estimates
[00:28:05] but um all of these V are just estimates they're like given finite amounts of
[00:28:07] they're like given finite amounts of data and however many backups we've done
[00:28:09] data and however many backups we've done Etc V is an approximation so this is an
[00:28:12] Etc V is an approximation so this is an estimate so this isn't true and if it's
[00:28:16] estimate so this isn't true and if it's not true it's probably biased um uh so
[00:28:18] not true it's probably biased um uh so this is uh in general V will not be an
[00:28:21] this is uh in general V will not be an unbiased estimator just um and so this
[00:28:23] unbiased estimator just um and so this means that we're only using one um one
[00:28:26] means that we're only using one um one point R from the true policy and then
[00:28:28] point R from the true policy and then we're immediately bootstrapping so in
[00:28:30] we're immediately bootstrapping so in general this is going to be high bias or
[00:28:32] general this is going to be high bias or higher bias but it's generally going to
[00:28:34] higher bias but it's generally going to be pretty low variance and the way to
[00:28:36] be pretty low variance and the way to think about this is that each of the
[00:28:38] think about this is that each of the rewards in general are going to be from
[00:28:40] rewards in general are going to be from sort of a stochastic process um because
[00:28:42] sort of a stochastic process um because you're like taking a series of State
[00:28:44] you're like taking a series of State steps and then at each one you're going
[00:28:46] steps and then at each one you're going to sample a reward from there so you
[00:28:48] to sample a reward from there so you only have kind of one really random
[00:28:50] only have kind of one really random thing here and then one thing that is
[00:28:52] thing here and then one thing that is fixed it might be wrong but it's fixed
[00:28:56] fixed it might be wrong but it's fixed um in contrast
[00:28:58] um in contrast this one which looks like a Monte Carlo
[00:29:00] this one which looks like a Monte Carlo estimate is going to have high variance
[00:29:03] estimate is going to have high variance because it's got all of these different
[00:29:05] because it's got all of these different rewards that are all being sampled from
[00:29:08] rewards that are all being sampled from like a stochastic process so if you
[00:29:10] like a stochastic process so if you think of it this way like um imagine
[00:29:13] think of it this way like um imagine it's something where like your robot can
[00:29:14] it's something where like your robot can walk anywhere over the room and under
[00:29:16] walk anywhere over the room and under your policy like your policy is it can
[00:29:18] your policy like your policy is it can go in all of these different
[00:29:21] go in all of these different directions okay but when you actually
[00:29:24] directions okay but when you actually just execute one trajectory you're just
[00:29:26] just execute one trajectory you're just going to get one of those and so its
[00:29:29] going to get one of those and so its variance generally is enormous and this
[00:29:31] variance generally is enormous and this might be true even if like on average
[00:29:34] might be true even if like on average you know you kind of have a trajectory
[00:29:35] you know you kind of have a trajectory like this or something but so in general
[00:29:38] like this or something but so in general this one is going to be really high
[00:29:39] this one is going to be really high variance but it's generally going to be
[00:29:42] variance but it's generally going to be low or zero
[00:29:44] low or zero bias why is it lower Zer bias because
[00:29:47] bias why is it lower Zer bias because this actually is a return from the
[00:29:49] this actually is a return from the policy that you're executing so um in
[00:29:53] policy that you're executing so um in expectation this really is equal to the
[00:29:56] expectation this really is equal to the value of that policy so generally low or
[00:29:58] value of that policy so generally low or zero bias um but it can be really high
[00:30:01] zero bias um but it can be really high variance and so that should maybe give
[00:30:03] variance and so that should maybe give some intuition over um and we're just
[00:30:06] some intuition over um and we're just discussing this in general it's going to
[00:30:07] discussing this in general it's going to be unfortunately be a trade-off between
[00:30:09] be unfortunately be a trade-off between low variance um and low bias and so
[00:30:12] low variance um and low bias and so often you're going to want things in
[00:30:13] often you're going to want things in terms of n so this is if that's the end
[00:30:16] terms of n so this is if that's the end step and you want to sort of minimize
[00:30:18] step and you want to sort of minimize your mean squared error often you're
[00:30:20] your mean squared error often you're going to end up wanting to do something
[00:30:21] going to end up wanting to do something where it's like a couple steps and then
[00:30:23] where it's like a couple steps and then bootstrap to get a nice trade-off
[00:30:25] bootstrap to get a nice trade-off between bias and variance
[00:30:28] between bias and variance and I think we'll probably talk a little
[00:30:29] and I think we'll probably talk a little bit more about that next
[00:30:31] bit more about that next week okay so I'll just highlight this
[00:30:33] week okay so I'll just highlight this here that this has
[00:30:36] here that this has um low bias and high
[00:30:39] um low bias and high variance on this one so this one has low
[00:30:43] variance on this one so this one has low variance and high bias the other one is
[00:30:45] variance and high bias the other one is the
[00:30:46] the opposite all right cool okay so just to
[00:30:51] opposite all right cool okay so just to think about this so when we are sort of
[00:30:52] think about this so when we are sort of thinking about these targets we can go
[00:30:54] thinking about these targets we can go between these different ones and then
[00:30:57] between these different ones and then oops sorry right somehow these got
[00:30:58] oops sorry right somehow these got copied okay so these are all things that
[00:31:01] copied okay so these are all things that you can plug in um you can make
[00:31:03] you can plug in um you can make different choices over whether you plug
[00:31:05] different choices over whether you plug in um these endstep methods or others
[00:31:09] in um these endstep methods or others and then you can use this all as part of
[00:31:10] and then you can use this all as part of like your actra critic policy gradient
[00:31:12] like your actra critic policy gradient method so now what we're going to do and
[00:31:14] method so now what we're going to do and I'll just make a delete those slides so
[00:31:17] I'll just make a delete those slides so now what we're going to go into is more
[00:31:19] now what we're going to go into is more advanced policy gradient methods okay so
[00:31:21] advanced policy gradient methods okay so those are the basic ones and they're
[00:31:22] those are the basic ones and they're kind of the backbone between all of the
[00:31:24] kind of the backbone between all of the algorithms that we do now but there's
[00:31:26] algorithms that we do now but there's been a lot of interest in these types of
[00:31:28] been a lot of interest in these types of methods and how do we scale them up and
[00:31:30] methods and how do we scale them up and how do we make them better okay and
[00:31:32] how do we make them better okay and we'll talk about what we mean by better
[00:31:34] we'll talk about what we mean by better here so we'll talk so this is actually
[00:31:37] here so we'll talk so this is actually we'll probably talk about some of this
[00:31:38] we'll probably talk about some of this next week because I wanted to make sure
[00:31:42] next week because I wanted to make sure that we got through the algorithm today
[00:31:43] that we got through the algorithm today so that you guys can um have all the
[00:31:45] so that you guys can um have all the knowledge you need to be starting to do
[00:31:47] knowledge you need to be starting to do the implementation and then we'll do
[00:31:49] the implementation and then we'll do more on the theory next week okay so why
[00:31:52] more on the theory next week okay so why so policy gradients so far we've been
[00:31:54] so policy gradients so far we've been talking about them being great and we
[00:31:55] talking about them being great and we know that they're used in some really
[00:31:56] know that they're used in some really important application are is why do we
[00:31:59] important application are is why do we have to go beyond the methods that we
[00:32:01] have to go beyond the methods that we just saw well there's a couple different
[00:32:03] just saw well there's a couple different limitations to
[00:32:05] limitations to them one is that the sample efficiency
[00:32:08] them one is that the sample efficiency is generally poor
[00:32:11] is generally poor okay and I'm uh so that in general you
[00:32:16] okay and I'm uh so that in general you have to do many roll outs with the same
[00:32:19] have to do many roll outs with the same policy and in order to get a good
[00:32:20] policy and in order to get a good estimate of the gradient because you
[00:32:22] estimate of the gradient because you want to good estim the gradient because
[00:32:23] want to good estim the gradient because otherwise when you take a step you might
[00:32:24] otherwise when you take a step you might get somewhere that's you know not as
[00:32:26] get somewhere that's you know not as good or not don't get to the the place
[00:32:28] good or not don't get to the the place you want to as quickly and the other and
[00:32:30] you want to as quickly and the other and we're going to see an example of this
[00:32:32] we're going to see an example of this shortly is that the distance in the
[00:32:34] shortly is that the distance in the parameter space doesn't necess say equal
[00:32:36] parameter space doesn't necess say equal the distance in the policy
[00:32:38] the distance in the policy space so this is a little bit weird of
[00:32:40] space so this is a little bit weird of an idea but the idea is that if you have
[00:32:42] an idea but the idea is that if you have a policy parameterization whether it's
[00:32:44] a policy parameterization whether it's like a deep neural network or others
[00:32:46] like a deep neural network or others there's some parameters in there and
[00:32:48] there's some parameters in there and when you change them like when you
[00:32:49] when you change them like when you change your Theta you're going to get a
[00:32:51] change your Theta you're going to get a different policy out but whether that is
[00:32:53] different policy out but whether that is really smooth like oh if I change Theta
[00:32:56] really smooth like oh if I change Theta let's say Theta is a scalar so let maybe
[00:32:58] let's say Theta is a scalar so let maybe Theta is like 7 if I change it to 75
[00:33:01] Theta is like 7 if I change it to 75 does that smoothly change how much more
[00:33:03] does that smoothly change how much more I take a particular action or might it
[00:33:05] I take a particular action or might it be really discontinuous like might it
[00:33:07] be really discontinuous like might it suddenly say like oh you know you were
[00:33:08] suddenly say like oh you know you were taking this action with 20% probability
[00:33:11] taking this action with 20% probability I changed the a little bit and now I'm
[00:33:12] I changed the a little bit and now I'm like taking that action with 90%
[00:33:15] like taking that action with 90% probability the main idea here and we'll
[00:33:17] probability the main idea here and we'll see an example of this in a second
[00:33:19] see an example of this in a second is this part may not be smooth which
[00:33:22] is this part may not be smooth which means that even really small changes in
[00:33:24] means that even really small changes in your Theta in your policy
[00:33:26] your Theta in your policy parameterization
[00:33:27] parameterization might actually lead to really big
[00:33:29] might actually lead to really big differences in how you're making
[00:33:30] differences in how you're making decisions okay so you could imagine your
[00:33:33] decisions okay so you could imagine your robot like picks up things one way and
[00:33:34] robot like picks up things one way and then you change your Theta a little bit
[00:33:35] then you change your Theta a little bit and suddenly it like drives off a cliff
[00:33:38] and suddenly it like drives off a cliff okay not quite that extreme but but we
[00:33:40] okay not quite that extreme but but we it's not clear that sort of as we
[00:33:42] it's not clear that sort of as we smoothly take gradient steps in Theta if
[00:33:45] smoothly take gradient steps in Theta if that's going to smoothly change our
[00:33:46] that's going to smoothly change our policy
[00:33:47] policy parameterization our policy decisions
[00:33:49] parameterization our policy decisions okay so let's look at those both all
[00:33:52] okay so let's look at those both all right sample efficiency so what we've
[00:33:54] right sample efficiency so what we've been seeing so far is the idea is that
[00:33:55] been seeing so far is the idea is that we take our policy We Roll It out one or
[00:33:58] we take our policy We Roll It out one or more times and then we take a gradient
[00:34:00] more times and then we take a gradient step and we take one gradient step and
[00:34:03] step and we take one gradient step and then we roll out our policy
[00:34:07] then we roll out our policy again and in general it would be really
[00:34:09] again and in general it would be really nice to be able to take multiple
[00:34:10] nice to be able to take multiple gradient steps um but so far we have not
[00:34:14] gradient steps um but so far we have not seen that and it's called sort of on
[00:34:15] seen that and it's called sort of on policy
[00:34:16] policy expectation so this is similar to kind
[00:34:20] expectation so this is similar to kind of Sara and other methods we've seen
[00:34:22] of Sara and other methods we've seen before were like you're learning about a
[00:34:24] before were like you're learning about a policy and its value by actually
[00:34:25] policy and its value by actually executing it so the problem is when we
[00:34:29] executing it so the problem is when we think about let me just go back to
[00:34:33] think about let me just go back to here when we think about doing these
[00:34:35] here when we think about doing these gradients we've assumed that we've
[00:34:37] gradients we've assumed that we've gotten trajectories from the policy and
[00:34:39] gotten trajectories from the policy and the Theta that we're at right now and
[00:34:42] the Theta that we're at right now and then we use that data to take a step so
[00:34:44] then we use that data to take a step so you know we're at some point here's our
[00:34:46] you know we're at some point here's our Theta we're at some point and we
[00:34:49] Theta we're at some point and we estimate the gradient from that point
[00:34:51] estimate the gradient from that point okay so we estimate it from trajectories
[00:34:54] okay so we estimate it from trajectories that are generated under Theta and then
[00:34:57] that are generated under Theta and then we take a step now the problem is if we
[00:35:01] we take a step now the problem is if we now now we might be
[00:35:03] now now we might be here and what we would like to do is to
[00:35:06] here and what we would like to do is to take another step before getting more
[00:35:07] take another step before getting more data but the problem is we don't have
[00:35:10] data but the problem is we don't have let's call this Theta Prime what we have
[00:35:13] let's call this Theta Prime what we have is we have data from
[00:35:15] is we have data from Theta we don't have any data from Theta
[00:35:18] Theta we don't have any data from Theta Prime so a priority it shouldn't be
[00:35:20] Prime so a priority it shouldn't be clear that we could take more than one
[00:35:22] clear that we could take more than one step we can take one step because we
[00:35:24] step we can take one step because we have data about Theta we estimate the
[00:35:25] have data about Theta we estimate the gradient at that point now we would like
[00:35:28] gradient at that point now we would like to just be able to continue to take
[00:35:30] to just be able to continue to take gradient steps before we actually go out
[00:35:31] gradient steps before we actually go out in the real world and gather more data
[00:35:33] in the real world and gather more data but it shouldn't be obvious how to do
[00:35:34] but it shouldn't be obvious how to do that
[00:35:35] that yet and so when we talk about policy
[00:35:38] yet and so when we talk about policy gradient right now we've been talking
[00:35:39] gradient right now we've been talking about sort of on policy methods where we
[00:35:41] about sort of on policy methods where we just try to estimate the gradient for
[00:35:43] just try to estimate the gradient for the policy we just
[00:35:45] the policy we just executed so similar to
[00:35:48] executed so similar to sarsa now so what we would like to be
[00:35:50] sarsa now so what we would like to be able to do is so that's what we've been
[00:35:54] able to do is so that's what we've been doing so far we collect sample
[00:35:55] doing so far we collect sample trajectories from the policy then we
[00:35:56] trajectories from the policy then we form a sample estimate it's pretty
[00:35:58] form a sample estimate it's pretty stable we get our gradient we take a
[00:35:59] stable we get our gradient we take a step we rinse and repeat another thing
[00:36:02] step we rinse and repeat another thing we could do like thinking about Q
[00:36:03] we could do like thinking about Q learning or others is like well what if
[00:36:05] learning or others is like well what if we could use that old data to estimate
[00:36:07] we could use that old data to estimate the gradient at some other Theta some
[00:36:09] the gradient at some other Theta some other policy could we do that this is
[00:36:12] other policy could we do that this is known as off policy estimates this
[00:36:15] known as off policy estimates this generally can start to be pretty
[00:36:16] generally can start to be pretty unstable we're going to think about
[00:36:17] unstable we're going to think about different ways we could even do that but
[00:36:20] different ways we could even do that but we really would like that we would like
[00:36:21] we really would like that we would like to be able to use our old data to take
[00:36:23] to be able to use our old data to take multiple gradient steps before we
[00:36:26] multiple gradient steps before we actually have to gather data
[00:36:28] actually have to gather data and you could imagine that might end up
[00:36:29] and you could imagine that might end up allowing us to be much more data
[00:36:31] allowing us to be much more data efficient so that the total amount of
[00:36:32] efficient so that the total amount of times we have to gather more data is
[00:36:34] times we have to gather more data is much
[00:36:35] much less so we're going to see think about a
[00:36:38] less so we're going to see think about a way today and we'll talk more about this
[00:36:39] way today and we'll talk more about this certainly over the next few weeks as
[00:36:41] certainly over the next few weeks as well of like how do we use our old data
[00:36:43] well of like how do we use our old data to essentially move faster in our
[00:36:45] to essentially move faster in our parameter
[00:36:46] parameter space okay here's the second Big
[00:36:49] space okay here's the second Big Challenge okay so in general we're going
[00:36:51] Challenge okay so in general we're going to be doing s stochastic gradient Ascent
[00:36:54] to be doing s stochastic gradient Ascent with some step size like we've
[00:36:56] with some step size like we've repeatedly thought of there being some
[00:36:57] repeatedly thought of there being some sort of learning weight or step
[00:36:59] sort of learning weight or step size one of the challenges and this was
[00:37:02] size one of the challenges and this was important you know for deep Q learning
[00:37:04] important you know for deep Q learning we thought about it even for like TD
[00:37:06] we thought about it even for like TD learning and stuff what was the step
[00:37:07] learning and stuff what was the step size how much do we update our estimate
[00:37:09] size how much do we update our estimate every time we get new data turns out
[00:37:11] every time we get new data turns out it's much harder here now we saw before
[00:37:13] it's much harder here now we saw before that under some like you know pretty
[00:37:15] that under some like you know pretty loose uh loose requirements on the
[00:37:17] loose uh loose requirements on the learning rate we could guarant you know
[00:37:20] learning rate we could guarant you know at least in tabular cases guaranteed to
[00:37:22] at least in tabular cases guaranteed to converge
[00:37:23] converge Etc policy gradient methods are a little
[00:37:25] Etc policy gradient methods are a little bit different um here the step size
[00:37:29] bit different um here the step size really
[00:37:31] really matters and if we take a step size that
[00:37:34] matters and if we take a step size that is really quite bad we can collapse in
[00:37:37] is really quite bad we can collapse in our
[00:37:38] our performance somebody have an idea of why
[00:37:41] performance somebody have an idea of why why does that
[00:37:43] why does that happen why we why can we suddenly
[00:37:53] collapse the optimal Target go
[00:37:58] collapse the optimal Target go for yeah that is great great R so let's
[00:38:01] for yeah that is great great R so let's look at an example so remember the way
[00:38:04] look at an example so remember the way that we're getting our data is from our
[00:38:06] that we're getting our data is from our policy if you
[00:38:09] policy if you overstep so let's say we're trying to
[00:38:11] overstep so let's say we're trying to get to this point you have a big
[00:38:13] get to this point you have a big learning rate so we took big
[00:38:14] learning rate so we took big steps you might now get to part of the
[00:38:17] steps you might now get to part of the space which is really bad like really
[00:38:19] space which is really bad like really really bad policies by really bad
[00:38:21] really bad policies by really bad policies I mean they have really bad
[00:38:22] policies I mean they have really bad value
[00:38:24] value functions if you have really bad value
[00:38:27] functions if you have really bad value functions and trajectories which you're
[00:38:28] functions and trajectories which you're vising States and actions which all have
[00:38:30] vising States and actions which all have really bad reward it's really hard to
[00:38:32] really bad reward it's really hard to estimate a good gradient of where to go
[00:38:35] estimate a good gradient of where to go in general you might be in sort of like
[00:38:36] in general you might be in sort of like a really long Plateau place right so
[00:38:38] a really long Plateau place right so this could be really long and so the
[00:38:41] this could be really long and so the gradient here might be really hard to
[00:38:42] gradient here might be really hard to estimate of like how do I get back to
[00:38:44] estimate of like how do I get back to the local Optima in general it's not
[00:38:46] the local Optima in general it's not going to be impossible unless you know
[00:38:48] going to be impossible unless you know it's completely flat but but it might be
[00:38:52] it's completely flat but but it might be really close to completely flat and so
[00:38:55] really close to completely flat and so that's a big problem is that you don't
[00:38:56] that's a big problem is that you don't necess say know
[00:38:58] necess say know how large your step size should be on
[00:39:01] how large your step size should be on the other hand if you use really small
[00:39:02] the other hand if you use really small step sizes that's bad too because then
[00:39:04] step sizes that's bad too because then you're just you know each time your step
[00:39:06] you're just you know each time your step size is basically determining how much
[00:39:07] size is basically determining how much you change your policy before you get
[00:39:09] you change your policy before you get new data and so you would like to take
[00:39:11] new data and so you would like to take as big of Step sizes as possible that
[00:39:13] as big of Step sizes as possible that don't overstep um and that allow you to
[00:39:16] don't overstep um and that allow you to quickly get to the optimal local Optima
[00:39:19] quickly get to the optimal local Optima now you know things like sort of atom
[00:39:20] now you know things like sort of atom style optimizers Etc help but it won't
[00:39:23] style optimizers Etc help but it won't necessarily solve the
[00:39:24] necessarily solve the problem and one of the challenges here
[00:39:26] problem and one of the challenges here is you know the we're only getting
[00:39:28] is you know the we're only getting information about the the states and
[00:39:30] information about the the states and actions that are visited under our
[00:39:32] actions that are visited under our policy and you just might get to Regions
[00:39:34] policy and you just might get to Regions where there's very little information to
[00:39:36] where there's very little information to estimate those
[00:39:38] estimate those gradients here's another challenge and
[00:39:40] gradients here's another challenge and this has to this relates to sort of that
[00:39:42] this has to this relates to sort of that we're taking steps in the Theta space
[00:39:46] we're taking steps in the Theta space not directly in terms of the actions we
[00:39:48] not directly in terms of the actions we take when we update our policy okay so
[00:39:51] take when we update our policy okay so let's think about sort of a
[00:39:53] let's think about sort of a parameterization which is a pretty
[00:39:54] parameterization which is a pretty simple parameterization so this is um
[00:39:57] simple parameterization so this is um logistic
[00:40:03] function this is like one
[00:40:08] one okay so in this case this is the
[00:40:13] one okay so in this case this is the parameterization it's kind of you like
[00:40:14] parameterization it's kind of you like kind of like a softmax you just have
[00:40:16] kind of like a softmax you just have some probability of going to one action
[00:40:17] some probability of going to one action and the other the rest of the probably
[00:40:19] and the other the rest of the probably goes to the other and you've
[00:40:20] goes to the other and you've parameterized it with
[00:40:22] parameterized it with Theta so if Theta is equal to zero it's
[00:40:25] Theta so if Theta is equal to zero it's 50/50 like you take a a 1 or A2 if Theta
[00:40:29] 50/50 like you take a a 1 or A2 if Theta is equal to 2 you suddenly take A2 with
[00:40:32] is equal to 2 you suddenly take A2 with much less and if Theta is equal to 4 you
[00:40:34] much less and if Theta is equal to 4 you basically never take
[00:40:36] basically never take A2 and that's just because of how our
[00:40:40] A2 and that's just because of how our relationship goes from
[00:40:42] relationship goes from Theta 2 pi of
[00:40:46] Theta 2 pi of a like in this parameterization it's
[00:40:49] a like in this parameterization it's pretty extreme so as we make sort of
[00:40:51] pretty extreme so as we make sort of small relatively small changes to Theta
[00:40:53] small relatively small changes to Theta like I didn't make Theta a million or
[00:40:55] like I didn't make Theta a million or anything I've basically shift
[00:40:57] anything I've basically shift even with sort of 0 to 2 to 4 I've
[00:41:00] even with sort of 0 to 2 to 4 I've radically shifted how much of my
[00:41:01] radically shifted how much of my probability Mass goes on to
[00:41:04] probability Mass goes on to A1 and that's just sort of just to
[00:41:06] A1 and that's just sort of just to illustrate this issue of smoothness that
[00:41:08] illustrate this issue of smoothness that is I make what might be considered
[00:41:10] is I make what might be considered relatively small changes in my Theta
[00:41:12] relatively small changes in my Theta space that might have that might make my
[00:41:14] space that might have that might make my policy near deterministic and we know
[00:41:16] policy near deterministic and we know that if our policy is deterministic we
[00:41:18] that if our policy is deterministic we can't learn anything about other
[00:41:21] can't learn anything about other actions so let me just make this a
[00:41:23] actions so let me just make this a little smaller so you can see the
[00:41:24] little smaller so you can see the question so so the challenge in this
[00:41:25] question so so the challenge in this case is that step size can matter a lot
[00:41:28] case is that step size can matter a lot in terms of
[00:41:30] in terms of efficiency we don't necessarily know
[00:41:31] efficiency we don't necessarily know what the right step size is and it may
[00:41:33] what the right step size is and it may be hard for us to know how small changes
[00:41:35] be hard for us to know how small changes in our parameter space relate to changes
[00:41:38] in our parameter space relate to changes in the action distributions we actually
[00:41:40] in the action distributions we actually follow and so what we'd really like to
[00:41:42] follow and so what we'd really like to do here is to actually come up with an
[00:41:43] do here is to actually come up with an update rule that doesn't sort of over
[00:41:46] update rule that doesn't sort of over change the policy too quickly but still
[00:41:48] change the policy too quickly but still allows us to make rapid
[00:41:50] allows us to make rapid progress so we'd like to sort of move as
[00:41:52] progress so we'd like to sort of move as far as we can in a way that we think is
[00:41:54] far as we can in a way that we think is is really ideally is going to just
[00:41:56] is really ideally is going to just directly increase the value of our
[00:41:59] directly increase the value of our policy okay and I guess I'll say well
[00:42:01] policy okay and I guess I'll say well we'll see a bit more of that I mean and
[00:42:03] we'll see a bit more of that I mean and also ideally we would like this all to
[00:42:04] also ideally we would like this all to be
[00:42:05] be monotonic we' like it so that if we
[00:42:07] monotonic we' like it so that if we think back to the policy Improvement
[00:42:09] think back to the policy Improvement algorithms that we've seen before policy
[00:42:11] algorithms that we've seen before policy Improvement algorithms for the tabular
[00:42:13] Improvement algorithms for the tabular case where we knew how the world worked
[00:42:15] case where we knew how the world worked had this great property that every time
[00:42:17] had this great property that every time we updated our policy we got a better
[00:42:19] we updated our policy we got a better policy or we were done now the world is
[00:42:22] policy or we were done now the world is much more complicated now we've got you
[00:42:23] much more complicated now we've got you know these sort of continuous
[00:42:25] know these sort of continuous parameterizations we not guaranteed to
[00:42:27] parameterizations we not guaranteed to get to the optimal policy but it would
[00:42:29] get to the optimal policy but it would be really cool if we could still
[00:42:31] be really cool if we could still guarantee that we're going to just get
[00:42:32] guarantee that we're going to just get sort of monotonic improvement unless we
[00:42:34] sort of monotonic improvement unless we get to a local
[00:42:36] get to a local Optima and the things that I've shown
[00:42:38] Optima and the things that I've shown you so far um you know don't necess have
[00:42:41] you so far um you know don't necess have that property because they'll still
[00:42:42] that property because they'll still converge to a local Optima but you might
[00:42:44] converge to a local Optima but you might sort of overstep like we see here so you
[00:42:46] sort of overstep like we see here so you might go over you might be having
[00:42:49] might go over you might be having monotonic Improvement and then crash and
[00:42:52] monotonic Improvement and then crash and then you have to go back and forth so we
[00:42:54] then you have to go back and forth so we have not guaranteed monotonic
[00:42:55] have not guaranteed monotonic Improvement so far but that would be
[00:42:57] Improvement so far but that would be really nice and that could be important
[00:42:59] really nice and that could be important in a lot of real world domains like you
[00:43:00] in a lot of real world domains like you could imagine if you're using this for
[00:43:03] could imagine if you're using this for like healthcare applications you would
[00:43:04] like healthcare applications you would really like to have monotonic
[00:43:05] really like to have monotonic Improvement and not suddenly kind of
[00:43:07] Improvement and not suddenly kind of performance
[00:43:08] performance crash all right so let's think about how
[00:43:12] crash all right so let's think about how we might be able to get here and you can
[00:43:14] we might be able to get here and you can think of a lot of this lecture as
[00:43:15] think of a lot of this lecture as motivating the things that you're going
[00:43:17] motivating the things that you're going to be doing in homework too all right
[00:43:20] to be doing in homework too all right including the theory so in general what
[00:43:22] including the theory so in general what we'd like to have is we'd like to have
[00:43:23] we'd like to have is we'd like to have an update step that uses all the data
[00:43:25] an update step that uses all the data that we just got as officient ly as
[00:43:27] that we just got as officient ly as possible and that take steps that sort
[00:43:30] possible and that take steps that sort of respect this distance in the policy
[00:43:32] of respect this distance in the policy like the decision space um as opposed to
[00:43:35] like the decision space um as opposed to just smoothness in the parameter
[00:43:37] just smoothness in the parameter space and and in order to do that we
[00:43:39] space and and in order to do that we need to sort of understand how does the
[00:43:42] need to sort of understand how does the performance of two policies relate so we
[00:43:45] performance of two policies relate so we have data from one policy and we're
[00:43:47] have data from one policy and we're considering trying to move to a new
[00:43:48] considering trying to move to a new policy and we'd really like to know okay
[00:43:51] policy and we'd really like to know okay given the data that I have from policy
[00:43:52] given the data that I have from policy one what does it tell me about how good
[00:43:54] one what does it tell me about how good policy 2 might be CU ideally it would
[00:43:57] policy 2 might be CU ideally it would allow us to policy 1's data would allow
[00:43:59] allow us to policy 1's data would allow us to tell us which policy 2 I should
[00:44:01] us to tell us which policy 2 I should move to next right so this is what
[00:44:04] move to next right so this is what you're proving in homework 2 you're
[00:44:06] you're proving in homework 2 you're going to prove the performance
[00:44:07] going to prove the performance difference
[00:44:08] difference Lema and in the performance difference
[00:44:10] Lema and in the performance difference LMA it allows us to relate the
[00:44:12] LMA it allows us to relate the performance of one policy um give the
[00:44:15] performance of one policy um give the performance of one policy to um the
[00:44:18] performance of one policy to um the performance of another policy given data
[00:44:20] performance of another policy given data from one of the policies let me just
[00:44:22] from one of the policies let me just State this out so what does this say
[00:44:25] State this out so what does this say this is you know the value of one
[00:44:28] this is you know the value of one policy policy one policy Pi Prime I'm
[00:44:33] policy policy one policy Pi Prime I'm just using J here um but this is value B
[00:44:37] just using J here um but this is value B is what you can think of this is just V
[00:44:39] is what you can think of this is just V so what is that equal to okay that is
[00:44:41] so what is that equal to okay that is equal to the expectation over
[00:44:44] equal to the expectation over trajectories that are sampled using pi
[00:44:48] trajectories that are sampled using pi Prime of and again you know if it's a if
[00:44:51] Prime of and again you know if it's a if this is if
[00:44:53] this is if finite canol can use Gam equal to 1 okay
[00:44:59] finite canol can use Gam equal to 1 okay we sum
[00:45:00] we sum over the distribution of trajectories
[00:45:02] over the distribution of trajectories you could get if you followed policy Pi
[00:45:05] you could get if you followed policy Pi Pi 1 times the advantage under policy
[00:45:10] Pi 1 times the advantage under policy Pi okay so this part here is just equal
[00:45:17] Pi okay so this part here is just equal to the difference between if you took an
[00:45:22] action minus
[00:45:30] okay but note here because this
[00:45:32] okay but note here because this expectation here is over Pi Prime the
[00:45:35] expectation here is over Pi Prime the way we're selecting these actions just
[00:45:38] way we're selecting these actions just write it out a little bit
[00:45:41] more imagine we have deterministic
[00:45:45] more imagine we have deterministic policies so it's like we're thinking
[00:45:47] policies so it's like we're thinking about the Q value if we first take an
[00:45:49] about the Q value if we first take an action according to policy Pi Prime and
[00:45:52] action according to policy Pi Prime and then follow policy Pi for the rest of
[00:45:54] then follow policy Pi for the rest of Time Versus what we would have gotten if
[00:45:56] Time Versus what we would have gotten if we just followed policy Pi the whole
[00:45:57] we just followed policy Pi the whole time so you can think of it as sort of
[00:45:59] time so you can think of it as sort of like breaking down the difference in the
[00:46:01] like breaking down the difference in the value between two policies into a series
[00:46:03] value between two policies into a series of small differences of like well how
[00:46:05] of small differences of like well how much game would I have gotten at this
[00:46:06] much game would I have gotten at this date if I had taken Pi Prime action
[00:46:09] date if I had taken Pi Prime action instead of the one I actually took okay
[00:46:11] instead of the one I actually took okay what about here and then you kind of
[00:46:12] what about here and then you kind of like want to sum up all those additions
[00:46:14] like want to sum up all those additions like every day I'm happier because I
[00:46:16] like every day I'm happier because I went to Stamford instead of Harvard and
[00:46:18] went to Stamford instead of Harvard and I just add up all of those and that
[00:46:19] I just add up all of those and that tells me over the course of my whole
[00:46:21] tells me over the course of my whole career how much happier I will have
[00:46:23] career how much happier I will have been
[00:46:25] been hypothetically Okay so
[00:46:27] hypothetically Okay so this here is over trajectories now we're
[00:46:29] this here is over trajectories now we're going to make a transformation and move
[00:46:31] going to make a transformation and move it into State action
[00:46:33] it into State action distributions okay cuz
[00:46:35] distributions okay cuz um what what this is going to be here so
[00:46:38] um what what this is going to be here so now this was over trajectories we're
[00:46:40] now this was over trajectories we're going to rewrite this just in terms of
[00:46:42] going to rewrite this just in terms of State action distributions okay what
[00:46:46] State action distributions okay what we're going to say is all right as we
[00:46:47] we're going to say is all right as we think about adding up all these
[00:46:49] think about adding up all these advantages what I'm going to do is I'm
[00:46:51] advantages what I'm going to do is I'm going to Cluster together all the
[00:46:52] going to Cluster together all the advantages that have to do with the same
[00:46:54] advantages that have to do with the same state so I think of there as being a
[00:46:56] state so I think of there as being a distribution
[00:46:57] distribution Over States I might reach and actions I
[00:47:00] Over States I might reach and actions I might take under policy Pi Prime so if I
[00:47:03] might take under policy Pi Prime so if I just follow this policy I'm going to
[00:47:04] just follow this policy I'm going to visit some states and I'm just going to
[00:47:06] visit some states and I'm just going to think about what is the advantage in
[00:47:08] think about what is the advantage in each of those States um weighed by how
[00:47:11] each of those States um weighed by how frequently I visit them so we're
[00:47:14] frequently I visit them so we're transforming things from thinking of it
[00:47:15] transforming things from thinking of it as being like trajectories and thinking
[00:47:17] as being like trajectories and thinking about waiting over time steps to waiting
[00:47:19] about waiting over time steps to waiting over a finite set or you know a a space
[00:47:22] over a finite set or you know a a space of states and actions and so we'll have
[00:47:25] of states and actions and so we'll have a distribution you know it might be like
[00:47:26] a distribution you know it might be like I visit State 1 half the time I visit
[00:47:29] I visit State 1 half the time I visit state 7even only you know one in 10,000
[00:47:32] state 7even only you know one in 10,000 times so this allows us to to reight the
[00:47:36] times so this allows us to to reight the advantages how what does this
[00:47:38] advantages how what does this distribution look like this looks like
[00:47:40] distribution look like this looks like the following essentially you just think
[00:47:42] the following essentially you just think of what is my so here we're allowing us
[00:47:44] of what is my so here we're allowing us to have discount factors because we're
[00:47:46] to have discount factors because we're looking for the infinite case but you
[00:47:48] looking for the infinite case but you can adjust this what this is just saying
[00:47:50] can adjust this what this is just saying is that well my sort of weighted
[00:47:52] is that well my sort of weighted distribution for State s is equal to
[00:47:55] distribution for State s is equal to well How likely was I to be in state s s
[00:47:57] well How likely was I to be in state s s on time step one under that policy well
[00:48:00] on time step one under that policy well How likely was I to be in State uh s in
[00:48:03] How likely was I to be in State uh s in time step two under that policy what if
[00:48:05] time step two under that policy what if I was in time step three and so you just
[00:48:07] I was in time step three and so you just sum up all of those and as you could
[00:48:09] sum up all of those and as you could imagine in the infinite Horizon case
[00:48:12] imagine in the infinite Horizon case you're going to you know those could
[00:48:13] you're going to you know those could easily go to Infinity particularly if
[00:48:14] easily go to Infinity particularly if you have States you can visit a lot um
[00:48:16] you have States you can visit a lot um and so the discount Factor here makes
[00:48:18] and so the discount Factor here makes sure this becomes a distribution so we
[00:48:20] sure this becomes a distribution so we want to still to be
[00:48:22] want to still to be normalized and then similarly this is
[00:48:24] normalized and then similarly this is also with respect to taking the under P
[00:48:28] also with respect to taking the under P Prime so you'll be you'll be proving
[00:48:30] Prime so you'll be you'll be proving this in the homework um but we'll see
[00:48:32] this in the homework um but we'll see how this can be helpful okay so why why
[00:48:35] how this can be helpful okay so why why are we making you do
[00:48:37] are we making you do this so the nice thing is is it's going
[00:48:40] this so the nice thing is is it's going to define the performance of Pi Prime in
[00:48:42] to define the performance of Pi Prime in terms of advantages from PI okay so that
[00:48:46] terms of advantages from PI okay so that seems good because we're like well we
[00:48:47] seems good because we're like well we have um we have an inis policy but the
[00:48:52] have um we have an inis policy but the problem is is this still requires
[00:48:53] problem is is this still requires trajectory sampled from PI Prime so you
[00:48:55] trajectory sampled from PI Prime so you could think of pi as potential being the
[00:48:57] could think of pi as potential being the policy we have right now and maybe we
[00:48:59] policy we have right now and maybe we can estimate the Q function for it we
[00:49:00] can estimate the Q function for it we can estimate its
[00:49:01] can estimate its advantages and now we want to figure out
[00:49:04] advantages and now we want to figure out how good would be this new policy we
[00:49:06] how good would be this new policy we might take but the problem is we don't
[00:49:08] might take but the problem is we don't have any trajectories from the new
[00:49:09] have any trajectories from the new policy we only have data from the old
[00:49:11] policy we only have data from the old policy okay so we really want to get to
[00:49:13] policy okay so we really want to get to something where we can estimate how good
[00:49:15] something where we can estimate how good is pi prime using data only from PI
[00:49:18] is pi prime using data only from PI that's our goal so like our our
[00:49:24] goal estimate
[00:49:28] goal estimate J Pi Prime
[00:49:31] J Pi Prime only from data from
[00:49:36] only from data from PI so you think of Pi Prime is our the
[00:49:38] PI so you think of Pi Prime is our the new policy so what we really want to do
[00:49:40] new policy so what we really want to do is sort of take a step so that like our
[00:49:42] is sort of take a step so that like our new policy is the best one we could get
[00:49:44] new policy is the best one we could get to in terms of the its improvement over
[00:49:47] to in terms of the its improvement over the previous policy and we want to be
[00:49:49] the previous policy and we want to be able to do that by only using an
[00:49:50] able to do that by only using an estimate from our old data and it
[00:49:53] estimate from our old data and it shouldn't be clear yet how we could do
[00:49:54] shouldn't be clear yet how we could do that like this is looking promising but
[00:49:56] that like this is looking promising but still seem like we need data so this
[00:49:59] still seem like we need data so this is
[00:50:01] is still data for the new
[00:50:06] policy so let's look at it from a
[00:50:07] policy so let's look at it from a different angle so this is this thing d
[00:50:11] different angle so this is this thing d d Pi of s is a distribution Over States
[00:50:13] d Pi of s is a distribution Over States it's a discounted future State
[00:50:16] it's a discounted future State distribution and we're going to use that
[00:50:18] distribution and we're going to use that to sort of rewrite the relative
[00:50:20] to sort of rewrite the relative performance a policy performance
[00:50:22] performance a policy performance identity so why is this relative it's
[00:50:24] identity so why is this relative it's because it's with respect to the per
[00:50:26] because it's with respect to the per performance of our current policy so
[00:50:28] performance of our current policy so that's why we have a subtraction
[00:50:31] that's why we have a subtraction there we're going to rewrite that
[00:50:35] there see if I okay yeah so I'm going to
[00:50:38] there see if I okay yeah so I'm going to step through this so what we can do at
[00:50:40] step through this so what we can do at this
[00:50:41] this point is we can rewrite it as follows so
[00:50:44] point is we can rewrite it as follows so right now remember this is in kind of
[00:50:46] right now remember this is in kind of like the time or the trajectory
[00:50:51] notation okay so I'm going to again
[00:50:54] notation okay so I'm going to again rewrite this so that I'm going to move
[00:50:56] rewrite this so that I'm going to move it into the state action representation
[00:50:58] it into the state action representation so instead of thinking about as
[00:50:59] so instead of thinking about as trajectories I'm going to think about it
[00:51:01] trajectories I'm going to think about it as what's the distribution Over States
[00:51:03] as what's the distribution Over States and actions that I'm visiting okay so
[00:51:05] and actions that I'm visiting okay so I'm going to write it as
[00:51:07] I'm going to write it as follows one 1 gamma
[00:51:11] follows one 1 gamma Su over that as right as the
[00:51:16] expectation expectation over S Prime
[00:51:19] expectation expectation over S Prime according to
[00:51:30] okay under let me actually write it this
[00:51:39] way okay what have I done there so this
[00:51:42] way okay what have I done there so this should look pretty similar to the
[00:51:43] should look pretty similar to the previous slide what I've said is okay
[00:51:45] previous slide what I've said is okay this is the discounted future State
[00:51:47] this is the discounted future State distribution I saw on the previous
[00:51:52] slide that I could
[00:51:54] slide that I could rewrite this expression as 1/ 1 - gamma
[00:51:58] rewrite this expression as 1/ 1 - gamma e s this D * a so that's what I'm
[00:52:03] e s this D * a so that's what I'm doing
[00:52:05] doing now so I've Rewritten it in terms of
[00:52:07] now so I've Rewritten it in terms of this state action distribution okay and
[00:52:10] this state action distribution okay and this is where I'm in this problematic
[00:52:12] this is where I'm in this problematic case that I've got um the wrong let me
[00:52:14] case that I've got um the wrong let me make sure I put a quote there so this is
[00:52:17] make sure I put a quote there so this is just to be really
[00:52:18] just to be really clear this is over Pi Prime okay so this
[00:52:22] clear this is over Pi Prime okay so this is all with respect to the new policy so
[00:52:25] is all with respect to the new policy so now I'm going to note the following
[00:52:27] now I'm going to note the following okay so what does this look like this
[00:52:29] okay so what does this look like this looks like it's 1 1us
[00:52:31] looks like it's 1 1us gamma the expectation over S Prime
[00:52:35] gamma the expectation over S Prime sampled according to D Pi Prime and then
[00:52:38] sampled according to D Pi Prime and then what is this expectation so this is
[00:52:40] what is this expectation so this is going to be sum over a
[00:52:43] going to be sum over a pi of a given S
[00:52:48] pi of a given S Prime it's horrible notation let's try
[00:52:51] Prime it's horrible notation let's try this
[00:52:55] again okay that's what this means like
[00:52:57] again okay that's what this means like I'm taking an expectation for each of
[00:52:59] I'm taking an expectation for each of the states I'm taking an
[00:53:02] the states I'm taking an expectation um with respect here let me
[00:53:05] expectation um with respect here let me just try to make this
[00:53:07] just try to make this neat there we go okay so I'm saying
[00:53:12] neat there we go okay so I'm saying imagine I sample States from my D Pi
[00:53:14] imagine I sample States from my D Pi Prime distribution how do I do this
[00:53:17] Prime distribution how do I do this expectation over a sampled from PI Prime
[00:53:19] expectation over a sampled from PI Prime well I just sum of all the actions look
[00:53:21] well I just sum of all the actions look at the probability of me taking that
[00:53:22] at the probability of me taking that action for that State under Pi Prime
[00:53:24] action for that State under Pi Prime Times my advantage okay so now the key
[00:53:28] Times my advantage okay so now the key thing we're going to do is we're going
[00:53:29] thing we're going to do is we're going to try to change this so that we're
[00:53:32] to try to change this so that we're using um more we're going to try to get
[00:53:34] using um more we're going to try to get to a point where we only need data from
[00:53:36] to a point where we only need data from PI okay so the first thing we're going
[00:53:38] PI okay so the first thing we're going to do here is we're just going to
[00:53:40] to do here is we're just going to rewrite this and I'm going to multiply
[00:53:42] rewrite this and I'm going to multiply and divide by the same
[00:53:43] and divide by the same thing
[00:53:45] thing okay so I'm going to say this is pi
[00:53:48] okay so I'm going to say this is pi Prime of a given s / Pi a given s * pi a
[00:53:53] Prime of a given s / Pi a given s * pi a given s okay I've not done anything at
[00:53:56] given s okay I've not done anything at except for I've multiplied and divided
[00:53:57] except for I've multiplied and divided by the same thing why did I do that well
[00:54:00] by the same thing why did I do that well the good thing is is I know how to get
[00:54:02] the good thing is is I know how to get samples from this I have samples from
[00:54:04] samples from this I have samples from this this is from my old data this is
[00:54:06] this this is from my old data this is from the actual policy that I took
[00:54:08] from the actual policy that I took before okay so what this says I can
[00:54:10] before okay so what this says I can write this as I've got 1 over 1us gamma
[00:54:13] write this as I've got 1 over 1us gamma e of s according to D Pi Prime and then
[00:54:17] e of s according to D Pi Prime and then I have this
[00:54:18] I have this expectation over a sampled according to
[00:54:22] expectation over a sampled according to Pi not Pi Prime of my re weighted
[00:54:29] advantages okay and what do I reway them
[00:54:31] advantages okay and what do I reway them by I reway them exactly by the
[00:54:34] by I reway them exactly by the probability I take that action under the
[00:54:35] probability I take that action under the new policy versus the old
[00:54:37] new policy versus the old policy and that's okay because I'm going
[00:54:39] policy and that's okay because I'm going to assume that I have access to the
[00:54:41] to assume that I have access to the policy parameterization of the new
[00:54:42] policy parameterization of the new policy I'm just trying to figure out how
[00:54:44] policy I'm just trying to figure out how good it is I don't have samples from it
[00:54:47] good it is I don't have samples from it but if you tell me hey this is the
[00:54:49] but if you tell me hey this is the action you took in that state I can say
[00:54:50] action you took in that state I can say oh okay well that's how much How likely
[00:54:52] oh okay well that's how much How likely I would have taken that under my new
[00:54:53] I would have taken that under my new policy so I can do that reting and this
[00:54:56] policy so I can do that reting and this is an instance of something called
[00:54:58] is an instance of something called important sampling and we're going to
[00:54:59] important sampling and we're going to see a lot more about that soon but so
[00:55:01] see a lot more about that soon but so this is the first step I can do so this
[00:55:03] this is the first step I can do so this is great right because now I don't need
[00:55:04] is great right because now I don't need to have samples of from actions taken by
[00:55:08] to have samples of from actions taken by my new policy I can just reweight the
[00:55:10] my new policy I can just reweight the data I already have so that's super cool
[00:55:13] data I already have so that's super cool but there's a problem okay this is still
[00:55:16] but there's a problem okay this is still Pi Prime so I still don't have any data
[00:55:18] Pi Prime so I still don't have any data over States from my Pi Prime okay all
[00:55:23] over States from my Pi Prime okay all right so this is that just written out
[00:55:26] right so this is that just written out more neatly we'll see a lot more on that
[00:55:28] more neatly we'll see a lot more on that in the future okay but we still have
[00:55:30] in the future okay but we still have this big problem because we don't have
[00:55:32] this big problem because we don't have any States from PI Prime we have data
[00:55:36] any States from PI Prime we have data from
[00:55:37] from PI okay so what are we going to do about
[00:55:40] PI okay so what are we going to do about that well we're just going to ignore
[00:55:45] that well we're just going to ignore it always an option so um we're just
[00:55:50] it always an option so um we're just going to ignore that and proceed and
[00:55:53] going to ignore that and proceed and this is what is going to happen in in
[00:55:55] this is what is going to happen in in the rest of the class
[00:55:56] the rest of the class for the rest of this lecture we're just
[00:55:58] for the rest of this lecture we're just going to pretend that those states are
[00:56:00] going to pretend that those states are the
[00:56:02] the same now as you might imagine that is
[00:56:05] same now as you might imagine that is going to slightly induce some error in
[00:56:08] going to slightly induce some error in my
[00:56:09] my estimate when might that be bad well it
[00:56:11] estimate when might that be bad well it might be really bad right if the two
[00:56:13] might be really bad right if the two policies would actually visit totally
[00:56:14] policies would actually visit totally different parts of the state
[00:56:17] different parts of the state spacee but if they visit things that are
[00:56:19] spacee but if they visit things that are really close maybe it's not going to be
[00:56:21] really close maybe it's not going to be that bad okay so let's imagine that you
[00:56:23] that bad okay so let's imagine that you have you know policy one and it goes
[00:56:25] have you know policy one and it goes this is your real SP and it goes to most
[00:56:27] this is your real SP and it goes to most of this part of the space and then you
[00:56:29] of this part of the space and then you change your policy and maybe it also
[00:56:30] change your policy and maybe it also goes to that part of the space and it
[00:56:32] goes to that part of the space and it goes a little over here too but there's
[00:56:34] goes a little over here too but there's quite a bit of overlap okay the places
[00:56:37] quite a bit of overlap okay the places where it would be bad is something like
[00:56:39] where it would be bad is something like you know if your policy goes like this
[00:56:40] you know if your policy goes like this your new policy and then there's like no
[00:56:43] your new policy and then there's like no overlap in your state
[00:56:44] overlap in your state space so it's going to turn out that if
[00:56:47] space so it's going to turn out that if pi and Pi Prime are close and we're
[00:56:48] pi and Pi Prime are close and we're going to find what we mean by
[00:56:50] going to find what we mean by close then this is actually not a bad
[00:56:53] close then this is actually not a bad approximation it's not perfect it's not
[00:56:55] approximation it's not perfect it's not too bad
[00:56:57] too bad so in the paper they prove that we can
[00:57:01] so in the paper they prove that we can bound how bad this approximation is okay
[00:57:05] bound how bad this approximation is okay so in general so this is we're going to
[00:57:07] so in general so this is we're going to Define this to be L um Mass script L of
[00:57:11] Define this to be L um Mass script L of Pi with respect to Pi Prime if this was
[00:57:14] Pi with respect to Pi Prime if this was perfect this thing would be zero because
[00:57:18] perfect this thing would be zero because this would exactly equal this difference
[00:57:20] this would exactly equal this difference so this minus this would be the these
[00:57:24] so this minus this would be the these two things would exactly be zero but
[00:57:26] two things would exactly be zero but what it turns out is that um how far off
[00:57:29] what it turns out is that um how far off this appro excuse me how off how far off
[00:57:31] this appro excuse me how off how far off this approximation is depends bless you
[00:57:34] this approximation is depends bless you on the kale Divergence in the
[00:57:37] on the kale Divergence in the policies and and I'll Define what kale
[00:57:38] policies and and I'll Define what kale Divergence is in a second so in
[00:57:40] Divergence is in a second so in particular it depends on the kale
[00:57:42] particular it depends on the kale Divergence with respect to uh let me
[00:57:45] Divergence with respect to uh let me just undo that so I can just do this
[00:57:46] just undo that so I can just do this part with respect to States if you were
[00:57:49] part with respect to States if you were sampling them according to dpy now dpy
[00:57:52] sampling them according to dpy now dpy is good because dpy is the actual
[00:57:55] is good because dpy is the actual discounted states that we visit under
[00:57:57] discounted states that we visit under our current policy that means we
[00:57:59] our current policy that means we actually have data about it so I always
[00:58:01] actually have data about it so I always like to work with my in my lab we always
[00:58:03] like to work with my in my lab we always try to instantiate our theoretical
[00:58:05] try to instantiate our theoretical bounds because I feel like it's super
[00:58:06] bounds because I feel like it's super informative to be like oh is this like
[00:58:08] informative to be like oh is this like you know 10 to the 10 or is this like 05
[00:58:11] you know 10 to the 10 or is this like 05 um and what and one of the big questions
[00:58:13] um and what and one of the big questions often that comes up when we try to do
[00:58:14] often that comes up when we try to do this is that um sometimes you can't
[00:58:16] this is that um sometimes you can't instantiate your Bounds at all because
[00:58:18] instantiate your Bounds at all because it'll depend on constants you don't know
[00:58:19] it'll depend on constants you don't know so like it's a beautiful theorem but
[00:58:21] so like it's a beautiful theorem but like you can't even check how big it is
[00:58:23] like you can't even check how big it is the nice thing about this is like it's
[00:58:24] the nice thing about this is like it's checkable um at least this part is right
[00:58:27] checkable um at least this part is right because you can get you can look at your
[00:58:29] because you can get you can look at your actual trajectories from your current
[00:58:30] actual trajectories from your current policy and if you have a new policy Pi
[00:58:32] policy and if you have a new policy Pi Prime you can see and evaluate what your
[00:58:34] Prime you can see and evaluate what your K Divergence will be so this is actually
[00:58:37] K Divergence will be so this is actually evaluatable so we'll see okay and I'll
[00:58:39] evaluatable so we'll see okay and I'll what c ah so C is a
[00:58:43] what c ah so C is a constant any information about its value
[00:58:46] constant any information about its value um no I will um I we're not going to
[00:58:48] um no I will um I we're not going to need it for now but um in the paper you
[00:58:51] need it for now but um in the paper you can read about sort of exactly what that
[00:58:52] can read about sort of exactly what that is yes good question though okay let's
[00:58:55] is yes good question though okay let's see what kale d es so what this says is
[00:58:58] see what kale d es so what this says is just at a high level is that this
[00:59:00] just at a high level is that this approximation is not totally insane if
[00:59:03] approximation is not totally insane if the policies are close um in fact this
[00:59:06] the policies are close um in fact this is going to be a pretty good
[00:59:07] is going to be a pretty good approximation so as we're going to see
[00:59:10] approximation so as we're going to see this is tight if the policies are
[00:59:12] this is tight if the policies are identical which is exactly where you'd
[00:59:13] identical which is exactly where you'd expect this to be tied so if your two
[00:59:16] expect this to be tied so if your two policies are identical their difference
[00:59:17] policies are identical their difference should be zero and this bound would tell
[00:59:19] should be zero and this bound would tell you it's zero all right what is K
[00:59:21] you it's zero all right what is K Divergence some of you guys might have
[00:59:22] Divergence some of you guys might have seen this before but in case you know
[00:59:24] seen this before but in case you know some people might be the new so what K
[00:59:26] some people might be the new so what K Divergence allows us to do is to compare
[00:59:29] Divergence allows us to do is to compare two probability distributions so in our
[00:59:32] two probability distributions so in our case what this is going to be is over
[00:59:35] case what this is going to be is over actions we will take so like you know Pi
[00:59:37] actions we will take so like you know Pi of a given s versus Pi of a s so both of
[00:59:42] of a given s versus Pi of a s so both of these are uh probability distributions
[00:59:44] these are uh probability distributions that sum to one for a particular State
[00:59:47] that sum to one for a particular State and so what what in our case what we
[00:59:49] and so what what in our case what we would be summing over here x would be a
[00:59:51] would be summing over here x would be a so' be summing over all of these if you
[00:59:54] so' be summing over all of these if you have the same um probability
[00:59:57] have the same um probability distribution the K Divergence is zero
[00:59:58] distribution the K Divergence is zero otherwise it's strictly positive it's
[01:00:01] otherwise it's strictly positive it's good to know it's not
[01:00:02] good to know it's not symmetric because we've made a choice
[01:00:05] symmetric because we've made a choice here in the ordering so just these are
[01:00:08] here in the ordering so just these are good properties to know about it comes
[01:00:09] good properties to know about it comes up all the time um in reinforcement
[01:00:12] up all the time um in reinforcement learning so in our case we can look at
[01:00:15] learning so in our case we can look at for a particular State s what is the K
[01:00:17] for a particular State s what is the K Divergence in what the policies would do
[01:00:19] Divergence in what the policies would do for that particular state so that's what
[01:00:21] for that particular state so that's what we've got there so it says essentially
[01:00:23] we've got there so it says essentially like how different are the actions you
[01:00:24] like how different are the actions you would take now why is this good well
[01:00:27] would take now why is this good well we've been spending some time saying hey
[01:00:29] we've been spending some time saying hey we really don't want to think just about
[01:00:30] we really don't want to think just about what the how close we are in Theta space
[01:00:33] what the how close we are in Theta space we really want to get to thinking about
[01:00:35] we really want to get to thinking about how different are our actual policies
[01:00:36] how different are our actual policies like how different are the actual
[01:00:38] like how different are the actual decisions we make in particular States
[01:00:40] decisions we make in particular States and the nice thing is is what this bound
[01:00:42] and the nice thing is is what this bound says and is that the difference between
[01:00:46] says and is that the difference between two policies is not just about how close
[01:00:49] two policies is not just about how close you are in some parameters base it's
[01:00:51] you are in some parameters base it's really about how different are the
[01:00:53] really about how different are the decisions you'd make in all the states
[01:00:55] decisions you'd make in all the states you'd reach under your current state
[01:00:58] you'd reach under your current state distribution and so if you'd make all
[01:01:00] distribution and so if you'd make all the same decisions in the in the states
[01:01:01] the same decisions in the in the states you're already reaching your policy
[01:01:03] you're already reaching your policy value is going to be really similar if
[01:01:05] value is going to be really similar if you would make very different decisions
[01:01:07] you would make very different decisions then your policy might be value might be
[01:01:09] then your policy might be value might be really different because you sort of go
[01:01:10] really different because you sort of go off and explore really different parts
[01:01:11] off and explore really different parts of the state space okay so this is
[01:01:15] of the state space okay so this is really elegant um and now we have
[01:01:19] really elegant um and now we have something where we can just use our old
[01:01:21] something where we can just use our old data so we have our old data we can use
[01:01:24] data so we have our old data we can use it to estimate what what the performance
[01:01:27] it to estimate what what the performance Improvement would be if we try to get to
[01:01:29] Improvement would be if we try to get to a new
[01:01:30] a new policy and so what you might imagine in
[01:01:32] policy and so what you might imagine in this case is you could use deserve
[01:01:34] this case is you could use deserve search over or decide which new policy
[01:01:36] search over or decide which new policy to try Okay this allows you to compute
[01:01:39] to try Okay this allows you to compute it for lots of different Pi Prime
[01:01:41] it for lots of different Pi Prime doesn't just have to be the pi Prime for
[01:01:42] doesn't just have to be the pi Prime for one gradient step it says in general
[01:01:46] one gradient step it says in general even outside of policy gradient methods
[01:01:48] even outside of policy gradient methods you can
[01:01:49] you can evaluate the um value of of changing
[01:01:52] evaluate the um value of of changing your policy to Pi Prime with respect to
[01:01:53] your policy to Pi Prime with respect to your current performance by the
[01:01:56] your current performance by the expression and this will be more or less
[01:01:58] expression and this will be more or less tight depending on how different your
[01:01:59] tight depending on how different your policy is at making decisions in states
[01:02:02] policy is at making decisions in states that you would currently reach okay and
[01:02:05] that you would currently reach okay and we'll talk more about this we haven't
[01:02:06] we'll talk more about this we haven't talked about this yet so we will talk
[01:02:15] about
[01:02:17] about okay this also relates to some really
[01:02:20] okay this also relates to some really nice literatures from kind of the last
[01:02:22] nice literatures from kind of the last 20 years of thinking about how do we do
[01:02:24] 20 years of thinking about how do we do monotonic policy Improvement ment in
[01:02:26] monotonic policy Improvement ment in sort of policy gradient methods and
[01:02:28] sort of policy gradient methods and policy search methods it also relates to
[01:02:30] policy search methods it also relates to the notion of a trust region which is
[01:02:32] the notion of a trust region which is the sort of this idea of how um when
[01:02:35] the sort of this idea of how um when you're changing your policy how far can
[01:02:37] you're changing your policy how far can you go and still sort of trust the
[01:02:38] you go and still sort of trust the performance of it and trust you can get
[01:02:39] performance of it and trust you can get Improvement so there's a bunch of
[01:02:41] Improvement so there's a bunch of different nice papers related to this
[01:02:43] different nice papers related to this okay let's talk about the
[01:02:45] okay let's talk about the algorithm proximal policy optimization
[01:02:47] algorithm proximal policy optimization is going to be inspired by all the
[01:02:49] is going to be inspired by all the things that we just talked about so what
[01:02:51] things that we just talked about so what we want to do is it wants to be able to
[01:02:53] we want to do is it wants to be able to take multiple gradient steps and it
[01:02:55] take multiple gradient steps and it wants to be able to do this in a way so
[01:02:57] wants to be able to do this in a way so that we don't overstep we try to focus
[01:02:59] that we don't overstep we try to focus on um policy
[01:03:01] on um policy parameterization uh in terms of the
[01:03:03] parameterization uh in terms of the actual decisions that we make so there
[01:03:05] actual decisions that we make so there are two different variants one is it
[01:03:09] are two different variants one is it solves an unconstrained optimization
[01:03:11] solves an unconstrained optimization problem where it uses this
[01:03:13] problem where it uses this approximation so that's the
[01:03:15] approximation so that's the approximation we had on the previous
[01:03:16] approximation we had on the previous slides I'll just write down from prior
[01:03:21] slides I'll just write down from prior slides where we use data from the
[01:03:23] slides where we use data from the current policy and we kind of add up
[01:03:25] current policy and we kind of add up these advant these weighted Advantage
[01:03:27] these advant these weighted Advantage functions so what it says is well the
[01:03:29] functions so what it says is well the thing you want to do is you want to pick
[01:03:30] thing you want to do is you want to pick the policy um that maximizes our
[01:03:33] the policy um that maximizes our estimated difference subject to a
[01:03:35] estimated difference subject to a constraint on the kale Divergence
[01:03:37] constraint on the kale Divergence because it's realizing that that that L
[01:03:40] because it's realizing that that that L approximation is going to get worse and
[01:03:42] approximation is going to get worse and worse as the kale Divergence gets really
[01:03:44] worse as the kale Divergence gets really large so it's sort of directly
[01:03:51] incorporating this
[01:03:53] incorporating this bound so it's thinking like okay I want
[01:03:56] bound so it's thinking like okay I want to think about what this is but then I
[01:03:57] to think about what this is but then I also have to consider the fact that my
[01:03:59] also have to consider the fact that my estimate might be off by as much of this
[01:04:01] estimate might be off by as much of this sort of square root
[01:04:03] sort of square root K so you really want to sort of improve
[01:04:06] K so you really want to sort of improve with respect to something that considers
[01:04:08] with respect to something that considers both of those right so this is what this
[01:04:11] both of those right so this is what this is one version of policy this is not the
[01:04:13] is one version of policy this is not the way most people do po we'll see the
[01:04:16] way most people do po we'll see the other really common one but it's it's a
[01:04:17] other really common one but it's it's a nice Baseline to know about and here
[01:04:20] nice Baseline to know about and here when we think about what that KL is as
[01:04:23] when we think about what that KL is as you might have noticed KL is defined for
[01:04:25] you might have noticed KL is defined for a single state so for a single state we
[01:04:28] a single state so for a single state we can say what is the distribution over
[01:04:30] can say what is the distribution over actions i' take in one policy versus
[01:04:32] actions i' take in one policy versus another but of course we have lots of
[01:04:34] another but of course we have lots of states and so what we can do here is we
[01:04:36] states and so what we can do here is we can take an expectation over the kale
[01:04:39] can take an expectation over the kale over all the states we'd visit and that
[01:04:41] over all the states we'd visit and that was part of the the theoretical bound
[01:04:43] was part of the the theoretical bound too another really important thing you
[01:04:45] too another really important thing you can see here
[01:04:46] can see here is this
[01:04:48] is this waiting waiting between trying to
[01:04:50] waiting waiting between trying to optimize kind of this policy Improvement
[01:04:53] optimize kind of this policy Improvement while respecting this kind of K penalty
[01:04:56] while respecting this kind of K penalty and you can change this over each
[01:04:59] and you can change this over each iteration to kind of approximately
[01:05:01] iteration to kind of approximately satisfy the kale Divergence
[01:05:03] satisfy the kale Divergence constraint so this does not guarantee
[01:05:05] constraint so this does not guarantee that you will this does not guarantee
[01:05:08] that you will this does not guarantee that you're going to get monotonic
[01:05:09] that you're going to get monotonic Improvement but it's trying to get
[01:05:10] Improvement but it's trying to get towards
[01:05:11] towards that so let's see how that works so this
[01:05:14] that so let's see how that works so this is the
[01:05:15] is the algorithm what it does is you can
[01:05:17] algorithm what it does is you can compute the advantages using any
[01:05:18] compute the advantages using any advantage estimation algorithm you comp
[01:05:20] advantage estimation algorithm you comp compute your policy update and you can
[01:05:22] compute your policy update and you can do K steps with that so the nice thing
[01:05:25] do K steps with that so the nice thing is is that
[01:05:26] is is that you know you can use your old data here
[01:05:28] you know you can use your old data here and you can take multiple gradient
[01:05:32] and you can take multiple gradient steps after you do this you can also
[01:05:34] steps after you do this you can also check if your K Divergence for your new
[01:05:37] check if your K Divergence for your new resulting policy is large if it is then
[01:05:40] resulting policy is large if it is then you may increase the penalty if it's
[01:05:43] you may increase the penalty if it's small um you can decrease the penalty
[01:05:46] small um you can decrease the penalty and that just allows us to sort of
[01:05:47] and that just allows us to sort of trade-off between how much we paying
[01:05:49] trade-off between how much we paying attention to this kale Divergence
[01:05:50] attention to this kale Divergence constraint versus
[01:05:53] constraint versus not and as as I noted here you know you
[01:05:56] not and as as I noted here you know you might violate the K constraints but most
[01:05:58] might violate the K constraints but most of them they don't
[01:06:00] of them they don't empirically okay this is one reasonable
[01:06:02] empirically okay this is one reasonable thing to do based on everything we've
[01:06:03] thing to do based on everything we've seen now we're going to see something
[01:06:05] seen now we're going to see something else which is inspired by that it's a
[01:06:06] else which is inspired by that it's a much more common thing to do um which is
[01:06:10] much more common thing to do um which is uh well let me just highlight here
[01:06:11] uh well let me just highlight here multiple gradient steps is really good
[01:06:14] multiple gradient steps is really good so this is you know the one of the
[01:06:16] so this is you know the one of the benefits is that we're not just taking a
[01:06:17] benefits is that we're not just taking a single gradient step we're taking
[01:06:19] single gradient step we're taking multiple so just to really highlight
[01:06:20] multiple so just to really highlight that all right what is the other thing
[01:06:23] that all right what is the other thing we want to do here um we haven't talked
[01:06:25] we want to do here um we haven't talked about natural gradients um just but for
[01:06:28] about natural gradients um just but for any of you that are familiar with these
[01:06:29] any of you that are familiar with these this is not they another way to try to
[01:06:31] this is not they another way to try to think about taking gradient steps um
[01:06:33] think about taking gradient steps um we're not going to talk about that for
[01:06:35] we're not going to talk about that for now so the other thing we do is a
[01:06:37] now so the other thing we do is a clipped objective so what we're going to
[01:06:41] clipped objective so what we're going to look at in this case is remember how we
[01:06:43] look at in this case is remember how we talked about we had this kind of ratio
[01:06:44] talked about we had this kind of ratio between this is what we're using to
[01:06:46] between this is what we're using to weight our advantage function was the
[01:06:48] weight our advantage function was the the difference between How likely you
[01:06:50] the difference between How likely you were to take that action under our old
[01:06:52] were to take that action under our old policy versus our new and we're using it
[01:06:54] policy versus our new and we're using it to wait our advantage
[01:06:56] to wait our advantage function what the clipping does is it
[01:06:59] function what the clipping does is it says well I don't want this to get too
[01:07:02] says well I don't want this to get too high or too
[01:07:03] high or too low okay this could become really high
[01:07:06] low okay this could become really high or really low when my policies are
[01:07:08] or really low when my policies are really
[01:07:09] really different if my policy is really you can
[01:07:13] different if my policy is really you can imagine that um if my policy puts really
[01:07:17] imagine that um if my policy puts really low probability on something that the
[01:07:18] low probability on something that the current policy puts high probability on
[01:07:21] current policy puts high probability on this ratio here is going to go towards
[01:07:24] this ratio here is going to go towards this R is going to go to about Z
[01:07:26] this R is going to go to about Z zero and if this puts very high let's
[01:07:30] zero and if this puts very high let's say this is one and this puts very low
[01:07:32] say this is one and this puts very low probability on that this could be
[01:07:34] probability on that this could be extremely large this could be like 10
[01:07:36] extremely large this could be like 10 the 6 this could be very very large and
[01:07:40] the 6 this could be very very large and both of those are being used to weight
[01:07:42] both of those are being used to weight the advantage function right so your
[01:07:43] the advantage function right so your advantage function could be getting
[01:07:44] advantage function could be getting shrunk towards zero or it could be
[01:07:46] shrunk towards zero or it could be getting blown up by a factor of say 10
[01:07:48] getting blown up by a factor of say 10 the 6 if you have a big difference in
[01:07:51] the 6 if you have a big difference in the actions you would take under one
[01:07:52] the actions you would take under one policy versus another and in general we
[01:07:54] policy versus another and in general we don't like where like you know we're
[01:07:56] don't like where like you know we're thinking of policy gradients where we
[01:07:58] thinking of policy gradients where we might have terms that are exploding or
[01:08:01] might have terms that are exploding or Vanishing and that's part of the point
[01:08:03] Vanishing and that's part of the point of the kale Divergence constraint is to
[01:08:04] of the kale Divergence constraint is to say you want your policies to stay close
[01:08:07] say you want your policies to stay close so the clipping is sort of inspired by
[01:08:09] so the clipping is sort of inspired by this general idea but says well maybe
[01:08:11] this general idea but says well maybe something similar we can do is we're
[01:08:12] something similar we can do is we're just going to clip we're just going to
[01:08:13] just going to clip we're just going to say you can't have weighted Advantage
[01:08:16] say you can't have weighted Advantage terms that are going towards um you know
[01:08:19] terms that are going towards um you know infinity or minus infinity or
[01:08:21] infinity or minus infinity or zero and so if this ratio is too extreme
[01:08:25] zero and so if this ratio is too extreme I'm just going to clip it I'm going to
[01:08:26] I'm just going to clip it I'm going to not allow it to be less than um 1 minus
[01:08:29] not allow it to be less than um 1 minus Epsilon or greater than 1 plus
[01:08:32] Epsilon or greater than 1 plus Epsilon and and Epsilon is just a
[01:08:34] Epsilon and and Epsilon is just a hyperparameter and essentially it's sort
[01:08:36] hyperparameter and essentially it's sort of meaning that you know your policy
[01:08:39] of meaning that you know your policy might change further than that but
[01:08:40] might change further than that but that's not going to benefit your loss
[01:08:43] that's not going to benefit your loss function so this again is going to sort
[01:08:45] function so this again is going to sort of constrain your policy class to stay
[01:08:47] of constrain your policy class to stay within this code of
[01:08:49] within this code of region for which it's making similar
[01:08:52] region for which it's making similar decisions so we're still really F
[01:08:54] decisions so we're still really F focusing on what actions are we actually
[01:08:55] focusing on what actions are we actually taking are we taking similar actions in
[01:08:57] taking are we taking similar actions in these states as we would be normally
[01:08:59] these states as we would be normally regardless of how much my Theta is
[01:09:02] regardless of how much my Theta is changing okay and then you just do your
[01:09:04] changing okay and then you just do your policy update by taking an argmax over
[01:09:06] policy update by taking an argmax over this so this is your clipped objective
[01:09:10] this so this is your clipped objective function all right so let's see how this
[01:09:12] function all right so let's see how this works let's think about what it's doing
[01:09:16] works let's think about what it's doing so we'll do um a quick check your
[01:09:19] so we'll do um a quick check your understanding so this shows you what L
[01:09:22] understanding so this shows you what L clip does this is L clip as well
[01:09:25] clip does this is L clip as well and what I'd like you to think about is
[01:09:28] and what I'd like you to think about is what does this look like depending on
[01:09:30] what does this look like depending on the advantage
[01:09:32] the advantage function so L let me just write down
[01:09:37] function so L let me just write down LP okay so this is
[01:09:39] LP okay so this is R so on the x- axis is r and then on the
[01:09:43] R so on the x- axis is r and then on the y axis is L clip and what this is asking
[01:09:45] y axis is L clip and what this is asking you to think about this is a from their
[01:09:47] you to think about this is a from their paper um is to think about what does
[01:09:50] paper um is to think about what does clipping do in terms of your sort of
[01:09:53] clipping do in terms of your sort of loss
[01:09:56] loss and so I'd like you to think about in
[01:09:57] and so I'd like you to think about in this case which of these two if either
[01:10:01] this case which of these two if either um match within the advantage function
[01:10:03] um match within the advantage function is positive or the advantage function is
[01:10:06] is positive or the advantage function is negative so a here is the advantage let
[01:10:09] negative so a here is the advantage let me just make that clear a is equal to
[01:10:12] me just make that clear a is equal to the advantage so just think of this for
[01:10:14] the advantage so just think of this for a single term
[01:10:18] a single term consider for one term okay so just just
[01:10:23] consider for one term okay so just just this part okay so just for one single um
[01:10:29] this part okay so just for one single um RTA what is happening here and what and
[01:10:31] RTA what is happening here and what and just to be clear here what we're doing
[01:10:32] just to be clear here what we're doing in this cases we're taking the minimum
[01:10:34] in this cases we're taking the minimum between the normal thing we do which is
[01:10:36] between the normal thing we do which is this reweighted Advantage function times
[01:10:39] this reweighted Advantage function times a clip of the r time the advantage
[01:10:41] a clip of the r time the advantage function
[01:11:04] and feel free to like you know flip back
[01:11:05] and feel free to like you know flip back and forth or play with
[01:11:07] and forth or play with numbers just to sort of get some
[01:11:09] numbers just to sort of get some intuition of what is this doing to our
[01:11:12] intuition of what is this doing to our um you know our loss
[01:11:18] function or I shouldn't say loss um our
[01:11:21] function or I shouldn't say loss um our objective function in this case we're
[01:11:23] objective function in this case we're trying to take the arcmax of it so
[01:11:25] trying to take the arcmax of it so thinking of this is sort of an
[01:11:26] thinking of this is sort of an approximation of how much is our policy
[01:11:28] approximation of how much is our policy going to improve when we change our
[01:11:30] going to improve when we change our Theta so we're going to want to take a
[01:11:31] Theta so we're going to want to take a Max of this over with respect to a new
[01:11:33] Max of this over with respect to a new policy Theta and we want to think about
[01:11:36] policy Theta and we want to think about this is sort of bounding what that new
[01:11:39] this is sort of bounding what that new performance benefit could be and how
[01:11:42] performance benefit could be and how does that vary with respect to the
[01:11:54] advantage for
[01:12:28] all right nobody thinks it um depends on
[01:12:30] all right nobody thinks it um depends on the value of e which is correct so this
[01:12:32] the value of e which is correct so this is um does not depend on the value of e
[01:12:36] is um does not depend on the value of e want you turn to your neighbor and see
[01:12:38] want you turn to your neighbor and see if you got the same thing
[01:13:24] Val
[01:13:54] e e
[01:14:30] oh cool it looks like talking converged
[01:14:32] oh cool it looks like talking converged most people which is great um so does
[01:14:35] most people which is great um so does someone want so the first one is
[01:14:38] someone want so the first one is correct so this is a greater than zero
[01:14:42] correct so this is a greater than zero this is a less than zero does someone
[01:14:44] this is a less than zero does someone want to explain
[01:14:49] why those of you that voted got it right
[01:14:57] well it is quite simple because the
[01:15:00] well it is quite simple because the simply like a qu efficient to the uh
[01:15:05] simply like a qu efficient to the uh value so it has a positive slope and it
[01:15:07] value so it has a positive slope and it is positive and negative slope and
[01:15:10] is positive and negative slope and negative yeah so if we just focus on
[01:15:12] negative yeah so if we just focus on this for a being equal to greater than
[01:15:15] this for a being equal to greater than zero um what will happen is as that
[01:15:18] zero um what will happen is as that ratio R gets higher and higher and
[01:15:20] ratio R gets higher and higher and higher um you'll just linearly go up
[01:15:23] higher um you'll just linearly go up because it's just um you know something
[01:15:25] because it's just um you know something between 0 and one that's getting larger
[01:15:26] between 0 and one that's getting larger and larger um R can never be negative so
[01:15:30] and larger um R can never be negative so it's just useful to see in this case so
[01:15:32] it's just useful to see in this case so as it getting larger and larger just
[01:15:33] as it getting larger and larger just going to increase your L clip value but
[01:15:36] going to increase your L clip value but at some point you're going to run up
[01:15:38] at some point you're going to run up against this part and so at this part
[01:15:43] against this part and so at this part you're going to clip it and you can't
[01:15:44] you're going to clip it and you can't get higher
[01:15:46] get higher anymore remember in general we're always
[01:15:48] anymore remember in general we're always trying to maximize L clip in this case
[01:15:50] trying to maximize L clip in this case when the advantage is negative you're
[01:15:52] when the advantage is negative you're trying to sort of reduce the amount of
[01:15:55] trying to sort of reduce the amount of um probability Mass you put on that
[01:15:57] um probability Mass you put on that action because you don't want to take
[01:15:58] action because you don't want to take that anymore like oh I got a negative
[01:16:00] that anymore like oh I got a negative Advantage so I need to stop doing that
[01:16:02] Advantage so I need to stop doing that so essentially you want to be sweeping
[01:16:04] so essentially you want to be sweeping changing your policy in the opposite
[01:16:06] changing your policy in the opposite direction you want to you'd really like
[01:16:07] direction you want to you'd really like to be able to push R all away to zero
[01:16:09] to be able to push R all away to zero and say I never want to do that action
[01:16:10] and say I never want to do that action again like it gave me a negative
[01:16:11] again like it gave me a negative Advantage but you can't do that because
[01:16:13] Advantage but you can't do that because that might change you know radically
[01:16:15] that might change you know radically change your policy and so once you get
[01:16:18] change your policy and so once you get to one minus Epsilon you cap it and you
[01:16:20] to one minus Epsilon you cap it and you can't further shrink it to
[01:16:23] can't further shrink it to zero great
[01:16:26] zero great okay
[01:16:28] so and another way to see this is from
[01:16:31] so and another way to see this is from the paper is like you can think about
[01:16:32] the paper is like you can think about sort of these different types of
[01:16:33] sort of these different types of constraints and different clipping and
[01:16:35] constraints and different clipping and essentially again it's sort of like
[01:16:36] essentially again it's sort of like making the objective pessimistic as you
[01:16:38] making the objective pessimistic as you get really far from
[01:16:40] get really far from Theta now just in the last couple
[01:16:42] Theta now just in the last couple minutes I want to make sure to show some
[01:16:43] minutes I want to make sure to show some plots so this is just the same algorithm
[01:16:47] plots so this is just the same algorithm um but where we're doing clipping I will
[01:16:49] um but where we're doing clipping I will just note here that next week so next
[01:16:53] just note here that next week so next time
[01:16:55] time we'll discuss next time we're going to
[01:16:57] we'll discuss next time we're going to discuss um the choice of a further just
[01:17:00] discuss um the choice of a further just like what we saw earlier today you can
[01:17:01] like what we saw earlier today you can do endep estimators Etc you can do
[01:17:03] do endep estimators Etc you can do what's called generalized Advantage
[01:17:05] what's called generalized Advantage estimation um you don't need to know
[01:17:07] estimation um you don't need to know that for this we we'll cover that for
[01:17:09] that for this we we'll cover that for today but we'll cover it more next week
[01:17:10] today but we'll cover it more next week so just to note there's some additional
[01:17:12] so just to note there's some additional choices here but let's just look at sort
[01:17:14] choices here but let's just look at sort of what the performance is so at this
[01:17:18] of what the performance is so at this point trpo and some other algorithms
[01:17:20] point trpo and some other algorithms were out there um they have the PO
[01:17:23] were out there um they have the PO clipping in purple this is a number of
[01:17:25] clipping in purple this is a number of different mojoko domains similar to
[01:17:27] different mojoko domains similar to mojoko domains you're going to be
[01:17:28] mojoko domains you're going to be working with what you can see here is
[01:17:31] working with what you can see here is that in general so trpo um has this
[01:17:34] that in general so trpo um has this trust region idea and it's similar in
[01:17:36] trust region idea and it's similar in some ways to what we're what they're
[01:17:38] some ways to what we're what they're doing in po but trust region policy
[01:17:40] doing in po but trust region policy optimization is quite a bit more
[01:17:42] optimization is quite a bit more complicated and what you can see here is
[01:17:44] complicated and what you can see here is that like this light purple which is po
[01:17:47] that like this light purple which is po um this is the number of steps uh is
[01:17:50] um this is the number of steps uh is generally doing just much better than
[01:17:52] generally doing just much better than these other ones so it is they're not
[01:17:54] these other ones so it is they're not they're going to get to a better better
[01:17:56] they're going to get to a better better Optima they're just saying we're going
[01:17:58] Optima they're just saying we're going to be able to get there much faster with
[01:17:59] to be able to get there much faster with much less data and in some of these
[01:18:01] much less data and in some of these cases this is really you know an
[01:18:03] cases this is really you know an enormous performance Improvement for the
[01:18:05] enormous performance Improvement for the same amount of data so so this is one of
[01:18:09] same amount of data so so this is one of the reasons why it became extremely
[01:18:11] the reasons why it became extremely popular is it is a pretty simple
[01:18:14] popular is it is a pretty simple algorithm to implement and it has
[01:18:16] algorithm to implement and it has extremely good performance in many
[01:18:18] extremely good performance in many cases now I do just want to so um you
[01:18:21] cases now I do just want to so um you can read you can go to the original
[01:18:22] can read you can go to the original paper proximal policy optimization alth
[01:18:25] paper proximal policy optimization alth or go to the blog post from uh 2017 I do
[01:18:28] or go to the blog post from uh 2017 I do think one thing that's good to know and
[01:18:30] think one thing that's good to know and really throughout much of ourl recent
[01:18:33] really throughout much of ourl recent history is to also understand what are
[01:18:34] history is to also understand what are the implementation details so this is
[01:18:36] the implementation details so this is optional you don't have to read it but
[01:18:38] optional you don't have to read it but there was a nice paper that followed up
[01:18:39] there was a nice paper that followed up from um this work because again po has
[01:18:42] from um this work because again po has been hugely influential uh from some
[01:18:45] been hugely influential uh from some colleagues at MIT that said well really
[01:18:46] colleagues at MIT that said well really which of these things are most important
[01:18:49] which of these things are most important because in general in these algorithms
[01:18:50] because in general in these algorithms there will be these things but then
[01:18:51] there will be these things but then there's also some hyperparameters or you
[01:18:53] there's also some hyperparameters or you know architecture choices Etc and so
[01:18:56] know architecture choices Etc and so knowing how these choices are made often
[01:18:58] knowing how these choices are made often do make a big difference in in reality
[01:19:01] do make a big difference in in reality and so that's always good to know it's
[01:19:02] and so that's always good to know it's whether or not like is it an algorithmic
[01:19:04] whether or not like is it an algorithmic improvement or are there additional
[01:19:05] improvement or are there additional things that we're not treating as part
[01:19:06] things that we're not treating as part of the algorithm but actually are really
[01:19:08] of the algorithm but actually are really important for practical performance so
[01:19:10] important for practical performance so you should know everything now that um
[01:19:12] you should know everything now that um you need to for making good progress on
[01:19:14] you need to for making good progress on homework 2 and we'll continue to discuss
[01:19:16] homework 2 and we'll continue to discuss this next week thanks
Lecture 007
Stanford CS234 Reinforcement Learning I Policy Search 3 I 2024 I Lecture 7
Source: https://www.youtube.com/watch?v=4ngb0IZTg8I
---
Transcript
[00:00:05] hi everybody welcome back we're going to
[0...
Stanford CS234 Reinforcement Learning I Policy Search 3 I 2024 I Lecture 7
Source: https://www.youtube.com/watch?v=4ngb0IZTg8I
---
Transcript
[00:00:05] hi everybody welcome back we're going to
[00:00:07] hi everybody welcome back we're going to be um talking more about policy gradient
[00:00:09] be um talking more about policy gradient methods today and then starting to talk
[00:00:10] methods today and then starting to talk about imitation learning but we'll do a
[00:00:12] about imitation learning but we'll do a quick refresh your understanding to
[00:00:28] start for
[00:01:03] I think everyone agrees that it will not
[00:01:05] I think everyone agrees that it will not necessarily converge to a global Optima
[00:01:07] necessarily converge to a global Optima so that's great um this uh some
[00:01:09] so that's great um this uh some different opinions about some of the
[00:01:10] different opinions about some of the other ones so maybe turn to a neighbor
[00:01:12] other ones so maybe turn to a neighbor and talk about whether a baseline term
[00:01:13] and talk about whether a baseline term can help to reduce the variance and
[00:01:15] can help to reduce the variance and whether or not after one step of policy
[00:01:17] whether or not after one step of policy gradient the resulting policy can get
[00:01:19] gradient the resulting policy can get worse
[00:01:49] the
[00:02:11] you're
[00:02:49] my
[00:02:52] seems not
[00:03:20] okay great all right so I think a lot of
[00:03:22] okay great all right so I think a lot of people um remembered from last time that
[00:03:24] people um remembered from last time that in general a baseline term does help
[00:03:26] in general a baseline term does help with the variance and that's one of the
[00:03:28] with the variance and that's one of the reasons we are adding it
[00:03:31] reasons we are adding it um you can initialize it with a sub uh
[00:03:34] um you can initialize it with a sub uh you can't initialize it with a
[00:03:35] you can't initialize it with a deterministic policy does somebody want
[00:03:37] deterministic policy does somebody want to say why it's the problem with
[00:03:39] to say why it's the problem with deterministic
[00:03:42] policies yes and
[00:03:44] policies yes and my uh potentially there's like an action
[00:03:46] my uh potentially there's like an action that the policy will never take so it's
[00:03:49] that the policy will never take so it's it's not able to reach a low minimum or
[00:03:51] it's not able to reach a low minimum or Optima right yeah so if you're EXA said
[00:03:54] Optima right yeah so if you're EXA said so if you are only taking actions
[00:03:56] so if you are only taking actions deterministically you're never going to
[00:03:57] deterministically you're never going to know about what other actions are in
[00:03:58] know about what other actions are in that state so if you're
[00:04:00] that state so if you're current policy is suboptimal you won't
[00:04:01] current policy is suboptimal you won't get to The Optimist table and then the
[00:04:04] get to The Optimist table and then the last one is something we're going to
[00:04:04] last one is something we're going to talk more about today uh so in general
[00:04:07] talk more about today uh so in general it can get worse this is
[00:04:10] it can get worse this is true um in general we're not guaranteed
[00:04:12] true um in general we're not guaranteed to have monotonic Improvement we would
[00:04:13] to have monotonic Improvement we would like to have monotonic Improvement but
[00:04:16] like to have monotonic Improvement but um in general policy gradient doesn't
[00:04:18] um in general policy gradient doesn't guarantee that but last term uh with po
[00:04:20] guarantee that but last term uh with po we saw things that are trying to get
[00:04:21] we saw things that are trying to get more towards that kind of monotonic
[00:04:24] more towards that kind of monotonic improvement okay
[00:04:26] improvement okay great so what we're going to be doing
[00:04:29] great so what we're going to be doing today is is we're going to talk a little
[00:04:30] today is is we're going to talk a little bit more about Po and some of the sort
[00:04:33] bit more about Po and some of the sort of theoretical underpinnings of it as
[00:04:35] of theoretical underpinnings of it as well as another feature about it that we
[00:04:36] well as another feature about it that we didn't talk about last time and then
[00:04:38] didn't talk about last time and then we're going to start talking about
[00:04:39] we're going to start talking about imitation
[00:04:42] learning okay so first and and all of
[00:04:46] learning okay so first and and all of you guys are going to be implementing po
[00:04:47] you guys are going to be implementing po as part of your homework as well as
[00:04:49] as part of your homework as well as reinforce so you'll get a chance to
[00:04:50] reinforce so you'll get a chance to practice with this um we're first going
[00:04:52] practice with this um we're first going to talk about generalized Advantage
[00:04:55] to talk about generalized Advantage estimation so first let's just refresh
[00:04:57] estimation so first let's just refresh our memory of um sort of one some of the
[00:04:59] our memory of um sort of one some of the ch Alles with policy gradients that
[00:05:01] ch Alles with policy gradients that motivated Po and a whole bunch of other
[00:05:04] motivated Po and a whole bunch of other research on sort of better policy
[00:05:05] research on sort of better policy gradient methods Beyond
[00:05:07] gradient methods Beyond reinforce um so in general we're we're
[00:05:10] reinforce um so in general we're we're using data to prorize the policy space
[00:05:12] using data to prorize the policy space and we're just going to do stochastic
[00:05:13] and we're just going to do stochastic gradient descent to try to get to a good
[00:05:16] gradient descent to try to get to a good value um a good a policy with a good
[00:05:19] value um a good a policy with a good value the challenges is that in general
[00:05:21] value the challenges is that in general when we did reinforce the sample
[00:05:23] when we did reinforce the sample efficiency was poor we had to run get
[00:05:26] efficiency was poor we had to run get data from one policy take a single
[00:05:28] data from one policy take a single gradient step and then get more data
[00:05:30] gradient step and then get more data from the new policy and as we were just
[00:05:33] from the new policy and as we were just discussing I think was mentioning this
[00:05:34] discussing I think was mentioning this or maybe um that the distance in the
[00:05:37] or maybe um that the distance in the parameter space is generally not equal
[00:05:38] parameter space is generally not equal to the distance in the action space um
[00:05:41] to the distance in the action space um so sort of the policy space so when you
[00:05:43] so sort of the policy space so when you make a small change in the parameters it
[00:05:45] make a small change in the parameters it might really change the type of actions
[00:05:47] might really change the type of actions you
[00:05:51] take so in proximal policy optimization
[00:05:54] take so in proximal policy optimization we saw two different ways to try to make
[00:05:56] we saw two different ways to try to make it so that we could essentially take
[00:05:59] it so that we could essentially take bigger steps steps in between each run
[00:06:01] bigger steps steps in between each run of when we um execute a policy uh but do
[00:06:04] of when we um execute a policy uh but do so in a way that would try to encourage
[00:06:07] so in a way that would try to encourage monotonic
[00:06:08] monotonic Improvement and so we saw this bound
[00:06:11] Improvement and so we saw this bound we're going to come back to that very
[00:06:12] we're going to come back to that very shortly which looked at sort of how we
[00:06:14] shortly which looked at sort of how we could try to approximate um the
[00:06:15] could try to approximate um the performance of a new policy under only
[00:06:18] performance of a new policy under only using the data that we have right now so
[00:06:20] using the data that we have right now so that's an instance of sort of off policy
[00:06:23] that's an instance of sort of off policy estimation and the bound showed that
[00:06:26] estimation and the bound showed that this relates to the kale Divergence
[00:06:28] this relates to the kale Divergence between the actual action taken under
[00:06:30] between the actual action taken under the new policy versus the old
[00:06:31] the new policy versus the old policy and in po that you could either
[00:06:34] policy and in po that you could either do so this adaptive kale penalty which
[00:06:36] do so this adaptive kale penalty which sort of says don't go too far from your
[00:06:38] sort of says don't go too far from your previous policy in terms of the actual
[00:06:40] previous policy in terms of the actual actions it takes um or a clipped
[00:06:43] actions it takes um or a clipped objective which is going to do something
[00:06:48] similar right so one thing that you
[00:06:51] similar right so one thing that you probably noticed in particularly if you
[00:06:52] probably noticed in particularly if you started implementing this already is
[00:06:53] started implementing this already is that we talked like last time a lot
[00:06:55] that we talked like last time a lot about using the advantage function so we
[00:06:57] about using the advantage function so we talked about how we're going to be doing
[00:06:59] talked about how we're going to be doing and in general for sort of um policy
[00:07:01] and in general for sort of um policy gradient we're often going to want an
[00:07:02] gradient we're often going to want an advantage function and you might wonder
[00:07:05] advantage function and you might wonder what we're going to plug in for that
[00:07:06] what we're going to plug in for that there are a lot of different choices for
[00:07:08] there are a lot of different choices for what the advantage function could be so
[00:07:10] what the advantage function could be so what we're going to talk about today is
[00:07:11] what we're going to talk about today is a particular choice um that was used in
[00:07:13] a particular choice um that was used in Po and that can be pretty
[00:07:15] Po and that can be pretty powerful so let's go back to last
[00:07:18] powerful so let's go back to last lecture before we introduce Po and talk
[00:07:20] lecture before we introduce Po and talk about the endstep estimators so in
[00:07:22] about the endstep estimators so in general in class sort of since the first
[00:07:24] general in class sort of since the first probably since the second lecture or so
[00:07:26] probably since the second lecture or so we've been talking uh second or third I
[00:07:28] we've been talking uh second or third I guess probably third lecture three we've
[00:07:30] guess probably third lecture three we've talking about this trade-off between
[00:07:32] talking about this trade-off between methods that bootstrap and use the
[00:07:35] methods that bootstrap and use the Markoff property such as temporal
[00:07:37] Markoff property such as temporal difference learning and methods that
[00:07:39] difference learning and methods that don't leverage anything about the markof
[00:07:40] don't leverage anything about the markof property like Monte Carlo so in
[00:07:43] property like Monte Carlo so in particular we've talked about cases
[00:07:45] particular we've talked about cases where this is sort of a temporal
[00:07:47] where this is sort of a temporal difference estimate where you just have
[00:07:48] difference estimate where you just have the immediate reward plus gamma times
[00:07:50] the immediate reward plus gamma times the you immediately bootstrap you plug
[00:07:52] the you immediately bootstrap you plug in your estimate of the value of the
[00:07:53] in your estimate of the value of the next State versus ones like a Infinity
[00:07:57] next State versus ones like a Infinity here where basically we have a Monte
[00:07:59] here where basically we have a Monte Carlo estimate obviously you can't
[00:08:01] Carlo estimate obviously you can't actually go out to Infinity you'd have
[00:08:02] actually go out to Infinity you'd have to have like um episodic cases but we
[00:08:04] to have like um episodic cases but we always we we're generally have been
[00:08:06] always we we're generally have been focusing on episodic cases so far in the
[00:08:08] focusing on episodic cases so far in the class minus the value so the the blue
[00:08:11] class minus the value so the the blue there is just us subtracting off our
[00:08:13] there is just us subtracting off our current value so we have this Advantage
[00:08:16] current value so we have this Advantage estimate so we talked before about sort
[00:08:19] estimate so we talked before about sort of the trade-offs between these
[00:08:20] of the trade-offs between these different estimates and how some of them
[00:08:21] different estimates and how some of them would have higher bias and some of them
[00:08:23] would have higher bias and some of them would have higher
[00:08:24] would have higher variance and what I'm going to talk
[00:08:26] variance and what I'm going to talk about now is sort of a way to use the to
[00:08:29] about now is sort of a way to use the to try to get to a new form of Advantage
[00:08:31] try to get to a new form of Advantage estimation um and it also involves a
[00:08:33] estimation um and it also involves a technique that comes up a lot in
[00:08:35] technique that comes up a lot in reinforcement learning so it's a useful
[00:08:36] reinforcement learning so it's a useful thing to be aware of so what we're going
[00:08:38] thing to be aware of so what we're going to do is we're going to define something
[00:08:39] to do is we're going to define something called Delta VT let me just highlight
[00:08:43] called Delta VT let me just highlight this and what this is is it's just
[00:08:47] this and what this is is it's just essentially here a TD backup so this is
[00:08:50] essentially here a TD backup so this is just what we've often seen where we have
[00:08:52] just what we've often seen where we have our immediate reward plus gamma * V of
[00:08:55] our immediate reward plus gamma * V of the next state so notice here that we've
[00:08:57] the next state so notice here that we've got a Time step t so that's going to be
[00:09:01] got a Time step t so that's going to be important and then this is just the
[00:09:03] important and then this is just the value our current estimate of the value
[00:09:05] value our current estimate of the value of the
[00:09:06] of the state so this should look like normal
[00:09:08] state so this should look like normal this is sort of the same Advantage as up
[00:09:14] here but note I could plug in different
[00:09:16] here but note I could plug in different T's here okay so we've defined this new
[00:09:18] T's here okay so we've defined this new thing called Delta so when we use this
[00:09:21] thing called Delta so when we use this new Delta then we would say that
[00:09:24] new Delta then we would say that our advantage with the advantage
[00:09:27] our advantage with the advantage function we've seen before we use a TDS
[00:09:29] function we've seen before we use a TDS is just exactly equal to Delta
[00:09:32] is just exactly equal to Delta VT why is it V well V is specifying here
[00:09:36] VT why is it V well V is specifying here what value function we're going to plug
[00:09:38] what value function we're going to plug in and this is just exactly equal to
[00:09:40] in and this is just exactly equal to this so same as what we saw
[00:09:42] this so same as what we saw before now the next thing that we can
[00:09:44] before now the next thing that we can say is well actually what is the
[00:09:46] say is well actually what is the advantage for if we use this two-step
[00:09:50] advantage for if we use this two-step estimate that is this okay so we've got
[00:09:53] estimate that is this okay so we've got this
[00:09:54] this expression and that is exactly equal to
[00:09:57] expression and that is exactly equal to Delta VT Plus Delta v t +
[00:10:01] Delta VT Plus Delta v t + one so I'm just going to write that out
[00:10:03] one so I'm just going to write that out so we can see it for for one second for
[00:10:05] so we can see it for for one second for why that's true okay so we had RT +
[00:10:09] why that's true okay so we had RT + gamma V of St +
[00:10:12] gamma V of St + 1 minus V of St that's this
[00:10:16] 1 minus V of St that's this term plus gamma R of t + 1 because
[00:10:21] term plus gamma R of t + 1 because notice
[00:10:23] notice here this is t + 1 okay plus gamma V V
[00:10:29] here this is t + 1 okay plus gamma V V of s + 2 minus V of s +
[00:10:37] 1 so when we do this in this way what
[00:10:41] 1 so when we do this in this way what we'll end up canceling here is the V of
[00:10:43] we'll end up canceling here is the V of St + 1 let me just so not we have a
[00:10:48] St + 1 let me just so not we have a gamma here in the front so we have this
[00:10:50] gamma here in the front so we have this term cancels with this term oop
[00:10:53] term cancels with this term oop sorry this term this term
[00:10:59] let me just be a little careful
[00:11:02] let me just be a little careful here we're going to have this term
[00:11:05] here we're going to have this term cancels with this term good we're going
[00:11:08] cancels with this term good we're going to get the two rewards RT and RT + one
[00:11:12] to get the two rewards RT and RT + one second one so that's here and here and
[00:11:14] second one so that's here and here and then we have gamma 2 * V of St +
[00:11:18] then we have gamma 2 * V of St + 2 okay so I just wrote out exactly what
[00:11:21] 2 okay so I just wrote out exactly what the definition was of Delta and Delta t
[00:11:23] the definition was of Delta and Delta t + 1 and essentially the important thing
[00:11:25] + 1 and essentially the important thing to see here is that one of the terms
[00:11:28] to see here is that one of the terms canel okay okay and so that's why we
[00:11:30] canel okay okay and so that's why we ended up getting exactly the same
[00:11:31] ended up getting exactly the same expression for a 2 as we had
[00:11:36] expression for a 2 as we had before and you can repeat
[00:11:39] before and you can repeat this and what will happen is kind of all
[00:11:41] this and what will happen is kind of all of those intermediate
[00:11:43] of those intermediate terms these things where you were
[00:11:46] terms these things where you were bootstrapping will cancel along the way
[00:11:50] bootstrapping will cancel along the way so this is why it's called a telescoping
[00:11:51] so this is why it's called a telescoping sum because here we're adding something
[00:11:56] sum because here we're adding something that at the next round we're going to
[00:11:57] that at the next round we're going to subtract from this the next one and so
[00:12:00] subtract from this the next one and so those are going to cancel and so you
[00:12:01] those are going to cancel and so you just get end up getting all the
[00:12:03] just get end up getting all the discounted sum of rewards plus the last
[00:12:07] discounted sum of rewards plus the last term who's seen telescoping sums
[00:12:10] term who's seen telescoping sums before okay Sobe about half people here
[00:12:13] before okay Sobe about half people here okay so for somebody this is really
[00:12:14] okay so for somebody this is really familiar for some this might be new it's
[00:12:16] familiar for some this might be new it's a useful technique to see to know about
[00:12:18] a useful technique to see to know about because it comes up in a lot of the
[00:12:20] because it comes up in a lot of the reinforcement learning
[00:12:21] reinforcement learning proofs
[00:12:23] proofs yeah just in comparing this to the
[00:12:26] yeah just in comparing this to the equation for a hat sub T2 above yeah are
[00:12:31] equation for a hat sub T2 above yeah are we basically saying that gamma s the St
[00:12:35] we basically saying that gamma s the St + 2 is equal to gamma the S + one no
[00:12:39] + 2 is equal to gamma the S + one no we're just we're just actually literally
[00:12:40] we're just we're just actually literally canceling it good question so when we
[00:12:42] canceling it good question so when we write out this expression there's a
[00:12:44] write out this expression there's a gamma in front and because there's a t +
[00:12:47] gamma in front and because there's a t + 1 here this will become s of t + 2 this
[00:12:50] 1 here this will become s of t + 2 this will become s of t + 1 and there's a
[00:12:53] will become s of t + 1 and there's a gamma so this will be a gamma gamma time
[00:12:56] gamma so this will be a gamma gamma time a minus B of St + 1 and that will cancel
[00:12:59] a minus B of St + 1 and that will cancel with the V of St plus one in the
[00:13:01] with the V of St plus one in the previous
[00:13:03] one yeah
[00:13:06] one yeah yeah like on top and on the bottom are
[00:13:10] yeah like on top and on the bottom are they equivalent then they're identical
[00:13:12] they equivalent then they're identical identical okay yeah yeah that was a good
[00:13:14] identical okay yeah yeah that was a good question so yeah I've written this in
[00:13:16] question so yeah I've written this in this notation now with the like Delta
[00:13:17] this notation now with the like Delta notation but this is exactly equal to
[00:13:20] notation but this is exactly equal to this which is the same so these are I
[00:13:22] this which is the same so these are I mean these are identical o
[00:13:25] mean these are identical o uh yeah so thanks so there's a typo
[00:13:28] uh yeah so thanks so there's a typo there that might be the question
[00:13:30] there that might be the question yeah just highlight that so this should
[00:13:32] yeah just highlight that so this should be
[00:13:40] T yeah in general we're always
[00:13:42] T yeah in general we're always bootstrapping with the final time step
[00:13:44] bootstrapping with the final time step thanks for cing
[00:13:47] thanks for cing that yeah so these all end up being
[00:13:49] that yeah so these all end up being exactly equivalent we've just Rewritten
[00:13:50] exactly equivalent we've just Rewritten it in terms of this Delta notation using
[00:13:53] it in terms of this Delta notation using AOS okay so these are just different
[00:13:55] AOS okay so these are just different endep estimators we haven't done
[00:13:57] endep estimators we haven't done anything new yet in terms of I we
[00:13:59] anything new yet in terms of I we Rewritten things but we haven't
[00:14:00] Rewritten things but we haven't introduced any new type of estimator
[00:14:02] introduced any new type of estimator these are just different Advantage
[00:14:03] these are just different Advantage functions and as you might imagine you
[00:14:05] functions and as you might imagine you know the first one is going to be uh low
[00:14:08] know the first one is going to be uh low variance High bias the last one is going
[00:14:10] variance High bias the last one is going to be low bias High
[00:14:13] to be low bias High variance so generalized Advantage
[00:14:16] variance so generalized Advantage estimation involves taking a weighted
[00:14:19] estimation involves taking a weighted combination of kstep
[00:14:21] combination of kstep estimators so we had
[00:14:24] estimators so we had here this was just lots of different
[00:14:26] here this was just lots of different estimators and you might say well how do
[00:14:28] estimators and you might say well how do I pick among them I'm not sure I'm going
[00:14:30] I pick among them I'm not sure I'm going to pick among them I'm just going to
[00:14:32] to pick among them I'm just going to take a weighted combination of all of
[00:14:34] take a weighted combination of all of them and in particular you could just
[00:14:36] them and in particular you could just take um an average weighted
[00:14:38] take um an average weighted combination so let's just step through
[00:14:41] combination so let's just step through this a little bit just to see how we do
[00:14:42] this a little bit just to see how we do this so what we what this is saying here
[00:14:45] this so what we what this is saying here is I'm going to take the one step
[00:14:48] is I'm going to take the one step estimate of um Advantage estimator plus
[00:14:51] estimate of um Advantage estimator plus Lambda I've introduced a new parameter
[00:14:52] Lambda I've introduced a new parameter here Lambda the two-step 1 Lambda S
[00:14:55] here Lambda the two-step 1 Lambda S Plus the 3ep 1 I'm just saying like Okay
[00:14:57] Plus the 3ep 1 I'm just saying like Okay well why don't I use all of my
[00:14:58] well why don't I use all of my estimators and I'm going to weigh my
[00:15:00] estimators and I'm going to weigh my different
[00:15:02] different estimators so now what I'm going to do
[00:15:04] estimators so now what I'm going to do is I'm going to I've next just written
[00:15:05] is I'm going to I've next just written this in this the Delta notation that we
[00:15:07] this in this the Delta notation that we saw on the previous
[00:15:09] saw on the previous slide okay and now what I'm going to see
[00:15:11] slide okay and now what I'm going to see is that some of these terms appear a lot
[00:15:14] is that some of these terms appear a lot of times so there's this this term
[00:15:16] of times so there's this this term appears in all of the terms in all of
[00:15:19] appears in all of the terms in all of the
[00:15:20] the advantages the second one only appears
[00:15:23] advantages the second one only appears in the second to the last one
[00:15:25] in the second to the last one Etc so I'm going to collect terms
[00:15:29] Etc so I'm going to collect terms so I'm going to write this as
[00:15:35] follows and this was introduced in a
[00:15:37] follows and this was introduced in a previous paper to PO and then po bills
[00:15:40] previous paper to PO and then po bills on
[00:15:46] it okay I just noticed what I've did
[00:15:49] it okay I just noticed what I've did there I've noticed that I had this term
[00:15:52] there I've noticed that I had this term me
[00:16:00] so I'm just taking all of those terms
[00:16:01] so I'm just taking all of those terms and I'm noticing how many lambdas I had
[00:16:03] and I'm noticing how many lambdas I had in front of them
[00:16:05] in front of them okay and then I'm going to have Delta t
[00:16:09] okay and then I'm going to have Delta t + 1
[00:16:12] + 1 V time Lambda * 1
[00:16:18] V time Lambda * 1 + I'll write it
[00:16:25] differently so this term is going to
[00:16:27] differently so this term is going to start with
[00:16:29] start with Lambda plus Lambda s because it's in the
[00:16:32] Lambda plus Lambda s because it's in the second through the all the rest of the
[00:16:35] second through the all the rest of the terms
[00:16:44] plus all right so I'm just rearranging
[00:16:46] plus all right so I'm just rearranging the sums and now when I look at this I
[00:16:48] the sums and now when I look at this I realize that I've got a geometric
[00:16:50] realize that I've got a geometric Series so this is just going to be cool
[00:16:53] Series so this is just going to be cool to 1 - Lambda * Lambda TV / 1 - Lambda
[00:17:00] plus let me just make sure got there
[00:17:06] plus let me just make sure got there was I'll put on the next page just to
[00:17:08] was I'll put on the next page just to make sure I made a there should have
[00:17:10] make sure I made a there should have been a gamma here let me just
[00:17:12] been a gamma here let me just put
[00:17:21] gamma right cleanly in the next slide so
[00:17:23] gamma right cleanly in the next slide so this will be clearer okay so we have we
[00:17:27] this will be clearer okay so we have we also had a gamma here from before okay
[00:17:30] also had a gamma here from before okay so gamma squar and then what you do is
[00:17:33] so gamma squar and then what you do is you realize this is a geometric series
[00:17:35] you realize this is a geometric series that goes to 1 over 1us
[00:17:37] that goes to 1 over 1us Lambda and then this is gamma * Lambda 1
[00:17:41] Lambda and then this is gamma Lambda 1 1us lamb and this is gamma s Lambda s
[00:17:44] 1us lamb and this is gamma s * Lambda s I'm just using the fact that this is a
[00:17:45] I'm just using the fact that this is a geometric series it's fine if you
[00:17:47] geometric series it's fine if you haven't seen this if you've done real
[00:17:49] haven't seen this if you've done real analysis you've seen this
[00:17:51] analysis you've seen this before and that means that the term
[00:17:53] before and that means that the term below just looks like the
[00:17:56] below just looks like the following and this was introduced by a
[00:17:58] following and this was introduced by a previous paper and the idea there was to
[00:18:01] previous paper and the idea there was to say well why don't we just take kind of
[00:18:02] say well why don't we just take kind of a weighted average over all these
[00:18:03] a weighted average over all these different terms that have different
[00:18:05] different terms that have different biases and variances and then we can
[00:18:07] biases and variances and then we can reexpress it compactly we don't actually
[00:18:10] reexpress it compactly we don't actually have to compute all of the advantages
[00:18:11] have to compute all of the advantages separately and track them we just are
[00:18:13] separately and track them we just are going to keep track of these deltas and
[00:18:15] going to keep track of these deltas and these Deltas are pretty easy to keep
[00:18:17] these Deltas are pretty easy to keep track of because those are just like
[00:18:19] track of because those are just like these one-step differences between so
[00:18:21] these one-step differences between so just remind ourself what what the Deltas
[00:18:24] just remind ourself what what the Deltas look like the Deltas are pretty easy to
[00:18:25] look like the Deltas are pretty easy to keep track of because they're just the
[00:18:27] keep track of because they're just the difference between your previous imate
[00:18:29] difference between your previous imate and your new reward plus gamma BST plus
[00:18:32] and your new reward plus gamma BST plus one so you can kind of just keep track
[00:18:34] one so you can kind of just keep track of those over time and then you're just
[00:18:36] of those over time and then you're just weigh in
[00:18:38] weigh in them okay and our derivation just
[00:18:41] them okay and our derivation just followed and so then you just sum these
[00:18:42] followed and so then you just sum these up and you essentially have different
[00:18:44] up and you essentially have different weights okay all right so let's think
[00:18:48] weights okay all right so let's think about what this means in terms of uh
[00:18:50] about what this means in terms of uh bias and variance as we often like to in
[00:18:52] bias and variance as we often like to in terms of the estimators we're using so
[00:18:55] terms of the estimators we're using so this is trying to be an estimate of the
[00:18:56] this is trying to be an estimate of the advantage and we'll do a check your
[00:18:57] advantage and we'll do a check your understanding now about about how
[00:19:00] understanding now about about how different choices so this is a a
[00:19:04] different choices so this is a a discount Factor so this is the discount
[00:19:06] discount Factor so this is the discount Factor comma your choice of Lambda which
[00:19:09] Factor comma your choice of Lambda which is how much you're waiting earlier ones
[00:19:11] is how much you're waiting earlier ones versus later ones so G is generally a
[00:19:14] versus later ones so G is generally a function of these two
[00:19:16] function of these two hyperparameters and let's think for a
[00:19:18] hyperparameters and let's think for a second about what this does for bias and
[00:19:20] second about what this does for bias and variance and how it relates to
[00:19:27] TV for
[00:20:20] can you like b or
[00:20:24] can you like b or no you can okay good see otherwise the
[00:20:27] no you can okay good see otherwise the te is helping make these just check with
[00:20:40] them and feel free to go to the previous
[00:20:43] them and feel free to go to the previous slide look at the
[00:20:53] definitions and the reason these all are
[00:20:55] definitions and the reason these all are really important is because if you get
[00:20:56] really important is because if you get better estimates of the advantage you're
[00:20:57] better estimates of the advantage you're going to get better estimates of the
[00:20:58] going to get better estimates of the gradient if you get better estimates of
[00:21:00] gradient if you get better estimates of the gradient can hopefully use less data
[00:21:02] the gradient can hopefully use less data to get to that really good
[00:21:04] to get to that really good policy so that's why people meant quite
[00:21:06] policy so that's why people meant quite a lot of effort thinking about with the
[00:21:09] a lot of effort thinking about with the with sort of using deep neural networks
[00:21:10] with sort of using deep neural networks both either for the advantage or for the
[00:21:12] both either for the advantage or for the policy how can we really quickly get
[00:21:14] policy how can we really quickly get good estimates is gay like Z even def
[00:21:19] good estimates is gay like Z even def there's a z to
[00:21:21] there's a z to Z there there is this yeah because in
[00:21:26] Z there there is this yeah because in the first term there will be 0 to zero
[00:21:29] the first term there will be 0 to zero Z to the zero it shouldn't be to the
[00:21:32] Z to the zero it shouldn't be to the Expo Oh you mean
[00:21:34] Expo Oh you mean here
[00:21:36] here um you in that case you would just plug
[00:21:40] um you in that case you would just plug zero in up here okay and then that would
[00:21:43] zero in up here okay and then that would disappear and you would just get
[00:21:56] this all right why you turn turn nebor
[00:21:58] this all right why you turn turn nebor and see if you got the same
[00:22:02] answer use the defition
[00:22:19] [Music]
[00:22:56] would you not look at
[00:23:27] um Z it would just
[00:23:56] be well I guess
[00:24:14] okay so thanks for the good question
[00:24:16] okay so thanks for the good question I'll make sure to clarify like the
[00:24:18] I'll make sure to clarify like the notation if Lambda is equal to
[00:24:22] notation if Lambda is equal to oopsie I'll make sure to clarify um that
[00:24:25] oopsie I'll make sure to clarify um that let's just say if Lambda is equal to
[00:24:29] let's just say if Lambda is equal to equal to zero
[00:24:31] equal to zero look at first
[00:24:35] line
[00:24:37] line okay so I'm I'll make sure to clarify
[00:24:41] okay so I'm I'll make sure to clarify that in next year's slides so I in that
[00:24:44] that in next year's slides so I in that case everything drops off so if Lambda
[00:24:46] case everything drops off so if Lambda is equal to zero all the other
[00:24:48] is equal to zero all the other estimators go away basically you have no
[00:24:51] estimators go away basically you have no um no no weight on all of the advantage
[00:24:54] um no no weight on all of the advantage estimators that are two or more and so
[00:24:56] estimators that are two or more and so it just becomes the first term and the
[00:24:58] it just becomes the first term and the first term is the TD estimator
[00:25:01] first term is the TD estimator yesal one shouldn't the entire thing
[00:25:04] yesal one shouldn't the entire thing become
[00:25:05] become zero if if what if Lambda is one if
[00:25:09] zero if if what if Lambda is one if Lambda is one yeah so if the Lambda is
[00:25:11] Lambda is one yeah so if the Lambda is one well then you're also kind of you're
[00:25:13] one well then you're also kind of you're suing from an infinite number of terms
[00:25:15] suing from an infinite number of terms too um but yes NE well so this is this
[00:25:19] too um but yes NE well so this is this is this is true so B is true and the
[00:25:23] is this is true so B is true and the second one we'll see on the next slide I
[00:25:25] second one we'll see on the next slide I mean this is a it's a little weird to
[00:25:27] mean this is a it's a little weird to write down in this fractal infinite
[00:25:29] write down in this fractal infinite Horizon because you can't ever do
[00:25:31] Horizon because you can't ever do multicar returns with infinite Horizon
[00:25:33] multicar returns with infinite Horizon so but it's a good you guys have good
[00:25:35] so but it's a good you guys have good questions I'll make sure to like clarify
[00:25:38] questions I'll make sure to like clarify what happens in like sort of the H
[00:25:40] what happens in like sort of the H equals
[00:25:41] equals infinity one and Lambda equals z cases
[00:25:44] infinity one and Lambda equals z cases just so that like the the infinities are
[00:25:47] just so that like the the infinities are clear um but this is certainly false
[00:25:49] clear um but this is certainly false because this is not tv0 because we'd
[00:25:51] because this is not tv0 because we'd have a whole bunch of terms here and
[00:25:52] have a whole bunch of terms here and there'd be this weird waiting in that
[00:25:54] there'd be this weird waiting in that case Okay um and then the cuz then you
[00:25:58] case Okay um and then the cuz then you sort of have to weigh how much is this
[00:26:00] sort of have to weigh how much is this term versus the Infinity of the other
[00:26:01] term versus the Infinity of the other like the Zero versus the Infinity of the
[00:26:03] like the Zero versus the Infinity of the other term so I'll make sure to clarify
[00:26:05] other term so I'll make sure to clarify that um D is also true because once this
[00:26:08] that um D is also true because once this is a TD estimate then we generally know
[00:26:10] is a TD estimate then we generally know TD estimates have higher bi have higher
[00:26:12] TD estimates have higher bi have higher bias and lower
[00:26:14] bias and lower variance okay
[00:26:16] variance okay so are
[00:26:18] so are true now note in general you would think
[00:26:21] true now note in general you would think therefore you want to put um Lambda
[00:26:23] therefore you want to put um Lambda somewhere in the
[00:26:24] somewhere in the middle because it will be balancing
[00:26:27] middle because it will be balancing between bias and
[00:26:28] between bias and variant but what they do in PO is a
[00:26:31] variant but what they do in PO is a little bit different but it's related to
[00:26:32] little bit different but it's related to this so this is what the generalized
[00:26:34] this so this is what the generalized Advantage estimation is um we kind of do
[00:26:37] Advantage estimation is um we kind of do this like exponential waiting over lots
[00:26:39] this like exponential waiting over lots of different Advantage estimators but
[00:26:40] of different Advantage estimators but without actually having to have separate
[00:26:43] without actually having to have separate copies in memory of all the advantage
[00:26:44] copies in memory of all the advantage estimators so that's why this is nice um
[00:26:49] estimators so that's why this is nice um so what we're going to do now is see
[00:26:52] so what we're going to do now is see what we actually they actually did in po
[00:26:54] what we actually they actually did in po which is instead of doing all of these
[00:26:57] which is instead of doing all of these we're just going going to do a finite
[00:26:59] we're just going going to do a finite number
[00:27:01] number so what they're going to do is a
[00:27:03] so what they're going to do is a truncated version where they use this
[00:27:06] truncated version where they use this but they only go up to a certain point
[00:27:10] but they only go up to a certain point and so they're not going to go up for to
[00:27:12] and so they're not going to go up for to Forever there's multiple benefits to
[00:27:14] Forever there's multiple benefits to this including the fact that they're
[00:27:15] this including the fact that they're going to be in episodic domains and what
[00:27:17] going to be in episodic domains and what this means is that let's say Your
[00:27:19] this means is that let's say Your Horizon is very long but not Infinity so
[00:27:22] Horizon is very long but not Infinity so your horizon might be something like
[00:27:23] your horizon might be something like 2,000 steps for your Mount car or
[00:27:25] 2,000 steps for your Mount car or something like that um you might pick t
[00:27:28] something like that um you might pick t equal to be 200 and what that would mean
[00:27:31] equal to be 200 and what that would mean is so remember the benefit one of the
[00:27:32] is so remember the benefit one of the benefits of temporal difference learning
[00:27:33] benefits of temporal difference learning compared to Monte Carlo is that you can
[00:27:35] compared to Monte Carlo is that you can update your estimate after every step um
[00:27:39] update your estimate after every step um the problem with the advantage estimator
[00:27:40] the problem with the advantage estimator that is defined here is you still have
[00:27:42] that is defined here is you still have to wait till the very end to update your
[00:27:45] to wait till the very end to update your EST like to update your estimator
[00:27:46] EST like to update your estimator because you need your advantage with
[00:27:49] because you need your advantage with near infinity and then you're going to
[00:27:50] near infinity and then you're going to weigh all of them so you don't actually
[00:27:52] weigh all of them so you don't actually want to do that in practice so um one
[00:27:54] want to do that in practice so um one thing that PO proposes is to say well
[00:27:57] thing that PO proposes is to say well why don't we just do a truncated version
[00:27:58] why don't we just do a truncated version and that means every T steps like Big T
[00:28:01] and that means every T steps like Big T steps so let's say t is 200 every 200
[00:28:03] steps so let's say t is 200 every 200 time steps you can compute this you
[00:28:07] time steps you can compute this you compute your your new sort of weighted
[00:28:09] compute your your new sort of weighted um average EST like your advantage
[00:28:11] um average EST like your advantage estimator and then
[00:28:14] estimator and then update so you can think of sort of the
[00:28:17] update so you can think of sort of the the Big T here is determining how long
[00:28:20] the Big T here is determining how long you have to go before you can make an
[00:28:22] you have to go before you can make an update so that's what they do in po um
[00:28:25] update so that's what they do in po um they use this truncated generalized
[00:28:27] they use this truncated generalized advantage
[00:28:29] advantage estimation um in order to get better
[00:28:33] estimation um in order to get better estimators okay anybody have any
[00:28:35] estimators okay anybody have any questions about that before we move on
[00:28:36] questions about that before we move on to going back to this question of
[00:28:38] to going back to this question of monotonic
[00:28:44] improvement okay so now let's go on to
[00:28:46] improvement okay so now let's go on to another important feature of Po which is
[00:28:49] another important feature of Po which is it's really sort of in some ways going
[00:28:51] it's really sort of in some ways going backwards but I wanted to make sure to
[00:28:52] backwards but I wanted to make sure to go through the algorithm for last time
[00:28:54] go through the algorithm for last time so that you guys could start working on
[00:28:56] so that you guys could start working on implementation but if I think uh and as
[00:28:59] implementation but if I think uh and as in many papers the theory is a little
[00:29:01] in many papers the theory is a little bit decoupled from what's actually done
[00:29:03] bit decoupled from what's actually done um but it's sort of serves as motivation
[00:29:06] um but it's sort of serves as motivation so I think it's useful to go back to the
[00:29:08] so I think it's useful to go back to the bound that was proposed there that
[00:29:09] bound that was proposed there that helped Inspire their algorithm and think
[00:29:11] helped Inspire their algorithm and think about what it actually um implies about
[00:29:14] about what it actually um implies about what happens when we do
[00:29:15] what happens when we do updates okay so remember that um and as
[00:29:19] updates okay so remember that um and as you're proving right now for homework 2
[00:29:21] you're proving right now for homework 2 uh remember that what we do in um what
[00:29:24] uh remember that what we do in um what they were thinking of doing in po was to
[00:29:27] they were thinking of doing in po was to say we want to be able to use our old
[00:29:29] say we want to be able to use our old data from a policy pi to estimate the
[00:29:32] data from a policy pi to estimate the performance of policy Pi Prime but the
[00:29:35] performance of policy Pi Prime but the problem is is that in general that's
[00:29:36] problem is is that in general that's going to induce a different state
[00:29:37] going to induce a different state distribution and so we played this EST
[00:29:41] distribution and so we played this EST approximation and said let's just ignore
[00:29:42] approximation and said let's just ignore the difference in the state
[00:29:44] the difference in the state distributions and that's great because
[00:29:46] distributions and that's great because now we can use our old data to estimate
[00:29:49] now we can use our old data to estimate the value of our new policy only our old
[00:29:51] the value of our new policy only our old data um because we always know what the
[00:29:54] data um because we always know what the actual policy parameter is but we don't
[00:29:55] actual policy parameter is but we don't actually have to gather new data from it
[00:29:57] actually have to gather new data from it and and we called this sort of L Pi Pi
[00:30:00] and and we called this sort of L Pi Pi Prime because Pi Prime is here but
[00:30:02] Prime because Pi Prime is here but everything else is being used by pi and
[00:30:05] everything else is being used by pi and what was proven was that if your two
[00:30:09] what was proven was that if your two policies have a close scale Divergence
[00:30:11] policies have a close scale Divergence in terms of the actual actions they take
[00:30:14] in terms of the actual actions they take then um you get this bound on
[00:30:17] then um you get this bound on performance okay so it said um this
[00:30:19] performance okay so it said um this approximation is not too bad and in
[00:30:23] approximation is not too bad and in particular we get this thing of sort of
[00:30:25] particular we get this thing of sort of this monotonic Improvement Theory saying
[00:30:27] this monotonic Improvement Theory saying that the value here so I'll just write
[00:30:29] that the value here so I'll just write down
[00:30:30] down that J of
[00:30:33] that J of pial to B of Pi some people use J some
[00:30:36] pial to B of Pi some people use J some people use V we mostly use V in the
[00:30:38] people use V we mostly use V in the class okay so that the value of your new
[00:30:41] class okay so that the value of your new policy Pi Prime minus the value of the
[00:30:43] policy Pi Prime minus the value of the old policy is greater than or equal to
[00:30:46] old policy is greater than or equal to this term that we had on the previous
[00:30:48] this term that we had on the previous slide so this L term this whole thing
[00:30:59] minus this sort of um error that we get
[00:31:02] minus this sort of um error that we get from the fact that we are approximating
[00:31:04] from the fact that we are approximating the state distribution by something
[00:31:06] the state distribution by something that's not true okay so we have this
[00:31:08] that's not true okay so we have this nice
[00:31:10] nice bound and now what we're going to go
[00:31:12] bound and now what we're going to go through now is to show why if we
[00:31:15] through now is to show why if we maximize with respect to the right hand
[00:31:18] maximize with respect to the right hand side that we are guaranteed to improve
[00:31:20] side that we are guaranteed to improve over
[00:31:22] over Pi that shouldn't necessarily um well
[00:31:25] Pi that shouldn't necessarily um well guess I'll ask for so who is seeing the
[00:31:26] guess I'll ask for so who is seeing the sort of majorized maximize out them
[00:31:29] sort of majorized maximize out them before I wouldn't expect you to
[00:31:31] before I wouldn't expect you to but I'm sorry so this kind of goes back
[00:31:35] but I'm sorry so this kind of goes back a I think we've seen ideas related to
[00:31:38] a I think we've seen ideas related to this in um uh policy improvement from
[00:31:42] this in um uh policy improvement from the very beginning uh but this is
[00:31:44] the very beginning uh but this is different because we've got sort of
[00:31:45] different because we've got sort of these bounds up so what this is saying
[00:31:46] these bounds up so what this is saying is this is a lower bound right like this
[00:31:48] is this is a lower bound right like this says that the difference between these
[00:31:50] says that the difference between these two policies is at least as big as this
[00:31:53] two policies is at least as big as this term minus this
[00:31:55] term minus this term but it shouldn't and what we're
[00:31:58] term but it shouldn't and what we're actually going to propose to do is to
[00:31:59] actually going to propose to do is to say all right well if we try to pick a
[00:32:01] say all right well if we try to pick a pi Prime that um maximizes this lower
[00:32:04] pi Prime that um maximizes this lower bound is that actually mean that we're
[00:32:06] bound is that actually mean that we're going to be guaranteed to improve over
[00:32:09] going to be guaranteed to improve over pi and it shouldn't necessarily be
[00:32:11] pi and it shouldn't necessarily be immediately obvious that that would TR
[00:32:13] immediately obvious that that would TR be true okay but it's going to turn out
[00:32:15] be true okay but it's going to turn out that that's the case so let's just go
[00:32:17] that that's the case so let's just go through the proof for that which is
[00:32:18] through the proof for that which is pretty cool all right so we're going to
[00:32:21] pretty cool all right so we're going to prove that if you do this if what you
[00:32:23] prove that if you do this if what you try to do is pick a policy Pi k + 1
[00:32:27] try to do is pick a policy Pi k + 1 which is the argmax of this lower bound
[00:32:30] which is the argmax of this lower bound that you will in fact get a new policy
[00:32:32] that you will in fact get a new policy that's either the local Optima or is
[00:32:36] that's either the local Optima or is actually better than your previous
[00:32:37] actually better than your previous policy so that's that's the idea of what
[00:32:39] policy so that's that's the idea of what we're trying to do okay so note a few
[00:32:43] we're trying to do okay so note a few things okay so
[00:32:45] things okay so Pi so we're going to assume that we have
[00:32:48] Pi so we're going to assume that we have some Pi K that's our previous policy and
[00:32:51] some Pi K that's our previous policy and that it was
[00:32:55] feasible okay so like it's it's like a
[00:32:58] feasible okay so like it's it's like a it's a well- defined policy sums to one
[00:33:00] it's a well- defined policy sums to one you know it satisfies all of those
[00:33:02] you know it satisfies all of those constraints okay so now
[00:33:12] let's in terms of okay so now recall
[00:33:18] let's in terms of okay so now recall that let's just do something a little
[00:33:21] that let's just do something a little silly but it's going to be useful okay
[00:33:23] silly but it's going to be useful okay so we're going to look at what L Pi of
[00:33:26] so we're going to look at what L Pi of Pi K of Pi K is okay that's this term
[00:33:30] Pi K of Pi K is okay that's this term let's just see what that is if we just
[00:33:31] let's just see what that is if we just plug in if we try to evaluate what that
[00:33:34] plug in if we try to evaluate what that term what that sort of expression is
[00:33:36] term what that sort of expression is when we plug in the same policy as what
[00:33:38] when we plug in the same policy as what we actually use to gather our data all
[00:33:40] we actually use to gather our data all right so remember that would just be
[00:33:42] right so remember that would just be equal to 1 over 1us gamma expected value
[00:33:46] equal to 1 over 1us gamma expected value over s according to D Pi K just writing
[00:33:49] over s according to D Pi K just writing down the definition of what L
[00:33:54] is okay and this is going to be Pi K of
[00:33:57] is okay and this is going to be Pi K of a given s / p k of a given s * a of p
[00:34:07] k all right well this is just one right
[00:34:10] k all right well this is just one right so this this
[00:34:13] so this this cancels but the important thing to
[00:34:16] cancels but the important thing to remember here is that the advantage
[00:34:18] remember here is that the advantage function of a policy with respect to
[00:34:21] function of a policy with respect to itself is
[00:34:24] zero so if I take actions according to
[00:34:27] zero so if I take actions according to to the current policy and compare what
[00:34:31] to the current policy and compare what the value is to taking actions according
[00:34:33] the value is to taking actions according to that current policy and then acting
[00:34:36] to that current policy and then acting according to the current policy minus
[00:34:37] according to the current policy minus just first taking Pol actions according
[00:34:39] just first taking Pol actions according to the current policy that is difference
[00:34:42] to the current policy that is difference is zero okay so that's the can just
[00:34:45] is zero okay so that's the can just write that out too in case right so just
[00:34:48] write that out too in case right so just remember like what we have here is we're
[00:34:50] remember like what we have here is we're going to have q Pi K of S A minus V Pi K
[00:34:59] going to have q Pi K of S A minus V Pi K of s
[00:35:01] of s okay but notice what we have here is
[00:35:04] okay but notice what we have here is that what are we taking you know what's
[00:35:05] that what are we taking you know what's the distribution we're taking these
[00:35:07] the distribution we're taking these actions it's exactly Pi
[00:35:10] actions it's exactly Pi K so Q Pi K if you first follow like I
[00:35:14] K so Q Pi K if you first follow like I can just write that out just in case
[00:35:16] can just write that out just in case it's helpful um is so this is like sum
[00:35:19] it's helpful um is so this is like sum over a pi K of a s q k s which is just
[00:35:27] over a pi K of a s q k s which is just equal to V
[00:35:29] equal to V K it's like if you start taking this
[00:35:32] K it's like if you start taking this action and you follow the policy and
[00:35:34] action and you follow the policy and then you follow the policy from all
[00:35:35] then you follow the policy from all future time steps versus if you just
[00:35:37] future time steps versus if you just follow the poy from now till forever
[00:35:39] follow the poy from now till forever that's exactly the same so that means
[00:35:41] that's exactly the same so that means that this is
[00:35:44] zero that's good that means that like
[00:35:46] zero that's good that means that like because if we think back to what this
[00:35:48] because if we think back to what this looks like that says that the difference
[00:35:50] looks like that says that the difference between the value of the policy and the
[00:35:52] between the value of the policy and the policy itself is zero so this bound is
[00:35:55] policy itself is zero so this bound is tight if your value in it with respect
[00:35:58] tight if your value in it with respect to itself there's no difference between
[00:36:00] to itself there's no difference between the value of the policy and the policy
[00:36:01] the value of the policy and the policy itself because the D oh I'll say the
[00:36:03] itself because the D oh I'll say the next thing so then because
[00:36:06] next thing so then because also D KL of Pi K Pi K is equal to
[00:36:14] also D KL of Pi K Pi K is equal to zero there is no K the K Divergence
[00:36:17] zero there is no K the K Divergence between a um a distribution and itself
[00:36:20] between a um a distribution and itself is
[00:36:21] is zero okay all right so now let me just
[00:36:25] zero okay all right so now let me just label these two so let's call
[00:36:28] label these two so let's call this term one and this term two okay so
[00:36:34] this term one and this term two okay so what we have here is we have that term
[00:36:36] what we have here is we have that term one is zero for pi K and term two is
[00:36:40] one is zero for pi K and term two is zero for pi K all right so that means 1
[00:36:45] zero for pi K all right so that means 1 - 2 has to be at least as great as
[00:36:49] - 2 has to be at least as great as zero does I want to say why that
[00:36:53] zero does I want to say why that is I made an that's not immediately
[00:36:55] is I made an that's not immediately obvious from these steps yet you have to
[00:36:57] obvious from these steps yet you have to make one more step and it has to do with
[00:36:58] make one more step and it has to do with the argmax let me see why that is I want
[00:37:02] the argmax let me see why that is I want to
[00:37:04] sure why is 1 minus 2 always have to be
[00:37:07] sure why is 1 minus 2 always have to be greater than or equal to
[00:37:13] zero given the
[00:37:21] arcmax achieve zero for exactly said
[00:37:25] arcmax achieve zero for exactly said yeah so um Pi K is an ex distance proof
[00:37:28] yeah so um Pi K is an ex distance proof that there exists at least one policy
[00:37:29] that there exists at least one policy for which the right hand side is zero
[00:37:32] for which the right hand side is zero we're taking an argmax over the whole
[00:37:34] we're taking an argmax over the whole policy space that means the argmax has
[00:37:35] policy space that means the argmax has to have value at least zero hopefully
[00:37:38] to have value at least zero hopefully better then and so um that is exactly
[00:37:41] better then and so um that is exactly why so
[00:37:44] why so because R
[00:37:48] because R Max is at
[00:37:51] Max is at least as
[00:37:53] least as good as but because we're trying to
[00:37:56] good as but because we're trying to maximize that okay so what that means
[00:37:59] maximize that okay so what that means then is that so remember all of this
[00:38:02] then is that so remember all of this term here on the right hand
[00:38:04] term here on the right hand side was what we had here okay so this
[00:38:07] side was what we had here okay so this whole so we
[00:38:09] whole so we had J of Pi k + 1 minus J Pi K is
[00:38:16] had J of Pi k + 1 minus J Pi K is greater than equal
[00:38:19] to term oneus term two which we just
[00:38:23] to term oneus term two which we just showed is greater than equal to zero so
[00:38:25] showed is greater than equal to zero so what we just proved is that by
[00:38:27] what we just proved is that by maximizing with respect to our lower
[00:38:29] maximizing with respect to our lower bound we got a new policy that was at
[00:38:31] bound we got a new policy that was at least as good as the old
[00:38:33] least as good as the old policy which is really cool so that
[00:38:35] policy which is really cool so that means that using a lower bound on the
[00:38:39] means that using a lower bound on the gap between the um the performance of
[00:38:42] gap between the um the performance of the policies is sufficient to allow us
[00:38:44] the policies is sufficient to allow us to make monotonic Improvement is that
[00:38:47] to make monotonic Improvement is that super elegant so now we could have
[00:38:48] super elegant so now we could have something if we actually did this most
[00:38:50] something if we actually did this most policies do not do this and we'll talk
[00:38:51] policies do not do this and we'll talk about that in a second but um if you
[00:38:54] about that in a second but um if you actually did this you would get
[00:38:55] actually did this you would get monotonic Improvement and there's
[00:38:57] monotonic Improvement and there's certainly like like a number of domains
[00:38:58] certainly like like a number of domains would be really cool to get monotonic
[00:39:00] would be really cool to get monotonic Improvement so I think I've mentioned
[00:39:01] Improvement so I think I've mentioned education before but you could imagine
[00:39:03] education before but you could imagine Healthcare as well like a lot of cases
[00:39:05] Healthcare as well like a lot of cases if you're doing stuff in the Intensive
[00:39:06] if you're doing stuff in the Intensive Care Unit Etc you might people might be
[00:39:09] Care Unit Etc you might people might be kind of worried about doing random
[00:39:10] kind of worried about doing random exploration or Epsilon greedy but if you
[00:39:13] exploration or Epsilon greedy but if you could say we're only going to improve
[00:39:15] could say we're only going to improve when we know that we the new policy is
[00:39:17] when we know that we the new policy is at least as good as the old policy
[00:39:19] at least as good as the old policy that's likely to be a scenario that's
[00:39:21] that's likely to be a scenario that's much more
[00:39:23] much more palatable
[00:39:24] palatable okay all right sorry about this out a
[00:39:27] okay all right sorry about this out a little bit more here and one of the
[00:39:29] little bit more here and one of the elegant things about this is that um we
[00:39:32] elegant things about this is that um we can restrict ourselves to parameterized
[00:39:34] can restrict ourselves to parameterized policies this doesn't mean we have to
[00:39:36] policies this doesn't mean we have to have um you know completely we can think
[00:39:39] have um you know completely we can think about any sort of policy class and as
[00:39:42] about any sort of policy class and as long as we initialize or initial
[00:39:44] long as we initialize or initial policies in that class it could be a gan
[00:39:47] policies in that class it could be a gan it could be a deep neural network you
[00:39:48] it could be a deep neural network you know um uh then you will and then you
[00:39:52] know um uh then you will and then you keep doing argmax over your policy class
[00:39:54] keep doing argmax over your policy class you'll get this monotonic
[00:39:55] you'll get this monotonic Improvement so it's it's really nice
[00:39:58] Improvement so it's it's really nice it's really um elegant that you could
[00:39:59] it's really um elegant that you could sort that you can do it in this
[00:40:01] sort that you can do it in this case but unfortunately like many
[00:40:04] case but unfortunately like many beautiful Theory things it has some
[00:40:05] beautiful Theory things it has some limitations um so if you look at the
[00:40:08] limitations um so if you look at the actual so C is a constant and we haven't
[00:40:10] actual so C is a constant and we haven't went through where the constant is in
[00:40:11] went through where the constant is in class but you're welcome to look it up
[00:40:12] class but you're welcome to look it up in the
[00:40:13] in the paper when gamma's near one and what
[00:40:16] paper when gamma's near one and what gamma near one means is that we care
[00:40:18] gamma near one means is that we care almost as much about long Horizon
[00:40:19] almost as much about long Horizon rewards as we do about immedia
[00:40:22] rewards as we do about immedia rewards when is close to one gamma is
[00:40:24] rewards when is close to one gamma is pretty
[00:40:25] pretty large and so what what that means is
[00:40:28] large and so what what that means is that in general that second term um can
[00:40:30] that in general that second term um can make you be very conservative so why is
[00:40:33] make you be very conservative so why is that well that means you've got if C is
[00:40:35] that well that means you've got if C is really large that means that if your Pol
[00:40:39] really large that means that if your Pol your new policy takes actions that are
[00:40:41] your new policy takes actions that are quite different than your old policy
[00:40:42] quite different than your old policy you're going to have a really big
[00:40:43] you're going to have a really big penalty so what that basically does is
[00:40:45] penalty so what that basically does is it shrinks your step size it says um uh
[00:40:49] it shrinks your step size it says um uh this is going to be a term that is weigh
[00:40:51] this is going to be a term that is weigh a lot and unless you only make very
[00:40:53] a lot and unless you only make very small changes you could get a big
[00:40:55] small changes you could get a big penalty essentially because you're
[00:40:57] penalty essentially because you're saying I'm really not sure it might be
[00:40:59] saying I'm really not sure it might be that when I change my policy I end up
[00:41:01] that when I change my policy I end up with very different state distributions
[00:41:03] with very different state distributions and I don't know whether the rewards
[00:41:04] and I don't know whether the rewards would be there so what that means is
[00:41:06] would be there so what that means is that in practice if you actually try to
[00:41:08] that in practice if you actually try to use this equation directly like just
[00:41:10] use this equation directly like just straight from the theory the step sizes
[00:41:12] straight from the theory the step sizes are too small now when people say
[00:41:13] are too small now when people say they're too small that doesn't mean that
[00:41:15] they're too small that doesn't mean that like there's anything wrong with them it
[00:41:17] like there's anything wrong with them it just means it's going to take way too
[00:41:18] just means it's going to take way too long it just means that people are
[00:41:20] long it just means that people are impatient um but either impatient or you
[00:41:24] impatient um but either impatient or you know we're being very sample and
[00:41:25] know we're being very sample and efficient so it means that this is
[00:41:28] efficient so it means that this is reasonable it will hold you will get
[00:41:29] reasonable it will hold you will get monotonic Improvement it's just going to
[00:41:31] monotonic Improvement it's just going to take a really really long time and it's
[00:41:33] take a really really long time and it's not going to be feasible for a lot of
[00:41:34] not going to be feasible for a lot of things or it's not going to be practical
[00:41:37] things or it's not going to be practical and so that is what sort of helped
[00:41:38] and so that is what sort of helped motivate why you might want to tune the
[00:41:40] motivate why you might want to tune the KL penalty which we saw last time where
[00:41:42] KL penalty which we saw last time where you sort of like increase or decrease
[00:41:44] you sort of like increase or decrease how much you care about this
[00:41:45] how much you care about this penalty um or use a trust regen or use
[00:41:48] penalty um or use a trust regen or use the
[00:41:49] the clipping and so that's why we see sort
[00:41:51] clipping and so that's why we see sort of a difference between what's formally
[00:41:53] of a difference between what's formally guaranteed by if you were to just
[00:41:55] guaranteed by if you were to just directly use this lower bound versus
[00:41:56] directly use this lower bound versus what's actually done done in
[00:41:59] what's actually done done in practice but I think in terms of kind of
[00:42:01] practice but I think in terms of kind of the the the take homes from this part on
[00:42:04] the the the take homes from this part on policy gradient and and mppo is that
[00:42:06] policy gradient and and mppo is that it's really useful to know that you
[00:42:08] it's really useful to know that you don't just have to take one gradient
[00:42:10] don't just have to take one gradient step you can be much more data efficient
[00:42:12] step you can be much more data efficient you can play this trick of pretending
[00:42:14] you can play this trick of pretending there's no change in the state action
[00:42:15] there's no change in the state action dist or state distribution in order to
[00:42:17] dist or state distribution in order to take several gradient steps and that you
[00:42:20] take several gradient steps and that you can do that while still trying to maybe
[00:42:23] can do that while still trying to maybe approximately get monotonic Improvement
[00:42:26] approximately get monotonic Improvement poo does not guarante monotonic
[00:42:27] poo does not guarante monotonic Improvement but it can be pretty close
[00:42:30] Improvement but it can be pretty close um by thinking explicitly about um these
[00:42:32] um by thinking explicitly about um these lower bounds and how much uh your
[00:42:35] lower bounds and how much uh your performance might change and how much
[00:42:36] performance might change and how much essentially your state distribution
[00:42:38] essentially your state distribution might change so that when you're not
[00:42:39] might change so that when you're not confident in these
[00:42:41] confident in these approximations it also uses um
[00:42:43] approximations it also uses um generalized uh Advantage estimation
[00:42:45] generalized uh Advantage estimation which can be helpful and as I mentioned
[00:42:47] which can be helpful and as I mentioned before it's extremely popular you can
[00:42:49] before it's extremely popular you can use it in many many places um in part
[00:42:52] use it in many many places um in part because also you don't need your reward
[00:42:53] because also you don't need your reward function to be differentiable so people
[00:42:56] function to be differentiable so people have used it in lots of domain
[00:43:01] right and the other thing that I think
[00:43:02] right and the other thing that I think is just useful to remember when we think
[00:43:04] is just useful to remember when we think about policy gradients is that you can
[00:43:06] about policy gradients is that you can also use them with actor critic methods
[00:43:08] also use them with actor critic methods so you can have um you know deep neural
[00:43:11] so you can have um you know deep neural networks to approximate your value
[00:43:13] networks to approximate your value function and then use that for your um
[00:43:15] function and then use that for your um Advantage estimation and combine them
[00:43:17] Advantage estimation and combine them and so that's what most people do is
[00:43:18] and so that's what most people do is that they have some sort of U critic AK
[00:43:21] that they have some sort of U critic AK your value function estimate and a
[00:43:23] your value function estimate and a policy and these are um only two you
[00:43:26] policy and these are um only two you know reinforce P are of course not the
[00:43:28] know reinforce P are of course not the only policy gradient algorithms but they
[00:43:30] only policy gradient algorithms but they are the backbone to all they're still
[00:43:32] are the backbone to all they're still used empirically a lot and then also um
[00:43:35] used empirically a lot and then also um they're the backbone to many of the
[00:43:36] they're the backbone to many of the other ones so if you read other papers
[00:43:38] other ones so if you read other papers they'll be really useful baselines that
[00:43:40] they'll be really useful baselines that you often see or that people are
[00:43:41] you often see or that people are building
[00:43:42] building on all right we're now going to go into
[00:43:45] on all right we're now going to go into imitation learning but does anybody have
[00:43:46] imitation learning but does anybody have any questions before we start
[00:43:53] there yeah on on slide 22
[00:43:57] there yeah on on slide 22 uh just a general question I guess sorry
[00:44:00] uh just a general question I guess sorry the one before
[00:44:02] the one before this yeah so this um does that mean when
[00:44:07] this yeah so this um does that mean when the policy is more myopic and like H is
[00:44:09] the policy is more myopic and like H is near zero then
[00:44:11] near zero then um um your step size will be like like
[00:44:16] um um your step size will be like like you'll be able to improve
[00:44:18] you'll be able to improve more to a greater extent yeah that's a
[00:44:22] more to a greater extent yeah that's a great question so yeah you're like is
[00:44:24] great question so yeah you're like is the is the converse good so if C if if
[00:44:27] the is the converse good so if C if if um gamma is near zero is this practical
[00:44:29] um gamma is near zero is this practical I don't actually know off the top of my
[00:44:31] I don't actually know off the top of my head what C looks like for gamma equals
[00:44:33] head what C looks like for gamma equals z um or not zero but near zero so I
[00:44:37] z um or not zero but near zero so I don't think anybody uses this in
[00:44:39] don't think anybody uses this in practice um I think they always use like
[00:44:41] practice um I think they always use like the clipping or like the kust region so
[00:44:44] the clipping or like the kust region so my guess is that it's still not
[00:44:45] my guess is that it's still not practical often times the the C
[00:44:48] practical often times the the C constants will often be a function of
[00:44:50] constants will often be a function of Vmax like your maximum value often
[00:44:52] Vmax like your maximum value often scaled by um 1 over 1us gamma so it can
[00:44:57] scaled by um 1 over 1us gamma so it can really be quite enormous in many cases
[00:45:00] really be quite enormous in many cases um uh so it might be that here it was
[00:45:03] um uh so it might be that here it was particularly they might be interested in
[00:45:05] particularly they might be interested in cases where your Horizon's pretty large
[00:45:06] cases where your Horizon's pretty large or where you I think one thing here too
[00:45:08] or where you I think one thing here too is that if we're in the episodic case
[00:45:10] is that if we're in the episodic case there's not really a good reason to
[00:45:11] there's not really a good reason to think that the discount Factor shouldn't
[00:45:13] think that the discount Factor shouldn't be near one because you probably
[00:45:15] be near one because you probably actually do just care about all the
[00:45:16] actually do just care about all the rewards so they're probably mostly
[00:45:18] rewards so they're probably mostly interested in domains where they didn't
[00:45:20] interested in domains where they didn't think it was reasonable but yeah it's a
[00:45:21] think it was reasonable but yeah it's a good
[00:45:25] question all right let's talk about
[00:45:27] question all right let's talk about imitation
[00:45:28] imitation learning okay so as we've said before in
[00:45:33] learning okay so as we've said before in general in computer science we like to
[00:45:34] general in computer science we like to try to reduce things if we can we like
[00:45:36] try to reduce things if we can we like to reduce them to other problems that we
[00:45:38] to reduce them to other problems that we know how to solve and so imitation
[00:45:39] know how to solve and so imitation learning is going to be our attempt to
[00:45:41] learning is going to be our attempt to try to do that at least in certain ways
[00:45:43] try to do that at least in certain ways for the all of reinforcement learning
[00:45:46] for the all of reinforcement learning okay and some of these slides come from
[00:45:49] okay and some of these slides come from some of my colleagues at Berkeley and at
[00:45:51] some of my colleagues at Berkeley and at uh
[00:45:52] uh C so in general we're going to now be
[00:45:54] C so in general we're going to now be thinking about the case where we're not
[00:45:55] thinking about the case where we're not going to be gathering online so we saw
[00:45:58] going to be gathering online so we saw in po that we tried to reuse our data a
[00:46:00] in po that we tried to reuse our data a little bit more to take bigger steps but
[00:46:03] little bit more to take bigger steps but one thing you might wonder is well why
[00:46:04] one thing you might wonder is well why do I need any more data at all couldn't
[00:46:06] do I need any more data at all couldn't I just gather some data and then just
[00:46:07] I just gather some data and then just use that and maybe I don't need to
[00:46:09] use that and maybe I don't need to gather any new online data and we'll see
[00:46:12] gather any new online data and we'll see more ideas about that shortly um H but
[00:46:16] more ideas about that shortly um H but one case where you might think that
[00:46:17] one case where you might think that would be reasonable is what if you have
[00:46:19] would be reasonable is what if you have great demonstrations so you know you
[00:46:22] great demonstrations so you know you have instances of doctors making really
[00:46:24] have instances of doctors making really good decisions in the Intensive Care
[00:46:25] good decisions in the Intensive Care Unit or you have people people flying
[00:46:27] Unit or you have people people flying planes or you have people driving cars
[00:46:30] planes or you have people driving cars why couldn't we just use those examples
[00:46:32] why couldn't we just use those examples to directly learn decision policies okay
[00:46:36] to directly learn decision policies okay um and so the hope would be there is
[00:46:37] um and so the hope would be there is like if we just have those recordings
[00:46:39] like if we just have those recordings you know anytime someone's driving like
[00:46:40] you know anytime someone's driving like a Tesla or anyone's someone's driving an
[00:46:43] a Tesla or anyone's someone's driving an airplane could we just get those um sort
[00:46:46] airplane could we just get those um sort of State action um Pairs and tles and
[00:46:49] of State action um Pairs and tles and use that information to try to learn a
[00:46:51] use that information to try to learn a policy
[00:46:52] policy directly now one thing you could do
[00:46:55] directly now one thing you could do instead is to say like you'd have a
[00:46:57] instead is to say like you'd have a human in the loop but that's going to be
[00:46:59] human in the loop but that's going to be pretty expensive and so the hope would
[00:47:00] pretty expensive and so the hope would be that um instead we could just use the
[00:47:03] be that um instead we could just use the demonstrations people are already doing
[00:47:04] demonstrations people are already doing and that might be much more reasonable
[00:47:06] and that might be much more reasonable too in terms of people's
[00:47:11] time
[00:47:13] time so one thing in this case would be all
[00:47:15] so one thing in this case would be all right now maybe we're going to try to
[00:47:17] right now maybe we're going to try to just look directly at demonstrations and
[00:47:18] just look directly at demonstrations and that means we're not going to get
[00:47:19] that means we're not going to get anybody need to have anybody to label
[00:47:21] anybody need to have anybody to label things this is an example from sort of
[00:47:23] things this is an example from sort of um trying to understand what the reward
[00:47:26] um trying to understand what the reward function might be for
[00:47:27] function might be for driving so I guess I just say in
[00:47:29] driving so I guess I just say in addition to the fact that we often have
[00:47:31] addition to the fact that we often have data about people doing these sorts of
[00:47:33] data about people doing these sorts of complex tasks that we'd like to imitate
[00:47:35] complex tasks that we'd like to imitate it also might be in those tasks that
[00:47:36] it also might be in those tasks that it's really hard for someone to write
[00:47:38] it's really hard for someone to write down a reward function like maybe in
[00:47:40] down a reward function like maybe in this sorts of setting like you want to
[00:47:42] this sorts of setting like you want to avoid the water unless it's really
[00:47:44] avoid the water unless it's really really steep or really grally in which
[00:47:45] really steep or really grally in which case maybe your your truck or train can
[00:47:47] case maybe your your truck or train can go into the water or maybe like in
[00:47:49] go into the water or maybe like in general you want to avoid trees but
[00:47:51] general you want to avoid trees but again you know if it's really slippy and
[00:47:52] again you know if it's really slippy and muddy it's actually better and so it
[00:47:54] muddy it's actually better and so it might just be that it's really hard for
[00:47:56] might just be that it's really hard for people to write write down a reward
[00:47:57] people to write write down a reward function in this case but they could
[00:47:59] function in this case but they could drive it and sort of indicate that
[00:48:00] drive it and sort of indicate that implicit reward function and so again
[00:48:03] implicit reward function and so again that might be easier to
[00:48:05] that might be easier to gather this comes up in a lot of
[00:48:07] gather this comes up in a lot of different cases and people have thought
[00:48:09] different cases and people have thought about it certainly a lot for kind of um
[00:48:12] about it certainly a lot for kind of um manipulating heavy machinery or
[00:48:13] manipulating heavy machinery or manipulating cars or things like that um
[00:48:16] manipulating cars or things like that um but you know for things like driving and
[00:48:18] but you know for things like driving and and uh and uh parking and stuff those
[00:48:21] and uh and uh parking and stuff those are a lot of cases where people provide
[00:48:22] are a lot of cases where people provide those sorts of demonstrations where it
[00:48:24] those sorts of demonstrations where it might be hard to specify that reward
[00:48:25] might be hard to specify that reward function
[00:48:29] so the idea from learning from
[00:48:31] so the idea from learning from demonstrations um is that you're going
[00:48:33] demonstrations um is that you're going to get a number of expert demonstrations
[00:48:36] to get a number of expert demonstrations so experts will demonstrate things
[00:48:39] so experts will demonstrate things whether flying a helicopter or
[00:48:40] whether flying a helicopter or manipulating you know something with a a
[00:48:42] manipulating you know something with a a robotic arm or stuff like through T
[00:48:44] robotic arm or stuff like through T operation dorsa SES group does a lot of
[00:48:46] operation dorsa SES group does a lot of this and it will give you a sequence of
[00:48:48] this and it will give you a sequence of states and actions not rewards so you
[00:48:51] states and actions not rewards so you just are going to have like trajectories
[00:48:54] just are going to have like trajectories of State action S
[00:48:59] of State action S Prime okay so we're not going to have
[00:49:01] Prime okay so we're not going to have any rewards anymore there's everything's
[00:49:02] any rewards anymore there's everything's just going to be implicit in this
[00:49:06] just going to be implicit in this case and now we're going to assume that
[00:49:09] case and now we're going to assume that it's easier for people to do this so
[00:49:11] it's easier for people to do this so they're just going to hopefully be able
[00:49:13] they're just going to hopefully be able to provide these demonstrations or they
[00:49:14] to provide these demonstrations or they maybe already
[00:49:16] maybe already have so what's the setup for to for the
[00:49:19] have so what's the setup for to for the rest of today the setup is that we still
[00:49:21] rest of today the setup is that we still have a state space and an action space
[00:49:23] have a state space and an action space we're going to assume this some
[00:49:24] we're going to assume this some transition
[00:49:25] transition model and and there's a reward function
[00:49:28] model and and there's a reward function but we don't know it so or there might
[00:49:31] but we don't know it so or there might be a reward function but we don't know
[00:49:32] be a reward function but we don't know it so there's nothing explicit here
[00:49:34] it so there's nothing explicit here there's no explicit rewards and that we
[00:49:37] there's no explicit rewards and that we have these set of
[00:49:39] have these set of demonstrations in Behavior cloning what
[00:49:41] demonstrations in Behavior cloning what we're going to do is just reduce this to
[00:49:42] we're going to do is just reduce this to supervised learning and try to learn a
[00:49:44] supervised learning and try to learn a mapping from States
[00:49:46] mapping from States actions we're just going to try to clone
[00:49:48] actions we're just going to try to clone the
[00:49:49] the behavior and then we're going to also
[00:49:51] behavior and then we're going to also see some about can we actually recover
[00:49:54] see some about can we actually recover the reward function that people might be
[00:49:56] the reward function that people might be using to generate their behavior and
[00:49:58] using to generate their behavior and then if we have that can we actually try
[00:50:00] then if we have that can we actually try to get a new good decision policy right
[00:50:03] to get a new good decision policy right but the first one is just kind of to try
[00:50:04] but the first one is just kind of to try to directly learn a
[00:50:06] to directly learn a policy so this is called Behavior
[00:50:09] policy so this is called Behavior cloning and essentially once you decide
[00:50:12] cloning and essentially once you decide to do this this is just offthe shelf
[00:50:14] to do this this is just offthe shelf supervised learning so now you treat it
[00:50:17] supervised learning so now you treat it is as you have a sequence of states and
[00:50:20] is as you have a sequence of states and actions from your expert
[00:50:25] actions from your expert trajectories and you can use whatever
[00:50:27] trajectories and you can use whatever tools and supervised learning you
[00:50:29] tools and supervised learning you want so just anything can be done there
[00:50:32] want so just anything can be done there I just to reduce it's strictly not made
[00:50:34] I just to reduce it's strictly not made it into a supervised learning
[00:50:37] it into a supervised learning problem and there were some really early
[00:50:39] problem and there were some really early successes so like Alvin from a very long
[00:50:42] successes so like Alvin from a very long time ago um and then 1982 learning to
[00:50:44] time ago um and then 1982 learning to fly in a flight simulator really early
[00:50:46] fly in a flight simulator really early on in sort of the history of
[00:50:47] on in sort of the history of reinforcement learning or like kind of
[00:50:48] reinforcement learning or like kind of the modern history of reinforcement
[00:50:49] the modern history of reinforcement learning people thought like could we
[00:50:51] learning people thought like could we just reduce this problem and we'll see
[00:50:53] just reduce this problem and we'll see in a second what's one of the challenges
[00:50:55] in a second what's one of the challenges that comes up when we do this
[00:50:57] that comes up when we do this but it certainly can be really helpful
[00:50:58] but it certainly can be really helpful so it's kind of fun to look at Alvin
[00:51:00] so it's kind of fun to look at Alvin this was you know yeah late ' 80s so um
[00:51:03] this was you know yeah late ' 80s so um but I think it's must have been kind of
[00:51:04] but I think it's must have been kind of amazing they were already thinking about
[00:51:06] amazing they were already thinking about cards then they're already thinking
[00:51:08] cards then they're already thinking about not so deep neural networks but
[00:51:09] about not so deep neural networks but they were thinking about neural networks
[00:51:11] they were thinking about neural networks I think this came out of CMU if I
[00:51:13] I think this came out of CMU if I remember right and they had this like
[00:51:14] remember right and they had this like tiny you know 30X 32 video input and
[00:51:18] tiny you know 30X 32 video input and they use this Rangefinder um and so they
[00:51:20] they use this Rangefinder um and so they were trying to you know use not so deep
[00:51:23] were trying to you know use not so deep neural networks to do Behavior cloning
[00:51:25] neural networks to do Behavior cloning for driving in the late ' 80s which is
[00:51:27] for driving in the late ' 80s which is pretty
[00:51:28] pretty awesome okay so so it can be can be done
[00:51:31] awesome okay so so it can be can be done pretty well
[00:51:33] pretty well um in reality this is something that
[00:51:35] um in reality this is something that still people try a lot it's a really
[00:51:36] still people try a lot it's a really good Baseline to try if you have good
[00:51:38] good Baseline to try if you have good data um and I'll talk about some of the
[00:51:41] data um and I'll talk about some of the challenges with join Behavior cloning
[00:51:43] challenges with join Behavior cloning but I think one thing now is like if you
[00:51:44] but I think one thing now is like if you have a lot of data like a lot a lot a
[00:51:46] have a lot of data like a lot a lot a lot of demonstrations like imagine you
[00:51:48] lot of demonstrations like imagine you have all the data from all the pilots um
[00:51:50] have all the data from all the pilots um like you have their actual what they're
[00:51:51] like you have their actual what they're doing all of the different sort of input
[00:51:53] doing all of the different sort of input actions they're doing and you have that
[00:51:55] actions they're doing and you have that from I don't know all of United or
[00:51:56] from I don't know all of United or something like that so if you have an
[00:51:58] something like that so if you have an enormous amount of data and you have a
[00:52:00] enormous amount of data and you have a pretty sophisticated um uh supervised
[00:52:04] pretty sophisticated um uh supervised learning technique it can work really
[00:52:06] learning technique it can work really well particularly if you use Behavior
[00:52:08] well particularly if you use Behavior cloning with an RNN or something that
[00:52:10] cloning with an RNN or something that takes track of the history so while what
[00:52:13] takes track of the history so while what I wrote here involved just States and
[00:52:17] I wrote here involved just States and then like the last state like a Markoff
[00:52:18] then like the last state like a Markoff assumption like the state in the action
[00:52:20] assumption like the state in the action you don't have to do that right you
[00:52:21] you don't have to do that right you could also say I could have my state
[00:52:24] could also say I could have my state action State and then go from there to
[00:52:27] action State and then go from there to A1 or like State action State action
[00:52:36] state so you in general could use
[00:52:38] state so you in general could use something that um is like a recurrent
[00:52:40] something that um is like a recurrent neural network or anything that like
[00:52:42] neural network or anything that like keeps track of long-term histories it
[00:52:44] keeps track of long-term histories it does not have to be a Markoff
[00:52:46] does not have to be a Markoff representation and that often can work
[00:52:48] representation and that often can work very
[00:52:49] very well again it depends a lot on your
[00:52:51] well again it depends a lot on your domain I think that um there's a nice
[00:52:53] domain I think that um there's a nice paper a few years ago in Coral which is
[00:52:55] paper a few years ago in Coral which is one of the robotics conferences where
[00:52:58] one of the robotics conferences where they looked at sort of what were some of
[00:52:59] they looked at sort of what were some of the important factors when you're doing
[00:53:01] the important factors when you're doing offline learning and um uh from for
[00:53:04] offline learning and um uh from for robot applications so it doesn't always
[00:53:07] robot applications so it doesn't always work well um but it can work really well
[00:53:09] work well um but it can work really well particularly if youth history um
[00:53:15] what cuz like imagine like if you're
[00:53:17] what cuz like imagine like if you're think you're flying or driving really
[00:53:19] think you're flying or driving really what matters just the current moments so
[00:53:21] what matters just the current moments so like when when is that actually help I
[00:53:23] like when when is that actually help I actually would debate that so I think
[00:53:24] actually would debate that so I think even it's a great question um I I think
[00:53:26] even it's a great question um I I think maybe partly depends on how you're
[00:53:27] maybe partly depends on how you're thinking of the state space but I think
[00:53:29] thinking of the state space but I think if your state say for let's say I'm
[00:53:31] if your state say for let's say I'm driving if my state is just my immediate
[00:53:34] driving if my state is just my immediate like position that's probably not enough
[00:53:36] like position that's probably not enough I probably need at least my last few to
[00:53:38] I probably need at least my last few to get to like velocity and acceleration so
[00:53:40] get to like velocity and acceleration so you might already be thinking oh in my
[00:53:41] you might already be thinking oh in my state I already have those if that's the
[00:53:43] state I already have those if that's the case if your state order incorporates
[00:53:45] case if your state order incorporates like something about like you know the
[00:53:46] like something about like you know the first or second order derivatives that's
[00:53:48] first or second order derivatives that's probably okay in some cases um but in
[00:53:50] probably okay in some cases um but in other cases if it's just your immediate
[00:53:52] other cases if it's just your immediate then um immediate sensors then you want
[00:53:55] then um immediate sensors then you want the the longer history to come capture
[00:53:56] the the longer history to come capture that and save for planes and
[00:53:59] that and save for planes and stuff yeah it's a good question so this
[00:54:02] stuff yeah it's a good question so this is always just a really good thing to
[00:54:03] is always just a really good thing to try it's a really natural Baseline it's
[00:54:05] try it's a really natural Baseline it's generally really easy to do people often
[00:54:07] generally really easy to do people often report it in offline RL um it's
[00:54:10] report it in offline RL um it's extensively used does not always work
[00:54:12] extensively used does not always work let's see why it might not work and it I
[00:54:15] let's see why it might not work and it I think one of the themes that you're
[00:54:17] think one of the themes that you're seeing now with like the um policy
[00:54:19] seeing now with like the um policy gradient work and now is this challenge
[00:54:20] gradient work and now is this challenge of what are the states you reach and how
[00:54:23] of what are the states you reach and how when you use different policies you're
[00:54:25] when you use different policies you're going to end up at different
[00:54:26] going to end up at different States in general that's sort of the
[00:54:28] States in general that's sort of the definition if your policies don't ever
[00:54:31] definition if your policies don't ever reach any different states and they
[00:54:32] reach any different states and they never take different actions they're the
[00:54:33] never take different actions they're the same policy they generate the same
[00:54:36] same policy they generate the same trajectories so dagger was a paper 2008
[00:54:40] trajectories so dagger was a paper 2008 I'm trying to remember um I it came out
[00:54:43] I'm trying to remember um I it came out in the um in the first decade of the
[00:54:46] in the um in the first decade of the 2000s to try to address some of the
[00:54:48] 2000s to try to address some of the challenges with behavior
[00:54:50] challenges with behavior cloning okay and I think what they were
[00:54:53] cloning okay and I think what they were noticing is this challenge of if you do
[00:54:56] noticing is this challenge of if you do your cloning sometimes things go
[00:54:58] your cloning sometimes things go badly and essentially that's because the
[00:55:01] badly and essentially that's because the decisions that you make over time can
[00:55:04] decisions that you make over time can have sort of cascading effects so let's
[00:55:07] have sort of cascading effects so let's see what that might look like so if you
[00:55:09] see what that might look like so if you do something like supervised
[00:55:14] do something like supervised learning you so and this is what we do
[00:55:17] learning you so and this is what we do when we would be reducing our problem to
[00:55:19] when we would be reducing our problem to like imitation Behavior cloning we just
[00:55:21] like imitation Behavior cloning we just reduce it gr learning in general we
[00:55:23] reduce it gr learning in general we assume that our um our pairs our data
[00:55:26] assume that our um our pairs our data points like our XY Pairs and supervised
[00:55:28] points like our XY Pairs and supervised learning are IID so they're independent
[00:55:31] learning are IID so they're independent and identically distributed and they're
[00:55:32] and identically distributed and they're going to you know they're ignoring
[00:55:33] going to you know they're ignoring temporal structure because they just
[00:55:34] temporal structure because they just assume they're totally independent but
[00:55:36] assume they're totally independent but in our case they're absolutely related
[00:55:40] in our case they're absolutely related in fact if you assume a Markoff
[00:55:41] in fact if you assume a Markoff structure then what happens is you have
[00:55:43] structure then what happens is you have s0 a0 S1 and so whatever you did here
[00:55:48] s0 a0 S1 and so whatever you did here exactly helps determine what is the next
[00:55:50] exactly helps determine what is the next state you do so they're not states are
[00:55:52] state you do so they're not states are definitely not independent um all of
[00:55:54] definitely not independent um all of your all of your different time points
[00:55:56] your all of your different time points so one of the challenges with that is
[00:55:58] so one of the challenges with that is that if you have independent in time
[00:56:00] that if you have independent in time errors and generally that's not too bad
[00:56:03] errors and generally that's not too bad like and that's what most of our
[00:56:04] like and that's what most of our supervised learning guarantees are for
[00:56:05] supervised learning guarantees are for is that you assume your data is all IID
[00:56:08] is that you assume your data is all IID and then you can think about how much um
[00:56:10] and then you can think about how much um sort of error in your estimates um what
[00:56:13] sort of error in your estimates um what what sort of error you get so in general
[00:56:16] what sort of error you get so in general if you have an error at time t with
[00:56:18] if you have an error at time t with probability less than equal to Epsilon
[00:56:21] probability less than equal to Epsilon and you have t decisions so let's ass we
[00:56:24] and you have t decisions so let's ass we have t
[00:56:25] have t decisions then your expected number of
[00:56:27] decisions then your expected number of total errors if all of your decisions
[00:56:29] total errors if all of your decisions are independent is just Epsilon * t plus
[00:56:32] are independent is just Epsilon * t plus you because they're all IID okay and
[00:56:35] you because they're all IID okay and that's not too
[00:56:37] that's not too terrible but that's not what we normally
[00:56:40] terrible but that's not what we normally have in oh I see what happened there
[00:56:45] have in oh I see what happened there okay okay let's think about something
[00:56:47] okay okay let's think about something else I'll add a different picture later
[00:56:49] else I'll add a different picture later let's think of a race
[00:56:51] let's think of a race track all right so in this case you have
[00:56:54] track all right so in this case you have a RAC trck and you're car is
[00:56:57] a RAC trck and you're car is driving except for your supervised
[00:56:59] driving except for your supervised learning thing isn't perfect and so it
[00:57:01] learning thing isn't perfect and so it makes a small error so what you actually
[00:57:03] makes a small error so what you actually do you should maybe maybe you actually
[00:57:04] do you should maybe maybe you actually should have went this way but you went
[00:57:08] should have went this way but you went to the black
[00:57:09] to the black part okay and now you again make a
[00:57:12] part okay and now you again make a little bit of an error and now you're
[00:57:14] little bit of an error and now you're off the
[00:57:16] off the track and now this is really tricky
[00:57:18] track and now this is really tricky because you may have almost no data in
[00:57:20] because you may have almost no data in the part of because your humans never
[00:57:22] the part of because your humans never decided to drop dri drive off the track
[00:57:25] decided to drop dri drive off the track and so now you're in of the region where
[00:57:26] and so now you're in of the region where you have very little data and very
[00:57:27] you have very little data and very little coverage and you're even more
[00:57:29] little coverage and you're even more likely to make mistakes and so what you
[00:57:31] likely to make mistakes and so what you can see in this case is that like if you
[00:57:33] can see in this case is that like if you make small mistakes early on those can
[00:57:35] make small mistakes early on those can compound and get you into parts of the
[00:57:36] compound and get you into parts of the stat space where you have even less
[00:57:38] stat space where you have even less coverage and you generally have even
[00:57:40] coverage and you generally have even less accuracy and so in general you can
[00:57:43] less accuracy and so in general you can actually do much
[00:57:44] actually do much worse and this is because you have a
[00:57:46] worse and this is because you have a data distribution mismatch if the policy
[00:57:49] data distribution mismatch if the policy that you compute gives you a different
[00:57:52] that you compute gives you a different distribution between train and test then
[00:57:54] distribution between train and test then you don't necessarily have the same guar
[00:57:57] you don't necessarily have the same guar okay and we're going to get a different
[00:57:58] okay and we're going to get a different distribution here because the policy we
[00:58:00] distribution here because the policy we were using um to gather the data is not
[00:58:03] were using um to gather the data is not exactly the same as the policy you get
[00:58:05] exactly the same as the policy you get now and we saw that in P2 that when the
[00:58:07] now and we saw that in P2 that when the policy changed we're going to get to
[00:58:09] policy changed we're going to get to different states and actions what's
[00:58:10] different states and actions what's causing our policy to change here is the
[00:58:12] causing our policy to change here is the fact that we can't perfectly imitate the
[00:58:15] fact that we can't perfectly imitate the expert okay so let's just see what that
[00:58:17] expert okay so let's just see what that looks like so in this case we
[00:58:20] looks like so in this case we had in our training set we had pistar
[00:58:24] had in our training set we had pistar which we assume to be our expert and
[00:58:26] which we assume to be our expert and we're generating States from
[00:58:28] we're generating States from piar in our test set we have learned a
[00:58:31] piar in our test set we have learned a policy by trying to match the state
[00:58:34] policy by trying to match the state action pairs we saw in our trading set
[00:58:36] action pairs we saw in our trading set and we're going get a different
[00:58:37] and we're going get a different distribution of
[00:58:40] States so in general this is going to be
[00:58:43] States so in general this is going to be different and we're going to get worse
[00:58:44] different and we're going to get worse errors in this
[00:58:47] case sorry about the let see what
[00:58:50] case sorry about the let see what happened in this case so I'll just draw
[00:58:51] happened in this case so I'll just draw it so what can happen in this case is
[00:58:54] it so what can happen in this case is let's say this is the error you make now
[00:58:56] let's say this is the error you make now and then you can make another
[00:58:59] and then you can make another error and it keeps
[00:59:02] error and it keeps compounding so if you make an error at
[00:59:04] compounding so if you make an error at time step t with probability
[00:59:06] time step t with probability e essentially what can happen there is
[00:59:09] e essentially what can happen there is that you may
[00:59:11] that you may then make errors on the remaining time
[00:59:14] then make errors on the remaining time steps so it make so cause you to get
[00:59:16] steps so it make so cause you to get into parts of the state action space for
[00:59:18] into parts of the state action space for which you make lots of
[00:59:21] which you make lots of errors and then you sort of incur lots
[00:59:24] errors and then you sort of incur lots of regret or cost through the end so in
[00:59:26] of regret or cost through the end so in general and I am not going to step
[00:59:27] general and I am not going to step through all of the proof today the error
[00:59:30] through all of the proof today the error can actually comound instead of linearly
[00:59:32] can actually comound instead of linearly with the number of time steps it can
[00:59:33] with the number of time steps it can comound
[00:59:34] comound quadratically which means that
[00:59:36] quadratically which means that essentially your performance is much
[00:59:38] essentially your performance is much worse than supervised learning would
[00:59:39] worse than supervised learning would predict supervised learning said oh I've
[00:59:41] predict supervised learning said oh I've got an Epsilon optimal or Epsilon
[00:59:43] got an Epsilon optimal or Epsilon accurate um policy great and what this
[00:59:47] accurate um policy great and what this says is because all of those decisions
[00:59:49] says is because all of those decisions are being made across an entire
[00:59:51] are being made across an entire trajectory you can actually end up with
[00:59:53] trajectory you can actually end up with Epsilon * t^ 2 errors instead of Epsilon
[00:59:55] Epsilon t^ 2 errors instead of Epsilon t
[00:59:58] so this is what motivated dager dger
[01:00:01] so this is what motivated dager dger said okay what's the problem the problem
[01:00:03] said okay what's the problem the problem that's happening here that we'd like to
[01:00:04] that's happening here that we'd like to address is whenever we make mistakes we
[01:00:07] address is whenever we make mistakes we go into a different part of the state
[01:00:08] go into a different part of the state space once we're there we maybe have
[01:00:10] space once we're there we maybe have very little guarantees that we're going
[01:00:11] very little guarantees that we're going to do anything reasonable so essentially
[01:00:14] to do anything reasonable so essentially what we want to try to do is kind of
[01:00:16] what we want to try to do is kind of figure out how we might correct or
[01:00:18] figure out how we might correct or adjust to those states that we reached
[01:00:20] adjust to those states that we reached that weren't in our original training
[01:00:22] that weren't in our original training set so the idea in this case is that
[01:00:25] set so the idea in this case is that your Pol this is going to be an
[01:00:26] your Pol this is going to be an iterative approach so you get a data set
[01:00:30] iterative approach so you get a data set where you take a current policy and you
[01:00:32] where you take a current policy and you execute it in the environment so it's
[01:00:35] execute it in the environment so it's like you know you drive your race car
[01:00:36] like you know you drive your race car around a track and hopefully you know
[01:00:39] around a track and hopefully you know you you know it's similar to what the
[01:00:40] you you know it's similar to what the expert would have done but probably not
[01:00:42] expert would have done but probably not perfect and then what you do is you go
[01:00:43] perfect and then what you do is you go to your expert and you say okay this is
[01:00:45] to your expert and you say okay this is what I did when I went around that track
[01:00:47] what I did when I went around that track what should I have done they're like a
[01:00:48] what should I have done they're like a coach and so then what the coach does or
[01:00:50] coach and so then what the coach does or the expert is they say ah in each of
[01:00:52] the expert is they say ah in each of those States this is what you should
[01:00:54] those States this is what you should have done
[01:00:56] have done okay so it would say if you went like
[01:00:59] okay so it would say if you went like this and then you did this and did all
[01:01:02] this and then you did this and did all these other crazy things after that it
[01:01:04] these other crazy things after that it would have said okay no first of all
[01:01:06] would have said okay no first of all here you should have gone here and then
[01:01:09] here you should have gone here and then once you reached here you should have
[01:01:10] once you reached here you should have went
[01:01:11] went down to try to get back onto the
[01:01:15] down to try to get back onto the road so essentially what you're having a
[01:01:17] road so essentially what you're having a human do is you're having them label it
[01:01:19] human do is you're having them label it every time point at every state in that
[01:01:21] every time point at every state in that trajectory what they would have
[01:01:23] trajectory what they would have done and when you do that that gives you
[01:01:25] done and when you do that that gives you a new set of data to learn from so it's
[01:01:28] a new set of data to learn from so it's like you know your export pilot gives
[01:01:30] like you know your export pilot gives you feedback on every place you made a
[01:01:31] you feedback on every place you made a mistake um when you just did your last
[01:01:33] mistake um when you just did your last flight run and then you integrate that
[01:01:36] flight run and then you integrate that you're like oh okay when I'm like you
[01:01:37] you're like oh okay when I'm like you know filling this form of lift next time
[01:01:39] know filling this form of lift next time I got to do this so it gives you a whole
[01:01:41] I got to do this so it gives you a whole bunch more data and then we aggregate
[01:01:44] bunch more data and then we aggregate that data that's why it's called dagger
[01:01:45] that data that's why it's called dagger so we're aggregating the data sets of
[01:01:48] so we're aggregating the data sets of the old data we had and the new data
[01:01:51] the old data we had and the new data that we just got from our expert we then
[01:01:53] that we just got from our expert we then do Behavior cloning again on our new
[01:01:56] do Behavior cloning again on our new data set which now includes more of the
[01:01:58] data set which now includes more of the states in the
[01:02:00] states in the environment and then we
[01:02:02] environment and then we repeat and I think part of the
[01:02:04] repeat and I think part of the motivation for this and this is why I
[01:02:05] motivation for this and this is why I said Behavior cloting can work really
[01:02:07] said Behavior cloting can work really well when you have enough data is that
[01:02:09] well when you have enough data is that the problem that's happening here is
[01:02:11] the problem that's happening here is that um we're assuming we don't kind of
[01:02:13] that um we're assuming we don't kind of have full coverage over the whole domain
[01:02:15] have full coverage over the whole domain of what the expert would do at any place
[01:02:17] of what the expert would do at any place inside of the save race car track and
[01:02:19] inside of the save race car track and what this is allowing us to do is to
[01:02:22] what this is allowing us to do is to better figure out over the whole Space
[01:02:24] better figure out over the whole Space what the what the expert would do and
[01:02:26] what the what the expert would do and make better decisions and correct in
[01:02:28] make better decisions and correct in case we end up in
[01:02:30] case we end up in those so in dagger we do this over and
[01:02:33] those so in dagger we do this over and over and over again and there's some
[01:02:34] over and over again and there's some nice theoretical guarantees of what
[01:02:35] nice theoretical guarantees of what you'll converge to when you do this and
[01:02:38] you'll converge to when you do this and what they did is to show this for things
[01:02:40] what they did is to show this for things like driving driving in like a simulated
[01:02:41] like driving driving in like a simulated domain like a Mario Kart or a video game
[01:02:44] domain like a Mario Kart or a video game um and show that they could learn
[01:02:46] um and show that they could learn quickly how to sort of get a very good
[01:02:47] quickly how to sort of get a very good policy that didn't suffer from these
[01:02:49] policy that didn't suffer from these kind of compounding
[01:02:51] kind of compounding errors can I think of what a limitation
[01:02:54] errors can I think of what a limitation might be of doing this over Behavior CL
[01:02:56] might be of doing this over Behavior CL cloning
[01:02:59] cloning yeah exactly it's super expensive yeah
[01:03:02] yeah exactly it's super expensive yeah so you have to um basically it's like
[01:03:05] so you have to um basically it's like you have to have that coach or your
[01:03:06] you have to have that coach or your teacher or your expert with you the
[01:03:08] teacher or your expert with you the whole learning so the nice thing about
[01:03:10] whole learning so the nice thing about Behavior Clon is you get data once the
[01:03:12] Behavior Clon is you get data once the data might already be available and then
[01:03:14] data might already be available and then you can just learn from it here you have
[01:03:16] you can just learn from it here you have to have constant
[01:03:17] to have constant supervision now in some cases that might
[01:03:20] supervision now in some cases that might be reasonable but in most settings
[01:03:21] be reasonable but in most settings that's going to be um really impasible
[01:03:24] that's going to be um really impasible so this this is very human in the loop
[01:03:27] so this this is very human in the loop human has to
[01:03:32] supervise and so I think for those
[01:03:34] supervise and so I think for those reasons that's one of the reasons that
[01:03:36] reasons that's one of the reasons that um you know in robotics and some other
[01:03:38] um you know in robotics and some other areas you'll certainly have built a lot
[01:03:40] areas you'll certainly have built a lot on dagger but I don't think it's as
[01:03:42] on dagger but I don't think it's as popular as Behavior clothing because it
[01:03:44] popular as Behavior clothing because it really does require a lot more work of
[01:03:45] really does require a lot more work of the
[01:03:47] the human all right so a second thing you
[01:03:49] human all right so a second thing you might want to do is learn a
[01:03:52] might want to do is learn a reward so you might say all right
[01:03:54] reward so you might say all right there's I'd like to actually figure out
[01:03:56] there's I'd like to actually figure out what the reward is you might want this
[01:03:58] what the reward is you might want this for several reasons you might want to
[01:03:59] for several reasons you might want to learn the reward because you want to
[01:04:02] learn the reward because you want to understand something about human
[01:04:03] understand something about human decision making like you might say all
[01:04:05] decision making like you might say all right I want to understand how surgeons
[01:04:06] right I want to understand how surgeons are making trade-offs when they're
[01:04:08] are making trade-offs when they're dealing with really complicated
[01:04:10] dealing with really complicated situations of like how do I trade off
[01:04:12] situations of like how do I trade off time or risk or things like that and
[01:04:13] time or risk or things like that and maybe it's really hard or just you know
[01:04:15] maybe it's really hard or just you know their time is really valuable to ask
[01:04:17] their time is really valuable to ask them lots of questions but you really
[01:04:19] them lots of questions but you really would like to understand kind of that
[01:04:20] would like to understand kind of that preference structure so that's one goal
[01:04:23] preference structure so that's one goal and another is that you might want to
[01:04:25] and another is that you might want to use that then to learn a policy you
[01:04:27] use that then to learn a policy you might so like if I can extract that from
[01:04:28] might so like if I can extract that from the data then I can learn a policy from
[01:04:30] the data then I can learn a policy from that and you'll see that in homework
[01:04:32] that and you'll see that in homework three because we're going to be doing r
[01:04:34] three because we're going to be doing r jef as part of that we're going to try
[01:04:35] jef as part of that we're going to try to learn from
[01:04:37] to learn from preferences so there's lots of reasons
[01:04:38] preferences so there's lots of reasons you might want to be able to learn
[01:04:39] you might want to be able to learn reward function so in this case we're
[01:04:42] reward function so in this case we're going to be in a similar setting we're
[01:04:43] going to be in a similar setting we're going to still have a state space and
[01:04:44] going to still have a state space and action space and a transition model
[01:04:46] action space and a transition model still no reward function still going to
[01:04:48] still no reward function still going to have some expert demonstrations and what
[01:04:50] have some expert demonstrations and what we want to do is infer the reward
[01:04:52] we want to do is infer the reward function the expert was using implicitly
[01:04:56] function the expert was using implicitly to make their
[01:05:00] decisions and what we're going to assume
[01:05:02] decisions and what we're going to assume for now is that the teacher's policy is
[01:05:05] for now is that the teacher's policy is optimal so like or you can call it
[01:05:08] optimal so like or you can call it expert the teacher or the expert's
[01:05:10] expert the teacher or the expert's policy is
[01:05:11] policy is optimal so let's think about what we can
[01:05:13] optimal so let's think about what we can infer from
[01:05:16] infer from that so if you see someone's
[01:05:18] that so if you see someone's demonstrations and you know that they're
[01:05:21] demonstrations and you know that they're optimal so teacher I'll use teacher
[01:05:26] optimal so teacher I'll use teacher equals to expert for this thing if you
[01:05:29] equals to expert for this thing if you if you see this you know it's
[01:05:30] if you see this you know it's optimal is there a single unique r that
[01:05:33] optimal is there a single unique r that makes teacher policy optimal are there
[01:05:35] makes teacher policy optimal are there many does it depend on the markup
[01:05:37] many does it depend on the markup decision process you're not
[01:05:41] sure and remember we know that the
[01:05:43] sure and remember we know that the actual policy is optimal
[01:06:22] and if you think there are many I'd like
[01:06:24] and if you think there are many I'd like you to give me an sample simple one
[01:06:27] you to give me an sample simple one which would make things optimal I mean
[01:06:30] which would make things optimal I mean not the thing but I'll ask in a
[01:06:42] second all right why don't we just do a
[01:06:45] second all right why don't we just do a quick check um and talk to a neighbor
[01:06:47] quick check um and talk to a neighbor and see what you find
[01:07:25] t
[01:08:24] I don't really understand
[01:08:27] I don't really understand repres
[01:08:58] okay good so almost everybody said the
[01:09:00] okay good so almost everybody said the answer is B which is true there is many
[01:09:03] answer is B which is true there is many does anybody want to tell me um kind of
[01:09:05] does anybody want to tell me um kind of a silly one that any policy is optimal
[01:09:09] a silly one that any policy is optimal under yeah two
[01:09:13] under yeah two are like just scale by constant fact
[01:09:16] are like just scale by constant fact yeah that's great and I was hearing that
[01:09:17] yeah that's great and I was hearing that too over there so if you scale if you
[01:09:19] too over there so if you scale if you take a word function and you multiply it
[01:09:21] take a word function and you multiply it by a positive constant like then that
[01:09:23] by a positive constant like then that can't change the policy um zero works
[01:09:25] can't change the policy um zero works too so you can just use zero and any
[01:09:29] too so you can just use zero and any policy is optimal if you never get
[01:09:31] policy is optimal if you never get reward so I bring this up um not to trip
[01:09:35] reward so I bring this up um not to trip ASE it but just to highlight that this
[01:09:37] ASE it but just to highlight that this is a huge identifiable Pro
[01:09:38] is a huge identifiable Pro identifiability problem there is not a
[01:09:41] identifiability problem there is not a single R even if you know that the
[01:09:43] single R even if you know that the demonstrations are expert um there's not
[01:09:45] demonstrations are expert um there's not a single reward function that's
[01:09:46] a single reward function that's compatible with them and so that's a
[01:09:48] compatible with them and so that's a problem and that's something to keep in
[01:09:49] problem and that's something to keep in mind when we start getting into uh rhf
[01:09:52] mind when we start getting into uh rhf and DPO shortly um that this is either
[01:09:56] and DPO shortly um that this is either you need to be making other sorts of
[01:09:58] you need to be making other sorts of assumptions to constrain your reward
[01:10:00] assumptions to constrain your reward function or or in general we're going to
[01:10:01] function or or in general we're going to have to make additional Cho choices or
[01:10:03] have to make additional Cho choices or constraints because otherwise this a is
[01:10:05] constraints because otherwise this a is a not identifiable
[01:10:07] a not identifiable problem
[01:10:11] great so one thing some people do to try
[01:10:13] great so one thing some people do to try to think about how we might do this is
[01:10:16] to think about how we might do this is to think about um so I think I know what
[01:10:20] to think about um so I think I know what happened I was editing two sets of
[01:10:22] happened I was editing two sets of slides and I think the other one is now
[01:10:24] slides and I think the other one is now well updated but this one must not
[01:10:25] well updated but this one must not unfortunately um in any case we did um
[01:10:28] unfortunately um in any case we did um we talked briefly about uh value
[01:10:30] we talked briefly about uh value function approximation through deep Q
[01:10:32] function approximation through deep Q learning deep Q learning naturally
[01:10:34] learning deep Q learning naturally implies that um we would use a deep
[01:10:35] implies that um we would use a deep neural network but you could use a
[01:10:36] neural network but you could use a linear value function just like you know
[01:10:39] linear value function just like you know a very shallow network uh the idea here
[01:10:42] a very shallow network uh the idea here and this is all the this work predated
[01:10:44] and this is all the this work predated deep Q learning is to think about
[01:10:46] deep Q learning is to think about generally where your reward is linear
[01:10:47] generally where your reward is linear over the features so your reward of s so
[01:10:51] over the features so your reward of s so here we're just doing reward respect to
[01:10:52] here we're just doing reward respect to States is w w is just going to be a
[01:10:55] States is w w is just going to be a feature Vector W is just going to be a
[01:10:57] feature Vector W is just going to be a vector X of s and X of s here is just a
[01:11:00] vector X of s and X of s here is just a feature representation so this is just
[01:11:05] feature representation so this is just features are
[01:11:07] features are X so that for example could be like if
[01:11:09] X so that for example could be like if I'm a robot if this is my current
[01:11:11] I'm a robot if this is my current location what's the distance to that
[01:11:13] location what's the distance to that wall what's the distance to that wall
[01:11:14] wall what's the distance to that wall that wall and this wall that would be a
[01:11:16] that wall and this wall that would be a set of features and then I could have a
[01:11:18] set of features and then I could have a weighted combination of those to give me
[01:11:19] weighted combination of those to give me the reward of me standing here and the
[01:11:23] the reward of me standing here and the goal is to identify the weight vector
[01:11:25] goal is to identify the weight vector you given a set of
[01:11:27] you given a set of demonstrations okay so in that case you
[01:11:30] demonstrations okay so in that case you can also Express the resulting value
[01:11:32] can also Express the resulting value function for a policy as a combination
[01:11:35] function for a policy as a combination of these weighted
[01:11:36] of these weighted features okay and so just write out sort
[01:11:42] features okay and so just write out sort of particularly because we didn't do it
[01:11:43] of particularly because we didn't do it in class um very much going to write out
[01:11:46] in class um very much going to write out sort of what that looks like so it's the
[01:11:48] sort of what that looks like so it's the states we reach under this policy this
[01:11:51] states we reach under this policy this time STS T equals 0 to Infinity gamma t
[01:11:55] time STS T equals 0 to Infinity gamma t of our weight Vector it's unknown times
[01:11:58] of our weight Vector it's unknown times our feature representation for that time
[01:12:00] our feature representation for that time step given we start in s0 but note here
[01:12:03] step given we start in s0 but note here that W um is always the same so we can
[01:12:06] that W um is always the same so we can just take this out okay we have WT the
[01:12:10] just take this out okay we have WT the expected value s
[01:12:13] expected value s pi and they should start to look
[01:12:15] pi and they should start to look somewhat familiar because it's going to
[01:12:17] somewhat familiar because it's going to look like these weird discounted
[01:12:19] look like these weird discounted features that we've seen sort of
[01:12:22] features that we've seen sort of before so we can also call this w T new
[01:12:26] before so we can also call this w T new of Pi for this is the state
[01:12:33] distribution
[01:12:35] distribution under discounted State distribution
[01:12:38] under discounted State distribution under
[01:12:40] under P okay we've seen this before we go sort
[01:12:43] P okay we've seen this before we go sort of back and forth between thinking of
[01:12:44] of back and forth between thinking of there being time steps and thinking of
[01:12:46] there being time steps and thinking of us is sort of saying well over all time
[01:12:48] us is sort of saying well over all time how much time do we spend in each of the
[01:12:49] how much time do we spend in each of the different
[01:12:50] different states
[01:12:53] states so in particular here I've defined mu to
[01:12:56] so in particular here I've defined mu to just be the discounted weighted
[01:12:57] just be the discounted weighted frequency of State features starting in
[01:12:59] frequency of State features starting in a particular
[01:13:01] a particular state so why have I done this well I've
[01:13:04] state so why have I done this well I've done this to say we can relate what the
[01:13:05] done this to say we can relate what the value is to just a linear combination
[01:13:08] value is to just a linear combination under this linear reward function a
[01:13:10] under this linear reward function a linear combination of my weight feature
[01:13:13] linear combination of my weight feature which I don't know times my feature
[01:13:16] which I don't know times my feature distribution okay and that's good
[01:13:18] distribution okay and that's good because I have access to Features I have
[01:13:20] because I have access to Features I have access to trajectories that were
[01:13:23] access to trajectories that were demonstrated by my experts and I can use
[01:13:25] demonstrated by my experts and I can use that to extract the features of those
[01:13:27] that to extract the features of those stes and compute something like
[01:13:29] stes and compute something like new okay all right so but we don't know
[01:13:33] new okay all right so but we don't know what w is yet so let's think of um what
[01:13:36] what w is yet so let's think of um what we could
[01:13:37] we could do so the goal here is that um we want
[01:13:41] do so the goal here is that um we want to identify the weight Vector W given a
[01:13:43] to identify the weight Vector W given a set of demonstrations we've just seen
[01:13:45] set of demonstrations we've just seen that we can rewrite the value of a
[01:13:47] that we can rewrite the value of a policy Pi if these rewards are linear as
[01:13:51] policy Pi if these rewards are linear as WT mu of Pi where it's the um discounted
[01:13:56] WT mu of Pi where it's the um discounted State
[01:13:57] State frequency all right so what we know is
[01:14:00] frequency all right so what we know is that V for the optimal policy is greater
[01:14:04] that V for the optimal policy is greater than or equal to V Pi for any other
[01:14:08] than or equal to V Pi for any other policy and that means that W Pi of mu Pi
[01:14:16] policy and that means that W Pi of mu Pi star has to be greater than equal to W
[01:14:20] star has to be greater than equal to W Pi of mu
[01:14:22] Pi of mu Pi for all pi
[01:14:26] Pi for all pi where this is
[01:14:30] where this is observed so to um
[01:14:34] observed so to um exper so what it means is that if I pick
[01:14:38] exper so what it means is that if I pick any other policy and I generate what are
[01:14:41] any other policy and I generate what are the state features you'd get under
[01:14:43] the state features you'd get under running that policy in the
[01:14:44] running that policy in the world that distribution of features has
[01:14:47] world that distribution of features has to have lower reward than the features
[01:14:51] to have lower reward than the features I've actually observed in my data
[01:14:53] I've actually observed in my data because I've assumed my expert is
[01:14:54] because I've assumed my expert is optimal so my experts demonstrated
[01:14:56] optimal so my experts demonstrated things it's optimal and when they
[01:14:59] things it's optimal and when they demonstrated things like let's say
[01:15:00] demonstrated things like let's say they're um you know controlling a robot
[01:15:02] they're um you know controlling a robot and the robot spends all this time over
[01:15:04] and the robot spends all this time over in this part of the room and if they
[01:15:06] in this part of the room and if they spend time over this part of the room
[01:15:07] spend time over this part of the room then all my features are going to come
[01:15:09] then all my features are going to come from over here and that means that any
[01:15:12] from over here and that means that any other policy that I
[01:15:15] other policy that I use its features have to have a lower
[01:15:18] use its features have to have a lower weight if they don't match what the
[01:15:20] weight if they don't match what the features are of the expert okay
[01:15:25] regardless of what w is right because
[01:15:27] regardless of what w is right because this has to hold okay so this is for the
[01:15:30] this has to hold okay so this is for the W that we pick this has to be
[01:15:33] W that we pick this has to be true so we can rewrite that as saying
[01:15:39] true so we can rewrite that as saying that the value of V St has to be greater
[01:15:41] that the value of V St has to be greater than V which means that the value under
[01:15:45] than V which means that the value under um this is yeah the the we can just
[01:15:50] um this is yeah the the we can just write it down in terms of this and the
[01:15:51] write it down in terms of this and the resulting frequencies so therefore the
[01:15:53] resulting frequencies so therefore the experts demonstrations are from the
[01:15:55] experts demonstrations are from the policy to identify W it's sufficient to
[01:15:56] policy to identify W it's sufficient to find a w star such that this
[01:15:59] find a w star such that this holds so we know this has to be true
[01:16:01] holds so we know this has to be true under the true expert that under the
[01:16:04] under the true expert that under the true W it has to be that features we get
[01:16:07] true W it has to be that features we get under the expert policy have to be have
[01:16:10] under the expert policy have to be have a higher reward than the features we get
[01:16:12] a higher reward than the features we get under any other policy so this gives us
[01:16:15] under any other policy so this gives us a
[01:16:16] a constraint that says when we are
[01:16:19] constraint that says when we are searching for what w is because remember
[01:16:21] searching for what w is because remember W determines our reward function this
[01:16:23] W determines our reward function this has to hold
[01:16:25] has to hold okay this
[01:16:28] okay this constraint um and then what we can do
[01:16:33] constraint um and then what we can do is so it's sufficient to say well what
[01:16:36] is so it's sufficient to say well what would be one thing we could do to be
[01:16:38] would be one thing we could do to be optimal if we want to get to a policy
[01:16:39] optimal if we want to get to a policy well we just need to match the match the
[01:16:42] well we just need to match the match the features of the expert so we need a
[01:16:44] features of the expert so we need a policy that induces the same
[01:16:46] policy that induces the same distribution of States as the
[01:16:48] distribution of States as the expert so in general if you have a
[01:16:52] expert so in general if you have a policy such the features you generate
[01:16:54] policy such the features you generate under that policy are really close to
[01:16:56] under that policy are really close to the features you get under Pi
[01:16:59] the features you get under Pi star then for all W with W Infinity less
[01:17:03] star then for all W with W Infinity less than equal to one this is using holders
[01:17:04] than equal to one this is using holders inequality you're guaranteed that the
[01:17:06] inequality you're guaranteed that the reward of this policy is very close to
[01:17:09] reward of this policy is very close to the reward of the optimal policy or your
[01:17:12] the reward of the optimal policy or your expert
[01:17:14] expert policy and all of this is just to say
[01:17:17] policy and all of this is just to say you can reduce the problem of sort of uh
[01:17:20] you can reduce the problem of sort of uh reward learning and policy learning in
[01:17:22] reward learning and policy learning in this case to feature matching that's
[01:17:24] this case to feature matching that's kind of the high level idea is to say in
[01:17:26] kind of the high level idea is to say in the case where you don't observe the
[01:17:27] the case where you don't observe the reward directly but you have access to
[01:17:30] reward directly but you have access to Optimal demonstrations by an expert all
[01:17:32] Optimal demonstrations by an expert all you need to do is to find a policy and
[01:17:35] you need to do is to find a policy and or reward function that um that allows
[01:17:38] or reward function that um that allows you to match those features because
[01:17:41] you to match those features because those are the features that we know have
[01:17:42] those are the features that we know have higher
[01:17:45] reward all right now as we've already
[01:17:48] reward all right now as we've already talked about there is still an infinite
[01:17:50] talked about there is still an infinite number of reward functions with the same
[01:17:52] number of reward functions with the same optimal policy so even when we think
[01:17:54] optimal policy so even when we think about this mapping features it doesn't
[01:17:55] about this mapping features it doesn't solve the issue we just
[01:17:58] solve the issue we just identified um and and there are many
[01:18:00] identified um and and there are many stochastic policies that can match the
[01:18:01] stochastic policies that can match the feature
[01:18:03] feature counts so I haven't told you anything
[01:18:05] counts so I haven't told you anything yet to solve that big problem I've just
[01:18:06] yet to solve that big problem I've just told you sort of another way to think
[01:18:08] told you sort of another way to think about it and so there's this question of
[01:18:10] about it and so there's this question of like how do we pick among all these
[01:18:11] like how do we pick among all these different
[01:18:13] different options so there's a number of different
[01:18:15] options so there's a number of different ways to do this some of the largest and
[01:18:18] ways to do this some of the largest and most influential ideas are are these two
[01:18:20] most influential ideas are are these two maximum entropy inverse reinforcement
[01:18:22] maximum entropy inverse reinforcement learning and Gale
[01:18:26] learning and Gale and what we'll do next time is to talk
[01:18:28] and what we'll do next time is to talk about maximum entropy imerse
[01:18:29] about maximum entropy imerse reinforcement learning which has been
[01:18:31] reinforcement learning which has been very very influential so this is in 2008
[01:18:34] very very influential so this is in 2008 so we'll pick up on that On th on
[01:18:36] so we'll pick up on that On th on Wednesday thanks
Lecture 008
Stanford CS234 Reinforcement Learning I Offline RL 1 I 2024 I Lecture 8
Source: https://www.youtube.com/watch?v=IEbuJtjqtMU
---
Transcript
[00:00:06] all right well we work on getting these
[00:00...
Stanford CS234 Reinforcement Learning I Offline RL 1 I 2024 I Lecture 8
Source: https://www.youtube.com/watch?v=IEbuJtjqtMU
---
Transcript
[00:00:06] all right well we work on getting these
[00:00:07] all right well we work on getting these started I'm just going to write a couple
[00:00:08] started I'm just going to write a couple things up about General Logistics can
[00:00:11] things up about General Logistics can continue to work on this for a second
[00:00:37] for
[00:01:11] all right okay great well why don't we
[00:01:13] all right okay great well why don't we dive into this
[00:01:16] dive into this um let's see so I think everybody agreed
[00:01:22] um let's see so I think everybody agreed from the first one which is great so
[00:01:23] from the first one which is great so this is true so this is
[00:01:27] this is true so this is true um there was was a bit of uh dis
[00:01:33] true um there was was a bit of uh dis bit of disagreement about this B and C
[00:01:35] bit of disagreement about this B and C so why don't you talk to your neighbor
[00:01:36] so why don't you talk to your neighbor for a second and see if that changes
[00:01:38] for a second and see if that changes your mind or resolves the communic
[00:01:40] your mind or resolves the communic confusion
[00:01:48] [Music]
[00:02:25] I mean
[00:02:26] I mean just we need some
[00:02:33] all right so um the first one is oh this
[00:02:38] all right so um the first one is oh this is false and this is false um so dagger
[00:02:42] is false and this is false um so dagger if you think back unfortunately required
[00:02:44] if you think back unfortunately required the human to keep around forever uh they
[00:02:46] the human to keep around forever uh they would constantly be getting asked hey
[00:02:48] would constantly be getting asked hey for the policy that the agent followed
[00:02:50] for the policy that the agent followed was this an optimal action or not um and
[00:02:53] was this an optimal action or not um and behavior cloning does not require
[00:02:55] behavior cloning does not require knowing the Dynamics model it allows us
[00:02:56] knowing the Dynamics model it allows us to reduce reinforcement learning to
[00:02:59] to reduce reinforcement learning to supervis learning and the idea is that
[00:03:00] supervis learning and the idea is that we take the expert demonstrations and we
[00:03:02] we take the expert demonstrations and we just try to learn state to action
[00:03:04] just try to learn state to action mappings and so we can just treat it as
[00:03:06] mappings and so we can just treat it as a standard supervised learning
[00:03:08] a standard supervised learning problem
[00:03:11] problem great all right so I think um you know
[00:03:14] great all right so I think um you know as we go further in the course we get to
[00:03:16] as we go further in the course we get to go to more and more exciting topics and
[00:03:19] go to more and more exciting topics and today we're really going to start to see
[00:03:21] today we're really going to start to see how you know skipping all the NLP side
[00:03:23] how you know skipping all the NLP side but how do we actually get to
[00:03:24] but how do we actually get to reinforcement learning that can do some
[00:03:26] reinforcement learning that can do some of the amazing things that we see large
[00:03:27] of the amazing things that we see large language models doing so for example um
[00:03:30] language models doing so for example um you know when I Was preparing this
[00:03:31] you know when I Was preparing this lecture I was like please write me a
[00:03:33] lecture I was like please write me a program to demonstrate how rhf Works be
[00:03:35] program to demonstrate how rhf Works be brief in your explanations um and then
[00:03:37] brief in your explanations um and then show me the code and within like you
[00:03:39] show me the code and within like you know about 5 Seconds it generated me
[00:03:42] know about 5 Seconds it generated me code that used um Q learning other
[00:03:44] code that used um Q learning other things to generate an actual example of
[00:03:47] things to generate an actual example of how rhf which stands for reinforcement
[00:03:49] how rhf which stands for reinforcement learning from Human feedback which is
[00:03:52] learning from Human feedback which is how they trained chat GPT amongst a
[00:03:55] how they trained chat GPT amongst a whole bunch of other things to do um so
[00:03:57] whole bunch of other things to do um so it could generate me a small example of
[00:03:59] it could generate me a small example of how to do that code that you can run so
[00:04:01] how to do that code that you can run so that's pretty extraordinary this was not
[00:04:04] that's pretty extraordinary this was not possible um you know two years ago well
[00:04:07] possible um you know two years ago well I started offering this class in 2017 so
[00:04:09] I started offering this class in 2017 so when I first started offering this class
[00:04:11] when I first started offering this class this was definitely not possible this
[00:04:12] this was definitely not possible this only really became possible with Chad
[00:04:14] only really became possible with Chad GPT so it's pretty phenomenal that we
[00:04:17] GPT so it's pretty phenomenal that we now have ai that can do this um and the
[00:04:19] now have ai that can do this um and the question is how do we get there and what
[00:04:21] question is how do we get there and what sort of RL techniques are being used to
[00:04:24] sort of RL techniques are being used to help accomplish this so that's what
[00:04:25] help accomplish this so that's what we're going to start digging into now so
[00:04:29] we're going to start digging into now so today what we're going to do is we um
[00:04:31] today what we're going to do is we um are going to continue on from imitation
[00:04:33] are going to continue on from imitation learning and talk a bit about
[00:04:36] learning and talk a bit about reinforcement learning from Human
[00:04:37] reinforcement learning from Human feedback and then next time we're going
[00:04:40] feedback and then next time we're going to have um a guest lecture from one of
[00:04:42] to have um a guest lecture from one of the authors of the direct preference
[00:04:43] the authors of the direct preference optimization work which received uh best
[00:04:46] optimization work which received uh best paper runnerup at neural information
[00:04:48] paper runnerup at neural information processing systems which is kind of the
[00:04:50] processing systems which is kind of the premier machine learning conference um
[00:04:52] premier machine learning conference um so he's going to come talk he's one of
[00:04:54] so he's going to come talk he's one of The Graduate students here at Stanford
[00:04:55] The Graduate students here at Stanford and this is sort of have become really I
[00:04:58] and this is sort of have become really I guess maybe like it's it's starting to
[00:05:00] guess maybe like it's it's starting to replace or exceed performance on rhf on
[00:05:03] replace or exceed performance on rhf on a lot of benchmarks that's super
[00:05:05] a lot of benchmarks that's super exciting it'll be great to have him and
[00:05:06] exciting it'll be great to have him and in fact because everybody here always is
[00:05:08] in fact because everybody here always is innovating which is awesome he was like
[00:05:10] innovating which is awesome he was like oh we actually have a new paper coming
[00:05:11] oh we actually have a new paper coming out an archive like next week that shows
[00:05:13] out an archive like next week that shows how we can extend this doll in all these
[00:05:14] how we can extend this doll in all these different ways um so I asked him if he
[00:05:16] different ways um so I asked him if he had time to cover that a little bit so
[00:05:19] had time to cover that a little bit so there's there's a lot of work to be done
[00:05:21] there's there's a lot of work to be done in this space to think about how do we
[00:05:22] in this space to think about how do we better use RL in combination with these
[00:05:25] better use RL in combination with these incredible function approximators of
[00:05:26] incredible function approximators of large language models to create you know
[00:05:29] large language models to create you know a sort of the amazing performance that
[00:05:31] a sort of the amazing performance that we could see of a system that could do
[00:05:32] we could see of a system that could do something like
[00:05:34] something like this so that's where we're going um what
[00:05:37] this so that's where we're going um what we're going to focus on today is to
[00:05:38] we're going to focus on today is to continue talking about imitation
[00:05:39] continue talking about imitation learning and I think imitation learning
[00:05:42] learning and I think imitation learning is a nice way to build into this because
[00:05:43] is a nice way to build into this because imitation learning is one form of using
[00:05:46] imitation learning is one form of using human feedback to try to train uh try to
[00:05:49] human feedback to try to train uh try to train reinforcement learning agents and
[00:05:52] train reinforcement learning agents and then when we get into rhf well that'll
[00:05:54] then when we get into rhf well that'll be sort of a different way to leverage
[00:05:56] be sort of a different way to leverage human
[00:05:57] human expertise so to start we're going to go
[00:05:59] expertise so to start we're going to go back to imitation learning and to talk a
[00:06:02] back to imitation learning and to talk a lot today about Max entropy um iners
[00:06:04] lot today about Max entropy um iners reinforcement learning so let's just
[00:06:06] reinforcement learning so let's just remember where we were last time what we
[00:06:09] remember where we were last time what we were talking about when we talked about
[00:06:10] were talking about when we talked about imitation learning was the idea of
[00:06:12] imitation learning was the idea of taking demonstrations from people and
[00:06:15] taking demonstrations from people and these either could be explicit
[00:06:16] these either could be explicit demonstrations like I show the robot how
[00:06:19] demonstrations like I show the robot how to pick up a cup and it records all my
[00:06:21] to pick up a cup and it records all my movements and then you can use that for
[00:06:23] movements and then you can use that for later training um or it could just be
[00:06:27] later training um or it could just be natural trajectories so like you take
[00:06:28] natural trajectories so like you take electronic medical record systems and
[00:06:30] electronic medical record systems and you just look at the decisions that are
[00:06:32] you just look at the decisions that are made from doctors and we use that um to
[00:06:35] made from doctors and we use that um to try to either equal doctor performance
[00:06:37] try to either equal doctor performance or exceed doctor performance so often we
[00:06:40] or exceed doctor performance so often we just have observation data which may
[00:06:41] just have observation data which may either it's been done in normal sort of
[00:06:43] either it's been done in normal sort of business as usual or that is explicitly
[00:06:45] business as usual or that is explicitly being given um as a demonstration
[00:06:47] being given um as a demonstration trajectory and this is just going to be
[00:06:49] trajectory and this is just going to be the sequence of states and actions we're
[00:06:52] the sequence of states and actions we're not going to have rewards um in general
[00:06:55] not going to have rewards um in general and so the idea was well it might be
[00:06:58] and so the idea was well it might be easier in some cases either because it's
[00:07:00] easier in some cases either because it's just sort of natural data traces that
[00:07:02] just sort of natural data traces that are being generated as they're part of
[00:07:03] are being generated as they're part of their normal work like electronic
[00:07:05] their normal work like electronic medical record systems are um or because
[00:07:10] medical record systems are um or because uh it's hard for people to write down a
[00:07:12] uh it's hard for people to write down a reward function that kind of captures
[00:07:14] reward function that kind of captures all the complexity of what they're
[00:07:15] all the complexity of what they're trying to um do in their objective so
[00:07:18] trying to um do in their objective so that was one of the motivations for this
[00:07:20] that was one of the motivations for this and we saw a few different ways to try
[00:07:22] and we saw a few different ways to try to think about this setting last time
[00:07:25] to think about this setting last time including Behavior cloning where we just
[00:07:26] including Behavior cloning where we just map things back to supervised learning
[00:07:28] map things back to supervised learning and we try to learn in a policy directly
[00:07:30] and we try to learn in a policy directly to match the expert we saw dagger um
[00:07:35] to match the expert we saw dagger um I'll put that on here too so another
[00:07:37] I'll put that on here too so another thing that we saw kind of in between
[00:07:38] thing that we saw kind of in between these two was
[00:07:40] these two was dagger which tried to address a
[00:07:43] dagger which tried to address a challenge of behavior cloning which is
[00:07:45] challenge of behavior cloning which is that when you make mistakes in your
[00:07:48] that when you make mistakes in your supervised Learning System you may end
[00:07:50] supervised Learning System you may end up in parts of the state and action
[00:07:52] up in parts of the state and action distribution that you don't know you
[00:07:54] distribution that you don't know you don't have good coverage so we talked
[00:07:56] don't have good coverage so we talked about this kind of race car track
[00:07:58] about this kind of race car track example where like once you go off
[00:08:01] example where like once you go off you've got a distribution mismatch and
[00:08:03] you've got a distribution mismatch and we'll hear more about distribution
[00:08:04] we'll hear more about distribution mismatches um in RLS later in the course
[00:08:08] mismatches um in RLS later in the course and there we wouldn't necessarily know
[00:08:09] and there we wouldn't necessarily know what to do and so what dagger said is we
[00:08:12] what to do and so what dagger said is we have to keep an expert around and then
[00:08:14] have to keep an expert around and then they will always tell us what you should
[00:08:15] they will always tell us what you should have done so they kind of a coach go
[00:08:17] have done so they kind of a coach go back they replay you know how you did on
[00:08:19] back they replay you know how you did on that hockey game what you should have
[00:08:20] that hockey game what you should have done at each moment um and there there's
[00:08:23] done at each moment um and there there's a lot of really interesting questions of
[00:08:24] a lot of really interesting questions of kind of thinking about those
[00:08:26] kind of thinking about those counterfactuals and then we thought
[00:08:28] counterfactuals and then we thought about this broad question of well could
[00:08:29] about this broad question of well could we recover the
[00:08:31] we recover the reward from looking at these
[00:08:34] reward from looking at these demonstrations and this could be useful
[00:08:36] demonstrations and this could be useful within its own right to try to
[00:08:38] within its own right to try to understand the you know the objectives
[00:08:40] understand the you know the objectives that people are using when they're
[00:08:41] that people are using when they're making their decisions for different
[00:08:43] making their decisions for different areas um as well as potentially for
[00:08:46] areas um as well as potentially for learning a better policy or learning the
[00:08:48] learning a better policy or learning the policy and then can we also sort of you
[00:08:51] policy and then can we also sort of you know once we have that R generate a good
[00:08:53] know once we have that R generate a good policy or generate a good policy
[00:08:57] directly so one of the ideas that we
[00:08:59] directly so one of the ideas that we talked about in this case is well what
[00:09:01] talked about in this case is well what is sufficient to be able to accomplish
[00:09:05] is sufficient to be able to accomplish mimicking um so in particular we said
[00:09:08] mimicking um so in particular we said well if we want to get a policy that
[00:09:10] well if we want to get a policy that matches the expert that is equivalent to
[00:09:15] matches the expert that is equivalent to generating trajectories where that
[00:09:17] generating trajectories where that distribution over those trajectories is
[00:09:19] distribution over those trajectories is the same as what the expert would have
[00:09:21] the same as what the expert would have done so we kind of think of this strong
[00:09:23] done so we kind of think of this strong relationship
[00:09:25] relationship between policies to trajectories which
[00:09:29] between policies to trajectories which also is to like know States and
[00:09:34] actions because we can think of there
[00:09:36] actions because we can think of there being a policy that induces a
[00:09:38] being a policy that induces a distribution Over States and actions and
[00:09:41] distribution Over States and actions and two policies that in um induce the same
[00:09:44] two policies that in um induce the same distribution Over States and actions
[00:09:45] distribution Over States and actions will have the same reward because we're
[00:09:48] will have the same reward because we're assuming that the rewards are only a
[00:09:49] assuming that the rewards are only a function of the states and
[00:09:51] function of the states and actions and so we we talked about how
[00:09:54] actions and so we we talked about how people had sort of leveraged this
[00:09:55] people had sort of leveraged this assumption to think about different ways
[00:09:57] assumption to think about different ways to try to learn reward features so for
[00:10:01] to try to learn reward features so for example if you have a set of features
[00:10:03] example if you have a set of features depent by your policy so this might be
[00:10:05] depent by your policy so this might be Mew which could be things like you know
[00:10:09] Mew which could be things like you know um uh how quickly a call service agent
[00:10:12] um uh how quickly a call service agent responds to calls how many times they
[00:10:14] responds to calls how many times they use positive sentiment things like that
[00:10:17] use positive sentiment things like that of course in the case of a robot it
[00:10:19] of course in the case of a robot it might be how many times it hit a wall
[00:10:21] might be how many times it hit a wall how far it went others any of these
[00:10:23] how far it went others any of these sorts of features you could imagine that
[00:10:25] sorts of features you could imagine that your reward function is just a linear
[00:10:27] your reward function is just a linear combination of those features
[00:10:30] combination of those features and so we saw so these features are just
[00:10:34] and so we saw so these features are just things that people can come up with for
[00:10:36] things that people can come up with for every problem great question so um ask
[00:10:39] every problem great question so um ask you know are these features like people
[00:10:40] you know are these features like people are writing down per problem um
[00:10:42] are writing down per problem um historically yes I think one of the big
[00:10:44] historically yes I think one of the big things with deep deep learning has been
[00:10:46] things with deep deep learning has been like let's at least go as close to the
[00:10:47] like let's at least go as close to the sensors as possible so can we use like
[00:10:50] sensors as possible so can we use like just images instead of features on
[00:10:52] just images instead of features on images um but in the case of something
[00:10:54] images um but in the case of something like say online marketing a lot of them
[00:10:56] like say online marketing a lot of them would be potentially predefined so you
[00:10:58] would be potentially predefined so you know is and what web web pages you
[00:11:01] know is and what web web pages you looked at and what things you you know
[00:11:03] looked at and what things you you know search queries you did so you would have
[00:11:05] search queries you did so you would have to still enumerate a set of features in
[00:11:07] to still enumerate a set of features in this case that you're defining your
[00:11:08] this case that you're defining your reward over but ideally it's sort of as
[00:11:11] reward over but ideally it's sort of as close to the sensor level of the data
[00:11:12] close to the sensor level of the data you're collecting as possible or at
[00:11:14] you're collecting as possible or at least that often has a big Advantage so
[00:11:17] least that often has a big Advantage so what we saw here is that essentially
[00:11:19] what we saw here is that essentially because we assume if if things are
[00:11:21] because we assume if if things are linear and we assume it's just like this
[00:11:22] linear and we assume it's just like this unknown weight Vector so this is a
[00:11:24] unknown weight Vector so this is a vector this is a vector um we could say
[00:11:27] vector this is a vector um we could say if you could have make sure that your
[00:11:28] if you could have make sure that your distrib over features is really close
[00:11:31] distrib over features is really close and if you bound the norm of the the
[00:11:33] and if you bound the norm of the the weight Vector then being really close
[00:11:35] weight Vector then being really close and features is the same as being really
[00:11:37] and features is the same as being really close in reward which means if your
[00:11:39] close in reward which means if your policy can induce the same features you
[00:11:41] policy can induce the same features you can get the same reward this is a recap
[00:11:44] can get the same reward this is a recap from last time but it's useful to think
[00:11:45] from last time but it's useful to think about as we go
[00:11:47] about as we go forward so one of the big challenges we
[00:11:49] forward so one of the big challenges we talked about last time is that there is
[00:11:50] talked about last time is that there is not a unique reward function that is
[00:11:53] not a unique reward function that is compatible with the observed data even
[00:11:56] compatible with the observed data even if you assume your observed data is
[00:11:58] if you assume your observed data is optimal so we talked about how even the
[00:12:00] optimal so we talked about how even the zero reward is compatible with any
[00:12:02] zero reward is compatible with any policy you might see and so in general
[00:12:05] policy you might see and so in general it's not going to be identifiable we
[00:12:06] it's not going to be identifiable we can't just like say oh if we observe
[00:12:09] can't just like say oh if we observe these trajectories and we know the
[00:12:10] these trajectories and we know the policy is optimal this is what the
[00:12:12] policy is optimal this is what the reward is there's too many rewards that
[00:12:13] reward is there's too many rewards that are
[00:12:15] are compatible and so what we're going to
[00:12:16] compatible and so what we're going to spend a lot of time on now is to think
[00:12:18] spend a lot of time on now is to think about one choice for how to kind of
[00:12:20] about one choice for how to kind of break that ambiguity okay this is where
[00:12:23] break that ambiguity okay this is where we left off last
[00:12:25] we left off last time and what we're going to um focus on
[00:12:28] time and what we're going to um focus on now
[00:12:30] now this um is maximum entropy IRL Gail is
[00:12:34] this um is maximum entropy IRL Gail is also this is the second one is known as
[00:12:36] also this is the second one is known as Gail this is also a popular poach this
[00:12:38] Gail this is also a popular poach this is developed by um uh Stephano oran's
[00:12:41] is developed by um uh Stephano oran's group here at Stamford uh but we're
[00:12:43] group here at Stamford uh but we're going to start with Max entropy because
[00:12:44] going to start with Max entropy because also it's there's a lot of other
[00:12:46] also it's there's a lot of other follow-up things that could be useful
[00:12:47] follow-up things that could be useful from this
[00:12:48] from this idea okay so we're going to talk about
[00:12:52] idea okay so we're going to talk about Max entropy inverse RL this came out in
[00:12:55] Max entropy inverse RL this came out in 2008 and it um goes first with the
[00:12:57] 2008 and it um goes first with the principle of Maximum entropy
[00:13:00] principle of Maximum entropy raise your hand if you've heard of this
[00:13:01] raise your hand if you've heard of this before in the context of probability
[00:13:04] before in the context of probability distributions okay a few people more
[00:13:06] distributions okay a few people more than I would expected cool all right so
[00:13:09] than I would expected cool all right so remember that the entropy of a
[00:13:11] remember that the entropy of a distribution P so think of um this is a
[00:13:13] distribution P so think of um this is a a probability distribution so remember
[00:13:15] a probability distribution so remember we've got this this is something so we'd
[00:13:18] we've got this this is something so we'd have like some overall I see if you have
[00:13:20] have like some overall I see if you have a discrete State
[00:13:22] a discrete State space this is just a probability
[00:13:25] space this is just a probability distribution
[00:13:26] distribution okay so the entropy of a probability
[00:13:30] okay so the entropy of a probability distribution is minus the sum over all
[00:13:34] distribution is minus the sum over all of the states the probability of that
[00:13:36] of the states the probability of that state Times log of the probability of
[00:13:37] state Times log of the probability of the state it helps capture sort of you
[00:13:40] the state it helps capture sort of you know how distributed our distribution
[00:13:43] know how distributed our distribution is and what the principle of Max entropy
[00:13:46] is and what the principle of Max entropy says is that the probability
[00:13:49] says is that the probability distribution which best represents the
[00:13:52] distribution which best represents the current state of knowledge what do we
[00:13:54] current state of knowledge what do we mean by current state of knowledge is it
[00:13:56] mean by current state of knowledge is it is there you know if we start if we have
[00:13:57] is there you know if we start if we have some previous data
[00:13:59] some previous data the one that we should pick is the the
[00:14:01] the one that we should pick is the the so the probably distribution we should
[00:14:03] so the probably distribution we should write down is the one with the largest
[00:14:04] write down is the one with the largest entropy given the constraints of the
[00:14:06] entropy given the constraints of the precisely stated prior
[00:14:09] precisely stated prior data so you can imagine you have your
[00:14:11] data so you can imagine you have your expert data and what this says is that
[00:14:14] expert data and what this says is that and we haven't talked about what these
[00:14:16] and we haven't talked about what these probability distributions will be yet
[00:14:18] probability distributions will be yet but what this says is that we're going
[00:14:19] but what this says is that we're going to try to write down distributions over
[00:14:22] to try to write down distributions over um we're going to look at trajectories
[00:14:24] um we're going to look at trajectories in particular that are compatible with
[00:14:28] in particular that are compatible with our observ
[00:14:29] our observ trajectories but otherwise have the
[00:14:32] trajectories but otherwise have the highest
[00:14:34] highest entropy and so intuitively you could
[00:14:36] entropy and so intuitively you could think of sort of if you have some data
[00:14:37] think of sort of if you have some data you want to find probability
[00:14:38] you want to find probability distributions that are consistent with
[00:14:40] distributions that are consistent with that but have the highest entropy on um
[00:14:43] that but have the highest entropy on um given that they're consistent so we're
[00:14:44] given that they're consistent so we're going to end up with something where you
[00:14:46] going to end up with something where you have constraints yeah I don't understand
[00:14:48] have constraints yeah I don't understand the motivation imation I'm trying to not
[00:14:53] the motivation imation I'm trying to not deploy it's expensive I
[00:14:56] deploy it's expensive I the what what are we trying to do with
[00:14:59] the what what are we trying to do with invitation yeah what's the motivation
[00:15:01] invitation yeah what's the motivation for doing this because you already have
[00:15:02] for doing this because you already have access to an expert by
[00:15:05] access to an expert by learn uh well the idea is that you have
[00:15:08] learn uh well the idea is that you have access to trajectories over the from the
[00:15:11] access to trajectories over the from the expert so you don't have access to the
[00:15:13] expert so you don't have access to the expert at all point you don't have their
[00:15:14] expert at all point you don't have their policy you just have
[00:15:17] policy you just have observations so you can imagine
[00:15:18] observations so you can imagine something like um if I'm an expert
[00:15:20] something like um if I'm an expert doctor you could look at all of the ways
[00:15:22] doctor you could look at all of the ways that I do surgery and you could like
[00:15:24] that I do surgery and you could like look at all of my movements and stuff
[00:15:26] look at all of my movements and stuff and then what I want to do is have a
[00:15:27] and then what I want to do is have a robot that can imitate that and so I
[00:15:29] robot that can imitate that and so I need to distill it and make it sort of
[00:15:30] need to distill it and make it sort of you know into an explicit parameterized
[00:15:33] you know into an explicit parameterized policy set yeah okay cool all right so
[00:15:37] policy set yeah okay cool all right so this is a this is an interesting idea um
[00:15:39] this is a this is an interesting idea um this is of saying this is one way to
[00:15:41] this is of saying this is one way to break ties like there's a whole bunch of
[00:15:43] break ties like there's a whole bunch of different reward functions a whole bunch
[00:15:44] different reward functions a whole bunch of different ways you could maybe be
[00:15:45] of different ways you could maybe be compatible with the observed data um
[00:15:48] compatible with the observed data um let's pick ones which have the maximum
[00:15:50] let's pick ones which have the maximum entropy okay so this is just a choice
[00:15:53] entropy okay so this is just a choice you could
[00:15:56] make be consistent
[00:15:59] make be consistent great question okay so hold on to that
[00:16:00] great question okay so hold on to that for a second I'm going to say yeah so
[00:16:02] for a second I'm going to say yeah so question was what does this mean like
[00:16:04] question was what does this mean like how do we actually make this
[00:16:05] how do we actually make this mathematically formal and and
[00:16:07] mathematically formal and and algorithmic and we'll see that in like
[00:16:08] algorithmic and we'll see that in like the next slides good question we're
[00:16:10] the next slides good question we're going to write this down in a formal way
[00:16:11] going to write this down in a formal way okay but this is the principle and so
[00:16:13] okay but this is the principle and so this is what Brian zird and his
[00:16:14] this is what Brian zird and his colleagues um thought about in terms of
[00:16:17] colleagues um thought about in terms of this method okay and I'll just say a
[00:16:19] this method okay and I'll just say a little bit about the motivation so um
[00:16:21] little bit about the motivation so um Brian was a grad student at the time at
[00:16:23] Brian was a grad student at the time at C melon and they were interested in
[00:16:24] C melon and they were interested in trying to understand Taxi Driver
[00:16:26] trying to understand Taxi Driver behavior and so what they wanted to do
[00:16:29] behavior and so what they wanted to do is you know when you're driving there's
[00:16:30] is you know when you're driving there's lotss of different constraints
[00:16:31] lotss of different constraints particularly if you're a taxi driver you
[00:16:32] particularly if you're a taxi driver you want to think about distance and
[00:16:34] want to think about distance and potential traffic and you know tolls and
[00:16:36] potential traffic and you know tolls and all these things and so what they wanted
[00:16:38] all these things and so what they wanted to do is just to take trajectories of
[00:16:40] to do is just to take trajectories of people driving through the streets of
[00:16:42] people driving through the streets of Pittsburgh um and then try to infer what
[00:16:44] Pittsburgh um and then try to infer what the reward function was that taxi
[00:16:46] the reward function was that taxi drivers were using as well be able to
[00:16:48] drivers were using as well be able to have a policy that did as well as like
[00:16:50] have a policy that did as well as like good taxi drivers um so this was sort of
[00:16:54] good taxi drivers um so this was sort of part of the motivation and they weren't
[00:16:55] part of the motivation and they weren't again they had to deal with this
[00:16:57] again they had to deal with this question of how do you you can't just
[00:16:58] question of how do you you can't just learn unique reward so like let's just
[00:17:00] learn unique reward so like let's just try to find something that's got maximum
[00:17:02] try to find something that's got maximum entropy and let's see what this means in
[00:17:04] entropy and let's see what this means in this case all right so in the linear
[00:17:07] this case all right so in the linear reward case what we're going to be
[00:17:10] reward case what we're going to be interested in or how we're going to
[00:17:11] interested in or how we're going to think about where Max entropy applies is
[00:17:13] think about where Max entropy applies is to say we're going to have distributions
[00:17:16] to say we're going to have distributions over
[00:17:17] over trajectories so we're have distributions
[00:17:19] trajectories so we're have distributions over trajectories and we want to find a
[00:17:22] over trajectories and we want to find a distribution over
[00:17:23] distribution over trajectories that matches our observed
[00:17:26] trajectories that matches our observed distribution over trajectories from the
[00:17:28] distribution over trajectories from the expert but otherwise has really high
[00:17:32] expert but otherwise has really high entropy so what we're going to what you
[00:17:34] entropy so what we're going to what you could be learning in this case is a
[00:17:36] could be learning in this case is a probability distribution over
[00:17:38] probability distribution over trajectories that has the maximum
[00:17:40] trajectories that has the maximum entropy subject to the fact that it is a
[00:17:43] entropy subject to the fact that it is a true probability distribution so that's
[00:17:46] true probability distribution so that's one constraint so this is just subject
[00:17:47] one constraint so this is just subject two you think of subject two or such
[00:17:50] two you think of subject two or such that but the other ones are constraints
[00:17:52] that but the other ones are constraints and the other is that in this case we're
[00:17:55] and the other is that in this case we're going to say we're going to want to
[00:17:56] going to say we're going to want to match the features
[00:17:59] match the features and we saw before that matching the
[00:18:01] and we saw before that matching the features was equivalent to being able to
[00:18:04] features was equivalent to being able to match the rewards in the case where you
[00:18:06] match the rewards in the case where you have a linear function so in the linear
[00:18:09] have a linear function so in the linear reward
[00:18:12] case what we want to do is we want to
[00:18:14] case what we want to do is we want to say I've got my distribution of
[00:18:16] say I've got my distribution of trajectories let's say mu is a function
[00:18:18] trajectories let's say mu is a function that just takes a trajectory and outputs
[00:18:21] that just takes a trajectory and outputs a set of features and we'll talk about
[00:18:22] a set of features and we'll talk about some other choices for that soon and we
[00:18:24] some other choices for that soon and we just want that to match what the
[00:18:26] just want that to match what the features were we observed from the
[00:18:28] features were we observed from the trajectories from D where D is a data
[00:18:31] trajectories from D where D is a data set from our
[00:18:33] set from our experts this is like
[00:18:35] experts this is like our from our
[00:18:39] experts okay so this is how we would
[00:18:41] experts okay so this is how we would write that down now I haven't told you
[00:18:43] write that down now I haven't told you yet how we're going to learn the reward
[00:18:44] yet how we're going to learn the reward function I haven't even told you how
[00:18:46] function I haven't even told you how we're going to learn this but it's this
[00:18:47] we're going to learn this but it's this is sort of this is where the maximum
[00:18:50] is sort of this is where the maximum entropy assumption is being applied it's
[00:18:52] entropy assumption is being applied it's saying what we mean by maximum entropy
[00:18:54] saying what we mean by maximum entropy is we want to think about getting a
[00:18:55] is we want to think about getting a distribution over trajectories that is
[00:18:57] distribution over trajectories that is compatible with our EXP data but
[00:18:59] compatible with our EXP data but otherwise has the maximum yeah remind me
[00:19:02] otherwise has the maximum yeah remind me one more time um when you say
[00:19:05] one more time um when you say distribution over trajectories is does
[00:19:06] distribution over trajectories is does that mean distribution over policies
[00:19:08] that mean distribution over policies that create that trajectory or is that
[00:19:10] that create that trajectory or is that something else great question that sort
[00:19:12] something else great question that sort of isomorphic so you can just think of
[00:19:13] of isomorphic so you can just think of it directly as me you know distributions
[00:19:15] it directly as me you know distributions over State reward State action State
[00:19:18] over State reward State action State action Etc or you can think of it as
[00:19:20] action Etc or you can think of it as it's implicitly going through a policy
[00:19:22] it's implicitly going through a policy that is generating those yeah and we'll
[00:19:25] that is generating those yeah and we'll become clear too about like sort of
[00:19:26] become clear too about like sort of where the policies come in great
[00:19:28] where the policies come in great question
[00:19:30] question okay so this is what this would say but
[00:19:32] okay so this is what this would say but we haven't got into rewards yet um and
[00:19:34] we haven't got into rewards yet um and we need to think about how do we go from
[00:19:35] we need to think about how do we go from this to thinking about learning reward
[00:19:37] this to thinking about learning reward model learning
[00:19:38] model learning policies okay so in general we don't
[00:19:42] policies okay so in general we don't have rewards but if we did have rewards
[00:19:44] have rewards but if we did have rewards what we would like to do is to get a
[00:19:45] what we would like to do is to get a distrib a policy that induces
[00:19:48] distrib a policy that induces trajectories that match the same reward
[00:19:51] trajectories that match the same reward as our expert okay so we would like to
[00:19:54] as our expert okay so we would like to excuse me get a a policy that has as
[00:19:56] excuse me get a a policy that has as high reward as our expert
[00:20:00] high reward as our expert if we knew what those were like if we
[00:20:02] if we knew what those were like if we had away like this
[00:20:04] had away like this RFI then we would say we want our
[00:20:06] RFI then we would say we want our distribution so we're let's say we we're
[00:20:08] distribution so we're let's say we we're going to learn um a distribution of
[00:20:10] going to learn um a distribution of trajectories we want this to be the same
[00:20:13] trajectories we want this to be the same as what the expert is and I'll just
[00:20:15] as what the expert is and I'll just highlight that here I'm using this P hat
[00:20:17] highlight that here I'm using this P hat for expert
[00:20:20] okay all right so this is expert okay so
[00:20:23] okay all right so this is expert okay so this is this looks almost the same as
[00:20:25] this is this looks almost the same as above except for I've said well let's
[00:20:27] above except for I've said well let's imagine that we don't necessarily have
[00:20:29] imagine that we don't necessarily have to have um a linear reward function in
[00:20:31] to have um a linear reward function in general we just want to say we would
[00:20:33] general we just want to say we would really like that whatever our
[00:20:34] really like that whatever our distribution of trajectories is is that
[00:20:36] distribution of trajectories is is that it matches the reward of the experts
[00:20:38] it matches the reward of the experts because we know the experts is optimal
[00:20:40] because we know the experts is optimal so if we achieve this we're good okay so
[00:20:42] so if we achieve this we're good okay so we would like to be able to solve this
[00:20:44] we would like to be able to solve this problem we don't know what R is still so
[00:20:47] problem we don't know what R is still so we can't do this but we're just going to
[00:20:48] we can't do this but we're just going to look at what would be the solution to
[00:20:50] look at what would be the solution to this
[00:20:52] this problem
[00:20:54] problem so and where we're going to go from this
[00:20:57] so and where we're going to go from this is that we're ultimately going to end up
[00:20:59] is that we're ultimately going to end up with an algorithm that does something
[00:21:00] with an algorithm that does something like the following we are going to
[00:21:03] like the following we are going to assume we have a reward function or
[00:21:04] assume we have a reward function or compute one once we have a reward
[00:21:07] compute one once we have a reward function we're going to learn an optimal
[00:21:08] function we're going to learn an optimal policy and then we're going to update
[00:21:10] policy and then we're going to update our state or trajectory features to
[00:21:12] our state or trajectory features to update our reward function and we're
[00:21:14] update our reward function and we're going to do this many times so we're
[00:21:16] going to do this many times so we're going to be thinking really a lot about
[00:21:18] going to be thinking really a lot about the relationship
[00:21:20] the relationship between um reward functions St optimal
[00:21:22] between um reward functions St optimal policies optimal policies to
[00:21:24] policies optimal policies to distributions Over States and actions
[00:21:26] distributions Over States and actions distributions Over States and actions to
[00:21:29] distributions Over States and actions to how can we update our reward function
[00:21:31] how can we update our reward function okay and we're going to step through all
[00:21:32] okay and we're going to step through all of those
[00:21:33] of those steps okay and in the original paper
[00:21:36] steps okay and in the original paper they're assume the Dynamics reward model
[00:21:38] they're assume the Dynamics reward model is
[00:21:39] is known all right so let's step through
[00:21:41] known all right so let's step through the first part because I think it's
[00:21:43] the first part because I think it's really helpful to see um often when
[00:21:47] really helpful to see um often when people talk about Max entropy then they
[00:21:49] people talk about Max entropy then they introduce this sort of exponential
[00:21:51] introduce this sort of exponential family and it may or may not be clear
[00:21:52] family and it may or may not be clear where that comes from okay so remember
[00:21:55] where that comes from okay so remember that we have this constrained objective
[00:21:57] that we have this constrained objective so we have this thing
[00:22:00] here okay all right so what we would
[00:22:05] here okay all right so what we would like to understand in this case
[00:22:07] like to understand in this case is
[00:22:09] is given a
[00:22:13] constrained
[00:22:17] objective if we knew the
[00:22:23] cost
[00:22:26] cost What would
[00:22:28] What would be the
[00:22:31] be the form of the
[00:22:37] distribution over
[00:22:41] distribution over to okay because remember what we've got
[00:22:43] to okay because remember what we've got here is we have a Max so what this thing
[00:22:46] here is we have a Max so what this thing is is this is an objective this is an
[00:22:48] is is this is an objective this is an optimization problem that says the right
[00:22:50] optimization problem that says the right distribution over trajectories you want
[00:22:52] distribution over trajectories you want is the one that maximizes that
[00:22:54] is the one that maximizes that expression there and what we're going to
[00:22:56] expression there and what we're going to do now is we're going to see what would
[00:22:58] do now is we're going to see what would it be like if we if we knew all of those
[00:23:00] it be like if we if we knew all of those things what would the sort of structural
[00:23:02] things what would the sort of structural form look like okay and then we're going
[00:23:04] form look like okay and then we're going to use that to to make some other steps
[00:23:06] to use that to to make some other steps okay so now this is just to get
[00:23:09] okay so now this is just to get intuition over this functional form what
[00:23:10] intuition over this functional form what we're going to do is we going to rewrite
[00:23:12] we're going to do is we going to rewrite this as using lrange multipliers Okay so
[00:23:15] this as using lrange multipliers Okay so we've got P here we're going to
[00:23:17] we've got P here we're going to introduce
[00:23:18] introduce Lambda okay and I'm just going to write
[00:23:24] this as follows
[00:23:34] over and I think this is illustrative
[00:23:37] over and I think this is illustrative because it'll make it really clear where
[00:23:40] because it'll make it really clear where these sort of structural forms come in
[00:23:41] these sort of structural forms come in that we're going to
[00:23:44] that we're going to use all right so I'm just writing down
[00:23:46] use all right so I'm just writing down our
[00:23:47] our first lrange multiplier and I suspect
[00:23:51] first lrange multiplier and I suspect most of you have seen this but if you
[00:23:52] most of you have seen this but if you haven't feel free to come up to me
[00:23:54] haven't feel free to come up to me afterwards okay we're just rewriting the
[00:23:56] afterwards okay we're just rewriting the constraint optimization problem
[00:23:59] constraint optimization problem okay so we rewrote our constraint
[00:24:01] okay so we rewrote our constraint optimization problem as a single
[00:24:02] optimization problem as a single equation and now we're going to take the
[00:24:04] equation and now we're going to take the derivative with respect to this okay
[00:24:07] derivative with respect to this okay because remember we want to optimize
[00:24:08] because remember we want to optimize this so we're going to do D going to do
[00:24:11] this so we're going to do D going to do it with respect to our trajectories so
[00:24:16] it with respect to our trajectories so we're just going to get log of P of to
[00:24:19] we're just going to get log of P of to plus P of to time the derivative of log
[00:24:23] plus P of to time the derivative of log which is just one over P of
[00:24:26] which is just one over P of to Plus
[00:24:33] okay and in the third
[00:24:35] okay and in the third case the third one doesn't have any P of
[00:24:38] case the third one doesn't have any P of to it just This this term does here so
[00:24:40] to it just This this term does here so you'll
[00:24:41] you'll get Lambda 1 R by of
[00:24:48] to yeah does the summation go away
[00:24:53] weing yeah exactly exactly what said so
[00:24:56] weing yeah exactly exactly what said so we're going to assum we're taking we're
[00:24:58] we're going to assum we're taking we're try to get what is the derivative with
[00:24:59] try to get what is the derivative with respect to one like this probability
[00:25:01] respect to one like this probability this particular trajectory to okay
[00:25:05] this particular trajectory to okay that's why everything goes away and the
[00:25:06] that's why everything goes away and the important thing to notice here
[00:25:09] important thing to notice here is this goes away because there's only a
[00:25:13] is this goes away because there's only a p hat there it was from the expert so
[00:25:16] p hat there it was from the expert so that term just disappears it's not a
[00:25:17] that term just disappears it's not a function of P of to at all okay all
[00:25:20] function of P of to at all okay all right so now we want to set this equal
[00:25:21] right so now we want to set this equal to
[00:25:23] to zero set this to zero because we want to
[00:25:25] zero set this to zero because we want to find the Max and then we're going to
[00:25:28] find the Max and then we're going to just do some algebra
[00:25:39] okay now we're just going to
[00:25:49] exponentiate okay why did we do this cuz
[00:25:52] exponentiate okay why did we do this cuz I wanted to illustrate that what this
[00:25:54] I wanted to illustrate that what this means is that the probability
[00:25:57] means is that the probability distribution over trajectories which
[00:25:59] distribution over trajectories which maximize the entropy subject to some con
[00:26:02] maximize the entropy subject to some con subject to some constraints is exactly
[00:26:05] subject to some constraints is exactly proportional to this is the proportional
[00:26:07] proportional to this is the proportional side the
[00:26:10] side the exponential of the reward for that
[00:26:13] exponential of the reward for that trajectory
[00:26:16] trajectory okay which means that in general If You
[00:26:19] okay which means that in general If You observe this you would put sort of
[00:26:21] observe this you would put sort of exponential more weight on things that
[00:26:24] exponential more weight on things that have higher reward subject to a
[00:26:27] have higher reward subject to a constraint that you have probability
[00:26:29] constraint that you have probability distribution
[00:26:30] distribution okay all right so what this shows
[00:26:35] okay all right so what this shows here
[00:26:37] here okay is that if we want to take this
[00:26:40] okay is that if we want to take this principle of Max entropy then what we
[00:26:42] principle of Max entropy then what we end up getting is that the functional
[00:26:43] end up getting is that the functional form over our trajectories is this
[00:26:46] form over our trajectories is this exponential it's Pro it's proportional
[00:26:48] exponential it's Pro it's proportional to an exponential okay and that's an
[00:26:51] to an exponential okay and that's an exponential family for those of you who
[00:26:52] exponential family for those of you who have seen this before or seen
[00:26:54] have seen this before or seen exponential families so this is like the
[00:26:56] exponential families so this is like the structural form this is the distribution
[00:26:59] structural form this is the distribution that maximizes the entropy and so once
[00:27:02] that maximizes the entropy and so once we know that we can leverage that to now
[00:27:04] we know that we can leverage that to now start to try to learn a reward function
[00:27:06] start to try to learn a reward function okay so let's see how we do this because
[00:27:08] okay so let's see how we do this because remember we just did this assuming that
[00:27:10] remember we just did this assuming that R RI was known like for a particular RI
[00:27:13] R RI was known like for a particular RI we're not taking a derivative with
[00:27:14] we're not taking a derivative with respect to F here at all it's just a
[00:27:16] respect to F here at all it's just a derivative with respect to P of to okay
[00:27:19] derivative with respect to P of to okay all
[00:27:20] all right so what this means is that we can
[00:27:23] right so what this means is that we can think of maximizing the entropy over the
[00:27:26] think of maximizing the entropy over the probability distribution with respect to
[00:27:28] probability distribution with respect to is equal to maximizing the likelihood of
[00:27:30] is equal to maximizing the likelihood of The observed data under this particular
[00:27:33] The observed data under this particular Max entropy
[00:27:34] Max entropy distribution okay so I'm going to just
[00:27:38] distribution okay so I'm going to just write out what that would be here so
[00:27:39] write out what that would be here so remember that's what we saw here we saw
[00:27:41] remember that's what we saw here we saw that if we Max entropy the functional
[00:27:42] that if we Max entropy the functional form we get looks like this normalized
[00:27:46] form we get looks like this normalized exponential okay so in particular we'll
[00:27:49] exponential okay so in particular we'll just write that out again here okay so
[00:27:51] just write that out again here okay so what we get is we say the probability
[00:27:55] what we get is we say the probability trajectory probability of a particular
[00:27:57] trajectory probability of a particular trajectory I given some reward Model F
[00:28:02] trajectory I given some reward Model F is equal to 1/ Z of and I'll Define that
[00:28:04] is equal to 1/ Z of and I'll Define that in just a second e
[00:28:07] in just a second e r to
[00:28:09] r to I okay where Z of five is our
[00:28:16] normalizing
[00:28:19] normalizing constant because we have to have a
[00:28:21] constant because we have to have a wellform probability
[00:28:23] wellform probability distribution okay so let's say this is
[00:28:25] distribution okay so let's say this is structurally like what it looks like
[00:28:28] structurally like what it looks like and notice that we can also write
[00:28:32] and notice that we can also write this in terms of States okay so this is
[00:28:36] this in terms of States okay so this is also equal
[00:28:38] also equal to e to the sum over all the states
[00:28:41] to e to the sum over all the states inside of your
[00:28:45] trajectory okay where I'm sort of
[00:28:48] trajectory okay where I'm sort of abusing notation a little bit to both
[00:28:50] abusing notation a little bit to both use RI of to or RI of State just to mean
[00:28:52] use RI of to or RI of State just to mean the reward you get from a particular
[00:28:54] the reward you get from a particular State or the reward you get from a whole
[00:28:55] State or the reward you get from a whole trajectory so notice we can use each of
[00:28:57] trajectory so notice we can use each of the
[00:28:59] the and this is our thing so so why is this
[00:29:03] and this is our thing so so why is this helpful okay
[00:29:05] helpful okay so we don't
[00:29:10] know what the reward function is we
[00:29:12] know what the reward function is we don't actually have that
[00:29:15] don't actually have that right yes but what this means is that
[00:29:19] right yes but what this means is that since we know what the functional form
[00:29:21] since we know what the functional form is of the probability of to under the
[00:29:24] is of the probability of to under the max entropy principle we can now say
[00:29:27] max entropy principle we can now say okay I'm not going to worry about this
[00:29:29] okay I'm not going to worry about this part I'm going to assume this is the
[00:29:30] part I'm going to assume this is the structural form now my unknown is just F
[00:29:33] structural form now my unknown is just F now I'm going to try to maximize the
[00:29:35] now I'm going to try to maximize the likelihood of my observed data by
[00:29:37] likelihood of my observed data by changing the parameterization of
[00:29:41] changing the parameterization of f
[00:29:42] f so
[00:29:44] so this observation and when I say this
[00:29:47] this observation and when I say this observation I mean that the um that the
[00:29:52] observation I mean that the um that the probability over tow that maximizes say
[00:30:01] entropy constrained constrained
[00:30:05] entropy constrained constrained entropy
[00:30:08] entropy looks
[00:30:09] looks like a normalized
[00:30:17] exponential
[00:30:19] exponential means we can
[00:30:22] means we can now
[00:30:25] now estimate or learn
[00:30:28] estimate or learn are
[00:30:31] are by by maximizing the
[00:30:35] by by maximizing the probability of our observed
[00:30:42] data so we're going to treat this as a
[00:30:44] data so we're going to treat this as a maximum likelihood
[00:30:47] problem all right and I'll just note
[00:30:49] problem all right and I'll just note here you know this is a a a really
[00:30:52] here you know this is a a a really elegant observation this came all the
[00:30:54] elegant observation this came all the way back from Janes in 1957 so when
[00:30:57] way back from Janes in 1957 so when people were thinking about what does it
[00:30:58] people were thinking about what does it mean to maximize the entropy of
[00:31:00] mean to maximize the entropy of something subject to some constraints
[00:31:02] something subject to some constraints they realize that you could make this
[00:31:04] they realize that you could make this you could convert it to this exponential
[00:31:06] you could convert it to this exponential family and then once you have that now
[00:31:08] family and then once you have that now you have something where um your
[00:31:10] you have something where um your uncertainty is only with respect to this
[00:31:12] uncertainty is only with respect to this fi and in fact this type of insight at a
[00:31:14] fi and in fact this type of insight at a very high level will be related to what
[00:31:16] very high level will be related to what you'll see next week in terms of direct
[00:31:17] you'll see next week in terms of direct preference optimization where sometimes
[00:31:19] preference optimization where sometimes we might be able to reparameterize our
[00:31:21] we might be able to reparameterize our objective function to be able to get rid
[00:31:23] objective function to be able to get rid of um might call them almost nuisance
[00:31:25] of um might call them almost nuisance parameters things that you might not
[00:31:26] parameters things that you might not care about directly where you have sort
[00:31:28] care about directly where you have sort of one parameter you really want to
[00:31:30] of one parameter you really want to learn okay so let's see how we can do
[00:31:32] learn okay so let's see how we can do the maximum likelihood so now what we're
[00:31:34] the maximum likelihood so now what we're going to try to do is we're going to try
[00:31:35] going to try to do is we're going to try to actually learn that reward function
[00:31:37] to actually learn that reward function and we're going to leverage the fact
[00:31:38] and we're going to leverage the fact that we know the structural form of this
[00:31:40] that we know the structural form of this probability distribution over the
[00:31:41] probability distribution over the trajectories so what we're going to do
[00:31:43] trajectories so what we're going to do is we're going to say we're going to
[00:31:44] is we're going to say we're going to maximize
[00:31:46] maximize F of log the probability over all of our
[00:31:50] F of log the probability over all of our data this is our
[00:31:53] data this is our expert of the
[00:31:55] expert of the probability of each of those
[00:31:57] probability of each of those trajectories
[00:31:58] trajectories so we're just saying we're going to tryy
[00:31:59] so we're just saying we're going to tryy to maximize the probability that we
[00:32:01] to maximize the probability that we observe the data that we did under our
[00:32:03] observe the data that we did under our reward function okay and because of our
[00:32:06] reward function okay and because of our structural form we can rewrite this as
[00:32:09] structural form we can rewrite this as follows this is going to be a sum so I'm
[00:32:12] follows this is going to be a sum so I'm just going to say log of product is this
[00:32:14] just going to say log of product is this the same as sum over the
[00:32:17] the same as sum over the logs then I'm going to plug in what my
[00:32:21] logs then I'm going to plug in what my form told me that my probability
[00:32:24] form told me that my probability distribution has to look like four
[00:32:29] distribution has to look like four my trajectories
[00:32:37] okay all right so this is just me
[00:32:40] okay all right so this is just me plugging in that sort of Max entropy
[00:32:42] plugging in that sort of Max entropy form of the
[00:32:43] form of the trajectories and now I'm just going to
[00:32:45] trajectories and now I'm just going to split that apart
[00:32:47] split that apart so I'm going to rewrite
[00:32:53] it the log and the exponential cancel
[00:32:58] it the log and the exponential cancel and then I have log of the normalizing
[00:33:03] term okay now um
[00:33:08] term okay now um notice that in terms of this
[00:33:12] notice that in terms of this part
[00:33:14] part so notice that this is independent of um
[00:33:20] so notice that this is independent of um T star okay so this is you know
[00:33:23] T star okay so this is you know independent of all those we end up with
[00:33:24] independent of all those we end up with two things we have Max over Pi SU over
[00:33:28] two things we have Max over Pi SU over toar and our data set of
[00:33:33] toar and our data set of R minus the size of our data set Times
[00:33:37] R minus the size of our data set Times log sum over
[00:33:43] tow Okay the reason that happened there
[00:33:46] tow Okay the reason that happened there is this was all inside the sum this sum
[00:33:49] is this was all inside the sum this sum this was completely independent of tar
[00:33:51] this was completely independent of tar so I could bring it out the number of
[00:33:53] so I could bring it out the number of trajectories I have in D is just the
[00:33:54] trajectories I have in D is just the cardinality of D excuse me
[00:33:58] cardinality of D excuse me all right so now what we can do is take
[00:34:02] all right so now what we can do is take a derivative okay so I'm going to I'm
[00:34:04] a derivative okay so I'm going to I'm going to call this whole
[00:34:06] going to call this whole thing
[00:34:10] j of f because it's all parameterized by
[00:34:13] j of f because it's all parameterized by my particular um reward function going
[00:34:15] my particular um reward function going to take the derivative of that because
[00:34:17] to take the derivative of that because in general we're going to do everything
[00:34:18] in general we're going to do everything with gradient descent as
[00:34:21] with gradient descent as usual okay so this is going to look
[00:34:24] usual okay so this is going to look like the
[00:34:26] like the sum for all my trajectories in my expert
[00:34:29] sum for all my trajectories in my expert data set the derivative with respect to
[00:34:31] data set the derivative with respect to my reward
[00:34:36] function minus let me see if I can make
[00:34:40] function minus let me see if I can make this nice and big for this part
[00:34:43] this nice and big for this part okay okay so we're going to have two
[00:34:45] okay okay so we're going to have two things we have this log so we're going
[00:34:46] things we have this log so we're going to have to take the derivative of that
[00:34:48] to have to take the derivative of that this godo on the
[00:34:49] this godo on the bottom okay and we're going to play a
[00:34:52] bottom okay and we're going to play a small we're going to observe something
[00:34:54] small we're going to observe something in just a second okay so then we're
[00:34:56] in just a second okay so then we're going to still have our SU
[00:34:58] going to still have our SU to e to the R5 to times the derivative
[00:35:03] to e to the R5 to times the derivative so I'm just taking the derivative of
[00:35:05] so I'm just taking the derivative of that whole term
[00:35:09] okay all right the important thing to
[00:35:12] okay all right the important thing to notice here so I just took the
[00:35:13] notice here so I just took the derivative of both of these parts this
[00:35:16] derivative of both of these parts this thing should look a little bit familiar
[00:35:19] thing should look a little bit familiar okay this is in fact just the exact
[00:35:22] okay this is in fact just the exact expression we got for what is the
[00:35:23] expression we got for what is the probability of a particular trajectory
[00:35:26] probability of a particular trajectory okay let me just put that in
[00:35:29] okay let me just put that in okay so
[00:35:31] okay so note this is just equal to the
[00:35:34] note this is just equal to the probability of
[00:35:37] to given five let me make sure I put
[00:35:40] to given five let me make sure I put given
[00:35:41] given five right because that's just this
[00:35:44] five right because that's just this normalized exponential divided by that
[00:35:47] normalized exponential divided by that so we're going to have
[00:35:50] so we're going to have this when you when you put that uh I
[00:35:52] this when you when you put that uh I shall be careful let me just make sure I
[00:35:54] shall be careful let me just make sure I write that carefully so this is going to
[00:35:57] write that carefully so this is going to go to this because this is equal to e
[00:36:00] go to this because this is equal to e the r t / the normalizing
[00:36:04] the r t / the normalizing constant so we can move this we can move
[00:36:07] constant so we can move this we can move this outside part into here and then
[00:36:10] this outside part into here and then that expression in there which is this
[00:36:12] that expression in there which is this is just equal to probability of tow
[00:36:14] is just equal to probability of tow given five
[00:36:16] given five okay all
[00:36:17] okay all right so this back to
[00:36:21] right so this back to here okay so we just end up with the
[00:36:25] here okay so we just end up with the following we get the derivative with
[00:36:27] following we get the derivative with respect to the reward function for every
[00:36:29] respect to the reward function for every trajectory inside of our expert
[00:36:32] trajectory inside of our expert data minus the number of different
[00:36:35] data minus the number of different trajectories we have times the sum over
[00:36:38] trajectories we have times the sum over all trajectories the probability of that
[00:36:40] all trajectories the probability of that trajectory
[00:36:42] trajectory given by times the
[00:36:48] derivative with respect to by that point
[00:36:54] derivative with respect to by that point okay and that's our gradient step
[00:36:58] okay and that's our gradient step so what this would say here is if you
[00:37:00] so what this would say here is if you want to take a step towards um
[00:37:04] want to take a step towards um optimizing then what we would do in this
[00:37:06] optimizing then what we would do in this case is you could compute the derivative
[00:37:08] case is you could compute the derivative with respect to your word
[00:37:10] with respect to your word function all right we got to we have a
[00:37:12] function all right we got to we have a few more steps to go so the next is this
[00:37:15] few more steps to go so the next is this is all in terms of trajectories we'd
[00:37:17] is all in terms of trajectories we'd like to get in terms of States so for
[00:37:19] like to get in terms of States so for that we can just observe the fact that
[00:37:21] that we can just observe the fact that as before the probability of trajectory
[00:37:23] as before the probability of trajectory can be broken down into its components
[00:37:28] okay so this is just equal to tal 1 to
[00:37:32] okay so this is just equal to tal 1 to length of your trajectory the
[00:37:34] length of your trajectory the probability of your a given s is like
[00:37:36] probability of your a given s is like your
[00:37:38] your policy and the probability of St + 1
[00:37:41] policy and the probability of St + 1 given St and a okay this is just a
[00:37:44] given St and a okay this is just a probability of a trajectory and we've
[00:37:46] probability of a trajectory and we've seen this
[00:37:47] seen this before
[00:37:48] before um if we
[00:37:51] um if we have that the probability of a
[00:37:54] have that the probability of a trajectory is proportional as we've seen
[00:37:57] trajectory is proportional as we've seen to e to Theus RF of a
[00:38:02] to e to Theus RF of a trajectory and we know that we can write
[00:38:04] trajectory and we know that we can write that also as equal to e to the minus sum
[00:38:08] that also as equal to e to the minus sum / s inside the trajectory of R Pi of
[00:38:12] / s inside the trajectory of R Pi of that state okay so then we can think of
[00:38:17] that state okay so then we can think of plugging that in for our derivative and
[00:38:19] plugging that in for our derivative and what we get is the following so just the
[00:38:21] what we get is the following so just the same derivative now but in terms of
[00:38:23] same derivative now but in terms of States instead of the trajectories so
[00:38:25] States instead of the trajectories so this is for all the states in St of your
[00:38:28] this is for all the states in St of your expert
[00:38:30] expert demonstrations you take the derivative
[00:38:33] demonstrations you take the derivative with respect to that
[00:38:35] with respect to that State minus
[00:38:38] State minus D sum over the states probability of the
[00:38:41] D sum over the states probability of the state given that in the
[00:38:44] state given that in the trajectory and then the derivative with
[00:38:46] trajectory and then the derivative with respect to that
[00:38:48] respect to that state
[00:38:51] state okay and why is this interesting this is
[00:38:53] okay and why is this interesting this is interesting because basically what we're
[00:38:55] interesting because basically what we're getting in this case is we're getting
[00:38:56] getting in this case is we're getting stuff that looks like us trying to match
[00:38:59] stuff that looks like us trying to match the distribution Over States um that we
[00:39:02] the distribution Over States um that we see in the data
[00:39:04] see in the data set now when we think of doing this one
[00:39:08] set now when we think of doing this one other thing to note is where do these
[00:39:11] other thing to note is where do these sort of State densities come by come
[00:39:13] sort of State densities come by come from so essentially you could think of
[00:39:15] from so essentially you could think of it as I have some observed States and
[00:39:18] it as I have some observed States and actions and I'm going to think about
[00:39:19] actions and I'm going to think about under a different policy what states and
[00:39:21] under a different policy what states and actions that will
[00:39:23] actions that will induce if you know the Dynamics model
[00:39:26] induce if you know the Dynamics model and if it's tab
[00:39:28] and if it's tab so if it's it's if tabular so if you
[00:39:32] so if it's it's if tabular so if you think back for a few weeks ago tabular
[00:39:34] think back for a few weeks ago tabular and the Dynamics are
[00:39:39] known then you can actually compute the
[00:39:41] known then you can actually compute the state action distribution directly using
[00:39:44] state action distribution directly using dynamic programming okay so and Pi is
[00:39:50] dynamic programming okay so and Pi is given then you can actually just
[00:39:52] given then you can actually just directly compute the states and actions
[00:39:54] directly compute the states and actions so let's see that so you could say mu1
[00:39:56] so let's see that so you could say mu1 of s is equal to P of
[00:39:59] of s is equal to P of ss and then for T = 1 do do do T so this
[00:40:04] ss and then for T = 1 do do do T so this is time
[00:40:07] is time indexed okay so this is like again
[00:40:09] indexed okay so this is like again remember a high level what we're trying
[00:40:10] remember a high level what we're trying to do here is we're learning that we are
[00:40:12] to do here is we're learning that we are getting um we're trying to match the
[00:40:14] getting um we're trying to match the state action frequencies between our
[00:40:16] state action frequencies between our observe policy from our experts and what
[00:40:18] observe policy from our experts and what we induce under our reward function what
[00:40:21] we induce under our reward function what this is going to say is that you're
[00:40:22] this is going to say is that you're going to try to estimate a reward
[00:40:24] going to try to estimate a reward function you're going to try to compute
[00:40:25] function you're going to try to compute an optimal policy given that reward
[00:40:27] an optimal policy given that reward function and then you're going to try to
[00:40:29] function and then you're going to try to count and see what your state action
[00:40:30] count and see what your state action distribution looks like under that
[00:40:32] distribution looks like under that resulting policy if it matches your
[00:40:34] resulting policy if it matches your experts you're done otherwise you need
[00:40:36] experts you're done otherwise you need to keep changing your reward function
[00:40:39] to keep changing your reward function your policy and the resulting State
[00:40:40] your policy and the resulting State action distribution until they match
[00:40:44] action distribution until they match okay all right so I'm just going to go
[00:40:46] okay all right so I'm just going to go through briefly so you can see how we
[00:40:48] through briefly so you can see how we could how this is computable right so
[00:40:51] could how this is computable right so what this would say is that your
[00:40:54] what this would say is that your distribution of states on the next time
[00:40:55] distribution of states on the next time step depends on your distribution of
[00:40:59] step depends on your distribution of states on the previous time
[00:41:02] states on the previous time step the probability under your
[00:41:06] step the probability under your policy so this is your policy actually
[00:41:09] policy so this is your policy actually I'll be a little careful
[00:41:14] there the probability of the action
[00:41:16] there the probability of the action given the state and the probability of
[00:41:19] given the state and the probability of the state given the state in
[00:41:23] the state given the state in action and you can use this then to sum
[00:41:27] action and you can use this then to sum up over all time
[00:41:35] steps what your sort of average density
[00:41:37] steps what your sort of average density is for a particular State
[00:41:41] is for a particular State okay so what this means is that when
[00:41:43] okay so what this means is that when you're trying to actually compute the
[00:41:44] you're trying to actually compute the derivative of your objective function
[00:41:46] derivative of your objective function with respect to the reward then what you
[00:41:48] with respect to the reward then what you end up getting is that you can plug
[00:41:50] end up getting is that you can plug these in and if you have so you can
[00:41:54] these in and if you have so you can write down this
[00:42:05] you're going to
[00:42:11] get so you can see that it's fairly
[00:42:13] get so you can see that it's fairly involved but it is definitely
[00:42:17] possible sum overall your States
[00:42:20] possible sum overall your States probability of the states given by T and
[00:42:23] probability of the states given by T and your reward
[00:42:25] your reward function and this will simpli y a bit if
[00:42:28] function and this will simpli y a bit if you have um so let me just write out if
[00:42:31] you have um so let me just write out if your
[00:42:33] your R is equal to just this times some
[00:42:37] R is equal to just this times some features then when you take the
[00:42:38] features then when you take the derivative with respect to the FI you
[00:42:42] derivative with respect to the FI you just get the features okay so this would
[00:42:44] just get the features okay so this would mean that Dr 5 S so if
[00:42:52] linear just equal to your
[00:42:54] linear just equal to your features Okay so I know this is a lot of
[00:42:58] features Okay so I know this is a lot of algebra but what this is saying is that
[00:43:00] algebra but what this is saying is that your derivative about how you want to
[00:43:01] your derivative about how you want to change your award will just end up being
[00:43:03] change your award will just end up being a sum over all the features you have
[00:43:05] a sum over all the features you have inside of your data um minus this
[00:43:08] inside of your data um minus this additional term okay so you sort of
[00:43:11] additional term okay so you sort of compute all of this with respect to um
[00:43:13] compute all of this with respect to um your observed features and um the
[00:43:16] your observed features and um the features you have in your data
[00:43:19] set all right so how does this all work
[00:43:22] set all right so how does this all work when we put this full fully together um
[00:43:25] when we put this full fully together um what we have in this case is you give as
[00:43:27] what we have in this case is you give as input some expert demonstrations you
[00:43:29] input some expert demonstrations you initialize your
[00:43:30] initialize your fi and then what you do is the following
[00:43:33] fi and then what you do is the following you first compute an optimal policy
[00:43:37] you first compute an optimal policy given that r f EG was something like
[00:43:40] given that r f EG was something like value iteration you compute the state
[00:43:42] value iteration you compute the state visitation
[00:43:43] visitation frequencies you complete the gradient on
[00:43:46] frequencies you complete the gradient on the reward
[00:43:47] the reward model and then you update your reward
[00:43:49] model and then you update your reward Model Five and you repeat over and over
[00:43:52] Model Five and you repeat over and over again me just sorry the tight go okay
[00:43:56] again me just sorry the tight go okay all right I'll just write out what that
[00:43:58] all right I'll just write out what that equation is here so your derivative here
[00:44:00] equation is here so your derivative here would be your sum over all your
[00:44:02] would be your sum over all your trajectories inside of your data set
[00:44:05] trajectories inside of your data set your features for each of those
[00:44:07] your features for each of those trajectories minus the sum over the
[00:44:10] trajectories minus the sum over the states probability of the state given
[00:44:12] states probability of the state given your
[00:44:15] current current parameterization
[00:44:21] oops and the features for those
[00:44:24] oops and the features for those States so this is under a linear
[00:44:28] States so this is under a linear a linear reward which is what they
[00:44:30] a linear reward which is what they derived it for okay and so this is what
[00:44:32] derived it for okay and so this is what you would do over and over again all
[00:44:35] you would do over and over again all right so let's pop up a level and check
[00:44:36] right so let's pop up a level and check our understanding given all of this what
[00:44:40] our understanding given all of this what steps in the above algorithm relying on
[00:44:41] steps in the above algorithm relying on knowing the Dynamics model is it
[00:44:43] knowing the Dynamics model is it Computing the optimal policy is it
[00:44:45] Computing the optimal policy is it Computing the state visitation
[00:44:46] Computing the state visitation frequencies is it Computing the gradient
[00:44:49] frequencies is it Computing the gradient or nothing requires it I told you that
[00:44:51] or nothing requires it I told you that they did say they assumed access to the
[00:44:54] they did say they assumed access to the Dynamics model and I'll just write out
[00:44:55] Dynamics model and I'll just write out that gradient again right here
[00:44:59] so let's just take a second to sort of
[00:45:00] so let's just take a second to sort of review the algorithm check in any
[00:45:02] review the algorithm check in any questions we might have about
[00:45:16] it all right
[00:45:33] and all the slides are on the web so you
[00:45:34] and all the slides are on the web so you welome till I go back though I guess
[00:45:37] welome till I go back though I guess might be helpful to most of them I've
[00:45:39] might be helpful to most of them I've just been writing out
[00:45:42] here but you can also just think back
[00:45:44] here but you can also just think back about like value iteration and how I was
[00:45:47] about like value iteration and how I was just starting to show that we could use
[00:45:49] just starting to show that we could use dynamic programming to compute the state
[00:45:51] dynamic programming to compute the state visitation frequencies so remember you
[00:45:55] visitation frequencies so remember you probably all remember value it
[00:45:57] probably all remember value it and then this was the type of equations
[00:45:59] and then this was the type of equations I was writing out so to compute the
[00:46:01] I was writing out so to compute the distribution over the next time step of
[00:46:04] distribution over the next time step of the states we doing things like summing
[00:46:05] the states we doing things like summing over the distribution for the previous
[00:46:07] over the distribution for the previous time step as well as the probability of
[00:46:09] time step as well as the probability of action given a state as well as the
[00:46:11] action given a state as well as the Dynamics
[00:46:13] model so that's the type of dynamic
[00:46:16] model so that's the type of dynamic programming algorithm they were
[00:46:17] programming algorithm they were proposing there to be able to do
[00:46:25] this and then you can also just think
[00:46:27] this and then you can also just think back to what we need in value iteration
[00:46:29] back to what we need in value iteration to be able to do
[00:46:39] this all right why don't you talk to a
[00:46:41] this all right why don't you talk to a neighbor and see what you got there's a
[00:46:43] neighbor and see what you got there's a lot of uh
[00:46:55] variance
[00:47:25] e e
[00:48:19] all right good so this is a nice sort of
[00:48:21] all right good so this is a nice sort of um reminder of algorithms that we've
[00:48:23] um reminder of algorithms that we've seen from long ago so the answer in this
[00:48:26] seen from long ago so the answer in this case
[00:48:27] case is one and two so to compute the optimal
[00:48:31] is one and two so to compute the optimal policy generally with value iteration
[00:48:33] policy generally with value iteration you need to have access to both the
[00:48:34] you need to have access to both the reward model and the Dynamics model so
[00:48:36] reward model and the Dynamics model so this one is
[00:48:38] this one is true is true and then the Dynamics
[00:48:41] true is true and then the Dynamics programming algorithm we looked at also
[00:48:43] programming algorithm we looked at also required access to the Dynamics model in
[00:48:45] required access to the Dynamics model in order to kind of back up and say if
[00:48:47] order to kind of back up and say if you're in this state now what's the
[00:48:48] you're in this state now what's the distribution Over States you're going to
[00:48:50] distribution Over States you're going to be in the next time step um once you
[00:48:52] be in the next time step um once you have those you don't need this for the
[00:48:55] have those you don't need this for the gradient so I'm just bringing that up
[00:48:56] gradient so I'm just bringing that up there once you have all the state once
[00:48:58] there once you have all the state once you've computed all the frequencies and
[00:48:59] you've computed all the frequencies and you have this you don't need it again so
[00:49:03] you have this you don't need it again so um assuming that you've done these two
[00:49:05] um assuming that you've done these two things you don't need it again for the
[00:49:06] things you don't need it again for the gradient but it is being heavily used um
[00:49:09] gradient but it is being heavily used um and as you might imagine that's also a
[00:49:11] and as you might imagine that's also a pretty strong assumption so like do we
[00:49:13] pretty strong assumption so like do we actually know what the Dynamics is um
[00:49:15] actually know what the Dynamics is um the Dynamics model is for say like a
[00:49:18] the Dynamics model is for say like a human physician making decisions or a
[00:49:20] human physician making decisions or a surgeon it seems quite unlikely um you
[00:49:22] surgeon it seems quite unlikely um you might know it for like Moko or something
[00:49:25] might know it for like Moko or something but uh but probably not in general
[00:49:27] but uh but probably not in general okay so let me just summarize where
[00:49:30] okay so let me just summarize where these things are um this approach has
[00:49:31] these things are um this approach has been incredibly influential um as we
[00:49:34] been incredibly influential um as we said the initial one used linear rewards
[00:49:36] said the initial one used linear rewards and assume the Dynamics model is known
[00:49:39] and assume the Dynamics model is known uh but there was um a lot of follow-up
[00:49:41] uh but there was um a lot of follow-up work to this including some really nice
[00:49:43] work to this including some really nice work by Chelsea fin Finn um who is
[00:49:46] work by Chelsea fin Finn um who is faculty here now and have been for all
[00:49:47] faculty here now and have been for all but it's part of her PhD um she showed
[00:49:50] but it's part of her PhD um she showed that you could use General reward and
[00:49:52] that you could use General reward and cost functions you know like things like
[00:49:53] cost functions you know like things like deep neural networks and others so you
[00:49:55] deep neural networks and others so you could use much more complicated um
[00:49:57] could use much more complicated um functions and state spaces where you're
[00:49:59] functions and state spaces where you're not going to be able to use dynamic
[00:50:00] not going to be able to use dynamic programming to be able to enumerate sort
[00:50:02] programming to be able to enumerate sort of the um distribution Over States and
[00:50:05] of the um distribution Over States and then also she remove the the need to
[00:50:07] then also she remove the the need to know the Dynamics model so they had a
[00:50:10] know the Dynamics model so they had a really nice paper in 2016 showing how to
[00:50:11] really nice paper in 2016 showing how to do this with really sort of very general
[00:50:13] do this with really sort of very general Rich complex State spaces which has also
[00:50:16] Rich complex State spaces which has also been highly
[00:50:17] been highly influential but I think this idea of
[00:50:19] influential but I think this idea of saying like how at a high level that the
[00:50:20] saying like how at a high level that the challenge was what do we do about the
[00:50:22] challenge was what do we do about the fact that there are many reward models
[00:50:24] fact that there are many reward models that are compatible with people's
[00:50:25] that are compatible with people's behavior one thing you could do is you
[00:50:27] behavior one thing you could do is you say well the one we're going to learn is
[00:50:29] say well the one we're going to learn is the one that has maximum entropy and
[00:50:31] the one that has maximum entropy and this provides sort of a a recipe or an
[00:50:33] this provides sort of a a recipe or an approach to trying to tackle that
[00:50:35] approach to trying to tackle that problem and it turns out that can be
[00:50:37] problem and it turns out that can be very effective in many cases and in
[00:50:39] very effective in many cases and in Brian ze's approach um they ended up
[00:50:42] Brian ze's approach um they ended up using it for trying to model sort of
[00:50:43] using it for trying to model sort of taxi drive cars uh taxi car drivers Etc
[00:50:47] taxi drive cars uh taxi car drivers Etc it's been used in many cases
[00:50:49] it's been used in many cases since so let's pop up a level um we're
[00:50:51] since so let's pop up a level um we're finishing sort of our introduction to
[00:50:53] finishing sort of our introduction to imitation learning what we've seen is
[00:50:55] imitation learning what we've seen is that um imitation learning is this nice
[00:50:58] that um imitation learning is this nice approach where if you have access to
[00:50:59] approach where if you have access to existing demonstrations and it might be
[00:51:01] existing demonstrations and it might be hard to write down the reward function
[00:51:03] hard to write down the reward function you could try to learn from those what
[00:51:05] you could try to learn from those what optimal behavior is to try to match that
[00:51:08] optimal behavior is to try to match that the behavior you have access to um in
[00:51:12] the behavior you have access to um in some cases it can greatly reduce the
[00:51:13] some cases it can greatly reduce the amount of data needed to learn a good
[00:51:15] amount of data needed to learn a good policy we haven't talked a lot about
[00:51:17] policy we haven't talked a lot about that um precisely but there's some
[00:51:19] that um precisely but there's some really nice work on the theory of
[00:51:21] really nice work on the theory of imitation learning in RL and thinking
[00:51:23] imitation learning in RL and thinking about some ideas we'll see later in this
[00:51:25] about some ideas we'll see later in this course around sample complexity of like
[00:51:27] course around sample complexity of like is it provably harder to learn from
[00:51:29] is it provably harder to learn from optimal demonstrations versus in the RL
[00:51:31] optimal demonstrations versus in the RL setting um so there's a lot of really
[00:51:33] setting um so there's a lot of really nice aspects for imitation
[00:51:36] nice aspects for imitation learning the things that I think you
[00:51:38] learning the things that I think you should know in terms of going forward is
[00:51:40] should know in terms of going forward is you should certainly be very familiar
[00:51:41] you should certainly be very familiar with behavior cloning because it is a
[00:51:42] with behavior cloning because it is a technique that is used very frequently
[00:51:44] technique that is used very frequently so you can just reduce RL to supervise
[00:51:46] so you can just reduce RL to supervise learning when you have demonstrations um
[00:51:48] learning when you have demonstrations um but it's also good to understand what
[00:51:50] but it's also good to understand what this principle of Maximum entropy is
[00:51:51] this principle of Maximum entropy is doing how that relates to distribution
[00:51:54] doing how that relates to distribution over trajectories and how that is then
[00:51:56] over trajectories and how that is then sort of formed into a maximum likelihood
[00:51:58] sort of formed into a maximum likelihood optimization problem to learn the reward
[00:52:00] optimization problem to learn the reward model okay and I I think one thing to
[00:52:03] model okay and I I think one thing to notice in this case is like when they
[00:52:05] notice in this case is like when they did this they are not claiming that that
[00:52:07] did this they are not claiming that that is actually the reward model used by
[00:52:09] is actually the reward model used by people it is the reward model that is
[00:52:11] people it is the reward model that is compatible with people's demonstrations
[00:52:14] compatible with people's demonstrations that maximizes that you know and a
[00:52:16] that maximizes that you know and a distribution that maximizes
[00:52:18] distribution that maximizes entropy so it is not necessarily
[00:52:21] entropy so it is not necessarily claiming that it is exactly mention
[00:52:23] claiming that it is exactly mention mapping human
[00:52:25] mapping human preferences awesome so now we're going
[00:52:27] preferences awesome so now we're going to get into this is you know one example
[00:52:30] to get into this is you know one example of human feedback or human input to
[00:52:32] of human feedback or human input to trying to use that to make good
[00:52:34] trying to use that to make good sequences of decisions under uncertainty
[00:52:36] sequences of decisions under uncertainty but there's actually a huge number of
[00:52:38] but there's actually a huge number of different ways to do this um and so now
[00:52:40] different ways to do this um and so now we're going to this class and next class
[00:52:41] we're going to this class and next class we're going to talk some about sort of
[00:52:43] we're going to talk some about sort of human feedback and reinforcement
[00:52:44] human feedback and reinforcement learning from Human preferences and I
[00:52:46] learning from Human preferences and I think you can think about this from many
[00:52:48] think you can think about this from many different levels you can think about it
[00:52:49] different levels you can think about it in terms of how could humans actively
[00:52:52] in terms of how could humans actively try to help reinforcement learning
[00:52:53] try to help reinforcement learning agents that they are saying trained to
[00:52:55] agents that they are saying trained to do something like maybe they want to
[00:52:57] do something like maybe they want to train the robot how to clean up their
[00:52:59] train the robot how to clean up their counter in their kitchen um and they
[00:53:01] counter in their kitchen um and they have a particular way they want to do it
[00:53:02] have a particular way they want to do it and so they might be actively trying to
[00:53:04] and so they might be actively trying to help an agent do a particular task or we
[00:53:07] help an agent do a particular task or we might be just trying to align say large
[00:53:09] might be just trying to align say large language models with our values or
[00:53:11] language models with our values or intents um and so then could we sort of
[00:53:14] intents um and so then could we sort of provide information that's going to
[00:53:15] provide information that's going to shape their behavior across many
[00:53:18] shape their behavior across many tasks okay so it is relevant to both of
[00:53:23] tasks okay so it is relevant to both of these different types of objectives and
[00:53:24] these different types of objectives and I'm going to go through sort of some
[00:53:26] I'm going to go through sort of some different ways that people could be
[00:53:27] different ways that people could be using human input in terms of these sort
[00:53:29] using human input in terms of these sort of
[00:53:31] of training so one thing to note is that
[00:53:34] training so one thing to note is that people have been thinking about this for
[00:53:35] people have been thinking about this for quite a long time I like this work by
[00:53:37] quite a long time I like this work by Andrea tamas um and Cynthia Brazil from
[00:53:40] Andrea tamas um and Cynthia Brazil from MIT and they had this thing it looks
[00:53:42] MIT and they had this thing it looks pretty primitive now but um this thing
[00:53:44] pretty primitive now but um this thing called Sophie's Kitchen and the idea in
[00:53:46] called Sophie's Kitchen and the idea in this case is that you would be trying to
[00:53:48] this case is that you would be trying to teach um an autonomous agent how to do
[00:53:51] teach um an autonomous agent how to do you know like make a recipe or do some
[00:53:52] you know like make a recipe or do some basic different tasks in the kitchen um
[00:53:54] basic different tasks in the kitchen um and of course as you can see with this
[00:53:55] and of course as you can see with this we've come along way in the last 20
[00:53:57] we've come along way in the last 20 years which is wonderful um but the
[00:54:00] years which is wonderful um but the there kind of key Insight here was well
[00:54:02] there kind of key Insight here was well maybe we could uh learn much faster if
[00:54:05] maybe we could uh learn much faster if you have a human like instead of having
[00:54:07] you have a human like instead of having an agent that's sort of trying out
[00:54:09] an agent that's sort of trying out things like Epsilon greedy and sort of
[00:54:10] things like Epsilon greedy and sort of exploring in the world by itself that's
[00:54:13] exploring in the world by itself that's not how humans do it most of the time
[00:54:15] not how humans do it most of the time most of the time we have things like
[00:54:16] most of the time we have things like schools or guardians or friends that are
[00:54:19] schools or guardians or friends that are giving us lots of feedback and help when
[00:54:21] giving us lots of feedback and help when we're trying to learn tasks and so their
[00:54:23] we're trying to learn tasks and so their Insight was to say well let's try to do
[00:54:26] Insight was to say well let's try to do more effective and efficient robot
[00:54:27] more effective and efficient robot learning by leveraging the fact that you
[00:54:29] learning by leveraging the fact that you could have a human in the loop that's
[00:54:31] could have a human in the loop that's sort of providing feedback to the robot
[00:54:34] sort of providing feedback to the robot and in this case I think one thing
[00:54:35] and in this case I think one thing that's important to note is that the
[00:54:37] that's important to note is that the robot is getting two different forms of
[00:54:39] robot is getting two different forms of input that they're trying to maximize
[00:54:41] input that they're trying to maximize they're both getting input from the
[00:54:43] they're both getting input from the human and they're getting input from the
[00:54:45] human and they're getting input from the environment so for example I don't
[00:54:48] environment so for example I don't remember if this exactly was in that
[00:54:49] remember if this exactly was in that particular domain but you could imagine
[00:54:50] particular domain but you could imagine something like maybe there's intrinsic
[00:54:52] something like maybe there's intrinsic um reward if you drop something like a
[00:54:54] um reward if you drop something like a big cost but then maybe they human also
[00:54:56] big cost but then maybe they human also says that's good when like you know you
[00:54:58] says that's good when like you know you stir you make the right recipe so
[00:55:01] stir you make the right recipe so there's two forms of signals that are
[00:55:02] there's two forms of signals that are being used to train so this is an
[00:55:05] being used to train so this is an example where it's more like dagger you
[00:55:07] example where it's more like dagger you have a human in the loop and like they
[00:55:08] have a human in the loop and like they are trying actively to help the the
[00:55:10] are trying actively to help the the robot all the
[00:55:12] robot all the time another version of this is the
[00:55:14] time another version of this is the Tamer Freight work um from Brad Knox and
[00:55:17] Tamer Freight work um from Brad Knox and Peter Stone over at UT Austin and what
[00:55:20] Peter Stone over at UT Austin and what they were again looking at is like well
[00:55:22] they were again looking at is like well maybe we can train agents to do things
[00:55:23] maybe we can train agents to do things much better much quicker if we're
[00:55:25] much better much quicker if we're willing to be in the loop
[00:55:27] willing to be in the loop um and these are all different
[00:55:28] um and these are all different approaches than the dagger approach in
[00:55:30] approaches than the dagger approach in this approach so what are we looking at
[00:55:32] this approach so what are we looking at this was again this was older so this is
[00:55:34] this was again this was older so this is looking at Tetris um a video game we
[00:55:37] looking at Tetris um a video game we trying to sort of Stack blocks and um
[00:55:39] trying to sort of Stack blocks and um and clear lines and what you could see
[00:55:41] and clear lines and what you could see in this case is a lot of the previous
[00:55:43] in this case is a lot of the previous work like sort of policy iteration and
[00:55:45] work like sort of policy iteration and it doesn't matter exactly what these
[00:55:46] it doesn't matter exactly what these algorithms are but there other sort of
[00:55:47] algorithms are but there other sort of competitive algorithms at the time were
[00:55:50] competitive algorithms at the time were at game three getting nothing like they
[00:55:52] at game three getting nothing like they just weren't clearing any lines um but
[00:55:55] just weren't clearing any lines um but after while they could start to learn
[00:55:58] after while they could start to learn you know much better like they could get
[00:55:59] you know much better like they could get many more later and what they found here
[00:56:02] many more later and what they found here is that by using human feedback um they
[00:56:05] is that by using human feedback um they were learn they were taking human
[00:56:06] were learn they were taking human feedback and they were learning an
[00:56:07] feedback and they were learning an explicit reward model so one thing you
[00:56:10] explicit reward model so one thing you could imagine you could be doing is
[00:56:11] could imagine you could be doing is doing something like modelfree RL where
[00:56:13] doing something like modelfree RL where like you're getting signals from the
[00:56:15] like you're getting signals from the human and you're using that to update
[00:56:17] human and you're using that to update the agent's policy but then you drop it
[00:56:19] the agent's policy but then you drop it you're not doing any parametric modeling
[00:56:21] you're not doing any parametric modeling of the reward model in this case they
[00:56:23] of the reward model in this case they are trying to explicitly build a reward
[00:56:25] are trying to explicitly build a reward model from the human feedback
[00:56:27] model from the human feedback and you could see that they could get
[00:56:28] and you could see that they could get much better performance very quickly but
[00:56:31] much better performance very quickly but you know just like kind of maybe the
[00:56:32] you know just like kind of maybe the problem with dagger people aren't going
[00:56:34] problem with dagger people aren't going to stay around for thousands of games
[00:56:35] to stay around for thousands of games and so you may not be able to exceed
[00:56:37] and so you may not be able to exceed performance at least in this case um of
[00:56:40] performance at least in this case um of if you allowed the agent to train for
[00:56:41] if you allowed the agent to train for much
[00:56:43] much longer but I think this is another
[00:56:44] longer but I think this is another example of where so this is sort of a
[00:56:46] example of where so this is sort of a place where they're starting to do
[00:56:47] place where they're starting to do modelbased approaches where you are
[00:56:49] modelbased approaches where you are actually explicitly training a reward
[00:56:51] actually explicitly training a reward model from Human
[00:56:52] model from Human feedback and I think it's nice to think
[00:56:55] feedback and I think it's nice to think about there being at least at least one
[00:56:57] about there being at least at least one um sort of Continuum of the type of
[00:56:59] um sort of Continuum of the type of support that humans could provide really
[00:57:01] support that humans could provide really probably is multi-dimensional but at
[00:57:02] probably is multi-dimensional but at least one which is you could think of if
[00:57:05] least one which is you could think of if we humans are willing to provide data at
[00:57:06] we humans are willing to provide data at all to train RL agents one might be I'm
[00:57:10] all to train RL agents one might be I'm only going to give demonstrations that
[00:57:11] only going to give demonstrations that I'm going to do anyway as part of my
[00:57:13] I'm going to do anyway as part of my normal behavior or maybe that I'll do
[00:57:15] normal behavior or maybe that I'll do once and then another extreme is
[00:57:17] once and then another extreme is something like this dagger or this
[00:57:19] something like this dagger or this constant teaching where I'm like willing
[00:57:21] constant teaching where I'm like willing to be a coach for my agent and I'm just
[00:57:24] to be a coach for my agent and I'm just going to sit there the whole time
[00:57:26] going to sit there the whole time and one of the things you might wonder
[00:57:28] and one of the things you might wonder is like well what's in between this is
[00:57:29] is like well what's in between this is clearly a
[00:57:30] clearly a spectrum and one thing that a lot of
[00:57:33] spectrum and one thing that a lot of people have thought about quite a bit
[00:57:35] people have thought about quite a bit over the last 15 years is whether
[00:57:37] over the last 15 years is whether preferences um pair comparisons is that
[00:57:39] preferences um pair comparisons is that sweet spot so the idea in this case is
[00:57:42] sweet spot so the idea in this case is that you're not going to ask people to
[00:57:45] that you're not going to ask people to do constant teaching you're not going um
[00:57:48] do constant teaching you're not going um but you are going to ask them to do a
[00:57:50] but you are going to ask them to do a bit of work and in particular you're
[00:57:51] bit of work and in particular you're going to ask them to be able to compare
[00:57:53] going to ask them to be able to compare different types of behaviors and which
[00:57:55] different types of behaviors and which do they like better
[00:57:57] do they like better okay and this is kind of in between on
[00:57:59] okay and this is kind of in between on the level of human asirs so one of the
[00:58:01] the level of human asirs so one of the first places this was discussed a lot
[00:58:02] first places this was discussed a lot was recommendation ranking systems so
[00:58:05] was recommendation ranking systems so Yan Yu who's a professor now at Caltech
[00:58:07] Yan Yu who's a professor now at Caltech um together with his PhD adviser
[00:58:09] um together with his PhD adviser Thorston ws and others at Cornell did
[00:58:12] Thorston ws and others at Cornell did some really nice early work on thinking
[00:58:13] some really nice early work on thinking of um if you have recommendation ranking
[00:58:15] of um if you have recommendation ranking systems so imagine you have two
[00:58:17] systems so imagine you have two different retrieval functions and you
[00:58:19] different retrieval functions and you put in some query and this gives you
[00:58:22] put in some query and this gives you this the retrieval function a gives you
[00:58:24] this the retrieval function a gives you this series of outputs and and retrieval
[00:58:26] this series of outputs and and retrieval function B gives you the
[00:58:28] function B gives you the other and you'd like to learn because
[00:58:30] other and you'd like to learn because you are Google or Bing or something like
[00:58:33] you are Google or Bing or something like that You' like to learn which of these
[00:58:35] that You' like to learn which of these two is
[00:58:36] two is better and so the idea that they came up
[00:58:38] better and so the idea that they came up with in this case is well we can ask
[00:58:40] with in this case is well we can ask people which one is better and in
[00:58:42] people which one is better and in particular you could ask people to
[00:58:43] particular you could ask people to compare say maybe the first um the first
[00:58:46] compare say maybe the first um the first um item returned or the second or the
[00:58:48] um item returned or the second or the complete ranking which one is better and
[00:58:51] complete ranking which one is better and that's something that might be much
[00:58:53] that's something that might be much easier for people to do than to specify
[00:58:57] easier for people to do than to specify a Scala reward function for like how
[00:58:59] a Scala reward function for like how good it is that like CS 159 is returned
[00:59:03] good it is that like CS 159 is returned to your query like is that 17.3 or is
[00:59:06] to your query like is that 17.3 or is that 2006 or is it minus 7 billion like
[00:59:10] that 2006 or is it minus 7 billion like it seems very hard to ask humans to do
[00:59:11] it seems very hard to ask humans to do that but they probably can do the
[00:59:13] that but they probably can do the comparison and they can say well it
[00:59:15] comparison and they can say well it seems a little bit
[00:59:17] seems a little bit better so that's one area and that was
[00:59:19] better so that's one area and that was sort of one of the early areas of people
[00:59:21] sort of one of the early areas of people thinking about um how could you get
[00:59:22] thinking about um how could you get feedback a recommendation systems so
[00:59:24] feedback a recommendation systems so that we could make them better but there
[00:59:26] that we could make them better but there lots of other applications and as you
[00:59:27] lots of other applications and as you can see Robotics and driving is one that
[00:59:30] can see Robotics and driving is one that people have thought Lots about so this
[00:59:31] people have thought Lots about so this is worked by d d Zig another one of my
[00:59:33] is worked by d d Zig another one of my great colleagues here and what they were
[00:59:35] great colleagues here and what they were doing here is to think about if you're
[00:59:36] doing here is to think about if you're training a a car to have different
[00:59:39] training a a car to have different behaviors on the road um how do you get
[00:59:42] behaviors on the road um how do you get input from humans about which types of
[00:59:44] input from humans about which types of behaviors are going to be
[00:59:46] behaviors are going to be acceptable so for example most of us
[00:59:49] acceptable so for example most of us would probably prefer the thing on the
[00:59:51] would probably prefer the thing on the left than the thing on the right because
[00:59:52] left than the thing on the right because the thing on the left does not involve a
[00:59:54] the thing on the left does not involve a car accident
[00:59:56] car accident but it is hard to write down an
[00:59:57] but it is hard to write down an objective function for all the things
[00:59:59] objective function for all the things you want to do when you're driving
[01:00:00] you want to do when you're driving including like if it's healing or if
[01:00:03] including like if it's healing or if like a car suddenly stops in front of
[01:00:04] like a car suddenly stops in front of you or you know it's pretty subtle and
[01:00:08] you or you know it's pretty subtle and so what dorsa and her colleagues showed
[01:00:10] so what dorsa and her colleagues showed is that um people can do the sort of
[01:00:12] is that um people can do the sort of preference Pairs and in fact she has
[01:00:14] preference Pairs and in fact she has done uh she and her lab here has done
[01:00:16] done uh she and her lab here has done lots of really interesting work on
[01:00:17] lots of really interesting work on thinking of which preference pairs do
[01:00:19] thinking of which preference pairs do you show to people so that you can
[01:00:21] you show to people so that you can quickly try to get a sense of what their
[01:00:22] quickly try to get a sense of what their preferences are to try to sort of
[01:00:25] preferences are to try to sort of respect this human effort aspect so
[01:00:28] respect this human effort aspect so these are just sort of two of the
[01:00:29] these are just sort of two of the examples of the places that people have
[01:00:30] examples of the places that people have thought about this um and of course chat
[01:00:33] thought about this um and of course chat GPT is another and we'll see more about
[01:00:35] GPT is another and we'll see more about that in a second so in general PA wise
[01:00:38] that in a second so in general PA wise comparisons might be in this sweet spot
[01:00:40] comparisons might be in this sweet spot because it's often easier than writing
[01:00:42] because it's often easier than writing down a a rward function and it's much
[01:00:44] down a a rward function and it's much less work than sort of dagger and having
[01:00:46] less work than sort of dagger and having to constantly you know say you could
[01:00:48] to constantly you know say you could imagine in recommendation systems like
[01:00:50] imagine in recommendation systems like what is the perfect response to this
[01:00:52] what is the perfect response to this query of like I don't know like which
[01:00:54] query of like I don't know like which courses involve this might be really
[01:00:56] courses involve this might be really hard for people to write down but it's
[01:00:58] hard for people to write down but it's easy for them to do the
[01:01:02] comparisons now how do we think about
[01:01:04] comparisons now how do we think about this well one way we could think about
[01:01:07] this well one way we could think about sort of how do we mathematically model
[01:01:08] sort of how do we mathematically model this is the Bradley Terry model and as
[01:01:12] this is the Bradley Terry model and as we've seen with trying to understand
[01:01:14] we've seen with trying to understand modeling scaler
[01:01:15] modeling scaler rewards when we start to think about
[01:01:17] rewards when we start to think about having people compare preferences often
[01:01:20] having people compare preferences often we will still be really interested in
[01:01:22] we will still be really interested in understanding sort of a latent reward
[01:01:24] understanding sort of a latent reward model so the idea will often be that we
[01:01:27] model so the idea will often be that we assume that people have some reward
[01:01:28] assume that people have some reward function in their head that maybe is
[01:01:30] function in their head that maybe is hard for them to articulate and what
[01:01:32] hard for them to articulate and what they can do is produce preference pairs
[01:01:34] they can do is produce preference pairs that are compatible with that underlying
[01:01:36] that are compatible with that underlying reward function now they might be noisy
[01:01:39] reward function now they might be noisy you know a lot of us all make mistakes
[01:01:41] you know a lot of us all make mistakes we all make mistakes and so we're not
[01:01:42] we all make mistakes and so we're not going to assume that necessarily what I
[01:01:44] going to assume that necessarily what I say is perfectly corresponding to my
[01:01:46] say is perfectly corresponding to my latent reward function but it's going to
[01:01:48] latent reward function but it's going to be noisily related to that okay and so
[01:01:51] be noisily related to that okay and so Bradley Terry model is one of these
[01:01:53] Bradley Terry model is one of these types of models that tries to relate
[01:01:55] types of models that tries to relate kind of
[01:01:56] kind of internal preferences over items to how
[01:01:59] internal preferences over items to how we might compare them all right so let's
[01:02:02] we might compare them all right so let's first start off with a simpler setting
[01:02:04] first start off with a simpler setting before we get to RL of Karm Bandits
[01:02:06] before we get to RL of Karm Bandits we're going to see K like Karm Bandits
[01:02:09] we're going to see K like Karm Bandits shortly in the course but for right now
[01:02:11] shortly in the course but for right now you don't really need to know what it is
[01:02:12] you don't really need to know what it is except for there's only actions so
[01:02:14] except for there's only actions so there's no States right now just like
[01:02:16] there's no States right now just like you have K different actions that's all
[01:02:19] you have K different actions that's all and they all have different rewards okay
[01:02:22] and they all have different rewards okay and what we're going to assume is that a
[01:02:23] and what we're going to assume is that a human makes some noisy pair wise
[01:02:24] human makes some noisy pair wise comparisons where the probability she
[01:02:27] comparisons where the probability she preserves prefers bi so Item B compared
[01:02:30] preserves prefers bi so Item B compared to
[01:02:32] to BJ is like the exponential model that we
[01:02:35] BJ is like the exponential model that we saw before our exponential models come
[01:02:36] saw before our exponential models come up a lot so it's going to be the
[01:02:38] up a lot so it's going to be the exponential of the reward for bi divided
[01:02:40] exponential of the reward for bi divided by the exponential reward for bi plus
[01:02:42] by the exponential reward for bi plus the exponential reward for
[01:02:45] the exponential reward for BJ
[01:02:47] BJ okay so what will be the probability
[01:02:50] okay so what will be the probability that I prefer bi to BJ if actually their
[01:02:53] that I prefer bi to BJ if actually their reward is identical to me like if
[01:02:56] reward is identical to me like if actually I'm like ah I don't mind
[01:02:57] actually I'm like ah I don't mind whether I have I don't know like deep
[01:02:59] whether I have I don't know like deep dish pizza versus flat I mean I actually
[01:03:01] dish pizza versus flat I mean I actually do have preferences but um but imagine
[01:03:03] do have preferences but um but imagine that I don't so what would the
[01:03:05] that I don't so what would the probability be in that
[01:03:06] probability be in that case according to this model so my
[01:03:08] case according to this model so my internal reward for both of them is like
[01:03:10] internal reward for both of them is like plus 20 because I really like pizza so
[01:03:13] plus 20 because I really like pizza so what would that be for this probability
[01:03:15] what would that be for this probability if the two items are
[01:03:18] identical 50% 50% right yeah so this is
[01:03:22] identical 50% 50% right yeah so this is automatically normalized um so 50% at
[01:03:25] automatically normalized um so 50% at most if I like one thing much more I do
[01:03:28] most if I like one thing much more I do like dejure a lot more so like it
[01:03:29] like dejure a lot more so like it probably more like that would be say um
[01:03:32] probably more like that would be say um 100 versus like 10 and in that case my
[01:03:35] 100 versus like 10 and in that case my probability would be more like say 0.9
[01:03:37] probability would be more like say 0.9 or
[01:03:38] or 0.95 something like that um so this is
[01:03:42] 0.95 something like that um so this is just a particular model it is noisy uh
[01:03:45] just a particular model it is noisy uh if you read the P if I put a link later
[01:03:47] if you read the P if I put a link later to the reinforcement learning from Human
[01:03:49] to the reinforcement learning from Human feedback paper they make some additional
[01:03:51] feedback paper they make some additional assumptions of how people make
[01:03:53] assumptions of how people make preference um pairs uh on of this model
[01:03:56] preference um pairs uh on of this model but this is the basic model that a lot
[01:03:58] but this is the basic model that a lot of people have been looking at recently
[01:03:59] of people have been looking at recently to understand how internal latent
[01:04:02] to understand how internal latent rewards relate to external preferences
[01:04:06] rewards relate to external preferences one of the important things to note here
[01:04:07] one of the important things to note here is that this model is transitive which
[01:04:09] is that this model is transitive which means that if I want to know what this
[01:04:11] means that if I want to know what this sort of probability is between I and K
[01:04:14] sort of probability is between I and K so those are two particular items I can
[01:04:16] so those are two particular items I can deduce it from my probability for I to J
[01:04:19] deduce it from my probability for I to J and my probability from J to K so you
[01:04:21] and my probability from J to K so you can kind of chain things so this is a a
[01:04:24] can kind of chain things so this is a a transitive probability model
[01:04:26] transitive probability model okay so this is was you know introduced
[01:04:28] okay so this is was you know introduced roughly 70 years ago um it's a very
[01:04:31] roughly 70 years ago um it's a very popular Model H and it came up you know
[01:04:33] popular Model H and it came up you know early on in terms of recommendation
[01:04:34] early on in terms of recommendation systems and others so another thing
[01:04:37] systems and others so another thing that's useful to think about is okay in
[01:04:39] that's useful to think about is okay in this setting where I just have K
[01:04:41] this setting where I just have K different actions I can take and I want
[01:04:42] different actions I can take and I want to understand I want to learn what the
[01:04:45] to understand I want to learn what the reward is for somebody for all of them
[01:04:48] reward is for somebody for all of them if I want to think about kind of finding
[01:04:49] if I want to think about kind of finding a maximum one you know say like what's
[01:04:51] a maximum one you know say like what's the best arm what is the best action I
[01:04:54] the best arm what is the best action I might want to try to understand how
[01:04:56] might want to try to understand how under these different preference models
[01:04:57] under these different preference models what it means for something to be good
[01:05:00] what it means for something to be good so in the class so far we've often
[01:05:02] so in the class so far we've often talked about just maximizing the value
[01:05:04] talked about just maximizing the value function and we want to find a policy
[01:05:05] function and we want to find a policy that's good now let's just think about
[01:05:08] that's good now let's just think about if I have you know K arms which of them
[01:05:10] if I have you know K arms which of them is best so a condos sort winner is one
[01:05:13] is best so a condos sort winner is one where for every other item you prefer
[01:05:16] where for every other item you prefer item I to all the other
[01:05:18] item I to all the other items so I like of all the types of
[01:05:20] items so I like of all the types of pizza if I like deep Dr most that's the
[01:05:22] pizza if I like deep Dr most that's the condur winner okay and it it does mean
[01:05:26] condur winner okay and it it does mean that it has to that those probabilities
[01:05:28] that it has to that those probabilities have to be you know probability one it
[01:05:31] have to be you know probability one it just has to be greater than 05 it means
[01:05:32] just has to be greater than 05 it means I have to beat all of the other
[01:05:35] I have to beat all of the other options and and I'm bringing these up
[01:05:37] options and and I'm bringing these up right now because there's also been sort
[01:05:38] right now because there's also been sort of some later discussion of how do sort
[01:05:40] of some later discussion of how do sort of all the recent rlf work relate to
[01:05:43] of all the recent rlf work relate to ideas from social Choice um and
[01:05:45] ideas from social Choice um and computational economics about what are
[01:05:48] computational economics about what are we Computing you know what what is the
[01:05:50] we Computing you know what what is the sort of underlying objective we're
[01:05:51] sort of underlying objective we're Computing and and how are we um uh
[01:05:54] Computing and and how are we um uh distinguishing between different sorts
[01:05:55] distinguishing between different sorts of of responses LMS could give us so at
[01:05:58] of of responses LMS could give us so at the second thing so this is a pretty
[01:05:59] the second thing so this is a pretty high bar right this means that there has
[01:06:01] high bar right this means that there has to be one thing that uh beats everything
[01:06:04] to be one thing that uh beats everything else a cop plan winner is a little bit
[01:06:07] else a cop plan winner is a little bit less it just says it's the winner um if
[01:06:10] less it just says it's the winner um if it has the highest number of par wise
[01:06:12] it has the highest number of par wise victories against everything else so
[01:06:14] victories against everything else so that doesn't mean that you have to
[01:06:16] that doesn't mean that you have to prefer it to everything else it just it
[01:06:19] prefer it to everything else it just it means on average it is beaten it beats
[01:06:21] means on average it is beaten it beats the others
[01:06:24] the others okay and aort border winner if it an
[01:06:28] okay and aort border winner if it an item is a border winner if it maximizes
[01:06:29] item is a border winner if it maximizes the expected score where the score for
[01:06:32] the expected score where the score for item BJ is one if you prefer bi to BJ
[01:06:36] item BJ is one if you prefer bi to BJ it's 0. five um if they're equal and
[01:06:39] it's 0. five um if they're equal and it's zero otherwise so it's sort of like
[01:06:41] it's zero otherwise so it's sort of like this discretization of the wins and
[01:06:43] this discretization of the wins and loses and typically algorithms for K
[01:06:46] loses and typically algorithms for K armed or dueling Bandits again we'll go
[01:06:48] armed or dueling Bandits again we'll go into what Bandits are more later what
[01:06:50] into what Bandits are more later what they focused on doing is trying to find
[01:06:52] they focused on doing is trying to find this like I don't necessarily need to
[01:06:54] this like I don't necessarily need to find an item you know say like a ranking
[01:06:57] find an item you know say like a ranking system that is always better than
[01:06:59] system that is always better than everything else I want to find one that
[01:07:01] everything else I want to find one that on average beats everything else and
[01:07:03] on average beats everything else and they often would construct these kind of
[01:07:04] they often would construct these kind of pairwise matrices where you could think
[01:07:06] pairwise matrices where you could think of do these different actions beat these
[01:07:08] of do these different actions beat these other
[01:07:10] actions all right so how would we learn
[01:07:13] actions all right so how would we learn these so the question is we have all of
[01:07:17] these so the question is we have all of these noisy par wise comparisons and
[01:07:19] these noisy par wise comparisons and what we want to do now is to see if we
[01:07:21] what we want to do now is to see if we can extract these underlying reward
[01:07:23] can extract these underlying reward functions why would we want to do that
[01:07:25] functions why would we want to do that well once we have these underlying
[01:07:26] well once we have these underlying reward functions we could figure out
[01:07:28] reward functions we could figure out which arm is best or which action is
[01:07:30] which arm is best or which action is best and in the reinforcement learning
[01:07:31] best and in the reinforcement learning case we could try to optimize for that
[01:07:33] case we could try to optimize for that reward function okay so how do we do
[01:07:36] reward function okay so how do we do that what we're going to do is we're
[01:07:37] that what we're going to do is we're going to assume we have endles of the
[01:07:38] going to assume we have endles of the following form we have item one item two
[01:07:41] following form we have item one item two so item I item J and mu where mu is one
[01:07:45] so item I item J and mu where mu is one if you prefer the first thing mu is zero
[01:07:48] if you prefer the first thing mu is zero if you prefer the other thing and it's
[01:07:49] if you prefer the other thing and it's 0.5 if you don't care so this is just
[01:07:52] 0.5 if you don't care so this is just like a classification task you just
[01:07:54] like a classification task you just think back to your supervised learning
[01:07:56] think back to your supervised learning um we just have it's it should look very
[01:07:59] um we just have it's it should look very much like sort of a logistic regression
[01:08:01] much like sort of a logistic regression task and you can maximize the likelihood
[01:08:05] task and you can maximize the likelihood with cross
[01:08:06] with cross entropy okay so we got to map it back to
[01:08:09] entropy okay so we got to map it back to sort of a standard logistic
[01:08:11] sort of a standard logistic loss where we say these are you know
[01:08:13] loss where we say these are you know these reward models and in general we're
[01:08:15] these reward models and in general we're going to parameterize these as like deep
[01:08:17] going to parameterize these as like deep neural networks or some other
[01:08:18] neural networks or some other complicated function could be linear
[01:08:20] complicated function could be linear just depends um but once we have that
[01:08:23] just depends um but once we have that then we can uh try to try to find the
[01:08:26] then we can uh try to try to find the set of
[01:08:27] set of parameters that maximize the likelihood
[01:08:30] parameters that maximize the likelihood okay so that's how we could fit a reward
[01:08:32] okay so that's how we could fit a reward model when we are given preference pairs
[01:08:35] model when we are given preference pairs um and observe
[01:08:37] um and observe preferences now you might wonder how do
[01:08:39] preferences now you might wonder how do we do this in RL because in RL we have
[01:08:41] we do this in RL because in RL we have States we have multiple actions we have
[01:08:44] States we have multiple actions we have trajectories the idea is pretty similar
[01:08:47] trajectories the idea is pretty similar in some ways to what we are seeing with
[01:08:48] in some ways to what we are seeing with Max entropy and that what we're going to
[01:08:50] Max entropy and that what we're going to do is to say well we have a trajectory
[01:08:52] do is to say well we have a trajectory if we have a trajectory we can think of
[01:08:53] if we have a trajectory we can think of there being a series of Rewards
[01:08:56] there being a series of Rewards so the reward of that trajectory is just
[01:08:58] so the reward of that trajectory is just the
[01:08:59] the sum so I plug in all of those sums and I
[01:09:03] sum so I plug in all of those sums and I prefer trajectory if I can get higher
[01:09:06] prefer trajectory if I can get higher reward for that trajectory than the
[01:09:08] reward for that trajectory than the other according to the same model so I
[01:09:10] other according to the same model so I essentially just map it back to as if it
[01:09:12] essentially just map it back to as if it was kind of like a bandit okay just now
[01:09:14] was kind of like a bandit okay just now that I have like two different
[01:09:16] that I have like two different trajectories we'll see an example of
[01:09:17] trajectories we'll see an example of this in just a
[01:09:19] this in just a minute okay so what do we do we now are
[01:09:23] minute okay so what do we do we now are going to ask people to compare
[01:09:24] going to ask people to compare trajectories
[01:09:25] trajectories we'll use that then we're going to learn
[01:09:27] we'll use that then we're going to learn our reward model and then once we have
[01:09:29] our reward model and then once we have our learned reward model we can do po or
[01:09:31] our learned reward model we can do po or something with it so this gives us a
[01:09:34] something with it so this gives us a reward model for our domain and now we
[01:09:35] reward model for our domain and now we can try to optimize our policy with
[01:09:37] can try to optimize our policy with respect to
[01:09:40] it so let's look at an example so in the
[01:09:43] it so let's look at an example so in the reinforcement learning from Human
[01:09:44] reinforcement learning from Human feedback more precisely called dbrl from
[01:09:47] feedback more precisely called dbrl from Human preferences this came out in
[01:09:49] Human preferences this came out in 2017 and they wanted to train something
[01:09:51] 2017 and they wanted to train something to do a
[01:09:53] to do a backflip and what they noticed here is
[01:09:56] backflip and what they noticed here is they needed about 900 bits of human
[01:09:58] they needed about 900 bits of human feedback in order to learn to do this so
[01:10:00] feedback in order to learn to do this so let's see what it looks
[01:10:02] let's see what it looks like all
[01:10:04] like all right okay so what someone's going to be
[01:10:06] right okay so what someone's going to be doing in this case so remember they're
[01:10:08] doing in this case so remember they're trying to train this sort of little um
[01:10:10] trying to train this sort of little um Moko like thing to do a backflip so what
[01:10:13] Moko like thing to do a backflip so what they're going to show people is they're
[01:10:14] they're going to show people is they're going to show them little Clips um and
[01:10:16] going to show them little Clips um and they're going to say is the thing on the
[01:10:18] they're going to say is the thing on the left doing a better job of trying to do
[01:10:20] left doing a better job of trying to do a back flip or is the thing on the right
[01:10:22] a back flip or is the thing on the right and they're just getting people to click
[01:10:23] and they're just getting people to click left right left right left right left
[01:10:25] left right left right left right left right
[01:10:26] right and so they're not having to say what is
[01:10:27] and so they're not having to say what is the reward function for doing a back
[01:10:29] the reward function for doing a back flip they're just saying I don't know
[01:10:30] flip they're just saying I don't know this one looks closer to back flip you
[01:10:32] this one looks closer to back flip you know or
[01:10:33] know or better okay and so what you can see here
[01:10:37] better okay and so what you can see here is that um some of them are going to be
[01:10:39] is that um some of them are going to be much better at like getting close to
[01:10:40] much better at like getting close to doing a backflip right so some of those
[01:10:42] doing a backflip right so some of those is actually pretty good and what they
[01:10:44] is actually pretty good and what they are saying is that they only needed
[01:10:46] are saying is that they only needed about
[01:10:47] about 900 examples in order to train it so
[01:10:50] 900 examples in order to train it so that it could actually learn to do a
[01:10:51] that it could actually learn to do a back flip which is pretty good okay
[01:10:55] back flip which is pretty good okay particularly if you think back to sort
[01:10:56] particularly if you think back to sort of like um deep Q learning and like the
[01:10:59] of like um deep Q learning and like the enormous amount of data and training
[01:11:01] enormous amount of data and training that they're often doing for um say
[01:11:03] that they're often doing for um say trying to learn Atari Etc which is
[01:11:05] trying to learn Atari Etc which is literally
[01:11:06] literally millions
[01:11:09] millions okay so this is really cool this is
[01:11:11] okay so this is really cool this is possible to do um and this is something
[01:11:13] possible to do um and this is something that you're going to be doing in your
[01:11:14] that you're going to be doing in your homework so in homework three you're
[01:11:16] homework so in homework three you're going to be doing both rhf and DPO I'm
[01:11:18] going to be doing both rhf and DPO I'm really excited about this assignment
[01:11:19] really excited about this assignment this the first time we're doing it um so
[01:11:21] this the first time we're doing it um so you can actually see how this works so
[01:11:23] you can actually see how this works so you can see how we can actually learn
[01:11:25] you can see how we can actually learn from Human preferences we're not making
[01:11:27] from Human preferences we're not making you do the human preferences we're going
[01:11:28] you do the human preferences we're going to give you a data set um for for how we
[01:11:31] to give you a data set um for for how we can actually train these agents now I
[01:11:34] can actually train these agents now I know we're almost out of time um but
[01:11:35] know we're almost out of time um but I'll say just a little bit about this
[01:11:37] I'll say just a little bit about this I'll probably have a bit of time on
[01:11:39] I'll probably have a bit of time on Monday before um we have our guest
[01:11:42] Monday before um we have our guest lecture but just I want to give you at
[01:11:43] lecture but just I want to give you at least a little taste so this was in 2017
[01:11:47] least a little taste so this was in 2017 that paper was and it was there was
[01:11:49] that paper was and it was there was attention to it but I feel like in many
[01:11:51] attention to it but I feel like in many ways it wasn't there wasn't a huge
[01:11:53] ways it wasn't there wasn't a huge amount of work on that until much more
[01:11:54] amount of work on that until much more recently recently so I just want to
[01:11:56] recently recently so I just want to share a couple slides from um Tatsu
[01:11:59] share a couple slides from um Tatsu Hashimoto's NLP class so if we just
[01:12:02] Hashimoto's NLP class so if we just think back I think I showed this slide
[01:12:03] think back I think I showed this slide on the on the very first day of class
[01:12:06] on the on the very first day of class how is rhf being used for chat
[01:12:08] how is rhf being used for chat GPT what they're doing there is they're
[01:12:11] GPT what they're doing there is they're getting demonstration data and they're
[01:12:12] getting demonstration data and they're doing supervised learning this is
[01:12:14] doing supervised learning this is basically what we would call Behavior
[01:12:16] basically what we would call Behavior cloning then they're going to get this
[01:12:18] cloning then they're going to get this comparison data and Trader reward model
[01:12:20] comparison data and Trader reward model now in their case they might not just
[01:12:21] now in their case they might not just use two you could actually have people
[01:12:24] use two you could actually have people rank between like say four or something
[01:12:26] rank between like say four or something like that you can extend these models to
[01:12:27] like that you can extend these models to do that so you train you get labelers to
[01:12:30] do that so you train you get labelers to do that then you train the reward model
[01:12:33] do that then you train the reward model and then you then use po to actually
[01:12:38] and then you then use po to actually update the large language
[01:12:40] update the large language model now one thing that I think is
[01:12:42] model now one thing that I think is important to note in this case is that
[01:12:44] important to note in this case is that this is all really an instance of kind
[01:12:45] this is all really an instance of kind of meta reinforcement learning in the
[01:12:47] of meta reinforcement learning in the sense that what they're going to be
[01:12:49] sense that what they're going to be trying to do here unlike where we've
[01:12:51] trying to do here unlike where we've seen like you know you want to train
[01:12:52] seen like you know you want to train something to do one task like being to
[01:12:55] something to do one task like being to do a backflip they're trying to learn in
[01:12:57] do a backflip they're trying to learn in general a reward function that covers
[01:12:59] general a reward function that covers kind of all the tasks that people might
[01:13:01] kind of all the tasks that people might want to do with large language models
[01:13:03] want to do with large language models and so it's sort of this multitask
[01:13:05] and so it's sort of this multitask problem right and so when they do this
[01:13:07] problem right and so when they do this they're going to give you sort of a new
[01:13:08] they're going to give you sort of a new prompt like write a story about frogs
[01:13:11] prompt like write a story about frogs and then they will want the agent to do
[01:13:13] and then they will want the agent to do well on that which is likely a task it
[01:13:15] well on that which is likely a task it has maybe never seen before in its data
[01:13:18] has maybe never seen before in its data so I think that's also another important
[01:13:19] so I think that's also another important thing to note here is that the reward
[01:13:21] thing to note here is that the reward models that are being trained now are
[01:13:22] models that are being trained now are things that probably would have been
[01:13:24] things that probably would have been considered UM multitask settings before
[01:13:27] considered UM multitask settings before but now we're sort of lifting them and
[01:13:29] but now we're sort of lifting them and saying your task is just to do whatever
[01:13:31] saying your task is just to do whatever humans want to do with this chat chat
[01:13:33] humans want to do with this chat chat GPT in terms of answering questions and
[01:13:36] GPT in terms of answering questions and so how do you train a reward model that
[01:13:37] so how do you train a reward model that will be good for any of
[01:13:39] will be good for any of those so we'll continue talking about
[01:13:41] those so we'll continue talking about this next week we'll talk proba either
[01:13:43] this next week we'll talk proba either before or after the guest lecture um a
[01:13:45] before or after the guest lecture um a bit about how we actually do this uh but
[01:13:48] bit about how we actually do this uh but it basically just follows exactly along
[01:13:49] it basically just follows exactly along the framework that we've just seen there
[01:13:51] the framework that we've just seen there and I'll see you then thanks
Lecture 009
Stanford CS234 I Guest Lecture on DPO: Rafael Rafailov, Archit Sharma, Eric Mitchell I Lecture 9
Source: https://www.youtube.com/watch?v=Q7rl8ovBWwQ
---
Transcript
[00:00:05] hi everybody we're go...
Stanford CS234 I Guest Lecture on DPO: Rafael Rafailov, Archit Sharma, Eric Mitchell I Lecture 9
Source: https://www.youtube.com/watch?v=Q7rl8ovBWwQ
---
Transcript
[00:00:05] hi everybody we're going to go ahead and
[00:00:06] hi everybody we're going to go ahead and get started because we're going to be
[00:00:08] get started because we're going to be having a guest lecture today which will
[00:00:09] having a guest lecture today which will start at 1:45 um so welcome back we um
[00:00:13] start at 1:45 um so welcome back we um just in terms of where we are a few
[00:00:15] just in terms of where we are a few different quick Logistics things the
[00:00:17] different quick Logistics things the midterm as everybody probably knows is
[00:00:19] midterm as everybody probably knows is on Wednesday it'll be in class you're
[00:00:21] on Wednesday it'll be in class you're allowed to have one side of a normal
[00:00:24] allowed to have one side of a normal sheet of paper um in terms of your sheet
[00:00:27] sheet of paper um in terms of your sheet of notes all the material through today
[00:00:30] of notes all the material through today is going to be eligible for the exam um
[00:00:32] is going to be eligible for the exam um that was also in the Ed poost and you
[00:00:34] that was also in the Ed poost and you can see the Ed post for any additional
[00:00:36] can see the Ed post for any additional information around um midterms and uh
[00:00:38] information around um midterms and uh PRI exams because homework 2 was only
[00:00:41] PRI exams because homework 2 was only due on Friday and a lot of people use
[00:00:43] due on Friday and a lot of people use late days through yesterday we're not
[00:00:45] late days through yesterday we're not releasing we won't be able to grade it
[00:00:46] releasing we won't be able to grade it in time for the midterm but we will
[00:00:48] in time for the midterm but we will release Solutions so those will be
[00:00:50] release Solutions so those will be available by the end of
[00:00:52] available by the end of today all right so let's start with a
[00:00:55] today all right so let's start with a quick refresh your understanding this is
[00:00:57] quick refresh your understanding this is on the polls and then I'll do a quick um
[00:01:00] on the polls and then I'll do a quick um recap of rhf before we dive into our
[00:01:03] recap of rhf before we dive into our guest
[00:01:12] lecture this be a good reminder of some
[00:01:14] lecture this be a good reminder of some of the ideas that will be relevant to
[00:01:16] of the ideas that will be relevant to today's lecture as
[00:01:28] well for
[00:02:32] all right we have pretty good consensus
[00:02:34] all right we have pretty good consensus on the first one that this is true the
[00:02:35] on the first one that this is true the Bradley Terry model expresses the
[00:02:37] Bradley Terry model expresses the probability that someone will select one
[00:02:39] probability that someone will select one option over another option so this is
[00:02:42] option over another option so this is true and we have pretty good consensus
[00:02:44] true and we have pretty good consensus the last one is false in rhf we do not
[00:02:47] the last one is false in rhf we do not update the model after each po roll out
[00:02:50] update the model after each po roll out um there's a little bit of disagreement
[00:02:52] um there's a little bit of disagreement particularly about these two so why
[00:02:54] particularly about these two so why don't you turn to a neighbor and quickly
[00:02:56] don't you turn to a neighbor and quickly see if you can resolve is
[00:03:34] as a hint it's useful to think about
[00:03:36] as a hint it's useful to think about whether things can change based on
[00:03:38] whether things can change based on whether or not it's positive or negative
[00:04:15] all right I hope everyone got a chance
[00:04:17] all right I hope everyone got a chance to think about that for a second so the
[00:04:18] to think about that for a second so the second one is true the third one is also
[00:04:22] second one is true the third one is also true somebody want to say why the fourth
[00:04:24] true somebody want to say why the fourth one is
[00:04:27] one is false false fourth one is false
[00:04:35] You by a negative
[00:04:37] You by a negative Conant yeah exactly so remember your
[00:04:40] Conant yeah exactly so remember your name so that is exactly right so if you
[00:04:43] name so that is exactly right so if you multiply by negative of course that's
[00:04:44] multiply by negative of course that's exactly flipping all the rewards um and
[00:04:47] exactly flipping all the rewards um and so in general that will not preserve
[00:04:48] so in general that will not preserve preferences um you can shift it by any
[00:04:51] preferences um you can shift it by any constant and if you go through the math
[00:04:52] constant and if you go through the math you can see that the exponentials will
[00:04:54] you can see that the exponentials will all cancel so um that part is true
[00:05:03] okay great so what we talked about last
[00:05:04] okay great so what we talked about last time was maximum entropy inverse
[00:05:06] time was maximum entropy inverse reinforcement learning and we started
[00:05:07] reinforcement learning and we started talking about rhf um including how you
[00:05:10] talking about rhf um including how you could use the Bradley Terry model for
[00:05:12] could use the Bradley Terry model for Markoff decision processes I'm going to
[00:05:14] Markoff decision processes I'm going to do a really quick discussion of rhf with
[00:05:17] do a really quick discussion of rhf with respect to large language models before
[00:05:19] respect to large language models before we get into our guest lecture today and
[00:05:20] we get into our guest lecture today and then on Wednesday is the
[00:05:23] then on Wednesday is the midterm so as we talked about last week
[00:05:27] midterm so as we talked about last week while you could um do imitation learning
[00:05:29] while you could um do imitation learning where you get sort of full trajectories
[00:05:31] where you get sort of full trajectories and you want to imitate those that is um
[00:05:34] and you want to imitate those that is um uh less information than you might be
[00:05:35] uh less information than you might be able to get if you got par wise
[00:05:37] able to get if you got par wise preferences and we talked about how par
[00:05:39] preferences and we talked about how par wise preferences might be an interesting
[00:05:41] wise preferences might be an interesting intermediary point between humans having
[00:05:43] intermediary point between humans having to label like they do and Dagger at
[00:05:45] to label like they do and Dagger at every step of what they someone should
[00:05:47] every step of what they someone should do or provide really dense rewards um
[00:05:50] do or provide really dense rewards um versus just providing demonstrations and
[00:05:53] versus just providing demonstrations and so this was sort of has motivated a long
[00:05:55] so this was sort of has motivated a long line of work including um preference
[00:05:57] line of work including um preference learning recently we saw how you could
[00:06:00] learning recently we saw how you could learn the parameters of a Bradley Terry
[00:06:02] learn the parameters of a Bradley Terry model as we saw just now uh these are
[00:06:05] model as we saw just now uh these are not unique in general um you can do
[00:06:07] not unique in general um you can do translations of the rewards and you will
[00:06:09] translations of the rewards and you will preserve um the resulting
[00:06:11] preserve um the resulting preferences can maximize this with cross
[00:06:14] preferences can maximize this with cross entropy and last time we saw how you
[00:06:16] entropy and last time we saw how you could do this for trajectories as well
[00:06:18] could do this for trajectories as well as for Bandit like problems where you
[00:06:20] as for Bandit like problems where you only have a finite set of actions um in
[00:06:23] only have a finite set of actions um in homework 3 you're going to be
[00:06:24] homework 3 you're going to be implementing both DPO and rhf for markof
[00:06:28] implementing both DPO and rhf for markof decision processes so get a chance to
[00:06:30] decision processes so get a chance to play with this um where you're using
[00:06:32] play with this um where you're using roll outs from mu joku like problems
[00:06:35] roll outs from mu joku like problems okay but before we go on to our guest
[00:06:37] okay but before we go on to our guest lecture I wanted to just briefly um go
[00:06:40] lecture I wanted to just briefly um go through how you go from doing this sort
[00:06:43] through how you go from doing this sort of um uh approach to learning uh reward
[00:06:46] of um uh approach to learning uh reward models all the way to chat
[00:06:48] models all the way to chat GPT and so for this I'm going to draw
[00:06:50] GPT and so for this I'm going to draw upon some of Tatsu Hashimoto's really
[00:06:53] upon some of Tatsu Hashimoto's really nice lecture notes from an NLP class so
[00:06:55] nice lecture notes from an NLP class so recall from the start of uh start of the
[00:06:58] recall from the start of uh start of the reinforcement learning course we looked
[00:07:00] reinforcement learning course we looked at this sort of pipeline from chat GPT
[00:07:02] at this sort of pipeline from chat GPT and here we had the demonstration data
[00:07:04] and here we had the demonstration data collecting the comparison data and then
[00:07:06] collecting the comparison data and then optimizing a
[00:07:07] optimizing a policy so now we've seen sort of how
[00:07:10] policy so now we've seen sort of how those last two steps happen so in
[00:07:12] those last two steps happen so in particular you can generate pair wise
[00:07:15] particular you can generate pair wise preferences or in fact you can generate
[00:07:16] preferences or in fact you can generate full rankings um and then use that to
[00:07:19] full rankings um and then use that to learn a reward model and so while we
[00:07:22] learn a reward model and so while we thought before about different ways of
[00:07:23] thought before about different ways of doing this as a particular example
[00:07:25] doing this as a particular example involving language you might say someone
[00:07:27] involving language you might say someone might prefer an earthquake hit San
[00:07:28] might prefer an earthquake hit San Francisco there was mining minor
[00:07:30] Francisco there was mining minor property damage but no injuries versus a
[00:07:33] property damage but no injuries versus a 4.2 magnitude earthquake hit San
[00:07:35] 4.2 magnitude earthquake hit San Francisco resulting in massive damage
[00:07:36] Francisco resulting in massive damage versus Barry has good weather but it sun
[00:07:39] versus Barry has good weather but it sun has Wildfire wildfires and
[00:07:41] has Wildfire wildfires and earthquakes so you can see in this case
[00:07:43] earthquakes so you can see in this case that these are places where someone
[00:07:45] that these are places where someone might be able to provide different
[00:07:46] might be able to provide different rankings in response to prompts so now
[00:07:48] rankings in response to prompts so now you can think of the context as being a
[00:07:50] you can think of the context as being a prompt and the output as being um all
[00:07:53] prompt and the output as being um all the actions or all the different
[00:07:54] the actions or all the different responses you can have and people are
[00:07:55] responses you can have and people are going to rank them
[00:07:59] Now sort of building on that before you
[00:08:02] Now sort of building on that before you actually do po or something um you may
[00:08:04] actually do po or something um you may want to try to check the quality of your
[00:08:06] want to try to check the quality of your award model and this is something that
[00:08:07] award model and this is something that you'll also you can think about for
[00:08:09] you'll also you can think about for homework 3 so in general depending on
[00:08:11] homework 3 so in general depending on the amount of data you have and the
[00:08:13] the amount of data you have and the complexity of your award model you're
[00:08:15] complexity of your award model you're going to be able to do a better or worse
[00:08:16] going to be able to do a better or worse job of being able to try to capture the
[00:08:18] job of being able to try to capture the underlying lat and reward model of
[00:08:20] underlying lat and reward model of people um so in this case this is
[00:08:23] people um so in this case this is looking at different model sizes and you
[00:08:25] looking at different model sizes and you know these are these are Big models like
[00:08:27] know these are these are Big models like a lot of the models that people have
[00:08:28] a lot of the models that people have thought about historically or things
[00:08:30] thought about historically or things like linear models or you know then
[00:08:32] like linear models or you know then neural network models but these can be
[00:08:33] neural network models but these can be extremely large models they can be on
[00:08:35] extremely large models they can be on the same order as large language models
[00:08:37] the same order as large language models not uncommon to see like you know seven
[00:08:39] not uncommon to see like you know seven s billion parameter reward models and
[00:08:42] s billion parameter reward models and what they're looking at here is sort of
[00:08:43] what they're looking at here is sort of validation accuracy and so what you can
[00:08:46] validation accuracy and so what you can see here is when you start to get enough
[00:08:47] see here is when you start to get enough data and you have a big enough model
[00:08:49] data and you have a big enough model then you can start to capture really
[00:08:51] then you can start to capture really complex reward models and so that's a
[00:08:54] complex reward models and so that's a useful thing to think about when you're
[00:08:55] useful thing to think about when you're thinking about your projects or you're
[00:08:56] thinking about your projects or you're thinking about homeworks of sort of what
[00:08:58] thinking about homeworks of sort of what is the complexity we we need in order to
[00:09:00] is the complexity we we need in order to start to capture human
[00:09:02] start to capture human preferences okay and then once you have
[00:09:05] preferences okay and then once you have that now we have everything we need to
[00:09:06] that now we have everything we need to do that pipeline so if you have tra
[00:09:09] do that pipeline so if you have tra you've gotten a lot of preferences now
[00:09:11] you've gotten a lot of preferences now again the question is how many of those
[00:09:13] again the question is how many of those preferences do you need it might be a
[00:09:14] preferences do you need it might be a lot so if you look back
[00:09:17] lot so if you look back here this is quite a lot of preference
[00:09:20] here this is quite a lot of preference data now it's not the same amount of
[00:09:22] data now it's not the same amount of data that we generally need to be using
[00:09:23] data that we generally need to be using to train an llm um but it's not like one
[00:09:26] to train an llm um but it's not like one or two either and in fact there's some
[00:09:29] or two either and in fact there's some uh and a lot of ongoing interesting work
[00:09:30] uh and a lot of ongoing interesting work and trying to think about how do we
[00:09:32] and trying to think about how do we reduce the amount of online preference
[00:09:33] reduce the amount of online preference data that we need in order to train
[00:09:36] data that we need in order to train these okay by online I just mean
[00:09:38] these okay by online I just mean additional data compared to the
[00:09:40] additional data compared to the historical so in reinforcement learning
[00:09:42] historical so in reinforcement learning from Human feedback what we can do is
[00:09:43] from Human feedback what we can do is once we've had that learned reward model
[00:09:46] once we've had that learned reward model now you can use that with
[00:09:48] now you can use that with Po and one of the important things to
[00:09:50] Po and one of the important things to note here is that just like how we saw
[00:09:53] note here is that just like how we saw for po before in general we're going to
[00:09:55] for po before in general we're going to need some sort of reference decision
[00:09:57] need some sort of reference decision policy um that maybe we use from like
[00:10:00] policy um that maybe we use from like super like Behavior cloning or
[00:10:02] super like Behavior cloning or supervised fine tuning and we want to
[00:10:04] supervised fine tuning and we want to regularize so we don't get too far from
[00:10:06] regularize so we don't get too far from that when we're doing po okay um and so
[00:10:10] that when we're doing po okay um and so that sort of Divergence is going to be
[00:10:11] that sort of Divergence is going to be just as important as what we've seen in
[00:10:12] just as important as what we've seen in the previous
[00:10:15] the previous work and one of the things that's been
[00:10:17] work and one of the things that's been noted is that you know perhaps not
[00:10:19] noted is that you know perhaps not surprisingly given the huge success of J
[00:10:20] surprisingly given the huge success of J chat gbt this type of approach can make
[00:10:23] chat gbt this type of approach can make a significant difference so by
[00:10:25] a significant difference so by leveraging rewards and doing rhf there
[00:10:27] leveraging rewards and doing rhf there really was a substantial gain over
[00:10:29] really was a substantial gain over previous um approaches even when you fix
[00:10:32] previous um approaches even when you fix for the model
[00:10:34] for the model size so that suggests that changing the
[00:10:36] size so that suggests that changing the optimization function we're using and
[00:10:38] optimization function we're using and using the Roo functions really can lead
[00:10:39] using the Roo functions really can lead to substantial gains in
[00:10:43] performance so I think something that's
[00:10:46] performance so I think something that's important to notice here is sort of like
[00:10:48] important to notice here is sort of like what are we doing the reinforcement
[00:10:49] what are we doing the reinforcement learning over and what is how are we
[00:10:51] learning over and what is how are we training the reward model and comparison
[00:10:53] training the reward model and comparison to what we've talked about mostly in
[00:10:55] to what we've talked about mostly in this class this is really where you're
[00:10:56] this class this is really where you're trying to do something almost like meta
[00:10:58] trying to do something almost like meta reinforcement learning or multitask
[00:11:00] reinforcement learning or multitask reinforcement learning so instead of
[00:11:02] reinforcement learning so instead of trading an agent to do one task like do
[00:11:04] trading an agent to do one task like do a backflip or you know solver grid World
[00:11:07] a backflip or you know solver grid World we're really trying to train a large
[00:11:08] we're really trying to train a large language model here to do any possible
[00:11:10] language model here to do any possible task a user might want and so then when
[00:11:13] task a user might want and so then when we're collecting data and we're doing
[00:11:14] we're collecting data and we're doing sort of um comparisons you might have an
[00:11:17] sort of um comparisons you might have an enormous number of different tasks so
[00:11:20] enormous number of different tasks so writing a thank you letter to making a
[00:11:22] writing a thank you letter to making a website to lots of different things all
[00:11:24] website to lots of different things all things that used to be previously
[00:11:25] things that used to be previously considered different tasks will likely
[00:11:27] considered different tasks will likely be involved in this
[00:11:32] so another thing that I think it's
[00:11:33] so another thing that I think it's useful to note is that um this is a
[00:11:36] useful to note is that um this is a comparison from 2023 also from Stamford
[00:11:39] comparison from 2023 also from Stamford um there's also been a lot of other work
[00:11:40] um there's also been a lot of other work this is a very on important ongoing area
[00:11:43] this is a very on important ongoing area to understand how good these approaches
[00:11:45] to understand how good these approaches are and one thing that's useful to know
[00:11:47] are and one thing that's useful to know is that best of n is an alternative
[00:11:51] is that best of n is an alternative where you could for example use your
[00:11:52] where you could for example use your reward model just generate n samples
[00:11:55] reward model just generate n samples from your original model and then just
[00:11:58] from your original model and then just use your award model to pick the best
[00:11:59] use your award model to pick the best one according to your reward model so
[00:12:02] one according to your reward model so that doesn't use any reinforcement
[00:12:03] that doesn't use any reinforcement learning doesn't use po it's just using
[00:12:06] learning doesn't use po it's just using your reward model as sort of an external
[00:12:08] your reward model as sort of an external like expert to try to pick among all of
[00:12:10] like expert to try to pick among all of your
[00:12:11] your generations and what you can see here is
[00:12:14] generations and what you can see here is that also does pretty well relative to
[00:12:17] that also does pretty well relative to PO now in general it doesn't do quite as
[00:12:19] PO now in general it doesn't do quite as well um but I think it's really useful
[00:12:21] well um but I think it's really useful to think about some of these alternative
[00:12:23] to think about some of these alternative baselines particularly depending on
[00:12:24] baselines particularly depending on whether or not you have access to
[00:12:25] whether or not you have access to actually training the model again versus
[00:12:28] actually training the model again versus you might have access to being able to
[00:12:29] you might have access to being able to train a reward model um and you might
[00:12:32] train a reward model um and you might have access to an off the-shelf llm and
[00:12:34] have access to an off the-shelf llm and you might be able to combine these it's
[00:12:36] you might be able to combine these it's a very active ongoing area to figure out
[00:12:38] a very active ongoing area to figure out what's the best way to train and refine
[00:12:39] what's the best way to train and refine these sorts of
[00:12:41] these sorts of models
[00:12:42] models okay all right so that was you know a f
[00:12:46] okay all right so that was you know a f minute overview of how people use rhf to
[00:12:49] minute overview of how people use rhf to train chat gbt and now I'm really
[00:12:51] train chat gbt and now I'm really excited to have our guest lecture on
[00:12:53] excited to have our guest lecture on Direct preference
[00:12:56] optimization yay all right okay well I'm
[00:13:00] optimization yay all right okay well I'm super delighted to have Rafael archid
[00:13:02] super delighted to have Rafael archid and Eric here today to talk about direct
[00:13:04] and Eric here today to talk about direct preference optimization I really
[00:13:05] preference optimization I really appreciate you guys coming um I know you
[00:13:07] appreciate you guys coming um I know you guys have done this um Rodeo before at
[00:13:10] guys have done this um Rodeo before at nerx um in terms of balancing between
[00:13:12] nerx um in terms of balancing between three people so for those of you that
[00:13:14] three people so for those of you that don't know direct preference
[00:13:15] don't know direct preference optimization um got outstanding best
[00:13:17] optimization um got outstanding best paper runner up at NS this year which is
[00:13:20] paper runner up at NS this year which is the premier machine learning conference
[00:13:22] the premier machine learning conference um it's also had a huge impact already
[00:13:25] um it's also had a huge impact already on the um well really broadly on the L1
[00:13:28] on the um well really broadly on the L1 community
[00:13:29] community um as an alternative to rhf so I think
[00:13:32] um as an alternative to rhf so I think it's extremely exciting you guys will
[00:13:34] it's extremely exciting you guys will get to do to my knowledge the first
[00:13:36] get to do to my knowledge the first homework um that's incorporating RF and
[00:13:38] homework um that's incorporating RF and DP uh which would be really great and
[00:13:41] DP uh which would be really great and what they're going to talk about today
[00:13:42] what they're going to talk about today is this and they also just had a um
[00:13:44] is this and they also just had a um newpaper dropped on our time just a few
[00:13:47] newpaper dropped on our time just a few days ago about some extensions so I
[00:13:49] days ago about some extensions so I think it time that thanks so much
[00:13:53] think it time that thanks so much y u well yeah thanks um so much Emma for
[00:13:57] y u well yeah thanks um so much Emma for having us um
[00:13:59] having us um it's funny when when you talk about sort
[00:14:01] it's funny when when you talk about sort of the the impact of the paper you sort
[00:14:03] of the the impact of the paper you sort of want to say RL but I guess it's like
[00:14:05] of want to say RL but I guess it's like llm Community or like what what even is
[00:14:08] llm Community or like what what even is the community anymore it's it's hard to
[00:14:09] the community anymore it's it's hard to draw the boundaries between things but I
[00:14:11] draw the boundaries between things but I think it's so cool like to see how the
[00:14:13] think it's so cool like to see how the boundaries are kind of breaking down
[00:14:15] boundaries are kind of breaking down between these areas um so yeah as Emma
[00:14:18] between these areas um so yeah as Emma said we're going to talk a bit about rhf
[00:14:20] said we're going to talk a bit about rhf and DPO and we have a little bit of
[00:14:22] and DPO and we have a little bit of background that um I'll I'll do to set
[00:14:24] background that um I'll I'll do to set things up for um these guys to to uh
[00:14:26] things up for um these guys to to uh bring things home and some of this is
[00:14:28] bring things home and some of this is probably going to be reviewed um from
[00:14:29] probably going to be reviewed um from things Emma has already covered but just
[00:14:30] things Emma has already covered but just to kind of make sure we're all on the
[00:14:32] to kind of make sure we're all on the same page um we are in fact talking
[00:14:35] same page um we are in fact talking about this setting of reinforcement
[00:14:37] about this setting of reinforcement learning from Human feedback and um as a
[00:14:41] learning from Human feedback and um as a like small piece of sort of background
[00:14:43] like small piece of sort of background or setup here you know why are we
[00:14:45] or setup here you know why are we talking about rhf why are we doing RL on
[00:14:48] talking about rhf why are we doing RL on language models why are we talking about
[00:14:50] language models why are we talking about it now um people did not start doing RL
[00:14:52] it now um people did not start doing RL on language models a few years ago when
[00:14:54] on language models a few years ago when chat GPT came out people have been doing
[00:14:56] chat GPT came out people have been doing RL on language models for a long time
[00:14:58] RL on language models for a long time but
[00:14:59] but um you know this this sort of chat gbt
[00:15:01] um you know this this sort of chat gbt moment so to speak um is something that
[00:15:02] moment so to speak um is something that I think really brought um these these RL
[00:15:05] I think really brought um these these RL methods to language models into the
[00:15:06] methods to language models into the Forefront of people's minds because
[00:15:08] Forefront of people's minds because there's sort of a sense in which things
[00:15:09] there's sort of a sense in which things really started working you know for the
[00:15:11] really started working you know for the for the first time in a way that maybe
[00:15:12] for the first time in a way that maybe they they didn't before and a lot of
[00:15:14] they they didn't before and a lot of this um comes from being able to start
[00:15:16] this um comes from being able to start from a really strong pre-trained model
[00:15:18] from a really strong pre-trained model that that already has a lot of
[00:15:19] that that already has a lot of interesting kind of um sort of skills um
[00:15:22] interesting kind of um sort of skills um and pre pre-learned behaviors that we
[00:15:24] and pre pre-learned behaviors that we can that we can fine-tune and so we
[00:15:26] can that we can fine-tune and so we don't have to start from scratch when
[00:15:28] don't have to start from scratch when we're doing RL typically on these
[00:15:29] we're doing RL typically on these language models and that that makes it a
[00:15:31] language models and that that makes it a lot more kind of access to to get some
[00:15:33] lot more kind of access to to get some benefits from these
[00:15:34] benefits from these algorithms okay so so RF we have this we
[00:15:37] algorithms okay so so RF we have this we have this three stage pipeline that is
[00:15:39] have this three stage pipeline that is the thing that is sort of been
[00:15:40] the thing that is sort of been popularized by by chat GPT so um in this
[00:15:43] popularized by by chat GPT so um in this in this first stage I think Emma
[00:15:45] in this first stage I think Emma actually showed this same figure in her
[00:15:47] actually showed this same figure in her slide just a minute ago so um this this
[00:15:49] slide just a minute ago so um this this isn't totally new in this first stage um
[00:15:51] isn't totally new in this first stage um we we were actually there's really a
[00:15:52] we we were actually there's really a step zero here which is do the
[00:15:54] step zero here which is do the unsupervised pre-training this is when
[00:15:55] unsupervised pre-training this is when we just fit a big generative model of
[00:15:57] we just fit a big generative model of like a ton of text this is again we kind
[00:15:59] like a ton of text this is again we kind of learned some sort of metal learn in
[00:16:01] of learned some sort of metal learn in some sense some some skills we're going
[00:16:02] some sense some some skills we're going to select from um and then we're going
[00:16:04] to select from um and then we're going to collect some some supervised demos so
[00:16:06] to collect some some supervised demos so from from humans um we'll have some data
[00:16:08] from from humans um we'll have some data set of prompts uh you know explain the
[00:16:10] set of prompts uh you know explain the moon landing to a six-year-old and a
[00:16:11] moon landing to a six-year-old and a human is going to write sort of a
[00:16:13] human is going to write sort of a sensible good uh uh demonstration uh
[00:16:16] sensible good uh uh demonstration uh response to this to this prompt and
[00:16:17] response to this to this prompt and we're just going to do supervised fine
[00:16:18] we're just going to do supervised fine tuning here and this is going to
[00:16:20] tuning here and this is going to actually serve as that reference policy
[00:16:22] actually serve as that reference policy that Emma was talking about a few
[00:16:24] that Emma was talking about a few minutes ago the thing that we're going
[00:16:25] minutes ago the thing that we're going to uh constrain our model to um to to
[00:16:28] to uh constrain our model to um to to make uh learning a little um easier and
[00:16:30] make uh learning a little um easier and also to avoid over optimization of our
[00:16:33] also to avoid over optimization of our approximate proxy reward function that
[00:16:35] approximate proxy reward function that we're learning from in the second stage
[00:16:37] we're learning from in the second stage that's when we do the learning of the
[00:16:38] that's when we do the learning of the reward model so here's when we collect
[00:16:41] reward model so here's when we collect preference data so we're going to sample
[00:16:42] preference data so we're going to sample responses typically from this supervised
[00:16:44] responses typically from this supervised fine tuned model that we learned in the
[00:16:46] fine tuned model that we learned in the first stage and we're going to ask a
[00:16:47] first stage and we're going to ask a human or often a collection of humans to
[00:16:50] human or often a collection of humans to provide ranking annotations over
[00:16:52] provide ranking annotations over multiple draws from that supervised fine
[00:16:54] multiple draws from that supervised fine tune model and we're going to use those
[00:16:56] tune model and we're going to use those preferences to learn a reward model so a
[00:16:59] preferences to learn a reward model so a mapping from a prompt uh a dialogue
[00:17:01] mapping from a prompt uh a dialogue history and a potential response to a
[00:17:03] history and a potential response to a scal of reward and then in the third
[00:17:05] scal of reward and then in the third stage we're going to do policy learning
[00:17:07] stage we're going to do policy learning so we're going to try to fine-tune that
[00:17:08] so we're going to try to fine-tune that supervised fine-tune model model to
[00:17:10] supervised fine-tune model model to generate responses that get uh that that
[00:17:12] generate responses that get uh that that receive high reward from that reward
[00:17:15] receive high reward from that reward model okay so the first step is pretty
[00:17:18] model okay so the first step is pretty uh straightforward supervis fine tuning
[00:17:19] uh straightforward supervis fine tuning we don't really need to talk about it
[00:17:20] we don't really need to talk about it very much um and again Emma already
[00:17:22] very much um and again Emma already covered uh some of these things so um
[00:17:24] covered uh some of these things so um hopefully this is mostly review but you
[00:17:25] hopefully this is mostly review but you know of course ask questions if if
[00:17:27] know of course ask questions if if anything seems funny um um like I said
[00:17:30] anything seems funny um um like I said the feedback here the thing we're going
[00:17:32] the feedback here the thing we're going to do is we're going to get preferences
[00:17:33] to do is we're going to get preferences over responses from our model okay so
[00:17:36] over responses from our model okay so we're going to end up with some data set
[00:17:37] we're going to end up with some data set of a prompt and this could be like a
[00:17:39] of a prompt and this could be like a single prompt or it could be an entire
[00:17:41] single prompt or it could be an entire dialogue history so multiple turns and
[00:17:43] dialogue history so multiple turns and then sort of the most recent user um uh
[00:17:45] then sort of the most recent user um uh message and then this is typically going
[00:17:48] message and then this is typically going to be only two responses that we're
[00:17:50] to be only two responses that we're going to do give a binary preference
[00:17:51] going to do give a binary preference over you can do rankings over more
[00:17:53] over you can do rankings over more responses but um the the the returns can
[00:17:55] responses but um the the the returns can be sort of uh uh Plateau relatively
[00:17:58] be sort of uh uh Plateau relatively quickly and it's typically maybe better
[00:18:00] quickly and it's typically maybe better to have more prompts and fewer responses
[00:18:01] to have more prompts and fewer responses per prompt I think it's worth mentioning
[00:18:04] per prompt I think it's worth mentioning briefly like why are we talking about
[00:18:06] briefly like why are we talking about preferences over responses instead of
[00:18:08] preferences over responses instead of you know directly asking for reward
[00:18:10] you know directly asking for reward annotations right like you know you
[00:18:12] annotations right like you know you could you could take your prompt in your
[00:18:14] could you could take your prompt in your response and just ask the human like one
[00:18:15] response and just ask the human like one to 10 how good of a response is this um
[00:18:19] to 10 how good of a response is this um and I there are a couple reasons for
[00:18:20] and I there are a couple reasons for this I mean uh first of all uh actually
[00:18:23] this I mean uh first of all uh actually another set of slides we have of an
[00:18:24] another set of slides we have of an example of this which I think makes it
[00:18:26] example of this which I think makes it quite clear I don't think we have them
[00:18:27] quite clear I don't think we have them in this deck but um you know if you if
[00:18:30] in this deck but um you know if you if you take two different humans and you
[00:18:32] you take two different humans and you say you know here's a prompt that says
[00:18:33] say you know here's a prompt that says write me a recipe for making a really
[00:18:35] write me a recipe for making a really good cake um and you ask two different
[00:18:36] good cake um and you ask two different humans um you know you have two
[00:18:38] humans um you know you have two different responses from your model and
[00:18:39] different responses from your model and and for human a you say you know what
[00:18:41] and for human a you say you know what reward do you give to this response and
[00:18:42] reward do you give to this response and that response and another human you know
[00:18:44] that response and another human you know you ask the same question uh you can end
[00:18:46] you ask the same question uh you can end up with the same ranking over responses
[00:18:49] up with the same ranking over responses but actually a lot of disagreement in
[00:18:50] but actually a lot of disagreement in the actual rewards that you're giving so
[00:18:53] the actual rewards that you're giving so people are not really calibrated to each
[00:18:54] people are not really calibrated to each other in terms of the absolute rewards
[00:18:55] other in terms of the absolute rewards that they're going to be assigning um
[00:18:58] that they're going to be assigning um and it's also just uh sort of more
[00:18:59] and it's also just uh sort of more cognitively uh difficult to assign this
[00:19:02] cognitively uh difficult to assign this absolute number um in contrast to
[00:19:05] absolute number um in contrast to anchoring to one response and then just
[00:19:07] anchoring to one response and then just making a decision about is another thing
[00:19:09] making a decision about is another thing better or worse so so in some sense um I
[00:19:12] better or worse so so in some sense um I I think of you know Gathering
[00:19:13] I think of you know Gathering preferences as opposed to um asking
[00:19:15] preferences as opposed to um asking humans to write high quality
[00:19:16] humans to write high quality demonstrations or asking humans to
[00:19:17] demonstrations or asking humans to assign directly um the the reward itself
[00:19:20] assign directly um the the reward itself is sort of a way to um uh get higher
[00:19:23] is sort of a way to um uh get higher return of annotation information per
[00:19:25] return of annotation information per sort of unit of cognitive effort of the
[00:19:27] sort of unit of cognitive effort of the human labeler
[00:19:29] human labeler so we're going to get these preferences
[00:19:30] so we're going to get these preferences and and now we just have this Bradley
[00:19:32] and and now we just have this Bradley Terry model which is a a very simple
[00:19:34] Terry model which is a a very simple model of discreet choice in humans um
[00:19:37] model of discreet choice in humans um which which relates a scoring function
[00:19:38] which which relates a scoring function or in this case a reward function so
[00:19:40] or in this case a reward function so that's this R of X and and a and a
[00:19:42] that's this R of X and and a and a response y here um to to a a
[00:19:45] response y here um to to a a probabilistic decision over two discret
[00:19:48] probabilistic decision over two discret choices okay so here our discret choices
[00:19:50] choices okay so here our discret choices are you know we have uh the the the
[00:19:52] are you know we have uh the the the thing that was labeled preferred in the
[00:19:53] thing that was labeled preferred in the data set and the thing that was labeled
[00:19:54] data set and the thing that was labeled this preferred in the data set and we we
[00:19:56] this preferred in the data set and we we wanted to train a model uh some
[00:19:58] wanted to train a model uh some probabilistic model again our reward
[00:20:00] probabilistic model again our reward model um to to maximize the likelihood
[00:20:02] model um to to maximize the likelihood of this observed data and we need to
[00:20:05] of this observed data and we need to decide on some model that relates a
[00:20:07] decide on some model that relates a scoring function to these choices in
[00:20:09] scoring function to these choices in order to do maximum likelihood and that
[00:20:11] order to do maximum likelihood and that is this bradle ter model which we can
[00:20:13] is this bradle ter model which we can then simply do uh maximum likelihood in
[00:20:16] then simply do uh maximum likelihood in or you know use this negative log
[00:20:18] or you know use this negative log likelihood of loss okay so so we're
[00:20:21] likelihood of loss okay so so we're using this Brader sort of conceptual
[00:20:23] using this Brader sort of conceptual model of of choices and this turns into
[00:20:25] model of of choices and this turns into a maximum likelihood loss so we're
[00:20:27] a maximum likelihood loss so we're simply solving a binary classific
[00:20:29] simply solving a binary classific problem so we have a binary binary
[00:20:31] problem so we have a binary binary classifier here where the the logit is
[00:20:32] classifier here where the the logit is just our reward model the the difference
[00:20:34] just our reward model the the difference in the reward we're assigning to the The
[00:20:36] in the reward we're assigning to the The Chosen response minus the um the the
[00:20:39] Chosen response minus the um the the dispreferred the rejected response
[00:20:41] dispreferred the rejected response that's just we're treating that as the
[00:20:42] that's just we're treating that as the logit of a binary classifier and doing
[00:20:43] logit of a binary classifier and doing maximum
[00:20:44] maximum likelihood so once we do that we get a
[00:20:46] likelihood so once we do that we get a reward model um we finished step two now
[00:20:49] reward model um we finished step two now uh and now we need to just find a policy
[00:20:50] uh and now we need to just find a policy that that actually optimizes this
[00:20:52] that that actually optimizes this reward and um really this is this is the
[00:20:56] reward and um really this is this is the RL bit so to speak um and and here we
[00:20:59] RL bit so to speak um and and here we want to learn again Pi Theta this is our
[00:21:01] want to learn again Pi Theta this is our our policy that we're actually
[00:21:02] our policy that we're actually fine-tuning we're actually learning here
[00:21:05] fine-tuning we're actually learning here and the objective here is you know we
[00:21:06] and the objective here is you know we have we have prompts we have some data
[00:21:08] have we have prompts we have some data set of prompts or conversation histories
[00:21:10] set of prompts or conversation histories uh and you know in expectation for for
[00:21:12] uh and you know in expectation for for responses sampled from our policy we
[00:21:14] responses sampled from our policy we want to achieve High reward um but but
[00:21:16] want to achieve High reward um but but it's not the full story
[00:21:18] it's not the full story here right if if we um you know if we
[00:21:21] here right if if we um you know if we just um optimize to to maximize the
[00:21:24] just um optimize to to maximize the reward here like what can what can
[00:21:26] reward here like what can what can happen I'm not sure if you've talked
[00:21:28] happen I'm not sure if you've talked about this ready okay perfect anybody
[00:21:32] about this ready okay perfect anybody have any worries about just optimizing
[00:21:34] have any worries about just optimizing this objective or are we good because
[00:21:37] this objective or are we good because this is okay we could we could tell if
[00:21:38] this is okay we could we could tell if an AI like you can forget the rest of
[00:21:40] an AI like you can forget the rest of the
[00:21:43] objective
[00:21:44] objective perfect okay so one thing that can
[00:21:46] perfect okay so one thing that can happen here is you remember this is not
[00:21:48] happen here is you remember this is not a true reward function right this is
[00:21:50] a true reward function right this is something we learn from a finite data
[00:21:51] something we learn from a finite data set and so there's going to be some
[00:21:54] set and so there's going to be some distribution in which this gives us
[00:21:57] distribution in which this gives us accurate or or meaningful reward and
[00:21:58] accurate or or meaningful reward and there's going to be you know outside of
[00:22:00] there's going to be you know outside of that distribution there's no guarantee
[00:22:01] that distribution there's no guarantee this thing is going to generalize
[00:22:02] this thing is going to generalize meaningfully so what we typically end up
[00:22:04] meaningfully so what we typically end up doing is we actually have an additional
[00:22:06] doing is we actually have an additional constraint a KL penalty uh from our from
[00:22:09] constraint a KL penalty uh from our from our starting model that sft model or
[00:22:11] our starting model that sft model or reference model to say I want you to
[00:22:12] reference model to say I want you to maximize rewards but I don't want you to
[00:22:14] maximize rewards but I don't want you to drift to far from the starting model
[00:22:16] drift to far from the starting model because again our reward model was
[00:22:18] because again our reward model was trained on preferences over samples from
[00:22:20] trained on preferences over samples from that reference model okay so if we if we
[00:22:23] that reference model okay so if we if we drift far from the reference model we're
[00:22:24] drift far from the reference model we're sort of out of distribution for the data
[00:22:26] sort of out of distribution for the data that our reward model was trained on so
[00:22:28] that our reward model was trained on so we basically can start getting bogus
[00:22:29] we basically can start getting bogus reward scores if our if our policy
[00:22:31] reward scores if our if our policy changes too much from that reference
[00:22:34] changes too much from that reference model yeah refence model Ever Changing
[00:22:38] model yeah refence model Ever Changing uh it depends on the algorithm I think
[00:22:40] uh it depends on the algorithm I think the sort of original canonical uh
[00:22:42] the sort of original canonical uh version of rhf no it was it was a fixed
[00:22:44] version of rhf no it was it was a fixed reference model but there have been a
[00:22:46] reference model but there have been a lot of work since then showing like ways
[00:22:47] lot of work since then showing like ways to update the reference model and use a
[00:22:49] to update the reference model and use a moving reference model over time yeah
[00:22:52] moving reference model over time yeah yeah data is coming from that reference
[00:22:54] yeah data is coming from that reference model you mean that like both those like
[00:22:56] model you mean that like both those like y w and y l they both coming from the
[00:22:58] y w and y l they both coming from the same model uh again in the sort of
[00:23:01] same model uh again in the sort of original canonical form of rhf yes since
[00:23:03] original canonical form of rhf yes since then there people have proposed a wide
[00:23:05] then there people have proposed a wide variety of different sort of sampling
[00:23:06] variety of different sort of sampling schemes a way to select you know what
[00:23:09] schemes a way to select you know what pair of responses do you show the human
[00:23:10] pair of responses do you show the human to get a preference over but in sort of
[00:23:12] to get a preference over but in sort of the original vanilla version of rhf yeah
[00:23:14] the original vanilla version of rhf yeah you typically sample two responses from
[00:23:15] you typically sample two responses from the reference model get a preference
[00:23:17] the reference model get a preference over them and use that to learn your
[00:23:19] over them and use that to learn your word model I've heard that like in
[00:23:21] word model I've heard that like in practice but like do like responses like
[00:23:23] practice but like do like responses like come from like the same model with
[00:23:24] come from like the same model with different temperatures or like different
[00:23:26] different temperatures or like different models does that like in theory kind of
[00:23:29] models does that like in theory kind of mean that we're doing something
[00:23:31] mean that we're doing something wrong um well I wouldn't I mean I'm not
[00:23:35] wrong um well I wouldn't I mean I'm not sure in theory it means you're doing
[00:23:36] sure in theory it means you're doing something wrong I think like one way to
[00:23:38] something wrong I think like one way to think about this is again like we want
[00:23:40] think about this is again like we want our reward model to perform well um
[00:23:42] our reward model to perform well um across the state action space right so
[00:23:44] across the state action space right so like you know if you think of our state
[00:23:46] like you know if you think of our state space being our context space being the
[00:23:48] space being our context space being the conversational history so far and our
[00:23:50] conversational history so far and our action space being the response like you
[00:23:53] action space being the response like you want to have good coverage over this
[00:23:55] want to have good coverage over this space so that you're going to get
[00:23:56] space so that you're going to get meaningful rewards out when you actually
[00:23:58] meaningful rewards out when you actually update your policy and so you know like
[00:24:02] update your policy and so you know like in principle yeah we would like to be
[00:24:03] in principle yeah we would like to be able to cover this space assuming we
[00:24:04] able to cover this space assuming we have a model that has high enough
[00:24:05] have a model that has high enough capacity to model all of it we'd like to
[00:24:07] capacity to model all of it we'd like to cover as much of the space as possible
[00:24:08] cover as much of the space as possible so yeah like a more diverse preference
[00:24:10] so yeah like a more diverse preference data set is very helpful and there's
[00:24:12] data set is very helpful and there's some trade-off sort of between we want
[00:24:14] some trade-off sort of between we want to concentrate our preference data set
[00:24:16] to concentrate our preference data set on the things that are high quality but
[00:24:17] on the things that are high quality but also make sure we do cover a wide
[00:24:20] also make sure we do cover a wide variety so we don't sort of overestimate
[00:24:21] variety so we don't sort of overestimate rewards for bad
[00:24:24] rewards for bad stuff okay one more and then I'll I'll
[00:24:27] stuff okay one more and then I'll I'll hand it off to these guys learning we
[00:24:29] hand it off to these guys learning we know that even if you have limited data
[00:24:32] know that even if you have limited data set but if you have a large enough
[00:24:34] set but if you have a large enough Network then if you train it to near
[00:24:37] Network then if you train it to near zero error on The Limited train data set
[00:24:40] zero error on The Limited train data set it can generalize well on test data set
[00:24:42] it can generalize well on test data set as well why is start not like applicable
[00:24:47] as well why is start not like applicable here as for the reward mod well is
[00:24:49] here as for the reward mod well is applicable in the sense that like the
[00:24:50] applicable in the sense that like the the same sort of phenomena like like
[00:24:52] the same sort of phenomena like like double descent and things like this like
[00:24:54] double descent and things like this like are still applicable in this case so I
[00:24:55] are still applicable in this case so I mean you you will get better performance
[00:24:58] mean you you will get better performance typically from using a larger reward
[00:24:59] typically from using a larger reward model um but you know there are limits
[00:25:01] model um but you know there are limits to this I mean there's only a certain
[00:25:02] to this I mean there's only a certain amount of information content in a
[00:25:04] amount of information content in a finite data set of of preferences and so
[00:25:06] finite data set of of preferences and so the you know the extent to which you can
[00:25:08] the you know the extent to which you can push that model to generalize to to new
[00:25:10] push that model to generalize to to new things I mean my preference data set
[00:25:11] things I mean my preference data set only has questions about like what types
[00:25:13] only has questions about like what types of pets someone like U someone likes
[00:25:15] of pets someone like U someone likes it's just not going to tell you anything
[00:25:16] it's just not going to tell you anything about Quantum field Theory um you know
[00:25:18] about Quantum field Theory um you know no matter how big you make your model um
[00:25:21] no matter how big you make your model um so so there are limits but yes I mean
[00:25:22] so so there are limits but yes I mean you you you would expect some level of
[00:25:25] you you you would expect some level of generalization Okay cool so so that that
[00:25:27] generalization Okay cool so so that that is a primary on r and um basically uh
[00:25:30] is a primary on r and um basically uh unfortunately what we end up with is
[00:25:32] unfortunately what we end up with is this you know if we're doing PO for
[00:25:34] this you know if we're doing PO for example this ends up being really really
[00:25:35] example this ends up being really really really complicated uh in the policy
[00:25:37] really complicated uh in the policy learning stage so there are a lot of
[00:25:38] learning stage so there are a lot of moving pieces here and I guess you all
[00:25:39] moving pieces here and I guess you all will have the distinct pleasure of
[00:25:41] will have the distinct pleasure of implementing this spere homework um
[00:25:45] implementing this spere homework um congratulations um but there are a lot
[00:25:48] congratulations um but there are a lot of moving pieces here and that was sort
[00:25:49] of moving pieces here and that was sort of one of the motivating reasons for for
[00:25:51] of one of the motivating reasons for for why um DPO came to be basically was that
[00:25:55] why um DPO came to be basically was that PO turns out it was a little bit tricky
[00:25:57] PO turns out it was a little bit tricky to to get it to work for particular
[00:25:58] to to get it to work for particular problem that that Raphael was um uh
[00:26:01] problem that that Raphael was um uh initiating some research on so um anyway
[00:26:04] initiating some research on so um anyway that that's sort of the the background
[00:26:05] that that's sort of the the background on rhf and I'm going to leave it to
[00:26:06] on rhf and I'm going to leave it to archit now to to give an overview of
[00:26:09] archit now to to give an overview of thepo all right thanks Eric uh is this
[00:26:12] thepo all right thanks Eric uh is this working okay cool just all right who's
[00:26:15] working okay cool just all right who's ready for the fun ma stuff
[00:26:18] ready for the fun ma stuff um so you saw the scary picture here and
[00:26:22] um so you saw the scary picture here and really the question we wanted to start
[00:26:24] really the question we wanted to start with is like do we need to do all this
[00:26:25] with is like do we need to do all this just to like find iner model according
[00:26:27] just to like find iner model according to human preferences
[00:26:29] to human preferences and unsurprisingly the answer is going
[00:26:30] and unsurprisingly the answer is going to be no so like be prepared for the
[00:26:32] to be no so like be prepared for the ride um and yeah um we saw this
[00:26:35] ride um and yeah um we saw this objective earlier and kind of before we
[00:26:39] objective earlier and kind of before we go into the math like I want to just
[00:26:40] go into the math like I want to just give a high level picture of what is
[00:26:42] give a high level picture of what is going to happen here um we had some
[00:26:44] going to happen here um we had some reward function which kind of told us
[00:26:46] reward function which kind of told us what humans like and humans do not like
[00:26:48] what humans like and humans do not like um and right now we're parameterizing
[00:26:50] um and right now we're parameterizing that as a separate Network saying that
[00:26:52] that as a separate Network saying that this will give us a score for which
[00:26:53] this will give us a score for which answer is good and which answer is bad
[00:26:56] answer is good and which answer is bad now really like a can we like sort of
[00:26:59] now really like a can we like sort of Leverage the idea that our language
[00:27:01] Leverage the idea that our language models have these probabilities over
[00:27:03] models have these probabilities over completions and the completions right
[00:27:05] completions and the completions right now represent any distribution over the
[00:27:07] now represent any distribution over the internet but can we like overload it
[00:27:09] internet but can we like overload it somehow to basically represent oh can we
[00:27:12] somehow to basically represent oh can we only put probability on things that
[00:27:13] only put probability on things that humans like and that's roughly the idea
[00:27:16] humans like and that's roughly the idea we're going to try to exploit is that
[00:27:17] we're going to try to exploit is that there's essentially a mapping between
[00:27:19] there's essentially a mapping between the language model and the reward model
[00:27:22] the language model and the reward model itself one-on-one mapping that you can
[00:27:24] itself one-on-one mapping that you can use to directly train the policy on
[00:27:26] use to directly train the policy on preferences themselves and towards the
[00:27:29] preferences themselves and towards the end of this what you're going to have is
[00:27:30] end of this what you're going to have is a distribution of responses that are not
[00:27:32] a distribution of responses that are not just arbitrary text responses on the
[00:27:34] just arbitrary text responses on the internet but responses that humans like
[00:27:37] internet but responses that humans like and that's where direct preference
[00:27:38] and that's where direct preference optimization will come in um how do we
[00:27:41] optimization will come in um how do we do that that's where the math is going
[00:27:42] do that that's where the math is going to be so we saw the rhf objective which
[00:27:45] to be so we saw the rhf objective which is essentially we want to maximize the
[00:27:47] is essentially we want to maximize the expected reward over completions and we
[00:27:49] expected reward over completions and we have a k constraint to the reference
[00:27:52] have a k constraint to the reference distribution Um this can for now we're
[00:27:54] distribution Um this can for now we're just assume it's any reward function um
[00:27:56] just assume it's any reward function um the math is going to hold for any
[00:27:58] the math is going to hold for any function but in general it's the Learned
[00:27:59] function but in general it's the Learned reward function now I don't know if this
[00:28:02] reward function now I don't know if this was covered in the class or not but like
[00:28:04] was covered in the class or not but like it turns out that this equation or this
[00:28:06] it turns out that this equation or this problem has a closed form solution
[00:28:10] problem has a closed form solution um okay um great I'm not going to derive
[00:28:13] um okay um great I'm not going to derive it maybe I'll leave it as an exercise
[00:28:14] it maybe I'll leave it as an exercise for people to
[00:28:16] for people to like um but it's a fun derivation and
[00:28:18] like um but it's a fun derivation and it's not too hard so hopefully you
[00:28:19] it's not too hard so hopefully you should like um find the time to do it um
[00:28:23] should like um find the time to do it um but really like if you've ever heard of
[00:28:24] but really like if you've ever heard of like boltzman distribution or something
[00:28:26] like boltzman distribution or something of this form this is really just atate
[00:28:29] of this form this is really just atate um and this is not what we're
[00:28:30] um and this is not what we're contributing this is a known result for
[00:28:32] contributing this is a known result for a while and it's very intuitive like it
[00:28:34] a while and it's very intuitive like it might look scary for a second but it's
[00:28:35] might look scary for a second but it's really what it's saying is that we had
[00:28:37] really what it's saying is that we had the reference distribution that we
[00:28:38] the reference distribution that we started with and we had some reward
[00:28:39] started with and we had some reward function and really what we're doing is
[00:28:41] function and really what we're doing is we're upgrading the responses by the
[00:28:43] we're upgrading the responses by the exponentiated reward so things which
[00:28:45] exponentiated reward so things which have a higher reward will have a higher
[00:28:47] have a higher reward will have a higher probability according to the
[00:28:48] probability according to the exponentiated reward now if you just
[00:28:51] exponentiated reward now if you just look at this this is very simple but
[00:28:52] look at this this is very simple but this won't be a probability distribution
[00:28:54] this won't be a probability distribution and the thing on the left hand side is a
[00:28:55] and the thing on the left hand side is a probability distribution so we normalize
[00:28:58] probability distribution so we normalize it by this partition function which is
[00:28:59] it by this partition function which is the
[00:29:00] the zfx um think of it as just like summing
[00:29:03] zfx um think of it as just like summing over every completion for a given
[00:29:05] over every completion for a given question
[00:29:06] question X um now you can imagine that that's a
[00:29:09] X um now you can imagine that that's a very very intractable quantity like if I
[00:29:11] very very intractable quantity like if I start Computing every sentence and try
[00:29:14] start Computing every sentence and try to measure the probability and then
[00:29:15] to measure the probability and then multiply it by an exponential or what
[00:29:17] multiply it by an exponential or what that's just basically not tractable so
[00:29:20] that's just basically not tractable so this equation by itself is not very
[00:29:24] this equation by itself is not very useful um and yeah I went over this this
[00:29:27] useful um and yeah I went over this this is exactly the definition of the
[00:29:28] is exactly the definition of the partition function we're summing over
[00:29:30] partition function we're summing over every response y uh the PF is a
[00:29:33] every response y uh the PF is a distribution we started with and the
[00:29:34] distribution we started with and the exponentiated reward and beta is the
[00:29:37] exponentiated reward and beta is the temperature term trading of the reward
[00:29:39] temperature term trading of the reward and the KL uh
[00:29:41] and the KL uh constraint so this is
[00:29:44] constraint so this is intractable we'll we'll hold on to the
[00:29:46] intractable we'll we'll hold on to the partition function for a second and
[00:29:47] partition function for a second and we'll see what happens to it but really
[00:29:50] we'll see what happens to it but really this result is a relationship between P
[00:29:51] this result is a relationship between P star and the reward R but now we can do
[00:29:53] star and the reward R but now we can do a little bit of algebra and Shuffle it
[00:29:55] a little bit of algebra and Shuffle it around and rewrite the reward in ter
[00:29:58] around and rewrite the reward in ter terms of the optimal policy itself um so
[00:30:01] terms of the optimal policy itself um so what does this equation say we're
[00:30:03] what does this equation say we're writing the reward in terms of the beta
[00:30:05] writing the reward in terms of the beta log ratio where the ratio is between the
[00:30:07] log ratio where the ratio is between the optimal policy Pi star and the reference
[00:30:10] optimal policy Pi star and the reference distribution we started with and then
[00:30:11] distribution we started with and then there's this PES key partition function
[00:30:13] there's this PES key partition function that just continues to stay on there um
[00:30:16] that just continues to stay on there um I'm going to try to like develop some
[00:30:18] I'm going to try to like develop some intuition here like this is important is
[00:30:19] intuition here like this is important is that what it is saying is that if an
[00:30:21] that what it is saying is that if an optimal policy way puts more probability
[00:30:23] optimal policy way puts more probability distribution on a response than a
[00:30:25] distribution on a response than a reference distribution the reward is
[00:30:27] reference distribution the reward is higher
[00:30:29] higher does that come through and like if a
[00:30:30] does that come through and like if a probability is lower then the reward is
[00:30:32] probability is lower then the reward is lower and this is intuitively correct
[00:30:34] lower and this is intuitively correct right this is how a reward function
[00:30:36] right this is how a reward function should also be if a response is
[00:30:38] should also be if a response is preferred then it should have a higher
[00:30:39] preferred then it should have a higher probability and a higher reward so this
[00:30:41] probability and a higher reward so this is you can see like we are starting to
[00:30:42] is you can see like we are starting to develop a relationship between a reward
[00:30:44] develop a relationship between a reward function and the probability
[00:30:45] function and the probability distribution
[00:30:47] itself cool
[00:30:50] itself cool so but the main problem here is that
[00:30:53] so but the main problem here is that this is by itself not very tractable
[00:30:54] this is by itself not very tractable because the partition function as we
[00:30:56] because the partition function as we said is like just completely intractable
[00:30:58] said is like just completely intractable so maybe let's go back to what we were
[00:31:00] so maybe let's go back to what we were doing in the rlf
[00:31:02] doing in the rlf process um the high level idea is that
[00:31:05] process um the high level idea is that like we mean uh we are we have a loss
[00:31:08] like we mean uh we are we have a loss function on reward functions and we're
[00:31:09] function on reward functions and we're going to use this transformation and
[00:31:11] going to use this transformation and once we plug it all together we're going
[00:31:12] once we plug it all together we're going to get a loss function on the policies
[00:31:14] to get a loss function on the policies themselves um and if we go back to our L
[00:31:17] themselves um and if we go back to our L function for the reward
[00:31:18] function for the reward bit if you remember the logit is the
[00:31:21] bit if you remember the logit is the difference between the rewards of the
[00:31:23] difference between the rewards of the preferred response and the dispreferred
[00:31:25] preferred response and the dispreferred response this is what Eric covered just
[00:31:27] response this is what Eric covered just a little a little bit back now if you
[00:31:29] a little a little bit back now if you look at this this difference is not
[00:31:31] look at this this difference is not going to depend on the input itself or
[00:31:34] going to depend on the input itself or the partition function
[00:31:36] the partition function itself if we look at it explicitly so
[00:31:38] itself if we look at it explicitly so this is exactly what is going to happen
[00:31:40] this is exactly what is going to happen here is we're going to take the equation
[00:31:42] here is we're going to take the equation that we took earlier Express the reward
[00:31:43] that we took earlier Express the reward in terms of the policy we're going to
[00:31:45] in terms of the policy we're going to learn and we're going to plug it into
[00:31:47] learn and we're going to plug it into the reward modeling loss and once you
[00:31:52] the reward modeling loss and once you compute this difference this partition
[00:31:54] compute this difference this partition function is going to cancel out because
[00:31:55] function is going to cancel out because it only depends on the input X and it
[00:31:58] it only depends on the input X and it does not depend on the output we're
[00:32:01] does not depend on the output we're Computing it over
[00:32:04] Computing it over and when we do this we get our final
[00:32:07] and when we do this we get our final beautiful loss function which we call um
[00:32:10] beautiful loss function which we call um the DPO loss function and really is just
[00:32:12] the DPO loss function and really is just a reward modeling loss and let's take a
[00:32:15] a reward modeling loss and let's take a second to like see like what it is doing
[00:32:17] second to like see like what it is doing what we're trying to do is is like we
[00:32:19] what we're trying to do is is like we have a preferred response YW and a dis
[00:32:21] have a preferred response YW and a dis preferred response y l for a given
[00:32:23] preferred response y l for a given question X and we're trying to maximize
[00:32:27] question X and we're trying to maximize this
[00:32:28] this difference um that's how we would
[00:32:30] difference um that's how we would minimize this loss and maximizing this
[00:32:32] minimize this loss and maximizing this difference means that our log
[00:32:34] difference means that our log probability on the preferred response
[00:32:36] probability on the preferred response should be higher than the probability
[00:32:38] should be higher than the probability that the reference distribution puts on
[00:32:40] that the reference distribution puts on it and the log probability of the dis
[00:32:42] it and the log probability of the dis preferred response should be lower than
[00:32:44] preferred response should be lower than the probability that the reference
[00:32:45] the probability that the reference distribution puts on
[00:32:47] distribution puts on it does this make intuitive sense why
[00:32:50] it does this make intuitive sense why this would like sort of change the
[00:32:51] this would like sort of change the probabilities in the right
[00:32:56] way
[00:32:59] way cool and yeah the log partition function
[00:33:02] cool and yeah the log partition function basically just cancels
[00:33:03] basically just cancels out did you want
[00:33:06] out did you want to and you think of this as being a
[00:33:08] to and you think of this as being a benefit to the fact that you can shift
[00:33:10] benefit to the fact that you can shift the rewards by a constant so often
[00:33:12] the rewards by a constant so often that's considered not a good thing but
[00:33:13] that's considered not a good thing but here they're Levering you can just
[00:33:14] here they're Levering you can just cancel out the
[00:33:16] cancel out the partition yeah all right I'll hand it to
[00:33:19] partition yeah all right I'll hand it to rapael and he can go over the results
[00:33:22] rapael and he can go over the results all right can you guys hear
[00:33:24] all right can you guys hear me so this is sort of like the first um
[00:33:27] me so this is sort of like the first um sort of control experiment we we run on
[00:33:29] sort of control experiment we we run on this project and basically we took this
[00:33:31] this project and basically we took this IMDb reviews data set which is sort of
[00:33:34] IMDb reviews data set which is sort of like movie reviews and um we wanted to
[00:33:37] like movie reviews and um we wanted to train the model to generate positive
[00:33:38] train the model to generate positive movie reviews so we use the
[00:33:40] movie reviews so we use the pre-trained sentiment classifier as a go
[00:33:43] pre-trained sentiment classifier as a go reward function this case we do know we
[00:33:44] reward function this case we do know we have access to the underlying um reward
[00:33:47] have access to the underlying um reward score and then we generated a bunch of
[00:33:49] score and then we generated a bunch of data from the base model which was
[00:33:51] data from the base model which was pretty trained with sft ranked it based
[00:33:53] pretty trained with sft ranked it based on the sentiment classifier and create
[00:33:55] on the sentiment classifier and create like synthetic preferences and then
[00:33:58] like synthetic preferences and then basically we just took a bunch of
[00:33:59] basically we just took a bunch of baselines um across that data and we
[00:34:02] baselines um across that data and we fundamentally were interested in
[00:34:04] fundamentally were interested in comparing to what degree is DPO an
[00:34:07] comparing to what degree is DPO an actual good Optimizer of the core
[00:34:09] actual good Optimizer of the core objective essentially there's this like
[00:34:11] objective essentially there's this like reward k tradeoff um underlying all of
[00:34:14] reward k tradeoff um underlying all of this and we basically wanted to see how
[00:34:17] this and we basically wanted to see how good like how good of a parto curve can
[00:34:19] good like how good of a parto curve can we um extract from that we kind of see
[00:34:22] we um extract from that we kind of see essentially DPO sort of the optimal
[00:34:25] essentially DPO sort of the optimal trade-off here in this simple Road
[00:34:27] trade-off here in this simple Road problem
[00:34:28] problem curve oh I see yeah um well Paro curve
[00:34:31] curve oh I see yeah um well Paro curve is is a general Concept in in economics
[00:34:33] is is a general Concept in in economics and sort of decision analysis and things
[00:34:35] and sort of decision analysis and things like that where we have tradeoffs uh
[00:34:38] like that where we have tradeoffs uh between several things for example this
[00:34:40] between several things for example this case reward versus KL and we're
[00:34:42] case reward versus KL and we're interested in the optimal tradeoff that
[00:34:44] interested in the optimal tradeoff that we can get and we say for example one
[00:34:46] we can get and we say for example one method par dominates another method if
[00:34:49] method par dominates another method if essentially we can get something get
[00:34:52] essentially we can get something get more without giving up on something else
[00:34:54] more without giving up on something else so in this case for the same K we can
[00:34:56] so in this case for the same K we can get more reward using DPO than than
[00:34:59] get more reward using DPO than than another another method and we actually
[00:35:02] another another method and we actually played quite a bit with the baselines
[00:35:04] played quite a bit with the baselines here I mean we probably I probably spent
[00:35:05] here I mean we probably I probably spent like a couple months trying to push
[00:35:07] like a couple months trying to push these po numbers and um essentially it
[00:35:11] these po numbers and um essentially it works like poo kind of works and you get
[00:35:13] works like poo kind of works and you get some results there but um it's it can
[00:35:17] some results there but um it's it can quite catch up with with the DP
[00:35:19] quite catch up with with the DP objective and what I kind of want to
[00:35:21] objective and what I kind of want to include this curve here in this in this
[00:35:23] include this curve here in this in this talk is um essentially I think even now
[00:35:26] talk is um essentially I think even now basically almost all of the RF paper
[00:35:28] basically almost all of the RF paper that you read are actually doing um
[00:35:31] that you read are actually doing um evaluation potentially wrong because you
[00:35:34] evaluation potentially wrong because you go read these papers and you kind of get
[00:35:35] go read these papers and you kind of get the win rates or you got like get the
[00:35:37] the win rates or you got like get the comparisons Etc but none of them really
[00:35:39] comparisons Etc but none of them really like plot this curves you for none of
[00:35:41] like plot this curves you for none of them you really don't know where along
[00:35:43] them you really don't know where along this like tradeoff you are and that that
[00:35:46] this like tradeoff you are and that that number in of itself doesn't really tell
[00:35:47] number in of itself doesn't really tell you much uh
[00:35:49] you much uh because um you know it's a question of
[00:35:51] because um you know it's a question of optimization and you don't know how well
[00:35:54] optimization and you don't know how well that optimization worked or didn't work
[00:35:56] that optimization worked or didn't work just by extracting one position position
[00:35:57] just by extracting one position position on this curve so I think that's kind an
[00:36:00] on this curve so I think that's kind an important point that uh the community is
[00:36:02] important point that uh the community is still not quite making as much but but I
[00:36:05] still not quite making as much but but I think when any of these new things come
[00:36:07] think when any of these new things come up I think this sort of the the
[00:36:08] up I think this sort of the the fundamental question that should be
[00:36:10] fundamental question that should be asked do you think it was because the
[00:36:12] asked do you think it was because the reward model is is specif or do you
[00:36:14] reward model is is specif or do you think it's like which part do you think
[00:36:16] think it's like which part do you think it's where is in this Cas do you think
[00:36:20] it's where is in this Cas do you think model that good or the P
[00:36:22] model that good or the P optimizations so basically if you look
[00:36:24] optimizations so basically if you look at the purple thing that's kind of like
[00:36:27] at the purple thing that's kind of like the how the Box po uh TRL um and if you
[00:36:30] the how the Box po uh TRL um and if you look at some of these our implementation
[00:36:33] look at some of these our implementation things to do a lot better
[00:36:36] things to do a lot better um so the core difference there and
[00:36:38] um so the core difference there and surprising to me people have Wroten like
[00:36:40] surprising to me people have Wroten like numerous papers about the same thing now
[00:36:42] numerous papers about the same thing now and to me was sort of a footnote how we
[00:36:44] and to me was sort of a footnote how we got this to work better we just sampled
[00:36:45] got this to work better we just sampled more answers per uh per prompt that was
[00:36:48] more answers per uh per prompt that was a question of variance right um and in
[00:36:51] a question of variance right um and in the RF setting the variance problem is
[00:36:53] the RF setting the variance problem is even higher because of the constant
[00:36:55] even higher because of the constant shift so we actually did some analysis
[00:36:57] shift so we actually did some analysis around this one writing in the paper
[00:36:58] around this one writing in the paper about like 60% of the reward scores are
[00:37:00] about like 60% of the reward scores are like noise essentially so signal to
[00:37:02] like noise essentially so signal to noise in like regular PO is about 40%
[00:37:05] noise in like regular PO is about 40% and when you mixing that like in the
[00:37:06] and when you mixing that like in the whole process like the variance like
[00:37:08] whole process like the variance like completely explode so it's like very
[00:37:09] completely explode so it's like very kind of like sparse signal to learn from
[00:37:11] kind of like sparse signal to learn from there yeah sorry um how do we I'm just
[00:37:14] there yeah sorry um how do we I'm just sorry this is just picture question but
[00:37:16] sorry this is just picture question but like I'm having a hard time knowing how
[00:37:18] like I'm having a hard time knowing how do we read the graph like like is it
[00:37:21] do we read the graph like like is it better to have higher reward or like
[00:37:22] better to have higher reward or like what is this graph actually telling us
[00:37:23] what is this graph actually telling us about each
[00:37:24] about each metric yeah so I mean obviously it's
[00:37:26] metric yeah so I mean obviously it's better to have higher right this is the
[00:37:28] better to have higher right this is the cor concept of reinforcement learning
[00:37:29] cor concept of reinforcement learning you want to maximize reward but
[00:37:31] you want to maximize reward but essentially from thef setup we maximize
[00:37:34] essentially from thef setup we maximize reward subject to a k constraint subject
[00:37:36] reward subject to a k constraint subject to some K cost what this graph is saying
[00:37:38] to some K cost what this graph is saying like for basically is plotting me for a
[00:37:40] like for basically is plotting me for a level of K using each of these baselines
[00:37:43] level of K using each of these baselines how much reward can I get and you want
[00:37:45] how much reward can I get and you want that to be basically said PR optimal in
[00:37:48] that to be basically said PR optimal in the sense that like you want to get the
[00:37:49] the sense that like you want to get the most reward for a certain level of K and
[00:37:51] most reward for a certain level of K and the other point I made is basically
[00:37:52] the other point I made is basically people compare only like win rates or
[00:37:54] people compare only like win rates or essentially reward but they don't tell
[00:37:55] essentially reward but they don't tell you at what K so you can like compare
[00:37:58] you at what K so you can like compare for example you know oops sorry you can
[00:38:00] for example you know oops sorry you can compare this DPO point to this PO point
[00:38:04] compare this DPO point to this PO point and like this PO point will will appear
[00:38:06] and like this PO point will will appear better right because it has more reward
[00:38:08] better right because it has more reward but fundamentally as an optimization
[00:38:10] but fundamentally as an optimization algorithm that's not the
[00:38:13] case this model is interpretable or you
[00:38:17] case this model is interpretable or you can canot can you explain what's going
[00:38:18] can canot can you explain what's going on under the hood or just optimization
[00:38:22] on under the hood or just optimization M what do you mean by inter so if you
[00:38:25] M what do you mean by inter so if you provide feedback right the behavior will
[00:38:26] provide feedback right the behavior will change can you explain the whole process
[00:38:29] change can you explain the whole process or you cannot explain the whole
[00:38:31] or you cannot explain the whole situation if you put in noisy data uh
[00:38:34] situation if you put in noisy data uh for example can you deug it explain the
[00:38:37] for example can you deug it explain the whole process yeah I mean I think that's
[00:38:40] whole process yeah I mean I think that's more complicated question than than it
[00:38:41] more complicated question than than it seems on on the surface like there's Co
[00:38:43] seems on on the surface like there's Co lines of research on basically I with
[00:38:45] lines of research on basically I with noisy feedback I multimod feedback
[00:38:48] noisy feedback I multimod feedback plurality of alignment so it's not quite
[00:38:50] plurality of alignment so it's not quite like answer I can give like in one
[00:38:51] like answer I can give like in one sentence
[00:38:53] sentence right it's a lot yeah expl again why we
[00:38:56] right it's a lot yeah expl again why we have a bad signal to noise ratio in
[00:38:58] have a bad signal to noise ratio in normal
[00:39:00] normal PP uh it's a it's a long question
[00:39:02] PP uh it's a it's a long question there's like a whole section in the
[00:39:03] there's like a whole section in the paper it's about half a page it's not
[00:39:05] paper it's about half a page it's not like one like s kind of line yeah but
[00:39:09] like one like s kind of line yeah but essentially by samply more answers per
[00:39:12] essentially by samply more answers per response kind of like goes
[00:39:15] response kind of like goes away um can you explain what the reward
[00:39:17] away um can you explain what the reward means for sentiment generation yeah it's
[00:39:20] means for sentiment generation yeah it's basically the sentiment of the sentence
[00:39:22] basically the sentiment of the sentence and one is very good sentiment zero is
[00:39:24] and one is very good sentiment zero is very bad
[00:39:26] very bad sentiment so like
[00:39:29] sentiment so like move this on hopefully a One sensor um
[00:39:32] move this on hopefully a One sensor um just to for our appreciation of the
[00:39:33] just to for our appreciation of the graph about what kale Divergence
[00:39:36] graph about what kale Divergence trade-off would you choose in a real
[00:39:37] trade-off would you choose in a real model here like is is it that like you
[00:39:40] model here like is is it that like you might choose something like 10 so we're
[00:39:42] might choose something like 10 so we're really in that region or is it somewhere
[00:39:44] really in that region or is it somewhere much farther it's very much moral and
[00:39:46] much farther it's very much moral and data dependent okay yeah I mean this
[00:39:48] data dependent okay yeah I mean this graph means absolutely nothing in a
[00:39:50] graph means absolutely nothing in a summarization set got I think maybe like
[00:39:53] summarization set got I think maybe like just to like I think it's very hard to
[00:39:56] just to like I think it's very hard to choose a specific Gale but usually what
[00:39:58] choose a specific Gale but usually what people do is measure performance on
[00:39:59] people do is measure performance on other benchmarks yeah they care about
[00:40:01] other benchmarks yeah they care about and usually they find that if the K is
[00:40:04] and usually they find that if the K is smaller the performance on other
[00:40:07] smaller the performance on other benchmarks is preserved so you typically
[00:40:09] benchmarks is preserved so you typically try to like air on the side of like
[00:40:11] try to like air on the side of like lower K and yeah I mean there's no
[00:40:14] lower K and yeah I mean there's no specific number but like wherever you
[00:40:16] specific number but like wherever you find like your MML performance is great
[00:40:19] find like your MML performance is great that's where you like
[00:40:21] that's where you like stop yeah in time um we had a bunch of
[00:40:24] stop yeah in time um we had a bunch of other experiments in the paper um which
[00:40:26] other experiments in the paper um which kind of like show up B basically DPO
[00:40:28] kind of like show up B basically DPO works but I think really really like the
[00:40:29] works but I think really really like the Testament to to the algorithm is it's
[00:40:31] Testament to to the algorithm is it's kind of be like widely adopted B the
[00:40:33] kind of be like widely adopted B the community and in larger scale um this
[00:40:37] community and in larger scale um this was maybe a little updated haven't like
[00:40:38] was maybe a little updated haven't like looked at this recently but a couple of
[00:40:40] looked at this recently but a couple of months ago this was basically the open a
[00:40:42] months ago this was basically the open a Leaderboard on on huging face basically
[00:40:44] Leaderboard on on huging face basically leaderboard of open language models and
[00:40:47] leaderboard of open language models and I think nine out of the top 10 models
[00:40:49] I think nine out of the top 10 models were trained with DPO and is kind of the
[00:40:51] were trained with DPO and is kind of the open source
[00:40:52] open source community and since then you know even
[00:40:55] community and since then you know even institutions have taken this up in
[00:40:57] institutions have taken this up in particular this is taken from the mistro
[00:40:58] particular this is taken from the mistro paper that basically they used um DPO
[00:41:01] paper that basically they used um DPO exclusively as their rhf algorithm um
[00:41:04] exclusively as their rhf algorithm um and as you kind of know basically you
[00:41:05] and as you kind of know basically you know some of the the mistro strong
[00:41:06] know some of the the mistro strong mistro models are somewhat competitive
[00:41:08] mistro models are somewhat competitive with gbd4 for example so we do
[00:41:10] with gbd4 for example so we do definitely have evidence that this works
[00:41:11] definitely have evidence that this works at very large scales and for basically
[00:41:13] at very large scales and for basically like last week we now know even llama 3
[00:41:16] like last week we now know even llama 3 is using DPO as part of its optimization
[00:41:18] is using DPO as part of its optimization pipel planine interesting enough it's
[00:41:19] pipel planine interesting enough it's actually using it with mixed with other
[00:41:21] actually using it with mixed with other things so basically the the tldr is this
[00:41:24] things so basically the the tldr is this kind of algorithm sort of works um and
[00:41:27] kind of algorithm sort of works um and you know we're seeing it kind of like
[00:41:29] you know we're seeing it kind of like taken up more and being used for more
[00:41:31] taken up more and being used for more and more
[00:41:32] and more things so this kind of like where the
[00:41:34] things so this kind of like where the paper ends since then there's been like
[00:41:36] paper ends since then there's been like a ton of other works that we have done
[00:41:38] a ton of other works that we have done and and other people have done thought a
[00:41:40] and and other people have done thought a lot about what to kind of talk about
[00:41:42] lot about what to kind of talk about this from those Works um for example I
[00:41:44] this from those Works um for example I heard you guys have learned inverse Max
[00:41:46] heard you guys have learned inverse Max entropy inverse reinforcement learning
[00:41:48] entropy inverse reinforcement learning you can actually derive DPO as a inverse
[00:41:51] you can actually derive DPO as a inverse q-learning algorithm in in a Max entropy
[00:41:53] q-learning algorithm in in a Max entropy RL setting it's obiously trivial but but
[00:41:55] RL setting it's obiously trivial but but is possible um and and that paper is
[00:41:58] is possible um and and that paper is called like your language is secret your
[00:42:00] called like your language is secret your language model is secretly a q function
[00:42:02] language model is secretly a q function so you can U for example you can do that
[00:42:05] so you can U for example you can do that um I've heard heard you're going to use
[00:42:07] um I've heard heard you're going to use sort of RF on control problems uh I I
[00:42:11] sort of RF on control problems uh I I don't know I haven't talked with the the
[00:42:12] don't know I haven't talked with the the T but actually DPO does not work for
[00:42:16] T but actually DPO does not work for control under the classical formulation
[00:42:18] control under the classical formulation and you need formulation of preferences
[00:42:21] and you need formulation of preferences under regret rather than the reward
[00:42:23] under regret rather than the reward functions so hoping they've taken that
[00:42:25] functions so hoping they've taken that into account that's like whole separate
[00:42:26] into account that's like whole separate other work but um I guess what I decided
[00:42:29] other work but um I guess what I decided to kind of like focus on is this sort of
[00:42:31] to kind of like focus on is this sort of DPO versus poo debate which is going to
[00:42:32] DPO versus poo debate which is going to be like raging a lot on in the community
[00:42:35] be like raging a lot on in the community in industry very much on Twitter um and
[00:42:39] in industry very much on Twitter um and kind of like want to give you my
[00:42:40] kind of like want to give you my perspective for this because and I don't
[00:42:42] perspective for this because and I don't want to sound like entric but I think
[00:42:44] want to sound like entric but I think pretty much the entire debate is wrong
[00:42:47] pretty much the entire debate is wrong uh I think there's like let's skip that
[00:42:49] uh I think there's like let's skip that from now um but basically there's two
[00:42:53] from now um but basically there's two things DPO fits this implicit reward
[00:42:55] things DPO fits this implicit reward function which Arch show you can think
[00:42:57] function which Arch show you can think about this as fitting a particular
[00:43:00] about this as fitting a particular reward model and there are two questions
[00:43:02] reward model and there are two questions there the first question is is this
[00:43:03] there the first question is is this implicit reward function as good as an
[00:43:05] implicit reward function as good as an explicitly parameterized reward
[00:43:07] explicitly parameterized reward function a second question is like for
[00:43:09] function a second question is like for this implicit reward model the DPO fits
[00:43:11] this implicit reward model the DPO fits you can analytically extract the optimal
[00:43:14] you can analytically extract the optimal policy so basically what I can do is you
[00:43:17] policy so basically what I can do is you know I can get the DPO policy or I can
[00:43:20] know I can get the DPO policy or I can take the DPO implicit War function put
[00:43:22] take the DPO implicit War function put it into Po and run that optimization
[00:43:24] it into Po and run that optimization Loop under perfect optimization
[00:43:27] Loop under perfect optimization absolutely perfect optimization I'll get
[00:43:28] absolutely perfect optimization I'll get back the DPO policy directly if my PO is
[00:43:31] back the DPO policy directly if my PO is perfect but that is RAR the case with
[00:43:34] perfect but that is RAR the case with any sort of machine learning
[00:43:35] any sort of machine learning optimization so get something that's
[00:43:36] optimization so get something that's like suboptimal and does that like
[00:43:38] like suboptimal and does that like suboptimality induce some sort of
[00:43:40] suboptimality induce some sort of regularization effect that makes my
[00:43:41] regularization effect that makes my model like stronger um so these are kind
[00:43:44] model like stronger um so these are kind of the two big questions I think in this
[00:43:47] of the two big questions I think in this debate um so kind of they've been kind
[00:43:50] debate um so kind of they've been kind of tackled recently um there's this
[00:43:52] of tackled recently um there's this thing callede came out reward bench
[00:43:54] thing callede came out reward bench which is a large scale evaluation of
[00:43:56] which is a large scale evaluation of reward models and as DPO is both the
[00:43:59] reward models and as DPO is both the generative and a reward model
[00:44:01] generative and a reward model discriminative model we can
[00:44:02] discriminative model we can evaluate DPO models as rewards and
[00:44:06] evaluate DPO models as rewards and basically on several scores here we have
[00:44:09] basically on several scores here we have this chat safety reasoning type of task
[00:44:12] this chat safety reasoning type of task so this for example shows uh scoring
[00:44:15] so this for example shows uh scoring reward scoring preferences based on like
[00:44:18] reward scoring preferences based on like dialogue and chat uh you can see the top
[00:44:20] dialogue and chat uh you can see the top four models are all DPO models and
[00:44:22] four models are all DPO models and outperform for example proprietary
[00:44:24] outperform for example proprietary models much bigger and sort of close
[00:44:26] models much bigger and sort of close Source ones
[00:44:28] Source ones and on reasoning the Top Model is this
[00:44:30] and on reasoning the Top Model is this proprietary coher model uh but the next
[00:44:33] proprietary coher model uh but the next like five or old DPO
[00:44:35] like five or old DPO models so and and you know obviously
[00:44:38] models so and and you know obviously like there's always more work to be done
[00:44:40] like there's always more work to be done more ADV to be done but in my mind this
[00:44:42] more ADV to be done but in my mind this sort of work kind of like solidified
[00:44:44] sort of work kind of like solidified this that the DPO Poli reward is about
[00:44:46] this that the DPO Poli reward is about as good as like the classic AR reward
[00:44:48] as good as like the classic AR reward like there's not you know we're not
[00:44:48] like there's not you know we're not losing generality we're not losing
[00:44:50] losing generality we're not losing capability for considering this implicit
[00:44:52] capability for considering this implicit model versus an explicit parameterized
[00:44:54] model versus an explicit parameterized one so the other big question is then
[00:44:56] one so the other big question is then does using a weaker Optimizer so po
[00:44:59] does using a weaker Optimizer so po provide a better solution gives you some
[00:45:00] provide a better solution gives you some sort of
[00:45:02] regularization um and basically started
[00:45:07] regularization um and basically started to look more into this recently some of
[00:45:10] to look more into this recently some of the first feedback we got on DP was like
[00:45:13] the first feedback we got on DP was like someone tried to train like a very large
[00:45:14] someone tried to train like a very large scale DPO model and what they said was
[00:45:17] scale DPO model and what they said was like oh you know it does well and then
[00:45:19] like oh you know it does well and then sort of like it becomes more and more of
[00:45:21] sort of like it becomes more and more of a Bose and then like starts speaking
[00:45:22] a Bose and then like starts speaking more and more and at some point like
[00:45:23] more and more and at some point like reaches a point where just one stop and
[00:45:25] reaches a point where just one stop and just kind of like goes off the rail
[00:45:27] just kind of like goes off the rail becomes like just can't stop talking and
[00:45:30] becomes like just can't stop talking and and we kind of looked at this on two
[00:45:31] and we kind of looked at this on two data sets one on summarization one on
[00:45:33] data sets one on summarization one on dialogue and what you can see how here
[00:45:35] dialogue and what you can see how here is like the distribution of of lengths
[00:45:38] is like the distribution of of lengths of answers and the blue distribution is
[00:45:41] of answers and the blue distribution is the preferred answer and the red
[00:45:42] the preferred answer and the red distribution is the dis preferred answer
[00:45:44] distribution is the dis preferred answer so you can see there's like a very
[00:45:46] so you can see there's like a very slight bias towards longer responses
[00:45:48] slight bias towards longer responses like people have biases they prefer more
[00:45:50] like people have biases they prefer more like verbos sers they prefer more like
[00:45:52] like verbos sers they prefer more like verbos more like longer summaries etc
[00:45:55] verbos more like longer summaries etc etc but once we train with DPO under
[00:45:58] etc but once we train with DPO under every column is like a separate level of
[00:46:00] every column is like a separate level of regularization under any level of
[00:46:02] regularization under any level of regularization this is blown way out of
[00:46:04] regularization this is blown way out of proportion it's not only DPO is
[00:46:07] proportion it's not only DPO is allocating probability Mass within the
[00:46:09] allocating probability Mass within the distribution it's pushing basically like
[00:46:11] distribution it's pushing basically like this green histogram is a DPO length
[00:46:14] this green histogram is a DPO length it's pushing things way out of
[00:46:15] it's pushing things way out of distribution and you see like now we
[00:46:17] distribution and you see like now we have answers which are significantly
[00:46:19] have answers which are significantly outside of the distribution that's
[00:46:20] outside of the distribution that's covered in our data
[00:46:21] covered in our data set so what is happening there and there
[00:46:24] set so what is happening there and there is concept of reward hacking I don't
[00:46:25] is concept of reward hacking I don't know if you've covered like reward
[00:46:27] know if you've covered like reward hacking
[00:46:28] hacking but there's a very famous paper from
[00:46:31] but there's a very famous paper from open AI called scaling lws for reward
[00:46:33] open AI called scaling lws for reward moral optimization and what they did
[00:46:35] moral optimization and what they did there is essentially the sentiment
[00:46:36] there is essentially the sentiment experiment but but a larger scale they
[00:46:38] experiment but but a larger scale they got some real human preferences they
[00:46:40] got some real human preferences they train a reward model like a very good
[00:46:42] train a reward model like a very good very strong reward model and then they
[00:46:44] very strong reward model and then they use that reward model to annotate some
[00:46:46] use that reward model to annotate some synthetic data synthetic preferences and
[00:46:48] synthetic data synthetic preferences and then they repeated the whole RF process
[00:46:50] then they repeated the whole RF process on top of the synthetic PR
[00:46:53] on top of the synthetic PR preferences and this is what they
[00:46:55] preferences and this is what they discovered um so basically what this
[00:46:57] discovered um so basically what this graph is is the same graph I showed
[00:46:58] graph is is the same graph I showed earlier for sentiment except that the
[00:47:01] earlier for sentiment except that the xaxis is a KO constraint and the y- axis
[00:47:04] xaxis is a KO constraint and the y- axis is
[00:47:05] is rewards and these things that the dash
[00:47:08] rewards and these things that the dash things you see are the Learned reward
[00:47:09] things you see are the Learned reward functions in po basically like the
[00:47:11] functions in po basically like the expected reward from from your motor
[00:47:13] expected reward from from your motor training and the solid lines are the
[00:47:16] training and the solid lines are the actual go War models so what you're
[00:47:19] actual go War models so what you're seeing from a reinforcement learing
[00:47:20] seeing from a reinforcement learing perspective it looks like the motor is
[00:47:21] perspective it looks like the motor is doing really well it's maximizing reward
[00:47:23] doing really well it's maximizing reward quite a bit but actually its quality is
[00:47:26] quite a bit but actually its quality is either nating or going down and this
[00:47:28] either nating or going down and this kind of this concept of War hacking has
[00:47:31] kind of this concept of War hacking has become quite prominent since then you
[00:47:32] become quite prominent since then you know both for practical purposes but for
[00:47:35] know both for practical purposes but for example the the AI safety Community is
[00:47:37] example the the AI safety Community is very worried about this you know the
[00:47:38] very worried about this you know the whole like paper clipping thing if if
[00:47:39] whole like paper clipping thing if if you've heard about it uh in the way that
[00:47:41] you've heard about it uh in the way that basically like the motel can find way to
[00:47:42] basically like the motel can find way to exploit these reward functions such that
[00:47:45] exploit these reward functions such that it thinks it's doing something good
[00:47:46] it thinks it's doing something good while it's actually doing something very
[00:47:48] while it's actually doing something very bad and basically the these things are
[00:47:50] bad and basically the these things are well understood this paper has something
[00:47:52] well understood this paper has something like 200 citations a ton of work has
[00:47:54] like 200 citations a ton of work has been done on like mitigating these
[00:47:56] been done on like mitigating these things and the thinking there is like I
[00:47:58] things and the thinking there is like I have a in in classic carf I'm learning a
[00:48:00] have a in in classic carf I'm learning a reward function I have a proxy reward
[00:48:03] reward function I have a proxy reward and I'm continuously quaring that reward
[00:48:04] and I'm continuously quaring that reward with new data which might make it out of
[00:48:06] with new data which might make it out of distribution which might kick it off
[00:48:07] distribution which might kick it off course etc etc so you know it's not
[00:48:10] course etc etc so you know it's not surprising that this this happens I
[00:48:13] surprising that this this happens I think by and large the community has not
[00:48:15] think by and large the community has not realized yet that this happens in direct
[00:48:17] realized yet that this happens in direct alignment as well because a there's no
[00:48:19] alignment as well because a there's no proy reward function you're directly
[00:48:20] proy reward function you're directly optimizing the model on the data and B
[00:48:22] optimizing the model on the data and B there's no um new data there's no like
[00:48:24] there's no um new data there's no like synthetic data being sampled it's all
[00:48:26] synthetic data being sampled it's all within the data set
[00:48:27] within the data set but what we have discovered ENT this is
[00:48:29] but what we have discovered ENT this is a new result that we are currently still
[00:48:30] a new result that we are currently still developing is that actually War hacking
[00:48:33] developing is that actually War hacking seems to be quite prominent in in DPO
[00:48:36] seems to be quite prominent in in DPO and actually all of the DPO varment
[00:48:37] and actually all of the DPO varment things like IPO and slick as well do
[00:48:39] things like IPO and slick as well do this if you've heard of
[00:48:41] this if you've heard of those um and actually might even be more
[00:48:43] those um and actually might even be more prominent than than in po because PO is
[00:48:47] prominent than than in po because PO is a weaker Optimizer so you have to like
[00:48:49] a weaker Optimizer so you have to like push really hard to like really hit
[00:48:52] push really hard to like really hit those tail of their War function uh but
[00:48:54] those tail of their War function uh but DPO gives you the exactly optimal anal
[00:48:56] DPO gives you the exactly optimal anal iCal function so in a sense it's sort of
[00:48:59] iCal function so in a sense it's sort of almost hacks in like in in an absolute
[00:49:02] almost hacks in like in in an absolute way um so yeah this is currently I think
[00:49:06] way um so yeah this is currently I think part of the dialogue and and the kind of
[00:49:08] part of the dialogue and and the kind of the research that the community is not
[00:49:10] the research that the community is not quite figuring out yet and you know
[00:49:12] quite figuring out yet and you know that's my goal to put these things out
[00:49:14] that's my goal to put these things out there that yes the same War hacking
[00:49:16] there that yes the same War hacking phenomena very surprisingly because it
[00:49:17] phenomena very surprisingly because it sort of goes against all the intuition
[00:49:19] sort of goes against all the intuition we've had from before happens in these
[00:49:21] we've had from before happens in these sort of algorithms as
[00:49:23] sort of algorithms as well um and
[00:49:27] well um and oh oh right right so it's kind of the
[00:49:29] oh oh right right so it's kind of the same type of plot you see on the left
[00:49:31] same type of plot you see on the left the x-axis a KO Divergence y- axis here
[00:49:33] the x-axis a KO Divergence y- axis here is GT4 win rate so basically judgments
[00:49:35] is GT4 win rate so basically judgments by gp4 and each checkpoint each data is
[00:49:38] by gp4 and each checkpoint each data is like a different checkpoint evaluated
[00:49:40] like a different checkpoint evaluated trained with DPO um and kind of similar
[00:49:42] trained with DPO um and kind of similar to before you see that basically it's
[00:49:45] to before you see that basically it's like different model sizes and and these
[00:49:46] like different model sizes and and these are different data but sort of what I'm
[00:49:48] are different data but sort of what I'm pointing out here is the pattern this
[00:49:50] pointing out here is the pattern this shape pattern um you kind of see like
[00:49:52] shape pattern um you kind of see like the more you train you sort of like the
[00:49:54] the more you train you sort of like the higher K you go actually your
[00:49:55] higher K you go actually your performance doesn't improve goes down so
[00:49:58] performance doesn't improve goes down so it's sort of the same reward hacking
[00:49:59] it's sort of the same reward hacking phenomena um the theory tells you this
[00:50:01] phenomena um the theory tells you this thing should be monotone like you give
[00:50:03] thing should be monotone like you give up some K you get some reward but that's
[00:50:05] up some K you get some reward but that's not the case um and kind of the point
[00:50:08] not the case um and kind of the point here is like this seems to be more
[00:50:09] here is like this seems to be more prevalent in this
[00:50:16] this technically the dward function is
[00:50:19] this technically the dward function is just as good as any other reward
[00:50:21] just as good as any other reward function but if you're like optimizing
[00:50:23] function but if you're like optimizing it too much we might be in this like
[00:50:25] it too much we might be in this like reward hacking phenomena and this is
[00:50:27] reward hacking phenomena and this is where where potentially like a PP
[00:50:28] where where potentially like a PP optimization could be more stable or
[00:50:31] optimization could be more stable or could be more beneficial because it's a
[00:50:32] could be more beneficial because it's a weaker Optimizer essentially form
[00:50:35] weaker Optimizer essentially form of so yeah I think this is sort of like
[00:50:38] of so yeah I think this is sort of like where we are with this type of
[00:50:39] where we are with this type of algorithms right now um and I think this
[00:50:42] algorithms right now um and I think this kind of like exciting work to be to be
[00:50:44] kind of like exciting work to be to be done again um you know in conclusion
[00:50:46] done again um you know in conclusion kind of
[00:50:47] kind of like yeah I mean we saw all of these
[00:50:49] like yeah I mean we saw all of these things um but I think it's kind of like
[00:50:52] things um but I think it's kind of like interesting what the next steps are
[00:50:54] interesting what the next steps are think like a ton of work has gone B to
[00:50:57] think like a ton of work has gone B to making RF robust um as we basically
[00:51:00] making RF robust um as we basically we're showing that like these alignment
[00:51:02] we're showing that like these alignment algorithms are very prone to ro hacking
[00:51:04] algorithms are very prone to ro hacking as well so I think a lot of work will
[00:51:06] as well so I think a lot of work will need to be done to make direct L
[00:51:07] need to be done to make direct L algorithms robust as well uh there's a
[00:51:10] algorithms robust as well uh there's a lot more interest as as Professor Bruce
[00:51:12] lot more interest as as Professor Bruce you mentioned on now online fine tuning
[00:51:14] you mentioned on now online fine tuning algorithms how do we like elicit
[00:51:16] algorithms how do we like elicit preferences how do we actually like f
[00:51:17] preferences how do we actually like f tune these things efficiently there's
[00:51:20] tune these things efficiently there's been explosion of RF across modalities
[00:51:22] been explosion of RF across modalities not just language models we've done
[00:51:24] not just language models we've done Vision language models we've done
[00:51:25] Vision language models we've done diffusion models in particular stable
[00:51:27] diffusion models in particular stable diffusion 3 for example is also trained
[00:51:29] diffusion 3 for example is also trained with DPO we've done text to image
[00:51:31] with DPO we've done text to image there's text to video work being done um
[00:51:34] there's text to video work being done um potential like speech and musics our
[00:51:35] potential like speech and musics our next Frontier to be tackled in a couple
[00:51:38] next Frontier to be tackled in a couple weeks we'll be releasing a paper on
[00:51:40] weeks we'll be releasing a paper on protein synthesis with feedback um and
[00:51:44] protein synthesis with feedback um and actively working on things like robot
[00:51:45] actively working on things like robot safety for things like large scale
[00:51:47] safety for things like large scale robotics Foundation models we're trying
[00:51:49] robotics Foundation models we're trying to do multi turn interactions which fast
[00:51:51] to do multi turn interactions which fast car jef cannot do um and things like
[00:51:54] car jef cannot do um and things like agents to use um and and all those
[00:51:56] agents to use um and and all those things are all kind of like basically
[00:51:59] things are all kind of like basically things that in the pipeline and we're
[00:52:00] things that in the pipeline and we're looking to so I think there's kind of
[00:52:02] looking to so I think there's kind of like a lot of exciting things that that
[00:52:03] like a lot of exciting things that that are happening in this field and still
[00:52:04] are happening in this field and still like it's been on for a while but I
[00:52:06] like it's been on for a while but I think only now we're just starting to
[00:52:08] think only now we're just starting to get deeper into and store of like
[00:52:09] get deeper into and store of like understand a lot of the the finer points
[00:52:11] understand a lot of the the finer points of this
[00:52:13] of this I'm I'm sorry if we r a little bit over
[00:52:17] I'm I'm sorry if we r a little bit over time
[00:52:23] yeah over that's we got some time for
[00:52:25] yeah over that's we got some time for questions so
[00:52:30] to uh isn't like reward fing implicitly
[00:52:34] to uh isn't like reward fing implicitly induced by the model
[00:52:37] induced by the model itself it's work of itself it's a finite
[00:52:41] itself it's work of itself it's a finite data type of issue right if you have
[00:52:42] data type of issue right if you have uniform if uniform data coverage over
[00:52:45] uniform if uniform data coverage over everything hacking will go back but is
[00:52:48] everything hacking will go back but is like a finite data thing because you
[00:52:50] like a finite data thing because you have like
[00:52:52] have like ratios exponentiated ratios there in the
[00:52:55] ratios exponentiated ratios there in the reward formulation and you're using that
[00:52:57] reward formulation and you're using that uh
[00:52:59] uh everywh because your model will try to
[00:53:01] everywh because your model will try to maximize that it will essentially try to
[00:53:03] maximize that it will essentially try to skew
[00:53:05] skew SC ratio right if like I'm wondering if
[00:53:08] SC ratio right if like I'm wondering if you had some other uh reort function
[00:53:12] you had some other uh reort function then maybe yeah it still happens so if
[00:53:14] then maybe yeah it still happens so if you use hinch objective still happens if
[00:53:17] you use hinch objective still happens if you use like a square type object still
[00:53:19] you use like a square type object still happens and Bally think about why this
[00:53:21] happens and Bally think about why this happens so if you think about is like
[00:53:23] happens so if you think about is like you see that your half cheater running L
[00:53:26] you see that your half cheater running L you know like basically imagine you half
[00:53:28] you know like basically imagine you half CH and your target running Target speed
[00:53:30] CH and your target running Target speed of 10 and you get like running Target
[00:53:32] of 10 and you get like running Target speed of eight that's better than run
[00:53:33] speed of eight that's better than run Target speed of seven learning Target
[00:53:35] Target speed of seven learning Target speed of nine it's learning running
[00:53:36] speed of nine it's learning running better than Target speed of eight Etc
[00:53:38] better than Target speed of eight Etc and then you think about well probably
[00:53:40] and then you think about well probably running a Target speed of 11 is better
[00:53:42] running a Target speed of 11 is better than Target speed of 10 but you've never
[00:53:43] than Target speed of 10 but you've never seen anything run at Target speed of 11
[00:53:45] seen anything run at Target speed of 11 so you just extrapolate away and just SW
[00:53:47] so you just extrapolate away and just SW it's basically you
[00:53:48] it's basically you know like this picture right you think
[00:53:51] know like this picture right you think long things are better so like longer
[00:53:52] long things are better so like longer things always better
[00:53:54] things always better too good question
[00:53:57] too good question yeah um it's kind of a niche question
[00:53:59] yeah um it's kind of a niche question but I'm kind of wondering so like what
[00:54:01] but I'm kind of wondering so like what if for a particular prompt like all of
[00:54:03] if for a particular prompt like all of the samples aren't that great but
[00:54:06] the samples aren't that great but obviously whoever is ranking them has to
[00:54:08] obviously whoever is ranking them has to rank all of them and doesn't have any
[00:54:10] rank all of them and doesn't have any way of indicating that even the best
[00:54:12] way of indicating that even the best sample isn't that great I was wondering
[00:54:14] sample isn't that great I was wondering if there's a way to account for that
[00:54:16] if there's a way to account for that like any sort of waiting that could be
[00:54:19] like any sort of waiting that could be applied to the rankings that would
[00:54:21] applied to the rankings that would indicate that the rankings are more or
[00:54:23] indicate that the rankings are more or less confident overall
[00:54:26] less confident overall no I don't have but like I mean um feel
[00:54:29] no I don't have but like I mean um feel free to interject but like that's a
[00:54:31] free to interject but like that's a great question like I mean I think
[00:54:33] great question like I mean I think General problem around like this is
[00:54:35] General problem around like this is almost like the exploration problem in
[00:54:36] almost like the exploration problem in RL is if you do not like ever
[00:54:39] RL is if you do not like ever see um like good trajectories what are
[00:54:42] see um like good trajectories what are you going to learn without it um I don't
[00:54:44] you going to learn without it um I don't have any easy answers frankly but like I
[00:54:46] have any easy answers frankly but like I think some things that are work is like
[00:54:48] think some things that are work is like there's other forms of feedback as well
[00:54:50] there's other forms of feedback as well um so there's this is like comparative
[00:54:52] um so there's this is like comparative feedback where you're comparing two
[00:54:53] feedback where you're comparing two things but like you can give thumbs up
[00:54:55] things but like you can give thumbs up thumbs down and then if all of them are
[00:54:57] thumbs down and then if all of them are just bad you can sort of indic it
[00:55:00] just bad you can sort of indic it optimize in a different way such that um
[00:55:03] optimize in a different way such that um you're down weting most of the
[00:55:06] you're down weting most of the responses um but yeah this is a good
[00:55:09] responses um but yeah this is a good open problem to look at I think the
[00:55:11] open problem to look at I think the explanation Pro pointed important to in
[00:55:13] explanation Pro pointed important to in that like one thing that people ask a
[00:55:16] that like one thing that people ask a lot is oh how can DPO work because like
[00:55:18] lot is oh how can DPO work because like with po you get to sample from your
[00:55:19] with po you get to sample from your policy during training so you can like
[00:55:21] policy during training so you can like explore and like that has to be helpful
[00:55:23] explore and like that has to be helpful right like DPO is just from your fixed
[00:55:25] right like DPO is just from your fixed preference data set and you're never
[00:55:26] preference data set and you're never sampling during training but I think
[00:55:28] sampling during training but I think your question actually points out the
[00:55:30] your question actually points out the fact that in some sense because we have
[00:55:33] fact that in some sense because we have this issue of we're optimizing only a
[00:55:35] this issue of we're optimizing only a proxy reward we don't get to optimize
[00:55:36] proxy reward we don't get to optimize the real reward the important
[00:55:38] the real reward the important exploration is actually the exploration
[00:55:40] exploration is actually the exploration we do when we gather the data that we're
[00:55:42] we do when we gather the data that we're getting preferences over that we're
[00:55:43] getting preferences over that we're going to learn our reward function from
[00:55:45] going to learn our reward function from because if we do good exploration at
[00:55:47] because if we do good exploration at policy training time but we sample some
[00:55:49] policy training time but we sample some great trajectory that our reward model
[00:55:50] great trajectory that our reward model doesn't correctly label as good it
[00:55:52] doesn't correctly label as good it doesn't help us um so yeah in that sense
[00:55:54] doesn't help us um so yeah in that sense it's basically an exploration problem
[00:55:55] it's basically an exploration problem it's very important that's why I think
[00:55:58] it's very important that's why I think like sort of multi- Turner sort of like
[00:55:59] like sort of multi- Turner sort of like iterative process could be really hard
[00:56:04] M yeah do you think U like a similar
[00:56:08] M yeah do you think U like a similar idea could be applied for say multi-step
[00:56:11] idea could be applied for say multi-step kind of reward like in which you get a
[00:56:13] kind of reward like in which you get a reward after multiple steps and you have
[00:56:14] reward after multiple steps and you have a preference at the final step but like
[00:56:18] a preference at the final step but like the reward function was like explicitly
[00:56:21] the reward function was like explicitly comparing like two uh like references
[00:56:24] comparing like two uh like references between exactly two
[00:56:27] between exactly two can you repeat that so so I just didn't
[00:56:30] can you repeat that so so I just didn't quite catch the question I was saying
[00:56:31] quite catch the question I was saying like if you have a multi-step kind of
[00:56:33] like if you have a multi-step kind of reasoning process and a reward which
[00:56:34] reasoning process and a reward which comes at the end of that would would
[00:56:38] comes at the end of that would would this this idea kind of apply yeah uh it
[00:56:41] this this idea kind of apply yeah uh it does work as said sort of like you can
[00:56:44] does work as said sort of like you can think of this as a ke learning problem
[00:56:46] think of this as a ke learning problem actually that is however not true to to
[00:56:49] actually that is however not true to to show but it does work like if you have a
[00:56:52] show but it does work like if you have a problem where basically you have a
[00:56:54] problem where basically you have a sparer word at the end like the mod does
[00:56:56] sparer word at the end like the mod does end up doing some sort of credit
[00:56:57] end up doing some sort of credit assignment on the intermediate tokens if
[00:56:59] assignment on the intermediate tokens if you think of this as like a part token
[00:57:01] you think of this as like a part token mdp um you will end up with something
[00:57:03] mdp um you will end up with something that does something interesting um for
[00:57:06] that does something interesting um for those intermediate steps it's not doing
[00:57:07] those intermediate steps it's not doing like explicit bootstrapping obviously
[00:57:09] like explicit bootstrapping obviously but like you do end up with some sort of
[00:57:11] but like you do end up with some sort of credit assignment and there are like I
[00:57:13] credit assignment and there are like I think there like several results now
[00:57:15] think there like several results now showing that if you have like sequence
[00:57:18] showing that if you have like sequence level you know rewards you you can end
[00:57:20] level you know rewards you you can end up um uh sort of doing something
[00:57:23] up um uh sort of doing something interesting even though you don't have
[00:57:25] interesting even though you don't have this these inter Med
[00:57:27] this these inter Med rewards you have a question
[00:57:34] next oh yeah can you a few slides ahead
[00:57:38] next oh yeah can you a few slides ahead like when you talk about synthetic
[00:57:40] like when you talk about synthetic data yeah just can you just explain
[00:57:43] data yeah just can you just explain again like what the difference is
[00:57:45] again like what the difference is between like the real and synthetic like
[00:57:46] between like the real and synthetic like what they're doing in both yeah it's
[00:57:48] what they're doing in both yeah it's it's sort of the same sort of sentiment
[00:57:51] it's sort of the same sort of sentiment problem I was talking about before they
[00:57:53] problem I was talking about before they they had real human data and trainer re
[00:57:54] they had real human data and trainer re warn function on this so they want to be
[00:57:56] warn function on this so they want to be able to measure the real reward function
[00:57:59] able to measure the real reward function so they they get this gold reward model
[00:58:00] so they they get this gold reward model which is trained on these real human
[00:58:02] which is trained on these real human comparisons and they generate data from
[00:58:05] comparisons and they generate data from their base model and rank that data
[00:58:07] their base model and rank that data using this gold reward function so
[00:58:09] using this gold reward function so essentially they have access they can
[00:58:10] essentially they have access they can Quire the gold re War function and know
[00:58:12] Quire the gold re War function and know what the actual score of this like
[00:58:14] what the actual score of this like synthetic Generations
[00:58:16] synthetic Generations is so basically they can like
[00:58:18] is so basically they can like essentially create these grows the
[00:58:20] essentially create these grows the reason we're getting reward hacking here
[00:58:22] reason we're getting reward hacking here is because we're not using like the
[00:58:23] is because we're not using like the actual reward function we're using like
[00:58:25] actual reward function we're using like this
[00:58:26] this synthetic reward function if you train
[00:58:28] synthetic reward function if you train on any like reward function with the
[00:58:30] on any like reward function with the finite amount of data and if you do like
[00:58:33] finite amount of data and if you do like in the limit of like infinite data you
[00:58:35] in the limit of like infinite data you would probably not see this phenomena
[00:58:36] would probably not see this phenomena but like because you're training on
[00:58:37] but like because you're training on finite data there will be like errors uh
[00:58:40] finite data there will be like errors uh outside the distribution that is trained
[00:58:41] outside the distribution that is trained on and some errors would like skew
[00:58:43] on and some errors would like skew positive or like overestimate the reward
[00:58:46] positive or like overestimate the reward some would skew underestimate the reward
[00:58:48] some would skew underestimate the reward but because you're optimizing against
[00:58:50] but because you're optimizing against that you'll end up um giving responses
[00:58:54] that you'll end up um giving responses where the errors are skewing positive
[00:58:56] where the errors are skewing positive so that that's why you start seeing
[00:58:58] so that that's why you start seeing phenomenas where like your learned
[00:59:01] phenomenas where like your learned reward is increasing but your true
[00:59:02] reward is increasing but your true reward is actually decreasing you guys
[00:59:05] reward is actually decreasing you guys think back to dagger where like they saw
[00:59:07] think back to dagger where like they saw this like propagating her
[00:59:11] this like propagating her supervis and here's also another
[00:59:12] supervis and here's also another interesting tidbit of information all
[00:59:14] interesting tidbit of information all these checkpoints even the very high K
[00:59:17] these checkpoints even the very high K ones um they have like quite low success
[00:59:20] ones um they have like quite low success rates actually have very low losses and
[00:59:23] rates actually have very low losses and very high accuracies as as Roar
[00:59:25] very high accuracies as as Roar functions
[00:59:27] functions so basically the the quality of the RO
[00:59:28] so basically the the quality of the RO function is not necessarily connected to
[00:59:31] function is not necessarily connected to the performance of the downstream policy
[00:59:33] the performance of the downstream policy you know quite surprising result
[00:59:36] you know quite surprising result of you have a question yeah so back to
[00:59:40] of you have a question yeah so back to the um a pairwise comparison so if your
[00:59:45] the um a pairwise comparison so if your object St you are comparing canot
[00:59:47] object St you are comparing canot perfectly do this kind of pair wise
[00:59:49] perfectly do this kind of pair wise comparison with so for example like say
[00:59:51] comparison with so for example like say the game uh rock paper stos right say
[00:59:54] the game uh rock paper stos right say rock is bigger
[00:59:56] rock is bigger like prefer against but it's not like a
[01:00:00] like prefer against but it's not like a perfect partially order SI then what can
[01:00:02] perfect partially order SI then what can we do you mean like if the reward
[01:00:04] we do you mean like if the reward function is not transitive it's a great
[01:00:07] function is not transitive it's a great question yeah so there I mean there
[01:00:09] question yeah so there I mean there there's an interesting like outcropping
[01:00:10] there's an interesting like outcropping of work that are that is basically
[01:00:13] of work that are that is basically trying to get away from the reward
[01:00:16] trying to get away from the reward maximization uh framework and think of
[01:00:18] maximization uh framework and think of this as a game um where instead of
[01:00:20] this as a game um where instead of saying I want to generate um responses
[01:00:23] saying I want to generate um responses that are the highest reward responses we
[01:00:25] that are the highest reward responses we should think of this at like the policy
[01:00:27] should think of this at like the policy optimization level and I should search
[01:00:28] optimization level and I should search for a policy where the average win rate
[01:00:31] for a policy where the average win rate if I take the expectation of the win
[01:00:33] if I take the expectation of the win rate of sampling a an action from the
[01:00:36] rate of sampling a an action from the policy that I'm optimizing and then I
[01:00:38] policy that I'm optimizing and then I have some comparison or adversary policy
[01:00:40] have some comparison or adversary policy I'm going to sample an action from that
[01:00:42] I'm going to sample an action from that adversary policy what is the expected
[01:00:44] adversary policy what is the expected win rate of the action sampled from my
[01:00:46] win rate of the action sampled from my policy compared to the action sampled
[01:00:48] policy compared to the action sampled from that adversary and so now we have
[01:00:51] from that adversary and so now we have to pick like an adversary policy class
[01:00:53] to pick like an adversary policy class which kind of makes sense in your um uh
[01:00:55] which kind of makes sense in your um uh Rec pris example right because like yeah
[01:00:57] Rec pris example right because like yeah there's not like an optimal action to
[01:00:59] there's not like an optimal action to take here it depends on what the policy
[01:01:00] take here it depends on what the policy of your adversary is here to know what's
[01:01:02] of your adversary is here to know what's good and what's bad so um in this case
[01:01:05] good and what's bad so um in this case it it exactly does address this issue of
[01:01:07] it it exactly does address this issue of if you have only a partial ordering you
[01:01:09] if you have only a partial ordering you you can't necessarily compare all pairs
[01:01:10] you can't necessarily compare all pairs of responses we can still use that kind
[01:01:12] of responses we can still use that kind of data we don't have to like uh uh be
[01:01:15] of data we don't have to like uh uh be bottlenecked by fitting a reward
[01:01:17] bottlenecked by fitting a reward function first so there the me methods
[01:01:20] function first so there the me methods like um sort of like a Nash yeah like
[01:01:23] like um sort of like a Nash yeah like direct Nash optimization or Nash
[01:01:24] direct Nash optimization or Nash learning from Human feedback are like
[01:01:27] learning from Human feedback are like this like other kind of newer I guess
[01:01:30] this like other kind of newer I guess like family of of algorithms that really
[01:01:31] like family of of algorithms that really interesting interesting that frog paper
[01:01:33] interesting interesting that frog paper scissors doesn't actually have a
[01:01:34] scissors doesn't actually have a deterministic Nash Point um it has like
[01:01:37] deterministic Nash Point um it has like a stochastic one but the stochastic one
[01:01:39] a stochastic one but the stochastic one is just like equal probability over
[01:01:40] is just like equal probability over everything um that has related to some
[01:01:42] everything um that has related to some deeper results they actually say that
[01:01:43] deeper results they actually say that like examples like this sort of
[01:01:44] like examples like this sort of plurality of preferences are actually
[01:01:46] plurality of preferences are actually unsatisfiable so you cannot actually
[01:01:48] unsatisfiable so you cannot actually theoretically train and in practice
[01:01:50] theoretically train and in practice train a model that will satisfy that set
[01:01:52] train a model that will satisfy that set of
[01:01:53] of references I think consider the pizza
[01:01:55] references I think consider the pizza one way that that isn't motivated to is
[01:01:57] one way that that isn't motivated to is if you have different distributions of
[01:01:58] if you have different distributions of populations with different um
[01:02:00] populations with different um preferences even if each of them are
[01:02:01] preferences even if each of them are internally trans you know consist with
[01:02:03] internally trans you know consist with transs to be they not
[01:02:07] cost yeah um so assuming that forward
[01:02:12] cost yeah um so assuming that forward hacking is not happening what in DPO uh
[01:02:17] hacking is not happening what in DPO uh prevents it from like taking large steps
[01:02:21] prevents it from like taking large steps and the
[01:02:24] optimization what do we mean about that
[01:02:26] optimization what do we mean about that like assuming reward hacking is not
[01:02:28] like assuming reward hacking is not happening like is is there something
[01:02:30] happening like is is there something that DPO is doing that's like preventing
[01:02:34] that DPO is doing that's like preventing it from taking too large of a step in
[01:02:37] it from taking too large of a step in optimization I think the K
[01:02:39] optimization I think the K regularization if you look at the beta
[01:02:41] regularization if you look at the beta term if the beta term is higher you the
[01:02:44] term if the beta term is higher you the sigmoid essentially saturates after a
[01:02:46] sigmoid essentially saturates after a point right uh and if the beta term is
[01:02:48] point right uh and if the beta term is higher you have to increase the
[01:02:50] higher you have to increase the differences less to satisfy the loss so
[01:02:54] differences less to satisfy the loss so roughly the beta control
[01:02:56] roughly the beta control how quickly you change the loss function
[01:02:59] how quickly you change the loss function but there are other parameters as well
[01:03:00] but there are other parameters as well like learning rates and so on which also
[01:03:02] like learning rates and so on which also change affect this so
[01:03:10] yeah I think for uh this reward haing uh
[01:03:14] yeah I think for uh this reward haing uh problem one of the methods that people
[01:03:16] problem one of the methods that people usually try to use to address this issue
[01:03:18] usually try to use to address this issue is like using Ensemble models right is
[01:03:21] is like using Ensemble models right is that something that could be done with
[01:03:23] that something that could be done with like direct methods like DPO like
[01:03:25] like direct methods like DPO like Ensemble of dpos or I don't know
[01:03:27] Ensemble of dpos or I don't know something like that you could the
[01:03:29] something like that you could the problem with that is then you have to
[01:03:30] problem with that is then you have to keep all the motos around but there's
[01:03:33] keep all the motos around but there's like smarter assembling things you can
[01:03:36] like smarter assembling things you can do yeah you you can you you don't have
[01:03:38] do yeah you you can you you don't have to have like complete copies of your
[01:03:41] to have like complete copies of your entire model to have an ensemble for
[01:03:42] entire model to have an ensemble for example so you can like Ensemble sub
[01:03:45] example so you can like Ensemble sub pieces of your your model or even
[01:03:47] pieces of your your model or even represent your reward model as a
[01:03:50] represent your reward model as a distribution instead of a single scaler
[01:03:52] distribution instead of a single scaler um and this starts tying back into these
[01:03:54] um and this starts tying back into these situations where we have um a a variety
[01:03:58] situations where we have um a a variety of preferences in our data that aren't
[01:04:00] of preferences in our data that aren't always consistent with each other um you
[01:04:02] always consistent with each other um you know one way of of modeling this data
[01:04:04] know one way of of modeling this data better is to say I have a sort of you
[01:04:06] better is to say I have a sort of you know a non-deterministic or I have a a
[01:04:08] know a non-deterministic or I have a a multimodal reward function instead um
[01:04:11] multimodal reward function instead um and and if you have a way of
[01:04:12] and and if you have a way of representing this with this like
[01:04:13] representing this with this like generative you know model architecture
[01:04:16] generative you know model architecture then you can still just stick this into
[01:04:18] then you can still just stick this into a DP looking loss
[01:04:26] um here answer kind of answered my
[01:04:28] um here answer kind of answered my question but uh I just wanted to ask in
[01:04:30] question but uh I just wanted to ask in general what were the promising
[01:04:32] general what were the promising directions for uh addressing reward
[01:04:35] directions for uh addressing reward hacking at
[01:04:38] DP well I mean there's like a number of
[01:04:41] DP well I mean there's like a number of reward hacking works on classic AR was
[01:04:43] reward hacking works on classic AR was like a huge number of those some of
[01:04:45] like a huge number of those some of those transfer uh pretty pretty
[01:04:47] those transfer uh pretty pretty straightforwardly here's something I'm
[01:04:49] straightforwardly here's something I'm I'm kind of excited about um and
[01:04:51] I'm kind of excited about um and interesting know that came from the open
[01:04:52] interesting know that came from the open source community in a way that didn't
[01:04:54] source community in a way that didn't actually understand they were doing uh
[01:04:57] actually understand they were doing uh they kind of like stumbl ac across this
[01:04:58] they kind of like stumbl ac across this like very randomly by by a very
[01:05:01] like very randomly by by a very questionable group of researchers on
[01:05:03] questionable group of researchers on Twitter uh what they discovered is
[01:05:05] Twitter uh what they discovered is basically if you just take a bunch of
[01:05:07] basically if you just take a bunch of like random like rhf models and you just
[01:05:09] like random like rhf models and you just like literally weight average them like
[01:05:11] like literally weight average them like they just become better like take the
[01:05:13] they just become better like take the weights like take the average and just
[01:05:15] weights like take the average and just just becomes better and turns out this
[01:05:18] just becomes better and turns out this there's like a ton of work on this from
[01:05:19] there's like a ton of work on this from like 2018 around the optimization
[01:05:22] like 2018 around the optimization landscape of these things um and you
[01:05:26] landscape of these things um and you know they very randomly stumbled across
[01:05:28] know they very randomly stumbled across in seems to work in but there's a paper
[01:05:31] in seems to work in but there's a paper called warm weight averaging reward
[01:05:33] called warm weight averaging reward models which kind of like makes that
[01:05:35] models which kind of like makes that point for reward model so if you train
[01:05:36] point for reward model so if you train an emble of reward models you don't keep
[01:05:38] an emble of reward models you don't keep the Ensemble but you average them like
[01:05:40] the Ensemble but you average them like we average The Ensemble that
[01:05:41] we average The Ensemble that significantly improves your robustness
[01:05:45] significantly improves your robustness as a ro function and the same seems to
[01:05:46] as a ro function and the same seems to be actually happening with DPO is if you
[01:05:48] be actually happening with DPO is if you train an sample of DPO models or take or
[01:05:50] train an sample of DPO models or take or preach your models and you weight
[01:05:52] preach your models and you weight average them that seems to actually
[01:05:54] average them that seems to actually significantly improve the question this
[01:05:56] significantly improve the question this as well and Twitter randomly stumbled
[01:05:58] as well and Twitter randomly stumbled across this without really understanding
[01:06:00] across this without really understanding it but it seem to work for them and
[01:06:03] it but it seem to work for them and turns out there's like a really sort of
[01:06:05] turns out there's like a really sort of deep reasons behind behind this so
[01:06:08] deep reasons behind behind this so that's what one thing I'm kind of
[01:06:10] that's what one thing I'm kind of excited about and actually like after we
[01:06:11] excited about and actually like after we we get this paper out we have right now
[01:06:13] we get this paper out we have right now something on order like 400 checkpoints
[01:06:15] something on order like 400 checkpoints or something the next thing we're
[01:06:17] or something the next thing we're probably going to do is try to see how
[01:06:18] probably going to do is try to see how how how much robustness we can squeeze
[01:06:20] how how much robustness we can squeeze out from some sort of like and people do
[01:06:22] out from some sort of like and people do smart things now like evolutionary merge
[01:06:24] smart things now like evolutionary merge and things like that like how much
[01:06:26] and things like that like how much robustness we can squeeze from some sort
[01:06:27] robustness we can squeeze from some sort of like average anding
[01:06:32] strategy maybe one thing that's sort of
[01:06:33] strategy maybe one thing that's sort of interesting also is that like we're
[01:06:34] interesting also is that like we're starting with this KL penalized reward
[01:06:37] starting with this KL penalized reward maximization objective which is you like
[01:06:40] maximization objective which is you like that was the original policy learning
[01:06:41] that was the original policy learning objective is maximize rewards subject to
[01:06:43] objective is maximize rewards subject to this KL penalty and like the intuition
[01:06:45] this KL penalty and like the intuition is that yeah we want to keep the KL
[01:06:47] is that yeah we want to keep the KL small so we don't like all optimize our
[01:06:50] small so we don't like all optimize our reward function but like this is kind of
[01:06:52] reward function but like this is kind of a crude way of encoding this
[01:06:56] a crude way of encoding this like P like desert AUM basically and
[01:06:59] like P like desert AUM basically and what we like something that might be
[01:07:01] what we like something that might be closer to what we really want is to say
[01:07:03] closer to what we really want is to say well like the the places where you know
[01:07:06] well like the the places where you know my reward model has high uh uncertainty
[01:07:10] my reward model has high uh uncertainty you know that those are the places where
[01:07:12] you know that those are the places where I want to be conservative um but you
[01:07:14] I want to be conservative um but you know if I have something that's sort of
[01:07:15] know if I have something that's sort of like out of distribution but like my
[01:07:17] like out of distribution but like my reward model is like really confident or
[01:07:19] reward model is like really confident or you know there there like um the um I
[01:07:22] you know there there like um the um I have low uncertainty over what the
[01:07:24] have low uncertainty over what the reward should be or basically like the
[01:07:25] reward should be or basically like the lower percentiles of reward are still
[01:07:27] lower percentiles of reward are still quite High um then it's okay to change
[01:07:30] quite High um then it's okay to change my model a lot in in these places and so
[01:07:32] my model a lot in in these places and so I think like One Direction I think is
[01:07:34] I think like One Direction I think is also interesting here is getting away
[01:07:35] also interesting here is getting away from the KL regularized policy optimiz
[01:07:38] from the KL regularized policy optimiz optimization objective which is nice
[01:07:40] optimization objective which is nice because it gives us this one to Oneness
[01:07:41] because it gives us this one to Oneness from like policies to reward models um
[01:07:44] from like policies to reward models um but is also like I think I don't know
[01:07:46] but is also like I think I don't know it's possible this is a bit too crude
[01:07:48] it's possible this is a bit too crude and it like leaves some performance on
[01:07:49] and it like leaves some performance on the table because we're over
[01:07:50] the table because we're over constraining our
[01:07:52] constraining our policy and I a quick question a quick
[01:07:55] policy and I a quick question a quick point like as I said you can think about
[01:07:57] point like as I said you can think about all this L is Q function essentially
[01:07:59] all this L is Q function essentially some framework I thing that's kind of
[01:08:01] some framework I thing that's kind of interesting in pursuing is like
[01:08:03] interesting in pursuing is like initially dqn were like really hard to
[01:08:05] initially dqn were like really hard to get working right and there's like you
[01:08:06] get working right and there's like you know after a couple years they will work
[01:08:08] know after a couple years they will work great because a lot of tricks were used
[01:08:10] great because a lot of tricks were used to make them stable make them perform
[01:08:12] to make them stable make them perform and make them not bootstrap and make
[01:08:13] and make them not bootstrap and make them not overfit Etc I think a lot of
[01:08:15] them not overfit Etc I think a lot of these things could potentially transfer
[01:08:16] these things could potentially transfer from that to the i in particular the
[01:08:19] from that to the i in particular the weight averaging thing is very much I
[01:08:20] weight averaging thing is very much I think also inspired by results in dqn
[01:08:23] think also inspired by results in dqn where you have like a Target function
[01:08:25] where you have like a Target function sort of like like I don't know if you
[01:08:26] sort of like like I don't know if you guys did the homework already we have
[01:08:28] guys did the homework already we have like a Target Q function which is like
[01:08:30] like a Target Q function which is like actually weight average sort some sort
[01:08:32] actually weight average sort some sort of poak averaging so it's kind of like
[01:08:34] of poak averaging so it's kind of like staggered and that like seems for
[01:08:35] staggered and that like seems for example to improv St a lot I think of
[01:08:37] example to improv St a lot I think of similar result go for lq's so to speak
[01:08:41] similar result go for lq's so to speak but again so sort of in the pipeline of
[01:08:45] but again so sort of in the pipeline of experiments
[01:08:48] to yeah I'm not sure if this was already
[01:08:51] to yeah I'm not sure if this was already touched upon but are there any risks
[01:08:52] touched upon but are there any risks with like overfitting and is there um
[01:08:56] with like overfitting and is there um like certain domains like medical
[01:08:57] like certain domains like medical domains where there's like very very
[01:08:58] domains where there's like very very small data sets is there a scope for
[01:09:01] small data sets is there a scope for this kind of for on those I this is
[01:09:03] this kind of for on those I this is essentially an overfitting problem right
[01:09:05] essentially an overfitting problem right you have like limited data coverage
[01:09:06] you have like limited data coverage extrapolating in the wrong way it's a
[01:09:08] extrapolating in the wrong way it's a little bit more trickier though like is
[01:09:09] little bit more trickier though like is that like I mean people have actually
[01:09:11] that like I mean people have actually found that in dpu and other settings
[01:09:13] found that in dpu and other settings like this overfitting is somewhat
[01:09:14] like this overfitting is somewhat beneficial So like um you can do
[01:09:17] beneficial So like um you can do multiple epochs and small data sets and
[01:09:20] multiple epochs and small data sets and for some of our experiments you can get
[01:09:21] for some of our experiments you can get a very tiny preference data sets as well
[01:09:24] a very tiny preference data sets as well and it still sort of works
[01:09:25] and it still sort of works uh but like people do multiple EPO and
[01:09:28] uh but like people do multiple EPO and they're very clearly overfitting but the
[01:09:29] they're very clearly overfitting but the performance still keeps on improving but
[01:09:32] performance still keeps on improving but again a lot depends upon how you
[01:09:34] again a lot depends upon how you evaluate these models and you're
[01:09:36] evaluate these models and you're probably losing somewhere else so it
[01:09:38] probably losing somewhere else so it really depends upon how you're going to
[01:09:39] really depends upon how you're going to use those one thing to keep in mind is
[01:09:41] use those one thing to keep in mind is that we've been talking a lot about
[01:09:42] that we've been talking a lot about reward over optimization or reward
[01:09:44] reward over optimization or reward hacking which is this discrepancy
[01:09:46] hacking which is this discrepancy between the proxy reward that we're
[01:09:48] between the proxy reward that we're actually optimizing against the thing we
[01:09:49] actually optimizing against the thing we learn from feedback and the true reward
[01:09:52] learn from feedback and the true reward that we don't actually get to observe
[01:09:53] that we don't actually get to observe but there's another discrepancy that we
[01:09:55] but there's another discrepancy that we haven't really talked about that ARA
[01:09:56] haven't really talked about that ARA just mentioned which is that um when we
[01:09:59] just mentioned which is that um when we evaluate these models we in practice
[01:10:01] evaluate these models we in practice we're actually typically not evaluating
[01:10:02] we're actually typically not evaluating average reward we're typically
[01:10:04] average reward we're typically evaluating something like more like a
[01:10:06] evaluating something like more like a win rate um which is you know comparing
[01:10:08] win rate um which is you know comparing to some baseline policy or something or
[01:10:10] to some baseline policy or something or so so the the the setup is more like
[01:10:13] so so the the the setup is more like almost a satisficing rather than a
[01:10:14] almost a satisficing rather than a maximizing kind of kind of situation and
[01:10:17] maximizing kind of kind of situation and so that's like another layer of
[01:10:20] so that's like another layer of Disconnect between the thing that we're
[01:10:21] Disconnect between the thing that we're using as our training objective and the
[01:10:23] using as our training objective and the way we're actually the the thing that is
[01:10:25] way we're actually the the thing that is actually providing utility for you know
[01:10:27] actually providing utility for you know the human or like the person who's
[01:10:29] the human or like the person who's actually building this thing so there's
[01:10:31] actually building this thing so there's another layer of where we can get over
[01:10:33] another layer of where we can get over fitting in a sense to the
[01:10:38] objective yeah uh so my understanding is
[01:10:41] objective yeah uh so my understanding is it's kind of like two stages the first
[01:10:42] it's kind of like two stages the first one is the normal like supervisor
[01:10:44] one is the normal like supervisor training of the language model and then
[01:10:45] training of the language model and then the second stage is the uh DPO training
[01:10:48] the second stage is the uh DPO training and then you use like the K Divergence
[01:10:49] and then you use like the K Divergence is a means to make sure you're not like
[01:10:51] is a means to make sure you're not like moving too far away from your original
[01:10:52] moving too far away from your original supervis model um it seems like two
[01:10:55] supervis model um it seems like two stages is it possible to like combine
[01:10:57] stages is it possible to like combine this preference zoning during the normal
[01:10:59] this preference zoning during the normal supervised training so like as you're
[01:11:01] supervised training so like as you're like training the model from the start
[01:11:04] like training the model from the start you're also picking into these um
[01:11:06] you're also picking into these um because it seems like you're using kale
[01:11:07] because it seems like you're using kale is like kind of a proxy for making sure
[01:11:08] is like kind of a proxy for making sure it doesn't move too far away but if you
[01:11:10] it doesn't move too far away but if you do them at the same time maybe it'll
[01:11:12] do them at the same time maybe it'll help address that there there are a few
[01:11:14] help address that there there are a few works that have like tried to so you're
[01:11:17] works that have like tried to so you're talking about merging the supervised
[01:11:19] talking about merging the supervised instruction tuning and the preference
[01:11:20] instruction tuning and the preference tuning part yeah cuz they're one after
[01:11:22] tuning part yeah cuz they're one after another right now right yeah yeah so
[01:11:23] another right now right yeah yeah so there's a few works that I've tried to
[01:11:25] there's a few works that I've tried to do that I think it's still an active
[01:11:26] do that I think it's still an active area of research but like the general
[01:11:29] area of research but like the general idea why like so maybe it's like useful
[01:11:32] idea why like so maybe it's like useful to understand why we even do instruction
[01:11:33] to understand why we even do instruction tuning before doing like rlf is that it
[01:11:36] tuning before doing like rlf is that it kind of when you start with a
[01:11:37] kind of when you start with a pre-trained model it will give you
[01:11:39] pre-trained model it will give you gibberish responses which are not even
[01:11:41] gibberish responses which are not even aligned with the instruction you're
[01:11:42] aligned with the instruction you're giving so the instruction tuning sort of
[01:11:45] giving so the instruction tuning sort of helps us generate the right preference
[01:11:47] helps us generate the right preference data set like where you're starting to
[01:11:48] data set like where you're starting to follow the question being asked so
[01:11:51] follow the question being asked so typically in a very typical rlf pipeline
[01:11:54] typically in a very typical rlf pipeline when you don't even have a PR reference
[01:11:55] when you don't even have a PR reference data set to begin with that's why you do
[01:11:57] data set to begin with that's why you do the instruction tuning bit uh but like I
[01:12:00] the instruction tuning bit uh but like I mean if you have reference data sets
[01:12:01] mean if you have reference data sets already people are coming up with
[01:12:03] already people are coming up with methods where you can both combine
[01:12:05] methods where you can both combine instruction tuning and preference uh
[01:12:07] instruction tuning and preference uh learning bit into the same optimization
[01:12:08] learning bit into the same optimization algorithm they're not very different
[01:12:10] algorithm they're not very different they're usually like some elements of
[01:12:11] they're usually like some elements of like the Lost functions you already see
[01:12:14] like the Lost functions you already see um but it's still somewhat of an active
[01:12:16] um but it's still somewhat of an active area of research but like could you do
[01:12:18] area of research but like could you do like maybe a dages kind of thing where
[01:12:20] like maybe a dages kind of thing where you train the model and then like you do
[01:12:22] you train the model and then like you do find tun at one there there's methods
[01:12:24] find tun at one there there's methods which do that as well yeah um in my
[01:12:27] which do that as well yeah um in my personal experience it didn't work very
[01:12:29] personal experience it didn't work very well but like um there are papers that
[01:12:31] well but like um there are papers that claim that works really well so like
[01:12:34] claim that works really well so like um we also have problems trying to get
[01:12:38] um we also have problems trying to get it doesn't mean it's impossible yeah it
[01:12:39] it doesn't mean it's impossible yeah it doesn't mean yeah there's a lot of
[01:12:40] doesn't mean yeah there's a lot of details that I'll go into this like I
[01:12:41] details that I'll go into this like I mean
[01:12:42] mean yeah yeah I also personally I'm somewhat
[01:12:46] yeah yeah I also personally I'm somewhat suspicious of of these things because
[01:12:47] suspicious of of these things because the optimization landscape is is so
[01:12:50] the optimization landscape is is so basically as you seeing from this over
[01:12:51] basically as you seeing from this over optimization things the optimization
[01:12:52] optimization things the optimization landscape is so complicated and conf
[01:12:55] landscape is so complicated and conf like so many different pitfalls I think
[01:12:58] like so many different pitfalls I think trying to combine this and navigate that
[01:13:00] trying to combine this and navigate that in a sort of like a single shot
[01:13:01] in a sort of like a single shot optimization direction is pretty hard
[01:13:04] optimization direction is pretty hard probably not impossible but pretty hard
[01:13:05] probably not impossible but pretty hard to me it's not really clear what the
[01:13:07] to me it's not really clear what the benefits from that are again I also
[01:13:10] benefits from that are again I also think kind of goes back to the
[01:13:11] think kind of goes back to the exploration question which is I think
[01:13:13] exploration question which is I think sort of how much it friended there is
[01:13:15] sort of how much it friended there is like in the like you know um on the
[01:13:19] like in the like you know um on the first day like open AI said let there be
[01:13:22] first day like open AI said let there be an sft model and like there was no
[01:13:24] an sft model and like there was no preference dat dat you set yet and so in
[01:13:26] preference dat dat you set yet and so in order to actually get the preferences
[01:13:28] order to actually get the preferences you needed a source of exploration to
[01:13:31] you needed a source of exploration to get the trajectories to get preferences
[01:13:32] get the trajectories to get preferences over so you had to do one and then the
[01:13:34] over so you had to do one and then the other and so I think that's like the
[01:13:37] other and so I think that's like the original way to think about that but now
[01:13:39] original way to think about that but now if we're trying to do this like in a
[01:13:40] if we're trying to do this like in a single offline stage well now we're sort
[01:13:42] single offline stage well now we're sort of stuck like like we're just stuck with
[01:13:44] of stuck like like we're just stuck with whatever data we have in terms of the
[01:13:45] whatever data we have in terms of the exploration and just like there's just
[01:13:47] exploration and just like there's just only going to be so much you can do when
[01:13:48] only going to be so much you can do when you have purely offline data and so you
[01:13:50] you have purely offline data and so you know either doing this iteratively and
[01:13:52] know either doing this iteratively and but like being able to sample and then
[01:13:53] but like being able to sample and then get new preferences over this samples
[01:13:55] get new preferences over this samples like is useful to
[01:14:01] do so you mentioned the discrepancy
[01:14:03] do so you mentioned the discrepancy where we're training to like maximize
[01:14:05] where we're training to like maximize the reward function but then during
[01:14:06] the reward function but then during evaluation we're um evaluating based on
[01:14:09] evaluation we're um evaluating based on like wi rate so could could we just like
[01:14:11] like wi rate so could could we just like use a different objective function to
[01:14:13] use a different objective function to like optimize directly for wind rate is
[01:14:15] like optimize directly for wind rate is that possibility yeah I mean these Nash
[01:14:17] that possibility yeah I mean these Nash algorithms basically are doing that so
[01:14:19] algorithms basically are doing that so so instead of deriving this as like we
[01:14:21] so instead of deriving this as like we have some reward function and we're
[01:14:23] have some reward function and we're maximizing reward it's like literally I
[01:14:25] maximizing reward it's like literally I have some um either Baseline policy and
[01:14:28] have some um either Baseline policy and I want to you know if I if I can only
[01:14:30] I want to you know if I if I can only evaluate the preference function not the
[01:14:32] evaluate the preference function not the reward function so a function that takes
[01:14:34] reward function so a function that takes two responses and says which one's
[01:14:35] two responses and says which one's better one objective I could come up
[01:14:37] better one objective I could come up with is the expectation under my policy
[01:14:40] with is the expectation under my policy of that preference uh uh function
[01:14:43] of that preference uh uh function computed on one response from my policy
[01:14:46] computed on one response from my policy and one response from this Baseline
[01:14:47] and one response from this Baseline policy or one response from like an
[01:14:49] policy or one response from like an adversarial you know best like like
[01:14:51] adversarial you know best like like worst case best you know adversary
[01:14:54] worst case best you know adversary policy um and so now you're basically
[01:14:56] policy um and so now you're basically explicitly optimizing for either like
[01:14:58] explicitly optimizing for either like average case or worst case win rate
[01:15:01] average case or worst case win rate against some reference or comparison so
[01:15:02] against some reference or comparison so how do that compared to
[01:15:04] how do that compared to DPO depends on who you ask I mean the
[01:15:08] DPO depends on who you ask I mean the the the paper introducing these methods
[01:15:10] the the paper introducing these methods like show improvements I mean I think
[01:15:12] like show improvements I mean I think one of the ways that it's helpful is
[01:15:15] one of the ways that it's helpful is that you're not again like going through
[01:15:18] that you're not again like going through a reward function and so you're not
[01:15:20] a reward function and so you're not requiring um you're not explicitly
[01:15:22] requiring um you're not explicitly training to have some uh complete total
[01:15:25] training to have some uh complete total ordering over all of your responses and
[01:15:27] ordering over all of your responses and so like this can be helpful it's it's
[01:15:29] so like this can be helpful it's it's not as like constraining of a um um of a
[01:15:33] not as like constraining of a um um of a sort of framework to think about um at
[01:15:35] sort of framework to think about um at the same time like any policy we end up
[01:15:37] the same time like any policy we end up with compared to some reference model we
[01:15:39] with compared to some reference model we can interpret as a reward function so
[01:15:41] can interpret as a reward function so like I'm not exactly sure how to think
[01:15:44] like I'm not exactly sure how to think about the advantages there but yes like
[01:15:46] about the advantages there but yes like if you look at the experiments in the
[01:15:47] if you look at the experiments in the papers they will say like yeah we have
[01:15:48] papers they will say like yeah we have improvements in like win rate which kind
[01:15:50] improvements in like win rate which kind of makes sense right you're you're
[01:15:51] of makes sense right you're you're Ealing with win rate and now we're
[01:15:53] Ealing with win rate and now we're training for win rate instead of
[01:15:54] training for win rate instead of training forward maximization is not
[01:15:55] training forward maximization is not that surprising you can see improvements
[01:15:58] that surprising you can see improvements there's also another point that if you
[01:15:59] there's also another point that if you do consider Bradley Terry model to be
[01:16:01] do consider Bradley Terry model to be true that this is your preference model
[01:16:03] true that this is your preference model this is the data generation model
[01:16:05] this is the data generation model maximizing reward and maximizing
[01:16:07] maximizing reward and maximizing probability of win rate are actually
[01:16:08] probability of win rate are actually like evental so as I said like this
[01:16:11] like evental so as I said like this reward Max meion thing because of the
[01:16:12] reward Max meion thing because of the free parameter has very high variance so
[01:16:14] free parameter has very high variance so what open AI does in other papers but
[01:16:16] what open AI does in other papers but it's always like a footnote in like a
[01:16:18] it's always like a footnote in like a you know 100 page paper is they actually
[01:16:20] you know 100 page paper is they actually normalize their reward functions so they
[01:16:22] normalize their reward functions so they substract some human Baseline so like
[01:16:24] substract some human Baseline so like the reward of the human completion or
[01:16:27] the reward of the human completion or the human data is like zero and what
[01:16:29] the human data is like zero and what this gives you is essentially actually
[01:16:31] this gives you is essentially actually the lock probability like then the
[01:16:33] the lock probability like then the reward function they optimize with p is
[01:16:34] reward function they optimize with p is the lock probability that generation is
[01:16:37] the lock probability that generation is preferred over the human generation
[01:16:39] preferred over the human generation under Brad so that these things are very
[01:16:42] under Brad so that these things are very tightly coupled and the normalization
[01:16:44] tightly coupled and the normalization part you know from our perspective
[01:16:46] part you know from our perspective actually doesn't change the optimal
[01:16:48] actually doesn't change the optimal policy it just you know in things I've
[01:16:50] policy it just you know in things I've kind of seen and kind of experimented it
[01:16:52] kind of seen and kind of experimented it actually significantly reduces the
[01:16:53] actually significantly reduces the variance which is kind of the intuition
[01:16:55] variance which is kind of the intuition there but actually there's very direct
[01:16:57] there but actually there's very direct way to tie that with essentially
[01:16:59] way to tie that with essentially maximizing probability
[01:17:01] maximizing probability of of winning essentially it's like a
[01:17:04] of of winning essentially it's like a baseline yeah it's exactly a baseline
[01:17:07] baseline yeah it's exactly a baseline essentially and you know this Baseline
[01:17:09] essentially and you know this Baseline actually works the variance actually
[01:17:10] actually works the variance actually significantly goes down why don't we do
[01:17:13] significantly goes down why don't we do one
[01:17:14] one more somebody hasn't
[01:17:22] question so uh do the
[01:17:25] question so uh do the uh applicable to uh like multiobjective
[01:17:28] uh applicable to uh like multiobjective parl
[01:17:32] CH there's paper called Mo DPO which is
[01:17:36] CH there's paper called Mo DPO which is stands for multiobjective DPO um and
[01:17:40] stands for multiobjective DPO um and yeah you can basically like I mean yes
[01:17:42] yeah you can basically like I mean yes you can you can do DPO in this setting
[01:17:45] you can you can do DPO in this setting where you uh basically like condition on
[01:17:47] where you uh basically like condition on a scalarization of your multiple
[01:17:49] a scalarization of your multiple objectives like a a particular waiting
[01:17:51] objectives like a a particular waiting so you don't have to like learn any
[01:17:53] so you don't have to like learn any reward function or you have to like
[01:17:55] reward function or you have to like learn n minus one or than no I think
[01:17:58] learn n minus one or than no I think it's what do you guys correct me I'm
[01:18:00] it's what do you guys correct me I'm wrong I think you're still learning
[01:18:01] wrong I think you're still learning basically a like a waiting conditioned
[01:18:04] basically a like a waiting conditioned policy where you can pick the mixture
[01:18:06] policy where you can pick the mixture like you have all of your different
[01:18:07] like you have all of your different objectives and you can pick like you
[01:18:09] objectives and you can pick like you know what waiting over these objectives
[01:18:11] know what waiting over these objectives you want to you want to use you know to
[01:18:13] you want to you want to use you know to like which policy do you actually want
[01:18:15] like which policy do you actually want to you know end up with how do you how
[01:18:17] to you know end up with how do you how do you trade these off and you don't
[01:18:18] do you trade these off and you don't have to like retrain for every single
[01:18:20] have to like retrain for every single different uh scalarization there are
[01:18:22] different uh scalarization there are others that do this with like
[01:18:23] others that do this with like uncertainty over the over the road model
[01:18:26] uncertainty over the over the road model as
[01:18:28] as wellers
[01:18:31] again thank you very much for having us
[01:18:34] again thank you very much for having us than for coming all right good luck with
[01:18:36] than for coming all right good luck with the midterm everybody see you Wednesday
Lecture 010
Stanford CS234 Reinforcement Learning I Offline RL 3 I 2024 I Lecture 10
Source: https://www.youtube.com/watch?v=F6APGIAm5fw
---
Transcript
[00:00:05] here asking you about DPO and
[00:00:28] rhf ...
Stanford CS234 Reinforcement Learning I Offline RL 3 I 2024 I Lecture 10
Source: https://www.youtube.com/watch?v=F6APGIAm5fw
---
Transcript
[00:00:05] here asking you about DPO and
[00:00:28] rhf for
[00:01:12] okay great why don't you turn to
[00:01:13] okay great why don't you turn to somebody and compare your answers
[00:01:57] I
[00:02:26] okay so there's still pretty good um
[00:02:29] okay so there's still pretty good um there's a lot of dis agreement on one of
[00:02:30] there's a lot of dis agreement on one of these um for the first one it's false
[00:02:33] these um for the first one it's false does somebody want to tell me which one
[00:02:34] does somebody want to tell me which one does not learn an explicit
[00:02:36] does not learn an explicit representation of the reward
[00:02:38] representation of the reward function so they do not both learn one
[00:02:41] function so they do not both learn one one of them does and one of them doesn't
[00:02:43] one of them does and one of them doesn't which one does yeah rhf learns and then
[00:02:47] which one does yeah rhf learns and then D doesn't learn that's right yeah that's
[00:02:50] D doesn't learn that's right yeah that's exactly right so this one does
[00:02:57] learn okay so this is false
[00:03:00] learn okay so this is false now it's true that DPO assumes a
[00:03:02] now it's true that DPO assumes a particular parametric representation for
[00:03:04] particular parametric representation for their W model both of them do but um DPO
[00:03:07] their W model both of them do but um DPO then inverts that so you can directly do
[00:03:09] then inverts that so you can directly do policy learning it never has to
[00:03:10] policy learning it never has to explicitly learn a reward function in
[00:03:13] explicitly learn a reward function in the same way that RF does what about the
[00:03:16] the same way that RF does what about the second one what do you think is it
[00:03:18] second one what do you think is it constrained to be as good as the best
[00:03:19] constrained to be as good as the best examples in the Parise preference
[00:03:27] data so I think this is false
[00:03:30] data so I think this is false um does somebody who also said false
[00:03:32] um does somebody who also said false want to say why why is this
[00:03:37] false yeah yeah yeah maybe because we're
[00:03:41] false yeah yeah yeah maybe because we're using like U policy approxim we're using
[00:03:44] using like U policy approxim we're using a function to approximate
[00:03:47] a function to approximate it
[00:03:48] it um yeah the gradient could like take a
[00:03:52] um yeah the gradient could like take a step which is like more positive
[00:03:55] step which is like more positive that yeah exactly what said so um you're
[00:03:59] that yeah exactly what said so um you're G to at least if we think about the rhf
[00:04:01] G to at least if we think about the rhf case we are using this information to
[00:04:04] case we are using this information to learn a reward model if that reward
[00:04:06] learn a reward model if that reward model is good even um and can sort of
[00:04:08] model is good even um and can sort of extrapolate Beyond and generalize beyond
[00:04:10] extrapolate Beyond and generalize beyond the samples that we have when you do po
[00:04:13] the samples that we have when you do po using that reward model you can learn a
[00:04:15] using that reward model you can learn a policy that's better than your
[00:04:16] policy that's better than your demonstrations so this can in fact go
[00:04:18] demonstrations so this can in fact go beyond the best sort of performance
[00:04:21] beyond the best sort of performance that's inside your data or if you think
[00:04:23] that's inside your data or if you think of it in terms of the reward um you know
[00:04:25] of it in terms of the reward um you know maybe some of the examples you're
[00:04:26] maybe some of the examples you're showing aren't that great but then you
[00:04:28] showing aren't that great but then you can use that to actually get a better
[00:04:29] can use that to actually get a better policy and in fact you might think
[00:04:31] policy and in fact you might think that's probably exactly what's happening
[00:04:33] that's probably exactly what's happening with chat gbt because for chat GPT they
[00:04:35] with chat gbt because for chat GPT they initially got um sort of the fine tune
[00:04:39] initially got um sort of the fine tune model from supervised uh learning and
[00:04:41] model from supervised uh learning and then they showed those examples to
[00:04:44] then they showed those examples to people and people would uh pick between
[00:04:46] people and people would uh pick between them and then it learned a reward model
[00:04:48] them and then it learned a reward model and then they got a policy that was
[00:04:49] and then they got a policy that was better at generating those sort of
[00:04:51] better at generating those sort of responses so you could argue that uh
[00:04:53] responses so you could argue that uh chbt is an example that suggests yes
[00:04:55] chbt is an example that suggests yes this often can be true we can learn a
[00:04:57] this often can be true we can learn a good enough reward model such that if we
[00:04:59] good enough reward model such that if we do PP at least a little bit of it we can
[00:05:01] do PP at least a little bit of it we can actually outperform the um the training
[00:05:04] actually outperform the um the training examples um po DPO does use a reference
[00:05:07] examples um po DPO does use a reference policy both of them
[00:05:10] policy both of them do and this idea will come up we've seen
[00:05:13] do and this idea will come up we've seen it a few times already and it'll
[00:05:14] it a few times already and it'll continue to come up today this idea of
[00:05:16] continue to come up today this idea of thinking is essentially of how far can
[00:05:18] thinking is essentially of how far can we extrapolate or how far can we
[00:05:19] we extrapolate or how far can we interpolate from our data um and when do
[00:05:22] interpolate from our data um and when do we need to sort of constrain ourselves
[00:05:24] we need to sort of constrain ourselves to be fairly close either in the policy
[00:05:26] to be fairly close either in the policy space or something else um so that we
[00:05:29] space or something else um so that we don't generalize to parts of the domain
[00:05:31] don't generalize to parts of the domain where we might have really bad
[00:05:32] where we might have really bad performance we saw that in imitation
[00:05:34] performance we saw that in imitation learning we saw that in DPO we've seen
[00:05:37] learning we saw that in DPO we've seen that in PPO in all of these cases where
[00:05:39] that in PPO in all of these cases where we're thinking given the data that we
[00:05:40] we're thinking given the data that we have how can we um sort of generalize as
[00:05:43] have how can we um sort of generalize as much as possible but not
[00:05:47] further
[00:05:49] further right okay so um we're getting into a
[00:05:53] right okay so um we're getting into a part of the class which is probably my
[00:05:54] part of the class which is probably my favorite part of the class though I like
[00:05:56] favorite part of the class though I like of course I'm biased I like all of it
[00:05:58] of course I'm biased I like all of it but um we've been talking about sort of
[00:06:00] but um we've been talking about sort of learning from past human preferences um
[00:06:03] learning from past human preferences um we first saw that sort of learning from
[00:06:05] we first saw that sort of learning from past human demonstrations then we saw
[00:06:07] past human demonstrations then we saw learning from past human preferences and
[00:06:09] learning from past human preferences and today we're going to think just
[00:06:10] today we're going to think just generally about learning from past data
[00:06:12] generally about learning from past data so that could be generated by humans or
[00:06:14] so that could be generated by humans or it could be generated by your robot or
[00:06:16] it could be generated by your robot or something else and then next time we're
[00:06:18] something else and then next time we're going to start talking about fast or
[00:06:20] going to start talking about fast or data efficient learning and that's going
[00:06:22] data efficient learning and that's going to be useful for doing homework 3 as
[00:06:24] to be useful for doing homework 3 as well because the theory question for
[00:06:25] well because the theory question for homework 3 is focused on data efficient
[00:06:27] homework 3 is focused on data efficient learning
[00:06:30] learning right so we'll focus on that now so in
[00:06:32] right so we'll focus on that now so in particular for today we're going to
[00:06:34] particular for today we're going to discuss like we often do sort of
[00:06:35] discuss like we often do sort of thinking of separating things into a
[00:06:37] thinking of separating things into a policy evaluation question and then a
[00:06:40] policy evaluation question and then a policy learning question um because
[00:06:42] policy learning question um because we've seen repeatedly that if we think
[00:06:43] we've seen repeatedly that if we think about can we evaluate how good a
[00:06:45] about can we evaluate how good a particular policy is that we can often
[00:06:47] particular policy is that we can often combine that as a way to sort of
[00:06:49] combine that as a way to sort of bootstrap improving or policy
[00:06:52] bootstrap improving or policy optimization all right but I want to
[00:06:54] optimization all right but I want to start with just a question um which is
[00:06:56] start with just a question um which is can we do better than meditation
[00:06:58] can we do better than meditation learning and of course this relates to
[00:06:59] learning and of course this relates to the question I just asked you and the
[00:07:01] the question I just asked you and the refresh your understanding so I'm just
[00:07:03] refresh your understanding so I'm just going to give up sort of an example um
[00:07:05] going to give up sort of an example um in my lab we often think about Education
[00:07:07] in my lab we often think about Education data or Healthcare data um or other
[00:07:09] data or Healthcare data um or other cases where decisions are being
[00:07:10] cases where decisions are being generated by humans or automated systems
[00:07:13] generated by humans or automated systems where you might have say a series of
[00:07:15] where you might have say a series of patients you could think of this as
[00:07:17] patients you could think of this as medical record data um and each of those
[00:07:20] medical record data um and each of those people are getting a series of
[00:07:21] people are getting a series of interventions maybe it's some medication
[00:07:23] interventions maybe it's some medication maybe it's a medical checkup maybe it's
[00:07:25] maybe it's a medical checkup maybe it's a vaccine and then we observe some sort
[00:07:28] a vaccine and then we observe some sort of outcome
[00:07:30] of outcome and in imitation learning um we've saw
[00:07:32] and in imitation learning um we've saw the idea of saying well could we try to
[00:07:34] the idea of saying well could we try to mimic the best human or could we try to
[00:07:36] mimic the best human or could we try to mimic expert data and so an important
[00:07:39] mimic expert data and so an important question is whether or not we can go
[00:07:40] question is whether or not we can go beyond that and we just thought about
[00:07:43] beyond that and we just thought about one example where we might be able to go
[00:07:44] one example where we might be able to go beyond that but I think that there's a
[00:07:46] beyond that but I think that there's a huge number of places we'd love to be
[00:07:48] huge number of places we'd love to be able to go beyond the limits of sort of
[00:07:50] able to go beyond the limits of sort of at least the average human performance
[00:07:52] at least the average human performance um Healthcare is certainly one of them
[00:07:54] um Healthcare is certainly one of them um in America we pay a lot for our
[00:07:55] um in America we pay a lot for our Healthcare and we don't have
[00:07:57] Healthcare and we don't have particularly good outcomes compared to
[00:07:58] particularly good outcomes compared to how much we are pay
[00:08:00] how much we are pay so you would hope that maybe we could
[00:08:01] so you would hope that maybe we could learn through sort of reinforcement
[00:08:03] learn through sort of reinforcement learning or others are there better
[00:08:05] learning or others are there better sequences of decisions we could make in
[00:08:07] sequences of decisions we could make in order to better assist say a new
[00:08:10] order to better assist say a new patient
[00:08:12] patient okay so I'll just give a a little bit of
[00:08:15] okay so I'll just give a a little bit of backstory of why I started thinking
[00:08:17] backstory of why I started thinking about this question so maybe about a
[00:08:19] about this question so maybe about a decade ago um I was collaborating with
[00:08:22] decade ago um I was collaborating with zaran papovich and his lab and my grad
[00:08:24] zaran papovich and his lab and my grad students he's at University of
[00:08:26] students he's at University of Washington and he had this game called
[00:08:28] Washington and he had this game called refraction
[00:08:29] refraction and fra refraction helps teach kids
[00:08:31] and fra refraction helps teach kids about fractions um it's one of the
[00:08:33] about fractions um it's one of the concepts kids typically find really
[00:08:35] concepts kids typically find really challenging um uh when they start to
[00:08:37] challenging um uh when they start to learn math and so in it you have a
[00:08:40] learn math and so in it you have a spaceship and you're trying to fuel a
[00:08:41] spaceship and you're trying to fuel a spaceship by splitting laser beams um in
[00:08:44] spaceship by splitting laser beams um in certain ways so that you create
[00:08:45] certain ways so that you create fractions or subp parts of laser beams
[00:08:47] fractions or subp parts of laser beams to fuel spaceships and um to sort of
[00:08:49] to fuel spaceships and um to sort of save the save the
[00:08:51] save the save the agents and in this case um so I think
[00:08:54] agents and in this case um so I think roughly around maybe 500,000 kids have
[00:08:56] roughly around maybe 500,000 kids have played this game and what we were
[00:08:58] played this game and what we were thinking about is how could we customize
[00:09:00] thinking about is how could we customize it to make it more personalized and
[00:09:02] it to make it more personalized and adaptive to students so in particular
[00:09:05] adaptive to students so in particular there are all these different sort of
[00:09:06] there are all these different sort of game activities and game levels and we
[00:09:09] game activities and game levels and we wanted to understand how could we use
[00:09:12] wanted to understand how could we use sort of information about how the
[00:09:13] sort of information about how the student was working in one of the
[00:09:15] student was working in one of the activities to adaptively select which
[00:09:17] activities to adaptively select which next activity to do so this is a
[00:09:20] next activity to do so this is a decision policy and you can imagine
[00:09:22] decision policy and you can imagine conditioning on all sorts of State
[00:09:24] conditioning on all sorts of State features so State features could be like
[00:09:26] features so State features could be like how long they took but it also could be
[00:09:28] how long they took but it also could be things like we did they put down laser
[00:09:30] things like we did they put down laser bams or what series of mistakes did they
[00:09:32] bams or what series of mistakes did they make you imagine it could be generally a
[00:09:34] make you imagine it could be generally a really really rich context or state
[00:09:36] really really rich context or state space and then there were lots of
[00:09:38] space and then there were lots of different next levels we could do okay
[00:09:40] different next levels we could do okay so that was the question we were
[00:09:41] so that was the question we were interested in um and in particular in
[00:09:44] interested in um and in particular in this case we had access to about 11,000
[00:09:46] this case we had access to about 11,000 Learners who had been giving activities
[00:09:49] Learners who had been giving activities in a a random
[00:09:50] in a a random order now that was because there was a
[00:09:53] order now that was because there was a human um designer who had designed a
[00:09:55] human um designer who had designed a specific sequence through the game but
[00:09:57] specific sequence through the game but we weren't sure if that was actually
[00:10:00] we weren't sure if that was actually optimal or close to Optimal um and what
[00:10:02] optimal or close to Optimal um and what we wanted to do is to see whether or not
[00:10:04] we wanted to do is to see whether or not we could find using reinforcement
[00:10:06] we could find using reinforcement learning an Adaptive policy to help
[00:10:08] learning an Adaptive policy to help students persist at the game for longer
[00:10:10] students persist at the game for longer so this game was offered on something
[00:10:12] so this game was offered on something called brain po um which some of you
[00:10:14] called brain po um which some of you guys might have seen before it offers
[00:10:15] guys might have seen before it offers lots of educational games uh for kids
[00:10:18] lots of educational games uh for kids and a lot of kids use it for a little
[00:10:20] and a lot of kids use it for a little while and then they stop so it's an
[00:10:22] while and then they stop so it's an optional game and we had some evidence
[00:10:24] optional game and we had some evidence that suggested that if kids played the
[00:10:26] that suggested that if kids played the game they were likely to learn things
[00:10:28] game they were likely to learn things but if they don't play the game they are
[00:10:30] but if they don't play the game they are not so we wanted to think about
[00:10:31] not so we wanted to think about increasing uh student persistence in
[00:10:34] increasing uh student persistence in terms of like the number of levels and
[00:10:36] terms of like the number of levels and so we really wanted to go beyond expert
[00:10:39] so we really wanted to go beyond expert performance in this case like beyond
[00:10:41] performance in this case like beyond what the experts had done and so what we
[00:10:43] what the experts had done and so what we did is we used reinforcement learning
[00:10:45] did is we used reinforcement learning and we wanted to see if we could
[00:10:46] and we wanted to see if we could outperform essentially Behavior cloning
[00:10:49] outperform essentially Behavior cloning and to give a spoiler of sort of the
[00:10:50] and to give a spoiler of sort of the types of ideas we're going to see today
[00:10:52] types of ideas we're going to see today in this case we found we could learn a
[00:10:53] in this case we found we could learn a policy that increased persistance by
[00:10:55] policy that increased persistance by about
[00:10:56] about 30% and so that suggests that in some of
[00:10:59] 30% and so that suggests that in some of the domains there may be essentially be
[00:11:01] the domains there may be essentially be enough um data and evidence to find new
[00:11:06] enough um data and evidence to find new decision policies that are substantially
[00:11:08] decision policies that are substantially better than what is being currently done
[00:11:11] better than what is being currently done and so that's what inspires me and my
[00:11:12] and so that's what inspires me and my lab a lot is to think about where can we
[00:11:14] lab a lot is to think about where can we use natural variation in the decisions
[00:11:17] use natural variation in the decisions that are being made or you know past
[00:11:19] that are being made or you know past experiments that were run in order to
[00:11:20] experiments that were run in order to find substantially better decision
[00:11:23] find substantially better decision policies that are currently being used
[00:11:24] policies that are currently being used yeah not a super relevant question to
[00:11:27] yeah not a super relevant question to the subject matter but just how to
[00:11:29] the subject matter but just how to curiosity was it 30% distributed uh
[00:11:32] curiosity was it 30% distributed uh uniformly or was it just like the people
[00:11:35] uniformly or was it just like the people who already played played longer were
[00:11:37] who already played played longer were the ones that um stopped early would
[00:11:39] the ones that um stopped early would actually continue this is a great
[00:11:41] actually continue this is a great question so this is a really big
[00:11:43] question so this is a really big challenge often is whether or not um who
[00:11:45] challenge often is whether or not um who are you moving inside of this
[00:11:46] are you moving inside of this distribution so this is just an
[00:11:47] distribution so this is just an expectation like most of what we've been
[00:11:49] expectation like most of what we've been doing for k n and others we did not
[00:11:52] doing for k n and others we did not analyze that too much in this case um in
[00:11:54] analyze that too much in this case um in another much more recent paper we have
[00:11:57] another much more recent paper we have which I think came out in January or
[00:11:58] which I think came out in January or something we did a exactly that an an
[00:12:00] something we did a exactly that an an analysis to try to see who was actually
[00:12:02] analysis to try to see who was actually impacted and there we're really excited
[00:12:04] impacted and there we're really excited that it was the lowest performers that
[00:12:05] that it was the lowest performers that were most impacted um and that was
[00:12:07] were most impacted um and that was exciting because one of the big concerns
[00:12:09] exciting because one of the big concerns is that a lot of these systems are just
[00:12:11] is that a lot of these systems are just increasing the inequity Gap um and this
[00:12:14] increasing the inequity Gap um and this is particularly a problem in kind of
[00:12:15] is particularly a problem in kind of these optional ones because it's
[00:12:17] these optional ones because it's normally the kids that are furthest
[00:12:19] normally the kids that are furthest ahead that have the highest
[00:12:21] ahead that have the highest usage so a great question but it also
[00:12:24] usage so a great question but it also sort of raises in terms of on the
[00:12:25] sort of raises in terms of on the technical side these questions around
[00:12:27] technical side these questions around understanding um sort of iting estimates
[00:12:29] understanding um sort of iting estimates for sub parts of the population and
[00:12:31] for sub parts of the population and doing heterogeneous treatment effect
[00:12:32] doing heterogeneous treatment effect analysis to figure out which groupings
[00:12:34] analysis to figure out which groupings of contexts have different forms of Q
[00:12:37] of contexts have different forms of Q values yeah ask in terms of the policy
[00:12:41] values yeah ask in terms of the policy what was sort of being changed like the
[00:12:43] what was sort of being changed like the simplest level is it like the difficulty
[00:12:45] simplest level is it like the difficulty of the fractions is it how hard it goes
[00:12:47] of the fractions is it how hard it goes as they're going up or yeah it's a great
[00:12:49] as they're going up or yeah it's a great question so in this case I can't
[00:12:51] question so in this case I can't remember what the final exact policy was
[00:12:52] remember what the final exact policy was they using but the type of things that
[00:12:54] they using but the type of things that we're varying in this case is things are
[00:12:56] we're varying in this case is things are on the fractions so like changing the
[00:12:57] on the fractions so like changing the numbers as well as different things of
[00:12:59] numbers as well as different things of sort of how tricky it is graphically to
[00:13:02] sort of how tricky it is graphically to do that um so there was a couple
[00:13:04] do that um so there was a couple different things that we could
[00:13:05] different things that we could manipulate as well
[00:13:07] manipulate as well as you can see just like visually here
[00:13:10] as you can see just like visually here these look quite different so one thing
[00:13:12] these look quite different so one thing that we found in some other work in um a
[00:13:14] that we found in some other work in um a game called Battleship number line which
[00:13:16] game called Battleship number line which I was excited because recently my son
[00:13:18] I was excited because recently my son uses Brain Pop and it just popped up and
[00:13:20] uses Brain Pop and it just popped up and I was like we worked on that so that was
[00:13:22] I was like we worked on that so that was exciting um in in Battleship number line
[00:13:25] exciting um in in Battleship number line which is another thing to do with
[00:13:26] which is another thing to do with fractions um we found there that
[00:13:28] fractions um we found there that variability was incred incredibly
[00:13:29] variability was incred incredibly important for persistence and so just
[00:13:31] important for persistence and so just changing sort of how things look in that
[00:13:33] changing sort of how things look in that case how big the battleships were um
[00:13:36] case how big the battleships were um also makes a very big difference to
[00:13:37] also makes a very big difference to persistence and engagement I think
[00:13:39] persistence and engagement I think that's actually an interesting question
[00:13:40] that's actually an interesting question too in terms of including us um the
[00:13:44] too in terms of including us um the history and the state features to try to
[00:13:46] history and the state features to try to you know um capture stuff like people
[00:13:48] you know um capture stuff like people caring about
[00:13:49] caring about variability so great questions um we'll
[00:13:52] variability so great questions um we'll we'll talk a little bit more about this
[00:13:54] we'll talk a little bit more about this example later to talk about some things
[00:13:56] example later to talk about some things that we tried or that didn't work in
[00:13:58] that we tried or that didn't work in this domain
[00:13:59] this domain but this is just to highlight that I
[00:14:01] but this is just to highlight that I guess we shouldn't set our expectations
[00:14:02] guess we shouldn't set our expectations too low so I think that imitation
[00:14:04] too low so I think that imitation learning is amazing and of course if
[00:14:06] learning is amazing and of course if you're trying to imitate imitate the
[00:14:08] you're trying to imitate imitate the best surgeons in the world that's
[00:14:09] best surgeons in the world that's incredible but there are many cases
[00:14:11] incredible but there are many cases where we think we can go beyond Human
[00:14:12] where we think we can go beyond Human Performance particularly in cases where
[00:14:15] Performance particularly in cases where kind of like our high level principles
[00:14:16] kind of like our high level principles don't inform what we should do at a more
[00:14:18] don't inform what we should do at a more micro level so for example here we might
[00:14:20] micro level so for example here we might have general princi principles of
[00:14:22] have general princi principles of learning science but it doesn't say you
[00:14:24] learning science but it doesn't say you know which activity to exactly do when
[00:14:26] know which activity to exactly do when and that's where being data driven can
[00:14:28] and that's where being data driven can be really helpful
[00:14:30] be really helpful okay let me give you another example
[00:14:32] okay let me give you another example another place thing that we think about
[00:14:33] another place thing that we think about a lot is Healthcare we've collaborated a
[00:14:35] a lot is Healthcare we've collaborated a lot with finale joshu valz at Harvard
[00:14:37] lot with finale joshu valz at Harvard and her lab um this is an example
[00:14:39] and her lab um this is an example thinking about um hypotension and trying
[00:14:41] thinking about um hypotension and trying to optimize different policies for that
[00:14:44] to optimize different policies for that uh there's a really amazing data set
[00:14:46] uh there's a really amazing data set called mimic that comes out of I think
[00:14:48] called mimic that comes out of I think MIT and MGH Mass General Hospital which
[00:14:51] MIT and MGH Mass General Hospital which has lots and lots of electronic medical
[00:14:53] has lots and lots of electronic medical record systems and so what um these guys
[00:14:57] record systems and so what um these guys did in this particular paper is to look
[00:14:59] did in this particular paper is to look at Behavior policy so that's this
[00:15:01] at Behavior policy so that's this Flatline and to see if they could learn
[00:15:03] Flatline and to see if they could learn policies using a method called popcorn
[00:15:06] policies using a method called popcorn that they thought would be much better
[00:15:08] that they thought would be much better and again here you know the results
[00:15:10] and again here you know the results depend on the method and some of the
[00:15:11] depend on the method and some of the hyperparameters they're looking at but
[00:15:13] hyperparameters they're looking at but the important thing just to notice here
[00:15:14] the important thing just to notice here is that a number of these policies are
[00:15:16] is that a number of these policies are substantially better than Baseline
[00:15:18] substantially better than Baseline suggesting again that there may be
[00:15:20] suggesting again that there may be domains where we can Leverage The
[00:15:22] domains where we can Leverage The intrinsic variability in the data and
[00:15:23] intrinsic variability in the data and identify things that are working much
[00:15:25] identify things that are working much more successfully in the systematic way
[00:15:29] more successfully in the systematic way so when we think about doing this
[00:15:31] so when we think about doing this generally I would call this sort of
[00:15:32] generally I would call this sort of offline or batch or counterfactual RL
[00:15:35] offline or batch or counterfactual RL and it's counterfactual because what
[00:15:36] and it's counterfactual because what we're trying to do is to estimate or
[00:15:38] we're trying to do is to estimate or learn policies that don't exist in the
[00:15:41] learn policies that don't exist in the actual data collection strategy so we
[00:15:43] actual data collection strategy so we have the setting now rule assume um like
[00:15:46] have the setting now rule assume um like in imitation learning that we have a
[00:15:47] in imitation learning that we have a data set of end trajectories so we're
[00:15:49] data set of end trajectories so we're going to assume now we're going back to
[00:15:51] going to assume now we're going back to the standard mdp setting um and it's not
[00:15:53] the standard mdp setting um and it's not pairwise preferences we're just back to
[00:15:55] pairwise preferences we're just back to having sequences of states and actions
[00:15:57] having sequences of states and actions and Rewards
[00:16:00] and Rewards okay all right so in particular we may
[00:16:03] okay all right so in particular we may have things like this where we have data
[00:16:05] have things like this where we have data from one policy and data from another
[00:16:07] from one policy and data from another policy and we want to think about how we
[00:16:09] policy and we want to think about how we can learn from that thinking about the
[00:16:10] can learn from that thinking about the state distribution of what's actually
[00:16:12] state distribution of what's actually best
[00:16:14] best okay now I'll just highlight here two
[00:16:16] okay now I'll just highlight here two reasons why this is hard um so we're
[00:16:21] reasons why this is hard um so we're always trying to estimate a
[00:16:22] always trying to estimate a counterfactual here over what might have
[00:16:25] counterfactual here over what might have happened for that wasn't tried so in
[00:16:28] happened for that wasn't tried so in this case we we don't know for this
[00:16:29] this case we we don't know for this patient group what would have happened
[00:16:30] patient group what would have happened if we gave them that treatment or vice
[00:16:33] if we gave them that treatment or vice versa so just a reminder this is the
[00:16:35] versa so just a reminder this is the fundamental problem of sort of um causal
[00:16:38] fundamental problem of sort of um causal inference and this is going to be a big
[00:16:39] inference and this is going to be a big challenge for us here particularly when
[00:16:41] challenge for us here particularly when we try to go beyond the performance of
[00:16:43] we try to go beyond the performance of the policy we saw in the
[00:16:45] the policy we saw in the past okay um so data is censored and of
[00:16:49] past okay um so data is censored and of course in general again we're going to
[00:16:50] course in general again we're going to need generalization because we don't
[00:16:51] need generalization because we don't want to have to enumerate all the
[00:16:54] want to have to enumerate all the possible
[00:16:55] possible policies okay and I do just want to High
[00:16:59] policies okay and I do just want to High here that in addition to education
[00:17:00] here that in addition to education Healthcare you know if you want to think
[00:17:01] Healthcare you know if you want to think about climate change or many other areas
[00:17:03] about climate change or many other areas there's just a huge number of scenarios
[00:17:06] there's just a huge number of scenarios including robotics because it's often
[00:17:07] including robotics because it's often really expensive to do robotics
[00:17:09] really expensive to do robotics experiments where these types of ideas
[00:17:11] experiments where these types of ideas are
[00:17:12] are helpful okay now one thing you might be
[00:17:15] helpful okay now one thing you might be wondering about is when I'm talking
[00:17:16] wondering about is when I'm talking about this and I'm talking about trying
[00:17:18] about this and I'm talking about trying to understand the performance of a new
[00:17:20] to understand the performance of a new decision policy that was not used to
[00:17:22] decision policy that was not used to gather the data you might start to think
[00:17:24] gather the data you might start to think back to Q learning there's a lot of work
[00:17:26] back to Q learning there's a lot of work on off policy reinforcement learning
[00:17:28] on off policy reinforcement learning from really you know the very beginning
[00:17:30] from really you know the very beginning of reinforcement learning so you might
[00:17:32] of reinforcement learning so you might say why don't we already have the tools
[00:17:34] say why don't we already have the tools that we need to try to tackle this
[00:17:36] that we need to try to tackle this problem of learning better policies and
[00:17:38] problem of learning better policies and in fact as we saw you know chat GPT if
[00:17:41] in fact as we saw you know chat GPT if we learn a reward function and we do po
[00:17:44] we learn a reward function and we do po that is do off policy learning so so
[00:17:46] that is do off policy learning so so that's one
[00:17:48] that's one example so why can't we do this why
[00:17:50] example so why can't we do this why can't we just do Q learning or some of
[00:17:52] can't we just do Q learning or some of the other methods we've seen um one
[00:17:55] the other methods we've seen um one thing to remember is a little while ago
[00:17:57] thing to remember is a little while ago I said sometimes we have this deadly
[00:17:59] I said sometimes we have this deadly Triad of bootstrapping function
[00:18:01] Triad of bootstrapping function approximation and off policy
[00:18:03] approximation and off policy learning that sometimes when we combine
[00:18:05] learning that sometimes when we combine all three of these things things can
[00:18:07] all three of these things things can fail and that was part of the motivation
[00:18:09] fail and that was part of the motivation for PO is that we don't want to go too
[00:18:11] for PO is that we don't want to go too far from the
[00:18:12] far from the distribution let me just talk a little
[00:18:14] distribution let me just talk a little bit about what can happen here in the
[00:18:16] bit about what can happen here in the context of Q learning sort of model free
[00:18:19] context of Q learning sort of model free learning so this is um the bcq um
[00:18:23] learning so this is um the bcq um Behavior constrained Q learning from
[00:18:26] Behavior constrained Q learning from Scott pu jimoto um and what this shows
[00:18:29] Scott pu jimoto um and what this shows here is that so these are a bunch of
[00:18:31] here is that so these are a bunch of different methods this is DQ deep Q
[00:18:34] different methods this is DQ deep Q learning um uh this is behavior cloning
[00:18:37] learning um uh this is behavior cloning uh this is the behavioral policy so what
[00:18:40] uh this is the behavioral policy so what they did is they gathered some data um
[00:18:42] they did is they gathered some data um and then they tried to use different
[00:18:43] and then they tried to use different methods to learn a policy from it and um
[00:18:47] methods to learn a policy from it and um this is D ddpg and what they found with
[00:18:49] this is D ddpg and what they found with this is that some of the methods did
[00:18:51] this is that some of the methods did really bad even given the behavior data
[00:18:54] really bad even given the behavior data dqn does about the same as the behavior
[00:18:56] dqn does about the same as the behavior data but what they found here is that by
[00:19:00] data but what they found here is that by being a bit more careful and using
[00:19:02] being a bit more careful and using methods that were explicitly designed to
[00:19:04] methods that were explicitly designed to handle this offline data in this case
[00:19:07] handle this offline data in this case bcq behavior um constraint Q learning
[00:19:09] bcq behavior um constraint Q learning they could do substantially
[00:19:11] they could do substantially better and so that suggests that we
[00:19:13] better and so that suggests that we don't probably just want to use if we
[00:19:15] don't probably just want to use if we know our data is fixed and we know we're
[00:19:17] know our data is fixed and we know we're not going to get additional data that it
[00:19:19] not going to get additional data that it may be worth it for us to use different
[00:19:20] may be worth it for us to use different types of algorithms in order to handle
[00:19:23] types of algorithms in order to handle the fact that our data is constrained
[00:19:24] the fact that our data is constrained and we're not going to be continuing to
[00:19:25] and we're not going to be continuing to get fresh
[00:19:27] get fresh data so that that so motivates why we're
[00:19:29] data so that that so motivates why we're going to need new
[00:19:31] going to need new methods all right so now what we're
[00:19:32] methods all right so now what we're going to do is dive into policy
[00:19:34] going to do is dive into policy evaluation and then we'll talk about
[00:19:36] evaluation and then we'll talk about policy optimization
[00:19:38] policy optimization afterwards
[00:19:40] afterwards okay so in Pol batch policy evaluation
[00:19:43] okay so in Pol batch policy evaluation what we're going to be thinking about is
[00:19:44] what we're going to be thinking about is we have a particular policy of interest
[00:19:47] we have a particular policy of interest and we have a data set and we'd like to
[00:19:48] and we have a data set and we'd like to be able to use that data set to estimate
[00:19:50] be able to use that data set to estimate how good that policy is for one state or
[00:19:53] how good that policy is for one state or on average over a set of starting States
[00:19:56] on average over a set of starting States okay so similar to what we've seen for
[00:19:58] okay so similar to what we've seen for policy evaluation
[00:20:00] policy evaluation before
[00:20:02] before okay one thing I want to highlight um
[00:20:05] okay one thing I want to highlight um this is by Phil Thomas who I had the
[00:20:08] this is by Phil Thomas who I had the privilege of having as my postto a few
[00:20:09] privilege of having as my postto a few years ago he is a um a professor UMass
[00:20:12] years ago he is a um a professor UMass Amherst we generally want to think about
[00:20:15] Amherst we generally want to think about sample efficient methods for doing this
[00:20:17] sample efficient methods for doing this so in this case he was working with
[00:20:20] so in this case he was working with Adobe and they have you know 10 million
[00:20:21] Adobe and they have you know 10 million to 20 million trajectories doesn't
[00:20:24] to 20 million trajectories doesn't matter too much what you what these
[00:20:26] matter too much what you what these lines are the key thing here is that um
[00:20:28] lines are the key thing here is that um this is sort of the behavior policy and
[00:20:30] this is sort of the behavior policy and you want to be learning policies which
[00:20:32] you want to be learning policies which you're confident are better than your
[00:20:33] you're confident are better than your behavior policy and this is just to
[00:20:36] behavior policy and this is just to highlight that depending on the methods
[00:20:37] highlight that depending on the methods you use you may be confident at very
[00:20:39] you use you may be confident at very different points so just meaning that
[00:20:42] different points so just meaning that data efficiency and having good
[00:20:43] data efficiency and having good algorithm is going to matter a lot Yeah
[00:20:45] algorithm is going to matter a lot Yeah by Behavior you mean the policy that was
[00:20:48] by Behavior you mean the policy that was observed in the current data set exactly
[00:20:51] observed in the current data set exactly Behavior policy great great
[00:20:52] Behavior policy great great clarification question when I say
[00:20:54] clarification question when I say Behavior policy today what I mean is the
[00:20:56] Behavior policy today what I mean is the policy that was used to gather the data
[00:20:57] policy that was used to gather the data set that you have so I'll just write
[00:20:59] set that you have so I'll just write that out on here so this
[00:21:02] that out on here so this is so we're going to
[00:21:05] is so we're going to assume Behavior policy is the one that
[00:21:08] assume Behavior policy is the one that was used to gather your
[00:21:10] was used to gather your data
[00:21:14] okay all right let's first think about
[00:21:16] okay all right let's first think about using models so this is actually the
[00:21:17] using models so this is actually the first thing we tried to do with
[00:21:19] first thing we tried to do with refraction we thought okay great um
[00:21:22] refraction we thought okay great um Travis mandal who is the grad student
[00:21:23] Travis mandal who is the grad student leading the project we have all this
[00:21:25] leading the project we have all this historical data let's just try to learn
[00:21:27] historical data let's just try to learn models from it
[00:21:29] models from it um so we're going to look at we're going
[00:21:30] um so we're going to look at we're going to cons you know represent the state
[00:21:32] to cons you know represent the state space in some way um and then under
[00:21:35] space in some way um and then under different actions here in this case
[00:21:36] different actions here in this case that's just different levels and there's
[00:21:38] that's just different levels and there's only a finite number of levels and
[00:21:39] only a finite number of levels and activities let's learn a Dynamics
[00:21:42] activities let's learn a Dynamics model so in this case the idea is that
[00:21:45] model so in this case the idea is that we have that existing data set and we're
[00:21:47] we have that existing data set and we're going to learn an explicit Dynamics
[00:21:49] going to learn an explicit Dynamics model and we can learn explicit reward
[00:21:50] model and we can learn explicit reward model now in our case our reward model
[00:21:53] model now in our case our reward model was known because it's persistence so we
[00:21:55] was known because it's persistence so we could essentially get a reward every
[00:21:57] could essentially get a reward every time the student did didn't quit the
[00:21:59] time the student did didn't quit the game but we didn't know the Dynamics
[00:22:01] game but we didn't know the Dynamics model and so that's what we're using the
[00:22:02] model and so that's what we're using the data to
[00:22:03] data to learn now as you might imagine we had to
[00:22:06] learn now as you might imagine we had to make a lot of choices here about what
[00:22:08] make a lot of choices here about what state representations we would use and
[00:22:10] state representations we would use and so we lot thought about lots and lots
[00:22:12] so we lot thought about lots and lots and lots of different state
[00:22:14] and lots of different state representations okay but um once you
[00:22:17] representations okay but um once you have that then you can treat this as a
[00:22:19] have that then you can treat this as a simulator so now you have your simulator
[00:22:21] simulator so now you have your simulator of the world because you have a Dynamics
[00:22:22] of the world because you have a Dynamics mod reward model either you can do this
[00:22:25] mod reward model either you can do this analytically like in some of the methods
[00:22:26] analytically like in some of the methods we saw in one some of the first few
[00:22:28] we saw in one some of the first few classes um or you can use dynamic
[00:22:31] classes um or you can use dynamic programming or um Q learning uh Q
[00:22:34] programming or um Q learning uh Q evaluation with those um to explicitly
[00:22:37] evaluation with those um to explicitly learn what the value is but really you
[00:22:40] learn what the value is but really you can use anything you can even use like
[00:22:41] can use anything you can even use like Monte Carlo methods because um you can
[00:22:44] Monte Carlo methods because um you can try to learn from this simulator um an
[00:22:47] try to learn from this simulator um an optimal policy so you can either try to
[00:22:50] optimal policy so you can either try to learn an optimal policy or you can
[00:22:51] learn an optimal policy or you can evaluate a specific one you can do
[00:22:53] evaluate a specific one you can do either of those so I'll just write that
[00:22:54] either of those so I'll just write that here so you can either evaluation or
[00:22:57] here so you can either evaluation or learn a new
[00:23:02] policy
[00:23:05] with any other oral method because now
[00:23:09] with any other oral method because now you have a
[00:23:11] you have a simulator okay let me show you what
[00:23:15] simulator okay let me show you what happens all right so the first thing I'm
[00:23:17] happens all right so the first thing I'm going to show you is the following what
[00:23:20] going to show you is the following what we have on the x-axis here is different
[00:23:23] we have on the x-axis here is different state representations of this
[00:23:25] state representations of this environment so these are obviously
[00:23:27] environment so these are obviously really small state spaces like we don't
[00:23:29] really small state spaces like we don't actually think that human learning is
[00:23:30] actually think that human learning is encapsulated in terms of five states or
[00:23:33] encapsulated in terms of five states or 10 states but you can just imagine
[00:23:35] 10 states but you can just imagine sweeping this right like so these are
[00:23:36] sweeping this right like so these are some of the state spaces we considered
[00:23:38] some of the state spaces we considered where we use really really condensed
[00:23:39] where we use really really condensed State spaces or much more complicated
[00:23:41] State spaces or much more complicated ones what this show is showing here is
[00:23:44] ones what this show is showing here is normalized score and this is log
[00:23:45] normalized score and this is log likelihood and this is um held out so
[00:23:49] likelihood and this is um held out so what this is saying is as you might
[00:23:50] what this is saying is as you might expect as you increase your sort of
[00:23:53] expect as you increase your sort of State space complexity you get a better
[00:23:56] State space complexity you get a better fit on the data you can better predict
[00:23:58] fit on the data you can better predict predict the next state of the student if
[00:24:00] predict the next state of the student if you use a more complex State
[00:24:02] you use a more complex State space and that's not totally surprising
[00:24:04] space and that's not totally surprising right because we think that human
[00:24:06] right because we think that human learning is complicated and so we really
[00:24:08] learning is complicated and so we really think we are getting a better Dynamics
[00:24:09] think we are getting a better Dynamics model and again just to emphasize here
[00:24:11] model and again just to emphasize here this is cross validation so this is on a
[00:24:14] this is cross validation so this is on a held out set it's is not training error
[00:24:18] held out set it's is not training error okay so so we're doing better in terms
[00:24:21] okay so so we're doing better in terms of this now what are we doing with these
[00:24:23] of this now what are we doing with these once we have these oh yeah go ahead
[00:24:25] once we have these oh yeah go ahead number yes yeah so the data set is fixed
[00:24:28] number yes yeah so the data set is fixed here what we're trying to do is given
[00:24:30] here what we're trying to do is given the data that you've seen before like
[00:24:31] the data that you've seen before like there are all different ways we just
[00:24:33] there are all different ways we just have clickstream data there's tons of
[00:24:34] have clickstream data there's tons of ways to model that as state spaces we're
[00:24:36] ways to model that as state spaces we're just doing model
[00:24:38] just doing model selection now what we were doing here
[00:24:40] selection now what we were doing here then is once we had that simulator we
[00:24:41] then is once we had that simulator we were trying to learn a good
[00:24:43] were trying to learn a good policy um and then we were evaluating
[00:24:46] policy um and then we were evaluating the performance of that actual policy
[00:24:48] the performance of that actual policy now I'll tell you how we actually
[00:24:49] now I'll tell you how we actually evaluated that policy shortly but this
[00:24:52] evaluated that policy shortly but this is the important thing so this is Shing
[00:24:55] is the important thing so this is Shing that the models that we're getting are
[00:24:56] that the models that we're getting are actually better
[00:24:58] actually better okay but here's the problem if I take
[00:25:02] okay but here's the problem if I take this policy which really or if I take
[00:25:05] this policy which really or if I take this model which really is a better
[00:25:06] this model which really is a better model it really does fit the data better
[00:25:08] model it really does fit the data better and then I do say dynamic programming
[00:25:10] and then I do say dynamic programming with it and I extract an optimal piie
[00:25:12] with it and I extract an optimal piie star so that's the procedure I take my
[00:25:14] star so that's the procedure I take my model I learn an optimal policy and now
[00:25:17] model I learn an optimal policy and now I want to know how good that actually is
[00:25:18] I want to know how good that actually is in the real world if I evaluate that in
[00:25:21] in the real world if I evaluate that in the real world even though the model
[00:25:23] the real world even though the model itself was actually better what you can
[00:25:25] itself was actually better what you can see is the actual value of that policy
[00:25:28] see is the actual value of that policy is is getting
[00:25:29] is is getting worse okay so I've got a better
[00:25:31] worse okay so I've got a better simulator but the policy I get by
[00:25:34] simulator but the policy I get by optimizing for that better simulator is
[00:25:37] optimizing for that better simulator is worse okay so this is the actual
[00:25:41] worse okay so this is the actual unbiased reward estimator and I'll tell
[00:25:42] unbiased reward estimator and I'll tell you shortly how we do that because of
[00:25:44] you shortly how we do that because of course under the model's opinion the
[00:25:47] course under the model's opinion the model thinks it's you know the policy
[00:25:49] model thinks it's you know the policy it's helping produce is great let me
[00:25:51] it's helping produce is great let me just make sure that sort of the pipeline
[00:25:52] just make sure that sort of the pipeline of what we're doing there is clear so
[00:25:54] of what we're doing there is clear so what we
[00:25:56] what we do is we are getting we're going from
[00:26:00] do is we are getting we're going from data to a model of the Dynamics
[00:26:06] model and then we add in a reward
[00:26:08] model and then we add in a reward function and we extract a p star for
[00:26:12] function and we extract a p star for that estimated Dynamics
[00:26:14] that estimated Dynamics model but that's just under the
[00:26:16] model but that's just under the simulator and then what I want to know
[00:26:18] simulator and then what I want to know actually is what the true value is of
[00:26:22] actually is what the true value is of that policy I've
[00:26:23] that policy I've computed okay and what this Mo this
[00:26:26] computed okay and what this Mo this graph is showing is is that even though
[00:26:29] graph is showing is is that even though under my model is getting better the
[00:26:32] under my model is getting better the actual performance of the value I'm
[00:26:33] actual performance of the value I'm getting out is is getting
[00:26:35] getting out is is getting worse now when we first saw this we were
[00:26:37] worse now when we first saw this we were kind of confused we weren't quite sure
[00:26:38] kind of confused we weren't quite sure why this was happening and in fact there
[00:26:40] why this was happening and in fact there had been some work um a few years prior
[00:26:43] had been some work um a few years prior to this in sort of the educational data
[00:26:45] to this in sort of the educational data mining community that suggested doing
[00:26:47] mining community that suggested doing exactly what we were doing here which
[00:26:49] exactly what we were doing here which was build a model then use it to
[00:26:51] was build a model then use it to simulate and learn a good policy and
[00:26:53] simulate and learn a good policy and then deploy the policy that looked best
[00:26:57] then deploy the policy that looked best okay but what are what our work here
[00:26:58] okay but what are what our work here suggested is that was not a good idea
[00:27:01] suggested is that was not a good idea now the reason for that is because the
[00:27:03] now the reason for that is because the model is misspecified
[00:27:07] now that means that under this model
[00:27:10] now that means that under this model missp specification the value it's
[00:27:12] missp specification the value it's getting when it computes the optimal
[00:27:14] getting when it computes the optimal policy so when I
[00:27:17] policy so when I get so you can think of there being two
[00:27:19] get so you can think of there being two things here there is one thing which is
[00:27:22] things here there is one thing which is V hat of Pi hat star which is its own
[00:27:25] V hat of Pi hat star which is its own estimate of how good its value is and
[00:27:27] estimate of how good its value is and then there is the
[00:27:29] then there is the true value of it and these in general
[00:27:32] true value of it and these in general are going to be different and these in
[00:27:35] are going to be different and these in particular are going to be different if
[00:27:36] particular are going to be different if your estimated model is bad so it's
[00:27:39] your estimated model is bad so it's going to think I'm doing great this is
[00:27:40] going to think I'm doing great this is going to you know help students persist
[00:27:42] going to you know help students persist till the Rend s of time but if the model
[00:27:44] till the Rend s of time but if the model is misspecified meaning that even with
[00:27:47] is misspecified meaning that even with infinite data it will not converge to
[00:27:49] infinite data it will not converge to the true model of student learning then
[00:27:52] the true model of student learning then that estimate will be
[00:27:54] that estimate will be wrong and as you might imagine here 20
[00:27:57] wrong and as you might imagine here 20 State model of learning is not that
[00:27:59] State model of learning is not that great yeah you are improving St
[00:28:03] great yeah you are improving St right yeah yeah so it's not saying that
[00:28:06] right yeah yeah so it's not saying that um it's not saying that some of these
[00:28:09] um it's not saying that some of these policies might not be good policies what
[00:28:11] policies might not be good policies what this was arguing to us so it's a great
[00:28:14] this was arguing to us so it's a great question it's not that inside of these
[00:28:16] question it's not that inside of these there might not be pretty decent policy
[00:28:17] there might not be pretty decent policy classes um you know you could argue that
[00:28:19] classes um you know you could argue that Education Works because there's
[00:28:22] Education Works because there's decentish policies I mean I don't have
[00:28:23] decentish policies I mean I don't have perfect models of all of you guys'
[00:28:25] perfect models of all of you guys' learning um but they're still sufficient
[00:28:27] learning um but they're still sufficient for us to be be a to learn and
[00:28:29] for us to be be a to learn and communicate what is AR what I'm arguing
[00:28:31] communicate what is AR what I'm arguing here is that we should not just use um
[00:28:34] here is that we should not just use um the accuracy of the Dynamics model as um
[00:28:38] the accuracy of the Dynamics model as um uh like a proxy for which of the values
[00:28:41] uh like a proxy for which of the values or which of the policies to
[00:28:43] or which of the policies to pick this is arguing that we need
[00:28:45] pick this is arguing that we need separate independent estimates of really
[00:28:47] separate independent estimates of really we want to basically in some ways you
[00:28:48] we want to basically in some ways you know kind of like what we saw with Po
[00:28:50] know kind of like what we saw with Po and policy learning we would like to
[00:28:52] and policy learning we would like to directly evaluate the performance of a
[00:28:54] directly evaluate the performance of a policy instead of using as a proxy sort
[00:28:57] policy instead of using as a proxy sort of um how much our Q function is
[00:28:59] of um how much our Q function is changing or how accurate we think our
[00:29:01] changing or how accurate we think our Dynamics model
[00:29:02] Dynamics model is yeah so when we evaluate policy we
[00:29:07] is yeah so when we evaluate policy we execute it it and there's a real
[00:29:10] execute it it and there's a real environment or estimat the policy
[00:29:12] environment or estimat the policy performance using our as so there's two
[00:29:16] performance using our as so there's two things we either we can do it under our
[00:29:18] things we either we can do it under our simulated model or we can do it under
[00:29:20] simulated model or we can do it under our real
[00:29:21] our real model we don't want to have to do it
[00:29:23] model we don't want to have to do it under a real model because we want to
[00:29:25] under a real model because we want to know which policy to deploy before we
[00:29:27] know which policy to deploy before we actually deploy it otherwise we could
[00:29:29] actually deploy it otherwise we could kind of be doing online RL so what I'll
[00:29:31] kind of be doing online RL so what I'll shortly be giving you is a way to get an
[00:29:33] shortly be giving you is a way to get an accurate estimate of how good the policy
[00:29:35] accurate estimate of how good the policy is before we deploy it I haven't said
[00:29:37] is before we deploy it I haven't said how to do that yet I've just argued that
[00:29:38] how to do that yet I've just argued that using models alone might not be good do
[00:29:41] using models alone might not be good do your problem yeah so about model Miss
[00:29:44] your problem yeah so about model Miss specification uh is one way to think
[00:29:47] specification uh is one way to think about this just like you're kind of
[00:29:50] about this just like you're kind of overfitting um your Dynamics modeled by
[00:29:53] overfitting um your Dynamics modeled by increasing number of states that you
[00:29:55] increasing number of states that you used to represented or but we're not
[00:29:57] used to represented or but we're not it's here's a question go we're not
[00:29:59] it's here's a question go we're not overfitting here because it really is a
[00:30:02] overfitting here because it really is a better fit it's still just not a perfect
[00:30:04] better fit it's still just not a perfect fit in other ways my might say this is
[00:30:06] fit in other ways my might say this is it's not realizable this is not the real
[00:30:08] it's not realizable this is not the real model of student learning and under this
[00:30:10] model of student learning and under this that means that there's still
[00:30:12] that means that there's still essentially significant bias um when we
[00:30:14] essentially significant bias um when we do this
[00:30:18] learning now one thing I just want to
[00:30:20] learning now one thing I just want to note is like modelbased learning can
[00:30:22] note is like modelbased learning can still be helpful one thing that we may
[00:30:25] still be helpful one thing that we may want to do in this case is explicitly
[00:30:27] want to do in this case is explicitly build different different models when we
[00:30:29] build different different models when we know we want to evaluate different
[00:30:30] know we want to evaluate different policies so normally when we fit a model
[00:30:33] policies so normally when we fit a model we try to minimize the loss under the
[00:30:35] we try to minimize the loss under the data distribution of the behavior policy
[00:30:38] data distribution of the behavior policy so like if you have a bunch of data and
[00:30:39] so like if you have a bunch of data and you fit your Dynamics model you're
[00:30:41] you fit your Dynamics model you're essentially trying to optimize for the
[00:30:42] essentially trying to optimize for the accuracy over your behavior policy but
[00:30:45] accuracy over your behavior policy but if you know that the policy you want to
[00:30:47] if you know that the policy you want to evaluate is different you can actually
[00:30:49] evaluate is different you can actually change ENT you can weigh your um your
[00:30:51] change ENT you can weigh your um your errors separately so this is a paper
[00:30:53] errors separately so this is a paper that we did a few years ago with um
[00:30:55] that we did a few years ago with um finel Dashi and um yalo and gotsman and
[00:30:58] finel Dashi and um yalo and gotsman and others which just highlighted this that
[00:31:00] others which just highlighted this that you could change your loss function and
[00:31:02] you could change your loss function and essentially upwe your accuracy over the
[00:31:04] essentially upwe your accuracy over the state and action pairs that you think
[00:31:05] state and action pairs that you think you will encounter under a different
[00:31:07] you will encounter under a different policy and that can help a lot so what
[00:31:10] policy and that can help a lot so what you can see here is this was for a
[00:31:11] you can see here is this was for a medical domain and um what you can see
[00:31:14] medical domain and um what you can see is that this green here is ground truth
[00:31:17] is that this green here is ground truth and what we found in this case so ours
[00:31:19] and what we found in this case so ours was our model here and um this is just a
[00:31:21] was our model here and um this is just a you fit for the behavior policy and what
[00:31:24] you fit for the behavior policy and what you can see is by essentially reeing
[00:31:25] you can see is by essentially reeing your data you can fit Dynamics models
[00:31:27] your data you can fit Dynamics models that that much better fit the type of um
[00:31:30] that that much better fit the type of um Dynamics you'd see in the
[00:31:33] Dynamics you'd see in the future okay but now I'm going to
[00:31:35] future okay but now I'm going to introduce um sort of model-free methods
[00:31:37] introduce um sort of model-free methods and then we're going to get into
[00:31:38] and then we're going to get into important sampling there's other ways to
[00:31:40] important sampling there's other ways to try to do this policy evaluation that
[00:31:42] try to do this policy evaluation that hopefully have different limitations or
[00:31:44] hopefully have different limitations or less limitations compared to the model
[00:31:45] less limitations compared to the model based
[00:31:48] method so one of the first methods um
[00:31:51] method so one of the first methods um that I'll talk about here is fitted Q
[00:31:53] that I'll talk about here is fitted Q evaluation so fitted Q evaluation is
[00:31:55] evaluation so fitted Q evaluation is going to look pretty similar to deep Q
[00:31:57] going to look pretty similar to deep Q learning but there's just going to be a
[00:31:59] learning but there's just going to be a couple important
[00:32:01] couple important differences so our data set here is a
[00:32:03] differences so our data set here is a bunch of just different tuples of State
[00:32:05] bunch of just different tuples of State action reward next
[00:32:07] action reward next state recall that our Q function Q Pi is
[00:32:12] state recall that our Q function Q Pi is just going to be the immediate reward we
[00:32:14] just going to be the immediate reward we got from being in that SI tupple so
[00:32:17] got from being in that SI tupple so whatever we saw in our data set and then
[00:32:19] whatever we saw in our data set and then we'll put in plus gamma * V Pi of Si +
[00:32:24] we'll put in plus gamma * V Pi of Si + 1 and then what we do is we try to
[00:32:26] 1 and then what we do is we try to minimize the difference between this
[00:32:28] minimize the difference between this under a parameterized function just like
[00:32:31] under a parameterized function just like what we saw with deep Q learning versus
[00:32:34] what we saw with deep Q learning versus The observed data tles so you can think
[00:32:36] The observed data tles so you can think of this as our
[00:32:41] Target and this is called fitted Q
[00:32:43] Target and this is called fitted Q evaluation it's closely related to
[00:32:45] evaluation it's closely related to something called fqi which is fitted Q
[00:32:51] iteration which I think was
[00:32:55] around 2015
[00:32:57] around 2015 [Music]
[00:33:00] [Music] 2005ish
[00:33:02] 2005ish um and so this is very similar to what
[00:33:05] um and so this is very similar to what we've seen with uh deep Q learning
[00:33:07] we've seen with uh deep Q learning before we just fit this function the key
[00:33:09] before we just fit this function the key thing here is that we want it to be for
[00:33:11] thing here is that we want it to be for just a single policy
[00:33:12] just a single policy Pi so we're not doing an
[00:33:15] Pi so we're not doing an argx
[00:33:17] argx okay so this is how the algorithm works
[00:33:19] okay so this is how the algorithm works for fitted Q evaluation we sort of
[00:33:21] for fitted Q evaluation we sort of initialize our Q function randomly it
[00:33:23] initialize our Q function randomly it could be a deep Q you know a deep Q
[00:33:25] could be a deep Q you know a deep Q Network it could be something else we
[00:33:27] Network it could be something else we compute the
[00:33:28] compute the targets where when we put in the next q
[00:33:31] targets where when we put in the next q we have to use the policy we're
[00:33:33] we have to use the policy we're interested in
[00:33:36] interested in evaluating so we're we're only doing
[00:33:38] evaluating so we're we're only doing this for the actions we would take under
[00:33:40] this for the actions we would take under the policy we care about we build our
[00:33:43] the policy we care about we build our training set of sort of x's actions and
[00:33:46] training set of sort of x's actions and our output Q Target and then we fit our
[00:33:48] our output Q Target and then we fit our Q
[00:33:51] function and so again the key difference
[00:33:54] function and so again the key difference here compared to dqn is there's no Max
[00:33:58] here compared to dqn is there's no Max we are
[00:34:02] fixing this
[00:34:06] part only for a fixed
[00:34:12] P but aside from that it should look
[00:34:14] P but aside from that it should look really
[00:34:15] really similar okay and so one of the so this
[00:34:19] similar okay and so one of the so this was something that was very closely
[00:34:21] was something that was very closely related to a common algorithm for doing
[00:34:23] related to a common algorithm for doing off policy learning which is fitted Q
[00:34:25] off policy learning which is fitted Q iteration EXC me which is very related
[00:34:27] iteration EXC me which is very related to deep Q
[00:34:29] to deep Q learning and one of the things people
[00:34:30] learning and one of the things people wanted to understand is whether this
[00:34:32] wanted to understand is whether this thing that was working in practice um
[00:34:34] thing that was working in practice um actually had some theoretical grounding
[00:34:36] actually had some theoretical grounding behind it like could we say anything
[00:34:38] behind it like could we say anything formal about how good this approach
[00:34:42] was so just to give you an illustration
[00:34:45] was so just to give you an illustration of the types of guarantees that we can
[00:34:46] of the types of guarantees that we can get in this case what we want to look at
[00:34:49] get in this case what we want to look at in this situation um is to think about
[00:34:53] in this situation um is to think about sort of what is the generalization error
[00:34:56] sort of what is the generalization error okay let me put this in here
[00:34:59] okay let me put this in here okay so I won't go through the whole
[00:35:01] okay so I won't go through the whole paper I just want to give you an
[00:35:02] paper I just want to give you an illustration of the types of guarantees
[00:35:04] illustration of the types of guarantees that you might get in this setting what
[00:35:06] that you might get in this setting what they would like to know in this case is
[00:35:08] they would like to know in this case is to compare the difference between the
[00:35:10] to compare the difference between the value that you will compute under this
[00:35:12] value that you will compute under this procedure versus the true value of the
[00:35:15] procedure versus the true value of the policy this is your normal discount
[00:35:17] policy this is your normal discount factor and then there's a whole bunch of
[00:35:18] factor and then there's a whole bunch of things that are additional let me
[00:35:21] things that are additional let me highlight some important things here n
[00:35:23] highlight some important things here n here is the number of samples you need
[00:35:25] here is the number of samples you need so n tells you about how much data
[00:35:27] so n tells you about how much data you're going to need need in order to do
[00:35:28] you're going to need need in order to do this so this is sort of how much
[00:35:32] data much
[00:35:35] data much data okay Epsilon is um sort of your
[00:35:41] data okay Epsilon is um sort of your desired Target act
[00:35:46] accuracy okay this is one of the really
[00:35:49] accuracy okay this is one of the really important things so we're going to have
[00:35:52] important things so we're going to have something called a concentr
[00:35:56] something called a concentr ability Co
[00:36:01] coefficient concentrate ability
[00:36:03] coefficient concentrate ability coefficient is going to be the
[00:36:04] coefficient is going to be the difference essentially between the
[00:36:06] difference essentially between the distribution of State action pairs that
[00:36:08] distribution of State action pairs that you have in your data set and the
[00:36:09] you have in your data set and the distribution of State action pairs you
[00:36:11] distribution of State action pairs you would get under your desired policy so
[00:36:13] would get under your desired policy so we saw this before with po of thinking
[00:36:15] we saw this before with po of thinking about sort of these Divergence in the
[00:36:17] about sort of these Divergence in the state action
[00:36:19] state action distributions State action
[00:36:23] distributions okay and it's also related
[00:36:26] distributions okay and it's also related to what we'll call overlap
[00:36:29] to what we'll call overlap later so I won't go through all the
[00:36:31] later so I won't go through all the details in this case but I want to just
[00:36:33] details in this case but I want to just give an illustration that people often
[00:36:34] give an illustration that people often think about trying to understand if you
[00:36:36] think about trying to understand if you have a data set of some Behavior data
[00:36:40] have a data set of some Behavior data how accurate you can hope to be of
[00:36:41] how accurate you can hope to be of evaluating the performance and of a
[00:36:43] evaluating the performance and of a policy depends on your discount Factor
[00:36:46] policy depends on your discount Factor because that says sort of how accurate
[00:36:47] because that says sort of how accurate you want to be and how much you care
[00:36:49] you want to be and how much you care about long-term rewards how much data
[00:36:51] about long-term rewards how much data you have in terms of your target error
[00:36:53] you have in terms of your target error and how closely related your state
[00:36:55] and how closely related your state action distributions are from your
[00:36:57] action distributions are from your training set to your test set or your
[00:36:59] training set to your test set or your desired
[00:37:00] desired policy
[00:37:02] policy okay now one of the challenges about
[00:37:04] okay now one of the challenges about this approach is that it generally still
[00:37:06] this approach is that it generally still relies on the markof Assumption so we're
[00:37:08] relies on the markof Assumption so we're still assuming our data is all Markoff
[00:37:10] still assuming our data is all Markoff and it relies on our models in these
[00:37:12] and it relies on our models in these case the sort of Q functions being wellp
[00:37:15] case the sort of Q functions being wellp specified so what do I mean by that it
[00:37:17] specified so what do I mean by that it means that we really can fit the Q
[00:37:18] means that we really can fit the Q function like there's some existing Q
[00:37:21] function like there's some existing Q function in the world for our policy and
[00:37:22] function in the world for our policy and we can really fit it and if you say for
[00:37:26] we can really fit it and if you say for example
[00:37:29] example let's say that this is your state space
[00:37:31] let's say that this is your state space it's just one
[00:37:34] dimensional and this is what like your
[00:37:36] dimensional and this is what like your true function looks like you know you
[00:37:37] true function looks like you know you could imagine that look something like
[00:37:40] could imagine that look something like this and let's say that you are
[00:37:42] this and let's say that you are restricting yourself to fit a line like
[00:37:45] restricting yourself to fit a line like that with just two parameters so in that
[00:37:49] that with just two parameters so in that case even if you had infinite amounts of
[00:37:50] case even if you had infinite amounts of data you're still going to have a lot of
[00:37:51] data you're still going to have a lot of error you're not going to be able to fit
[00:37:53] error you're not going to be able to fit the Q function so these methods assume
[00:37:56] the Q function so these methods assume typically realizability that if you had
[00:37:58] typically realizability that if you had infinite data you could fit the function
[00:37:59] infinite data you could fit the function the problem is that you don't have
[00:38:00] the problem is that you don't have infinite data okay all right so now
[00:38:03] infinite data okay all right so now we're going to see a really beautiful
[00:38:04] we're going to see a really beautiful method called important sampling which
[00:38:06] method called important sampling which allows us to deal with this we've seen
[00:38:08] allows us to deal with this we've seen sort of brief ideas about this before
[00:38:10] sort of brief ideas about this before but I'm curious if anybody who's seen
[00:38:12] but I'm curious if anybody who's seen this in other classes who's seen
[00:38:13] this in other classes who's seen important sampling before okay so just a
[00:38:16] important sampling before okay so just a couple
[00:38:17] couple people this is one of the favorite ideas
[00:38:20] people this is one of the favorite ideas in cs234 According to some past people
[00:38:23] in cs234 According to some past people um all right so the idea what is the
[00:38:25] um all right so the idea what is the motivation so important sampling is an
[00:38:27] motivation so important sampling is an idea from statistics um that we have
[00:38:29] idea from statistics um that we have imported over into reinforcement
[00:38:31] imported over into reinforcement learning why would we like to do this
[00:38:33] learning why would we like to do this well we want a method that doesn't rely
[00:38:34] well we want a method that doesn't rely on the models being correct meaning that
[00:38:36] on the models being correct meaning that we can actually you know that fit things
[00:38:38] we can actually you know that fit things with a two- layer you know uh deep
[00:38:40] with a two- layer you know uh deep neural network or stuff and that we
[00:38:42] neural network or stuff and that we don't have to rely on the markof
[00:38:43] don't have to rely on the markof Assumption in the state space we're
[00:38:45] Assumption in the state space we're using we saw before that we could use
[00:38:47] using we saw before that we could use Monte Carlo methods to accomplish this
[00:38:49] Monte Carlo methods to accomplish this for online policy evaluation and now we
[00:38:52] for online policy evaluation and now we want to do this for offline data meaning
[00:38:53] want to do this for offline data meaning that we have data from a different
[00:38:55] that we have data from a different distribution from the policy we want to
[00:38:56] distribution from the policy we want to evaluate and the key challenge as has
[00:38:59] evaluate and the key challenge as has often been is data distribution
[00:39:02] often been is data distribution mismatch okay so here's how important
[00:39:04] mismatch okay so here's how important sampling works let me just specify what
[00:39:06] sampling works let me just specify what this means let's say we want to try to
[00:39:09] this means let's say we want to try to understand the expected reward over a
[00:39:12] understand the expected reward over a distribution of States so for this part
[00:39:13] distribution of States so for this part you can just think of X is equal to
[00:39:18] States and R ofx is equal to the reward
[00:39:21] States and R ofx is equal to the reward of EST
[00:39:22] of EST State this works for very very general
[00:39:25] State this works for very very general distributions but you could think of
[00:39:27] distributions but you could think of that here is just being
[00:39:29] that here is just being rewards all right what we're going to do
[00:39:31] rewards all right what we're going to do is the following this is what we would
[00:39:33] is the following this is what we would like to evaluate so you could think of
[00:39:35] like to evaluate so you could think of this here is maybe being P of X could be
[00:39:39] this here is maybe being P of X could be equal to the probability of
[00:39:44] equal to the probability of reaching X
[00:39:47] reaching X under
[00:39:49] under policy so you might really want this you
[00:39:51] policy so you might really want this you might want to know what is the expected
[00:39:53] might want to know what is the expected reward I'm going to get under this
[00:39:54] reward I'm going to get under this policy where I know what my reward is
[00:39:56] policy where I know what my reward is for each day or I have samples of it and
[00:39:58] for each day or I have samples of it and then I have this probability
[00:39:59] then I have this probability distribution the problem is that you
[00:40:01] distribution the problem is that you don't have data from that so we but no
[00:40:06] don't have data from that so we but no data from P of X so that's the the
[00:40:09] data from P of X so that's the the general challenge we're in we want to
[00:40:11] general challenge we're in we want to see how well like our alternative policy
[00:40:13] see how well like our alternative policy would work for like helping students
[00:40:15] would work for like helping students persist but we have no data from that so
[00:40:18] persist but we have no data from that so here's the
[00:40:19] here's the trick let's multiply and divide by the
[00:40:22] trick let's multiply and divide by the same
[00:40:23] same thing I'm going to introduce a new
[00:40:26] thing I'm going to introduce a new policy and its distribution
[00:40:30] Q okay so Q of X is a different
[00:40:41] policy this is a different policy maybe
[00:40:43] policy this is a different policy maybe it's going to end up in different states
[00:40:44] it's going to end up in different states with different probabilities okay so
[00:40:47] with different probabilities okay so let's rewrite this this is going to be
[00:40:49] let's rewrite this this is going to be equal to Q of
[00:40:51] equal to Q of X time P of X over R over Q ofx
[00:40:59] RX okay right I haven't changed anything
[00:41:03] RX okay right I haven't changed anything yet this is exactly equal but if I have
[00:41:05] yet this is exactly equal but if I have data from Q ofx I can approximate this
[00:41:08] data from Q ofx I can approximate this expectation with
[00:41:09] expectation with samples so this is approximately equal
[00:41:12] samples so this is approximately equal to 1 n sum over I = 1 to n of X sampled
[00:41:19] to 1 n sum over I = 1 to n of X sampled according to Q of
[00:41:21] according to Q of x p of x
[00:41:24] x p of x i q ofx i
[00:41:31] this is super beautiful what we've de
[00:41:33] this is super beautiful what we've de said here is that I really want to
[00:41:35] said here is that I really want to estimate the expectation of something
[00:41:37] estimate the expectation of something over say policy this policy P I don't
[00:41:40] over say policy this policy P I don't have any samples from P what I can do is
[00:41:42] have any samples from P what I can do is I can just take samples from my policy q
[00:41:45] I can just take samples from my policy q and I can reway
[00:41:47] and I can reway them so it says if I was you know really
[00:41:51] them so it says if I was you know really likely to take to reach a particular X
[00:41:53] likely to take to reach a particular X under policy q but less likely under
[00:41:55] under policy q but less likely under this one I'll weigh that data less if
[00:41:58] this one I'll weigh that data less if I'm much more likely to get to a state
[00:42:00] I'm much more likely to get to a state XI than I was under here I'm going to
[00:42:02] XI than I was under here I'm going to upweight those
[00:42:04] samples so this is beautiful and it's
[00:42:07] samples so this is beautiful and it's unbiased so this is an unbiased
[00:42:11] unbiased so this is an unbiased estimate and we'll extend it in a second
[00:42:14] estimate and we'll extend it in a second to think about multi-time steps but just
[00:42:16] to think about multi-time steps but just for single time step right now this is
[00:42:18] for single time step right now this is how we can do this gives us an unbiased
[00:42:20] how we can do this gives us an unbiased estimate and as we'll see shortly we can
[00:42:22] estimate and as we'll see shortly we can extend this to multi-time steps and we
[00:42:24] extend this to multi-time steps and we don't have to make a Markoff assumption
[00:42:27] don't have to make a Markoff assumption right so this is a really lovely
[00:42:30] right so this is a really lovely idea so we can compute this expected
[00:42:32] idea so we can compute this expected value under an alternative
[00:42:37] distribution and it is generally an
[00:42:39] distribution and it is generally an unbiased estimator under a couple
[00:42:41] unbiased estimator under a couple assumptions the first is that the
[00:42:43] assumptions the first is that the sampling distribution Q so like our
[00:42:45] sampling distribution Q so like our alternative policy has to be greater
[00:42:48] alternative policy has to be greater than or equal to zero for all X such
[00:42:50] than or equal to zero for all X such that P of X would be greater than zero
[00:42:53] that P of X would be greater than zero what does that mean in practice that
[00:42:55] what does that mean in practice that means that if you could reach a state
[00:42:57] means that if you could reach a state under your policy you care about with a
[00:43:00] under your policy you care about with a non-zero probability so let's say I
[00:43:02] non-zero probability so let's say I don't know your student could get to
[00:43:03] don't know your student could get to this particular level with non-zero
[00:43:04] this particular level with non-zero probability under your target policy
[00:43:07] probability under your target policy then there has to be some probability
[00:43:09] then there has to be some probability you'd also get there under your um
[00:43:12] you'd also get there under your um training data
[00:43:13] training data set this is sort of reasonable right so
[00:43:15] set this is sort of reasonable right so this says that like if I want to think
[00:43:18] this says that like if I want to think about I don't know a policy that like um
[00:43:21] about I don't know a policy that like um uh recommends restaurants versus coffee
[00:43:24] uh recommends restaurants versus coffee shops I can't use that data to then
[00:43:27] shops I can't use that data to then estimate how good it would be to go to
[00:43:29] estimate how good it would be to go to the movies I just I've never done that
[00:43:31] the movies I just I've never done that there has to be for anything that we're
[00:43:33] there has to be for anything that we're trying to estimate here we have to have
[00:43:34] trying to estimate here we have to have non-zero probability for that
[00:43:37] non-zero probability for that X the second thing is a little bit more
[00:43:39] X the second thing is a little bit more subtle but comes up a lot in real
[00:43:41] subtle but comes up a lot in real empirical data which is called No hidden
[00:43:43] empirical data which is called No hidden confounding and that means that
[00:43:45] confounding and that means that essentially you have to know all of the
[00:43:47] essentially you have to know all of the features that were used to kind of um
[00:43:50] features that were used to kind of um Define this distribution so this doesn't
[00:43:53] Define this distribution so this doesn't may not seem as clear in this part but I
[00:43:55] may not seem as clear in this part but I think once we start getting into multi
[00:43:57] think once we start getting into multi steps and the sequences it becomes
[00:43:59] steps and the sequences it becomes really relevant so let me give an
[00:44:00] really relevant so let me give an example
[00:44:34] okay so imagine like a healthcare
[00:44:35] okay so imagine like a healthcare setting so if we go back to the
[00:44:37] setting so if we go back to the electronic medical record setting um we
[00:44:39] electronic medical record setting um we often are interested in what would have
[00:44:41] often are interested in what would have happened to a patient if we did a
[00:44:42] happened to a patient if we did a different action so we want to know what
[00:44:44] different action so we want to know what that counterfactual is one of the
[00:44:46] that counterfactual is one of the challenges there is that we will have
[00:44:48] challenges there is that we will have certain features that are in are in our
[00:44:50] certain features that are in are in our electronic medical record system we will
[00:44:53] electronic medical record system we will see an action like you know someone was
[00:44:54] see an action like you know someone was taken to surgery or some drug was
[00:44:56] taken to surgery or some drug was administered and then wece the outcome
[00:44:59] administered and then wece the outcome in order for important sampling to work
[00:45:01] in order for important sampling to work all of the features that were used to
[00:45:03] all of the features that were used to make that decision or pick that action
[00:45:05] make that decision or pick that action have to be
[00:45:07] have to be known and that's called No hidden
[00:45:10] known and that's called No hidden confounding now why is that well it
[00:45:12] confounding now why is that well it might be for example you might see that
[00:45:14] might be for example you might see that there are certain patients that um that
[00:45:17] there are certain patients that um that are sick and then a particular action is
[00:45:19] are sick and then a particular action is taken and maybe they die and you might
[00:45:21] taken and maybe they die and you might see other patients that look like they
[00:45:22] see other patients that look like they have the same features and a different
[00:45:24] have the same features and a different action is taken and they live and in
[00:45:27] action is taken and they live and in that case you might think oh maybe the
[00:45:29] that case you might think oh maybe the the decision was just bad that's
[00:45:31] the decision was just bad that's possible but it's also possible that
[00:45:33] possible but it's also possible that there are just hidden additional
[00:45:34] there are just hidden additional features that you don't have in your
[00:45:36] features that you don't have in your data and that meant that the first
[00:45:38] data and that meant that the first person was much more sick and that's why
[00:45:39] person was much more sick and that's why they got that particular treatment
[00:45:41] they got that particular treatment versus the other
[00:45:42] versus the other person so it might be that like you know
[00:45:45] person so it might be that like you know that there's important reasons that are
[00:45:48] that there's important reasons that are not part of X that are being using used
[00:45:50] not part of X that are being using used to define what the action is in the data
[00:45:53] to define what the action is in the data set excuse me and in those sorts of
[00:45:55] set excuse me and in those sorts of confounding scenarios then if you try to
[00:45:57] confounding scenarios then if you try to use important sampling you will not get
[00:45:58] use important sampling you will not get an unbias
[00:46:00] an unbias estimator this is really important and
[00:46:02] estimator this is really important and really hard in practice it comes up all
[00:46:04] really hard in practice it comes up all the time and in facted one of the things
[00:46:06] the time and in facted one of the things we were just doing on a paper we just
[00:46:07] we were just doing on a paper we just put online we are trying to think really
[00:46:10] put online we are trying to think really really carefully about whether or not
[00:46:12] really carefully about whether or not there would be additional compounding
[00:46:14] there would be additional compounding beyond the the features that we had in
[00:46:16] beyond the the features that we had in our data set so in that case we had done
[00:46:18] our data set so in that case we had done an experiment to see whether or not
[00:46:20] an experiment to see whether or not offering students access to GPT for um
[00:46:23] offering students access to GPT for um would increase or decrease um
[00:46:25] would increase or decrease um participation in the class and exam
[00:46:28] participation in the class and exam scores and only some people used gp4 a
[00:46:32] scores and only some people used gp4 a lot of people that were given access to
[00:46:33] lot of people that were given access to it did not use it and so an important
[00:46:36] it did not use it and so an important question there then is well is there
[00:46:37] question there then is well is there something intrinsically different about
[00:46:39] something intrinsically different about those students who are using it that
[00:46:42] those students who are using it that also would confound their their test
[00:46:45] also would confound their their test scores and so this issue of noan hidden
[00:46:49] scores and so this issue of noan hidden confounding comes up a lot particularly
[00:46:50] confounding comes up a lot particularly when sort of actions are optional or
[00:46:52] when sort of actions are optional or being made by
[00:46:54] being made by humans now if you're in sort of muok or
[00:46:57] humans now if you're in sort of muok or something this is easy because if you
[00:46:58] something this is easy because if you have control over the simulator you
[00:46:59] have control over the simulator you don't have to worry about it but it's
[00:47:01] don't have to worry about it but it's important to start in practice all right
[00:47:03] important to start in practice all right let's take a second and sort of check
[00:47:04] let's take a second and sort of check your
[00:47:05] your understanding so we haven't really
[00:47:07] understanding so we haven't really talked about Bandits yet um don't worry
[00:47:09] talked about Bandits yet um don't worry about exactly this we're going to be
[00:47:11] about exactly this we're going to be doing policy evaluation so let's say we
[00:47:13] doing policy evaluation so let's say we have a data set for I'll just say
[00:47:16] have a data set for I'll just say samples for
[00:47:17] samples for samples from three actions okay action
[00:47:21] samples from three actions okay action one is a brand newly variable where with
[00:47:23] one is a brand newly variable where with probability 0.02 you get a really high
[00:47:25] probability 0.02 you get a really high reward else zero
[00:47:27] reward else zero the second one you get probability with
[00:47:29] the second one you get probability with probability 0.55 you get reward of two
[00:47:31] probability 0.55 you get reward of two lse Z and the third one with probability
[00:47:34] lse Z and the third one with probability 0.5 get a reward of one else
[00:47:36] 0.5 get a reward of one else zero your data is going to be sampled
[00:47:39] zero your data is going to be sampled from a particular Behavior policy so
[00:47:41] from a particular Behavior policy so this
[00:47:43] this is our be what we've been calling a
[00:47:45] is our be what we've been calling a behavior
[00:47:48] policy where with probability point8 it
[00:47:51] policy where with probability point8 it pulls this action else it pulls action
[00:47:54] pulls this action else it pulls action two the policy we want to evaluate Pi 2
[00:47:57] two the policy we want to evaluate Pi 2 pulls action two excuse me else it pulls
[00:47:59] pulls action two excuse me else it pulls action
[00:48:01] action one this question asks you to think
[00:48:03] one this question asks you to think about what are true about the
[00:48:06] about what are true about the performance of those
[00:48:07] performance of those policies whether or not we could use the
[00:48:10] policies whether or not we could use the data from PI 1 to get an unbiased
[00:48:12] data from PI 1 to get an unbiased estimator of Pi 2 and whether or not the
[00:48:15] estimator of Pi 2 and whether or not the rewards being positive or negative might
[00:48:17] rewards being positive or negative might impact
[00:48:25] that for
[00:49:03] the third one is kind of hard and might
[00:49:05] the third one is kind of hard and might require looking back at the equations on
[00:49:07] require looking back at the equations on the previous
[00:49:13] slides water for a second and then
[00:49:46] all right why you turn to a neighbor and
[00:49:47] all right why you turn to a neighbor and see what you got
[00:50:16] a couple layers
[00:50:38] I actually swwa P1 and P2
[00:51:24] yeah three
[00:51:27] yeah three I did I did it in my head a bit so if
[00:51:30] I did I did it in my head a bit so if you
[00:51:31] you go yeah
[00:51:34] go yeah so we have two there so that's one plus
[00:51:36] so we have two there so that's one plus whatever we P action one which would be
[00:51:39] whatever we P action one which would be two which should give us
[00:51:44] around but
[00:51:46] around but thenes
[00:51:48] thenes us at the
[00:51:53] time yeah
[00:51:56] time yeah gives us and then otherwise you
[00:51:59] gives us and then otherwise you two
[00:52:03] US yeah so it wasn't as high in my small
[00:52:07] US yeah so it wasn't as high in my small calculations and then yeah
[00:52:38] you never
[00:52:39] you never see thing
[00:52:52] [Music]
[00:52:59] yes okay so they have to be completely
[00:53:01] yes okay so they have to be completely over
[00:53:33] all right let's go through this so the
[00:53:34] all right let's go through this so the first one requires um a couple nested
[00:53:37] first one requires um a couple nested expectations so let's go through those
[00:53:39] expectations so let's go through those and make sure I get my math right um
[00:53:42] and make sure I get my math right um okay so for the first one Pi one so
[00:53:45] okay so for the first one Pi one so there's two levels of stochasticity here
[00:53:47] there's two levels of stochasticity here we have a stochastic policy and we have
[00:53:48] we have a stochastic policy and we have sto we have rewards with stochastic
[00:53:50] sto we have rewards with stochastic actions so let's first um or stochastic
[00:53:53] actions so let's first um or stochastic rewards so let's first just figure out
[00:53:55] rewards so let's first just figure out what the expected reward is for action
[00:53:57] what the expected reward is for action one this is equal to 02 with reward 100
[00:54:02] one this is equal to 02 with reward 100 so that is 2 plus else you get a reward
[00:54:05] so that is 2 plus else you get a reward of zero so the expected reward for
[00:54:07] of zero so the expected reward for Action A1 is two I'll just write that as
[00:54:13] here expector reward for Action A1 we
[00:54:17] here expector reward for Action A1 we can do the same calculation here so the
[00:54:19] can do the same calculation here so the expected reward for
[00:54:22] expected reward for A2 is just going to be equal to 0.55 * 2
[00:54:26] A2 is just going to be equal to 0.55 * 2 which is just equal to 1.1 and the
[00:54:28] which is just equal to 1.1 and the expected reward for
[00:54:30] expected reward for A3 just equal to .5 okay so generally
[00:54:34] A3 just equal to .5 okay so generally policies that P more weight on action
[00:54:37] policies that P more weight on action one are going to be better now let's see
[00:54:39] one are going to be better now let's see look at what the expected value is of Pi
[00:54:41] look at what the expected value is of Pi 1 so what pi 1
[00:54:43] 1 so what pi 1 is is it's going to say with probability
[00:54:46] is is it's going to say with probability 8 we're going to get the reward of
[00:54:51] 8 we're going to get the reward of A3 plus 2 we get the reward of A2 the
[00:54:55] A3 plus 2 we get the reward of A2 the reward of A3
[00:54:56] reward of A3 is .5 so it's 8 .5 + 2
[00:55:03] 1.1 so that's about how much that one is
[00:55:06] 1.1 so that's about how much that one is so this is like
[00:55:09] so this is like approximately yeah okay so we'll we'll
[00:55:12] approximately yeah okay so we'll we'll um this one is like approximately like.
[00:55:14] um this one is like approximately like. 4
[00:55:16] 4 2ish I'll double check the exact path
[00:55:18] 2ish I'll double check the exact path it's roughly like that okay now let's do
[00:55:20] it's roughly like that okay now let's do it for um so this is the reward for for
[00:55:23] it for um so this is the reward for for pi 1 we'll do the same thing for pi 2
[00:55:27] pi 1 we'll do the same thing for pi 2 this says with 0. five it pulls gets the
[00:55:30] this says with 0. five it pulls gets the reward the expected reward for R of
[00:55:33] reward the expected reward for R of A2
[00:55:35] A2 +0.5 it gets the reward of A1 so that's
[00:55:38] +0.5 it gets the reward of A1 so that's equal to.5 1.1 +5 2 so it's
[00:55:45] equal to.5 1.1 +5 2 so it's approximately equal to 1.5
[00:55:50] approximately equal to 1.5 is I think I was off by two when I was
[00:55:52] is I think I was off by two when I was chatting to some people before um yeah
[00:55:56] chatting to some people before um yeah I I I got 6 for first one 65 for the
[00:56:00] I I I got 6 for first one 65 for the second one I do my ma um I think this is
[00:56:04] second one I do my ma um I think this is I think it's going to be more than that
[00:56:04] I think it's going to be more than that because the expector reward for this one
[00:56:06] because the expector reward for this one is two and so this one has to be two
[00:56:11] is two and so this one has to be two okay I thought I thought it's two yeah
[00:56:14] okay I thought I thought it's two yeah yeah so I think this s so I think this
[00:56:16] yeah so I think this s so I think this ends up being roughly 1.5 I can double
[00:56:18] ends up being roughly 1.5 I can double check my math but I think that's right
[00:56:20] check my math but I think that's right okay so Pi 2 does have true higher
[00:56:22] okay so Pi 2 does have true higher rewards so this is true
[00:56:26] rewards so this is true um the second is we can't use Pi 1 to
[00:56:28] um the second is we can't use Pi 1 to get an unbiased estimate of Pi 2 why is
[00:56:32] get an unbiased estimate of Pi 2 why is that so this is true
[00:56:36] that so this is true also why can't we use Pi 1 data from PI
[00:56:41] also why can't we use Pi 1 data from PI 1 because it never pulls it never does
[00:56:44] 1 because it never pulls it never does action one that's right so it never does
[00:56:46] action one that's right so it never does action one so it's like saying um you
[00:56:48] action one so it's like saying um you have data about all these restaurants
[00:56:49] have data about all these restaurants and then you ask it okay I also have a
[00:56:51] and then you ask it okay I also have a policy that's not going to go to this
[00:56:52] policy that's not going to go to this new restaurant and you have no data from
[00:56:54] new restaurant and you have no data from that so we can't get an unbiased
[00:56:55] that so we can't get an unbiased estimate of the average
[00:56:57] estimate of the average award okay this one's hard um uh this is
[00:57:03] award okay this one's hard um uh this is false
[00:57:07] okay it turns out that um uh you can
[00:57:11] okay it turns out that um uh you can still get an unbiased you stand you can
[00:57:14] still get an unbiased you stand you can still get a lower bound on the
[00:57:15] still get a lower bound on the performance of a policy using another
[00:57:18] performance of a policy using another policy which doesn't have complete
[00:57:20] policy which doesn't have complete overlap if the rewards are strictly po
[00:57:23] overlap if the rewards are strictly po positive
[00:57:24] positive so if the
[00:57:27] so if the rewards are always greater than equal
[00:57:29] rewards are always greater than equal zero you can do
[00:57:34] this why is this so we have a paper on
[00:57:36] this why is this so we have a paper on this from a few years ago now um just
[00:57:38] this from a few years ago now um just for the for why this happens essentially
[00:57:40] for the for why this happens essentially you can think of it is if you're
[00:57:42] you can think of it is if you're Behavior policy doesn't include some of
[00:57:44] Behavior policy doesn't include some of the actions that you want to evaluate
[00:57:47] the actions that you want to evaluate it's like putting zero Mass on those
[00:57:50] it's like putting zero Mass on those okay because if you think back to what
[00:57:52] okay because if you think back to what is happening here it's like you never
[00:57:56] is happening here it's like you never sample them right so you have zero
[00:58:00] sample them right so you have zero probability Mass on some things that you
[00:58:01] probability Mass on some things that you want to evaluate like you want to
[00:58:03] want to evaluate like you want to include a policy that sometimes
[00:58:05] include a policy that sometimes recommends movies and you never do so
[00:58:07] recommends movies and you never do so it's like putting zero Mass on that if
[00:58:09] it's like putting zero Mass on that if all your reward is positive that's
[00:58:11] all your reward is positive that's essentially just lowering your estimated
[00:58:14] essentially just lowering your estimated value okay so it turns out that if all
[00:58:17] value okay so it turns out that if all your rewards are positive you can use a
[00:58:20] your rewards are positive you can use a behavior policy that doesn't have
[00:58:21] behavior policy that doesn't have complete coverage with your target
[00:58:23] complete coverage with your target policy but it'll be a lower bound the
[00:58:26] policy but it'll be a lower bound the reason why that might be useful is
[00:58:28] reason why that might be useful is because if it's still the case that your
[00:58:30] because if it's still the case that your target new evaluation policy is better
[00:58:33] target new evaluation policy is better than your behavior policy even though it
[00:58:35] than your behavior policy even though it might not have full coverage you may
[00:58:37] might not have full coverage you may still want to use it so it's like oh it
[00:58:39] still want to use it so it's like oh it doesn't matter whether those
[00:58:40] doesn't matter whether those recommendations it makes for like those
[00:58:41] recommendations it makes for like those new movies is good or not it's already a
[00:58:43] new movies is good or not it's already a better policy so we can do that okay
[00:58:48] better policy so we can do that okay great all right so it turns out that we
[00:58:50] great all right so it turns out that we can also do this for RL policy
[00:58:52] can also do this for RL policy evaluation so I just showed you a much
[00:58:54] evaluation so I just showed you a much more simple setting of this um and I'll
[00:58:56] more simple setting of this um and I'll highlight too here that import sampling
[00:58:59] highlight too here that import sampling like many things in stats and math Etc
[00:59:01] like many things in stats and math Etc um goes by many different names you'll
[00:59:03] um goes by many different names you'll often see things like
[00:59:06] often see things like inverse
[00:59:07] inverse propensity
[00:59:09] propensity waiting so if you take econ classes
[00:59:12] waiting so if you take econ classes people often refer to these things more
[00:59:15] people often refer to these things more as like ipws or inverse propensity
[00:59:17] as like ipws or inverse propensity waiting um when I learned about them I
[00:59:19] waiting um when I learned about them I learned about them is important sampling
[00:59:21] learned about them is important sampling often also depends whether you're using
[00:59:23] often also depends whether you're using these to design ways to gather data or
[00:59:25] these to design ways to gather data or whether you have historical data okay
[00:59:27] whether you have historical data okay let's see how we can do this for
[00:59:29] let's see how we can do this for reinforcement learning so in
[00:59:30] reinforcement learning so in reinforcement learning we can do exactly
[00:59:32] reinforcement learning we can do exactly the same thing so I have what I want to
[00:59:35] the same thing so I have what I want to have these are now my
[00:59:37] have these are now my trajectories and as we've seen before we
[00:59:40] trajectories and as we've seen before we can think of the value of a policy is
[00:59:41] can think of the value of a policy is just being an expectation over all the
[00:59:45] just being an expectation over all the trajectories that could be generated by
[00:59:46] trajectories that could be generated by that policy from initial start State
[00:59:49] that policy from initial start State times the reward of those trajectories
[00:59:51] times the reward of those trajectories so this is the
[00:59:54] so this is the reward of a
[00:59:56] reward of a trajectory tow this is the
[01:00:00] trajectory tow this is the probability of a trajectory under the
[01:00:03] probability of a trajectory under the the desired
[01:00:04] the desired policy so what we can do in this case is
[01:00:07] policy so what we can do in this case is the following we can just multiply and
[01:00:09] the following we can just multiply and divide by the same thing like what we
[01:00:10] divide by the same thing like what we saw before so we're going to imagine
[01:00:13] saw before so we're going to imagine that we have data from a different
[01:00:18] policy so I'm going to call this pi
[01:00:24] b so we've now introduce my behavior
[01:00:29] b so we've now introduce my behavior policy
[01:00:30] policy okay so I'm just going to rewrite that
[01:00:38] so I again just have this
[01:00:42] so I again just have this weight this is just reweighing what's
[01:00:44] weight this is just reweighing what's the probability of me getting a
[01:00:46] the probability of me getting a particular trajectory under my behavior
[01:00:48] particular trajectory under my behavior policy versus my target
[01:00:53] policy okay so we have that here
[01:00:57] policy okay so we have that here and let's let me write it out here first
[01:00:59] and let's let me write it out here first okay so now we know let me put this from
[01:01:03] okay so now we know let me put this from here we know from before that if we have
[01:01:06] here we know from before that if we have samples from our Behavior policy we can
[01:01:09] samples from our Behavior policy we can approximate this expectation by a
[01:01:10] approximate this expectation by a sampled expectation and we reweight
[01:01:12] sampled expectation and we reweight these the next thing is to make sure
[01:01:14] these the next thing is to make sure that we can compute what's the
[01:01:15] that we can compute what's the probability of a trajectory under our
[01:01:18] probability of a trajectory under our Target policy versus our evaluation
[01:01:20] Target policy versus our evaluation policy and we've seen things like this
[01:01:22] policy and we've seen things like this before so just remember what we can do
[01:01:24] before so just remember what we can do in this case is that the probability of
[01:01:26] in this case is that the probability of a
[01:01:27] a trajectory given a policy and action is
[01:01:31] trajectory given a policy and action is equal to the product over = 1 to the
[01:01:35] equal to the product over = 1 to the length of the
[01:01:37] length of the trajectory the probability the
[01:01:39] trajectory the probability the transition
[01:01:42] transition probability I'm just going to write it
[01:01:44] probability I'm just going to write it as a deterministic policy for Simplicity
[01:01:47] as a deterministic policy for Simplicity deterministic for
[01:01:50] deterministic for Simplicity but you can extend all of
[01:01:52] Simplicity but you can extend all of this times the probability that you
[01:01:55] this times the probability that you would take that action s
[01:02:01] oops
[01:02:03] oops actually yeah I'll rewrite that you
[01:02:06] actually yeah I'll rewrite that you don't want it I don't want it to be
[01:02:08] don't want it I don't want it to be deterministic that will be
[01:02:10] deterministic that will be misleading okay put
[01:02:23] here okay all right so this is just the
[01:02:28] here okay all right so this is just the probability of us taking the action
[01:02:30] probability of us taking the action given the state under our policy and the
[01:02:31] given the state under our policy and the transition probability for every single
[01:02:33] transition probability for every single time
[01:02:34] time step the nice thing that we can see in
[01:02:36] step the nice thing that we can see in this case we can write that out for both
[01:02:38] this case we can write that out for both the behavior policy and the um the
[01:02:41] the behavior policy and the um the target policy and as we've seen in some
[01:02:43] target policy and as we've seen in some other cases this will cancel so you
[01:02:46] other cases this will cancel so you don't need to know the Dynamics
[01:02:48] don't need to know the Dynamics model so this is beautiful and
[01:02:51] model so this is beautiful and Incredibly helpful under similar
[01:02:53] Incredibly helpful under similar conditions to what we just saw as long
[01:02:54] conditions to what we just saw as long as you have coverage which which means
[01:02:56] as you have coverage which which means that you will visit the same sort of
[01:02:57] that you will visit the same sort of trajectories maybe with differing
[01:02:59] trajectories maybe with differing probabilities all you have to do is
[01:03:01] probabilities all you have to do is reweight them so they look more like um
[01:03:04] reweight them so they look more like um the policy that you want to evaluate and
[01:03:06] the policy that you want to evaluate and we assume that you know this cuz this is
[01:03:08] we assume that you know this cuz this is just the um your policy probability just
[01:03:11] just the um your policy probability just says what action would you take in this
[01:03:12] says what action would you take in this state and so this is known if you're
[01:03:14] state and so this is known if you're doing policy
[01:03:15] doing policy evaluation okay so it's first introduced
[01:03:18] evaluation okay so it's first introduced for RL to my knowledge by doing a preup
[01:03:20] for RL to my knowledge by doing a preup Richard Sutton and um sender Singh in
[01:03:22] Richard Sutton and um sender Singh in 2000 and there's been a lot of follow-up
[01:03:25] 2000 and there's been a lot of follow-up work in Leverage of this it's super
[01:03:28] work in Leverage of this it's super helpful we don't need the markof
[01:03:29] helpful we don't need the markof Assumption or
[01:03:31] Assumption or anything okay requires very mle
[01:03:33] anything okay requires very mle assumptions it's unbiased um and it
[01:03:36] assumptions it's unbiased um and it corrects for distribution mismatch so
[01:03:38] corrects for distribution mismatch so extremely
[01:03:39] extremely helpful I won't do this now but you
[01:03:41] helpful I won't do this now but you might want to look through this later
[01:03:42] might want to look through this later just to think about given everything you
[01:03:44] just to think about given everything you know about um Monte Carlo methods Etc
[01:03:47] know about um Monte Carlo methods Etc like what might be some of the
[01:03:48] like what might be some of the limitations of doing
[01:03:51] limitations of doing this I'll just briefly say there's been
[01:03:53] this I'll just briefly say there's been a whole bunch of extensions one thing is
[01:03:55] a whole bunch of extensions one thing is called per decision important sampling
[01:03:57] called per decision important sampling similar to a policy gradient we think
[01:04:00] similar to a policy gradient we think about the fact that um in terms of the
[01:04:03] about the fact that um in terms of the reward and the decisions later decisions
[01:04:05] reward and the decisions later decisions that are made can't affect earlier
[01:04:07] that are made can't affect earlier rewards so you can reduce the variance
[01:04:09] rewards so you can reduce the variance by being a little bit more um strategic
[01:04:12] by being a little bit more um strategic and where you put your weights and we
[01:04:14] and where you put your weights and we saw similar ideas to this in policy
[01:04:16] saw similar ideas to this in policy gradient so this is called the per
[01:04:18] gradient so this is called the per decision important sampling and it it
[01:04:20] decision important sampling and it it helps to have Better Properties uh in
[01:04:22] helps to have Better Properties uh in terms of long particularly for long
[01:04:24] terms of long particularly for long sequences
[01:04:27] in general the variance is pretty high
[01:04:29] in general the variance is pretty high like for most Monte Carlo methods um
[01:04:33] like for most Monte Carlo methods um I one thing to know is there's
[01:04:35] I one thing to know is there's concentration inequalities like the
[01:04:37] concentration inequalities like the hting inequality that you can use that
[01:04:39] hting inequality that you can use that will generally scale with the largest
[01:04:41] will generally scale with the largest range of the variable if you want to
[01:04:42] range of the variable if you want to start to get confidence intervals over
[01:04:44] start to get confidence intervals over these
[01:04:44] these values um and this can start to be
[01:04:47] values um and this can start to be pretty terrible for long Horizons for
[01:04:49] pretty terrible for long Horizons for important sampling so I'll post
[01:04:51] important sampling so I'll post afterwards what the solutions are for
[01:04:53] afterwards what the solutions are for both of these check your understandings
[01:04:54] both of these check your understandings but it's pretty informative to think
[01:04:56] but it's pretty informative to think about exactly how bad this can
[01:04:58] about exactly how bad this can become okay there's a to deal with this
[01:05:01] become okay there's a to deal with this there's a lot of different extensions um
[01:05:03] there's a lot of different extensions um one thing is that you can if you do have
[01:05:04] one thing is that you can if you do have Markoff structure you can think about
[01:05:07] Markoff structure you can think about State distributions instead of
[01:05:09] State distributions instead of trajectories and that can be very
[01:05:10] trajectories and that can be very helpful and there's been a bunch of work
[01:05:12] helpful and there's been a bunch of work in that direction One work that we've
[01:05:15] in that direction One work that we've done and others is called W is taking
[01:05:17] done and others is called W is taking ideas from statistics on W robust
[01:05:19] ideas from statistics on W robust estimation and using these um to reduce
[01:05:22] estimation and using these um to reduce the variance um in these methods as well
[01:05:24] the variance um in these methods as well as trying to blend between methods that
[01:05:26] as trying to blend between methods that make a Markoff assumption and methods
[01:05:28] make a Markoff assumption and methods that
[01:05:30] that don't
[01:05:32] don't right I want to finish Now by talking a
[01:05:35] right I want to finish Now by talking a bit about how we can use these ideas and
[01:05:37] bit about how we can use these ideas and others to think about offline policy
[01:05:39] others to think about offline policy learning so I think there's sort of a
[01:05:41] learning so I think there's sort of a couple important ideas we went through
[01:05:42] couple important ideas we went through so far today one is that you can just
[01:05:44] so far today one is that you can just build a simulator from historical data
[01:05:46] build a simulator from historical data and you can use that to learn but it may
[01:05:48] and you can use that to learn but it may be biased and that bias um may be
[01:05:51] be biased and that bias um may be substantial when you're trying to use it
[01:05:53] substantial when you're trying to use it to pick
[01:05:54] to pick policies we can do model-free methods um
[01:05:58] policies we can do model-free methods um uh but we're going to want to be careful
[01:05:59] uh but we're going to want to be careful about those and we're going to see more
[01:06:00] about those and we're going to see more of that later and you can use important
[01:06:02] of that later and you can use important sampling to get an unbiased estimate but
[01:06:04] sampling to get an unbiased estimate but it might be high variance not going to
[01:06:06] it might be high variance not going to think about those sorts of ideas in the
[01:06:08] think about those sorts of ideas in the context of when we're actually trying to
[01:06:10] context of when we're actually trying to pick a policy and do
[01:06:13] pick a policy and do optimization so I'm going to go back to
[01:06:14] optimization so I'm going to go back to this issue of coverage because I think
[01:06:16] this issue of coverage because I think it's important to emphasize so let's
[01:06:18] it's important to emphasize so let's imagine that you have antibiotics
[01:06:20] imagine that you have antibiotics mechanical ventilator and a vasopressor
[01:06:22] mechanical ventilator and a vasopressor all things that are often used in um uh
[01:06:25] all things that are often used in um uh sort of an intensive care unit and you
[01:06:27] sort of an intensive care unit and you might have different probabilities of
[01:06:28] might have different probabilities of these interventions and let's say you
[01:06:31] these interventions and let's say you want to evaluate a policy that
[01:06:33] want to evaluate a policy that frequently does mechanical
[01:06:37] ventilation as we've been talking about
[01:06:39] ventilation as we've been talking about your data has to support the policy you
[01:06:41] your data has to support the policy you want to evaluate so if this is your
[01:06:43] want to evaluate so if this is your behavior policy that works because every
[01:06:45] behavior policy that works because every single action you want to try you have a
[01:06:48] single action you want to try you have a non-zero probability of trying that in
[01:06:49] non-zero probability of trying that in the
[01:06:51] the data if you have this policy that
[01:06:54] data if you have this policy that doesn't work that's the same as the
[01:06:55] doesn't work that's the same as the example that we saw before so if you
[01:06:57] example that we saw before so if you never use a vasopressor in your behavior
[01:06:58] never use a vasopressor in your behavior data you cannot evaluate how good that
[01:07:00] data you cannot evaluate how good that would be in the
[01:07:02] would be in the future now when I draw it like this or
[01:07:04] future now when I draw it like this or in the example that I gave in the check
[01:07:05] in the example that I gave in the check your understanding it's pretty obvious
[01:07:07] your understanding it's pretty obvious because there's a finite number of
[01:07:08] because there's a finite number of actions and it's pretty clear like if we
[01:07:11] actions and it's pretty clear like if we didn't take action the vpress our action
[01:07:13] didn't take action the vpress our action that we can't evaluate it but in real
[01:07:15] that we can't evaluate it but in real data sets it often gets really hard to
[01:07:17] data sets it often gets really hard to understand what does it mean to have
[01:07:18] understand what does it mean to have sort of sufficient
[01:07:21] sort of sufficient coverage so in general this is going to
[01:07:23] coverage so in general this is going to be hard cuz we're going to want to say
[01:07:24] be hard cuz we're going to want to say well you know is it okay if it's zero I
[01:07:27] well you know is it okay if it's zero I definitely can't do it but if it was
[01:07:28] definitely can't do it but if it was like right here is that sufficient you
[01:07:30] like right here is that sufficient you know if like one in a million times I
[01:07:32] know if like one in a million times I use a vasopressor is that going to be
[01:07:34] use a vasopressor is that going to be okay does it have to be in my actual
[01:07:36] okay does it have to be in my actual data set or does it just have to be
[01:07:37] data set or does it just have to be there was a chance of me doing this um
[01:07:40] there was a chance of me doing this um so all these issues are kind of exactly
[01:07:42] so all these issues are kind of exactly how much data support you need come
[01:07:44] how much data support you need come up all right so up to around 2020 um
[01:07:49] up all right so up to around 2020 um most of the methods for join off policy
[01:07:51] most of the methods for join off policy evaluation kind of model based or model
[01:07:53] evaluation kind of model based or model free assumed overlap so so um if you're
[01:07:56] free assumed overlap so so um if you're doing off policy estimation it means for
[01:07:58] doing off policy estimation it means for your policy of interest but for off
[01:08:00] your policy of interest but for off policy optimization it often assumed all
[01:08:02] policy optimization it often assumed all policies so every single policy you
[01:08:05] policies so every single policy you could imagine in your domain had to have
[01:08:07] could imagine in your domain had to have coverage with your behavior policy now
[01:08:10] coverage with your behavior policy now if your behavior policy is random that's
[01:08:12] if your behavior policy is random that's fine but if your behavior policy is say
[01:08:15] fine but if your behavior policy is say like how Physicians operate or how
[01:08:17] like how Physicians operate or how teachers operate or some sort of you
[01:08:18] teachers operate or some sort of you know policy that's not completely random
[01:08:21] know policy that's not completely random that wouldn't always be be um
[01:08:24] that wouldn't always be be um satisfied and in general many many real
[01:08:26] satisfied and in general many many real data sets don't involve complete random
[01:08:29] data sets don't involve complete random exploration and this means if you assume
[01:08:31] exploration and this means if you assume this and use these methods and it's not
[01:08:33] this and use these methods and it's not true then you might end up sort of going
[01:08:36] true then you might end up sort of going into parts of the domain or ending up
[01:08:38] into parts of the domain or ending up taking policies that go into parts of
[01:08:40] taking policies that go into parts of the domain where you have very little
[01:08:44] coverage so I'm going to introduce an
[01:08:46] coverage so I'm going to introduce an idea and it turns out this idea um there
[01:08:48] idea and it turns out this idea um there was a number of groups that all started
[01:08:50] was a number of groups that all started thinking about this at the same time and
[01:08:51] thinking about this at the same time and I'll cite a few others of them in a
[01:08:53] I'll cite a few others of them in a minute um we s we call that's doing the
[01:08:56] minute um we s we call that's doing the best with what you got so the idea was
[01:08:58] best with what you got so the idea was how can we leverage data sets where we
[01:09:00] how can we leverage data sets where we only have partial coverage like we still
[01:09:02] only have partial coverage like we still want to do as well as we can but within
[01:09:04] want to do as well as we can but within the support of the
[01:09:06] the support of the data and this is similar to kind of the
[01:09:09] data and this is similar to kind of the K constraint or po clipping that we've
[01:09:10] K constraint or po clipping that we've seen before but this is all going to be
[01:09:12] seen before but this is all going to be entirely in the offline case where we
[01:09:13] entirely in the offline case where we don't manage to get any additional data
[01:09:16] don't manage to get any additional data and the key idea that we're going to
[01:09:17] and the key idea that we're going to think about here is just being
[01:09:19] think about here is just being pessimistic so that when we don't think
[01:09:22] pessimistic so that when we don't think we have sufficient coverage or we or we
[01:09:23] we have sufficient coverage or we or we have high uncertainty over what the
[01:09:25] have high uncertainty over what the reward might be in a particular state or
[01:09:27] reward might be in a particular state or action we want to be pessimistic with
[01:09:29] action we want to be pessimistic with respect to that
[01:09:31] respect to that uncertainty I want to highlight that
[01:09:33] uncertainty I want to highlight that just even when our paper came out here
[01:09:35] just even when our paper came out here there was sort of increasing interest in
[01:09:37] there was sort of increasing interest in offline RL but what we noted is that
[01:09:40] offline RL but what we noted is that like there was still quite a few
[01:09:41] like there was still quite a few challenges and I just want to illustrate
[01:09:43] challenges and I just want to illustrate that with a really simple example so
[01:09:45] that with a really simple example so this is known as the chain mdp and we
[01:09:47] this is known as the chain mdp and we might talk about it more when we talk
[01:09:48] might talk about it more when we talk about data efficient exploration um this
[01:09:52] about data efficient exploration um this this this is not exactly the same as all
[01:09:54] this this is not exactly the same as all the chain mdps but there's number of
[01:09:55] the chain mdps but there's number of them and they're used to illustrate the
[01:09:57] them and they're used to illustrate the hardness of sort of learning good
[01:09:59] hardness of sort of learning good policies so the idea in this setting is
[01:10:02] policies so the idea in this setting is you have an initial start State
[01:10:04] you have an initial start State S1 and then or sorry s0 and then under
[01:10:08] S1 and then or sorry s0 and then under one policy mu you have um a probability
[01:10:11] one policy mu you have um a probability of going to S1 S2
[01:10:15] of going to S1 S2 Etc and then also with another
[01:10:17] Etc and then also with another probability you have a probability of
[01:10:19] probability you have a probability of transitioning to S10 it's a really
[01:10:21] transitioning to S10 it's a really really small mdp just a very small
[01:10:23] really small mdp just a very small number of states the important thing to
[01:10:26] number of states the important thing to note here is that all of these states
[01:10:29] note here is that all of these states have deterministic reward except for
[01:10:31] have deterministic reward except for this
[01:10:33] this one so in reality this has an expected
[01:10:37] one so in reality this has an expected reward of
[01:10:39] reward of 08 you always get 08 when you get to
[01:10:41] 08 you always get 08 when you get to that state and this one has an expected
[01:10:44] that state and this one has an expected reward of
[01:10:45] reward of 0.5 so it's a worse
[01:10:48] 0.5 so it's a worse state but if you go there because of
[01:10:50] state but if you go there because of stochasticity some of the time you'll
[01:10:52] stochasticity some of the time you'll get a one there which means when you
[01:10:54] get a one there which means when you have finite data you might think that
[01:10:55] have finite data you might think that state S9 is better than S10 that will
[01:10:58] state S9 is better than S10 that will just happen with your
[01:11:00] just happen with your data so what we could show in this case
[01:11:03] data so what we could show in this case is that's a bunch of the other
[01:11:04] is that's a bunch of the other algorithms for doing conservative hrl
[01:11:07] algorithms for doing conservative hrl and I won't go through all of them but
[01:11:08] and I won't go through all of them but happy to talk to them offline have this
[01:11:10] happy to talk to them offline have this weird Behavior where as you this is your
[01:11:13] weird Behavior where as you this is your behavior data set so as you increase the
[01:11:16] behavior data set so as you increase the amount of your behavior data you would
[01:11:17] amount of your behavior data you would hope in general that you get a better
[01:11:19] hope in general that you get a better better estimate of a new policy and you
[01:11:21] better estimate of a new policy and you get actually better and better
[01:11:22] get actually better and better performance this is the success rate
[01:11:26] performance this is the success rate and we found we had this weird Behavior
[01:11:28] and we found we had this weird Behavior where for a lot of the other algorithms
[01:11:31] where for a lot of the other algorithms they would start off and they would
[01:11:32] they would start off and they would learn the optimal policy for this domain
[01:11:34] learn the optimal policy for this domain which is to go to
[01:11:35] which is to go to S10 but then as you got more data from
[01:11:38] S10 but then as you got more data from your behavior data set they would get
[01:11:41] your behavior data set they would get misled because sometime you would have
[01:11:43] misled because sometime you would have seen S9 and it would have given given
[01:11:44] seen S9 and it would have given given you a one and if it saw that it would
[01:11:47] you a one and if it saw that it would say oh no I don't want to go to S10 I
[01:11:48] say oh no I don't want to go to S10 I want to go to S9 instead so with
[01:11:50] want to go to S9 instead so with intermediate amounts of data these other
[01:11:52] intermediate amounts of data these other methods would get confused and learned a
[01:11:54] methods would get confused and learned a bad policy and it was only as you
[01:11:56] bad policy and it was only as you started to get a lot more data that they
[01:11:57] started to get a lot more data that they would end up getting back and realizing
[01:11:59] would end up getting back and realizing what the best policy was and so that was
[01:12:02] what the best policy was and so that was somewhat concerning right because you
[01:12:04] somewhat concerning right because you would generally hope that you get some
[01:12:05] would generally hope that you get some sort of monotonic improvement as you get
[01:12:08] sort of monotonic improvement as you get more and more data from your behavior
[01:12:09] more and more data from your behavior data set but here we were seeing that
[01:12:12] data set but here we were seeing that some of the previous methods had this
[01:12:14] some of the previous methods had this sort of unfortunate
[01:12:16] sort of unfortunate behavior and it turned out it it didn't
[01:12:18] behavior and it turned out it it didn't just happen in like for these particular
[01:12:20] just happen in like for these particular examples but like we could show some
[01:12:22] examples but like we could show some other types of examples where we got
[01:12:24] other types of examples where we got very similar types of Performance
[01:12:25] very similar types of Performance challenges for other
[01:12:27] challenges for other methods
[01:12:29] methods so the key idea is pretty simple which
[01:12:32] so the key idea is pretty simple which is just be pessimistic if you haven't
[01:12:35] is just be pessimistic if you haven't seen a state action very much so we
[01:12:38] seen a state action very much so we defined a filtration function which is
[01:12:40] defined a filtration function which is just a simple threshold that says let me
[01:12:42] just a simple threshold that says let me check for this dat in action pair it's
[01:12:44] check for this dat in action pair it's kind of what my density is how much I've
[01:12:46] kind of what my density is how much I've seen it if it's greater than a threshold
[01:12:48] seen it if it's greater than a threshold this is going to be
[01:12:50] this is going to be one okay so what this threshold is doing
[01:12:54] one okay so what this threshold is doing is trying to account for your
[01:12:55] is trying to account for your statistical uncertainty you have if you
[01:12:57] statistical uncertainty you have if you have finite amounts of data so if you
[01:12:58] have finite amounts of data so if you haven't seen things very much this is
[01:13:00] haven't seen things very much this is going to become zero if you've seen
[01:13:02] going to become zero if you've seen things a lot it's going to be one that's
[01:13:04] things a lot it's going to be one that's all we're
[01:13:05] all we're doing and then we can just combine this
[01:13:07] doing and then we can just combine this with like Bellman backups so just like
[01:13:09] with like Bellman backups so just like sort of your dqn or Bellman operator we
[01:13:11] sort of your dqn or Bellman operator we can just apply it so that when we are
[01:13:13] can just apply it so that when we are looking at your reward plus gamma times
[01:13:15] looking at your reward plus gamma times your expected discounted sum of rewards
[01:13:18] your expected discounted sum of rewards we look at the States you might get into
[01:13:20] we look at the States you might get into and if those States we don't have very
[01:13:22] and if those States we don't have very much data for then this whole thing
[01:13:24] much data for then this whole thing becomes zero
[01:13:26] becomes zero and so it's like saying if I transition
[01:13:28] and so it's like saying if I transition to a next state for which I don't have
[01:13:31] to a next state for which I don't have much data I just pretend its reward is
[01:13:32] much data I just pretend its reward is zero and then I back up from there which
[01:13:35] zero and then I back up from there which essentially means I don't want to take
[01:13:37] essentially means I don't want to take actions that transition to States for
[01:13:39] actions that transition to States for which I don't have enough
[01:13:41] which I don't have enough data just pessimistic and it's going to
[01:13:44] data just pessimistic and it's going to um uh be a lower bound if your rewards
[01:13:47] um uh be a lower bound if your rewards are all bounded by zero it's just going
[01:13:48] are all bounded by zero it's just going to be a lower bound on your potential
[01:13:51] to be a lower bound on your potential reward okay so since we assume that our
[01:13:55] reward okay so since we assume that our rewards are all positive and you can
[01:13:56] rewards are all positive and you can always just shift them this is going to
[01:13:57] always just shift them this is going to become a pessimistic estimate for all of
[01:14:00] become a pessimistic estimate for all of those
[01:14:03] doues and you can do this for either
[01:14:05] doues and you can do this for either policy evaluation or for um so you can
[01:14:08] policy evaluation or for um so you can use this in kind of like policy gradient
[01:14:10] use this in kind of like policy gradient type approaches or for um Q learning
[01:14:12] type approaches or for um Q learning type of
[01:14:14] type of methods and it turns out this helps a
[01:14:16] methods and it turns out this helps a lot so let me just we call this
[01:14:19] lot so let me just we call this marginalized Behavior supported policy
[01:14:21] marginalized Behavior supported policy optimization um I'll just highlight what
[01:14:23] optimization um I'll just highlight what was because one of the key things of
[01:14:25] was because one of the key things of this paper was the theory that we showed
[01:14:27] this paper was the theory that we showed with it as I said a lot of the previous
[01:14:29] with it as I said a lot of the previous methods sort of had to make assumptions
[01:14:31] methods sort of had to make assumptions over coverage that like your data
[01:14:34] over coverage that like your data covered any possible policy that you
[01:14:36] covered any possible policy that you might want to evaluate and under that
[01:14:39] might want to evaluate and under that you could ensure that your the policy
[01:14:41] you could ensure that your the policy that you learn is close to
[01:14:43] that you learn is close to Optimal ours does not make that
[01:14:45] Optimal ours does not make that guarantee it only says let's think about
[01:14:49] guarantee it only says let's think about all policies that we could reasonably
[01:14:51] all policies that we could reasonably evaluate that have sort of enough
[01:14:52] evaluate that have sort of enough coverage we are guaranteed to find the
[01:14:55] coverage we are guaranteed to find the best policy within that
[01:14:57] best policy within that class and we have some um I'll skip
[01:15:00] class and we have some um I'll skip through this now due to time um but
[01:15:03] through this now due to time um but under some assumptions then we can also
[01:15:04] under some assumptions then we can also give these kind of finite sample
[01:15:06] give these kind of finite sample guarantees similar to what we saw for
[01:15:08] guarantees similar to what we saw for the fitted Q evaluation
[01:15:11] the fitted Q evaluation okay all right and I'll just highlight
[01:15:13] okay all right and I'll just highlight that those do include the function
[01:15:15] that those do include the function approximation so these aren't for
[01:15:17] approximation so these aren't for tabular okay so this is what's pretty
[01:15:20] tabular okay so this is what's pretty cool to see so this in this case this is
[01:15:22] cool to see so this in this case this is Hopper um this is the behavior policy
[01:15:25] Hopper um this is the behavior policy this is the behavior policy used to
[01:15:26] this is the behavior policy used to gather the data what you can see here is
[01:15:28] gather the data what you can see here is if you use
[01:15:29] if you use ddpg um that actually does worse than
[01:15:33] ddpg um that actually does worse than the behavior policy if you use Behavior
[01:15:35] the behavior policy if you use Behavior cloning it was um a little bit better
[01:15:37] cloning it was um a little bit better about the same used a particular
[01:15:39] about the same used a particular vae we then compared to bcq which I
[01:15:42] vae we then compared to bcq which I mentioned uh briefly before uh Scott
[01:15:44] mentioned uh briefly before uh Scott fujimoto's work and you can see that
[01:15:47] fujimoto's work and you can see that that and our approach in green both do
[01:15:49] that and our approach in green both do substantially better again highlighted
[01:15:51] substantially better again highlighted in some of these cases the data does
[01:15:52] in some of these cases the data does support you learning a much better
[01:15:54] support you learning a much better policy and you should do you should try
[01:15:57] policy and you should do you should try to uncover that by using these methods
[01:15:58] to uncover that by using these methods that sort of explicitly think about your
[01:16:03] uncertainty now um I'll skip this just
[01:16:05] uncertainty now um I'll skip this just due to time there's some interesting
[01:16:07] due to time there's some interesting sort of theoretical reasons why model
[01:16:09] sort of theoretical reasons why model base might be even better at the same
[01:16:12] base might be even better at the same time um there were three papers that all
[01:16:14] time um there were three papers that all came out but nurs ours was one of them
[01:16:16] came out but nurs ours was one of them the same year with all basically very
[01:16:17] the same year with all basically very related ideas ours was a model free
[01:16:20] related ideas ours was a model free based approach and work by some of my
[01:16:22] based approach and work by some of my colleagues uh Chelsea Finn and uh t Yuma
[01:16:25] colleagues uh Chelsea Finn and uh t Yuma and others um learned a model- based
[01:16:28] and others um learned a model- based approach where they penalized model
[01:16:29] approach where they penalized model uncertainty during
[01:16:31] uncertainty during planning and they had some very nice
[01:16:32] planning and they had some very nice results in D4L cases ours was a bit more
[01:16:35] results in D4L cases ours was a bit more theoretical and model-free theirs was a
[01:16:36] theoretical and model-free theirs was a little more algorithmic um and empirical
[01:16:39] little more algorithmic um and empirical and also had some really nice um and was
[01:16:41] and also had some really nice um and was focused on model based
[01:16:44] focused on model based approaches I'll just highlight that um
[01:16:46] approaches I'll just highlight that um another method that came out similarly
[01:16:48] another method that came out similarly around the same time was conservative Q
[01:16:50] around the same time was conservative Q learning and that has also continued to
[01:16:52] learning and that has also continued to be very popular since so that's another
[01:16:55] be very popular since so that's another another way to think about sort of being
[01:16:58] another way to think about sort of being conservative we almost out of time so I
[01:17:00] conservative we almost out of time so I just wanted to do sort of share kind of
[01:17:02] just wanted to do sort of share kind of how do these different approaches
[01:17:04] how do these different approaches compare pessimistic approaches in
[01:17:06] compare pessimistic approaches in general um do better than Alternatives
[01:17:10] general um do better than Alternatives all of these have some form of um
[01:17:12] all of these have some form of um pessimism these are modelbased this is
[01:17:15] pessimism these are modelbased this is the be the sort of behavior constraint Q
[01:17:17] the be the sort of behavior constraint Q learning some nice work bear from um
[01:17:20] learning some nice work bear from um Sergey lavine's group from Berkeley and
[01:17:22] Sergey lavine's group from Berkeley and cql which is also from Berkeley
[01:17:25] cql which is also from Berkeley um the different methods tend to do
[01:17:27] um the different methods tend to do better or worse in different settings
[01:17:30] better or worse in different settings um I think that in general the key thing
[01:17:33] um I think that in general the key thing to understand from this part is that it
[01:17:35] to understand from this part is that it really can be be beneficial to think
[01:17:37] really can be be beneficial to think explicitly about uncertainty and use
[01:17:39] explicitly about uncertainty and use that to sort of penalize and constrain
[01:17:41] that to sort of penalize and constrain your function to be in the parts of the
[01:17:44] your function to be in the parts of the domain where um you have support and
[01:17:47] domain where um you have support and again this is pretty similar should
[01:17:49] again this is pretty similar should definitely make you think back to sort
[01:17:50] definitely make you think back to sort of
[01:17:53] Po um instead of having
[01:17:58] constrained
[01:18:01] constrained updates so many of these different
[01:18:03] updates so many of these different settings we're really trying to think
[01:18:04] settings we're really trying to think explicitly about um sort of coverage and
[01:18:08] explicitly about um sort of coverage and how far we can use the existing data we
[01:18:09] how far we can use the existing data we have but particularly here where we
[01:18:12] have but particularly here where we don't assume you don't get any
[01:18:13] don't assume you don't get any additional data you're just going to
[01:18:14] additional data you're just going to deploy a policy at the end we want to
[01:18:15] deploy a policy at the end we want to think about uh you know exactly how much
[01:18:18] think about uh you know exactly how much support we
[01:18:19] support we have okay all right I will skip the last
[01:18:23] have okay all right I will skip the last part because we're going to be out of
[01:18:24] part because we're going to be out of time if you're interested um just want
[01:18:26] time if you're interested um just want to highlight that you can extend these
[01:18:29] to highlight that you can extend these ideas to think about there being
[01:18:31] ideas to think about there being constraints so we had a science paper a
[01:18:33] constraints so we had a science paper a few years ago thinking about what if you
[01:18:35] few years ago thinking about what if you want to make sure your performance is
[01:18:36] want to make sure your performance is improving compared to
[01:18:37] improving compared to baselines and in particular we used um
[01:18:41] baselines and in particular we used um like a diabetes insulin management
[01:18:43] like a diabetes insulin management simulator it's a really cool simulator
[01:18:45] simulator it's a really cool simulator is improved by the FDA to replace early
[01:18:47] is improved by the FDA to replace early stage animal trials and you can learn
[01:18:49] stage animal trials and you can learn new ways to do insulin um delivery and
[01:18:52] new ways to do insulin um delivery and what we wanted to illustrate in this
[01:18:54] what we wanted to illustrate in this case is that by thinking explicitly
[01:18:56] case is that by thinking explicitly about your uncertainty over the
[01:18:58] about your uncertainty over the performance of new decision policies you
[01:19:00] performance of new decision policies you could quickly learn a policy that you
[01:19:02] could quickly learn a policy that you were confident would be better than the
[01:19:04] were confident would be better than the existing policy so just highlight that
[01:19:07] existing policy so just highlight that to say that there are lots of cases
[01:19:08] to say that there are lots of cases where you'd like to do this offline
[01:19:09] where you'd like to do this offline policy learning but do so in a way where
[01:19:11] policy learning but do so in a way where you have safety constraints or
[01:19:13] you have safety constraints or constraints over the
[01:19:14] constraints over the performance all right let me just
[01:19:16] performance all right let me just summarize this part so in terms of
[01:19:18] summarize this part so in terms of things that you should know or be able
[01:19:20] things that you should know or be able to do excuse me you should be able to
[01:19:22] to do excuse me you should be able to Define and apply important sampling for
[01:19:24] Define and apply important sampling for policy policy evaluation and understand
[01:19:26] policy policy evaluation and understand some of the limitations of these prior
[01:19:28] some of the limitations of these prior Works um you should understand why
[01:19:30] Works um you should understand why offline RL might be able to outperform
[01:19:32] offline RL might be able to outperform imitation learning should know sort of
[01:19:34] imitation learning should know sort of this idea of pessimism under uncertainty
[01:19:36] this idea of pessimism under uncertainty and be able to have some application
[01:19:38] and be able to have some application areas where you might want to be doing
[01:19:40] areas where you might want to be doing offline RL or offline policy evaluation
[01:19:43] offline RL or offline policy evaluation so particularly in kind of high-risk
[01:19:44] so particularly in kind of high-risk settings that can be important what
[01:19:46] settings that can be important what we'll be doing next is going to start to
[01:19:48] we'll be doing next is going to start to talk about how if we can gather our data
[01:19:51] talk about how if we can gather our data how we should gather our data in order
[01:19:53] how we should gather our data in order to really efficiently learn policy
[01:19:55] to really efficiently learn policy I'll see you on Wednesday
Lecture 011
Stanford CS234 Reinforcement Learning I Exploration 1 I 2024 I Lecture 11
Source: https://www.youtube.com/watch?v=sqYii3nd78w
---
Transcript
[00:00:06] hi everybody welcome back we're going to
[00...
Stanford CS234 Reinforcement Learning I Exploration 1 I 2024 I Lecture 11
Source: https://www.youtube.com/watch?v=sqYii3nd78w
---
Transcript
[00:00:06] hi everybody welcome back we're going to
[00:00:07] hi everybody welcome back we're going to start to talk about fast or data
[00:00:09] start to talk about fast or data efficient reinforcement learning before
[00:00:11] efficient reinforcement learning before we do that we're going to start with a
[00:00:13] we do that we're going to start with a refresher knowledge
[00:00:35] one of the things there's fairly good
[00:00:36] one of the things there's fairly good evidence about in terms of learning is
[00:00:38] evidence about in terms of learning is that space repetition is helpful so I'll
[00:00:39] that space repetition is helpful so I'll try to periodically bring up ideas that
[00:00:41] try to periodically bring up ideas that came up earlier in the quarter when we
[00:00:43] came up earlier in the quarter when we do the refresh your
[00:00:58] understandings for
[00:01:53] all right qu you turn to a neighbor and
[00:01:55] all right qu you turn to a neighbor and see if you got the same answer
[00:02:00] I
[00:02:39] [Music]
[00:02:55] or
[00:03:16] all right so we'll go ahead and get
[00:03:17] all right so we'll go ahead and get started for those of you that just came
[00:03:18] started for those of you that just came in feel free to vote um the first one is
[00:03:21] in feel free to vote um the first one is that important sampling does not
[00:03:23] that important sampling does not leverage the mark off assumption do not
[00:03:25] leverage the mark off assumption do not need this so this is
[00:03:27] need this so this is false um important sampling can be work
[00:03:30] false um important sampling can be work with non-m markup systems which is one
[00:03:31] with non-m markup systems which is one of the benefits of it it makes very
[00:03:33] of the benefits of it it makes very little assumptions over the data
[00:03:34] little assumptions over the data distribution generating process um so it
[00:03:37] distribution generating process um so it can be used in just similar to how we
[00:03:39] can be used in just similar to how we could use Monte Carlo methods to
[00:03:41] could use Monte Carlo methods to estimate the value of a policy through
[00:03:42] estimate the value of a policy through rollouts this also is very general um
[00:03:46] rollouts this also is very general um let's go through the next one as
[00:03:48] let's go through the next one as well so let not use the mark for this
[00:03:52] well so let not use the mark for this one um the first one is
[00:03:56] one um the first one is true so we can think of using the
[00:03:58] true so we can think of using the advantage function of one policy samples
[00:04:00] advantage function of one policy samples from the other um the second is that we
[00:04:03] from the other um the second is that we can sort of importance weight between
[00:04:05] can sort of importance weight between the two policies and get the samples
[00:04:07] the two policies and get the samples from policy one so it's not really an
[00:04:10] from policy one so it's not really an exact bound but it turns out we can
[00:04:12] exact bound but it turns out we can bound how off that is the reason it's
[00:04:14] bound how off that is the reason it's not exact is because we're using samples
[00:04:16] not exact is because we're using samples of States from one policy whereas in
[00:04:19] of States from one policy whereas in reality the other the other policy might
[00:04:21] reality the other the other policy might visit different types of
[00:04:23] visit different types of States um and po uses these types of
[00:04:28] States um and po uses these types of ideas um
[00:04:30] ideas um and the approximation error is bounded
[00:04:32] and the approximation error is bounded by the average over the states visited
[00:04:34] by the average over the states visited by one policy between the two policies
[00:04:37] by one policy between the two policies so this is trying to say EXA sort of how
[00:04:39] so this is trying to say EXA sort of how bad is this approximation when we use
[00:04:41] bad is this approximation when we use just samples from one
[00:04:44] just samples from one policy okay and so that was one of the
[00:04:46] policy okay and so that was one of the really nice insights of that um prior
[00:04:48] really nice insights of that um prior work is that to show you actually could
[00:04:50] work is that to show you actually could bound what is the error in the like the
[00:04:52] bound what is the error in the like the approximation error that we induce by
[00:04:54] approximation error that we induce by pretending that we'd get to the same
[00:04:56] pretending that we'd get to the same States under policy 2 compared to policy
[00:04:58] States under policy 2 compared to policy one
[00:05:01] awesome so last time we talked a bit
[00:05:03] awesome so last time we talked a bit about learning from prior data and in
[00:05:05] about learning from prior data and in really the last few lectures we've been
[00:05:06] really the last few lectures we've been talking about how to learn from Human
[00:05:08] talking about how to learn from Human feedback or from PR demonstrations of
[00:05:10] feedback or from PR demonstrations of people or sort of historical data we
[00:05:12] people or sort of historical data we have and now we're going to switch and
[00:05:15] have and now we're going to switch and we're going to think more about well
[00:05:16] we're going to think more about well what if we can actually gather that data
[00:05:18] what if we can actually gather that data and of course that's where we started at
[00:05:20] and of course that's where we started at the beginning we thought about if we are
[00:05:22] the beginning we thought about if we are quite early on we thought about how to
[00:05:23] quite early on we thought about how to evaluate policies if we could gather
[00:05:26] evaluate policies if we could gather data but we didn't think a lot about how
[00:05:29] data but we didn't think a lot about how that was gathered we talked about
[00:05:31] that was gathered we talked about Epsilon greedy and we'll talk more about
[00:05:32] Epsilon greedy and we'll talk more about Epsilon greedy today but we didn't think
[00:05:35] Epsilon greedy today but we didn't think um super strategically over the
[00:05:37] um super strategically over the influence of the way we were gathering
[00:05:39] influence of the way we were gathering data and so for the next few lectures
[00:05:41] data and so for the next few lectures we're going to talk about that a lot and
[00:05:43] we're going to talk about that a lot and that's really a critical part
[00:05:44] that's really a critical part particularly for online reinforcement
[00:05:46] particularly for online reinforcement learning it's like how do we actually
[00:05:47] learning it's like how do we actually gather the data we need in order to
[00:05:49] gather the data we need in order to learn to make good decisions and can we
[00:05:51] learn to make good decisions and can we do this are there better or worse ways
[00:05:53] do this are there better or worse ways to do
[00:05:55] this so one of the things I want to
[00:05:57] this so one of the things I want to emphasize when we start thinking about
[00:05:59] emphasize when we start thinking about this part part of the course is a lot of
[00:06:02] this part part of the course is a lot of reinforcement learning particularly if
[00:06:03] reinforcement learning particularly if you have simulated environments focuses
[00:06:05] you have simulated environments focuses on computational efficiency um so we
[00:06:08] on computational efficiency um so we think about things like any place where
[00:06:10] think about things like any place where you have a simulator so if we want to do
[00:06:12] you have a simulator so if we want to do like Atari or if we want to do
[00:06:16] like Atari or if we want to do go in these sort of cases computational
[00:06:19] go in these sort of cases computational time is essentially the same as data
[00:06:22] time is essentially the same as data because you can either be using that
[00:06:24] because you can either be using that additional computational time to sample
[00:06:26] additional computational time to sample from your simulator or to actually spend
[00:06:28] from your simulator or to actually spend more time Computing a function or policy
[00:06:32] more time Computing a function or policy and so to some extent um simulators sort
[00:06:34] and so to some extent um simulators sort of blend the difference between
[00:06:37] of blend the difference between computational efficiency and data
[00:06:39] computational efficiency and data efficiency because it's all just
[00:06:41] efficiency because it's all just computation like you have a you have a
[00:06:42] computation like you have a you have a simulator and it can either give you
[00:06:44] simulator and it can either give you data or you can use it to you know do
[00:06:45] data or you can use it to you know do bman backups or whatever else you want
[00:06:47] bman backups or whatever else you want but you could just count how much total
[00:06:49] but you could just count how much total resources you're using essentially in
[00:06:51] resources you're using essentially in terms of computation there are a lot of
[00:06:54] terms of computation there are a lot of other domains where computation is
[00:06:56] other domains where computation is really separate from samples like from
[00:06:58] really separate from samples like from actual data so this is
[00:07:00] actual data so this is data and these are a lot of the
[00:07:02] data and these are a lot of the application areas that I tend to think
[00:07:04] application areas that I tend to think about and a number of other people think
[00:07:05] about and a number of other people think about as well so if you think about
[00:07:06] about as well so if you think about something like um uh using mobile
[00:07:10] something like um uh using mobile phones for health
[00:07:15] interventions or if you think about
[00:07:17] interventions or if you think about consumer
[00:07:19] consumer marketing like which ad to show to
[00:07:22] marketing like which ad to show to people or you think about educational
[00:07:28] technology or you think think about
[00:07:32] technology or you think think about climate oh do like s like
[00:07:40] environmental in all of these cases
[00:07:42] environmental in all of these cases there's sort of a real world that's
[00:07:43] there's sort of a real world that's happening out there there's like real
[00:07:45] happening out there there's like real students or there's real patients or you
[00:07:47] students or there's real patients or you know there's like where you're trying to
[00:07:49] know there's like where you're trying to decide um uh say policies to encourage
[00:07:52] decide um uh say policies to encourage like wildlife conservation or others and
[00:07:55] like wildlife conservation or others and so you have a computers you can use to
[00:07:57] so you have a computers you can use to compute that policy and then you have
[00:07:59] compute that policy and then you have real World data and the real world data
[00:08:01] real World data and the real world data I'll call samples or You Know sample
[00:08:03] I'll call samples or You Know sample efficiency here and you care often a lot
[00:08:05] efficiency here and you care often a lot about that real world data and how sort
[00:08:07] about that real world data and how sort of squeezing the most you can out of it
[00:08:09] of squeezing the most you can out of it so in particular you might imagine that
[00:08:11] so in particular you might imagine that like if you have say data from um you
[00:08:13] like if you have say data from um you know 500,000 patients or something like
[00:08:15] know 500,000 patients or something like that that's quite a lot bless you but
[00:08:17] that that's quite a lot bless you but it's not nearly as large as what you
[00:08:19] it's not nearly as large as what you would normally have in the case of say
[00:08:20] would normally have in the case of say Atari where you can just run the
[00:08:22] Atari where you can just run the simulator forever um or in the case of
[00:08:24] simulator forever um or in the case of things like Alpha go where again you
[00:08:26] things like Alpha go where again you could just play um the board game go
[00:08:28] could just play um the board game go against each you know again simulated
[00:08:30] against each you know again simulated agents forever so a lot of the things
[00:08:33] agents forever so a lot of the things we're going to be talking about in the
[00:08:34] we're going to be talking about in the next few lectures just going to assume
[00:08:36] next few lectures just going to assume that we care about this because we can't
[00:08:37] that we care about this because we can't get infinite data so we have to be we're
[00:08:39] get infinite data so we have to be we're thinking about cases where like these
[00:08:41] thinking about cases where like these are coming from patients or they're
[00:08:42] are coming from patients or they're coming from students and so we want to
[00:08:44] coming from students and so we want to be much more careful with the data we're
[00:08:46] be much more careful with the data we're Gathering and think about how we could s
[00:08:48] Gathering and think about how we could s maximize the information we get out of
[00:08:50] maximize the information we get out of those to try to make good
[00:08:52] those to try to make good decisions so when we start to do that
[00:08:55] decisions so when we start to do that there's a number of different things we
[00:08:56] there's a number of different things we want to consider in terms of um how good
[00:08:59] want to consider in terms of um how good are the different algorithms we're going
[00:09:00] are the different algorithms we're going to
[00:09:01] to consider so one thing might be that if
[00:09:04] consider so one thing might be that if it converges um at all and we've seen
[00:09:07] it converges um at all and we've seen before that from the deadly Triad we're
[00:09:10] before that from the deadly Triad we're not always guaranteed to converge so
[00:09:12] not always guaranteed to converge so we've seen that
[00:09:15] we've seen that for some
[00:09:18] for some settings where where this is not
[00:09:22] settings where where this is not guaranteed or it hasn't been proven yet
[00:09:24] guaranteed or it hasn't been proven yet so you're not even necessarily
[00:09:25] so you're not even necessarily guaranteed to converge to anything it
[00:09:27] guaranteed to converge to anything it might chatter it might oscillate it
[00:09:28] might chatter it might oscillate it might not go to anything stable second
[00:09:31] might not go to anything stable second question you might ask is if you're
[00:09:32] question you might ask is if you're going to be guaranteed to converge to
[00:09:33] going to be guaranteed to converge to the optimal policy and then a third
[00:09:36] the optimal policy and then a third thing that might be really important is
[00:09:37] thing that might be really important is how quickly so in this case it's going
[00:09:39] how quickly so in this case it's going to be how much
[00:09:44] data and we're going to see different
[00:09:46] data and we're going to see different types of measures to evaluate different
[00:09:47] types of measures to evaluate different reinforcement learning algorithms so let
[00:09:49] reinforcement learning algorithms so let me just give you sort of an illustration
[00:09:50] me just give you sort of an illustration to of like why these things might look
[00:09:52] to of like why these things might look quite different so imagine that you have
[00:09:54] quite different so imagine that you have something like this where this is
[00:09:57] something like this where this is time and this is reward
[00:10:01] time and this is reward okay so you could have really different
[00:10:02] okay so you could have really different algorithms you could have algorithms
[00:10:04] algorithms you could have algorithms that look like
[00:10:06] that look like this that might be one algorithm or you
[00:10:09] this that might be one algorithm or you could have an algorithm that looks
[00:10:12] could have an algorithm that looks like this like really
[00:10:15] like this like really smooth and you could have algorithms
[00:10:17] smooth and you could have algorithms that in general maybe you know most of
[00:10:19] that in general maybe you know most of the time do great but periodically make
[00:10:21] the time do great but periodically make terrible
[00:10:22] terrible mistakes versus you could have another
[00:10:25] mistakes versus you could have another algorithm which never does awesome but
[00:10:27] algorithm which never does awesome but is always pretty good
[00:10:30] is always pretty good and those are really different types of
[00:10:31] and those are really different types of behavior so if you think about that in
[00:10:32] behavior so if you think about that in terms of say um an AI clinician you
[00:10:35] terms of say um an AI clinician you could have an AI clinician that on
[00:10:37] could have an AI clinician that on average um you know helps get like let's
[00:10:41] average um you know helps get like let's say 80% of your desired outcomes like it
[00:10:43] say 80% of your desired outcomes like it helps you manage your blood pressure
[00:10:44] helps you manage your blood pressure with 80% accurate you know um Fidelity
[00:10:47] with 80% accurate you know um Fidelity or it could be that for 80% of the
[00:10:49] or it could be that for 80% of the population it helps them completely
[00:10:51] population it helps them completely manage their blood pressure and for 20%
[00:10:53] manage their blood pressure and for 20% of them it fails so those are really
[00:10:55] of them it fails so those are really different types of um performance
[00:10:57] different types of um performance guarantees and we'll think about whether
[00:10:59] guarantees and we'll think about whether you know trading off between those and
[00:11:00] you know trading off between those and what sort of algorithms guarantee us to
[00:11:02] what sort of algorithms guarantee us to have different sorts of
[00:11:06] performance so we'll start to introduce
[00:11:08] performance so we'll start to introduce sort of different types of settings and
[00:11:10] sort of different types of settings and ways to evaluate the quality of
[00:11:11] ways to evaluate the quality of algorithms and we're going to start with
[00:11:13] algorithms and we're going to start with Bandits um and we've talked very briefly
[00:11:15] Bandits um and we've talked very briefly about Bandits um in the context of s
[00:11:17] about Bandits um in the context of s chat gbt and preference learning we'll
[00:11:20] chat gbt and preference learning we'll talk a lot more about them now and then
[00:11:22] talk a lot more about them now and then we'll move back into the Markov decision
[00:11:23] we'll move back into the Markov decision process case a lot of the ideas from
[00:11:25] process case a lot of the ideas from Bandits will turn out to exactly for
[00:11:28] Bandits will turn out to exactly for quite quite easily translate over to the
[00:11:31] quite quite easily translate over to the um the RL
[00:11:32] um the RL setting okay all right so let's dive in
[00:11:36] setting okay all right so let's dive in so what is a bandit so a bandit is like
[00:11:38] so what is a bandit so a bandit is like a really really simple RL problem um
[00:11:42] a really really simple RL problem um they've been studied since I think at
[00:11:43] they've been studied since I think at least like around the 1920s there's a
[00:11:45] least like around the 1920s there's a very long history um of research on
[00:11:47] very long history um of research on multiarm Bandits um it's been used for
[00:11:50] multiarm Bandits um it's been used for all sorts of application areas so let's
[00:11:52] all sorts of application areas so let's describe what it is so the idea in this
[00:11:55] describe what it is so the idea in this case is that there's no States there's
[00:11:57] case is that there's no States there's just a finite set of arms
[00:11:59] just a finite set of arms and arms are the same as what we've been
[00:12:01] and arms are the same as what we've been calling actions before so as a concrete
[00:12:03] calling actions before so as a concrete example you might think of there being
[00:12:05] example you might think of there being like 20 different ads you could show
[00:12:08] like 20 different ads you could show customers and we're going to assume that
[00:12:10] customers and we're going to assume that there's a probability distribution over
[00:12:12] there's a probability distribution over rewards for each arm so maybe you know
[00:12:15] rewards for each arm so maybe you know like on average um this gives you you
[00:12:18] like on average um this gives you you know 90% click-through rate for this
[00:12:20] know 90% click-through rate for this particular ad and this other AD gives
[00:12:21] particular ad and this other AD gives you 20% click-through rate but that's
[00:12:24] you 20% click-through rate but that's not known that's you know not observed
[00:12:26] not known that's you know not observed and what will happen is each time step
[00:12:28] and what will happen is each time step you get to select one the actions and
[00:12:30] you get to select one the actions and then the environment will sample a
[00:12:32] then the environment will sample a reward from that stochastic
[00:12:34] reward from that stochastic variable so if the clickthrough rate is
[00:12:36] variable so if the clickthrough rate is 90% for that particular arms most of the
[00:12:39] 90% for that particular arms most of the time you'll get a one and sometimes
[00:12:40] time you'll get a one and sometimes people won't click on it and the goal is
[00:12:42] people won't click on it and the goal is to maximize your cumulative reward so
[00:12:45] to maximize your cumulative reward so overall time steps that you get the most
[00:12:47] overall time steps that you get the most amount of say
[00:12:49] amount of say clicks and this is a very simple setting
[00:12:52] clicks and this is a very simple setting but it's been used extensively in a lot
[00:12:53] but it's been used extensively in a lot of areas you could think about this for
[00:12:55] of areas you could think about this for something like how could I if I was
[00:12:57] something like how could I if I was doing something like a clinical trial
[00:12:59] doing something like a clinical trial how might I randomize the next person
[00:13:01] how might I randomize the next person over what treatment to get you know
[00:13:03] over what treatment to get you know treatment or control for example um for
[00:13:05] treatment or control for example um for ads for many many different types of
[00:13:07] ads for many many different types of application
[00:13:09] application areas so I'm going to go through I'm
[00:13:11] areas so I'm going to go through I'm going to have some running examples for
[00:13:13] going to have some running examples for this part of the course and we're going
[00:13:14] this part of the course and we're going to have a sort of a silly one that's
[00:13:15] to have a sort of a silly one that's going to be illustrative so let's
[00:13:16] going to be illustrative so let's imagine that we're trying to treat
[00:13:18] imagine that we're trying to treat patients with broken toes this has
[00:13:20] patients with broken toes this has nothing to do with medical stuff so this
[00:13:22] nothing to do with medical stuff so this is not medical advice um imagine you
[00:13:24] is not medical advice um imagine you have three different options you could
[00:13:25] have three different options you could do surgery you could buddy tape the the
[00:13:27] do surgery you could buddy tape the the broken toe to another toe or you could
[00:13:28] broken toe to another toe or you could do nothing
[00:13:29] do nothing and your outcome measure is a binary
[00:13:31] and your outcome measure is a binary variable about whether or not that toe
[00:13:33] variable about whether or not that toe is healed or not healed after 6 weeks so
[00:13:37] is healed or not healed after 6 weeks so that's our setting we've got broken toes
[00:13:38] that's our setting we've got broken toes we want to figure out what's the best
[00:13:40] we want to figure out what's the best strategy for healing them and we're not
[00:13:41] strategy for healing them and we're not going to do a clinical trial instead
[00:13:43] going to do a clinical trial instead we're just going to say well sometimes
[00:13:44] we're just going to say well sometimes people come in and they've broken toes
[00:13:45] people come in and they've broken toes and I'm going to try to figure out over
[00:13:47] and I'm going to try to figure out over time which thing is best
[00:13:51] time which thing is best okay all
[00:13:53] okay all right so in this case we're going to
[00:13:56] right so in this case we're going to model it as a multi-arm bandit with
[00:13:57] model it as a multi-arm bandit with three arms the arms are the and we're
[00:14:01] three arms the arms are the and we're we're going to model each arm as a newly
[00:14:02] we're going to model each arm as a newly variable with an unknown parameter Theta
[00:14:05] variable with an unknown parameter Theta so let's just do a quick check your
[00:14:06] so let's just do a quick check your understanding about the framework of
[00:14:08] understanding about the framework of Bandits
[00:14:45] excuse me
[00:15:15] okay great I think most people are
[00:15:17] okay great I think most people are converging on this already Yes um
[00:15:19] converging on this already Yes um pulling an arm or taking an action um is
[00:15:22] pulling an arm or taking an action um is just the action we're actually doing the
[00:15:24] just the action we're actually doing the second one this is a better fit to the
[00:15:25] second one this is a better fit to the problem than an mdp because we're only
[00:15:27] problem than an mdp because we're only going to make one decision per patient
[00:15:29] going to make one decision per patient and we're also going to assume that
[00:15:30] and we're also going to assume that whatever decision you know whether um uh
[00:15:33] whatever decision you know whether um uh whether Mofo's toe heals after we do
[00:15:35] whether Mofo's toe heals after we do this is independent of whether or not
[00:15:36] this is independent of whether or not when Sophie shows up what we do so these
[00:15:39] when Sophie shows up what we do so these are totally independent processes um you
[00:15:42] are totally independent processes um you know the next person to show up so we
[00:15:43] know the next person to show up so we don't have any sort of sequential
[00:15:44] don't have any sort of sequential dependence even though we're making a
[00:15:46] dependence even though we're making a sequence of decisions it's like each
[00:15:48] sequence of decisions it's like each time Point there's a new person we're
[00:15:49] time Point there's a new person we're just going to decide what to do for them
[00:15:51] just going to decide what to do for them um and yes this is right so if your
[00:15:53] um and yes this is right so if your Theta I is between zero and one meaning
[00:15:55] Theta I is between zero and one meaning your outcomes are not deterministic
[00:15:57] your outcomes are not deterministic sometimes you'll heal sometimes you
[00:15:58] sometimes you'll heal sometimes you won't
[00:16:01] okay so one thing that we could do to
[00:16:03] okay so one thing that we could do to solve this would be to use yeah so there
[00:16:06] solve this would be to use yeah so there confirm there is no like time Point
[00:16:08] confirm there is no like time Point dependency of the probability
[00:16:10] dependency of the probability distribution it has to be the same in
[00:16:12] distribution it has to be the same in every single rate question we're going
[00:16:14] every single rate question we're going to assume for now that everything's
[00:16:15] to assume for now that everything's stationary meaning that that reward
[00:16:17] stationary meaning that that reward probability distribution is the same at
[00:16:18] probability distribution is the same at every time step so there's lots of
[00:16:21] every time step so there's lots of really interesting questions around
[00:16:22] really interesting questions around non-stationarity our Labs don't work on
[00:16:23] non-stationarity our Labs don't work on that there's lots of other really
[00:16:24] that there's lots of other really interesting work on this like with time
[00:16:26] interesting work on this like with time point you know detection and change
[00:16:27] point you know detection and change points for now we're going to assume
[00:16:29] points for now we're going to assume it's stationary and that would include
[00:16:30] it's stationary and that would include the fact that we don't suddenly get a
[00:16:32] the fact that we don't suddenly get a new distribution of people for whom
[00:16:34] new distribution of people for whom different things work good
[00:16:37] different things work good question all right so one thing you
[00:16:38] question all right so one thing you could imagine doing is just to be greedy
[00:16:41] could imagine doing is just to be greedy so what we're going to do in this case
[00:16:42] so what we're going to do in this case we're going to use Q today not to denote
[00:16:44] we're going to use Q today not to denote like a state action or discounted sum of
[00:16:46] like a state action or discounted sum of future Awards or you can think of it
[00:16:48] future Awards or you can think of it like that except for there's no State
[00:16:50] like that except for there's no State there's a single state um and um it's
[00:16:53] there's a single state um and um it's only over actions and it's only the
[00:16:54] only over actions and it's only the immediate reward so what Q here would
[00:16:57] immediate reward so what Q here would denote is what is just the expected
[00:16:59] denote is what is just the expected reward of
[00:17:00] reward of R and we can just estimate that by
[00:17:04] R and we can just estimate that by counting we can just look up every other
[00:17:05] counting we can just look up every other time you know we we did surgery what
[00:17:07] time you know we we did surgery what were the outcomes for that individual
[00:17:09] were the outcomes for that individual and we can average and what the greedy
[00:17:11] and we can average and what the greedy algorithm does is they just selects the
[00:17:13] algorithm does is they just selects the action with the highest
[00:17:14] action with the highest value and um takes that action observes
[00:17:17] value and um takes that action observes the outcome and
[00:17:20] repeats so let's think about what
[00:17:22] repeats so let's think about what happens when we do that so if you have
[00:17:25] happens when we do that so if you have this setting um imagine that this really
[00:17:27] this setting um imagine that this really is the true set of param
[00:17:29] is the true set of param so surgery in this case in our fake
[00:17:32] so surgery in this case in our fake example is actually the most effective
[00:17:34] example is actually the most effective body taping is second and doing nothing
[00:17:36] body taping is second and doing nothing is not very
[00:17:38] is not very effective
[00:17:39] effective so imagine this so you start off and
[00:17:42] so imagine this so you start off and this is pretty common with a lot of
[00:17:43] this is pretty common with a lot of Bandit algorithms if you have a small
[00:17:46] Bandit algorithms if you have a small finite set of actions often you'll just
[00:17:47] finite set of actions often you'll just start off and you'll sample everything
[00:17:49] start off and you'll sample everything once now when you start to get into like
[00:17:51] once now when you start to get into like really large action spaces like all of
[00:17:53] really large action spaces like all of the ads you could recommend to people
[00:17:55] the ads you could recommend to people will have to do something smarter but in
[00:17:57] will have to do something smarter but in this case you can just sample all the
[00:17:58] this case you can just sample all the actions once and let's see what you
[00:18:00] actions once and let's see what you would would observe so in this case
[00:18:03] would would observe so in this case imagine that you get the first
[00:18:05] imagine that you get the first observation here is zero for arm one
[00:18:07] observation here is zero for arm one it's one for arm two and zero for arm
[00:18:09] it's one for arm two and zero for arm three so which arm this is not meant to
[00:18:12] three so which arm this is not meant to be tricky which arm would you select
[00:18:14] be tricky which arm would you select next under the greedy
[00:18:24] algorithm which of them has the highest
[00:18:29] algorithm which of them has the highest right exactly so you would just um there
[00:18:31] right exactly so you would just um there would be you deterministically the
[00:18:33] would be you deterministically the probability of taking A2 would be equal
[00:18:34] probability of taking A2 would be equal to one you just deterministically take
[00:18:36] to one you just deterministically take whichever one looks best um so would
[00:18:40] whichever one looks best um so would that be good or
[00:18:46] bad and in particular would you ever
[00:18:48] bad and in particular would you ever select the optimal
[00:18:50] select the optimal action no so you actually couldn't exact
[00:18:53] action no so you actually couldn't exact um so it will never find
[00:18:55] um so it will never find it because you have a really low
[00:19:00] it because you have a really low estimate of the true value of these two
[00:19:02] estimate of the true value of these two your average for A2 can never drop down
[00:19:05] your average for A2 can never drop down to zero because you've got at least one
[00:19:06] to zero because you've got at least one one and so even if you get zeros forever
[00:19:09] one and so even if you get zeros forever which you're unlikely to get to for A2
[00:19:11] which you're unlikely to get to for A2 you're never going to sample A1 again so
[00:19:14] you're never going to sample A1 again so what we would say in this case is that
[00:19:15] what we would say in this case is that this means that you will not converge
[00:19:17] this means that you will not converge the optimal action and this algorithm is
[00:19:20] the optimal action and this algorithm is is not very good and and we'll formalize
[00:19:21] is not very good and and we'll formalize what we mean find not very good in a
[00:19:23] what we mean find not very good in a second so this just is to illustrate why
[00:19:26] second so this just is to illustrate why you should not just be greedy um that
[00:19:28] you should not just be greedy um that you can lock onto the suboptimal action
[00:19:30] you can lock onto the suboptimal action Forever This highlights why you need to
[00:19:32] Forever This highlights why you need to do some form of
[00:19:33] do some form of exploration um because you can in fact
[00:19:36] exploration um because you can in fact make you know an infinite number of bad
[00:19:38] make you know an infinite number of bad decisions so how do we quantify what it
[00:19:41] decisions so how do we quantify what it means to make an infinite number of good
[00:19:43] means to make an infinite number of good or bad decisions um we're going to use
[00:19:44] or bad decisions um we're going to use the word regret uh and we mean regret in
[00:19:47] the word regret uh and we mean regret in the case of sequential decision-
[00:19:50] the case of sequential decision- making okay so the idea in this case is
[00:19:54] making okay so the idea in this case is that we're going to think formally about
[00:19:56] that we're going to think formally about what is the difference between the
[00:19:58] what is the difference between the decisions that our algorithm makes and
[00:19:59] decisions that our algorithm makes and the optimal decisions um and then we're
[00:20:02] the optimal decisions um and then we're going to score the algorithm based on
[00:20:03] going to score the algorithm based on what the Gap is so in
[00:20:07] what the Gap is so in particular the optimal value just like
[00:20:10] particular the optimal value just like what we've seen in the past is the
[00:20:12] what we've seen in the past is the maximum overall the Q value so which
[00:20:14] maximum overall the Q value so which whichever arm has the best highest
[00:20:16] whichever arm has the best highest expected reward and the regret is the
[00:20:18] expected reward and the regret is the opportunity
[00:20:20] opportunity loss you could also think of this as the
[00:20:22] loss you could also think of this as the difference of the advantage the
[00:20:24] difference of the advantage the advantage of the optimal action compared
[00:20:25] advantage of the optimal action compared to the action that's
[00:20:27] to the action that's taken and so you're regret just like we
[00:20:29] taken and so you're regret just like we often use it colloquially is the gap
[00:20:31] often use it colloquially is the gap between what the agent could have
[00:20:34] between what the agent could have achieved and what it actually got um
[00:20:36] achieved and what it actually got um we're going to focus here of looking at
[00:20:38] we're going to focus here of looking at these in expectation of course due to
[00:20:40] these in expectation of course due to stochasticity there could be times where
[00:20:42] stochasticity there could be times where the particular reward you get for a
[00:20:44] the particular reward you get for a suboptimal action might be higher than
[00:20:46] suboptimal action might be higher than the action the reward you'd get for the
[00:20:48] the action the reward you'd get for the optimal action because of stochasticity
[00:20:50] optimal action because of stochasticity but we're just going to focus here on
[00:20:52] but we're just going to focus here on expectations so we're always comparing
[00:20:53] expectations so we're always comparing the expected reward of the optimal arm
[00:20:55] the expected reward of the optimal arm to the expected reward of the suboptimal
[00:20:57] to the expected reward of the suboptimal arm
[00:20:59] arm all right so that's regret so how do we
[00:21:01] all right so that's regret so how do we compute it um we're going to think about
[00:21:04] compute it um we're going to think about sort of comparing this over all time
[00:21:07] sort of comparing this over all time steps and we're going to maximize
[00:21:08] steps and we're going to maximize cumulative reward which is equivalent to
[00:21:10] cumulative reward which is equivalent to minimizing total regret because remember
[00:21:14] minimizing total regret because remember this is unknown but it's
[00:21:16] this is unknown but it's fixed so we really want to you know
[00:21:19] fixed so we really want to you know maximize our total reward and we can
[00:21:21] maximize our total reward and we can either think of that as your
[00:21:23] either think of that as your maximizing the queue you got overall
[00:21:25] maximizing the queue you got overall time steps or we're minimizing the total
[00:21:28] time steps or we're minimizing the total regret
[00:21:29] regret and normally in Bandits we talk about
[00:21:30] and normally in Bandits we talk about minimizing total regret instead of
[00:21:32] minimizing total regret instead of maximizing total reward all right let's
[00:21:35] maximizing total reward all right let's see how we can think about how big the
[00:21:37] see how we can think about how big the regret will be so um let's let NTA be
[00:21:41] regret will be so um let's let NTA be the number of times action a has been
[00:21:44] the number of times action a has been selected at time step
[00:21:49] selected at time step T so that means that if your agent has
[00:21:52] T so that means that if your agent has made T decisions you count up and see
[00:21:55] made T decisions you count up and see how many times did I take action A1 how
[00:21:57] how many times did I take action A1 how many times I action A2 how many times I
[00:21:59] many times I action A2 how many times I take action
[00:22:00] take action A3 the gap for a particular arm is
[00:22:03] A3 the gap for a particular arm is essentially you know its
[00:22:08] advantage of a star over a so it's just
[00:22:12] advantage of a star over a so it's just the difference between what is the
[00:22:14] the difference between what is the expected reward the optimal action would
[00:22:15] expected reward the optimal action would have gotten minus the expected reward
[00:22:18] have gotten minus the expected reward you get under this alternative action
[00:22:20] you get under this alternative action and we often call this the Gap I think
[00:22:22] and we often call this the Gap I think the the literature developed somewhat
[00:22:24] the the literature developed somewhat independently and so I think that's why
[00:22:26] independently and so I think that's why you know people don't commonly call it
[00:22:27] you know people don't commonly call it the advantage in the case of bits they
[00:22:29] the advantage in the case of bits they typically call it the Gap and the Gap
[00:22:31] typically call it the Gap and the Gap will turn out to be pretty important
[00:22:33] will turn out to be pretty important because as you might start to think
[00:22:35] because as you might start to think about intuitively depending on the size
[00:22:37] about intuitively depending on the size of the Gap it's going to be easier or
[00:22:39] of the Gap it's going to be easier or harder to learn which of two actions is
[00:22:41] harder to learn which of two actions is better so if the gaps are really large
[00:22:44] better so if the gaps are really large between like you know action one and
[00:22:45] between like you know action one and action two which means they have really
[00:22:46] action two which means they have really different expected rewards you're going
[00:22:49] different expected rewards you're going to need less samples to figure that out
[00:22:50] to need less samples to figure that out if the gaps are really really small
[00:22:52] if the gaps are really really small generally need a lot more data okay so G
[00:22:55] generally need a lot more data okay so G is going to just be a function of the
[00:22:56] is going to just be a function of the gaps and the counts so we can just think
[00:22:58] gaps and the counts so we can just think of the number of times that you took
[00:23:01] of the number of times that you took each action and the difference between
[00:23:03] each action and the difference between the you know and the and this Gap the
[00:23:04] the you know and the and this Gap the difference between the optimal action
[00:23:06] difference between the optimal action you should have taken and the reward you
[00:23:08] you should have taken and the reward you actually got and so the um our expected
[00:23:11] actually got and so the um our expected regret here is just going to be the sum
[00:23:13] regret here is just going to be the sum of times you take each action times the
[00:23:18] of times you take each action times the Gap and so what that means intuitively
[00:23:21] Gap and so what that means intuitively is that we want to don't we do not want
[00:23:23] is that we want to don't we do not want to take actions which have large gaps
[00:23:25] to take actions which have large gaps very much and it's more okay if we take
[00:23:28] very much and it's more okay if we take more of actions that are close to the
[00:23:29] more of actions that are close to the optimal
[00:23:31] optimal action and a lot of algorithms um for a
[00:23:35] action and a lot of algorithms um for a lot of algorithms what we try to do is
[00:23:36] lot of algorithms what we try to do is we try to bound this quantity so we try
[00:23:38] we try to bound this quantity so we try to say in advance in general this is
[00:23:39] to say in advance in general this is something that we can't know because
[00:23:41] something that we can't know because this requires access to what is ever is
[00:23:43] this requires access to what is ever is the optimal action and its value and we
[00:23:46] the optimal action and its value and we don't know either of those things but
[00:23:47] don't know either of those things but what we can do is we can have algorithms
[00:23:49] what we can do is we can have algorithms where we can prove something about how
[00:23:50] where we can prove something about how the regret
[00:23:53] the regret grows all right let's first just see
[00:23:55] grows all right let's first just see sort of what I mean just to instantiate
[00:23:56] sort of what I mean just to instantiate that okay so just so again we can't do
[00:23:59] that okay so just so again we can't do this in the real world but we can do
[00:24:00] this in the real world but we can do this um in in you know for a toy example
[00:24:04] this um in in you know for a toy example um let's just think about what the
[00:24:05] um let's just think about what the regret would look like in this case so
[00:24:08] regret would look like in this case so this would be a series if you were
[00:24:10] this would be a series if you were running your greedy algorithm so this is
[00:24:12] running your greedy algorithm so this is the actions this is like time this is
[00:24:14] the actions this is like time this is like one two three four five so we first
[00:24:19] like one two three four five so we first take all of our actions in each of those
[00:24:21] take all of our actions in each of those cases the true optimal action was A1 and
[00:24:24] cases the true optimal action was A1 and our observe reward was on and our regret
[00:24:26] our observe reward was on and our regret was as follows so a one really is the
[00:24:29] was as follows so a one really is the optimal action so we had zero regret
[00:24:32] optimal action so we had zero regret there the second one our regret was
[00:24:36] there the second one our regret was this and for the third one I regret was
[00:24:45] this so this just shows you how what the
[00:24:48] this so this just shows you how what the size would be and so this this here is
[00:24:52] size would be and so this this here is actually the Gap it's the gap between
[00:24:54] actually the Gap it's the gap between the optimal arm and and the arm that
[00:24:56] the optimal arm and and the arm that you're taking so it's just shows you
[00:24:58] you're taking so it's just shows you sort of how regret can grow and as you
[00:25:00] sort of how regret can grow and as you might expect if you make bad decisions
[00:25:02] might expect if you make bad decisions forever you're going to get linear
[00:25:05] forever you're going to get linear regret so for example here in the greedy
[00:25:08] regret so for example here in the greedy case if we now take A3 forever our
[00:25:11] case if we now take A3 forever our regret is going to be the total number
[00:25:12] regret is going to be the total number of time steps T *
[00:25:15] of time steps T * 85 because that's how much we're losing
[00:25:17] 85 because that's how much we're losing for every single decision and then we
[00:25:18] for every single decision and then we sum them all
[00:25:22] up
[00:25:24] up okay all
[00:25:26] okay all right so in general it can be linear in
[00:25:28] right so in general it can be linear in the number of decisions and so part of
[00:25:30] the number of decisions and so part of the the main thing we're going to be
[00:25:31] the the main thing we're going to be trying to do is ideally you would have
[00:25:33] trying to do is ideally you would have constant regret or zero regret what I
[00:25:35] constant regret or zero regret what I mean by constant regret would mean that
[00:25:37] mean by constant regret would mean that um you make a finite number of bad
[00:25:40] um you make a finite number of bad decisions so if you can figure out what
[00:25:42] decisions so if you can figure out what the optimal arm is and then take that
[00:25:43] the optimal arm is and then take that forever then you'll have constant regret
[00:25:45] forever then you'll have constant regret because it just is going to be say I
[00:25:47] because it just is going to be say I make 10 decisions then I learn the
[00:25:49] make 10 decisions then I learn the optimal arm and then I make the optimal
[00:25:50] optimal arm and then I make the optimal thing forever that's generally pretty
[00:25:52] thing forever that's generally pretty hard to do um if you in the worst case
[00:25:55] hard to do um if you in the worst case you'll be linear you'll make a mistake
[00:25:57] you'll be linear you'll make a mistake on every single AR you know decision
[00:25:59] on every single AR you know decision forever and typically what we're hoping
[00:26:01] forever and typically what we're hoping to find is we're hoping to have
[00:26:02] to find is we're hoping to have sublinear regret so it still might grow
[00:26:05] sublinear regret so it still might grow with a number of time steps the number
[00:26:06] with a number of time steps the number of decisions you're making but it's not
[00:26:09] of decisions you're making but it's not going to be
[00:26:10] going to be linear okay and we'll see a lot more
[00:26:12] linear okay and we'll see a lot more about
[00:26:13] about that
[00:26:15] that okay all right so what we're going to
[00:26:18] okay all right so what we're going to think of next is we've seen these before
[00:26:20] think of next is we've seen these before the Epsilon greedy algorithms so let's
[00:26:22] the Epsilon greedy algorithms so let's think about what sort of regret Epsilon
[00:26:24] think about what sort of regret Epsilon greedy will have we've seen that greedy
[00:26:26] greedy will have we've seen that greedy can be linear now let's see if there's
[00:26:27] can be linear now let's see if there's some better things we can
[00:26:29] some better things we can do okay so in this case we're going to
[00:26:32] do okay so in this case we're going to do just to refresh our memories um with
[00:26:35] do just to refresh our memories um with probability 1us Epsilon we're going to
[00:26:37] probability 1us Epsilon we're going to select we're going to be greedy we're
[00:26:38] select we're going to be greedy we're going to select whichever action is the
[00:26:40] going to select whichever action is the argx and otherwise we're going to select
[00:26:42] argx and otherwise we're going to select a random
[00:26:43] a random action and that means that Epsilon
[00:26:46] action and that means that Epsilon amount of the time we're going to be
[00:26:48] amount of the time we're going to be making some suboptimal decision because
[00:26:51] making some suboptimal decision because unless all of your arms are optimal and
[00:26:52] unless all of your arms are optimal and your gaps are zero in which case it
[00:26:54] your gaps are zero in which case it doesn't matter what arm you're picking
[00:26:56] doesn't matter what arm you're picking um if you select things at random you're
[00:26:58] um if you select things at random you're always going to be making some bad
[00:26:59] always going to be making some bad decision um at each time
[00:27:01] decision um at each time Point
[00:27:03] Point okay so what does this look like okay so
[00:27:06] okay so what does this look like okay so what this would look like in this case
[00:27:08] what this would look like in this case is Imagine again we sample all three
[00:27:10] is Imagine again we sample all three arms to start this is our Epsilon I'm
[00:27:12] arms to start this is our Epsilon I'm just going to work out what it will look
[00:27:13] just going to work out what it will look like so with this case we're going to
[00:27:15] like so with this case we're going to 90% probability we're going to be
[00:27:19] 90% probability we're going to be greedy and in that case we will take
[00:27:23] greedy and in that case we will take action
[00:27:26] A1 and a two each with probability
[00:27:29] A1 and a two each with probability assume that you T split
[00:27:31] assume that you T split ties 45% probability and then with 10%
[00:27:35] ties 45% probability and then with 10% probability we will take all of the
[00:27:37] probability we will take all of the other actions so we'll have like
[00:27:41] other actions so we'll have like 3.3% A1 A2 and A3 so that's just to conc
[00:27:47] 3.3% A1 A2 and A3 so that's just to conc you know to be concrete about what that
[00:27:48] you know to be concrete about what that would look like in this
[00:27:52] case I'll skip through
[00:27:55] case I'll skip through this um so
[00:27:59] this um so the question here is what will this
[00:28:01] the question here is what will this regret look like so now we want to try
[00:28:03] regret look like so now we want to try to compute this for Epsilon greedy to
[00:28:04] to compute this for Epsilon greedy to think about whether will have sublinear
[00:28:07] think about whether will have sublinear regret for Epsilon
[00:28:09] regret for Epsilon greedy
[00:28:12] greedy okay all
[00:28:14] okay all right so let's assume that we're in a
[00:28:17] right so let's assume that we're in a setting where there always exists at
[00:28:21] setting where there always exists at least one action such that the Gap is
[00:28:24] least one action such that the Gap is non zero that means that not all arms
[00:28:26] non zero that means that not all arms are tied if all arms are tied again and
[00:28:28] are tied if all arms are tied again and doesn't really matter what you do
[00:28:29] doesn't really matter what you do because everything is the same and so it
[00:28:30] because everything is the same and so it doesn't matter what action you take so
[00:28:32] doesn't matter what action you take so this makes it a non-trivial
[00:28:33] this makes it a non-trivial decision-making problem so let's think
[00:28:36] decision-making problem so let's think about in terms of
[00:28:37] about in terms of our thing whether or not Epsilon equals
[00:28:42] our thing whether or not Epsilon equals 0.1 can have linear regret and whether
[00:28:44] 0.1 can have linear regret and whether Epsilon equals z can have linear
[00:28:48] regret as in this is generally trying to
[00:28:50] regret as in this is generally trying to think about are there settings of
[00:28:52] think about are there settings of Epsilon for which you could get linear
[00:28:53] Epsilon for which you could get linear gr and maybe set of Epsilon where you
[00:28:55] gr and maybe set of Epsilon where you couldn't I don't know if this is
[00:28:57] couldn't I don't know if this is actually on the there are three and we
[00:28:59] actually on the there are three and we three you don't know what I don't know
[00:29:02] three you don't know what I don't know if this one's actually on the post we
[00:29:05] if this one's actually on the post we all
[00:29:07] all three I wonder if that was
[00:29:11] missed all right well if it's not on
[00:29:14] missed all right well if it's not on feel free just
[00:29:15] feel free just to I have it oh okay I can check and see
[00:29:19] to I have it oh okay I can check and see I wonder if some things got missed in
[00:29:20] I wonder if some things got missed in last
[00:29:21] last one oh I think okay hold on I'll put
[00:29:24] one oh I think okay hold on I'll put I'll post those in but feel free just to
[00:29:26] I'll post those in but feel free just to think for a second and then I'll ask you
[00:29:27] think for a second and then I'll ask you to talk to a neighbor
[00:29:29] to talk to a neighbor let me see I think something's got
[00:29:35] mangled which ones were mangled this
[00:29:41] one
[00:29:55] okay I think this should be the there
[00:29:58] okay I think this should be the there now you can check I just updated
[00:30:02] now you can check I just updated Ed tell if that didn't
[00:30:05] Ed tell if that didn't work do that work looks like great
[00:30:29] I think most people agree on this but
[00:30:31] I think most people agree on this but maybe let's just do one minute and check
[00:30:32] maybe let's just do one minute and check with the neighbor and just check you got
[00:30:33] with the neighbor and just check you got the same
[00:30:56] thing e
[00:31:35] all right I'm going to interrupt you for
[00:31:37] all right I'm going to interrupt you for a second
[00:31:38] a second um so I think one way that's useful to
[00:31:41] um so I think one way that's useful to think about this is when we think about
[00:31:43] think about this is when we think about how many times we sample things all of
[00:31:45] how many times we sample things all of the arms are going to have a lower bound
[00:31:47] the arms are going to have a lower bound on the number of times we sample them
[00:31:49] on the number of times we sample them which is at least Epsilon divided by the
[00:31:51] which is at least Epsilon divided by the number of actions time T where T is the
[00:31:54] number of actions time T where T is the total number of decisions we make
[00:31:58] total number of decisions we make and so I think that can be a helpful way
[00:32:00] and so I think that can be a helpful way to think about this that you see there's
[00:32:02] to think about this that you see there's a t here times some constant because
[00:32:05] a t here times some constant because there's a big T here times a constant
[00:32:06] there's a big T here times a constant that means you're going to have at least
[00:32:07] that means you're going to have at least linear
[00:32:09] linear regret so if Epsilon is greater than
[00:32:12] regret so if Epsilon is greater than zero you will have linear regret and if
[00:32:15] zero you will have linear regret and if Epsilon is equal to zero you're greedy
[00:32:17] Epsilon is equal to zero you're greedy and we just saw that that can have
[00:32:18] and we just saw that that can have linear regret so in either of these two
[00:32:21] linear regret so in either of these two cases unfortunately both of
[00:32:24] cases unfortunately both of these both are true
[00:32:29] anybody have any questions about
[00:32:32] that now it turns out there there are
[00:32:34] that now it turns out there there are certainly better and worse ways of
[00:32:36] certainly better and worse ways of setting Epsilon but if you just set
[00:32:38] setting Epsilon but if you just set Epsilon in a static way um it can be
[00:32:41] Epsilon in a static way um it can be pretty bad and as you might remember
[00:32:44] pretty bad and as you might remember from a long time ago sometimes we talked
[00:32:45] from a long time ago sometimes we talked about decaying Epsilon over time and so
[00:32:48] about decaying Epsilon over time and so that can matter a lot too but static
[00:32:50] that can matter a lot too but static Epsilon is not
[00:32:51] Epsilon is not great all right so let's look at what
[00:32:54] great all right so let's look at what this can look
[00:32:55] this can look like if you think about how regret is
[00:32:58] like if you think about how regret is growing over time steps these are very
[00:33:00] growing over time steps these are very common plots when you look at like
[00:33:01] common plots when you look at like Bandits or some some other approaches
[00:33:03] Bandits or some some other approaches we'll see if we consider what total
[00:33:05] we'll see if we consider what total regret is you'd like regret to be zero
[00:33:07] regret is you'd like regret to be zero um if you make it greedy it can be
[00:33:10] um if you make it greedy it can be linear if you make it Epsilon greedy
[00:33:12] linear if you make it Epsilon greedy it's normally a little bit better but
[00:33:13] it's normally a little bit better but it's still linear um if you Decay it it
[00:33:16] it's still linear um if you Decay it it can get a lot
[00:33:18] can get a lot closer uh and it is going to be possible
[00:33:20] closer uh and it is going to be possible to be
[00:33:21] to be sublinear for good choices of
[00:33:24] sublinear for good choices of algorithms one of the challenges for
[00:33:27] algorithms one of the challenges for this is that it can turn out that there
[00:33:28] this is that it can turn out that there can be some pretty good choices of
[00:33:30] can be some pretty good choices of Epsilon but it often depends on problem
[00:33:32] Epsilon but it often depends on problem dependent properties that we don't know
[00:33:34] dependent properties that we don't know in advance so we need to have an
[00:33:36] in advance so we need to have an algorithm which before knowing how you
[00:33:39] algorithm which before knowing how you know anything about the problem in terms
[00:33:40] know anything about the problem in terms of the gaps or anything like that can be
[00:33:43] of the gaps or anything like that can be guaranteed to have sublinear
[00:33:45] guaranteed to have sublinear regret so first of all um let's think
[00:33:48] regret so first of all um let's think about sort of what type of regret bance
[00:33:50] about sort of what type of regret bance we might get and is there reasons for
[00:33:52] we might get and is there reasons for hope so a problem independent bound
[00:33:55] hope so a problem independent bound talks about how does the regret grow as
[00:33:57] talks about how does the regret grow as a function of t
[00:33:59] a function of t for any possible problem um that you
[00:34:01] for any possible problem um that you might be given so what this might say is
[00:34:04] might be given so what this might say is I might give you an algorithm which is
[00:34:06] I might give you an algorithm which is guaranteed to be sublinear in Te no
[00:34:08] guaranteed to be sublinear in Te no matter what Bandit problem you put me in
[00:34:10] matter what Bandit problem you put me in so it's just an algorithm that will work
[00:34:11] so it's just an algorithm that will work well for any po potential domain you put
[00:34:14] well for any po potential domain you put in and it'll make a bound on its
[00:34:16] in and it'll make a bound on its performance instance dependent or
[00:34:19] performance instance dependent or problem dependent bounds um bound things
[00:34:22] problem dependent bounds um bound things uh as a function of the Gap now one of
[00:34:25] uh as a function of the Gap now one of the really elegant things of problem
[00:34:27] the really elegant things of problem dependent bound is that it doesn't mean
[00:34:29] dependent bound is that it doesn't mean the algorithm has to know the gaps it
[00:34:30] the algorithm has to know the gaps it just means that if it turns out the
[00:34:32] just means that if it turns out the problem's easy like the there are really
[00:34:34] problem's easy like the there are really large gaps you will have a much better
[00:34:37] large gaps you will have a much better regret and so some of my and my lab's
[00:34:40] regret and so some of my and my lab's work and a number of other people too
[00:34:42] work and a number of other people too were often very interested in this and I
[00:34:44] were often very interested in this and I think at a high level what this means is
[00:34:45] think at a high level what this means is you have an algorithm that's adaptive to
[00:34:47] you have an algorithm that's adaptive to the problem so it means that your
[00:34:48] the problem so it means that your algorithm will be guaranteed to do
[00:34:50] algorithm will be guaranteed to do really well on the problem if the
[00:34:51] really well on the problem if the problem is easier to learn in and if
[00:34:53] problem is easier to learn in and if it's harder well then you can't do well
[00:34:54] it's harder well then you can't do well anyway it it'll do as well as it can so
[00:34:57] anyway it it'll do as well as it can so we'll talk about bounds the the both
[00:34:59] we'll talk about bounds the the both types of these in general is the gap
[00:35:04] types of these in general is the gap usually less we're considering only
[00:35:08] usually less we're considering only rewards we usually consider only rewards
[00:35:11] rewards we usually consider only rewards between Z one great question totally
[00:35:14] between Z one great question totally depends on the domain um uh so if you're
[00:35:16] depends on the domain um uh so if you're looking like beri then it's naturally
[00:35:18] looking like beri then it's naturally between zero and one um uh other domains
[00:35:21] between zero and one um uh other domains might be very different you can always
[00:35:22] might be very different you can always normalize it I think whether the domain
[00:35:24] normalize it I think whether the domain has really big gaps really depends so if
[00:35:26] has really big gaps really depends so if you think about something like C through
[00:35:27] you think about something like C through rates for ads you know clickthrough
[00:35:30] rates for ads you know clickthrough rates are are really really hard to
[00:35:31] rates are are really really hard to optimize it's often like you know 01
[00:35:33] optimize it's often like you know 01 versus 011 like um you know nobody likes
[00:35:36] versus 011 like um you know nobody likes ads so in those cases the differences
[00:35:39] ads so in those cases the differences the gaps we're looking at could often be
[00:35:41] the gaps we're looking at could often be really tiny and so you'll generally need
[00:35:43] really tiny and so you'll generally need a lot of data and having smart data
[00:35:45] a lot of data and having smart data efficient algorithms will matter a lot
[00:35:48] efficient algorithms will matter a lot um there might be other cases where
[00:35:49] um there might be other cases where there's really big gaps um if the
[00:35:51] there's really big gaps um if the problem has really big gaps it's really
[00:35:53] problem has really big gaps it's really easy um and so tends to not matter too
[00:35:56] easy um and so tends to not matter too much what you do there because you can
[00:35:58] much what you do there because you can quickly estimate them great question
[00:36:02] quickly estimate them great question okay all right so here's a Reason for
[00:36:04] okay all right so here's a Reason for Hope um so there's a nice lower Bound by
[00:36:07] Hope um so there's a nice lower Bound by lean Robbins I think this was around
[00:36:08] lean Robbins I think this was around 1950s it's been a long time um which
[00:36:11] 1950s it's been a long time um which tries to think about what's the minimum
[00:36:13] tries to think about what's the minimum regret you're going to get um as a
[00:36:16] regret you're going to get um as a function of the
[00:36:17] function of the problem and so this means that any
[00:36:19] problem and so this means that any algorithm is going to suffer at least
[00:36:21] algorithm is going to suffer at least this much in terms of
[00:36:22] this much in terms of regret so it says um you're going to at
[00:36:25] regret so it says um you're going to at least be log T like the number of time
[00:36:28] least be log T like the number of time steps number of decisions you've made
[00:36:30] steps number of decisions you've made and for any arm for which it is
[00:36:32] and for any arm for which it is suboptimal you're going to suffer this
[00:36:34] suboptimal you're going to suffer this in terms of a kale uh difference between
[00:36:36] in terms of a kale uh difference between your distribution of um rewards you get
[00:36:39] your distribution of um rewards you get on your arm on that arm versus the
[00:36:41] on your arm on that arm versus the optimal arm um with the gap on the
[00:36:44] optimal arm um with the gap on the numerator but this should be promising
[00:36:46] numerator but this should be promising because it's sublinear it's log it's not
[00:36:49] because it's sublinear it's log it's not linear which means that the lower bound
[00:36:52] linear which means that the lower bound says um according to this it is not yet
[00:36:54] says um according to this it is not yet impossible to try to have sublinear
[00:36:56] impossible to try to have sublinear regret okay and this would be considered
[00:36:59] regret okay and this would be considered an a problem dependent or instant
[00:37:00] an a problem dependent or instant dependent bound because this holds based
[00:37:02] dependent bound because this holds based on the unknown
[00:37:05] gaps
[00:37:07] gaps okay so now we're going to see one of my
[00:37:09] okay so now we're going to see one of my favorite ideas in the course um which is
[00:37:11] favorite ideas in the course um which is optimism under uncertainty which uh
[00:37:14] optimism under uncertainty which uh gives us I think it's a lovely principle
[00:37:17] gives us I think it's a lovely principle because it shows why it's provably
[00:37:18] because it shows why it's provably optimal to be optimistic about things
[00:37:21] optimal to be optimistic about things which is kind of beautiful um and it's
[00:37:23] which is kind of beautiful um and it's going to be one of the first things
[00:37:24] going to be one of the first things we're going to see that's going to allow
[00:37:25] we're going to see that's going to allow us to have sublinear regret okay so why
[00:37:28] us to have sublinear regret okay so why is optimism good and what do we mean by
[00:37:29] is optimism good and what do we mean by optimism in this case okay what we mean
[00:37:33] optimism in this case okay what we mean is we're going to choose actions or arms
[00:37:36] is we're going to choose actions or arms typ of there um that might have a high
[00:37:40] typ of there um that might have a high value well what happens when we choose
[00:37:43] value well what happens when we choose things that are
[00:37:44] things that are good so one thing that can happen is we
[00:37:47] good so one thing that can happen is we actually get high
[00:37:51] reward so that's good because that's our
[00:37:53] reward so that's good because that's our goal is we want to get high reward we
[00:37:55] goal is we want to get high reward we want to maximize reward SL minimize cost
[00:37:57] want to maximize reward SL minimize cost what's the other thing that can happen
[00:37:58] what's the other thing that can happen if we pick something that might be
[00:38:00] if we pick something that might be good might have high
[00:38:09] reward lower reward lower reward exactly
[00:38:12] reward lower reward lower reward exactly what okay so yeah those are the things
[00:38:14] what okay so yeah those are the things you can e get higher reward you can get
[00:38:15] you can e get higher reward you can get lower reward what happens if there's
[00:38:17] lower reward what happens if there's lower reward I mean of course we're that
[00:38:20] lower reward I mean of course we're that but like aside from that what happens do
[00:38:22] but like aside from that what happens do you think probably to our estimates like
[00:38:24] you think probably to our estimates like those Q estimates if we get low reward
[00:38:26] those Q estimates if we get low reward yeah improve prision exactly yeah
[00:38:30] yeah improve prision exactly yeah exactly remind me your name yeah so what
[00:38:33] exactly remind me your name yeah so what said is exactly right so basically
[00:38:36] said is exactly right so basically either you get high reward or you learn
[00:38:38] either you get high reward or you learn something okay so the other alternative
[00:38:41] something okay so the other alternative is you get low reward and you learn
[00:38:43] is you get low reward and you learn something and you're going to improve
[00:38:44] something and you're going to improve your
[00:38:46] your estimates and from the point of view of
[00:38:48] estimates and from the point of view of a reinforcement learning algorithm or
[00:38:49] a reinforcement learning algorithm or banded algorithm both of these are
[00:38:51] banded algorithm both of these are really valuable right because either
[00:38:53] really valuable right because either you're actually achieving your goal or
[00:38:55] you're actually achieving your goal or you are learning something so that in
[00:38:56] you are learning something so that in the future you won't make make bad
[00:38:57] the future you won't make make bad decisions in the future okay and so that
[00:38:59] decisions in the future okay and so that is why optimism is we're going to see
[00:39:02] is why optimism is we're going to see provably optimal
[00:39:04] provably optimal okay all right now of course that means
[00:39:06] okay all right now of course that means that we have to have an algorithm
[00:39:07] that we have to have an algorithm leverages the information we get when we
[00:39:09] leverages the information we get when we see low rewards okay so we're going to
[00:39:11] see low rewards okay so we're going to have to be formal about what it means to
[00:39:13] have to be formal about what it means to be might you know we're going to we're
[00:39:15] be might you know we're going to we're going to formalize this as quantifying
[00:39:18] going to formalize this as quantifying our
[00:39:20] our uncertainty so we're going to need to be
[00:39:22] uncertainty so we're going to need to be precise over our confidence intervals or
[00:39:23] precise over our confidence intervals or uncertainty bounds and then use that to
[00:39:25] uncertainty bounds and then use that to make
[00:39:26] make decisions okay
[00:39:28] decisions okay so in particular what we're going to do
[00:39:30] so in particular what we're going to do is we are going to estimate an upper
[00:39:31] is we are going to estimate an upper confidence Bound for each action value
[00:39:34] confidence Bound for each action value such that that confidence bound uh that
[00:39:36] such that that confidence bound uh that upper confidence bounds holds with high
[00:39:38] upper confidence bounds holds with high probability so we're going to make sure
[00:39:41] probability so we're going to make sure we're going to be frequentist today
[00:39:42] we're going to be frequentist today we're not going to be basian and don't
[00:39:43] we're not going to be basian and don't worry if you haven't done a lot on E on
[00:39:45] worry if you haven't done a lot on E on either of those but we're going to focus
[00:39:46] either of those but we're going to focus today on just high probability bounds so
[00:39:48] today on just high probability bounds so we're going to need a UT of a where that
[00:39:50] we're going to need a UT of a where that holds with high
[00:39:52] holds with high probability um and we're going to want
[00:39:54] probability um and we're going to want this to be dependent on how many times
[00:39:56] this to be dependent on how many times we've SL to the arm there are lots of
[00:39:59] we've SL to the arm there are lots of ways to quantify uncertainty we're going
[00:40:01] ways to quantify uncertainty we're going to focus today on a frequentist view and
[00:40:02] to focus today on a frequentist view and just thinking about counts and then the
[00:40:05] just thinking about counts and then the way we're going to behave the way that
[00:40:07] way we're going to behave the way that our agent is going to take actions is
[00:40:08] our agent is going to take actions is just going to pick whichever action has
[00:40:09] just going to pick whichever action has the highest upper confidence
[00:40:11] the highest upper confidence bound and there's a whole Suite of
[00:40:13] bound and there's a whole Suite of algorithms that are called UCB
[00:40:15] algorithms that are called UCB algorithms there are many algorithms
[00:40:17] algorithms there are many algorithms that are variants of this
[00:40:21] notion there's also ones that are called
[00:40:24] notion there's also ones that are called optimism in the face of uncertainty ofu
[00:40:27] optimism in the face of uncertainty ofu okay so it's a really simple idea and
[00:40:30] okay so it's a really simple idea and now the question is going to be how well
[00:40:32] now the question is going to be how well does this perform and how do we quantify
[00:40:34] does this perform and how do we quantify the
[00:40:36] uncertainty so let's go through hopings
[00:40:38] uncertainty so let's go through hopings inequality we're going to use it in
[00:40:39] inequality we're going to use it in homework three but I'm curious who has
[00:40:41] homework three but I'm curious who has seen it in previous
[00:40:43] seen it in previous classes okay maybe a couple people but
[00:40:46] classes okay maybe a couple people but most people I wouldn't expect you us to
[00:40:47] most people I wouldn't expect you us to do okay so hting inequality is a really
[00:40:50] do okay so hting inequality is a really useful inequality um the idea of it is
[00:40:53] useful inequality um the idea of it is we're just going to think about how
[00:40:54] we're just going to think about how different can our observed average from
[00:40:57] different can our observed average from the true true
[00:40:59] the true true mean so let's say we have n samples that
[00:41:04] mean so let's say we have n samples that are somewhere between Z and
[00:41:06] are somewhere between Z and one and this is our true expectation
[00:41:09] one and this is our true expectation this is their true mean which we don't
[00:41:10] this is their true mean which we don't know what it is and this is our sample
[00:41:12] know what it is and this is our sample mean just over the end
[00:41:14] mean just over the end samples what hopings inequality says is
[00:41:17] samples what hopings inequality says is that the difference between your
[00:41:18] that the difference between your empirical
[00:41:19] empirical estimate and the true
[00:41:22] estimate and the true estimate um if it if they're off by U
[00:41:26] estimate um if it if they're off by U then the probability of that happening
[00:41:28] then the probability of that happening is going down
[00:41:30] is going down exponentially which essentially means
[00:41:32] exponentially which essentially means that um uh as you have more data the
[00:41:34] that um uh as you have more data the chance that your empirical estimate is
[00:41:36] chance that your empirical estimate is really different than your true mean is
[00:41:38] really different than your true mean is going down exponentially
[00:41:40] going down exponentially fast like you can't have your empirical
[00:41:42] fast like you can't have your empirical average be you know 30 and the real
[00:41:45] average be you know 30 and the real thing is 2,000 if you have a lot of data
[00:41:47] thing is 2,000 if you have a lot of data you're going to converge at you know on
[00:41:49] you're going to converge at you know on the true mean which is of course what
[00:41:52] the true mean which is of course what you would hope but um that this says a
[00:41:54] you would hope but um that this says a formal thing about what the rate is so
[00:41:56] formal thing about what the rate is so let's just look for second and think a
[00:41:58] let's just look for second and think a little bit about what this can imply
[00:42:00] little bit about what this can imply okay so let's look at this part let's
[00:42:02] okay so let's look at this part let's say I'm going to do it for the absolute
[00:42:03] say I'm going to do it for the absolute value probability of e of
[00:42:05] value probability of e of x minus xn so this is again just our EMP
[00:42:10] x minus xn so this is again just our EMP our empirical mean the probability this
[00:42:12] our empirical mean the probability this is greater than me so that this gap
[00:42:14] is greater than me so that this gap between our empirical average and the
[00:42:16] between our empirical average and the true one is good and so why just to back
[00:42:18] true one is good and so why just to back up why are we doing all this we're going
[00:42:19] up why are we doing all this we're going to want to figure out a way to get an
[00:42:21] to want to figure out a way to get an upper bound on what the real mean is of
[00:42:24] upper bound on what the real mean is of this and so what this equation is going
[00:42:26] this and so what this equation is going to allow us to do is to try to figure
[00:42:28] to allow us to do is to try to figure out how big do we need to set u to be in
[00:42:31] out how big do we need to set u to be in order for us to get an upper bound on
[00:42:33] order for us to get an upper bound on what the true expected reward might be
[00:42:36] what the true expected reward might be for a particular arm okay all right so
[00:42:38] for a particular arm okay all right so let's see how we can do that all right
[00:42:40] let's see how we can do that all right so we're going to say this is less than
[00:42:42] so we're going to say this is less than I've gotten an absolute value here so
[00:42:44] I've gotten an absolute value here so we're going to use this
[00:42:48] version and we're going to set this to
[00:42:50] version and we're going to set this to Delta so this is going to be the
[00:42:51] Delta so this is going to be the confidence with which we want this um
[00:42:53] confidence with which we want this um this Pro uh confidence interval to hold
[00:42:56] this Pro uh confidence interval to hold so this is going to be
[00:42:58] so this is going to be want the CI to hold with this
[00:43:02] want the CI to hold with this probability with 1us Delta probility so
[00:43:05] probability with 1us Delta probility so we're going to try to construct an upper
[00:43:07] we're going to try to construct an upper competence bound that holds with least
[00:43:09] competence bound that holds with least probability 1us Delta okay so let's just
[00:43:11] probability 1us Delta okay so let's just do this now we're going to focus on this
[00:43:13] do this now we're going to focus on this hand this side so we're just going to do
[00:43:16] hand this side so we're just going to do some
[00:43:18] some algebra equal to Delta over 2 that means
[00:43:22] algebra equal to Delta over 2 that means u^2 is equal to 1 n log 2 Delta which
[00:43:27] u^2 is equal to 1 n log 2 Delta which means you is equal to square
[00:43:34] root okay so this gives us a range and
[00:43:38] root okay so this gives us a range and it says if we want the probability that
[00:43:40] it says if we want the probability that our empirical estimate differs from the
[00:43:43] our empirical estimate differs from the true mean by no more than
[00:43:47] true mean by no more than U um then it is sufficient to set u
[00:43:50] U um then it is sufficient to set u equal to
[00:43:52] equal to this okay so that means that we can say
[00:43:55] this okay so that means that we can say that X
[00:43:58] that X nus mu is less than or equal to expected
[00:44:01] nus mu is less than or equal to expected value of x Which is less than or equal
[00:44:04] value of x Which is less than or equal to xn plus
[00:44:06] to xn plus u with probability greater than equal to
[00:44:09] u with probability greater than equal to 1 Delta so that just created our upper
[00:44:12] 1 Delta so that just created our upper confidence bound so they said with high
[00:44:15] confidence bound so they said with high probability I can take my empirical
[00:44:17] probability I can take my empirical estimate I add in my mu my mu here note
[00:44:20] estimate I add in my mu my mu here note just depends on the number of samples
[00:44:22] just depends on the number of samples that I have and that gives me my upper
[00:44:24] that I have and that gives me my upper upper confidence bound
[00:44:27] upper confidence bound and so we can use this we can use this
[00:44:29] and so we can use this we can use this given our data it just requires us to
[00:44:31] given our data it just requires us to count how many times we've sampled
[00:44:32] count how many times we've sampled things compute the average and then add
[00:44:34] things compute the average and then add on this additional bonus we often call
[00:44:36] on this additional bonus we often call these like bonus terms in these
[00:44:39] these like bonus terms in these cases so this is going to create the
[00:44:41] cases so this is going to create the ucb1 algorithm which is at every time
[00:44:44] ucb1 algorithm which is at every time step we're just going to compute this is
[00:44:45] step we're just going to compute this is again remember the qad is this the
[00:44:47] again remember the qad is this the empirical average
[00:44:57] and then we add on this bonus
[00:45:00] and then we add on this bonus term
[00:45:02] term okay and this is again just the number
[00:45:05] okay and this is again just the number of
[00:45:07] of samples of a
[00:45:10] samples of a after T time
[00:45:17] STS and for those of you familiar with
[00:45:19] STS and for those of you familiar with things like Union bounds and stuff we'll
[00:45:21] things like Union bounds and stuff we'll come to that shortly so this is we
[00:45:24] come to that shortly so this is we haven't really fully made sure that all
[00:45:25] haven't really fully made sure that all these competence intervals are going to
[00:45:26] these competence intervals are going to hold overall time steps um so we'll be a
[00:45:30] hold overall time steps um so we'll be a little bit more careful about what Delta
[00:45:31] little bit more careful about what Delta needs to be soon yeah it's called ucb1
[00:45:35] needs to be soon yeah it's called ucb1 like why is it one so there's a lot of
[00:45:38] like why is it one so there's a lot of different variants um of the UCB
[00:45:40] different variants um of the UCB algorithm I think this is one of the
[00:45:42] algorithm I think this is one of the first ones it was I think
[00:45:44] first ones it was I think our UE
[00:45:47] our UE 2002 I think it's the one they named
[00:45:49] 2002 I think it's the one they named first in their
[00:45:52] paper but this notion of kind of
[00:45:54] paper but this notion of kind of optimism under uncertainty is certainly
[00:45:56] optimism under uncertainty is certainly around before the 2000s um but I think
[00:45:59] around before the 2000s um but I think this is the paper where they first did
[00:46:00] this is the paper where they first did some of these nice
[00:46:01] some of these nice proofs
[00:46:03] proofs okay all
[00:46:05] okay all right okay so let's think about what
[00:46:07] right okay so let's think about what that how different that algorithm would
[00:46:09] that how different that algorithm would look like in our types of
[00:46:11] look like in our types of settings okay so we're going to use
[00:46:14] settings okay so we're going to use optimism under
[00:46:15] optimism under uncertainty and what we're going to do
[00:46:18] uncertainty and what we're going to do in this case is we're first going to
[00:46:21] in this case is we're first going to sample each arm once so same as
[00:46:25] before and this is what we're going to
[00:46:27] before and this is what we're going to get and now what we do is we're going to
[00:46:29] get and now what we do is we're going to compute those upper competence
[00:46:32] compute those upper competence bounds okay so what we want to do is
[00:46:34] bounds okay so what we want to do is compute this upper competence bounds for
[00:46:36] compute this upper competence bounds for each of the arms so UCB of
[00:46:40] each of the arms so UCB of A1 2 A3 okay and so this would be 1 + <
[00:46:46] A1 2 A3 okay and so this would be 1 + < TK of 2 log or Delta over 1 same for
[00:46:53] TK of 2 log or Delta over 1 same for this one and then 0 + < TK 2 log l 1 /
[00:47:00] Delta okay so in this case you would
[00:47:03] Delta okay so in this case you would pick A1 or A2 with equal probability
[00:47:05] pick A1 or A2 with equal probability because the upper confidence bound is
[00:47:10] identical okay so we select the argmax
[00:47:13] identical okay so we select the argmax let's say that we
[00:47:19] pick okay and now we're going to again
[00:47:21] pick okay and now we're going to again compute the upper confidence
[00:47:23] compute the upper confidence bound so in this case what would happen
[00:47:26] bound so in this case what would happen is you would still have you would have
[00:47:28] is you would still have you would have the following you would have UCB of A1
[00:47:32] the following you would have UCB of A1 is equal to 1 + < TK 2 log 1 / Delta 2
[00:47:39] is equal to 1 + < TK 2 log 1 / Delta 2 UCB A2 is = to 1 + < TK 2 log 1 / Delta
[00:47:46] UCB A2 is = to 1 + < TK 2 log 1 / Delta over 1 and
[00:47:48] over 1 and UCB A3 is equal to 0+ < TK 2 log 1/
[00:47:54] UCB A3 is equal to 0+ < TK 2 log 1/ Delta 1 so what you can see here um is
[00:48:00] Delta 1 so what you can see here um is that we've now reduced our upper
[00:48:02] that we've now reduced our upper competence bound because we've learned
[00:48:03] competence bound because we've learned something and this case we happen to
[00:48:05] something and this case we happen to have atin gotten high reward but either
[00:48:07] have atin gotten high reward but either way we learned something we could shrink
[00:48:08] way we learned something we could shrink our competence intervals because we have
[00:48:09] our competence intervals because we have additional
[00:48:10] additional accounts just make sure understanding
[00:48:13] accounts just make sure understanding correctly the Delta is something that we
[00:48:15] correctly the Delta is something that we would select to kind of figure out or to
[00:48:18] would select to kind of figure out or to choose our confidence B yeah great
[00:48:21] choose our confidence B yeah great question so yes we haven't talked a lot
[00:48:22] question so yes we haven't talked a lot about how we set Delta there're going to
[00:48:24] about how we set Delta there're going to be a couple criteria for it in general
[00:48:26] be a couple criteria for it in general um
[00:48:27] um we're going to need all of these
[00:48:28] we're going to need all of these confidence bounds to hold for all time
[00:48:30] confidence bounds to hold for all time steps for all arms so we're going to
[00:48:31] steps for all arms so we're going to need to do some Union bounding to make
[00:48:33] need to do some Union bounding to make sure all of them simultaneously hold um
[00:48:37] sure all of them simultaneously hold um because we want to have it with high
[00:48:38] because we want to have it with high probability that all of these things are
[00:48:40] probability that all of these things are valid at the same time um we also in the
[00:48:43] valid at the same time um we also in the simple setting we know how many total
[00:48:45] simple setting we know how many total decisions we're making and so we need to
[00:48:46] decisions we're making and so we need to use that information as
[00:48:48] use that information as well and then you can um use those two
[00:48:50] well and then you can um use those two things toe to bound the regret as we'll
[00:48:53] things toe to bound the regret as we'll see so you can see this is why it's a
[00:48:56] see so you can see this is why it's a bit different than greedy because we are
[00:48:58] bit different than greedy because we are still using our empirical averages but
[00:49:00] still using our empirical averages but then these confidence intervals are
[00:49:01] then these confidence intervals are going to change so that over time these
[00:49:06] going to change so that over time these will sort of alternate often depending
[00:49:07] will sort of alternate often depending on which um uh rewards you're getting
[00:49:10] on which um uh rewards you're getting and you may periodically take A3 because
[00:49:12] and you may periodically take A3 because with that little data there is some
[00:49:14] with that little data there is some probability that A3 is just as good as
[00:49:17] probability that A3 is just as good as A1 and A2 particularly after you get
[00:49:19] A1 and A2 particularly after you get additional data so sort of alternate
[00:49:21] additional data so sort of alternate between the arms based on these upper
[00:49:23] between the arms based on these upper confidence bounds
[00:49:26] [Applause]
[00:49:28] [Applause] okay let's go
[00:49:30] okay let's go ahead I'll skip through those here let's
[00:49:32] ahead I'll skip through those here let's go to here okay so this is as were just
[00:49:35] go to here okay so this is as were just asking it's a little bit subtle um if
[00:49:38] asking it's a little bit subtle um if you have a fixed number of time steps
[00:49:40] you have a fixed number of time steps like you know that total you're going to
[00:49:41] like you know that total you're going to make like Big T
[00:49:43] make like Big T decisions you can set T to be roughly
[00:49:47] decisions you can set T to be roughly you probably want this by
[00:49:50] you probably want this by a um this is because you can use a union
[00:49:53] a um this is because you can use a union bound so why are we doing this we want
[00:49:55] bound so why are we doing this we want these upper confidence bounds to be
[00:49:56] these upper confidence bounds to be valid and we need them to be valid at
[00:49:58] valid and we need them to be valid at every single time step because we are
[00:50:00] every single time step because we are using them to make
[00:50:01] using them to make decisions uh so this is also related to
[00:50:05] decisions uh so this is also related to false Discovery and other things like
[00:50:06] false Discovery and other things like that if you've heard about them in
[00:50:07] that if you've heard about them in machine learning so what we're going to
[00:50:09] machine learning so what we're going to use here is we're going to think about
[00:50:10] use here is we're going to think about all of these as being events that these
[00:50:12] all of these as being events that these confidence bounds are hold and what we
[00:50:14] confidence bounds are hold and what we mean by that is that they really do
[00:50:16] mean by that is that they really do contain the true um the True Value the
[00:50:18] contain the true um the True Value the true unknown value with high probability
[00:50:21] true unknown value with high probability so what we're going to say is the
[00:50:22] so what we're going to say is the probability that all of these events
[00:50:23] probability that all of these events hold which means that all of our
[00:50:25] hold which means that all of our confidence intervals are valid for all
[00:50:27] confidence intervals are valid for all of the arms for all of the time steps
[00:50:29] of the arms for all of the time steps we're just going to use a union bound
[00:50:30] we're just going to use a union bound which says we're just going to sum over
[00:50:32] which says we're just going to sum over the probability of each of them over all
[00:50:34] the probability of each of them over all of those events so that would be roughly
[00:50:36] of those events so that would be roughly the number of arms times Big T and so
[00:50:39] the number of arms times Big T and so that's why you can then just divide your
[00:50:41] that's why you can then just divide your confidence interval your Delta sorry you
[00:50:45] confidence interval your Delta sorry you can just divide your Delta into Delta
[00:50:47] can just divide your Delta into Delta divide T * the size of your a and that
[00:50:50] divide T * the size of your a and that generally is
[00:50:51] generally is sufficient and just to think about what
[00:50:53] sufficient and just to think about what that will do in terms of your bounds so
[00:50:55] that will do in terms of your bounds so remember we had a log 1/ Delta term so
[00:50:59] remember we had a log 1/ Delta term so that means you would get something like
[00:51:00] that means you would get something like this log ta a /
[00:51:06] Delta so generally the union bounding
[00:51:09] Delta so generally the union bounding sort of blows up your log term um
[00:51:11] sort of blows up your log term um there's various approaches including law
[00:51:13] there's various approaches including law of iterated logarithms and others to try
[00:51:15] of iterated logarithms and others to try to get this term to be smaller so you
[00:51:18] to get this term to be smaller so you can do tighter things than
[00:51:24] this okay so let's think about I I
[00:51:27] this okay so let's think about I I promised you that we're going to be able
[00:51:28] promised you that we're going to be able to use this type of idea to get
[00:51:30] to use this type of idea to get sublinear regret so let's um go through
[00:51:33] sublinear regret so let's um go through a proof sketch to think about how this
[00:51:36] a proof sketch to think about how this actually enables us to have to get much
[00:51:38] actually enables us to have to get much better performance than what we've seen
[00:51:40] better performance than what we've seen before all right so what this statement
[00:51:44] before all right so what this statement says and I'll just put a pointer in so
[00:51:46] says and I'll just put a pointer in so it's in the um it's in the references
[00:51:49] it's in the um it's in the references for under the website but there's a
[00:51:51] for under the website but there's a great book on B I think it's just called
[00:51:53] great book on B I think it's just called banded algorithms
[00:51:58] by Tor
[00:52:01] ladimore and Chaba
[00:52:05] sasari which I think maybe came out in
[00:52:07] sasari which I think maybe came out in 2019 or or 2000 I'm trying to remember
[00:52:10] 2019 or or 2000 I'm trying to remember um but they have a great book so it came
[00:52:11] um but they have a great book so it came out of a series of blog posts they were
[00:52:13] out of a series of blog posts they were doing on multiarm Bandits and then they
[00:52:15] doing on multiarm Bandits and then they turned it into a book um and so there's
[00:52:17] turned it into a book um and so there's a really nice one and if you go there
[00:52:18] a really nice one and if you go there it's I think approximately chapter
[00:52:21] it's I think approximately chapter 7 they're going to do a much more
[00:52:23] 7 they're going to do a much more rigorous version of this proof compared
[00:52:24] rigorous version of this proof compared to what I'm doing today um what I'm
[00:52:26] to what I'm doing today um what I'm going to try to do today is just to give
[00:52:28] going to try to do today is just to give you a flavor of um types of bounds that
[00:52:30] you a flavor of um types of bounds that you might want to uh prove in these
[00:52:32] you might want to uh prove in these sorts of cases and how we end up making
[00:52:35] sorts of cases and how we end up making getting a sublinear
[00:52:36] getting a sublinear regret so what this result says is the
[00:52:39] regret so what this result says is the following if you think back what we said
[00:52:41] following if you think back what we said before is we could bound um the expected
[00:52:44] before is we could bound um the expected regret by how many times we make we
[00:52:48] regret by how many times we make we choose an arm and how much gap or loss
[00:52:51] choose an arm and how much gap or loss we have whenever we choose it and so one
[00:52:54] we have whenever we choose it and so one thing that we could do is then try to
[00:52:56] thing that we could do is then try to just think about well we you know we
[00:52:58] just think about well we you know we don't know what the gaps are but the
[00:52:59] don't know what the gaps are but the gaps we can just write down as the
[00:53:00] gaps we can just write down as the difference between the the expected
[00:53:02] difference between the the expected reward of that arm versus the true
[00:53:03] reward of that arm versus the true reward of that arm that's not something
[00:53:05] reward of that arm that's not something we can influence the thing that we can
[00:53:07] we can influence the thing that we can influence is how many times we're
[00:53:08] influence is how many times we're selecting bad arms so what this says is
[00:53:11] selecting bad arms so what this says is that if an arm is suboptimal the number
[00:53:14] that if an arm is suboptimal the number of times that we pull it number of times
[00:53:16] of times that we pull it number of times we take that action in Upper confidence
[00:53:18] we take that action in Upper confidence bounds scales as a constant C Prime not
[00:53:21] bounds scales as a constant C Prime not going to tell you what that is um often
[00:53:23] going to tell you what that is um often in the algorithms they don't tell you
[00:53:24] in the algorithms they don't tell you what that is either I mean it'll be
[00:53:25] what that is either I mean it'll be somewhere in the print um the point is
[00:53:28] somewhere in the print um the point is that constant can't depend on parts of
[00:53:31] that constant can't depend on parts of the the domain so it can't depend on the
[00:53:33] the the domain so it can't depend on the number of arms or the gaps or things
[00:53:34] number of arms or the gaps or things like that it could be like 37 for
[00:53:37] like that it could be like 37 for example so a constant time log of 1/
[00:53:41] example so a constant time log of 1/ Delta Delta 2 + < ^ 2 3 +
[00:53:45] Delta Delta 2 + < ^ 2 3 + 1
[00:53:47] 1 okay so why is this interesting before
[00:53:49] okay so why is this interesting before we get into how do we prove this this is
[00:53:51] we get into how do we prove this this is interesting um because it says if the
[00:53:54] interesting um because it says if the Gap is large we're going to take it many
[00:53:56] Gap is large we're going to take it many less
[00:53:57] less times so if the Gap is really small then
[00:54:00] times so if the Gap is really small then it means that we're going to um we might
[00:54:02] it means that we're going to um we might sample that action a lot more and if the
[00:54:05] sample that action a lot more and if the Gap is large we're going to take it
[00:54:08] Gap is large we're going to take it less and then we can combine that with
[00:54:10] less and then we can combine that with this
[00:54:12] this equation and what happens in that case
[00:54:15] equation and what happens in that case is so just I'll go through that part
[00:54:17] is so just I'll go through that part before we actually s think about so what
[00:54:18] before we actually s think about so what we're going to focus on doing a proof
[00:54:20] we're going to focus on doing a proof sketch of for today is to focus on this
[00:54:23] sketch of for today is to focus on this part but let's just think if we could
[00:54:25] part but let's just think if we could prove that why that would show the
[00:54:26] prove that why that would show the second well what we would get in this
[00:54:28] second well what we would get in this case is we would say we get this term
[00:54:31] case is we would say we get this term plugged into
[00:54:33] plugged into here and the main thing that would
[00:54:35] here and the main thing that would happen
[00:54:37] happen there is this would become Delta because
[00:54:40] there is this would become Delta because we multiplied it by a Delta on top and
[00:54:43] we multiplied it by a Delta on top and then here if you assume that everything
[00:54:46] then here if you assume that everything is bounded between 0 and one then the
[00:54:49] is bounded between 0 and one then the Deltas are at most one too so you can
[00:54:52] Deltas are at most one too so you can get this is just the number of actions *
[00:54:55] get this is just the number of actions * 1 plus s three this
[00:54:58] 1 plus s three this term so this just shows how your you
[00:55:00] term so this just shows how your you know what your total regret would be in
[00:55:02] know what your total regret would be in this case your total expected
[00:55:04] this case your total expected regret as I said there's quite a bit
[00:55:06] regret as I said there's quite a bit more subtleties to the formal proof but
[00:55:08] more subtleties to the formal proof but this just gives sort of a rough idea so
[00:55:10] this just gives sort of a rough idea so we have any questions of that before we
[00:55:12] we have any questions of that before we dig into how we show the first part
[00:55:14] dig into how we show the first part which is the total number of times we're
[00:55:16] which is the total number of times we're going to take arms we're going to pull a
[00:55:17] going to take arms we're going to pull a particular arm scales with one over the
[00:55:21] particular arm scales with one over the size of the Gap squared
[00:55:30] all right let's go through it so this is
[00:55:31] all right let's go through it so this is going to heavily rely on um the hting
[00:55:36] going to heavily rely on um the hting inequality and upper confidence bounds
[00:55:38] inequality and upper confidence bounds so remember
[00:55:40] so remember um what we saw before is let's imagine
[00:55:43] um what we saw before is let's imagine that we've got this so we're going to
[00:55:45] that we've got this so we're going to say this was our our upper confidence
[00:55:47] say this was our our upper confidence bound so we have this upper confidence
[00:55:49] bound so we have this upper confidence bound and again I'm going to
[00:55:52] bound and again I'm going to be loose with
[00:55:56] be loose with the Deltas okay you'd have to be a
[00:55:59] the Deltas okay you'd have to be a little bit more formal of it in general
[00:56:01] little bit more formal of it in general but let's look at this so this is going
[00:56:03] but let's look at this so this is going to be the True
[00:56:07] to be the True Value and this is our empirical estimate
[00:56:13] okay so what hting inequality had told
[00:56:15] okay so what hting inequality had told us is to say the difference between the
[00:56:18] us is to say the difference between the true expected value for an arm and your
[00:56:19] true expected value for an arm and your empirical
[00:56:21] empirical average
[00:56:23] average is greater than this quantity upper
[00:56:26] is greater than this quantity upper confidence bound with probability no
[00:56:28] confidence bound with probability no more than one uh Delta over T okay so
[00:56:33] more than one uh Delta over T okay so now let's think about the following
[00:56:35] now let's think about the following let's think about the
[00:56:38] let's think about the times we pull a which is not equal to a
[00:56:44] times we pull a which is not equal to a star and Delta a is not equal to zero
[00:56:49] star and Delta a is not equal to zero these are kind of the only things we
[00:56:50] these are kind of the only things we care about in terms of regret if we're
[00:56:52] care about in terms of regret if we're pulling a star we have zero regret if we
[00:56:55] pulling a star we have zero regret if we are pulling an arm that that is has
[00:56:57] are pulling an arm that that is has Delta a equals z that also means that it
[00:57:00] Delta a equals z that also means that it has zero regret because it means it's
[00:57:01] has zero regret because it means it's tied with an optimal arm so the only
[00:57:03] tied with an optimal arm so the only things that we care about bounding here
[00:57:05] things that we care about bounding here is to think about for that NT of a how
[00:57:08] is to think about for that NT of a how many times are we pulling arms that are
[00:57:09] many times are we pulling arms that are not optimal
[00:57:11] not optimal okay all right so what we're going to do
[00:57:14] okay all right so what we're going to do is kind of observed a couple so
[00:57:17] is kind of observed a couple so if the confidence interval
[00:57:20] if the confidence interval holds so we can think this if this holds
[00:57:23] holds so we can think this if this holds then we have the following we have the Q
[00:57:26] then we have the following we have the Q a
[00:57:27] a minus C log 1 / Delta /
[00:57:33] minus C log 1 / Delta / NT so here I'll say if one
[00:57:41] holds okay which is less than or equal
[00:57:45] holds okay which is less than or equal to QT hat of a Which is less than or
[00:57:48] to QT hat of a Which is less than or equal to Q of a
[00:57:51] equal to Q of a plus < TK C log 1 / Delta / n T of a
[00:57:57] plus < TK C log 1 / Delta / n T of a this just says if your confidence
[00:57:59] this just says if your confidence intervals holds what it means for it to
[00:58:00] intervals holds what it means for it to hold is that that confidence interval is
[00:58:02] hold is that that confidence interval is wide enough that it contains your true
[00:58:04] wide enough that it contains your true value and the upper confidence part is
[00:58:06] value and the upper confidence part is higher than that and the lower
[00:58:07] higher than that and the lower confidence bound is lower than your true
[00:58:09] confidence bound is lower than your true value so this is just holds if our
[00:58:11] value so this is just holds if our confidence intervals hold okay
[00:58:15] confidence intervals hold okay now if we pull a instead of a star so
[00:58:21] now if we pull a instead of a star so under UCB algorithm
[00:58:27] we have the
[00:58:28] we have the following we know
[00:58:30] following we know that the upper confidence bound of a was
[00:58:35] that the upper confidence bound of a was greater than the
[00:58:41] upper because that's why we picked its
[00:58:43] upper because that's why we picked its alternative action so in this case if we
[00:58:47] alternative action so in this case if we pull this arm a that means that its
[00:58:49] pull this arm a that means that its upper competence bound was different
[00:58:50] upper competence bound was different than the op competence bound of the
[00:58:52] than the op competence bound of the optimal action and it was more preferred
[00:58:54] optimal action and it was more preferred so that's the only time we ever take the
[00:58:56] so that's the only time we ever take the wrong action is if its upper confidence
[00:58:58] wrong action is if its upper confidence bound is higher than the other
[00:59:00] bound is higher than the other actions so let's write down what that
[00:59:02] actions so let's write down what that means in terms of its upper bounds so
[00:59:05] means in terms of its upper bounds so the definition of upper bounds here is
[00:59:08] the definition of upper bounds here is that QT of a plus s < TK C log 1 / Delta
[00:59:14] that QT of a plus s < TK C log 1 / Delta / NT of a is greater than QT of a
[00:59:20] / NT of a is greater than QT of a star
[00:59:21] star plus C log 1 / Delta
[00:59:26] plus C log 1 / Delta by n t of a
[00:59:30] star because that's just the definition
[00:59:32] star because that's just the definition of our two upper confidence bound so it
[00:59:34] of our two upper confidence bound so it says okay I'm only going to take this
[00:59:35] says okay I'm only going to take this other non-optimal action because its
[00:59:37] other non-optimal action because its upper confidence bound was actually
[00:59:39] upper confidence bound was actually higher than the upper confidence bound
[00:59:40] higher than the upper confidence bound with the optimal action okay and then we
[00:59:44] with the optimal action okay and then we notice so let's just label
[00:59:47] notice so let's just label them so we're going to call this two
[00:59:50] them so we're going to call this two call this three so now we're going to
[00:59:53] call this three so now we're going to subtitute
[00:59:56] in from two
[01:00:00] in from two okay all right so we
[01:00:02] okay all right so we know that this is greater than QT of a
[01:00:08] know that this is greater than QT of a star from equation two because we know
[01:00:12] star from equation two because we know that the upper confidence bound on the
[01:00:14] that the upper confidence bound on the optimal action also holds so its upper
[01:00:17] optimal action also holds so its upper confidence bound has to be higher than
[01:00:19] confidence bound has to be higher than its true
[01:00:20] its true value
[01:00:23] value okay all right so now what do I have
[01:00:28] I have
[01:00:31] that
[01:00:34] that q and let me write one more thing here
[01:00:36] q and let me write one more thing here so
[01:00:38] so similarly just check that I get that
[01:00:40] similarly just check that I get that right right one two
[01:00:47] three good I just want to make sure I
[01:00:50] three good I just want to make sure I got that one right
[01:01:02] yes okay so that means
[01:01:05] yes okay so that means that
[01:01:08] QT oh hold
[01:01:11] QT oh hold on all
[01:01:14] on all right so this is going to mean that Q of
[01:01:19] right so this is going to mean that Q of a plus I'm confusing myself but we'll
[01:01:22] a plus I'm confusing myself but we'll figure it out in a
[01:01:24] figure it out in a second of n n t
[01:01:29] a * 2 is greater than
[01:01:33] a * 2 is greater than Q of oops I should have WR this
[01:01:42] here okay so let me just make sure I did
[01:01:45] here okay so let me just make sure I did that correctly because I want that to
[01:01:47] that correctly because I want that to end up going in this case let just make
[01:01:50] end up going in this case let just make sure that I did
[01:01:52] sure that I did that in the right way
[01:01:57] feel like I'm off by a
[01:02:00] feel like I'm off by a constant all right I'll double check the
[01:02:03] constant all right I'll double check the constants afterwards I'll just write a
[01:02:04] constants afterwards I'll just write a note um so I'll check the constants okay
[01:02:09] note um so I'll check the constants okay but the the main formula is going to be
[01:02:11] but the the main formula is going to be fine even if you drop the two here that
[01:02:12] fine even if you drop the two here that would um here so what we're going to
[01:02:16] would um here so what we're going to have in this
[01:02:19] case something's bothering I'll see if I
[01:02:21] case something's bothering I'll see if I can figure out in a second um so
[01:02:25] can figure out in a second um so what we want to argue in this case is
[01:02:28] what we want to argue in this case is that
[01:02:30] that the Q of a that we have plus two of the
[01:02:35] the Q of a that we have plus two of the competence intervals is going to be
[01:02:37] competence intervals is going to be greater than Q of a star and I'm
[01:02:39] greater than Q of a star and I'm confusing myself slightly now and I'll
[01:02:41] confusing myself slightly now and I'll check into it later but what this would
[01:02:43] check into it later but what this would mean in this case is let's assume this
[01:02:45] mean in this case is let's assume this holds for a sec I'll make sure I get the
[01:02:46] holds for a sec I'll make sure I get the explanation for next week or I'll just
[01:02:47] explanation for next week or I'll just put on Ed um what we'd have in this case
[01:02:50] put on Ed um what we'd have in this case is we're going to have that 2 < TK C log
[01:02:55] is we're going to have that 2 < TK C log 1 Delta over NT of
[01:02:59] 1 Delta over NT of a is greater than Q of a star minus Q of
[01:03:04] a is greater than Q of a star minus Q of a which is equal to Delta
[01:03:08] a which is equal to Delta a let's go to the next
[01:03:12] a let's go to the next slide
[01:03:13] slide okay so if we have this in this case
[01:03:17] okay so if we have this in this case what we can then argue is that in this
[01:03:21] what we can then argue is that in this situation what we have here is that
[01:03:26] situation what we have here is that we can rearrange this to the other side
[01:03:29] we can rearrange this to the other side so let me just do the algebra for that
[01:03:31] so let me just do the algebra for that part so what we're going to have is
[01:03:33] part so what we're going to have is we're going to say that 4 * C log 1 /
[01:03:38] we're going to say that 4 * C log 1 / Delta / NT of a is greater than equal to
[01:03:44] Delta / NT of a is greater than equal to Delta a
[01:03:45] Delta a 2 which means that if we rearrange this
[01:03:48] 2 which means that if we rearrange this here we have NT of a is less than or
[01:03:51] here we have NT of a is less than or equal to 4 C log 1 / Delta
[01:03:58] squ okay and that
[01:04:01] squ okay and that looks really like
[01:04:06] this so what is this saying intuitively
[01:04:09] this so what is this saying intuitively intuitively this is saying if your
[01:04:11] intuitively this is saying if your confidence bounds hold and you use them
[01:04:14] confidence bounds hold and you use them to make decisions then if those
[01:04:16] to make decisions then if those confidence bounds are holding then the
[01:04:18] confidence bounds are holding then the only time that you make a decision that
[01:04:20] only time that you make a decision that is wrong is where these confidence
[01:04:22] is wrong is where these confidence bounds is large enough that it
[01:04:24] bounds is large enough that it overwhelms the gap
[01:04:29] and the number of times that that can
[01:04:31] and the number of times that that can occur is finite because the Gap is non
[01:04:34] occur is finite because the Gap is non zero and since we know from hopings
[01:04:36] zero and since we know from hopings inequality that the size of the
[01:04:39] inequality that the size of the confidence intervals are going down over
[01:04:40] confidence intervals are going down over time eventually they will get smaller
[01:04:42] time eventually they will get smaller than the
[01:04:43] than the Gap so you're going to sort of take
[01:04:46] Gap so you're going to sort of take these suboptimal actions less and less
[01:04:48] these suboptimal actions less and less often according to sort of how quickly
[01:04:50] often according to sort of how quickly your competence intervals are
[01:04:52] your competence intervals are Contracting relative to the Gap in these
[01:04:54] Contracting relative to the Gap in these cases
[01:04:59] anybody have any questions about
[01:05:05] that
[01:05:08] that okay all right so what this means is
[01:05:11] okay all right so what this means is then when we look at this we end up
[01:05:13] then when we look at this we end up getting that it achieves logarithmic ASM
[01:05:15] getting that it achieves logarithmic ASM totic regret as a function of log
[01:05:18] totic regret as a function of log T so because we had the log T
[01:05:22] T so because we had the log T here inside of the number of times we're
[01:05:24] here inside of the number of times we're taking these sub optimal
[01:05:27] taking these sub optimal actions and what you can see in these
[01:05:30] actions and what you can see in these cases is that over time so this is a a
[01:05:32] cases is that over time so this is a a previous um result where we look at sort
[01:05:34] previous um result where we look at sort of the amount of data that we have and
[01:05:35] of the amount of data that we have and sort of what is the best performance
[01:05:37] sort of what is the best performance that we have over time if you tune
[01:05:39] that we have over time if you tune Epsilon grey it can definitely get
[01:05:41] Epsilon grey it can definitely get better but also UCB ones definitely have
[01:05:44] better but also UCB ones definitely have this nice logarithmic shape if you have
[01:05:46] this nice logarithmic shape if you have the right um you know if you set the
[01:05:48] the right um you know if you set the constants
[01:05:50] constants correctly now empirically often it will
[01:05:52] correctly now empirically often it will end up being that the constants matter a
[01:05:54] end up being that the constants matter a lot and so if you set the constants
[01:05:56] lot and so if you set the constants wrong or if you set the constants off
[01:05:58] wrong or if you set the constants off into the theoretically prescribed value
[01:06:00] into the theoretically prescribed value it'll often explore for a long time so
[01:06:02] it'll often explore for a long time so you can often be more aggressive than
[01:06:03] you can often be more aggressive than that in terms of the resulting
[01:06:11] bounce so an alternative we could have
[01:06:13] bounce so an alternative we could have done to UCB is to always select the arm
[01:06:15] done to UCB is to always select the arm with the highest lower
[01:06:18] with the highest lower bound this can yield um linear regret
[01:06:30] so I think that's um a useful thing to
[01:06:32] so I think that's um a useful thing to think
[01:06:33] think about um this is optional but you
[01:06:36] about um this is optional but you couldn't do the check your understanding
[01:06:37] couldn't do the check your understanding to think about why can this lead l lead
[01:06:40] to think about why can this lead l lead to um linear negative
[01:06:42] to um linear negative regret it's helpful to think about the
[01:06:45] regret it's helpful to think about the upper confidence bound case and why that
[01:06:47] upper confidence bound case and why that one works and why this
[01:06:54] wouldn't for
[01:07:26] so in particular I guess imagine this
[01:07:28] so in particular I guess imagine this was on an exam what I would be looking
[01:07:30] was on an exam what I would be looking for in this case is for you to construct
[01:07:32] for in this case is for you to construct a multi-arm bandit case for which
[01:07:34] a multi-arm bandit case for which selecting um based on this criteria
[01:07:37] selecting um based on this criteria would give you linear regret so if you
[01:07:39] would give you linear regret so if you think back to the example I showed you
[01:07:41] think back to the example I showed you for greedy where we considered a
[01:07:42] for greedy where we considered a particular sequence of armps such that
[01:07:45] particular sequence of armps such that you would never recover and you'd get
[01:07:46] you would never recover and you'd get linear regret in that case think about
[01:07:48] linear regret in that case think about this sort of setting
[01:07:50] this sort of setting to where based on some confidence
[01:07:53] to where based on some confidence intervals if you select whichever one
[01:07:56] intervals if you select whichever one looks like it's better in terms of its
[01:07:57] looks like it's better in terms of its lower bound that you would never recover
[01:07:59] lower bound that you would never recover and select the optimal
[01:08:08] action so I had a question
[01:08:10] action so I had a question about um the slides before where we were
[01:08:14] about um the slides before where we were kind of assuming that the that condition
[01:08:18] kind of assuming that the that condition was met um then I'm assuming like the
[01:08:22] was met um then I'm assuming like the other parts came from where the
[01:08:24] other parts came from where the condition isn't met that's right yeah so
[01:08:26] condition isn't met that's right yeah so in those cases if you set the Delta
[01:08:28] in those cases if you set the Delta correctly you can say um uh so with high
[01:08:31] correctly you can say um uh so with high probability you're going to want this to
[01:08:33] probability you're going to want this to hold for all time steps and then there's
[01:08:34] hold for all time steps and then there's going to be the small amount of
[01:08:35] going to be the small amount of probability that it doesn't hold and
[01:08:37] probability that it doesn't hold and then you can argue in that case that um
[01:08:40] then you can argue in that case that um the the regret is going to be bounded
[01:08:41] the the regret is going to be bounded from those time points so you split the
[01:08:43] from those time points so you split the it's a good question you split sort of
[01:08:45] it's a good question you split sort of the expectation into the high
[01:08:46] the expectation into the high probability event and the low
[01:08:47] probability event and the low probability
[01:08:52] event so why don't we why don't you talk
[01:08:55] event so why don't we why don't you talk to neighbor and see if you got the same
[01:09:02] thing at least one person already has
[01:09:04] thing at least one person already has the right
[01:09:22] answer oh good
[01:09:34] by
[01:10:04] but
[01:10:51] okay I'm going to interrupt you for a
[01:10:52] okay I'm going to interrupt you for a sec for interrupt you now where would I
[01:10:55] sec for interrupt you now where would I have to put the mean and the upper Bound
[01:10:59] have to put the mean and the upper Bound for
[01:11:00] for A2 so that beam pessimistic
[01:11:08] fails so according to the
[01:11:11] fails so according to the algorithm here if we select the arm with
[01:11:12] algorithm here if we select the arm with the highest lower bound we would select
[01:11:14] the highest lower bound we would select A1 because A2 has a lower lower bound
[01:11:17] A1 because A2 has a lower lower bound but where would I have to put its upper
[01:11:18] but where would I have to put its upper Bound in its mean for it truly to be for
[01:11:21] Bound in its mean for it truly to be for us to have linear regret
[01:11:31] so here I put reward on the Y
[01:11:34] so here I put reward on the Y AIS at least one person said the right
[01:11:37] AIS at least one person said the right thing in in there so I know one of you
[01:11:39] thing in in there so I know one of you guys know
[01:11:46] this
[01:11:48] this yeah should be really high yeah that's
[01:11:50] yeah should be really high yeah that's right so for
[01:11:52] right so for example could have this okay so it's you
[01:11:56] example could have this okay so it's you could be really uncertain about it its
[01:11:58] could be really uncertain about it its lower bound is lower once you pick a one
[01:12:02] lower bound is lower once you pick a one the the lower bound here e an
[01:12:05] the the lower bound here e an expectation is only going to get closer
[01:12:08] expectation is only going to get closer like the lower bound this is these are
[01:12:09] like the lower bound this is these are valid confidence intervals this lower
[01:12:10] valid confidence intervals this lower bound really is smaller than the mean of
[01:12:13] bound really is smaller than the mean of A1 which means on average when weever we
[01:12:16] A1 which means on average when weever we sample A1 this is really just going to
[01:12:18] sample A1 this is really just going to shrink okay which means we'll never pull
[01:12:22] shrink okay which means we'll never pull A2 A2 is upper confidence bound is
[01:12:24] A2 A2 is upper confidence bound is higher than A1 so under UCB we would
[01:12:26] higher than A1 so under UCB we would learn this but if you're pessimistic in
[01:12:30] learn this but if you're pessimistic in some if you think about for um upper
[01:12:32] some if you think about for um upper confidence bounds if you're optimistic
[01:12:35] confidence bounds if you're optimistic either you're correct or you learn
[01:12:36] either you're correct or you learn something the problem with being
[01:12:38] something the problem with being pessimistic is that you may not learn
[01:12:41] pessimistic is that you may not learn anything because you're not updating
[01:12:43] anything because you're not updating your other other
[01:12:44] your other other bounds okay I realized why I was being
[01:12:46] bounds okay I realized why I was being confused so let me go back and just
[01:12:48] confused so let me go back and just correct that here um okay so how did I
[01:12:50] correct that here um okay so how did I get these so let me just clarify it was
[01:12:52] get these so let me just clarify it was this step I was confusing myself okay so
[01:12:54] this step I was confusing myself okay so we have this this particular equation
[01:12:56] we have this this particular equation right that the empirical average plus
[01:12:58] right that the empirical average plus its upper confidence bound was bigger
[01:13:00] its upper confidence bound was bigger than the optimal arms empirical average
[01:13:02] than the optimal arms empirical average plus its uper bound what I did from
[01:13:04] plus its uper bound what I did from equation two is I
[01:13:07] equation two is I reminded is that remind ourselves that
[01:13:11] reminded is that remind ourselves that the empirical average is always less
[01:13:13] the empirical average is always less than equal to the True Value Plus the
[01:13:15] than equal to the True Value Plus the upper confidence bound so we substitute
[01:13:18] upper confidence bound so we substitute that in for QT to get the QA plus 2 *
[01:13:22] that in for QT to get the QA plus 2 * the bound okay so that's why this this
[01:13:25] the bound okay so that's why this this works out so you just substitute this
[01:13:28] works out so you just substitute this with this upper bound into here so then
[01:13:31] with this upper bound into here so then it gets another it gets Q of a plus this
[01:13:33] it gets another it gets Q of a plus this upper bound plus this upper bound which
[01:13:36] upper bound plus this upper bound which means this bound becomes two so that's
[01:13:38] means this bound becomes two so that's where that came
[01:13:41] from okay so this is the first algorithm
[01:13:44] from okay so this is the first algorithm we've seen which has provably sublinear
[01:13:46] we've seen which has provably sublinear regret which is really nice it's also
[01:13:47] regret which is really nice it's also really easy to implement certainly when
[01:13:49] really easy to implement certainly when you have counts but all of this stuff
[01:13:51] you have counts but all of this stuff can be extended to much more complicated
[01:13:53] can be extended to much more complicated settings um and so there's a lot of work
[01:13:55] settings um and so there's a lot of work of thinking about for function
[01:13:56] of thinking about for function approximation and RL and we'll see all
[01:13:58] approximation and RL and we'll see all of those of um ways to sort of think
[01:14:01] of those of um ways to sort of think about formalizing this optimism under
[01:14:03] about formalizing this optimism under uncertainty principle in order to make
[01:14:05] uncertainty principle in order to make decisions when we don't know um uh what
[01:14:08] decisions when we don't know um uh what the outcomes will be in order to reduce
[01:14:10] the outcomes will be in order to reduce our regret over
[01:14:12] our regret over time so what we're going to see next
[01:14:15] time so what we're going to see next time is we're going to see more fast
[01:14:16] time is we're going to see more fast learning but we're also going to think
[01:14:18] learning but we're also going to think about it from a very different
[01:14:19] about it from a very different perspective called um a like basian
[01:14:21] perspective called um a like basian Bandits where we think of it not being
[01:14:23] Bandits where we think of it not being this sort of just fixed upper and lower
[01:14:25] this sort of just fixed upper and lower rectangular confidence intervals but we
[01:14:27] rectangular confidence intervals but we think of having a prior over what the
[01:14:29] think of having a prior over what the distribution is going to be of the
[01:14:30] distribution is going to be of the rewards for each arm and then in that
[01:14:33] rewards for each arm and then in that case we can also introduce algorithms of
[01:14:35] case we can also introduce algorithms of that end up being somewhat similar to
[01:14:37] that end up being somewhat similar to optimism in certain ways um as ways to
[01:14:40] optimism in certain ways um as ways to use those prior informations to figure
[01:14:41] use those prior informations to figure out how to quickly gather data um and
[01:14:43] out how to quickly gather data um and start to make good decisions so we'll
[01:14:45] start to make good decisions so we'll see that next week thanks
Lecture 012
Stanford CS234 Reinforcement Learning I Exploration 2 I 2024 I Lecture 12
Source: https://www.youtube.com/watch?v=gFJNsfg_35E
---
Transcript
[00:00:05] hi everybody welcome back um we're going
[00...
Stanford CS234 Reinforcement Learning I Exploration 2 I 2024 I Lecture 12
Source: https://www.youtube.com/watch?v=gFJNsfg_35E
---
Transcript
[00:00:05] hi everybody welcome back um we're going
[00:00:07] hi everybody welcome back um we're going to start talking uh more about sta
[00:00:09] to start talking uh more about sta efficient reinforcement learning today
[00:00:11] efficient reinforcement learning today but before we do that we're going to
[00:00:12] but before we do that we're going to start with a check your
[00:00:15] start with a check your understanding so this ask you to think
[00:00:17] understanding so this ask you to think back about what we were learning from
[00:00:18] back about what we were learning from multi-arm Bandits um I would probably do
[00:00:21] multi-arm Bandits um I would probably do one and six first CU they're kind of
[00:00:23] one and six first CU they're kind of warm-ups and then the rest of these just
[00:00:26] warm-ups and then the rest of these just to clarify in terms of notation I'm
[00:00:28] to clarify in terms of notation I'm using F of Delta here to be a function
[00:00:31] using F of Delta here to be a function of Delta because I was slightly loose on
[00:00:34] of Delta because I was slightly loose on exactly what the dependence is on Delta
[00:00:36] exactly what the dependence is on Delta in terms of whether you know um it's
[00:00:38] in terms of whether you know um it's like Delta over t or what we're going to
[00:00:40] like Delta over t or what we're going to choose for that function so I just
[00:00:42] choose for that function so I just wanted to be agnostic to that there and
[00:00:43] wanted to be agnostic to that there and put it as a log of a function of
[00:00:58] Delta for
[00:01:49] and as usual feel free to look back at
[00:01:51] and as usual feel free to look back at your notes from last week if you want to
[00:01:52] your notes from last week if you want to refresh your brain on the
[00:01:58] notation for
[00:02:47] all right one more minute to write down
[00:02:49] all right one more minute to write down your initial answers and then I'll ask
[00:02:50] your initial answers and then I'll ask you to turn to a neighbor and
[00:02:58] compare for
[00:03:46] all right why don't you compare answers
[00:03:47] all right why don't you compare answers with someone that's nearby you
[00:04:27] different
[00:04:47] all right great let's come back together
[00:04:48] all right great let's come back together um so I think most people conversed on
[00:04:51] um so I think most people conversed on the same answer for the first one which
[00:04:53] the same answer for the first one which is yes algorithms that minimize regret
[00:04:55] is yes algorithms that minimize regret do also maximize
[00:04:57] do also maximize reward o hold on pen is not
[00:05:01] reward o hold on pen is not working see if I can grab a different
[00:05:05] working see if I can grab a different one so the first one is true I can get
[00:05:08] one so the first one is true I can get some here okay so the first one is true
[00:05:11] some here okay so the first one is true if you minimize regret you also maximize
[00:05:14] if you minimize regret you also maximize reward um for the second one is that one
[00:05:17] reward um for the second one is that one true somebody want to say why it's
[00:05:21] true somebody want to say why it's true you say it's false saying the
[00:05:24] true you say it's false saying the second one is
[00:05:27] true hold on let see if I can get my
[00:05:29] true hold on let see if I can get my thing to power up my pen isn't working
[00:05:32] thing to power up my pen isn't working so the second one is
[00:05:35] so the second one is uh just double check that I I'll keep it
[00:05:38] uh just double check that I I'll keep it back onto here in terms of the answers I
[00:05:39] back onto here in terms of the answers I move things around a little bit last
[00:05:40] move things around a little bit last minute which is always dangerous um but
[00:05:43] minute which is always dangerous um but I wanted to include a couple additional
[00:05:45] I wanted to include a couple additional ones okay let's see if we can make this
[00:05:47] ones okay let's see if we can make this do the right thing okay so the second
[00:05:50] do the right thing okay so the second one should be true um this is basically
[00:05:53] one should be true um this is basically the UCB algorithm which is this is the
[00:05:55] the UCB algorithm which is this is the empirical estimate of the performance of
[00:05:57] empirical estimate of the performance of each arm so in the case where we just
[00:05:59] each arm so in the case where we just have a finite set of arms which we can
[00:06:01] have a finite set of arms which we can also think of as a finite set of actions
[00:06:03] also think of as a finite set of actions um we just look at what their average
[00:06:05] um we just look at what their average reward was n t of a was how many times
[00:06:09] reward was n t of a was how many times have we pulled action a after T time
[00:06:12] have we pulled action a after T time steps and log of f of Delta was just the
[00:06:15] steps and log of f of Delta was just the term that we had to try to express the
[00:06:17] term that we had to try to express the dependence on Delta Delta was used to
[00:06:20] dependence on Delta Delta was used to look at confidence intervals um that we
[00:06:22] look at confidence intervals um that we were using for the upper confidence
[00:06:24] were using for the upper confidence bound so this is true um the third one
[00:06:28] bound so this is true um the third one is also true
[00:06:30] is also true true so in general with our confidence
[00:06:33] true so in general with our confidence intervals um you will be selecting all
[00:06:36] intervals um you will be selecting all arms an infinite number of times but it
[00:06:38] arms an infinite number of times but it might be really slow late later on
[00:06:41] might be really slow late later on because uh let's say you have a really
[00:06:42] because uh let's say you have a really big gap between arms then um that log
[00:06:47] big gap between arms then um that log term and and you'll still you'll have a
[00:06:49] term and and you'll still you'll have a t dependence in there in general that
[00:06:51] t dependence in there in general that will continue to grow a little bit so
[00:06:52] will continue to grow a little bit so you'll sample another arm again which
[00:06:54] you'll sample another arm again which sort of helps with the fact that you
[00:06:55] sort of helps with the fact that you might have been really unlucky and
[00:06:57] might have been really unlucky and gotten a really weird estimate of the
[00:06:59] gotten a really weird estimate of the arm performance so
[00:07:01] arm performance so far okay this one was a little bit
[00:07:03] far okay this one was a little bit subtle um and I realized it could be not
[00:07:05] subtle um and I realized it could be not quite clear here whether I was asking
[00:07:07] quite clear here whether I was asking you to think about the T over Delta part
[00:07:09] you to think about the T over Delta part or the 1 / < TK < TK and T of a I wanted
[00:07:13] or the 1 / < TK < TK and T of a I wanted you to focus on the first thing so what
[00:07:16] you to focus on the first thing so what this is saying here is that instead of
[00:07:20] this is saying here is that instead of shrinking our confidence intervals by a
[00:07:22] shrinking our confidence intervals by a rate of one over < TK n t we're
[00:07:25] rate of one over < TK n t we're shrinking them at a rate of n the minus
[00:07:28] shrinking them at a rate of n the minus 4th
[00:07:29] 4th just rip
[00:07:39] that it's very squeaky okay all right so
[00:07:44] that it's very squeaky okay all right so let me just give so we shrinking
[00:07:49] it so will that mean that our confidence
[00:07:53] it so will that mean that our confidence intervals are wider or narrower for the
[00:07:56] intervals are wider or narrower for the same number of counts so let's say
[00:08:01] same number of counts so let's say versus NT of a to Theus 12 so for
[00:08:06] versus NT of a to Theus 12 so for example if NT of a is equal to 100
[00:08:09] example if NT of a is equal to 100 you've pulled this arm 100 times which
[00:08:12] you've pulled this arm 100 times which of these two is going to be
[00:08:15] of these two is going to be bigger one on the left or the one on the
[00:08:20] right the one on the right that's right
[00:08:23] right the one on the right that's right so instead of it being um oh the other
[00:08:25] so instead of it being um oh the other way around so this is going to be so if
[00:08:27] way around so this is going to be so if you have 100 to Theus 1/4 versus 100 1/2
[00:08:32] you have 100 to Theus 1/4 versus 100 1/2 this is going to be 1/10th this is going
[00:08:34] this is going to be 1/10th this is going to be approximately 1 over
[00:08:36] to be approximately 1 over 3 there's like several different
[00:08:38] 3 there's like several different inverses here I
[00:08:40] inverses here I know what this means basically is that
[00:08:42] know what this means basically is that we're growing we're shrinking our
[00:08:43] we're growing we're shrinking our confidence inals
[00:08:46] confidence inals slower so another thing you might see
[00:08:49] slower so another thing you might see often as a bonus term um Deep Mind often
[00:08:52] often as a bonus term um Deep Mind often uses this in particular for some of
[00:08:54] uses this in particular for some of their
[00:08:54] their algorithms is to the minus one so that's
[00:08:57] algorithms is to the minus one so that's a faster rate so then that one would be
[00:09:01] a faster rate so then that one would be 1 over
[00:09:03] 1 over 100 and you can think of these as
[00:09:05] 100 and you can think of these as trading off different amounts of sort of
[00:09:07] trading off different amounts of sort of essentially how much exploration you're
[00:09:08] essentially how much exploration you're going to get because this is saying sort
[00:09:11] going to get because this is saying sort of how quickly are you collapsing your
[00:09:14] of how quickly are you collapsing your confidence interval as you have more
[00:09:16] confidence interval as you have more data now as somebody pointed was asking
[00:09:18] data now as somebody pointed was asking me about when I was going around they're
[00:09:20] me about when I was going around they're like well we didn't just randomly pick
[00:09:22] like well we didn't just randomly pick this we pick this because of the
[00:09:24] this we pick this because of the uncertainty bounds that we derived from
[00:09:27] uncertainty bounds that we derived from from hofing so hofing said if you have
[00:09:29] from hofing so hofing said if you have an empirical estimate of your mean how
[00:09:32] an empirical estimate of your mean how far away could that be from the true
[00:09:33] far away could that be from the true mean well you know under pretty mild
[00:09:36] mean well you know under pretty mild conditions about your variable being
[00:09:38] conditions about your variable being bounded we could get this kind of uh one
[00:09:41] bounded we could get this kind of uh one / Square n rate so someone was asking me
[00:09:44] / Square n rate so someone was asking me very reasonbly was asking me like well
[00:09:46] very reasonbly was asking me like well why would you pick say this or something
[00:09:48] why would you pick say this or something else you might pick this because you
[00:09:50] else you might pick this because you just don't want to explore as much so
[00:09:53] just don't want to explore as much so even though this holds for our Theory
[00:09:56] even though this holds for our Theory it's often somewhat conservative in
[00:09:58] it's often somewhat conservative in practice so you might just pick
[00:10:00] practice so you might just pick something like a faster exploration rate
[00:10:02] something like a faster exploration rate because empirically you want to just you
[00:10:05] because empirically you want to just you know explore less and you could think of
[00:10:08] know explore less and you could think of that as being related back to what we
[00:10:09] that as being related back to what we saw with po that a lot of their
[00:10:11] saw with po that a lot of their theoretical derivations said this is
[00:10:13] theoretical derivations said this is what your step size should be but it was
[00:10:15] what your step size should be but it was way too conservative for most realistic
[00:10:18] way too conservative for most realistic applications so they just changed it and
[00:10:20] applications so they just changed it and they introduced the clipping thing on
[00:10:22] they introduced the clipping thing on the other hand um there might be cases
[00:10:24] the other hand um there might be cases for which you might not be sure you
[00:10:27] for which you might not be sure you could get this sort of rate and in those
[00:10:29] could get this sort of rate and in those those cases or you might have other
[00:10:30] those cases or you might have other reasons to think you might need more
[00:10:32] reasons to think you might need more exploration so for example maybe things
[00:10:34] exploration so for example maybe things are
[00:10:35] are non-stationary and you think my customer
[00:10:37] non-stationary and you think my customer preferences are actually changing over
[00:10:39] preferences are actually changing over time and so I want to sort of Explore
[00:10:41] time and so I want to sort of Explore More over time than I would if I assume
[00:10:44] More over time than I would if I assume that I was
[00:10:45] that I was stationary okay and we'll talk more
[00:10:47] stationary okay and we'll talk more about stationarity in just a
[00:10:49] about stationarity in just a second so given all of that this means
[00:10:53] second so given all of that this means that this expression would actually have
[00:10:54] that this expression would actually have wider confidence intervals and probably
[00:10:56] wider confidence intervals and probably a higher upper confidence bound than our
[00:10:58] a higher upper confidence bound than our original algorithm which means that we
[00:11:00] original algorithm which means that we would still expect that over time it
[00:11:03] would still expect that over time it will learn to pull the optimal arm bless
[00:11:05] will learn to pull the optimal arm bless you more than any other arms but it
[00:11:08] you more than any other arms but it probably won't have as tight regret
[00:11:10] probably won't have as tight regret bounds because we may be exploring too
[00:11:13] bounds because we may be exploring too much now this was an interesting one um
[00:11:17] much now this was an interesting one um will this if we add this particular
[00:11:20] will this if we add this particular bonus term make the algorithm optimistic
[00:11:23] bonus term make the algorithm optimistic with respect to the empirical Rewards
[00:11:39] I think I
[00:11:49] know somebody want to say if it's going
[00:11:51] know somebody want to say if it's going to make if we add on a bonus term will
[00:11:53] to make if we add on a bonus term will it make it optimistic with respect to
[00:11:55] it make it optimistic with respect to the empirical Rewards
[00:11:59] just the empirical rewards so like
[00:12:01] just the empirical rewards so like compared to your empirical
[00:12:06] mean I'm not trying to make it a trick
[00:12:08] mean I'm not trying to make it a trick question
[00:12:10] question just yes yes exactly so if you just add
[00:12:14] just yes yes exactly so if you just add 20 to your empirical estimate it will be
[00:12:16] 20 to your empirical estimate it will be optimistic with respect to your
[00:12:17] optimistic with respect to your empirical estimate but is it guaranteed
[00:12:20] empirical estimate but is it guaranteed to be optimistic with respect to your
[00:12:22] to be optimistic with respect to your true mean
[00:12:33] so imagine if I'd said B was like
[00:12:36] so imagine if I'd said B was like 0.001 it would still make it be
[00:12:38] 0.001 it would still make it be optimistic with respect to your
[00:12:40] optimistic with respect to your empirical rewards but would it
[00:12:41] empirical rewards but would it necessarily be optimistic with respect
[00:12:43] necessarily be optimistic with respect to your true
[00:12:47] mean in general no right like so if you
[00:12:50] mean in general no right like so if you think back to the our um our Bandits
[00:12:53] think back to the our um our Bandits which just had binary rewards let's say
[00:12:55] which just had binary rewards let's say you know you have a coin that actually
[00:12:57] you know you have a coin that actually has um a 05 probability of getting a
[00:12:59] has um a 05 probability of getting a heads which we'll call a one if you flip
[00:13:02] heads which we'll call a one if you flip it once and you get a Tails your
[00:13:03] it once and you get a Tails your empirical estimate will be zero if you
[00:13:06] empirical estimate will be zero if you add a bonus of 01 your empirical
[00:13:08] add a bonus of 01 your empirical estimate will be
[00:13:10] estimate will be 01 the true value of the mean is still
[00:13:13] 01 the true value of the mean is still 0.5 so one of the key ideas from using
[00:13:16] 0.5 so one of the key ideas from using hting and explicit upper confidence
[00:13:18] hting and explicit upper confidence bounds is that in general it's not easy
[00:13:20] bounds is that in general it's not easy to figure out a simple bonus term you
[00:13:22] to figure out a simple bonus term you can add in order to make things
[00:13:25] can add in order to make things optimistic and so that's why uh you
[00:13:28] optimistic and so that's why uh you might in General want to be using these
[00:13:30] might in General want to be using these um hting or other sort of uh explicitly
[00:13:32] um hting or other sort of uh explicitly derived confidence
[00:13:35] intervals okay great and then the last
[00:13:37] intervals okay great and then the last one is true does anybody have any
[00:13:39] one is true does anybody have any questions about these
[00:13:41] questions about these yet explain number three
[00:13:45] yet explain number three again
[00:13:48] again dep where the is at it's inside of
[00:13:52] dep where the is at it's inside of here so if you go back to the slides
[00:13:55] here so if you go back to the slides from last time let me see if I just have
[00:13:56] from last time let me see if I just have them up
[00:14:01] yeah we create these upper confidence
[00:14:09] bounds so in general we Define these
[00:14:12] bounds so in general we Define these upper confidence bounds we talked about
[00:14:14] upper confidence bounds we talked about how we need the the bounds to hold over
[00:14:16] how we need the the bounds to hold over all time steps T and so that in fact
[00:14:20] all time steps T and so that in fact this was not a perfect Rec um expression
[00:14:22] this was not a perfect Rec um expression that we're going to have some sort of T
[00:14:24] that we're going to have some sort of T dependence inside of the
[00:14:25] dependence inside of the log and in general we'll have like
[00:14:28] log and in general we'll have like something like t or t^2 inside of the
[00:14:30] something like t or t^2 inside of the log and so that will introduce a
[00:14:32] log and so that will introduce a dependence on the time either the time
[00:14:35] dependence on the time either the time step so far or your total time Horizon
[00:14:37] step so far or your total time Horizon ins TI of your per confidence
[00:14:41] bound any other questions about this
[00:14:47] part
[00:14:49] part okay right I'll make sure that the
[00:14:52] okay right I'll make sure that the solutions are aligned okay so last time
[00:14:55] solutions are aligned okay so last time we talked about basian we defi sorry we
[00:14:57] we talked about basian we defi sorry we talked about Bandits we which were this
[00:15:00] talked about Bandits we which were this single state version of markof decision
[00:15:02] single state version of markof decision processes your actions didn't make any
[00:15:04] processes your actions didn't make any difference to the next state cuz you're
[00:15:06] difference to the next state cuz you're always in a single state um we talked
[00:15:08] always in a single state um we talked about how people often use the word arms
[00:15:10] about how people often use the word arms as an equivalent for actions and but
[00:15:12] as an equivalent for actions and but there we were trying to be really
[00:15:13] there we were trying to be really explicit about uncertainty over the
[00:15:15] explicit about uncertainty over the rewards H and what and we talked about
[00:15:18] rewards H and what and we talked about an algorithm upper confidence bounds for
[00:15:20] an algorithm upper confidence bounds for trying to be optimistic with respect to
[00:15:22] trying to be optimistic with respect to that uncertainty what today we're going
[00:15:24] that uncertainty what today we're going to focus on mostly is basy and bandits
[00:15:26] to focus on mostly is basy and bandits and we'll get there in a few minutes
[00:15:29] and we'll get there in a few minutes before we do that I think it's nice to
[00:15:30] before we do that I think it's nice to think about um I think it's exciting to
[00:15:32] think about um I think it's exciting to think about all the application areas
[00:15:34] think about all the application areas where these come up and I wanted to go
[00:15:36] where these come up and I wanted to go through this example which I think I
[00:15:38] through this example which I think I mentioned briefly in lecture one um just
[00:15:41] mentioned briefly in lecture one um just again to think about all the
[00:15:42] again to think about all the complexities which come up when we want
[00:15:44] complexities which come up when we want to try to use these in practice and
[00:15:46] to try to use these in practice and where Bandit algorithms in particular
[00:15:47] where Bandit algorithms in particular might be used so this is a really
[00:15:49] might be used so this is a really beautiful paper by Hobson Bastin um uh
[00:15:53] beautiful paper by Hobson Bastin um uh which was in nature a few years ago and
[00:15:55] which was in nature a few years ago and they were trying to tackle a really
[00:15:56] they were trying to tackle a really important problem which was at the time
[00:15:59] important problem which was at the time as everything was shutting down with
[00:16:00] as everything was shutting down with covid all these countries had to decide
[00:16:03] covid all these countries had to decide on a quarantine protocol and who to test
[00:16:07] on a quarantine protocol and who to test so before you know a number of countries
[00:16:10] so before you know a number of countries basically almost entirely shut down
[00:16:11] basically almost entirely shut down travel but particularly in the beginning
[00:16:15] travel but particularly in the beginning people were letting uh and even then
[00:16:17] people were letting uh and even then often there might be you know exceptions
[00:16:18] often there might be you know exceptions so as people come into a border crossing
[00:16:21] so as people come into a border crossing um organiz or countries had to decide
[00:16:24] um organiz or countries had to decide who to test now they couldn't
[00:16:26] who to test now they couldn't necessarily test everybody because re
[00:16:28] necessarily test everybody because re resources are finite they're also having
[00:16:31] resources are finite they're also having testing facilities they're using for
[00:16:33] testing facilities they're using for testing all of their own individuals and
[00:16:34] testing all of their own individuals and tests are expensive in addition when
[00:16:37] tests are expensive in addition when someone is tested they were going to ask
[00:16:38] someone is tested they were going to ask them to quarantine um depending on where
[00:16:41] them to quarantine um depending on where you were in the world that might
[00:16:42] you were in the world that might actually have been funded by the
[00:16:44] actually have been funded by the government so they you know you have to
[00:16:45] government so they you know you have to go to a quarantine hotel which also cost
[00:16:47] go to a quarantine hotel which also cost the government money so there are a lot
[00:16:49] the government money so there are a lot of reasons in this case that your
[00:16:50] of reasons in this case that your resources are limited and you don't
[00:16:51] resources are limited and you don't necessarily want to test everyone also
[00:16:53] necessarily want to test everyone also in general it may not be necessary to
[00:16:55] in general it may not be necessary to test everybody if you're trying to
[00:16:58] test everybody if you're trying to minimize the probability of letting in
[00:17:00] minimize the probability of letting in people that have covid um in terms of
[00:17:02] people that have covid um in terms of limiting spread so if there's someone
[00:17:04] limiting spread so if there's someone that's from somewhere where there's no
[00:17:06] that's from somewhere where there's no covid then you don't necess need to test
[00:17:07] covid then you don't necess need to test them so this is the setting what happens
[00:17:10] them so this is the setting what happens is that when people were coming into
[00:17:12] is that when people were coming into Greece they would submit a form in
[00:17:14] Greece they would submit a form in advance um like you when you go to the
[00:17:17] advance um like you when you go to the airport and your before you go Etc and
[00:17:20] airport and your before you go Etc and then what they would do is they had this
[00:17:21] then what they would do is they had this um approach called Eva which tried to
[00:17:25] um approach called Eva which tried to use the prior testing results to figure
[00:17:27] use the prior testing results to figure out who to actually test when they came
[00:17:30] out who to actually test when they came so what would happen is that then when
[00:17:31] so what would happen is that then when somebody comes like the next day either
[00:17:34] somebody comes like the next day either they would say we're not going to test
[00:17:35] they would say we're not going to test you at all and then you leave the
[00:17:37] you at all and then you leave the premises or for a subset of people based
[00:17:40] premises or for a subset of people based on the form based on where they were
[00:17:41] on the form based on where they were coming and based on prior results they
[00:17:43] coming and based on prior results they had they would decide to test
[00:17:46] had they would decide to test someone then after you got that test you
[00:17:48] someone then after you got that test you would send it to a lab and it normally
[00:17:51] would send it to a lab and it normally um would take 24 to 48 hours I don't
[00:17:53] um would take 24 to 48 hours I don't remember exactly what kind of test they
[00:17:55] remember exactly what kind of test they were using there maybe it was some sort
[00:17:56] were using there maybe it was some sort of Rapid PCR I don't remember um um
[00:17:59] of Rapid PCR I don't remember um um those would go to a central data
[00:18:01] those would go to a central data database and then they would use those
[00:18:03] database and then they would use those results so someone all these people
[00:18:04] results so someone all these people would quarantine you know for 24 to 96
[00:18:07] would quarantine you know for 24 to 96 hours or so during this time period they
[00:18:08] hours or so during this time period they would get the results back if you're
[00:18:10] would get the results back if you're clear you can go and um you know proceed
[00:18:13] clear you can go and um you know proceed otherwise you need to continue to
[00:18:14] otherwise you need to continue to quarantine and then they're going to use
[00:18:16] quarantine and then they're going to use this information to go back to Eva and
[00:18:18] this information to go back to Eva and update their
[00:18:19] update their algorithm so this is really cool because
[00:18:23] algorithm so this is really cool because this is an opportunity to try to be very
[00:18:26] this is an opportunity to try to be very careful about resources but really do so
[00:18:29] careful about resources but really do so in a way that still preserves the safety
[00:18:31] in a way that still preserves the safety of the individuals in the country as
[00:18:33] of the individuals in the country as much as possible and the public health
[00:18:35] much as possible and the public health so I like that in the besty paper they
[00:18:37] so I like that in the besty paper they describe this as a non-stationary
[00:18:39] describe this as a non-stationary contextual batch Bandit problem with
[00:18:41] contextual batch Bandit problem with delayed feedback and
[00:18:42] delayed feedback and constraints okay so that's quite a
[00:18:44] constraints okay so that's quite a mouthful um but I think it's really nice
[00:18:46] mouthful um but I think it's really nice to think about sort of you know as we go
[00:18:48] to think about sort of you know as we go from the simple setting of just thinking
[00:18:50] from the simple setting of just thinking there are K arms we can think about all
[00:18:52] there are K arms we can think about all the Practical things that we might have
[00:18:54] the Practical things that we might have to deal with in this setting so here in
[00:18:56] to deal with in this setting so here in some ways the K is very small it's only
[00:18:59] some ways the K is very small it's only two either you're going to test someone
[00:19:00] two either you're going to test someone or you're not going to test them so it's
[00:19:02] or you're not going to test them so it's a very small action space which is
[00:19:05] a very small action space which is nice in this case compared to what we've
[00:19:07] nice in this case compared to what we've seen so far but we'll we'll see this
[00:19:10] seen so far but we'll we'll see this this case later we're going to have
[00:19:11] this case later we're going to have context context you can think of as just
[00:19:14] context context you can think of as just being like States so people will have a
[00:19:16] being like States so people will have a feature Vector that describes what
[00:19:18] feature Vector that describes what country they're coming from you know a
[00:19:19] country they're coming from you know a bunch of other details about them and
[00:19:22] bunch of other details about them and that gives you a state that we're going
[00:19:23] that gives you a state that we're going to use to decide whether or not to test
[00:19:25] to use to decide whether or not to test someone okay so that's why it's
[00:19:27] someone okay so that's why it's contextual it's non-stationary because
[00:19:30] contextual it's non-stationary because covid was constantly evolving um and
[00:19:34] covid was constantly evolving um and often a lot of the information we were
[00:19:36] often a lot of the information we were getting was lagged so if you're in
[00:19:37] getting was lagged so if you're in Greece you might be able to see
[00:19:39] Greece you might be able to see information from Sweden and from China
[00:19:40] information from Sweden and from China and from the
[00:19:41] and from the US but all of that information is often
[00:19:44] US but all of that information is often likely probably at a population level
[00:19:47] likely probably at a population level those people may or may not be the same
[00:19:48] those people may or may not be the same people that are traveling to Greece
[00:19:50] people that are traveling to Greece probably in general are different um and
[00:19:54] probably in general are different um and uh because of the lag it may or may not
[00:19:56] uh because of the lag it may or may not be informative and in fact in their
[00:19:58] be informative and in fact in their paper they are a lot of that information
[00:19:59] paper they are a lot of that information was not as informative as this kind of
[00:20:01] was not as informative as this kind of real-time
[00:20:02] real-time information it's batched what I mean by
[00:20:05] information it's batched what I mean by that is that um and we'll see with this
[00:20:08] that is that um and we'll see with this more today you don't get to make a
[00:20:09] more today you don't get to make a decision after um every test or not test
[00:20:13] decision after um every test or not test you don't see the result immediately so
[00:20:15] you don't see the result immediately so what happens here is that say you know
[00:20:17] what happens here is that say you know 200 people flying on a plane you have to
[00:20:19] 200 people flying on a plane you have to decide for every single one of them
[00:20:21] decide for every single one of them whether or not you're going to test them
[00:20:23] whether or not you're going to test them and then you wait you know two days so
[00:20:27] and then you wait you know two days so it's this delayed feedback and you have
[00:20:28] it's this delayed feedback and you have to make a decision for everybody before
[00:20:30] to make a decision for everybody before you get to observe that feedback and so
[00:20:33] you get to observe that feedback and so that makes it quite tricky um and we'll
[00:20:35] that makes it quite tricky um and we'll we'll talk more about why that might be
[00:20:37] we'll talk more about why that might be tricky for some of the upper confidence
[00:20:39] tricky for some of the upper confidence bound algorithms we've seen so far I
[00:20:41] bound algorithms we've seen so far I think this batching is really important
[00:20:43] think this batching is really important for many many application areas so if
[00:20:46] for many many application areas so if you think back to our guest lecture and
[00:20:47] you think back to our guest lecture and you think about uh direct preference
[00:20:50] you think about uh direct preference optimization this is another area where
[00:20:52] optimization this is another area where in general you're going to have be able
[00:20:54] in general you're going to have be able to get sort of a batch of data label it
[00:20:56] to get sort of a batch of data label it all and then continue um so and some of
[00:20:58] all and then continue um so and some of the work that my lab is doing and some
[00:21:00] the work that my lab is doing and some other people's work when we're thinking
[00:21:02] other people's work when we're thinking about doing adaptive data collection for
[00:21:04] about doing adaptive data collection for preference optimization we again need it
[00:21:06] preference optimization we again need it to be able to handle this kind of much
[00:21:08] to be able to handle this kind of much more realistic batch setting compared to
[00:21:10] more realistic batch setting compared to getting information after each decision
[00:21:14] getting information after each decision okay we also have con so the delayed
[00:21:17] okay we also have con so the delayed feedback is this 24 to 48 hours and the
[00:21:19] feedback is this 24 to 48 hours and the final thing is
[00:21:20] final thing is constraints so there are lots of
[00:21:22] constraints so there are lots of constraints in this setting um which
[00:21:25] constraints in this setting um which also generally changed the setting from
[00:21:27] also generally changed the setting from a lot of the ones we've thought about so
[00:21:28] a lot of the ones we've thought about so far
[00:21:29] far so one is that you might have resource
[00:21:30] so one is that you might have resource constraints um you might say at most we
[00:21:32] constraints um you might say at most we can handle let's say 100 I forget
[00:21:35] can handle let's say 100 I forget exactly what was in the paper a 100
[00:21:36] exactly what was in the paper a 100 tests a day so you're going have
[00:21:38] tests a day so you're going have constraints on that the second is
[00:21:40] constraints on that the second is politically you might have constraints
[00:21:42] politically you might have constraints too you know it might be tricky for
[00:21:43] too you know it might be tricky for Greece if they decide that they're not
[00:21:45] Greece if they decide that they're not going to let in anyone from Sweden so
[00:21:47] going to let in anyone from Sweden so there might be different quotas and
[00:21:48] there might be different quotas and there might be other reasons to say we
[00:21:51] there might be other reasons to say we have to um think about some broader
[00:21:53] have to um think about some broader types of risks and benefits in these
[00:21:55] types of risks and benefits in these cases so that's also challenging what
[00:21:58] cases so that's also challenging what way you can think about implementing
[00:22:00] way you can think about implementing this is this could essentially change
[00:22:02] this is this could essentially change your policy class that is reasonable so
[00:22:05] your policy class that is reasonable so instead of your policy class saying you
[00:22:06] instead of your policy class saying you can make any decision for any individual
[00:22:09] can make any decision for any individual you may now have sort of a population
[00:22:11] you may now have sort of a population level constraint as
[00:22:12] level constraint as well this is something that my lab has
[00:22:15] well this is something that my lab has thought about some with um our partner
[00:22:17] thought about some with um our partner Shard goel who's uh at the Harvard kity
[00:22:20] Shard goel who's uh at the Harvard kity school and there we thought about cases
[00:22:22] school and there we thought about cases where you might have resource
[00:22:23] where you might have resource constraints and fairness constraints
[00:22:25] constraints and fairness constraints that mean that you can't just make
[00:22:26] that mean that you can't just make decisions for people individually you
[00:22:28] decisions for people individually you you need to think about overall um
[00:22:31] you need to think about overall um trade-offs in terms of your policy
[00:22:32] trade-offs in terms of your policy quality that happen at the up population
[00:22:35] quality that happen at the up population level the reason that's important is
[00:22:37] level the reason that's important is because it often introduces a lot of
[00:22:38] because it often introduces a lot of challenges computationally when you
[00:22:40] challenges computationally when you can't just think of each individual
[00:22:42] can't just think of each individual separately all right so we won't be able
[00:22:44] separately all right so we won't be able to cover all of the ways that they're um
[00:22:47] to cover all of the ways that they're um that you know they handle this
[00:22:48] that you know they handle this algorithmically but I encourage you to
[00:22:50] algorithmically but I encourage you to read the paper if you're interested in
[00:22:51] read the paper if you're interested in this space and I think it's a really
[00:22:53] this space and I think it's a really beautiful example of using reinforcement
[00:22:55] beautiful example of using reinforcement learning particularly multi-arm Bandits
[00:22:57] learning particularly multi-arm Bandits um to tackle this problem one of the
[00:23:00] um to tackle this problem one of the things that they had to do so this is a
[00:23:01] things that they had to do so this is a real system they really deployed it in
[00:23:02] real system they really deployed it in Greece I think when I talked to hobma
[00:23:04] Greece I think when I talked to hobma she said it came together in like a
[00:23:06] she said it came together in like a month it was a really amazing effort and
[00:23:09] month it was a really amazing effort and then one of the interesting things they
[00:23:10] then one of the interesting things they also had to do here is to understand how
[00:23:12] also had to do here is to understand how much of an impact it made because they
[00:23:15] much of an impact it made because they weren't going to do a randomized control
[00:23:16] weren't going to do a randomized control trial in Co to understand this so
[00:23:19] trial in Co to understand this so another interesting thing that this
[00:23:21] another interesting thing that this paper looks at is using offline methods
[00:23:24] paper looks at is using offline methods like the batch methods you've been seen
[00:23:26] like the batch methods you've been seen in the past to try to estimate the
[00:23:28] in the past to try to estimate the counterfactual of how much impact this
[00:23:30] counterfactual of how much impact this had so I think it's a really nice
[00:23:32] had so I think it's a really nice example of a lot of the different ideas
[00:23:34] example of a lot of the different ideas that we've been seeing in this
[00:23:37] that we've been seeing in this class all right so that's one of the
[00:23:40] class all right so that's one of the many many ways that Bandits are useful
[00:23:42] many many ways that Bandits are useful clinical trials is another one AB
[00:23:44] clinical trials is another one AB testing um you know ad placement there's
[00:23:46] testing um you know ad placement there's many many others as well but I think
[00:23:48] many many others as well but I think this is a really nice example in public
[00:23:49] this is a really nice example in public health okay so now let's continue we're
[00:23:51] health okay so now let's continue we're going to talk about specifically some of
[00:23:54] going to talk about specifically some of the algorithms that could be relevant to
[00:23:55] the algorithms that could be relevant to this and in particular Thompson sampling
[00:23:57] this and in particular Thompson sampling which is particularly relevant to this
[00:23:59] which is particularly relevant to this kind of batch
[00:24:01] kind of batch setting all right I'm going to do very
[00:24:03] setting all right I'm going to do very quickly just notation remember regret is
[00:24:06] quickly just notation remember regret is the opportunity loss per one step total
[00:24:08] the opportunity loss per one step total regret is the total opportunity loss
[00:24:10] regret is the total opportunity loss we're using Q to denote the expected
[00:24:12] we're using Q to denote the expected reward for a particular
[00:24:16] reward for a particular arm so one thing I'm blanking on who
[00:24:18] arm so one thing I'm blanking on who suggested this last time forgive me but
[00:24:21] suggested this last time forgive me but someone uh came up to me and said hey
[00:24:23] someone uh came up to me and said hey couldn't we have used just like a
[00:24:24] couldn't we have used just like a smarter optimistic initialization you
[00:24:27] smarter optimistic initialization you know do we have to actually um have
[00:24:30] know do we have to actually um have these upper confidence
[00:24:31] these upper confidence bounds and I think that's a very
[00:24:33] bounds and I think that's a very reasonable suggestion H and that was a
[00:24:36] reasonable suggestion H and that was a great follow up to some of the stuff
[00:24:37] great follow up to some of the stuff we're going to talk about
[00:24:39] we're going to talk about today so one simple thing you can
[00:24:41] today so one simple thing you can imagine you could do instead of worrying
[00:24:43] imagine you could do instead of worrying about these upper confidence bounds
[00:24:44] about these upper confidence bounds which you have to update all the time is
[00:24:46] which you have to update all the time is you just optimize you just initialize
[00:24:49] you just optimize you just initialize your Q hat to some high value and then
[00:24:52] your Q hat to some high value and then you just update that estimate over
[00:24:54] you just update that estimate over time and when you do that you know that
[00:24:58] time and when you do that you know that eventually you're going to converge to
[00:25:00] eventually you're going to converge to the right thing like ASM totically with
[00:25:02] the right thing like ASM totically with a law of large numbers as you know as
[00:25:04] a law of large numbers as you know as you you're not changing that initialized
[00:25:06] you you're not changing that initialized value that initialized value may or may
[00:25:08] value that initialized value may or may not have been right it'll be an upper
[00:25:10] not have been right it'll be an upper bound and it'll just converge to it so
[00:25:13] bound and it'll just converge to it so this is an interesting thing you can
[00:25:17] do the challenge with that is that in
[00:25:20] do the challenge with that is that in general you don't know how high to make
[00:25:22] general you don't know how high to make it and so if you make it really high um
[00:25:27] it and so if you make it really high um now often let me just just be clear here
[00:25:28] now often let me just just be clear here what I mean by really high often this
[00:25:30] what I mean by really high often this might be much much larger than the
[00:25:32] might be much much larger than the actual range of possible rewards so
[00:25:34] actual range of possible rewards so maybe your arm rewards can be between 0
[00:25:36] maybe your arm rewards can be between 0 one and you initialize this to 70 so
[00:25:39] one and you initialize this to 70 so sometimes you know the the
[00:25:40] sometimes you know the the initialization might be far higher than
[00:25:42] initialization might be far higher than what is actually practical um it does
[00:25:45] what is actually practical um it does encourage a lot of exploration early on
[00:25:47] encourage a lot of exploration early on which might be really valuable but in
[00:25:50] which might be really valuable but in general unless you get the value exactly
[00:25:51] general unless you get the value exactly right which you generally can't know
[00:25:54] right which you generally can't know because that's why you're trying to
[00:25:55] because that's why you're trying to learn in the first place um then you can
[00:25:58] learn in the first place um then you can still lock on to a suboptimal action and
[00:26:00] still lock on to a suboptimal action and what do I mean by lock on is that you
[00:26:02] what do I mean by lock on is that you converge to a suboptimal action and then
[00:26:04] converge to a suboptimal action and then you never try anything again which means
[00:26:07] you never try anything again which means you'd get linear
[00:26:09] you'd get linear regret the other thing that's bad is
[00:26:12] regret the other thing that's bad is that um if you initialize Q um too high
[00:26:17] that um if you initialize Q um too high then you're also just not going to
[00:26:18] then you're also just not going to benefit from you're going to be making
[00:26:20] benefit from you're going to be making bad decisions for much longer than you
[00:26:22] bad decisions for much longer than you actually need to do okay so even while
[00:26:25] actually need to do okay so even while even though you know in theory this
[00:26:27] even though you know in theory this could be a good thing to do or like
[00:26:29] could be a good thing to do or like sorry in principle you might imagine
[00:26:31] sorry in principle you might imagine this is a good thing to do um in reality
[00:26:34] this is a good thing to do um in reality it's very hard to
[00:26:35] it's very hard to set
[00:26:37] set now it's
[00:26:39] now it's also a interesting question of how you
[00:26:41] also a interesting question of how you might do this with function
[00:26:43] might do this with function approximation so if you think back to I
[00:26:46] approximation so if you think back to I know you didn't Implement deep Q
[00:26:47] know you didn't Implement deep Q learning but if you think back to deep q
[00:26:49] learning but if you think back to deep q q learning where we used a neural
[00:26:51] q learning where we used a neural network to represent the Q
[00:26:54] network to represent the Q function do you guys think it is easy to
[00:26:56] function do you guys think it is easy to initialize that so the values are
[00:27:00] initialize that so the values are optimistic I see at least shaking his
[00:27:02] optimistic I see at least shaking his head why not um it's just Network output
[00:27:06] head why not um it's just Network output specific value yeah it's yeah it's hard
[00:27:09] specific value yeah it's yeah it's hard right like maybe you could train it on
[00:27:10] right like maybe you could train it on fake data but then you'd have to know
[00:27:11] fake data but then you'd have to know how big the queue is yeah in general
[00:27:14] how big the queue is yeah in general this is really it's if it was a table
[00:27:16] this is really it's if it was a table it's at least easy to write down you
[00:27:18] it's at least easy to write down you know like 90 for all of those things and
[00:27:20] know like 90 for all of those things and that's what you initialize in a deep
[00:27:21] that's what you initialize in a deep neural network it's really unclear how
[00:27:24] neural network it's really unclear how you like initialize those parameters so
[00:27:26] you like initialize those parameters so that for all the states that you would
[00:27:27] that for all the states that you would reach you would have a even a good shot
[00:27:29] reach you would have a even a good shot of it being optimistic so I think that's
[00:27:32] of it being optimistic so I think that's another challenge here is um you know
[00:27:34] another challenge here is um you know and that's a challenge for a lot of the
[00:27:35] and that's a challenge for a lot of the optimism algorithms we'll see in general
[00:27:37] optimism algorithms we'll see in general is can we do it with function
[00:27:41] approximation now there's a lot of work
[00:27:42] approximation now there's a lot of work on thinking about sort of how to do
[00:27:45] on thinking about sort of how to do things with function approximation and
[00:27:47] things with function approximation and we'll get into that
[00:27:49] we'll get into that soon okay so if you do carefully choose
[00:27:53] soon okay so if you do carefully choose the initialization value you can get
[00:27:55] the initialization value you can get good
[00:27:56] good performance um under new way of
[00:27:58] performance um under new way of measuring what good performance actually
[00:28:00] measuring what good performance actually means okay so let's go back to regret so
[00:28:05] means okay so let's go back to regret so in regret we sort of just try to think
[00:28:08] in regret we sort of just try to think about how do we quantify the performance
[00:28:11] about how do we quantify the performance as we make lots of decisions so T here
[00:28:13] as we make lots of decisions so T here is the number of decisions we make and
[00:28:15] is the number of decisions we make and we're just trying to think in this case
[00:28:17] we're just trying to think in this case about how many decisions we make over
[00:28:20] about how many decisions we make over time let me see if the pen is finally
[00:28:22] time let me see if the pen is finally charged not today okay so um so we could
[00:28:26] charged not today okay so um so we could either be making lots of little mistakes
[00:28:27] either be making lots of little mistakes or frequent large ones and what you
[00:28:30] or frequent large ones and what you might imagine that we want to
[00:28:35] do is to think about a different form of
[00:28:40] do is to think about a different form of loss and
[00:28:43] loss and so or in particular another form of
[00:28:46] so or in particular another form of performance then we're going to that is
[00:28:48] performance then we're going to that is going to be pack okay so let's draw it
[00:28:50] going to be pack okay so let's draw it that quick
[00:28:53] that quick okay
[00:28:55] okay so think I drew this last time but I'll
[00:28:58] so think I drew this last time but I'll it again so I make this time step T and
[00:29:01] it again so I make this time step T and this is Q of a q of the actual arm that
[00:29:05] this is Q of a q of the actual arm that you pulled and this is Q of a
[00:29:10] you pulled and this is Q of a star okay so let's imagine that you have
[00:29:15] star okay so let's imagine that you have an algorithm that is pulling arms like
[00:29:17] an algorithm that is pulling arms like the
[00:29:21] following
[00:29:23] following okay all right which means that then
[00:29:25] okay all right which means that then maybe sometimes it's pulling the right
[00:29:26] maybe sometimes it's pulling the right arm hopefully
[00:29:28] arm hopefully okay so in this case sometimes the
[00:29:31] okay so in this case sometimes the algorithm is doing something that's just
[00:29:32] algorithm is doing something that's just a little bit suboptimal and sometimes it
[00:29:35] a little bit suboptimal and sometimes it is making a really big
[00:29:37] is making a really big mistake so what we can do here is we can
[00:29:40] mistake so what we can do here is we can quantify how big our mistakes are and
[00:29:44] quantify how big our mistakes are and you might have a
[00:29:47] situation where you say Optimal
[00:29:51] situation where you say Optimal Performance is really hard you know it's
[00:29:53] Performance is really hard you know it's really hard to learn um like what
[00:29:55] really hard to learn um like what everyone's perfect like ad preferences
[00:29:57] everyone's perfect like ad preferences are things like like that maybe I'm
[00:29:59] are things like like that maybe I'm going to relax my criteria I'm not going
[00:30:01] going to relax my criteria I'm not going to require Optimal Performance but I
[00:30:03] to require Optimal Performance but I want pretty good performance I want
[00:30:05] want pretty good performance I want Epsilon
[00:30:07] Epsilon optimal so
[00:30:09] optimal so what we do in this case is we
[00:30:13] what we do in this case is we count every time we make a bad decision
[00:30:16] count every time we make a bad decision meaning something that is worse than
[00:30:18] meaning something that is worse than Epsilon optimal and otherwise we think
[00:30:20] Epsilon optimal and otherwise we think of all of those as basically being an
[00:30:22] of all of those as basically being an equivalence class of
[00:30:25] optimal so that's going to be what we
[00:30:28] optimal so that's going to be what we think about when we think about pack
[00:30:31] think about when we think about pack okay so I'll Define what that
[00:30:33] okay so I'll Define what that is so a pack algorithm and raise your
[00:30:36] is so a pack algorithm and raise your hands if you've seen this a machine
[00:30:37] hands if you've seen this a machine learning if you've taken machine
[00:30:38] learning if you've taken machine learning you might have seen
[00:30:40] learning you might have seen back okay yeah so at least one or two
[00:30:42] back okay yeah so at least one or two people have so often a machine learning
[00:30:44] people have so often a machine learning particularly if it's a machine learning
[00:30:45] particularly if it's a machine learning class that includes some Theory um we'll
[00:30:48] class that includes some Theory um we'll they'll talk about pack and and probably
[00:30:50] they'll talk about pack and and probably approximately correct algorithms and
[00:30:51] approximately correct algorithms and that's where this idea comes from so it
[00:30:53] that's where this idea comes from so it was it came from the machine Learning
[00:30:54] was it came from the machine Learning Community um and then reinforcement
[00:30:56] Community um and then reinforcement learning B it so the idea in a pack
[00:30:59] learning B it so the idea in a pack algorithm is that on each time step a
[00:31:02] algorithm is that on each time step a pack algorithm is going to choose an
[00:31:03] pack algorithm is going to choose an action whose value is Epsilon optimal
[00:31:06] action whose value is Epsilon optimal meaning the value of the action that's
[00:31:08] meaning the value of the action that's taken is at least the value of the
[00:31:10] taken is at least the value of the optimal action minus Epsilon so that
[00:31:13] optimal action minus Epsilon so that means that we're in this
[00:31:15] means that we're in this region with high probability on all but
[00:31:18] region with high probability on all but a polinomial number of time
[00:31:21] a polinomial number of time steps so essentially it's saying that
[00:31:24] steps so essentially it's saying that the majority of the time your algorithm
[00:31:25] the majority of the time your algorithm is making good decisions good here
[00:31:27] is making good decisions good here didn't being defined as Epsilon optimal
[00:31:30] didn't being defined as Epsilon optimal but sometimes we'll make bad decisions
[00:31:31] but sometimes we'll make bad decisions but we're going to say with high
[00:31:33] but we're going to say with high probability the total number of bad
[00:31:34] probability the total number of bad decisions we make is not too many what
[00:31:36] decisions we make is not too many what we mean by not too many here is
[00:31:38] we mean by not too many here is something that's polinomial in your
[00:31:40] something that's polinomial in your problem parameters so that generally
[00:31:42] problem parameters so that generally means like the number of actions you
[00:31:43] means like the number of actions you have Epsilon Delta
[00:31:47] have Epsilon Delta Etc as you might expect if Epsilon is
[00:31:51] Etc as you might expect if Epsilon is smaller generally the number of samples
[00:31:53] smaller generally the number of samples you need will go up normally something
[00:31:55] you need will go up normally something like 1 Epsilon squ so if you care about
[00:31:58] like 1 Epsilon squ so if you care about being more optimal you're going to need
[00:32:01] being more optimal you're going to need more data um or in other words your
[00:32:03] more data um or in other words your algorithm might make bad decisions for
[00:32:05] algorithm might make bad decisions for longer if Delta is small smaller meaning
[00:32:08] longer if Delta is small smaller meaning that you want this to hold with higher
[00:32:10] that you want this to hold with higher probability you'll also need more data
[00:32:13] probability you'll also need more data um and if there are a lot of actions to
[00:32:15] um and if there are a lot of actions to learn about in general you need more
[00:32:17] learn about in general you need more data so it gives us some notion of sort
[00:32:20] data so it gives us some notion of sort of the complexity of the problem to
[00:32:22] of the complexity of the problem to learn in so this is a different type of
[00:32:24] learn in so this is a different type of um uh a lot of algorithms you can get
[00:32:27] um uh a lot of algorithms you can get both pack guarantees for and regret
[00:32:29] both pack guarantees for and regret guarantees but it is just a different
[00:32:31] guarantees but it is just a different notion of
[00:32:33] notion of optimality
[00:32:35] optimality okay most of the pack algorithms for
[00:32:38] okay most of the pack algorithms for reinforcement learning are based on
[00:32:39] reinforcement learning are based on either optimism like what we've seen
[00:32:41] either optimism like what we've seen from last lecture or Thompson sampling
[00:32:43] from last lecture or Thompson sampling which we're going to see later in this
[00:32:45] which we're going to see later in this lecture and there do exist pack
[00:32:48] lecture and there do exist pack algorithms that just initialize
[00:32:50] algorithms that just initialize everything to a really high value I
[00:32:52] everything to a really high value I don't know of any practical algorithms
[00:32:54] don't know of any practical algorithms that do that like ones that people use
[00:32:55] that do that like ones that people use in practice um but there is theory in
[00:32:58] in practice um but there is theory in papers about that case so it is possible
[00:33:00] papers about that case so it is possible to
[00:33:02] to do all right and we'll we'll see more
[00:33:05] do all right and we'll we'll see more stuff about pack shortly let me just
[00:33:06] stuff about pack shortly let me just give an example so remember back to our
[00:33:08] give an example so remember back to our fake trying to learn how to um treat
[00:33:11] fake trying to learn how to um treat broken toes example from last time where
[00:33:13] broken toes example from last time where we had surgery and taping um like buddy
[00:33:16] we had surgery and taping um like buddy taping the toes together and nothing
[00:33:18] taping the toes together and nothing again this is not medical advice imagine
[00:33:20] again this is not medical advice imagine that this is Epsilon to
[00:33:23] that this is Epsilon to 0.05 so in this case before we thought
[00:33:26] 0.05 so in this case before we thought about this is what the optimal sequence
[00:33:28] about this is what the optimal sequence of actions you should take but of course
[00:33:30] of actions you should take but of course you don't know that because you don't
[00:33:31] you don't know that because you don't have
[00:33:32] have data if you had this sequence of actions
[00:33:35] data if you had this sequence of actions say under an optimistic algorithm this
[00:33:37] say under an optimistic algorithm this would be the regret you would get in
[00:33:39] would be the regret you would get in each case but under optim um but under
[00:33:43] each case but under optim um but under the pack case let me see if I can type
[00:33:46] the pack case let me see if I can type these here under the pack case would be
[00:33:55] um okay so the important thing to notice
[00:33:58] um okay so the important thing to notice here is that because the reward of A2 is
[00:34:01] here is that because the reward of A2 is within the Epsilon bound of A1 which is
[00:34:04] within the Epsilon bound of A1 which is the optimal action this action would
[00:34:06] the optimal action this action would also be considered optimal so from the
[00:34:09] also be considered optimal so from the perspective of the pack algorithm
[00:34:11] perspective of the pack algorithm definition this would be not denoted as
[00:34:13] definition this would be not denoted as a
[00:34:14] a mistake the only mistakes would be when
[00:34:16] mistake the only mistakes would be when it the algorithm takes
[00:34:18] it the algorithm takes A3 so when we talk about this pack
[00:34:21] A3 so when we talk about this pack definition
[00:34:22] definition here of sort of counting up the number
[00:34:25] here of sort of counting up the number of time steps we don't make a really
[00:34:27] of time steps we don't make a really good decision the only decisions that
[00:34:30] good decision the only decisions that would count for that in this setting is
[00:34:31] would count for that in this setting is the A3
[00:34:33] the A3 decisions in contrast to that when we
[00:34:35] decisions in contrast to that when we talk about regret anything that
[00:34:37] talk about regret anything that suboptimal counts so you get penalized
[00:34:39] suboptimal counts so you get penalized for all the A2
[00:34:42] for all the A2 decisions I took you were allowed to
[00:34:44] decisions I took you were allowed to make mistakes for a polinomial number of
[00:34:47] make mistakes for a polinomial number of steps yes but I guess I yes you AR
[00:34:49] steps yes but I guess I yes you AR allowed and it still be packed that's
[00:34:51] allowed and it still be packed that's exactly right but I'm just pointing out
[00:34:53] exactly right but I'm just pointing out here that the only actions you're taking
[00:34:55] here that the only actions you're taking that count towards that polinomial is 83
[00:34:57] that count towards that polinomial is 83 here it's not A2 whereas A3 and A2 count
[00:35:01] here it's not A2 whereas A3 and A2 count towards
[00:35:02] towards your yeah does training become easier if
[00:35:05] your yeah does training become easier if I like gra
[00:35:07] I like gra to good question so normally in these
[00:35:10] to good question so normally in these cases um you fix Epsilon in advance um
[00:35:14] cases um you fix Epsilon in advance um and it defines kind of a um it defines
[00:35:18] and it defines kind of a um it defines the number of samples you're going to
[00:35:19] the number of samples you're going to need for each of the like actions or
[00:35:21] need for each of the like actions or states and actions in the mdp case so
[00:35:24] states and actions in the mdp case so it's kind of like an exploration term
[00:35:25] it's kind of like an exploration term and you keep track of counts um there
[00:35:28] and you keep track of counts um there are algorithms um with me and my former
[00:35:31] are algorithms um with me and my former PhD student part of the work that we did
[00:35:33] PhD student part of the work that we did there was to talk about what if you want
[00:35:34] there was to talk about what if you want to have guarantees or many epsilons at
[00:35:36] to have guarantees or many epsilons at once I'm think more along the lines
[00:35:42] of where because we
[00:35:45] of where because we gradually conver to
[00:35:49] gradually conver to the the same they can
[00:35:52] the the same they can be yeah it's a little bit subtle it's a
[00:35:54] be yeah it's a little bit subtle it's a great question so in general the balance
[00:35:56] great question so in general the balance will depend something like one over
[00:35:57] will depend something like one over Epsilon squ so if your Epsilon is going
[00:36:00] Epsilon squ so if your Epsilon is going to zero that will say that you have to
[00:36:01] to zero that will say that you have to do an infinite amount of
[00:36:03] do an infinite amount of exploration um if you're interested know
[00:36:05] exploration um if you're interested know have one of our paper thinks about
[00:36:07] have one of our paper thinks about trying to have s simultaneous bounds
[00:36:09] trying to have s simultaneous bounds over lots of
[00:36:10] over lots of epsilons but in general the the basic
[00:36:13] epsilons but in general the the basic version of this you commit to an Epsilon
[00:36:14] version of this you commit to an Epsilon in
[00:36:15] in advance great
[00:36:18] advance great questions all right so going back to
[00:36:21] questions all right so going back to sort of where we are and reminding
[00:36:22] sort of where we are and reminding ourselves in terms of algorithms and
[00:36:24] ourselves in terms of algorithms and this relates to your Epsilon greedy
[00:36:25] this relates to your Epsilon greedy constant e greedy decay greedy and
[00:36:28] constant e greedy decay greedy and optimistic initialization all have the
[00:36:30] optimistic initialization all have the problem in general of having um
[00:36:32] problem in general of having um sublinear of having bad performance it's
[00:36:35] sublinear of having bad performance it's in theory possible to have sublinear
[00:36:37] in theory possible to have sublinear regret but you often need to have
[00:36:39] regret but you often need to have stronger guarant um stronger knowledge
[00:36:41] stronger guarant um stronger knowledge than is known optimistic and
[00:36:43] than is known optimistic and initialization also can have the Pack
[00:36:45] initialization also can have the Pack guarantees that we just talked
[00:36:50] about um and I guess I'll just say to
[00:36:52] about um and I guess I'll just say to you can convert these results into so
[00:36:56] you can convert these results into so Epsilon greedy is not a pack algorithm
[00:36:58] Epsilon greedy is not a pack algorithm um but you can think about different
[00:37:00] um but you can think about different types of other sort of exploration
[00:37:02] types of other sort of exploration strategies and whether or not their pack
[00:37:04] strategies and whether or not their pack and we'll get back into those
[00:37:07] and we'll get back into those soon okay let's jump into Basi and
[00:37:09] soon okay let's jump into Basi and bandits they're a pretty elegant
[00:37:12] bandits they're a pretty elegant idea so so far we've made almost no
[00:37:16] idea so so far we've made almost no assumptions about our reward
[00:37:17] assumptions about our reward distribution so we maybe said they're
[00:37:19] distribution so we maybe said they're bounded you know it could be between
[00:37:20] bounded you know it could be between zero and one and that's basically all we
[00:37:22] zero and one and that's basically all we needed for hting um we need them to be
[00:37:24] needed for hting um we need them to be bounded we need them to be um we use the
[00:37:27] bounded we need them to be um we use the where they're independent and
[00:37:28] where they're independent and identically
[00:37:30] identically distributed but we haven't made any
[00:37:32] distributed but we haven't made any other assumptions so we haven't said
[00:37:34] other assumptions so we haven't said it's Galan or it's a beri or something
[00:37:36] it's Galan or it's a beri or something else we might know and when we're being
[00:37:39] else we might know and when we're being basing about this we're actually going
[00:37:40] basing about this we're actually going to leverage knowledge we have about the
[00:37:42] to leverage knowledge we have about the structure of the way the rewards are
[00:37:44] structure of the way the rewards are generated and what I mean by that is
[00:37:46] generated and what I mean by that is normally some particular statistical
[00:37:48] normally some particular statistical model so it's a gaussia model or it's a
[00:37:50] model so it's a gaussia model or it's a beri model things like
[00:37:52] beri model things like that and the reason that that might be
[00:37:55] that and the reason that that might be helpful is that often if we're doing
[00:37:56] helpful is that often if we're doing these in a domain like public health or
[00:37:58] these in a domain like public health or others people might know lots of
[00:38:00] others people might know lots of information about um you know what the
[00:38:02] information about um you know what the reward structure is bless you and could
[00:38:05] reward structure is bless you and could we leverage that to get better
[00:38:06] we leverage that to get better algorithms and better
[00:38:08] algorithms and better performance okay so before we do this
[00:38:12] performance okay so before we do this it's probably helpful to do just a quick
[00:38:14] it's probably helpful to do just a quick refresher on Basi and inference some of
[00:38:16] refresher on Basi and inference some of you guys might have done a lot of this
[00:38:17] you guys might have done a lot of this some of you might have done very little
[00:38:19] some of you might have done very little we'll we'll go through sort of just a
[00:38:20] we'll we'll go through sort of just a quick quick reminder about this okay
[00:38:25] quick quick reminder about this okay because this is going to be used a lot
[00:38:26] because this is going to be used a lot for what we're going to see today okay
[00:38:29] for what we're going to see today okay so the idea is that we're going to start
[00:38:31] so the idea is that we're going to start with a prior over the unknown parameters
[00:38:34] with a prior over the unknown parameters in our particular case that's going to
[00:38:36] in our particular case that's going to be the unknown distribution over the
[00:38:38] be the unknown distribution over the rewards for each arm so it's like if we
[00:38:41] rewards for each arm so it's like if we have a coin flip or like if we think
[00:38:44] have a coin flip or like if we think about the toes example what's the
[00:38:45] about the toes example what's the probability that someone's going to heal
[00:38:47] probability that someone's going to heal if they're given surgery we don't know
[00:38:49] if they're given surgery we don't know what that parameter Theta is and so
[00:38:51] what that parameter Theta is and so we're going to have a prior over what
[00:38:53] we're going to have a prior over what that Theta could
[00:38:55] that Theta could be once we're given some obser ation
[00:38:58] be once we're given some obser ation about that parameter for example if we
[00:39:00] about that parameter for example if we observe when you do surgery that you
[00:39:02] observe when you do surgery that you someone was healed that is going to
[00:39:04] someone was healed that is going to change our uncertainty over the unknown
[00:39:11] parameters so let's do a particular
[00:39:13] parameters so let's do a particular example so if the reward of arm I is a
[00:39:17] example so if the reward of arm I is a probability distribution depends on the
[00:39:18] probability distribution depends on the parameter f i we have initial prior over
[00:39:22] parameter f i we have initial prior over that parameter pull an arm we observe a
[00:39:24] that parameter pull an arm we observe a reward reward then we can use base o
[00:39:26] reward reward then we can use base o that should be B um B rule to update
[00:39:29] that should be B um B rule to update that
[00:39:31] that okay and I think it's really helpful to
[00:39:33] okay and I think it's really helpful to visualize how the prior change over time
[00:39:35] visualize how the prior change over time so we'll see that in an example shortly
[00:39:37] so we'll see that in an example shortly just so you can kind of think about see
[00:39:40] just so you can kind of think about see what that might look like okay so what
[00:39:41] what that might look like okay so what we're going to have
[00:39:45] here is base roll
[00:40:00] okay all right this is our prior
[00:40:03] okay all right this is our prior probability over the parameter governing
[00:40:06] probability over the parameter governing the reward uh distribution for this
[00:40:09] the reward uh distribution for this arm this is the likelihood of observing
[00:40:11] arm this is the likelihood of observing a particular reward given a specific
[00:40:14] a particular reward given a specific parameter value and this is the
[00:40:16] parameter value and this is the probability of seeing that reward in
[00:40:18] probability of seeing that reward in general okay and when we do that this is
[00:40:21] general okay and when we do that this is BAS Rule and then we use it to update
[00:40:22] BAS Rule and then we use it to update what our new probability is over the
[00:40:25] what our new probability is over the parameter that generates that reward so
[00:40:27] parameter that generates that reward so in the case of surgery it would be
[00:40:29] in the case of surgery it would be before we had some distribution over How
[00:40:31] before we had some distribution over How likely how successful we think surgery
[00:40:34] likely how successful we think surgery is on average we give surgery to someone
[00:40:37] is on average we give surgery to someone we update it uh we observe that they are
[00:40:39] we update it uh we observe that they are healed and then that changes what we
[00:40:41] healed and then that changes what we think about the underlying
[00:40:43] think about the underlying parameters
[00:40:45] parameters so BR that out
[00:40:49] so BR that out here so this is the prior probability
[00:40:52] here so this is the prior probability this is the probability of reward given
[00:40:53] this is the probability of reward given a particular parameter this is the
[00:40:55] a particular parameter this is the probability of getting the reward in
[00:40:56] probability of getting the reward in general
[00:40:58] general and we can rewrite this by using the
[00:41:00] and we can rewrite this by using the joint distribution of the reward and the
[00:41:02] joint distribution of the reward and the parameter and then marginalize out the
[00:41:07] parameter and then marginalize out the parameter
[00:41:09] parameter okay all
[00:41:12] okay all right
[00:41:14] right okay so this is beautiful and really oh
[00:41:17] okay so this is beautiful and really oh yeah um can we go back to slide I'm just
[00:41:20] yeah um can we go back to slide I'm just kind of confused on the setup like if I
[00:41:22] kind of confused on the setup like if I imagine that f is like U the par for
[00:41:26] imagine that f is like U the par for like a that variable um and I using like
[00:41:30] like a that variable um and I using like background knowledge I have some sort of
[00:41:31] background knowledge I have some sort of Prior what this should be like what does
[00:41:33] Prior what this should be like what does it mean to have a like like a
[00:41:35] it mean to have a like like a distribution over that yeah it's a great
[00:41:37] distribution over that yeah it's a great so
[00:41:39] so it's in general it may not be obvious
[00:41:41] it's in general it may not be obvious that we can compute this so for example
[00:41:43] that we can compute this so for example um we're going to see in some cases this
[00:41:45] um we're going to see in some cases this is analytic you can analytically update
[00:41:47] is analytic you can analytically update this which is super elegant what I mean
[00:41:49] this which is super elegant what I mean in that case is like um as a
[00:41:52] in that case is like um as a simple so let's
[00:41:54] simple so let's say we'll see an example shortly but
[00:41:56] say we'll see an example shortly but five could be um you know is the
[00:41:59] five could be um you know is the probability of
[00:42:02] probability of recovery I'll do this for surgery for
[00:42:04] recovery I'll do this for surgery for surgery okay so this would be say like
[00:42:08] surgery okay so this would be say like you know 90% of the time someone's
[00:42:10] you know 90% of the time someone's recovered and let's say um or something
[00:42:13] recovered and let's say um or something like that right and this is this could
[00:42:14] like that right and this is this could be a particular prior okay so I could
[00:42:17] be a particular prior okay so I could say I think my probability that your
[00:42:20] say I think my probability that your recover mostly from the
[00:42:23] recover mostly from the surgery is .9 so I I'm pretty confident
[00:42:26] surgery is .9 so I I'm pretty confident that the surgery is going to be highly
[00:42:27] that the surgery is going to be highly effective on average but you know I
[00:42:31] effective on average but you know I think that there's some
[00:42:34] probability that the surgery is not so
[00:42:37] probability that the surgery is not so effective and then I would say well I
[00:42:39] effective and then I would say well I think that you know maybe with 10%
[00:42:41] think that you know maybe with 10% probability the surgery isn't as
[00:42:42] probability the surgery isn't as effective that on average people are
[00:42:45] effective that on average people are going to recover at rate point4 with the
[00:42:48] going to recover at rate point4 with the surgery and we'll see some specific
[00:42:50] surgery and we'll see some specific examples of this this is not the priors
[00:42:52] examples of this this is not the priors we're going to use but this just
[00:42:53] we're going to use but this just illustrates how you can have
[00:42:55] illustrates how you can have distributions over distributions which
[00:42:57] distributions over distributions which can get confusing pretty quickly but on
[00:42:59] can get confusing pretty quickly but on the other hand is also super elegant and
[00:43:01] the other hand is also super elegant and a place we can put in prior knowledge
[00:43:03] a place we can put in prior knowledge just like clinicians and others may have
[00:43:05] just like clinicians and others may have information that where they can actually
[00:43:07] information that where they can actually directly specify
[00:43:11] these right and so so there's many
[00:43:14] these right and so so there's many questions you might have in this case of
[00:43:16] questions you might have in this case of like where do these priors come from and
[00:43:18] like where do these priors come from and even if we have these priors how do we
[00:43:19] even if we have these priors how do we do this
[00:43:21] do this calculation so in general This is
[00:43:26] calculation so in general This is complicated so you can see here you've
[00:43:28] complicated so you can see here you've got to have like a functional form for
[00:43:30] got to have like a functional form for this this in our case was like flipping
[00:43:32] this this in our case was like flipping a coin and so if your coin has a bias of
[00:43:35] a coin and so if your coin has a bias of 0.9 you know what's the probability
[00:43:37] 0.9 you know what's the probability you'd get reward one it would be 0.9 so
[00:43:40] you'd get reward one it would be 0.9 so you have to probability distribution
[00:43:41] you have to probability distribution here probability distribution here you
[00:43:43] here probability distribution here you have to marginalize one out over here
[00:43:45] have to marginalize one out over here and when you do all of that you get your
[00:43:46] and when you do all of that you get your new posterior which is After You observe
[00:43:50] new posterior which is After You observe something now what is your new
[00:43:52] something now what is your new distribution so you might imagine that
[00:43:53] distribution so you might imagine that now I update this maybe I see that the
[00:43:55] now I update this maybe I see that the surgery was successful and I'm like
[00:43:57] surgery was successful and I'm like maybe I'll can update this to be 095 and
[00:44:00] maybe I'll can update this to be 095 and 05
[00:44:01] 05 okay so in general this is going to be
[00:44:04] okay so in general this is going to be computationally trickly to do exactly um
[00:44:07] computationally trickly to do exactly um without additional structure there's
[00:44:09] without additional structure there's lots of ways to approximate it but the
[00:44:11] lots of ways to approximate it but the really cool thing is that in some cases
[00:44:13] really cool thing is that in some cases you can do this
[00:44:15] you can do this analytically so this is this idea of
[00:44:18] analytically so this is this idea of these conjugate priors so this this
[00:44:21] these conjugate priors so this this beautiful idea of the exponential
[00:44:23] beautiful idea of the exponential families and if you have a
[00:44:27] families and if you have a representation of your prior that is
[00:44:29] representation of your prior that is conjugate with this is often called your
[00:44:30] conjugate with this is often called your likelihood
[00:44:32] likelihood function then after you do all of this
[00:44:34] function then after you do all of this updating this new thing is in the same
[00:44:37] updating this new thing is in the same statistical family as what this was in
[00:44:40] statistical family as what this was in before and we'll see some specific
[00:44:42] before and we'll see some specific examples of this in a second so the
[00:44:44] examples of this in a second so the really the highle really beautiful idea
[00:44:46] really the highle really beautiful idea in this case is that it's analytic when
[00:44:48] in this case is that it's analytic when you do all of this let's say this was
[00:44:51] you do all of this let's say this was initially like a gaussian this is still
[00:44:53] initially like a gaussian this is still going to be a gaussian if you use
[00:44:54] going to be a gaussian if you use conjugate
[00:44:55] conjugate priors okay
[00:44:58] priors okay so let's see how to do this for bruli so
[00:45:01] so let's see how to do this for bruli so for Bui there is a conjugate prior which
[00:45:03] for Bui there is a conjugate prior which is really cool and the conjugate prior
[00:45:05] is really cool and the conjugate prior is called a beta distribution and it's
[00:45:07] is called a beta distribution and it's going to have a really nice beautiful
[00:45:08] going to have a really nice beautiful interpretation that we'll see in just a
[00:45:10] interpretation that we'll see in just a second so the equation looks terrible um
[00:45:13] second so the equation looks terrible um the equation says the probability of a
[00:45:15] the equation says the probability of a particular Theta remember this is like
[00:45:17] particular Theta remember this is like the bias of your coin given some Alpha
[00:45:20] the bias of your coin given some Alpha and beta these are just two other
[00:45:22] and beta these are just two other parameters is Theta the alpha - 1 1 -
[00:45:25] parameters is Theta the alpha - 1 1 - Theta the beta - 1 * this the gamma
[00:45:28] Theta the beta - 1 * this the gamma function of alpha plus beta divid by
[00:45:30] function of alpha plus beta divid by gamma of alpha gpha of beta okay so this
[00:45:32] gamma of alpha gpha of beta okay so this looks fairly terrible but it is
[00:45:35] looks fairly terrible but it is conjugate which means that after we
[00:45:37] conjugate which means that after we observe something our new posterior is
[00:45:39] observe something our new posterior is also going to be the same but it turns
[00:45:42] also going to be the same but it turns out that it has a really simple
[00:45:44] out that it has a really simple explanation a really simple
[00:45:46] explanation a really simple intuition which is imagine you start
[00:45:48] intuition which is imagine you start with your prior being a beta Alpha
[00:45:51] with your prior being a beta Alpha Beta And Then You observe a reward
[00:45:54] Beta And Then You observe a reward that's either Z or one because your
[00:45:55] that's either Z or one because your variable is just Z one so
[00:45:59] variable is just Z one so newly then your new beta your posterior
[00:46:02] newly then your new beta your posterior is just r+
[00:46:06] is just r+ Alpha comma 1 - r +
[00:46:09] Alpha comma 1 - r + Alpha what does this mean if you
[00:46:11] Alpha what does this mean if you observed a one then you increment your
[00:46:14] observed a one then you increment your first parameter it's like you increase
[00:46:16] first parameter it's like you increase the number of successes If You observe a
[00:46:18] the number of successes If You observe a zero you increase this number the second
[00:46:21] zero you increase this number the second number like you increase the number of
[00:46:23] number like you increase the number of failures so you can think of what the
[00:46:25] failures so you can think of what the beta is doing is essentially just
[00:46:26] beta is doing is essentially just keeping keeping track of how many heads
[00:46:29] keeping keeping track of how many heads did you get and how many Tails or how
[00:46:31] did you get and how many Tails or how many ones did you get and how many zeros
[00:46:33] many ones did you get and how many zeros it's just keeping track of those and it
[00:46:35] it's just keeping track of those and it can use those to explicitly update what
[00:46:38] can use those to explicitly update what the probability is of your Theta okay so
[00:46:41] the probability is of your Theta okay so it's really beautiful because you don't
[00:46:43] it's really beautiful because you don't um computationally that's really easy to
[00:46:45] um computationally that's really easy to keep track of just going to add one if I
[00:46:47] keep track of just going to add one if I know what you've seen and what you can
[00:46:49] know what you've seen and what you can think of this here as being is how
[00:46:51] think of this here as being is how confident are you in advance of sort of
[00:46:53] confident are you in advance of sort of how many pseudo counts did you see of
[00:46:56] how many pseudo counts did you see of uccess versus failure so for example if
[00:46:59] uccess versus failure so for example if I'm really confident that the surgery is
[00:47:01] I'm really confident that the surgery is going to be successful maybe I'm like
[00:47:02] going to be successful maybe I'm like yeah I'm so confident it's as if I've
[00:47:04] yeah I'm so confident it's as if I've seen 100 successful surgeries and only
[00:47:07] seen 100 successful surgeries and only two failures okay but if I'm really
[00:47:10] two failures okay but if I'm really uncertain what I would do is I'd say
[00:47:11] uncertain what I would do is I'd say well I'm going to treat like one
[00:47:13] well I'm going to treat like one successful one failure I really don't
[00:47:14] successful one failure I really don't know and we'll see what this looks like
[00:47:16] know and we'll see what this looks like in just a sec
[00:47:20] okay excuse me so um now when we have
[00:47:25] okay excuse me so um now when we have this we're going to this is basically
[00:47:26] this we're going to this is basically giving us to distribution over the
[00:47:27] giving us to distribution over the reward parameters and we can use this to
[00:47:29] reward parameters and we can use this to actually make
[00:47:30] actually make decisions all
[00:47:32] decisions all right so there's a couple different ways
[00:47:35] right so there's a couple different ways to do this and one of the ways to do
[00:47:37] to do this and one of the ways to do this is by getting a confidence interval
[00:47:39] this is by getting a confidence interval similar to what we've seen before but
[00:47:41] similar to what we've seen before but the other thing is called probability
[00:47:42] the other thing is called probability matching or Thompson
[00:47:44] matching or Thompson sampling and let's go through Thompson
[00:47:46] sampling and let's go through Thompson sampling now and see an example
[00:47:50] okay all right
[00:47:54] okay all right so in probability matching we're going
[00:47:57] so in probability matching we're going to assume we're in the Bas Bandit case
[00:47:59] to assume we're in the Bas Bandit case and what prob mching does is says okay
[00:48:02] and what prob mching does is says okay the way we might want to explore is by
[00:48:04] the way we might want to explore is by sampling actions according to the
[00:48:05] sampling actions according to the probability that they're optimal given
[00:48:07] probability that they're optimal given everything I've seen so far okay so what
[00:48:10] everything I've seen so far okay so what it says is I given some history which is
[00:48:13] it says is I given some history which is like the past things I've tried and
[00:48:15] like the past things I've tried and whether I've gotten ones or zeros for
[00:48:16] whether I've gotten ones or zeros for them I want to select a new action based
[00:48:20] them I want to select a new action based on the probability that its true mean is
[00:48:23] on the probability that its true mean is higher than the mean of all the other
[00:48:24] higher than the mean of all the other arms
[00:48:28] and I'm not going to tell you yet that
[00:48:29] and I'm not going to tell you yet that that's you know formally a good thing to
[00:48:31] that's you know formally a good thing to do in terms of regret but you might
[00:48:32] do in terms of regret but you might imagine that's a reasonable thing to do
[00:48:34] imagine that's a reasonable thing to do it s of says like oh well if I think
[00:48:35] it s of says like oh well if I think that arm is likely to be the optimal one
[00:48:37] that arm is likely to be the optimal one with 60% probability I'll try that with
[00:48:39] with 60% probability I'll try that with 60% probability and then if I think
[00:48:41] 60% probability and then if I think there's another arm that might be
[00:48:43] there's another arm that might be optimal I'll try that with you know 30%
[00:48:46] optimal I'll try that with you know 30% probability now in general it's not
[00:48:48] probability now in general it's not clear how you would compute this it
[00:48:50] clear how you would compute this it seems like kind of an interesting idea
[00:48:51] seems like kind of an interesting idea it's not clear how you can compute it
[00:48:53] it's not clear how you can compute it but it turns out there's a really simple
[00:48:55] but it turns out there's a really simple algorithm to compute this
[00:48:58] algorithm to compute this so this is called Thompson sampling and
[00:49:01] so this is called Thompson sampling and I think it came it was first invented by
[00:49:02] I think it came it was first invented by Thompson maybe in 1919 maybe 1919 maybe
[00:49:06] Thompson maybe in 1919 maybe 1919 maybe 1920
[00:49:09] 1919 so it was around forever I mean
[00:49:12] 1919 so it was around forever I mean it's been around for like 100 years but
[00:49:15] it's been around for like 100 years but at least from the machine learning
[00:49:16] at least from the machine learning perspective I think it was forgotten
[00:49:17] perspective I think it was forgotten about for like the first I don't know 90
[00:49:20] about for like the first I don't know 90 of those it really came back into
[00:49:21] of those it really came back into prominence about 2010 2011 when some
[00:49:25] prominence about 2010 2011 when some people discovered that it actually had
[00:49:26] people discovered that it actually had some really nice empirical properties so
[00:49:29] some really nice empirical properties so unlike hting which has been used for a
[00:49:30] unlike hting which has been used for a long time how does Topson sampling work
[00:49:33] long time how does Topson sampling work we're going to have a prior over each
[00:49:34] we're going to have a prior over each arm then for each iteration what we're
[00:49:37] arm then for each iteration what we're going to do is we're going to sample a
[00:49:39] going to do is we're going to sample a reward distribution from the posterior
[00:49:41] reward distribution from the posterior we'll see an example of exactly what I
[00:49:42] we'll see an example of exactly what I mean by that we compute the action value
[00:49:44] mean by that we compute the action value function given that sample we take the
[00:49:47] function given that sample we take the arm that is maximum given those cubes we
[00:49:51] arm that is maximum given those cubes we observe our reward and we update our
[00:49:52] observe our reward and we update our posterior and then we're going to do
[00:49:54] posterior and then we're going to do this many many times and again this will
[00:49:56] this many many times and again this will all seem much more concrete when we do
[00:49:57] all seem much more concrete when we do an
[00:49:58] an example right and it's going to turn out
[00:50:01] example right and it's going to turn out that this exactly implements probability
[00:50:03] that this exactly implements probability matching so let's come back to this in a
[00:50:05] matching so let's come back to this in a second and let's first do a specific
[00:50:07] second and let's first do a specific example because I think that'll make it
[00:50:08] example because I think that'll make it a lot more
[00:50:10] a lot more concrete all right so let's go back to
[00:50:13] concrete all right so let's go back to our broken toes
[00:50:14] our broken toes example what we're going to do so we're
[00:50:17] example what we're going to do so we're going to place a prior over each arm's
[00:50:19] going to place a prior over each arm's parameter and I'm going to choose beta 1
[00:50:21] parameter and I'm going to choose beta 1 one what does a beta 1 one look like
[00:50:24] one what does a beta 1 one look like that looks like the following I'm sorry
[00:50:27] that looks like the following I'm sorry my pen isn't working today otherwise
[00:50:28] my pen isn't working today otherwise that would have been helpful but I'll
[00:50:30] that would have been helpful but I'll draw it up here
[00:50:39] okay okay this is zero this is one this
[00:50:42] okay okay this is zero this is one this is Theta okay we know that for um a
[00:50:45] is Theta okay we know that for um a bruli variable the the value for a Theta
[00:50:49] bruli variable the the value for a Theta has to be somewhere between Zer and one
[00:50:51] has to be somewhere between Zer and one because you can either always get one or
[00:50:54] because you can either always get one or always get zero or somewhere in between
[00:50:56] always get zero or somewhere in between what if what a beta 1 one looks like so
[00:50:58] what if what a beta 1 one looks like so this is going to be the probability of
[00:50:59] this is going to be the probability of theta this is my
[00:51:01] theta this is my prior what a beta 1 one looks like is
[00:51:05] prior what a beta 1 one looks like is this which is a uniform distribution
[00:51:08] this which is a uniform distribution what it says is I have no idea what
[00:51:11] what it says is I have no idea what Theta is it could be zero it could be 1
[00:51:13] Theta is it could be zero it could be 1 it could be 0.5 it could be 7 it could
[00:51:15] it could be 0.5 it could be 7 it could be 0.9 it just says someone is totally
[00:51:18] be 0.9 it just says someone is totally agnostic this is called often like an
[00:51:20] agnostic this is called often like an uninformative prior saying I have no
[00:51:22] uninformative prior saying I have no idea what my probability is for surgery
[00:51:25] idea what my probability is for surgery Etc but this is what that looks like
[00:51:30] Etc but this is what that looks like okay so this is our prior and now what
[00:51:34] okay so this is our prior and now what we're going to do is we're actually
[00:51:36] we're going to do is we're actually going to sample a bruli parameter given
[00:51:38] going to sample a bruli parameter given the prior of each arm for the three arms
[00:51:42] the prior of each arm for the three arms okay so what does that mean that means
[00:51:44] okay so what does that mean that means I'm going to sample something for
[00:51:47] I'm going to sample something for surgery going to sample something for
[00:51:50] surgery going to sample something for Buddy
[00:51:51] Buddy taping and I gu sample something for
[00:51:54] taping and I gu sample something for nothing for do nothing
[00:51:57] nothing for do nothing all of them have this particular prior
[00:51:59] all of them have this particular prior for now so for the first one it's like
[00:52:02] for now so for the first one it's like I'm just sampling from a uniform
[00:52:03] I'm just sampling from a uniform distribution between Z and one so it
[00:52:05] distribution between Z and one so it could be anything between those zero and
[00:52:08] could be anything between those zero and one let me just check which number I'm
[00:52:11] one let me just check which number I'm sure I use the numbers that I'm going to
[00:52:13] sure I use the numbers that I'm going to use for the next ones okay so let's say
[00:52:16] use for the next ones okay so let's say for example that I happen to sample
[00:52:20] for example that I happen to sample 0.3 okay that's a totally reasonable
[00:52:23] 0.3 okay that's a totally reasonable thing that I could sample given this
[00:52:24] thing that I could sample given this uniform distribution between 0 and 1 one
[00:52:27] uniform distribution between 0 and 1 one then for Buddy taping let's say I sample
[00:52:32] then for Buddy taping let's say I sample 0.5 okay again a totally reasonable
[00:52:34] 0.5 okay again a totally reasonable thing I could sample given this
[00:52:36] thing I could sample given this distribution and for do nothing I'm
[00:52:38] distribution and for do nothing I'm going to sample
[00:52:41] 0.6 so this is just the the
[00:52:44] 0.6 so this is just the the distributions that I have over my prior
[00:52:47] distributions that I have over my prior over the parameters and this is a
[00:52:49] over the parameters and this is a particular set of parameters I could
[00:52:52] particular set of parameters I could sample given
[00:52:55] sample given that I
[00:52:57] that I now what Thompson sampling says I should
[00:53:00] now what Thompson sampling says I should do is I should select the action that is
[00:53:03] do is I should select the action that is maximal given the parameters I've
[00:53:05] maximal given the parameters I've sampled so under these three parameters
[00:53:08] sampled so under these three parameters if you want to maximize the probability
[00:53:10] if you want to maximize the probability that someone will recover from surgery
[00:53:12] that someone will recover from surgery or sorry recover from their from recover
[00:53:15] or sorry recover from their from recover from in terms of their broken toe should
[00:53:17] from in terms of their broken toe should I do surgery buddy taping or
[00:53:21] I do surgery buddy taping or nothing which one has the highest chance
[00:53:23] nothing which one has the highest chance in this case nothing in this case
[00:53:25] in this case nothing in this case nothing right
[00:53:27] nothing right so as soon as you in our case it's
[00:53:29] so as soon as you in our case it's pretty simple once you see the Theta
[00:53:30] pretty simple once you see the Theta because the Theta is exactly equal to
[00:53:32] because the Theta is exactly equal to the expected reward so what this would
[00:53:34] the expected reward so what this would say is in this case you should do
[00:53:38] nothing so this is going to
[00:53:51] be okay so this will say do
[00:53:55] be okay so this will say do nothing all right we're going to observe
[00:53:57] nothing all right we're going to observe the patient's outcome now in this case
[00:53:59] the patient's outcome now in this case we're going to assume that doing nothing
[00:54:02] we're going to assume that doing nothing is actually not so effective and so
[00:54:03] is actually not so effective and so we're going to observe a zero and now
[00:54:06] we're going to observe a zero and now what we're going to do is we're going to
[00:54:07] what we're going to do is we're going to update the posterior over doing nothing
[00:54:10] update the posterior over doing nothing given that observation
[00:54:13] given that observation okay now the other two haven't the other
[00:54:16] okay now the other two haven't the other two arms their prior hasn't changed
[00:54:19] two arms their prior hasn't changed because we haven't gotten any
[00:54:20] because we haven't gotten any observations about surgery or budy
[00:54:22] observations about surgery or budy taping the only thing we've got an
[00:54:23] taping the only thing we've got an observation about is doing nothing so
[00:54:25] observation about is doing nothing so what I said
[00:54:28] what I said before so we have Alpha this is our
[00:54:30] before so we have Alpha this is our Alpha Beta parameter this is our prior
[00:54:33] Alpha Beta parameter this is our prior in particular it was beta 1 1 before and
[00:54:37] in particular it was beta 1 1 before and when I pull this arm and I get a reward
[00:54:41] when I pull this arm and I get a reward of
[00:54:42] of zero what I said we would do here is the
[00:54:45] zero what I said we would do here is the first one you can think of as being the
[00:54:47] first one you can think of as being the number of successes the second one is
[00:54:49] number of successes the second one is the number of failures so what this
[00:54:51] the number of failures so what this becomes is it becomes
[00:54:53] becomes is it becomes beta 1 2
[00:54:57] beta 1 2 okay and that is going to look
[00:55:04] different it's going to look like
[00:55:08] different it's going to look like this this is a
[00:55:10] this this is a beta one
[00:55:13] beta one 2 this is a beta 1
[00:55:17] 2 this is a beta 1 one
[00:55:20] okay does somebody want to tell me
[00:55:22] okay does somebody want to tell me intuitively why it makes sense that this
[00:55:25] intuitively why it makes sense that this looks like this for doing nothing does
[00:55:27] looks like this for doing nothing does this put where does this put weight in
[00:55:29] this put where does this put weight in terms of
[00:55:32] parameters um since we
[00:55:37] receive yeah we should think that the
[00:55:39] receive yeah we should think that the parameter value is likely lower and so
[00:55:41] parameter value is likely lower and so we've shifted our probability mass and
[00:55:43] we've shifted our probability mass and so we're like okay for the things that
[00:55:45] so we're like okay for the things that we don't know we're still totally
[00:55:47] we don't know we're still totally agnostic about whether they're effective
[00:55:48] agnostic about whether they're effective or not for the thing that we just tried
[00:55:50] or not for the thing that we just tried do nothing we got a zero so it is more
[00:55:53] do nothing we got a zero so it is more likely that our actual Theta is lower
[00:55:56] likely that our actual Theta is lower because a lower Theta in general will
[00:55:58] because a lower Theta in general will generate more zeros and so we've changed
[00:56:01] generate more zeros and so we've changed our
[00:56:04] distribution let see what that looks
[00:56:06] distribution let see what that looks like
[00:56:07] like here so this is our new posterior we're
[00:56:10] here so this is our new posterior we're using again remember this is conjugate
[00:56:12] using again remember this is conjugate so this is our new Beta And we haven't
[00:56:15] so this is our new Beta And we haven't changed it for the other two now here's
[00:56:18] changed it for the other two now here's the next important thing what we're
[00:56:20] the next important thing what we're going to do now is we are so this is
[00:56:23] going to do now is we are so this is what that beta looks like just for beta
[00:56:25] what that beta looks like just for beta 1 2 now we're going to do our next step
[00:56:27] 1 2 now we're going to do our next step at Thompson sampling so what we have to
[00:56:29] at Thompson sampling so what we have to do now is we now
[00:56:32] do now is we now have to
[00:56:35] have to resample so we are going to resample
[00:56:38] resample so we are going to resample where we have two distributions these
[00:56:40] where we have two distributions these two ones are a beta 1 one and this one
[00:56:43] two ones are a beta 1 one and this one is a beta one
[00:56:45] is a beta one two we're not we're going to throw away
[00:56:47] two we're not we're going to throw away all of our old parameters from last time
[00:56:50] all of our old parameters from last time they were just samples we now have an
[00:56:52] they were just samples we now have an updated distribution for one of the arms
[00:56:55] updated distribution for one of the arms and that we have the old one for the
[00:56:56] and that we have the old one for the other two so in this case we might
[00:56:59] other two so in this case we might resample and we would get
[00:57:04] this it's more likely now that we would
[00:57:07] this it's more likely now that we would sample a theta 3 which is lower because
[00:57:10] sample a theta 3 which is lower because our beta puts more weight on the lower
[00:57:12] our beta puts more weight on the lower part okay so this is what this looks
[00:57:15] part okay so this is what this looks like here so under this we're going to
[00:57:18] like here so under this we're going to pull arm one because it has the highest
[00:57:20] pull arm one because it has the highest expected
[00:57:23] success and that one is going to give us
[00:57:26] success and that one is going to give us a beta 21 cuz remember again that we
[00:57:28] a beta 21 cuz remember again that we increment the first one if in terms of
[00:57:31] increment the first one if in terms of the number of successes and number of
[00:57:32] the number of successes and number of failures so as you might expect is
[00:57:35] failures so as you might expect is exactly symmetric to this
[00:57:38] one we have something looks like
[00:57:41] one we have something looks like this I'm not being perfectly precise of
[00:57:44] this I'm not being perfectly precise of the
[00:57:50] intersection
[00:57:54] okay so now now we can again we throw
[00:57:57] okay so now now we can again we throw away all the old parameters that we've
[00:57:59] away all the old parameters that we've sampled so far so we're going to throw
[00:58:00] sampled so far so we're going to throw away the 7 the5 and the3 and we're going
[00:58:04] away the 7 the5 and the3 and we're going to
[00:58:05] to resample
[00:58:07] resample okay so this time let's imagine resample
[00:58:10] okay so this time let's imagine resample like 7165 and
[00:58:12] like 7165 and 0.1 and we again observe that the
[00:58:14] 0.1 and we again observe that the outcome from surgery is successful this
[00:58:17] outcome from surgery is successful this is what a beta 31 looks like so it stops
[00:58:20] is what a beta 31 looks like so it stops looking like a straight line starts
[00:58:21] looking like a straight line starts having
[00:58:22] having curve so I really like these uh these
[00:58:25] curve so I really like these uh these graphs cuz CU I feel like it gives one a
[00:58:26] graphs cuz CU I feel like it gives one a much better intuitive sense of how as
[00:58:28] much better intuitive sense of how as you get information that translates to
[00:58:31] you get information that translates to your posterior over what the what you
[00:58:35] your posterior over what the what you think the Theta likely is so as you see
[00:58:38] think the Theta likely is so as you see more successes um it'll tend to go
[00:58:40] more successes um it'll tend to go weighted to one way as you see more
[00:58:42] weighted to one way as you see more failures and we get it the other way and
[00:58:43] failures and we get it the other way and as you might expect so we're not seeing
[00:58:45] as you might expect so we're not seeing that right here but if you can see
[00:58:49] that right here but if you can see cases where it starts to concentrate in
[00:58:51] cases where it starts to concentrate in the middle or somewhere in between right
[00:58:54] the middle or somewhere in between right just depends on what actual
[00:58:56] just depends on what actual um observations you're
[00:58:58] um observations you're getting
[00:59:00] getting okay so this is how Thompson sampling
[00:59:03] okay so this is how Thompson sampling works and I think so let's say we did
[00:59:05] works and I think so let's say we did this then now we have this something
[00:59:06] this then now we have this something that's even more peaked we get another
[00:59:08] that's even more peaked we get another one and you can see it just continues to
[00:59:10] one and you can see it just continues to curve okay yeah um could you use toson
[00:59:14] curve okay yeah um could you use toson samping with like random variables that
[00:59:16] samping with like random variables that are just like have many more parameters
[00:59:19] are just like have many more parameters like as opposed to just using bures yes
[00:59:23] like as opposed to just using bures yes absolutely yeah and one of the examples
[00:59:24] absolutely yeah and one of the examples we'll see later today they're using it
[00:59:26] we'll see later today they're using it for advertising and they have a large
[00:59:27] for advertising and they have a large number of
[00:59:30] features yeah you can extend all of
[00:59:32] features yeah you can extend all of these to the function approximation case
[00:59:34] these to the function approximation case good question okay so I think one of the
[00:59:37] good question okay so I think one of the things I mean obviously this is a small
[00:59:39] things I mean obviously this is a small example what we saw in this particular
[00:59:40] example what we saw in this particular example I just did is that we quickly
[00:59:44] example I just did is that we quickly started to converge to a one in this
[00:59:48] started to converge to a one in this case now notice so far we've actually
[00:59:50] case now notice so far we've actually never pulled A2 we had some probability
[00:59:52] never pulled A2 we had some probability of pulling A2 because if we had sampled
[00:59:55] of pulling A2 because if we had sampled a really high high value for A2 then we
[00:59:58] a really high high value for A2 then we would have pulled it but we haven't done
[01:00:01] would have pulled it but we haven't done that yet and A1 which actually does have
[01:00:04] that yet and A1 which actually does have you know high like generally higher
[01:00:07] you know high like generally higher probability in this case of having good
[01:00:08] probability in this case of having good outcomes is starting to be
[01:00:10] outcomes is starting to be pulled so it's quite different than the
[01:00:12] pulled so it's quite different than the optimism methods because in optimism we
[01:00:14] optimism methods because in optimism we had to at least pull each arm once so we
[01:00:16] had to at least pull each arm once so we could even start to initialize our
[01:00:18] could even start to initialize our confidence bounds that's not the case
[01:00:20] confidence bounds that's not the case here we already have a prior over what
[01:00:21] here we already have a prior over what the values are all of these and we can
[01:00:23] the values are all of these and we can immediately start using that to make
[01:00:24] immediately start using that to make decisions
[01:00:29] okay all right so what is Thompson
[01:00:31] okay all right so what is Thompson sampling doing um when we're doing these
[01:00:34] sampling doing um when we're doing these pools and and what sort of uh results do
[01:00:36] pools and and what sort of uh results do we have in this
[01:00:40] case
[01:00:42] case so what it is doing in this case is it's
[01:00:45] so what it is doing in this case is it's well actually let me just step back CU I
[01:00:46] well actually let me just step back CU I wanted to get to the example and then um
[01:00:49] wanted to get to the example and then um so I went through that part a little bit
[01:00:50] so I went through that part a little bit fast let me just go to how the POS the
[01:00:52] fast let me just go to how the POS the matching is working so let's just go
[01:00:55] matching is working so let's just go back to here for second so what we can
[01:00:57] back to here for second so what we can see in this case is what Thompson
[01:00:59] see in this case is what Thompson sampling is actually doing is at each
[01:01:02] sampling is actually doing is at each time point it is trying to select
[01:01:04] time point it is trying to select actions according to this
[01:01:08] probability and it'll often end up being
[01:01:10] probability and it'll often end up being optimistic in the face of uncertainty
[01:01:13] optimistic in the face of uncertainty because we're doing an
[01:01:15] because we're doing an argmax with respect to like our
[01:01:17] argmax with respect to like our empirical
[01:01:18] empirical estimates but it won't in general as you
[01:01:21] estimates but it won't in general as you might imagine uncertain actions have a
[01:01:23] might imagine uncertain actions have a higher probability of being the max so
[01:01:25] higher probability of being the max so so if you are really sure that your
[01:01:28] so if you are really sure that your parameter is at 0.5 and you have another
[01:01:30] parameter is at 0.5 and you have another parameter you have very little
[01:01:32] parameter you have very little information about like you have a beta 1
[01:01:33] information about like you have a beta 1 one then you're more likely to
[01:01:35] one then you're more likely to accidentally sample a much higher value
[01:01:37] accidentally sample a much higher value for that
[01:01:40] parameter so the elegant thing
[01:01:44] parameter so the elegant thing here is that you can think of this as
[01:01:47] here is that you can think of this as the following and this is really useful
[01:01:49] the following and this is really useful also for the theory so in posterior
[01:01:52] also for the theory so in posterior matching that's this first line that's
[01:01:55] matching that's this first line that's sampling things of according to
[01:01:57] sampling things of according to this what put what Thompson sampling
[01:01:59] this what put what Thompson sampling does is it doesn't do that explicitly it
[01:02:01] does is it doesn't do that explicitly it just samples a reward for each of like a
[01:02:05] just samples a reward for each of like a reward parameter for each of the
[01:02:06] reward parameter for each of the different arms and then it picks the one
[01:02:08] different arms and then it picks the one that's
[01:02:10] that's argmax and so it's really elegant that
[01:02:13] argmax and so it's really elegant that that is in fact the same thing as doing
[01:02:15] that is in fact the same thing as doing probability
[01:02:18] matching that gives this um the fact
[01:02:21] matching that gives this um the fact that that ends up working in terms of
[01:02:23] that that ends up working in terms of these so the key IDE a in this case is
[01:02:27] these so the key IDE a in this case is that as you're Computing these with
[01:02:29] that as you're Computing these with respect to the data that you have so far
[01:02:32] respect to the data that you have so far in fact the probability that Thompson
[01:02:35] in fact the probability that Thompson sampling picks an arm is exactly equal
[01:02:38] sampling picks an arm is exactly equal to this true probability given all the
[01:02:40] to this true probability given all the data you've seen so
[01:02:42] data you've seen so far I'll do a pointer if we have time at
[01:02:44] far I'll do a pointer if we have time at the end maybe I'll to go through the
[01:02:45] the end maybe I'll to go through the proof briefly um for the basing regret
[01:02:47] proof briefly um for the basing regret case um but there's also a really nice
[01:02:50] case um but there's also a really nice explanation of this inside of T Lamar
[01:02:52] explanation of this inside of T Lamar and chabab as spire's book um there's
[01:02:55] and chabab as spire's book um there's quite a lot there is a quite
[01:02:56] quite a lot there is a quite mathematical version of it um but it
[01:02:58] mathematical version of it um but it gives you some really nice nice
[01:03:00] gives you some really nice nice background let's go back to
[01:03:03] background let's go back to here okay so
[01:03:07] here okay so how how let's first just talk about how
[01:03:09] how how let's first just talk about how do we evaluate performance so what we
[01:03:11] do we evaluate performance so what we saw in frequentist regret like what we
[01:03:13] saw in frequentist regret like what we saw last time is that we're assuming
[01:03:15] saw last time is that we're assuming like a particular unknown set of
[01:03:17] like a particular unknown set of parameters like our arms are actually
[01:03:21] parameters like our arms are actually 9.76 we just don't know what they are
[01:03:24] 9.76 we just don't know what they are and then our regret is always um
[01:03:26] and then our regret is always um evaluated with respect to the optimal
[01:03:29] evaluated with respect to the optimal arm given that fixed set of
[01:03:31] arm given that fixed set of parameters basing regret assumes there's
[01:03:34] parameters basing regret assumes there's this prior over the parameters and so
[01:03:36] this prior over the parameters and so when we talk about regret we're actually
[01:03:37] when we talk about regret we're actually taking an expectation with respect to
[01:03:39] taking an expectation with respect to that
[01:03:41] that prior so it's still this looks like
[01:03:43] prior so it's still this looks like exactly the same as the frequent just
[01:03:45] exactly the same as the frequent just regret but now we have this outer
[01:03:46] regret but now we have this outer expectation over
[01:03:51] Theta one of the key ideas of this in
[01:03:55] Theta one of the key ideas of this in terms of how one might prove things in
[01:03:56] terms of how one might prove things in this case is if we think back to how we
[01:03:59] this case is if we think back to how we proved some ideas around regret we
[01:04:01] proved some ideas around regret we didn't do the full proof I just tried to
[01:04:03] didn't do the full proof I just tried to give some sketches one of the key ideas
[01:04:05] give some sketches one of the key ideas in the proof for um frequenti regret and
[01:04:07] in the proof for um frequenti regret and upper confidence bounds is that we tried
[01:04:10] upper confidence bounds is that we tried to construct these upper confid balance
[01:04:13] to construct these upper confid balance UT that we thought would be higher than
[01:04:16] UT that we thought would be higher than the true value of the arm with high
[01:04:19] the true value of the arm with high probability and we use that in order to
[01:04:21] probability and we use that in order to figure out how many times we would pull
[01:04:22] figure out how many times we would pull suboptimal arms we leverage sort of this
[01:04:25] suboptimal arms we leverage sort of this this
[01:04:27] this fact so it turns out that you can do
[01:04:29] fact so it turns out that you can do basing regret bounds under a pretty
[01:04:31] basing regret bounds under a pretty similar decomposition you can think
[01:04:34] similar decomposition you can think about sort of computing an upper
[01:04:36] about sort of computing an upper confidence bound and the likelihood that
[01:04:37] confidence bound and the likelihood that it'll
[01:04:40] hold we might come back to that later
[01:04:43] hold we might come back to that later today but I want to First sort of get
[01:04:44] today but I want to First sort of get into extending these up to sort of
[01:04:47] into extending these up to sort of higher level settings as well before we
[01:04:49] higher level settings as well before we do that I just want to highlight that if
[01:04:52] do that I just want to highlight that if you try to get standard bounds like what
[01:04:54] you try to get standard bounds like what we saw last time for standard Thompson
[01:04:56] we saw last time for standard Thompson sampling and what I mean by that is the
[01:04:58] sampling and what I mean by that is the type of Thompson sampling I just showed
[01:05:00] type of Thompson sampling I just showed you to my last check they don't actually
[01:05:03] you to my last check they don't actually match the best Bound for like upper
[01:05:04] match the best Bound for like upper confidence bound and frequen algorithms
[01:05:07] confidence bound and frequen algorithms however often empirically they can be
[01:05:08] however often empirically they can be really effective
[01:05:10] really effective algorithms and I'll just highlight here
[01:05:12] algorithms and I'll just highlight here that in general you can't compare
[01:05:14] that in general you can't compare directly between basian regret bounds
[01:05:16] directly between basian regret bounds and frequentist because one of them is
[01:05:18] and frequentist because one of them is with respect to this prior bir
[01:05:21] with respect to this prior bir parameters okay so let's look at that
[01:05:23] parameters okay so let's look at that for a particular domain and why um
[01:05:27] for a particular domain and why um Thompson sample might be particularly
[01:05:28] Thompson sample might be particularly helpful for a lot of real world cases so
[01:05:31] helpful for a lot of real world cases so this is a really nice paper by um Olivia
[01:05:33] this is a really nice paper by um Olivia Chappelle and leh Hong Lee which sort of
[01:05:36] Chappelle and leh Hong Lee which sort of re um initiated a huge amount of
[01:05:38] re um initiated a huge amount of interest in Thompson sampl a little over
[01:05:40] interest in Thompson sampl a little over a decade ago so I think they were both
[01:05:43] a decade ago so I think they were both at Yahoo at the time if I remember right
[01:05:45] at Yahoo at the time if I remember right um they were thinking about a contextual
[01:05:47] um they were thinking about a contextual Bandit case so they were thinking about
[01:05:49] Bandit case so they were thinking about making you know news article
[01:05:50] making you know news article recommendations Etc and so there you
[01:05:53] recommendations Etc and so there you would have a context like you'd have a
[01:05:55] would have a context like you'd have a bunch of features about an individual
[01:05:56] bunch of features about an individual and also often you would have a bunch of
[01:05:58] and also often you would have a bunch of features about the arms so like
[01:06:00] features about the arms so like explaining like may be news articles and
[01:06:01] explaining like may be news articles and all the features or you know ads and
[01:06:03] all the features or you know ads and stuff like
[01:06:04] stuff like that and but we're still going to assume
[01:06:07] that and but we're still going to assume the context is sampled I ID at each step
[01:06:09] the context is sampled I ID at each step so you know if I give a particular ad at
[01:06:13] so you know if I give a particular ad at this time point it doesn't impact what's
[01:06:14] this time point it doesn't impact what's going to happen to okay so there's it's
[01:06:17] going to happen to okay so there's it's still a bandit there's no like
[01:06:18] still a bandit there's no like sequential dependencies there arms or
[01:06:21] sequential dependencies there arms or articles reward is uh binary either you
[01:06:24] articles reward is uh binary either you click on it or you don't
[01:06:27] click on it or you don't in this case that you can model it using
[01:06:29] in this case that you can model it using like logistic regression because you
[01:06:31] like logistic regression because you have this binary output so what are we
[01:06:33] have this binary output so what are we seeing here so this is CTR which means
[01:06:35] seeing here so this is CTR which means it's a clickthrough rate it's normalized
[01:06:36] it's a clickthrough rate it's normalized because they're not going to tell us
[01:06:37] because they're not going to tell us exactly what they get on their real
[01:06:39] exactly what they get on their real world data um the important thing to
[01:06:41] world data um the important thing to look at here is the x-axis which is
[01:06:43] look at here is the x-axis which is DeLay So in many cases just like what we
[01:06:46] DeLay So in many cases just like what we saw for the public health setting
[01:06:48] saw for the public health setting there'll be some form of delay even for
[01:06:49] there'll be some form of delay even for online you know customer cases so Amazon
[01:06:52] online you know customer cases so Amazon will show you something and they don't
[01:06:53] will show you something and they don't find out for a little bit of whether or
[01:06:54] find out for a little bit of whether or not you're clicking on it or whether you
[01:06:56] not you're clicking on it or whether you bought the thing and so what they're
[01:06:58] bought the thing and so what they're comparing their algorithms with here is
[01:07:01] comparing their algorithms with here is the following so TS THS is Thompson
[01:07:03] the following so TS THS is Thompson sampling OTS is optimistic Thompson
[01:07:06] sampling OTS is optimistic Thompson sampling you can try to add in a little
[01:07:08] sampling you can try to add in a little bit of optimism in these UCB is upper
[01:07:10] bit of optimism in these UCB is upper confidence bound EG I think is Epsilon
[01:07:13] confidence bound EG I think is Epsilon greedy and exploit is you just do
[01:07:16] greedy and exploit is you just do whatever the mean looks like so far
[01:07:19] whatever the mean looks like so far these are all hyperparameters um as
[01:07:22] these are all hyperparameters um as often is the case the hyper parameters
[01:07:23] often is the case the hyper parameters matter so it's useful to to to look at
[01:07:25] matter so it's useful to to to look at these I think the really interesting
[01:07:27] these I think the really interesting thing to look at in this case is to look
[01:07:29] thing to look at in this case is to look across the time so if you this is um
[01:07:34] across the time so if you this is um sort of the shortest delay and this is
[01:07:36] sort of the shortest delay and this is the longest delay and you can see for
[01:07:38] the longest delay and you can see for the blue algorithm it varies very little
[01:07:42] the blue algorithm it varies very little in terms of its performance even if
[01:07:43] in terms of its performance even if things are delayed a lot but if you look
[01:07:47] things are delayed a lot but if you look at say
[01:07:50] at say UCB it's performance tends to drop a lot
[01:07:53] UCB it's performance tends to drop a lot in terms of as you have longer delay
[01:07:56] in terms of as you have longer delay okay and so that is one of the reasons
[01:07:59] okay and so that is one of the reasons why you might want to do Thompson
[01:08:00] why you might want to do Thompson sampling in these cases okay so let's
[01:08:04] sampling in these cases okay so let's think more about that okay and do a
[01:08:06] think more about that okay and do a check our understanding so let's think
[01:08:08] check our understanding so let's think about an online news website with lots
[01:08:10] about an online news website with lots of people logging in every second often
[01:08:12] of people logging in every second often someone will come online before you've
[01:08:13] someone will come online before you've seen the outcome of the previous person
[01:08:15] seen the outcome of the previous person it asks you to select all of the things
[01:08:17] it asks you to select all of the things that you think are true as we think
[01:08:19] that you think are true as we think about Thompson sampling versus upper
[01:08:20] about Thompson sampling versus upper confidence
[01:08:24] bounds
[01:08:54] for for
[01:09:40] all right wait you compare your answer
[01:09:41] all right wait you compare your answer to some
[01:09:54] nearby e
[01:10:49] okay so let's come back together um so
[01:10:53] okay so let's come back together um so as we were just discussing point out
[01:10:55] as we were just discussing point out that like Thompson sampling could cause
[01:10:57] that like Thompson sampling could cause much worse performance this one is true
[01:11:00] much worse performance this one is true um than optimism if the prior is very
[01:11:02] um than optimism if the prior is very misleading so this is
[01:11:22] true okay because if freak somewhat for
[01:11:25] true okay because if freak somewhat for example maybe surgery is really
[01:11:27] example maybe surgery is really effective and someone starts off and
[01:11:28] effective and someone starts off and thinks surgery isn't effective at all
[01:11:30] thinks surgery isn't effective at all and so you put a lot of probability mess
[01:11:32] and so you put a lot of probability mess could have a really sharp prior on it
[01:11:34] could have a really sharp prior on it over here then it could take a long time
[01:11:37] over here then it could take a long time essentially for your data to overwhelm
[01:11:38] essentially for your data to overwhelm your prior so this one can be a problem
[01:11:42] your prior so this one can be a problem um the first one is also
[01:11:43] um the first one is also true so if you think back to the
[01:11:47] true so if you think back to the algorithms that we saw last time for
[01:11:49] algorithms that we saw last time for optimism there is no Randomness in there
[01:11:52] optimism there is no Randomness in there unless you have a tie so if your upper
[01:11:56] unless you have a tie so if your upper confidence Bound for arm one is higher
[01:11:57] confidence Bound for arm one is higher than the upper confidence Bound for arm
[01:11:58] than the upper confidence Bound for arm two and arm three you're just going to
[01:12:00] two and arm three you're just going to take arm one and that's fine but if you
[01:12:04] take arm one and that's fine but if you have a delay that means you can't update
[01:12:05] have a delay that means you can't update those upper confidence bounds so if like
[01:12:07] those upper confidence bounds so if like the next customer comes you're like oh
[01:12:09] the next customer comes you're like oh or like the next patient comes you're
[01:12:10] or like the next patient comes you're like I still think surgery is best I
[01:12:12] like I still think surgery is best I still think surgery is best and you're
[01:12:13] still think surgery is best and you're not going to try anything
[01:12:15] not going to try anything different whereas Thompson sampling just
[01:12:17] different whereas Thompson sampling just has this P this you know prior or
[01:12:19] has this P this you know prior or posterior and so if I have someone come
[01:12:23] posterior and so if I have someone come I can just sample from all of my um my
[01:12:25] I can just sample from all of my um my priors and then if another person comes
[01:12:28] priors and then if another person comes I'll again sample for my priors and so
[01:12:30] I'll again sample for my priors and so because of this sort of distribution
[01:12:32] because of this sort of distribution over parameters unless it's collapsed to
[01:12:35] over parameters unless it's collapsed to a Delta function in which case you know
[01:12:37] a Delta function in which case you know what the right thing is to do anyway
[01:12:38] what the right thing is to do anyway you'll get natural exploration so that's
[01:12:41] you'll get natural exploration so that's one of the really big benefits of
[01:12:42] one of the really big benefits of Thompson sampling is that even if you
[01:12:44] Thompson sampling is that even if you don't get new data you naturally sort of
[01:12:46] don't get new data you naturally sort of will try out different things um and
[01:12:48] will try out different things um and that can be really helpful it is true
[01:12:51] that can be really helpful it is true that optimism algorithms generally are
[01:12:52] that optimism algorithms generally are better than Thompson sampling in terms
[01:12:54] better than Thompson sampling in terms of their regret bounds that may or may
[01:12:55] of their regret bounds that may or may not translate to empirical benefits um
[01:12:58] not translate to empirical benefits um but they don't actually necessarily have
[01:13:00] but they don't actually necessarily have stronger bounds for this setting this is
[01:13:03] stronger bounds for this setting this is false and that's because all the bounds
[01:13:05] false and that's because all the bounds we've been talking about so far don't
[01:13:08] we've been talking about so far don't think about that batch setting they're
[01:13:10] think about that batch setting they're being derived for the case where you get
[01:13:11] being derived for the case where you get information you update your confidence
[01:13:13] information you update your confidence bounds you incre you know you
[01:13:16] bounds you incre you know you continue so this s of highlight some of
[01:13:19] continue so this s of highlight some of the particular benefits and the
[01:13:20] the particular benefits and the potential weaknesses of Thompson
[01:13:22] potential weaknesses of Thompson sampling if your prior is reasonable and
[01:13:24] sampling if your prior is reasonable and you've got got this sort of delay or
[01:13:26] you've got got this sort of delay or batch setting it can be very helpful um
[01:13:29] batch setting it can be very helpful um if your prior is really bad that it can
[01:13:31] if your prior is really bad that it can take a long time to sort of get past
[01:13:37] that so before we end today um I think
[01:13:41] that so before we end today um I think it an interesting question to consider
[01:13:43] it an interesting question to consider is whether or not Thompson sampling is
[01:13:45] is whether or not Thompson sampling is optimal now we can get nice regret
[01:13:48] optimal now we can get nice regret bounds for this case I know I didn't
[01:13:49] bounds for this case I know I didn't have a chance to go through that
[01:13:50] have a chance to go through that particular proof today but um but it's
[01:13:53] particular proof today but um but it's not optimal in general so it would be
[01:13:56] not optimal in general so it would be really cool if we could get something
[01:13:58] really cool if we could get something that was basically perfect you might
[01:14:00] that was basically perfect you might imagine that if you have a prior and you
[01:14:02] imagine that if you have a prior and you have a known Horizon you could compute a
[01:14:04] have a known Horizon you could compute a decision policy that would maximize your
[01:14:06] decision policy that would maximize your expected rewards given that prior and
[01:14:08] expected rewards given that prior and the Horizon so I haven't at least in
[01:14:11] the Horizon so I haven't at least in this class taught you all the tools you
[01:14:12] this class taught you all the tools you need to do that but at a high level you
[01:14:14] need to do that but at a high level you could think of it as kind of a Markoff
[01:14:16] could think of it as kind of a Markoff decision process over parameters which
[01:14:18] decision process over parameters which is kind of wild so if any of you guys
[01:14:20] is kind of wild so if any of you guys have taken Michael Cocker's group class
[01:14:22] have taken Michael Cocker's group class actually who's taken Michael's class
[01:14:24] actually who's taken Michael's class anybody here okay so you can think of
[01:14:26] anybody here okay so you can think of like a pom DP your state is your
[01:14:28] like a pom DP your state is your parameters your actions are pulling
[01:14:30] parameters your actions are pulling things and then your belief state is
[01:14:32] things and then your belief state is your new probability of your parameters
[01:14:35] your new probability of your parameters so it's really elegant in theory you can
[01:14:38] so it's really elegant in theory you can compute something that will exactly
[01:14:39] compute something that will exactly maximize your expected reward by doing
[01:14:42] maximize your expected reward by doing PM DP planning the problem also for
[01:14:44] PM DP planning the problem also for those of you who taken my class is that
[01:14:46] those of you who taken my class is that often pom DP planning is really
[01:14:47] often pom DP planning is really intractable so it's often not clear that
[01:14:49] intractable so it's often not clear that we could do this in a computationally
[01:14:51] we could do this in a computationally reasonable
[01:14:52] reasonable way in general when one of the
[01:14:55] way in general when one of the challenges here is that um if you wanted
[01:14:57] challenges here is that um if you wanted to do this it would have a decision
[01:14:59] to do this it would have a decision policy that's a function of the history
[01:15:01] policy that's a function of the history which means all the prior actions you've
[01:15:03] which means all the prior actions you've taken and all of the rewards you've
[01:15:05] taken and all of the rewards you've observed and that's going to increase
[01:15:06] observed and that's going to increase exponentially with the number of
[01:15:08] exponentially with the number of decisions youve
[01:15:12] made so there's this idea of an index
[01:15:15] made so there's this idea of an index policy and an index policy says we don't
[01:15:17] policy and an index policy says we don't want have to think about this
[01:15:18] want have to think about this exponential sort of history or state and
[01:15:21] exponential sort of history or state and a decision an index policy is one a
[01:15:23] a decision an index policy is one a decision policy that computes a real
[01:15:25] decision policy that computes a real valued index for each arm and it plays
[01:15:27] valued index for each arm and it plays the arm with the highest index using
[01:15:29] the arm with the highest index using statistics only from that arm in the
[01:15:31] statistics only from that arm in the Horizon so that means I don't have to
[01:15:33] Horizon so that means I don't have to pay attention to the sort of
[01:15:34] pay attention to the sort of combinatorial exponential thing um I can
[01:15:37] combinatorial exponential thing um I can just say for this particular arm maybe
[01:15:40] just say for this particular arm maybe you know what were my rewards that I've
[01:15:42] you know what were my rewards that I've observed so far um and then I can use
[01:15:44] observed so far um and then I can use that information to make decisions so
[01:15:46] that information to make decisions so for example a greedy algorithm which
[01:15:48] for example a greedy algorithm which just relies on your empirical average of
[01:15:51] just relies on your empirical average of the performance for each arm is an index
[01:15:53] the performance for each arm is an index policy so as an confidence bound
[01:15:55] policy so as an confidence bound algorithm because it just relies on the
[01:15:57] algorithm because it just relies on the upper confidence Bound for the rewards
[01:15:58] upper confidence Bound for the rewards you've seen for each arm so there are a
[01:16:00] you've seen for each arm so there are a lot of index
[01:16:02] lot of index policies surprisingly there is an index
[01:16:05] policies surprisingly there is an index policy that is optimal so gens proved
[01:16:09] policy that is optimal so gens proved that there exists an optimal policy for
[01:16:11] that there exists an optimal policy for maximizing the expected discounted
[01:16:13] maximizing the expected discounted reward in a basian multiarm bandit that
[01:16:16] reward in a basian multiarm bandit that you can compute that only depends on
[01:16:17] you can compute that only depends on these statistics separately free charms
[01:16:20] these statistics separately free charms so that's really cool it means that it
[01:16:22] so that's really cool it means that it is possible in some settings to actually
[01:16:25] is possible in some settings to actually exactly optimize your expected sum of
[01:16:27] exactly optimize your expected sum of discounted rewards for these type of
[01:16:29] discounted rewards for these type of Basi and
[01:16:32] bandits Thompson sampling will not do
[01:16:34] bandits Thompson sampling will not do this in general so Thompson sampling is
[01:16:36] this in general so Thompson sampling is generally not equal to what the G index
[01:16:38] generally not equal to what the G index would be but it can still be a very good
[01:16:40] would be but it can still be a very good thing to
[01:16:41] thing to do all right so just to summarize some
[01:16:44] do all right so just to summarize some of the things that um are useful to
[01:16:46] of the things that um are useful to understand uh from this part of the
[01:16:48] understand uh from this part of the section and next time we're going to
[01:16:49] section and next time we're going to start talking about these sort of ideas
[01:16:51] start talking about these sort of ideas for sequential decision processes like
[01:16:53] for sequential decision processes like Markoff decision process you should be
[01:16:55] Markoff decision process you should be able to Define regret imp pack you
[01:16:56] able to Define regret imp pack you should be able to prove or know why sort
[01:16:58] should be able to prove or know why sort of the UCB banded algorithm has
[01:17:00] of the UCB banded algorithm has sublinear regret like up to the proof
[01:17:02] sublinear regret like up to the proof sketch we did in class um you should be
[01:17:04] sketch we did in class um you should be able to give an example of why egedy and
[01:17:06] able to give an example of why egedy and greedy and pess is can result in linear
[01:17:09] greedy and pess is can result in linear regret um I don't think you need to be
[01:17:11] regret um I don't think you need to be able to do this for G
[01:17:15] rewards but you should be able to do
[01:17:17] rewards but you should be able to do Thompson sampling for the case that
[01:17:19] Thompson sampling for the case that we've just talked about at least like
[01:17:21] we've just talked about at least like sort of in pseudo codeland so if someone
[01:17:23] sort of in pseudo codeland so if someone said Like You observe another count what
[01:17:24] said Like You observe another count what would your beta parameter B um and then
[01:17:27] would your beta parameter B um and then also that you should be able to
[01:17:28] also that you should be able to understand the UCB banded algorithm as
[01:17:30] understand the UCB banded algorithm as we've covered in class so I we've been
[01:17:33] we've covered in class so I we've been building up all of these things to think
[01:17:34] building up all of these things to think about now how we can do exploration and
[01:17:36] about now how we can do exploration and data efficient learning um for
[01:17:38] data efficient learning um for sequential processes so next time we'll
[01:17:40] sequential processes so next time we'll think about how to do this we in a
[01:17:41] think about how to do this we in a standard decision process as well as
[01:17:43] standard decision process as well as thinking about what do we do when we're
[01:17:45] thinking about what do we do when we're in really large State spaces or really
[01:17:46] in really large State spaces or really large action spaces and how do we lift
[01:17:48] large action spaces and how do we lift this all up for function approximation
[01:17:49] this all up for function approximation I'll see you on Wednesday
Lecture 013
Stanford CS234 Reinforcement Learning I Exploration 3 I 2024 I Lecture 13
Source: https://www.youtube.com/watch?v=pc7oayCSZmQ
---
Transcript
[00:00:05] all right it should be up in a second
[00:00...
Stanford CS234 Reinforcement Learning I Exploration 3 I 2024 I Lecture 13
Source: https://www.youtube.com/watch?v=pc7oayCSZmQ
---
Transcript
[00:00:05] all right it should be up in a second
[00:00:06] all right it should be up in a second you can go Ahad and get started on your
[00:00:08] you can go Ahad and get started on your refresh your understanding
[00:00:53] all right when you turn to somebody near
[00:00:54] all right when you turn to somebody near you and see if you got the same answers
[00:00:56] you and see if you got the same answers for this this question ask you to think
[00:00:58] for this this question ask you to think back to what we were learning about last
[00:01:00] back to what we were learning about last time in terms of posteriors over what
[00:01:02] time in terms of posteriors over what the parameters might look like for a
[00:01:04] the parameters might look like for a multiarm bandit so check with someone
[00:01:06] multiarm bandit so check with someone nearby you and see whether you got the
[00:01:08] nearby you and see whether you got the same idea
[00:01:36] yeah
[00:01:42] optim oh I
[00:02:15] okay we're going to go ahead and come
[00:02:16] okay we're going to go ahead and come back together um and go through the
[00:02:18] back together um and go through the answers for these all right so the first
[00:02:21] answers for these all right so the first one of these are true okay because in
[00:02:24] one of these are true okay because in this case for beta one two where we're
[00:02:27] this case for beta one two where we're weighed more towards um an arm that more
[00:02:30] weighed more towards um an arm that more frequently gets um something like a zero
[00:02:33] frequently gets um something like a zero instead of a one then we're more likely
[00:02:35] instead of a one then we're more likely to sample these three parameters um the
[00:02:38] to sample these three parameters um the second one is also true because if you
[00:02:40] second one is also true because if you have a flat uniform over all of the
[00:02:43] have a flat uniform over all of the different arm parameters you're more
[00:02:45] different arm parameters you're more likely to keep distribution um and the
[00:02:47] likely to keep distribution um and the third is false because when you have a
[00:02:50] third is false because when you have a one one prior that's a uniform somewhere
[00:02:52] one one prior that's a uniform somewhere between zero and one so you could get a
[00:02:54] between zero and one so you could get a so the true arm parameter could be a
[00:02:57] so the true arm parameter could be a zero or it could be a one or anything in
[00:02:58] zero or it could be a one or anything in between okay and then the second one
[00:03:01] between okay and then the second one asks you to think about sort of using
[00:03:03] asks you to think about sort of using Thompson sampling to sample arms um and
[00:03:08] Thompson sampling to sample arms um and so the first one is true so given these
[00:03:10] so the first one is true so given these priors you could sample either of those
[00:03:11] priors you could sample either of those values for the underlying uh parameter
[00:03:15] values for the underlying uh parameter for your beri
[00:03:16] for your beri variable the second one is false so
[00:03:19] variable the second one is false so let's assume that the real parameter
[00:03:21] let's assume that the real parameter here is4 and 6 what this um question is
[00:03:25] here is4 and 6 what this um question is asking you to reflect about is that
[00:03:26] asking you to reflect about is that Thompson sampling is not guaranteed to
[00:03:29] Thompson sampling is not guaranteed to give you an upper confidence bound so up
[00:03:31] give you an upper confidence bound so up it may instead just select a parameter
[00:03:35] it may instead just select a parameter that is consistent with your prior um
[00:03:38] that is consistent with your prior um and for these particular sample thas it
[00:03:40] and for these particular sample thas it will happen to tr choose the true
[00:03:42] will happen to tr choose the true optimal arm for this
[00:03:45] round awesome so I want to just let's
[00:03:48] round awesome so I want to just let's see if we can make all the a work want
[00:03:52] see if we can make all the a work want to briefly show you this nice
[00:03:57] to briefly show you this nice example w
[00:04:00] example w see if we can make
[00:04:04] this all
[00:04:06] this all right so I wanted to show you this nice
[00:04:09] right so I wanted to show you this nice example of somewhere where you might
[00:04:11] example of somewhere where you might want exploration so we've talked about
[00:04:13] want exploration so we've talked about exploration so far in terms of cases
[00:04:15] exploration so far in terms of cases like you're um an online Advertiser and
[00:04:18] like you're um an online Advertiser and you'd like to figure out um which ads
[00:04:20] you'd like to figure out um which ads work for people um it comes up in
[00:04:22] work for people um it comes up in healthcare I want to show you an example
[00:04:24] healthcare I want to show you an example of an application which we thought about
[00:04:26] of an application which we thought about in collaboration with Chelsea Finn and a
[00:04:27] in collaboration with Chelsea Finn and a bunch of wonderful Stu grad students
[00:04:29] bunch of wonderful Stu grad students recently so this is the breakout
[00:04:31] recently so this is the breakout assignment um this is an assignment
[00:04:34] assignment um this is an assignment that's used in Stanford where students
[00:04:36] that's used in Stanford where students actually encode the game so in this case
[00:04:39] actually encode the game so in this case you know you're not the compared to the
[00:04:41] you know you're not the compared to the settings we're at where we sort of
[00:04:42] settings we're at where we sort of assume you have the environment and then
[00:04:44] assume you have the environment and then you're learning an agent to act in that
[00:04:46] you're learning an agent to act in that environment here students are actually
[00:04:48] environment here students are actually creating the code to make the breakout
[00:04:49] creating the code to make the breakout assignment so to S make the game
[00:04:52] assignment so to S make the game environment and this in generally is
[00:04:54] environment and this in generally is often really engaging and fun for
[00:04:56] often really engaging and fun for students particularly when they're
[00:04:57] students particularly when they're learning to program many people like
[00:04:59] learning to program many people like computer games so this is a really great
[00:05:00] computer games so this is a really great opportunity for people to learn and
[00:05:02] opportunity for people to learn and could be really engaging um and a lot of
[00:05:04] could be really engaging um and a lot of different people use these type of
[00:05:06] different people use these type of assignments so it's not just at Stanford
[00:05:08] assignments so it's not just at Stanford but many many other places including
[00:05:10] but many many other places including code.org and others use this assignment
[00:05:12] code.org and others use this assignment to try to teach students about
[00:05:14] to try to teach students about programming here's the
[00:05:16] programming here's the problem um even though it teaches lots
[00:05:18] problem um even though it teaches lots of different sort of introductory
[00:05:19] of different sort of introductory computer science Concepts there's a
[00:05:22] computer science Concepts there's a challenge which is if you want people to
[00:05:24] challenge which is if you want people to learn from writing this assignment you
[00:05:25] learn from writing this assignment you need to be able to provide them with
[00:05:27] need to be able to provide them with feedback and providing them with
[00:05:29] feedback and providing them with feedback involves grading the
[00:05:30] feedback involves grading the assignments so in this case we normally
[00:05:34] assignments so in this case we normally have like a rubric of different things
[00:05:35] have like a rubric of different things that the program is expected to do
[00:05:37] that the program is expected to do correctly like is the paddle drawn
[00:05:40] correctly like is the paddle drawn correctly is the ball drawn correctly
[00:05:42] correctly is the ball drawn correctly when you bounce does that respect the
[00:05:44] when you bounce does that respect the desired you know um uh transition
[00:05:47] desired you know um uh transition Dynamics things like that and so
[00:05:49] Dynamics things like that and so normally just like when you guys get
[00:05:50] normally just like when you guys get feedback from grade scope someone has to
[00:05:52] feedback from grade scope someone has to go through and play the game to do this
[00:05:55] go through and play the game to do this okay and so that is really expensive
[00:05:58] okay and so that is really expensive because that means people people have to
[00:06:00] because that means people people have to figure out okay when the ball bounces
[00:06:02] figure out okay when the ball bounces here does it actually do the right thing
[00:06:05] here does it actually do the right thing um and then you have to do that for each
[00:06:06] um and then you have to do that for each of the different rubric items so there
[00:06:09] of the different rubric items so there for example it kind of jittered right it
[00:06:11] for example it kind of jittered right it didn't do the right thing so what you
[00:06:13] didn't do the right thing so what you can think of here is that essentially
[00:06:14] can think of here is that essentially someone is manually designing a mental
[00:06:17] someone is manually designing a mental policy a grader is designing a mental
[00:06:19] policy a grader is designing a mental policy in their head for how to play
[00:06:21] policy in their head for how to play this game in order to uncover whether
[00:06:23] this game in order to uncover whether the game Dynamics are correct and so the
[00:06:26] the game Dynamics are correct and so the way we normally do that right now is you
[00:06:27] way we normally do that right now is you know each individual grer figures out
[00:06:29] know each individual grer figures out how to do that and then they play this
[00:06:32] how to do that and then they play this so this means that it would take
[00:06:33] so this means that it would take probably around 8 minutes per
[00:06:35] probably around 8 minutes per submission so you can't just do a unit
[00:06:37] submission so you can't just do a unit test in the normal way because like you
[00:06:40] test in the normal way because like you actually trying to figure out how the
[00:06:42] actually trying to figure out how the game behaves in different scenarios
[00:06:44] game behaves in different scenarios where it might take multiple actions to
[00:06:46] where it might take multiple actions to even get to that
[00:06:48] even get to that scenario so if you think about sort of
[00:06:51] scenario so if you think about sort of doing 8 minutes per submission if you
[00:06:53] doing 8 minutes per submission if you have like 300 submissions in a course
[00:06:55] have like 300 submissions in a course and there's actually many many more
[00:06:56] and there's actually many many more people than that that have played this
[00:06:57] people than that that have played this game on code.org or to code this that's
[00:07:00] game on code.org or to code this that's an enormous amount of grading time
[00:07:01] an enormous amount of grading time that's an enormous amount of human
[00:07:03] that's an enormous amount of human resource time and that means that some
[00:07:05] resource time and that means that some of the people that offer this challenge
[00:07:07] of the people that offer this challenge to students don't grade it at all like
[00:07:10] to students don't grade it at all like just too expensive um and so that means
[00:07:12] just too expensive um and so that means students gets the opportunity of trying
[00:07:14] students gets the opportunity of trying to do this exciting assignment but they
[00:07:16] to do this exciting assignment but they don't get any feedback back um which can
[00:07:18] don't get any feedback back um which can be you know really hinder their learning
[00:07:21] be you know really hinder their learning process so there's a lot of things that
[00:07:23] process so there's a lot of things that make this hard it's you know sort of a
[00:07:25] make this hard it's you know sort of a stochastic setting there's not kind of
[00:07:27] stochastic setting there's not kind of simple
[00:07:28] simple heuristics and um and there are multiple
[00:07:31] heuristics and um and there are multiple errors so my student um uh Chris Peach
[00:07:36] errors so my student um uh Chris Peach another professor here and I started
[00:07:37] another professor here and I started thinking about this problem a few years
[00:07:38] thinking about this problem a few years ago of saying couldn't we design like a
[00:07:40] ago of saying couldn't we design like a reinforcement learning agent to play
[00:07:41] reinforcement learning agent to play this game and what we want is that this
[00:07:44] this game and what we want is that this reinforcement learning agent can explore
[00:07:45] reinforcement learning agent can explore the parts of the domain so that we can
[00:07:47] the parts of the domain so that we can try to uncover how they're doing and
[00:07:49] try to uncover how they're doing and whether the game is coded
[00:07:51] whether the game is coded correctly so we did this work um and
[00:07:55] correctly so we did this work um and then we did an initial approach to this
[00:07:58] then we did an initial approach to this and then Evan Who is the author of this
[00:08:00] and then Evan Who is the author of this um Extended this to try to think about
[00:08:02] um Extended this to try to think about rubric items so the idea is that instead
[00:08:04] rubric items so the idea is that instead of having humans graded what we're going
[00:08:07] of having humans graded what we're going to do is we're going to replace humans
[00:08:08] to do is we're going to replace humans by a machine learning agent and in
[00:08:11] by a machine learning agent and in particular what Evan did is he built on
[00:08:14] particular what Evan did is he built on r i and Chris's initial work and said
[00:08:16] r i and Chris's initial work and said let's actually phrase this as um think
[00:08:18] let's actually phrase this as um think about how we can use meta reinforcement
[00:08:21] about how we can use meta reinforcement learning and exploration the reason this
[00:08:24] learning and exploration the reason this is an exploration problem is because you
[00:08:26] is an exploration problem is because you want to learn um an RL policy here so
[00:08:30] want to learn um an RL policy here so that in a new environment you can click
[00:08:32] that in a new environment you can click quickly use behaviors to grade the
[00:08:35] quickly use behaviors to grade the assignment and so that's where efficient
[00:08:37] assignment and so that's where efficient exploration is coming in so you don't
[00:08:38] exploration is coming in so you don't want the staff to take you know 20
[00:08:40] want the staff to take you know 20 minutes to try to grade it you want to
[00:08:42] minutes to try to grade it you want to as quickly as possible whether for an
[00:08:44] as quickly as possible whether for an agent or a human figure out what
[00:08:46] agent or a human figure out what strategy you should use to play the game
[00:08:48] strategy you should use to play the game in order to correctly grade whether this
[00:08:50] in order to correctly grade whether this is a a good
[00:08:52] is a a good environment and so Evan had a really
[00:08:54] environment and so Evan had a really nice nervs paper building on our nervs
[00:08:56] nice nervs paper building on our nervs paper these are both machine learning
[00:08:58] paper these are both machine learning contributions of how to um well there's
[00:09:00] contributions of how to um well there's a series of papers there's a first paper
[00:09:02] a series of papers there's a first paper on how we could do this at all there's a
[00:09:04] on how we could do this at all there's a second paper by Evan who was looking at
[00:09:06] second paper by Evan who was looking at trying to do um explicit um exploration
[00:09:10] trying to do um explicit um exploration really fast exploration and then we
[00:09:12] really fast exploration and then we joined forces to think about how we
[00:09:13] joined forces to think about how we could do fast exploration in this
[00:09:15] could do fast exploration in this setting and then more recently we
[00:09:17] setting and then more recently we published a paper showing that this
[00:09:18] published a paper showing that this could actually significantly reduce
[00:09:20] could actually significantly reduce grading time and actually improve
[00:09:22] grading time and actually improve accuracy when you combine this with
[00:09:24] accuracy when you combine this with humans so I just give this as an example
[00:09:26] humans so I just give this as an example to illustrate another sort of exciting
[00:09:29] to illustrate another sort of exciting exploration case where if you can design
[00:09:33] exploration case where if you can design agents that can learn quickly and can
[00:09:35] agents that can learn quickly and can quickly explore an environment it can
[00:09:37] quickly explore an environment it can end up being really helpful and we'll
[00:09:39] end up being really helpful and we'll come back to sort of dream and this idea
[00:09:41] come back to sort of dream and this idea of kind of meta exploration later in the
[00:09:43] of kind of meta exploration later in the course um later today so today will be
[00:09:46] course um later today so today will be our final lecture on Fast and efficient
[00:09:49] our final lecture on Fast and efficient reinforcement learning and then next
[00:09:51] reinforcement learning and then next week we're going to start talking about
[00:09:52] week we're going to start talking about Monte Carlo research which was one of
[00:09:54] Monte Carlo research which was one of the key ideas behind
[00:09:56] the key ideas behind alphago um I hope that homework 3 is
[00:09:58] alphago um I hope that homework 3 is going well well feel free to reach out
[00:10:00] going well well feel free to reach out to us with any questions um and feel
[00:10:02] to us with any questions um and feel free to come to our office
[00:10:05] free to come to our office hours all right so just to remind
[00:10:06] hours all right so just to remind ourselves about where we are we've been
[00:10:09] ourselves about where we are we've been thinking about different Frameworks for
[00:10:11] thinking about different Frameworks for evaluating the correctness of algorithms
[00:10:13] evaluating the correctness of algorithms and how efficient they are at learning
[00:10:15] and how efficient they are at learning and making decisions and so far we have
[00:10:18] and making decisions and so far we have focused mostly on Bandits which is this
[00:10:21] focused mostly on Bandits which is this much simpler version of reinforcement
[00:10:23] much simpler version of reinforcement learning where the decisions we make
[00:10:25] learning where the decisions we make don't affect the next
[00:10:26] don't affect the next state so we saw how to do that for for
[00:10:29] state so we saw how to do that for for um for both standard Bandits and basian
[00:10:33] um for both standard Bandits and basian bandits and today we're going to start
[00:10:34] bandits and today we're going to start to lift all those ideas up to markup
[00:10:36] to lift all those ideas up to markup decision
[00:10:38] decision processes so we did that by Design
[00:10:41] processes so we did that by Design because a lot of the ideas around sort
[00:10:43] because a lot of the ideas around sort of optimism under uncertainty or
[00:10:45] of optimism under uncertainty or posterior sampling or Thompson sampling
[00:10:47] posterior sampling or Thompson sampling can be lifted up to the tabular Markoff
[00:10:50] can be lifted up to the tabular Markoff decision process case and then all of
[00:10:52] decision process case and then all of these ideas also then can be
[00:10:53] these ideas also then can be extrapolated up with some care to the
[00:10:56] extrapolated up with some care to the function approximation setting so that's
[00:10:58] function approximation setting so that's where we're going to go
[00:10:59] where we're going to go today um the main approach is for trying
[00:11:03] today um the main approach is for trying to act efficiently in markof decision
[00:11:06] to act efficiently in markof decision processes and we're going to start by
[00:11:08] processes and we're going to start by focusing on the tabular setting will
[00:11:10] focusing on the tabular setting will again be optimism under uncertainty and
[00:11:13] again be optimism under uncertainty and probability matching or Thompson
[00:11:14] probability matching or Thompson sampling and we're going to see ideas of
[00:11:16] sampling and we're going to see ideas of how to do that in this
[00:11:19] how to do that in this setting okay so here is one of it's not
[00:11:23] setting okay so here is one of it's not the oldest algorithm um to do sort of
[00:11:26] the oldest algorithm um to do sort of probably efficient um exploration in
[00:11:29] probably efficient um exploration in tabular markof decision processes but
[00:11:32] tabular markof decision processes but it's um sort of one of the
[00:11:33] it's um sort of one of the quintessential ones and I think it
[00:11:34] quintessential ones and I think it illustrates a lot of the really nice
[00:11:36] illustrates a lot of the really nice ideas so this kind of a lot let's just
[00:11:38] ideas so this kind of a lot let's just step through it so the idea in this case
[00:11:40] step through it so the idea in this case is that we're going to um be making
[00:11:42] is that we're going to um be making decisions in a tabular Markoff decision
[00:11:44] decisions in a tabular Markoff decision process we're going to be taking actions
[00:11:47] process we're going to be taking actions with respect to some specific Q function
[00:11:49] with respect to some specific Q function that I'm going to Define in a second
[00:11:51] that I'm going to Define in a second we'll observe the reward in the state
[00:11:53] we'll observe the reward in the state we're going to update a whole bunch of
[00:11:54] we're going to update a whole bunch of things update that special Q Tilda and
[00:11:58] things update that special Q Tilda and repeat the key thing that we're going to
[00:12:00] repeat the key thing that we're going to be trying to do is similar to what we
[00:12:02] be trying to do is similar to what we saw for the upper confidence bound
[00:12:04] saw for the upper confidence bound algorithms we're going to think about
[00:12:05] algorithms we're going to think about how do we construct an upper confidence
[00:12:07] how do we construct an upper confidence bound on the Q function so that's going
[00:12:10] bound on the Q function so that's going to be sort of the
[00:12:13] key we're going to be doing this is an
[00:12:17] key we're going to be doing this is an upper confidence bound
[00:12:21] algorithm so this is going to again use
[00:12:24] algorithm so this is going to again use the idea of optimism under uncertainty
[00:12:26] the idea of optimism under uncertainty and we're going to think about how do we
[00:12:28] and we're going to think about how do we bring this to
[00:12:31] GPS okay so the key idea in this case is
[00:12:34] GPS okay so the key idea in this case is what we would like to do is we'd like to
[00:12:36] what we would like to do is we'd like to construct an optimistic upper bound on
[00:12:38] construct an optimistic upper bound on the Q
[00:12:39] the Q function this is a model based approach
[00:12:42] function this is a model based approach which means the way we're going to do
[00:12:43] which means the way we're going to do that is we're going to try to construct
[00:12:46] that is we're going to try to construct um optimistic estimates of the reward
[00:12:47] um optimistic estimates of the reward function and optimistic estimates of the
[00:12:50] function and optimistic estimates of the Dynamics model it shouldn't be
[00:12:52] Dynamics model it shouldn't be immediately obvious what it means to be
[00:12:54] immediately obvious what it means to be optimistic with respect to the Dynamics
[00:12:56] optimistic with respect to the Dynamics model and and we'll go through that in a
[00:12:58] model and and we'll go through that in a minute
[00:13:00] minute in practice what we're going to do is
[00:13:01] in practice what we're going to do is the following the reward is the easiest
[00:13:03] the following the reward is the easiest to start with so in the reward case
[00:13:06] to start with so in the reward case we're going to maintain we're going to
[00:13:07] we're going to maintain we're going to maintain counts of how many times we've
[00:13:09] maintain counts of how many times we've taken an action in a particular State
[00:13:12] taken an action in a particular State we're also going to maintain counts of
[00:13:13] we're also going to maintain counts of how many times we've started in a state
[00:13:15] how many times we've started in a state taken an action and went to a particular
[00:13:17] taken an action and went to a particular next
[00:13:18] next state and we've seen these ideas before
[00:13:20] state and we've seen these ideas before for tabular markof decision processes
[00:13:22] for tabular markof decision processes we've used them for certainty equivalent
[00:13:24] we've used them for certainty equivalent planning um back in the first couple
[00:13:25] planning um back in the first couple weeks of
[00:13:27] weeks of class so the reward model is perhaps
[00:13:29] class so the reward model is perhaps closest to what we've seen for the
[00:13:31] closest to what we've seen for the Bandit before for the reward model what
[00:13:33] Bandit before for the reward model what we're going to do is we're going to
[00:13:34] we're going to do is we're going to compute the empirical average over in
[00:13:37] compute the empirical average over in this state and this action what's our
[00:13:39] this state and this action what's our average reward we've seen so far and
[00:13:41] average reward we've seen so far and then we're going to um think of there
[00:13:44] then we're going to um think of there being an upper confidence bound to that
[00:13:47] being an upper confidence bound to that okay what we're also going to do in this
[00:13:49] okay what we're also going to do in this case is we're going to U maintain an
[00:13:52] case is we're going to U maintain an empirical estimate of the Dynamics
[00:13:55] empirical estimate of the Dynamics model now when we do this we're going to
[00:13:59] model now when we do this we're going to do the normal Bellman equation except
[00:14:01] do the normal Bellman equation except for we're going to include a
[00:14:05] for we're going to include a bonus so this part should look familiar
[00:14:09] bonus so this part should look familiar to what we've seen for hting which is
[00:14:14] to what we've seen for hting which is when we are going to compute a Bellman
[00:14:16] when we are going to compute a Bellman backup instead of just using instead of
[00:14:19] backup instead of just using instead of like in certainty equivalence we would
[00:14:21] like in certainty equivalence we would just use the empirical estimate of the
[00:14:22] just use the empirical estimate of the reward function and the empirical
[00:14:24] reward function and the empirical estimate of the Dynamics model instead
[00:14:26] estimate of the Dynamics model instead of doing that we're going to include
[00:14:28] of doing that we're going to include this bonus term
[00:14:30] this bonus term this is this bonus
[00:14:32] this is this bonus term okay and there's there's a few
[00:14:35] term okay and there's there's a few different ways to do kind of modelbased
[00:14:38] different ways to do kind of modelbased interval estimation I'm picking one here
[00:14:40] interval estimation I'm picking one here that just sort of uses a bonus term but
[00:14:43] that just sort of uses a bonus term but um I'll talk about some other one
[00:14:44] um I'll talk about some other one there's a number of variants so what
[00:14:46] there's a number of variants so what this is saying is when I do my bman
[00:14:50] this is saying is when I do my bman backup of what is the expected
[00:14:52] backup of what is the expected discounted sum of rewards from starting
[00:14:54] discounted sum of rewards from starting State s and taking action a this is
[00:14:56] State s and taking action a this is going to try to approximate qar going to
[00:14:59] going to try to approximate qar going to plug in my empirical estim of the reward
[00:15:01] plug in my empirical estim of the reward I'm going to use my empirical estim the
[00:15:02] I'm going to use my empirical estim the Dynamics model and then I'm going to add
[00:15:05] Dynamics model and then I'm going to add in a
[00:15:06] in a bonus and if I have not taken that state
[00:15:08] bonus and if I have not taken that state in action very much that bonus is going
[00:15:10] in action very much that bonus is going to be really large because those counts
[00:15:12] to be really large because those counts of the number of states and actions is
[00:15:14] of the number of states and actions is going to be really small so this will be
[00:15:18] going to be really small so this will be large if the counts are
[00:15:23] small the key difference compared to
[00:15:25] small the key difference compared to what we've seen with Bandits before is
[00:15:27] what we've seen with Bandits before is this is a Bellman back up so we will
[00:15:30] this is a Bellman back up so we will then repeat this many many times so you
[00:15:34] then repeat this many many times so you do this for all states and actions and
[00:15:36] do this for all states and actions and then you back up you do this many times
[00:15:38] then you back up you do this many times so intuitively what's happening here is
[00:15:40] so intuitively what's happening here is this is like pretending the the expected
[00:15:43] this is like pretending the the expected discounted sum of rewards you'd get if
[00:15:45] discounted sum of rewards you'd get if you start in a particular State and take
[00:15:46] you start in a particular State and take a particular action is much higher if
[00:15:49] a particular action is much higher if you have not visited that state in
[00:15:50] you have not visited that state in action very
[00:15:52] action very much so that's where this optimism comes
[00:15:54] much so that's where this optimism comes in you end up adding in this bonus term
[00:15:58] in you end up adding in this bonus term here and this bonus term will be really
[00:16:00] here and this bonus term will be really large so beta is defined up here this
[00:16:03] large so beta is defined up here this bonus term this is a 1 over 1 minus
[00:16:05] bonus term this is a 1 over 1 minus gamma so if you imagine that all rewards
[00:16:08] gamma so if you imagine that all rewards are scaled between zero and one um and
[00:16:11] are scaled between zero and one um and Gamma is really small you can think of
[00:16:13] Gamma is really small you can think of that as kind of being like H times that
[00:16:15] that as kind of being like H times that sort of special term divided by the
[00:16:18] sort of special term divided by the square root of the number of times
[00:16:19] square root of the number of times you've been in that state in
[00:16:21] you've been in that state in action so what that means is when you do
[00:16:23] action so what that means is when you do these repeated Bellman backups it will
[00:16:25] these repeated Bellman backups it will drive your policy to visit parts of the
[00:16:27] drive your policy to visit parts of the state in action which you have not
[00:16:29] state in action which you have not visited
[00:16:30] visited much okay because those are the parts
[00:16:32] much okay because those are the parts where you're going to have these really
[00:16:34] where you're going to have these really large overestimates probably
[00:16:36] large overestimates probably overestimates optimistic estimates I
[00:16:38] overestimates optimistic estimates I should say you have these optimistic
[00:16:39] should say you have these optimistic estimates of how good the value could be
[00:16:41] estimates of how good the value could be in those States and the reason this is
[00:16:43] in those States and the reason this is important is because it might be um so
[00:16:46] important is because it might be um so this is going to work when you're doing
[00:16:47] this is going to work when you're doing sort of a series of episodes or you know
[00:16:49] sort of a series of episodes or you know you're working in the same mdp for a
[00:16:51] you're working in the same mdp for a long time what will happen is that your
[00:16:53] long time what will happen is that your md you will explore in your mdp and it
[00:16:56] md you will explore in your mdp and it will drive you to cover the state and
[00:16:58] will drive you to cover the state and action space
[00:16:59] action space if you think that you know that it might
[00:17:01] if you think that you know that it might possibly have good rewards in those
[00:17:03] possibly have good rewards in those places so this will drive exploration
[00:17:06] places so this will drive exploration and by doing these repeated backups here
[00:17:08] and by doing these repeated backups here you're propagating your optimism under
[00:17:11] you're propagating your optimism under uncertainty backwards so that you
[00:17:13] uncertainty backwards so that you develop this policy to drive you that
[00:17:17] way so this is um one of the
[00:17:20] way so this is um one of the quintessential algorithms um for doing
[00:17:23] quintessential algorithms um for doing uh tabular optimism under uncertainty
[00:17:25] uh tabular optimism under uncertainty based
[00:17:27] based planning it is also a pack algorithm so
[00:17:30] planning it is also a pack algorithm so we talked about pack last time in pack
[00:17:33] we talked about pack last time in pack pack it means it's probably
[00:17:34] pack it means it's probably approximately correct I'll just write
[00:17:36] approximately correct I'll just write that out again just to remind ourselves
[00:17:40] that out again just to remind ourselves probably
[00:17:44] approximately
[00:17:46] approximately correct but now we're going to talk
[00:17:48] correct but now we're going to talk about in particular markup decision
[00:17:50] about in particular markup decision processes so we talked about how an
[00:17:52] processes so we talked about how an algorithm is probably approximately
[00:17:53] algorithm is probably approximately correct if most of the time it makes a
[00:17:55] correct if most of the time it makes a decision that is close to Optimal and
[00:17:57] decision that is close to Optimal and only makes mistakes on a polom number of
[00:18:00] only makes mistakes on a polom number of times so we talked about that last time
[00:18:03] times so we talked about that last time we saw that you know you don't have to
[00:18:05] we saw that you know you don't have to guarantee this you could make mistakes
[00:18:06] guarantee this you could make mistakes forever like if you're acting randomly
[00:18:08] forever like if you're acting randomly that's not a you know you would continue
[00:18:10] that's not a you know you would continue to make m
[00:18:12] to make m forever MBI is a pack algorithm so what
[00:18:17] forever MBI is a pack algorithm so what it says is that it says that let's let
[00:18:21] it says is that it says that let's let sort of script A toote MBI ev's policy
[00:18:25] sort of script A toote MBI ev's policy at time step T and St denote the state
[00:18:27] at time step T and St denote the state at time t with high probability the
[00:18:30] at time t with high probability the value of the action the algorithm takes
[00:18:32] value of the action the algorithm takes is at least the value of the optimal
[00:18:34] is at least the value of the optimal action for that State minus Epsilon and
[00:18:38] action for that State minus Epsilon and it's true on all but a finite number of
[00:18:40] it's true on all but a finite number of steps with high
[00:18:42] steps with high probability so this is the number of
[00:18:46] probability so this is the number of steps okay and the important thing here
[00:18:49] steps okay and the important thing here is this is a polom in the sets of the
[00:18:52] is this is a polom in the sets of the state space the action space 1/ Epsilon
[00:18:55] state space the action space 1/ Epsilon and 1 over 1us gamma
[00:18:59] and 1 over 1us gamma now I always encourage um my research
[00:19:02] now I always encourage um my research students to plug in for bounds because
[00:19:04] students to plug in for bounds because theoretical bounds are beautiful um but
[00:19:06] theoretical bounds are beautiful um but it's nice to know whether or not they
[00:19:08] it's nice to know whether or not they are all related to
[00:19:09] are all related to practice so for example in this case you
[00:19:12] practice so for example in this case you might imagine let's say we have S = 10
[00:19:16] might imagine let's say we have S = 10 and a = 10 okay and you said Epsilon is
[00:19:20] and a = 10 okay and you said Epsilon is equal to 0.1 and Gamma is equal to9 all
[00:19:25] equal to 0.1 and Gamma is equal to9 all right so let's just work out what that
[00:19:26] right so let's just work out what that would be that would be roughly 10
[00:19:29] would be that would be roughly 10 3 * 10 to the
[00:19:33] 3 * 10 to the 9 or 10
[00:19:37] 9 or 10 12 so that's a lot okay right so what
[00:19:40] 12 so that's a lot okay right so what that would say is that we are sure by
[00:19:41] that would say is that we are sure by using this algorithm that we will only
[00:19:44] using this algorithm that we will only make step mistakes on this 10 State mdp
[00:19:47] make step mistakes on this 10 State mdp 10 to the 12 time
[00:19:50] 10 to the 12 time steps now I don't know about you but I
[00:19:52] steps now I don't know about you but I would hope that in you know a 10- state
[00:19:54] would hope that in you know a 10- state grid world mdp that we could learn to
[00:19:56] grid world mdp that we could learn to act substantially faster than that
[00:19:59] act substantially faster than that so I use it to highlight that um these
[00:20:02] so I use it to highlight that um these bounds while this might officially say
[00:20:03] bounds while this might officially say this is a pack algorithm they can be um
[00:20:06] this is a pack algorithm they can be um pretty conservative in sort of how many
[00:20:09] pretty conservative in sort of how many mistakes you might make now in practice
[00:20:11] mistakes you might make now in practice often this optimism under uncertainty
[00:20:13] often this optimism under uncertainty algorithm can work very well it doesn't
[00:20:15] algorithm can work very well it doesn't say you will make this number of
[00:20:16] say you will make this number of mistakes it just is an upper bound on it
[00:20:19] mistakes it just is an upper bound on it but it's good to plug these things in
[00:20:20] but it's good to plug these things in just to sort of see um how tight or not
[00:20:23] just to sort of see um how tight or not you think it is relative to Real
[00:20:25] you think it is relative to Real Performance all right so this is a pack
[00:20:28] Performance all right so this is a pack algorithm um the paper goes through an
[00:20:30] algorithm um the paper goes through an interesting proof of it to sort of show
[00:20:32] interesting proof of it to sort of show the different um components but one of
[00:20:34] the different um components but one of the key ideas is something called the
[00:20:36] the key ideas is something called the simulation Lemma and that I'm going to
[00:20:38] simulation Lemma and that I'm going to go through at least briefly because the
[00:20:41] go through at least briefly because the simulation Lemma is one of the one of
[00:20:43] simulation Lemma is one of the one of the many core ideas when we think about
[00:20:45] the many core ideas when we think about doing efficient
[00:20:47] doing efficient exploration a and the key idea for the
[00:20:50] exploration a and the key idea for the um uh the simulation Lemma is the idea
[00:20:53] um uh the simulation Lemma is the idea that we can relate the accuracy of our
[00:20:55] that we can relate the accuracy of our models to the accuracy of our learned Q
[00:20:58] models to the accuracy of our learned Q function okay so that's the key idea
[00:21:00] function okay so that's the key idea it's going to say we have
[00:21:03] bounded yeah I guess I'll just leave
[00:21:05] bounded yeah I guess I'll just leave that so if we so we can bound there sort
[00:21:08] that so if we so we can bound there sort of a really if we if we just ensure that
[00:21:11] of a really if we if we just ensure that we have good predictive models we can
[00:21:14] we have good predictive models we can relate our error and our predictive
[00:21:16] relate our error and our predictive models back to our value function so
[00:21:19] models back to our value function so let's do that at least sort of sketch so
[00:21:23] let's do that at least sort of sketch so this is going to be
[00:21:24] this is going to be for for tabular settings okay
[00:21:29] for for tabular settings okay so we're going to assume that we're back
[00:21:31] so we're going to assume that we're back in like a finite set of States finite
[00:21:32] in like a finite set of States finite set of actions okay so this is one proof
[00:21:36] set of actions okay so this is one proof of the simulation LMA um we're going to
[00:21:39] of the simulation LMA um we're going to assume that we have pi as a fixed
[00:21:45] policy and we are going
[00:21:48] policy and we are going to assume that we have a Max
[00:21:52] to assume that we have a Max Norm
[00:21:55] Norm um on the reward so we're going to
[00:21:58] um on the reward so we're going to assume if you remember
[00:22:00] assume if you remember back um let me do R1 minus R2 we're
[00:22:03] back um let me do R1 minus R2 we're going to assume that we have two
[00:22:04] going to assume that we have two different mdps so
[00:22:08] different mdps so mdp1 and
[00:22:09] mdp1 and mdp2 and these might have slightly
[00:22:12] mdp2 and these might have slightly different reward functions and slightly
[00:22:13] different reward functions and slightly different Dynamics models so remember
[00:22:16] different Dynamics models so remember that if we have the infinity
[00:22:19] that if we have the infinity Norm can express this as like this is
[00:22:21] Norm can express this as like this is going to be the place where the two
[00:22:23] going to be the place where the two reward functions differ the most over
[00:22:24] reward functions differ the most over our finite State space so you know if
[00:22:27] our finite State space so you know if one of them gives the rewards of 1 2 7 3
[00:22:29] one of them gives the rewards of 1 2 7 3 and the other one gives rewards of like
[00:22:30] and the other one gives rewards of like 2 6 1 7 we would figure out for which
[00:22:33] 2 6 1 7 we would figure out for which state do the two rewards differ the most
[00:22:35] state do the two rewards differ the most let's assume that's upper bounded by
[00:22:38] let's assume that's upper bounded by Alpha we're also going to assume that we
[00:22:40] Alpha we're also going to assume that we have an upper bound on the Dynamics
[00:22:41] have an upper bound on the Dynamics model okay so we're going to assume that
[00:22:45] model okay so we're going to assume that t of S Prime given sa minus T2 of S
[00:22:50] t of S Prime given sa minus T2 of S Prime given
[00:22:52] Prime given sa is also bounded I'm going to assume
[00:22:55] sa is also bounded I'm going to assume that's bounded by Beta okay so that
[00:22:57] that's bounded by Beta okay so that means from the point a view of your
[00:22:59] means from the point a view of your predictive models the two mdps differ by
[00:23:02] predictive models the two mdps differ by you know you can bound the the amount
[00:23:04] you know you can bound the the amount and we're going to show if that's true
[00:23:05] and we're going to show if that's true then we also are going to only differ in
[00:23:07] then we also are going to only differ in their estimated Q functions for a
[00:23:08] their estimated Q functions for a particular policy by bounded amount so
[00:23:11] particular policy by bounded amount so what we have is we want to compare what
[00:23:13] what we have is we want to compare what is the Q value of under model one for um
[00:23:19] is the Q value of under model one for um State sa and then follow following
[00:23:21] State sa and then follow following policy Pi versus Q Pi
[00:23:24] policy Pi versus Q Pi 2 okay and the reason this is going to
[00:23:27] 2 okay and the reason this is going to be important is because general what
[00:23:29] be important is because general what we're going to have is the R1 and R2 are
[00:23:31] we're going to have is the R1 and R2 are going to correspond to our uncertainty
[00:23:33] going to correspond to our uncertainty so if you think back to the hting
[00:23:36] so if you think back to the hting inequalities I told you about we talked
[00:23:37] inequalities I told you about we talked about how our empirical estimate could
[00:23:39] about how our empirical estimate could differ from the true estimate by a
[00:23:41] differ from the true estimate by a bounded amount so that should make you
[00:23:43] bounded amount so that should make you think about this part R1 could be our
[00:23:45] think about this part R1 could be our empirical estimate of the word and R2
[00:23:47] empirical estimate of the word and R2 could be the true unknown one and hting
[00:23:50] could be the true unknown one and hting can give you an upper bound on what that
[00:23:51] can give you an upper bound on what that Alpha is and similarly we can get a
[00:23:54] Alpha is and similarly we can get a bound on the Dynamics model error as
[00:23:56] bound on the Dynamics model error as well okay so the idea will be to say if
[00:23:59] well okay so the idea will be to say if you end up plugging in say like an
[00:24:01] you end up plugging in say like an empirical estimate of the reward model
[00:24:04] empirical estimate of the reward model and the Dynamics model how far away
[00:24:06] and the Dynamics model how far away could your estimate of the Q function be
[00:24:08] could your estimate of the Q function be from if you actually knew the true
[00:24:10] from if you actually knew the true reward model and the true Dynamics model
[00:24:13] reward model and the true Dynamics model okay so that's why we're we're doing
[00:24:15] okay so that's why we're we're doing this okay so this is the difference
[00:24:18] this okay so this is the difference let's just write down what that will
[00:24:19] let's just write down what that will look like so this is going to be
[00:24:22] look like so this is going to be R1 of sa a plus gamma sum over S Prime
[00:24:29] R1 of sa a plus gamma sum over S Prime T1 S Prime given sa a T1 Pi of S Prime
[00:24:35] T1 S Prime given sa a T1 Pi of S Prime okay-
[00:24:41] R2 S Prime
[00:24:46] T2 and just use the definition of the Q
[00:24:49] T2 and just use the definition of the Q function there to write out what is the
[00:24:51] function there to write out what is the difference between the two Q values okay
[00:24:55] difference between the two Q values okay all right we're going to Upper bound
[00:24:56] all right we're going to Upper bound this
[00:24:58] this okay as follows we're going to just use
[00:25:01] okay as follows we're going to just use the triangle inequality first okay so
[00:25:04] the triangle inequality first okay so we're just going to say this is less
[00:25:06] we're just going to say this is less than equal to
[00:25:07] than equal to R1 minus R2 so I use the triangle
[00:25:15] R1 minus R2 so I use the triangle inequality
[00:25:17] inequality plus gamma time the difference in the
[00:25:20] plus gamma time the difference in the second terms
[00:25:29] AE in
[00:25:43] parenthesis okay I just use my triangle
[00:25:46] parenthesis okay I just use my triangle and equality I split the two terms okay
[00:25:49] and equality I split the two terms okay um remember this we've already said is
[00:25:51] um remember this we've already said is going to be less than or equal to
[00:25:55] Alpha cuz we've upper bounded our so so
[00:25:58] Alpha cuz we've upper bounded our so so then we have to think about the second
[00:25:59] then we have to think about the second term then we're going to do something
[00:26:00] term then we're going to do something that we often do in reinforcement
[00:26:01] that we often do in reinforcement learning which is we add and subtract
[00:26:03] learning which is we add and subtract zero um or we add zero and we're going
[00:26:06] zero um or we add zero and we're going to do that by trying to relate between
[00:26:08] to do that by trying to relate between right now we have the Dynamics model of
[00:26:10] right now we have the Dynamics model of one thing the value under the Dynamics
[00:26:12] one thing the value under the Dynamics model of model one and the value
[00:26:14] model of model one and the value function of model one and so now we want
[00:26:16] function of model one and so now we want to have some in between terms so we can
[00:26:17] to have some in between terms so we can directly think about the difference in
[00:26:19] directly think about the difference in the value functions under one particular
[00:26:21] the value functions under one particular Dynamics model and the difference of the
[00:26:22] Dynamics model and the difference of the Dynamics model separately okay so let's
[00:26:25] Dynamics model separately okay so let's what we're going to do in this case is
[00:26:26] what we're going to do in this case is we're going to say this is last Al to
[00:26:29] we're going to say this is last Al to Alpha plus gamma and then I'm just going
[00:26:31] Alpha plus gamma and then I'm just going to add and subtract some terms okay so
[00:26:35] to add and subtract some terms okay so sum over S Prime and I'm just going to
[00:26:37] sum over S Prime and I'm just going to use shorthand here so that I can fit
[00:26:42] everything okay so I'm just going to
[00:26:44] everything okay so I'm just going to introduce add and subtract
[00:26:51] zero careful with
[00:26:56] my make sure that's clear
[00:27:11] okay so I just introduced I added a new
[00:27:13] okay so I just introduced I added a new term that is kind of this intersection
[00:27:15] term that is kind of this intersection term um and I added and subtracted it
[00:27:18] term um and I added and subtracted it okay and the reason that's helpful is
[00:27:19] okay and the reason that's helpful is now I can just think of terms where they
[00:27:21] now I can just think of terms where they only differ in the Dynamics model or
[00:27:22] only differ in the Dynamics model or terms that they only differ in the value
[00:27:25] terms that they only differ in the value function okay so this is going to be
[00:27:28] function okay so this is going to be less than or equal to Alpha + gamma
[00:27:32] less than or equal to Alpha + gamma * Su over S Prime T1 of S
[00:27:37] * Su over S Prime T1 of S Prime Times the absolute value of B1
[00:27:42] Pi Prim as
[00:27:48] V2
[00:27:50] V2 okay
[00:27:53] plus
[00:27:55] plus gamma time
[00:27:58] gamma time S Prime T1 of s primeus T2 of S
[00:28:10] Prime all right so what have I done
[00:28:13] Prime all right so what have I done there I've just rearranged the terms I
[00:28:14] there I've just rearranged the terms I just move these two here and then I move
[00:28:17] just move these two here and then I move those two there and I'm starting to
[00:28:18] those two there and I'm starting to apply um my absolute values a lot to
[00:28:21] apply um my absolute values a lot to just repeatedly do the triangle
[00:28:23] just repeatedly do the triangle inequality all right so what is this
[00:28:26] inequality all right so what is this this part
[00:28:28] this part looks a lot like this
[00:28:31] looks a lot like this thing okay so that's going to be like a
[00:28:34] thing okay so that's going to be like a recursive term okay so we can turn this
[00:28:38] recursive term okay so we can turn this part into the following this is going to
[00:28:42] part into the following this is going to be this part here will be less than or
[00:28:45] be this part here will be less than or equal to Alpha plus gamma let's call
[00:28:48] equal to Alpha plus gamma let's call this difference Delta I'm going to call
[00:28:50] this difference Delta I'm going to call that
[00:28:51] that Delta okay and then if I do that I have
[00:28:54] Delta okay and then if I do that I have Su over S Prime just T1 of S Prime okay
[00:28:57] Su over S Prime just T1 of S Prime okay so this could just be like a Max
[00:28:58] so this could just be like a Max difference in your value
[00:29:00] difference in your value functions um and then the second term I
[00:29:03] functions um and then the second term I have I'm going to use the fact that my
[00:29:05] have I'm going to use the fact that my value fun funion is upper bounded so
[00:29:10] value fun funion is upper bounded so here R Max / 1us gamma so if you get the
[00:29:15] here R Max / 1us gamma so if you get the maximum reward at every single time step
[00:29:17] maximum reward at every single time step and your discount factor is gamma this
[00:29:19] and your discount factor is gamma this is an upper bound on Vmax so that allows
[00:29:22] is an upper bound on Vmax so that allows me to take this term out and then that
[00:29:25] me to take this term out and then that just leaves me with my difference in my
[00:29:27] just leaves me with my difference in my dynamic model okay so I have
[00:29:33] this that was this
[00:29:39] term
[00:29:41] term okay all right so that's what I have in
[00:29:44] okay all right so that's what I have in this case now um and so this has to hold
[00:29:47] this case now um and so this has to hold for all states and actions so here I'm
[00:29:51] for all states and actions so here I'm going to here the Delta that I've
[00:29:52] going to here the Delta that I've defined Delta is kind of like the worst
[00:29:55] defined Delta is kind of like the worst case error over any of these right
[00:29:59] case error over any of these right so over any states what's the maximum
[00:30:01] so over any states what's the maximum difference between the value functions
[00:30:03] difference between the value functions so that also has to hold on the Q
[00:30:06] so that also has to hold on the Q side so we get this we
[00:30:09] side so we get this we get Delta is less than or equal to Alpha
[00:30:13] get Delta is less than or equal to Alpha plus Gamma
[00:30:14] plus Gamma Delta plus gamma B Max beta going to
[00:30:20] Delta plus gamma B Max beta going to subtract that so now I have 1 minus
[00:30:22] subtract that so now I have 1 minus gamma * Delta is less than equal to
[00:30:24] gamma * Delta is less than equal to Alpha plus gamma V Max beta
[00:30:28] Alpha plus gamma V Max beta or Delta is less thanal 1
[00:30:32] or Delta is less thanal 1 gamma Alpha plus gamma B Max
[00:30:37] beta okay what is what have we just
[00:30:40] beta okay what is what have we just shown we've said the worst case error in
[00:30:42] shown we've said the worst case error in the value function for the same policy
[00:30:45] the value function for the same policy between one model and the other model is
[00:30:47] between one model and the other model is upper bounded by 1 over 1us gamma times
[00:30:50] upper bounded by 1 over 1us gamma times the error in your reward model plus
[00:30:52] the error in your reward model plus gamma time your maximum value time the
[00:30:55] gamma time your maximum value time the error in your Dynamics model so that is
[00:30:58] error in your Dynamics model so that is one version at least of the simulation
[00:31:00] one version at least of the simulation Leva it comes up in lots of different
[00:31:01] Leva it comes up in lots of different other areas too people use it for a lot
[00:31:03] other areas too people use it for a lot more advanced complicated settings but
[00:31:06] more advanced complicated settings but the critical idea here is to say if you
[00:31:08] the critical idea here is to say if you can bound your error in the Dynamics
[00:31:10] can bound your error in the Dynamics model or the error in your value
[00:31:11] model or the error in your value function or or and the error in your
[00:31:13] function or or and the error in your reward function that also means that
[00:31:14] reward function that also means that your CU functions can't be too different
[00:31:17] your CU functions can't be too different okay so that's like the that's the main
[00:31:19] okay so that's like the that's the main important point of that and so this idea
[00:31:22] important point of that and so this idea in general is a is a excuse me a helpful
[00:31:24] in general is a is a excuse me a helpful one because it means as we explore and
[00:31:26] one because it means as we explore and we learn these predictive models but
[00:31:27] we learn these predictive models but better we can be sure that our value
[00:31:30] better we can be sure that our value function is also simultaneously getting
[00:31:32] function is also simultaneously getting better and better over time and getting
[00:31:33] better and better over time and getting more accurate and in the proofs of pack
[00:31:36] more accurate and in the proofs of pack algorithms that's often used to say you
[00:31:39] algorithms that's often used to say you can't sort of infinitely learn in
[00:31:42] can't sort of infinitely learn in particular States and actions and not
[00:31:44] particular States and actions and not end up with a value function that is
[00:31:45] end up with a value function that is getting more and more accurate
[00:31:50] okay all right so now I'll pause there
[00:31:52] okay all right so now I'll pause there in case anybody has any questions before
[00:31:54] in case anybody has any questions before we move on to basian uh markof decision
[00:31:56] we move on to basian uh markof decision processes
[00:31:58] processes yeah uh so here we Define the difference
[00:32:01] yeah uh so here we Define the difference in the value
[00:32:02] in the value function as Delta but why we could also
[00:32:06] function as Delta but why we could also represent the difference in Q function
[00:32:08] represent the difference in Q function as that yeah because we end up um we use
[00:32:13] as that yeah because we end up um we use this this has to hold this is an upper
[00:32:15] this this has to hold this is an upper bound to this term and this has to hold
[00:32:17] bound to this term and this has to hold for all of the states and actions we
[00:32:18] for all of the states and actions we could ever be at and that also has to
[00:32:20] could ever be at and that also has to hold for this for any state so you could
[00:32:22] hold for this for any state so you could make a here be Pi of s and that so then
[00:32:28] make a here be Pi of s and that so then yeah
[00:32:30] yeah should factor of the number of states um
[00:32:35] should factor of the number of states um because we're summing over all the
[00:32:36] because we're summing over all the states or
[00:32:37] states or it's um because just from the aor bound
[00:32:41] it's um because just from the aor bound I thought that's just for one state for
[00:32:44] I thought that's just for one state for example the Dynamics just for one state
[00:32:48] example the Dynamics just for one state is less than or equal to Beta then the
[00:32:51] is less than or equal to Beta then the proof it's selling over a bunch of
[00:32:54] proof it's selling over a bunch of States um good question so what happens
[00:32:56] States um good question so what happens here is this is just a
[00:32:58] here is this is just a assuming that given you have a bound on
[00:33:01] assuming that given you have a bound on this like for every single like this
[00:33:04] this like for every single like this tells you this is for every single S
[00:33:07] tells you this is for every single S Prime you have a
[00:33:08] Prime you have a bound um what will end up coming
[00:33:11] bound um what will end up coming normally where the number of states will
[00:33:13] normally where the number of states will come in is when you start to think about
[00:33:15] come in is when you start to think about how much data you need to achieve
[00:33:18] how much data you need to achieve this and you want this to hold for every
[00:33:21] this and you want this to hold for every single state action pair and normally
[00:33:24] single state action pair and normally that's where you will end up getting a
[00:33:25] that's where you will end up getting a dependence am Moda of data in order to
[00:33:27] dependence am Moda of data in order to get get sufficiently small confidence
[00:33:29] get get sufficiently small confidence intervals and with your union bounding
[00:33:31] intervals and with your union bounding to make sure all of these bounds hold in
[00:33:33] to make sure all of these bounds hold in terms of this part the state space
[00:33:35] terms of this part the state space doesn't
[00:33:36] doesn't appear but this is just for the
[00:33:38] appear but this is just for the simulation limit it doesn't then tell
[00:33:40] simulation limit it doesn't then tell you all the way to how many samples you
[00:33:42] you all the way to how many samples you need to achieve this okay that's sort of
[00:33:44] need to achieve this okay that's sort of another part of the proof you could
[00:33:45] another part of the proof you could probably imagine already given what you
[00:33:47] probably imagine already given what you know about hting that you could imagine
[00:33:49] know about hting that you could imagine having some way to compute how many
[00:33:51] having some way to compute how many samples you need to get Alpha
[00:33:52] samples you need to get Alpha sufficiently small and the Dynamics
[00:33:54] sufficiently small and the Dynamics model requ you know kind of a similar
[00:33:56] model requ you know kind of a similar idea
[00:33:58] idea and you can do this for other sort of
[00:33:59] and you can do this for other sort of parametric models too like gin Etc
[00:34:02] parametric models too like gin Etc anybody else have any questions on this
[00:34:03] anybody else have any questions on this part before we go on to
[00:34:10] basium all right let's step on to basian
[00:34:13] basium all right let's step on to basian okay so this are some of the so in all
[00:34:14] okay so this are some of the so in all these cases you know we weren't using
[00:34:16] these cases you know we weren't using any notion of
[00:34:19] any notion of um priors or any of the things that we
[00:34:22] um priors or any of the things that we saw last time so now we're going to
[00:34:24] saw last time so now we're going to think about how we can lift some of the
[00:34:25] think about how we can lift some of the ideas we saw from last time to think
[00:34:27] ideas we saw from last time to think think about this from our cough decision
[00:34:29] think about this from our cough decision processes so just as a refresher
[00:34:32] processes so just as a refresher remember um the other way we can think
[00:34:34] remember um the other way we can think about this a common way to think about
[00:34:36] about this a common way to think about trying to do efficient exploration is to
[00:34:38] trying to do efficient exploration is to imagine that we have some prior
[00:34:39] imagine that we have some prior knowledge over how good we think
[00:34:41] knowledge over how good we think different states and actions might be or
[00:34:42] different states and actions might be or how we think the Dynamics might work and
[00:34:45] how we think the Dynamics might work and then what we're going to do is try to
[00:34:46] then what we're going to do is try to use that information to figure out how
[00:34:47] use that information to figure out how to act um and we saw Thompson sampling
[00:34:50] to act um and we saw Thompson sampling as being one method that was an
[00:34:51] as being one method that was an efficient way to try to make decisions
[00:34:54] efficient way to try to make decisions when we have these priors and these
[00:34:56] when we have these priors and these posteriors
[00:34:57] posteriors okay and now we're going to think about
[00:34:59] okay and now we're going to think about sort of lifting these ideas to the
[00:35:01] sort of lifting these ideas to the sequential case so what we saw before is
[00:35:04] sequential case so what we saw before is that we'd have these priors over the
[00:35:06] that we'd have these priors over the model parameters like in this case say
[00:35:08] model parameters like in this case say the Dynam the reward models and that if
[00:35:10] the Dynam the reward models and that if they were
[00:35:12] they were conjugate excuse me then after we would
[00:35:14] conjugate excuse me then after we would actually observe a reward we had this
[00:35:16] actually observe a reward we had this nice closed form expression for the
[00:35:17] nice closed form expression for the betas so we could think of these as just
[00:35:20] betas so we could think of these as just being kind of the number of successes
[00:35:21] being kind of the number of successes and the number of
[00:35:23] and the number of failures and I talked about but didn't
[00:35:25] failures and I talked about but didn't actually illustrate that um you can do
[00:35:27] actually illustrate that um you can do this Brothers sorts of things like
[00:35:28] this Brothers sorts of things like gaussians Etc all right so you might
[00:35:31] gaussians Etc all right so you might think this should work clearly for the
[00:35:32] think this should work clearly for the reward part of the um of a markof
[00:35:36] reward part of the um of a markof decision process can we do this in
[00:35:38] decision process can we do this in general so this is what we did to
[00:35:40] general so this is what we did to straight again to remind ourselves
[00:35:41] straight again to remind ourselves Thompson simply for multiarm for
[00:35:43] Thompson simply for multiarm for multiarm Bandits involved us maintaining
[00:35:45] multiarm Bandits involved us maintaining this prior we would sample from that
[00:35:48] this prior we would sample from that meaning we would get like a particular
[00:35:50] meaning we would get like a particular set of values for our coin Clips like
[00:35:52] set of values for our coin Clips like what you saw before we would then act
[00:35:54] what you saw before we would then act optimally um with respect to those
[00:35:56] optimally um with respect to those observe reward and update our
[00:36:01] posterior so now what we're going to do
[00:36:03] posterior so now what we're going to do is all a very similar thing but we're
[00:36:06] is all a very similar thing but we're going to maintain prior over Markoff
[00:36:08] going to maintain prior over Markoff decision process
[00:36:10] decision process models so we could have a reward model
[00:36:12] models so we could have a reward model and this right now we're going to again
[00:36:13] and this right now we're going to again start start with the tabular
[00:36:22] case so we're going to start with the
[00:36:24] case so we're going to start with the tabular case there's a finite set of
[00:36:25] tabular case there's a finite set of states and actions so in this case you
[00:36:27] states and actions so in this case you could imagine maintaining a different
[00:36:29] could imagine maintaining a different reward model for every single state in
[00:36:31] reward model for every single state in action um and being able to sample from
[00:36:34] action um and being able to sample from it so you could sample like a parameter
[00:36:37] it so you could sample like a parameter for every single one of those and we're
[00:36:40] for every single one of those and we're going to see how we can use that to
[00:36:42] going to see how we can use that to actually do sort of something very
[00:36:44] actually do sort of something very similar to Thompson sampling for the
[00:36:46] similar to Thompson sampling for the sequential process
[00:36:48] sequential process case Okay so the idea now is that we're
[00:36:52] case Okay so the idea now is that we're going to maintain a prior over all of
[00:36:55] going to maintain a prior over all of the Dynamics models and all of the
[00:36:56] the Dynamics models and all of the reward models we will sample from that
[00:37:00] reward models we will sample from that now if you remember what I just showed
[00:37:02] now if you remember what I just showed you in the case of Bandits once we did
[00:37:03] you in the case of Bandits once we did the sampling of the parameters it was
[00:37:05] the sampling of the parameters it was really easy to figure out a decision
[00:37:07] really easy to figure out a decision because we just in the like in the case
[00:37:09] because we just in the like in the case of the Bui bits as soon as you know that
[00:37:11] of the Bui bits as soon as you know that this coin flip you know is going to give
[00:37:13] this coin flip you know is going to give you one with higher probability than
[00:37:15] you one with higher probability than this other one it tells you how to act
[00:37:17] this other one it tells you how to act for a Markoff decision process it's more
[00:37:19] for a Markoff decision process it's more complicated because as soon as you see
[00:37:20] complicated because as soon as you see the Dynamics and reward model you don't
[00:37:22] the Dynamics and reward model you don't know how to act yet so you actually have
[00:37:24] know how to act yet so you actually have to solve a planning problem so it's like
[00:37:27] to solve a planning problem so it's like you sample a Markoff decision process
[00:37:29] you sample a Markoff decision process once you're given that Markoff decision
[00:37:30] once you're given that Markoff decision process then you have to do planning
[00:37:33] process then you have to do planning like value iteration or something like
[00:37:35] like value iteration or something like that um to actually get your qar once
[00:37:39] that um to actually get your qar once you get your qar then you can sample you
[00:37:41] you get your qar then you can sample you can you can select the optimal action
[00:37:43] can you can select the optimal action given that computed
[00:37:47] qar so computationally it can involve a
[00:37:49] qar so computationally it can involve a lot more work than what we saw in the
[00:37:50] lot more work than what we saw in the Bandit case okay then the next question
[00:37:54] Bandit case okay then the next question you might have is how do we do this
[00:37:55] you might have is how do we do this sampling
[00:38:00] so this is the psrl algorithm it was
[00:38:03] so this is the psrl algorithm it was invented by Ian osband and Dan Russo and
[00:38:05] invented by Ian osband and Dan Russo and Ben vanroy um who's here at Stanford and
[00:38:07] Ben vanroy um who's here at Stanford and these guys were here at Stanford when
[00:38:09] these guys were here at Stanford when they invented it the idea is as follows
[00:38:11] they invented it the idea is as follows and I'll talk about sampling the
[00:38:12] and I'll talk about sampling the Dynamics model in a second there's going
[00:38:15] Dynamics model in a second there's going to be a series of episodes at the very
[00:38:18] to be a series of episodes at the very start of an episode given your prior
[00:38:20] start of an episode given your prior your current posterior you're going to
[00:38:22] your current posterior you're going to for every single state action pair
[00:38:24] for every single state action pair sample a Dynamics model and sampler a
[00:38:26] sample a Dynamics model and sampler a reward model given that you're going to
[00:38:28] reward model given that you're going to compute qar for your sampled
[00:38:32] compute qar for your sampled mdp once you have that sampled mdp you
[00:38:34] mdp once you have that sampled mdp you are going to act according to that
[00:38:36] are going to act according to that policy for the entire episode so this
[00:38:39] policy for the entire episode so this computes a qar for the entire episode
[00:38:42] computes a qar for the entire episode you're then just for T equals 1 to H
[00:38:44] you're then just for T equals 1 to H you're going to assume your episodes are
[00:38:45] you're going to assume your episodes are finite going to act according to your
[00:38:47] finite going to act according to your qar observe your word in your next state
[00:38:50] qar observe your word in your next state you're going to
[00:38:51] you're going to repeat okay at the end of the whole
[00:38:54] repeat okay at the end of the whole episode you're going to take all of the
[00:38:56] episode you're going to take all of the data that you just got God and you're
[00:38:58] data that you just got God and you're going to update your
[00:39:00] going to update your posterior so for the reward model it can
[00:39:02] posterior so for the reward model it can be very similar to what we saw last time
[00:39:04] be very similar to what we saw last time where you know you just sort of update
[00:39:05] where you know you just sort of update your counts um for the Dynamics model
[00:39:09] your counts um for the Dynamics model it's probably um it may not be clear
[00:39:12] it's probably um it may not be clear what you would do in that case so in
[00:39:14] what you would do in that case so in this case what we would often probably
[00:39:16] this case what we would often probably choose to
[00:39:21] do all right let's just WR up here so
[00:39:24] do all right let's just WR up here so what we'd often do is we'd use a DL
[00:39:27] what we'd often do is we'd use a DL model
[00:39:31] a DL model is a conjugate prior for
[00:39:37] multinomial okay multinomials is what we
[00:39:40] multinomial okay multinomials is what we can use for our normal Dynamics model
[00:39:42] can use for our normal Dynamics model here because a multinomial allows us to
[00:39:45] here because a multinomial allows us to express what is the probability of going
[00:39:47] express what is the probability of going to any of the next States given this
[00:39:49] to any of the next States given this current state in action so in general we
[00:39:51] current state in action so in general we would have one multinomial for each day
[00:39:53] would have one multinomial for each day in action
[00:39:54] in action pair we are now in the basian setting
[00:39:57] pair we are now in the basian setting and so now so we would
[00:39:59] and so now so we would have
[00:40:00] have one
[00:40:03] one per sa pair okay and this specifies our
[00:40:08] per sa pair okay and this specifies our probability distribution over all the
[00:40:10] probability distribution over all the next states that
[00:40:14] specifies P over S Prime given s and a
[00:40:18] specifies P over S Prime given s and a for all S Prime okay has to sum up to
[00:40:21] for all S Prime okay has to sum up to one that's the multinomial part the part
[00:40:24] one that's the multinomial part the part where we're being basy in is we're
[00:40:26] where we're being basy in is we're assuming we don't know what all these
[00:40:28] assuming we don't know what all these parameters are and so we have a prior
[00:40:31] parameters are and so we have a prior over them and the dlay is a conjugate
[00:40:35] over them and the dlay is a conjugate prior which means that if we start with
[00:40:37] prior which means that if we start with the dlay over multinomial parameters we
[00:40:39] the dlay over multinomial parameters we observe something so let's say we let's
[00:40:42] observe something so let's say we let's say we're interested in understanding
[00:40:43] say we're interested in understanding what happens when we're in state one and
[00:40:44] what happens when we're in state one and we take action one and we observe that
[00:40:47] we take action one and we observe that we go to S3 seven
[00:40:50] we go to S3 seven times and we observe we go to S7 three
[00:40:53] times and we observe we go to S7 three times well I'll
[00:40:55] times well I'll do two
[00:41:00] times what that means is that at the end
[00:41:03] times what that means is that at the end of that episode we would use that data
[00:41:05] of that episode we would use that data to change our durle distribution over
[00:41:08] to change our durle distribution over those multinomial
[00:41:10] those multinomial parameters very similar to what we saw
[00:41:12] parameters very similar to what we saw for the beta distribution okay and I
[00:41:15] for the beta distribution okay and I don't expect I mean it's an interesting
[00:41:17] don't expect I mean it's an interesting thing to do but I don't expect you to um
[00:41:19] thing to do but I don't expect you to um do that in this class you know some of
[00:41:21] do that in this class you know some of you might want to for part of your
[00:41:22] you might want to for part of your projects but the key idea here is that
[00:41:24] projects but the key idea here is that it is conjugate so it means that the
[00:41:26] it is conjugate so it means that the posterior you get are in the same family
[00:41:29] posterior you get are in the same family as your priors and so you can use this
[00:41:31] as your priors and so you can use this to sample multinomials essentially it's
[00:41:33] to sample multinomials essentially it's just sampling Dynamics
[00:41:35] just sampling Dynamics models so we do this in this case um and
[00:41:39] models so we do this in this case um and we do this over and over again and like
[00:41:41] we do this over and over again and like the really key things to notice here
[00:41:43] the really key things to notice here compared to what we were seeing before
[00:41:47] compared to what we were seeing before is that we have to sample this entire
[00:41:50] is that we have to sample this entire mdp and we have to compute its optimal
[00:41:52] mdp and we have to compute its optimal value before we
[00:41:55] value before we act so we do all of this computation
[00:41:58] act so we do all of this computation before the start of an
[00:42:00] before the start of an episode and you might
[00:42:03] episode and you might yeah I'm a about this sampling the mdp
[00:42:08] yeah I'm a about this sampling the mdp but then sampling Dynamics and reward
[00:42:10] but then sampling Dynamics and reward models part oh this is just explaining
[00:42:12] models part oh this is just explaining that okay this is like a comment you
[00:42:15] that okay this is like a comment you sample the mdp what this means is just
[00:42:17] sample the mdp what this means is just for each of the state and actions um
[00:42:20] for each of the state and actions um yeah good good clarification question to
[00:42:22] yeah good good clarification question to sample mdp what I mean is that you're
[00:42:24] sample mdp what I mean is that you're going to define the mdp so that means we
[00:42:28] going to define the mdp so that means we can completely specify an mdp given a
[00:42:31] can completely specify an mdp given a known State and action space and a
[00:42:33] known State and action space and a discount Factor by specifying a Dynamics
[00:42:35] discount Factor by specifying a Dynamics model for every single state in action
[00:42:37] model for every single state in action pair and specify model okay and then
[00:42:40] pair and specify model okay and then we'll have to compute the optimal
[00:42:43] we'll have to compute the optimal action now one thing that you might
[00:42:45] action now one thing that you might wonder about is is it important or
[00:42:50] wonder about is is it important or necessary that we do all of this once
[00:42:53] necessary that we do all of this once per episode okay so when we talked about
[00:42:57] per episode okay so when we talked about um basy and bandits after every single
[00:43:00] um basy and bandits after every single observation we updated our posterior so
[00:43:03] observation we updated our posterior so we would you know try buddy taping the
[00:43:06] we would you know try buddy taping the toes and we'd see that that helps
[00:43:08] toes and we'd see that that helps someone recover and then we would update
[00:43:09] someone recover and then we would update our prior this is a bit different right
[00:43:12] our prior this is a bit different right it we are only doing this every eight
[00:43:15] it we are only doing this every eight steps now you might think maybe that's
[00:43:18] steps now you might think maybe that's computational you might think that
[00:43:20] computational you might think that that's being done for another reason but
[00:43:23] that's being done for another reason but it's something that um that's an
[00:43:25] it's something that um that's an interesting thing to think about let me
[00:43:27] interesting thing to think about let me see if I have this on the next slide
[00:43:28] see if I have this on the next slide yeah so let's do a check your
[00:43:29] yeah so let's do a check your understanding and then I'll give you
[00:43:30] understanding and then I'll give you talk a little bit more about um why why
[00:43:33] talk a little bit more about um why why this is done in psrl so this asks you to
[00:43:36] this is done in psrl so this asks you to think a little bit about sort of doing
[00:43:37] think a little bit about sort of doing St exploration in mdps and in Thompson
[00:43:40] St exploration in mdps and in Thompson sampling in the algorithm I just
[00:43:56] showed e
[00:44:36] want you compare your answers to someone
[00:44:38] want you compare your answers to someone near
[00:44:56] you e
[00:45:42] okay thank you okay yeah so now it
[00:45:44] okay thank you okay yeah so now it should be back on what I was saying is
[00:45:47] should be back on what I was saying is that in um Maria's uh deo's work she was
[00:45:50] that in um Maria's uh deo's work she was thinking about concurrent reinforcement
[00:45:53] thinking about concurrent reinforcement learning which is something we've also
[00:45:54] learning which is something we've also thought about and for this much more
[00:45:55] thought about and for this much more realistic setting
[00:45:57] realistic setting the idea is whether you might need to
[00:45:59] the idea is whether you might need to coordinate exploration and how
[00:46:01] coordinate exploration and how frequently should you update now one of
[00:46:04] frequently should you update now one of the challenges of this setting even
[00:46:07] the challenges of this setting even before you can get into before you get
[00:46:09] before you can get into before you get into concurrent reinforcement learning
[00:46:11] into concurrent reinforcement learning is that if you update your prior a lot
[00:46:14] is that if you update your prior a lot within a task like within a single
[00:46:16] within a task like within a single episode you're essentially sampling
[00:46:18] episode you're essentially sampling different mdps within the same episode
[00:46:21] different mdps within the same episode the reason that can be bad is it now is
[00:46:23] the reason that can be bad is it now is going to totally change your behavior
[00:46:25] going to totally change your behavior and there may be some cases where you
[00:46:27] and there may be some cases where you essentially thrash so let me give an
[00:46:32] essentially thrash so let me give an example so one of the sort of canonical
[00:46:35] example so one of the sort of canonical hard Markoff decision processes that
[00:46:37] hard Markoff decision processes that people talk about is a
[00:46:39] people talk about is a chain okay it's really just sort of like
[00:46:41] chain okay it's really just sort of like an illustrative
[00:46:45] one and there are lots of different
[00:46:47] one and there are lots of different slight variance of
[00:46:49] slight variance of chains um and the idea is that you might
[00:46:54] chains um and the idea is that you might have something I've shown stuff that's
[00:46:57] have something I've shown stuff that's similar to this before where you know on
[00:47:00] similar to this before where you know on one side or on the other side oh it's
[00:47:02] one side or on the other side oh it's not reconnecting just for in the back
[00:47:04] not reconnecting just for in the back there to reconnect um that you would
[00:47:08] there to reconnect um that you would have high reward on one of the other
[00:47:10] have high reward on one of the other sides what you could imagine in this
[00:47:12] sides what you could imagine in this case if you were thinking about it being
[00:47:13] case if you were thinking about it being like a basy and Bandit is that some of
[00:47:16] like a basy and Bandit is that some of the times it might pick a Markoff
[00:47:17] the times it might pick a Markoff decision process where this is the good
[00:47:19] decision process where this is the good the best state and some of the time it
[00:47:22] the best state and some of the time it might break a markup decision process oh
[00:47:24] might break a markup decision process oh it's still not showing on the there
[00:47:28] it's still not showing on the there um thanks hopefully that'll come up um
[00:47:31] um thanks hopefully that'll come up um and some of the time it might pick one
[00:47:33] and some of the time it might pick one that is here so if you start off acting
[00:47:38] that is here so if you start off acting let's say that you first sampled an mdp
[00:47:41] let's say that you first sampled an mdp where this is the best state you do your
[00:47:42] where this is the best state you do your planning and then your agent's going to
[00:47:44] planning and then your agent's going to start going this way okay let's say You
[00:47:48] start going this way okay let's say You observe that there's some zero reward
[00:47:49] observe that there's some zero reward here and your Thompson sampling updates
[00:47:51] here and your Thompson sampling updates and
[00:47:52] and now it says hey this is the best date
[00:47:56] now it says hey this is the best date okay okay cuz you just have some prior
[00:47:57] okay okay cuz you just have some prior over the model parameters and so your
[00:47:59] over the model parameters and so your agent turns around and it's like oh I
[00:48:00] agent turns around and it's like oh I shouldn't go this way I should go this
[00:48:02] shouldn't go this way I should go this way okay and then as you're doing that
[00:48:05] way okay and then as you're doing that it's getting more rewards and it's
[00:48:06] it's getting more rewards and it's updating its posterior and so then it
[00:48:08] updating its posterior and so then it samples again and it's like oh this is
[00:48:14] good and so it can Rel lead to this kind
[00:48:16] good and so it can Rel lead to this kind of thrashing Behavior because you're
[00:48:19] of thrashing Behavior because you're it's sampling a new Markoff decision
[00:48:21] it's sampling a new Markoff decision process each time and so your agent can
[00:48:23] process each time and so your agent can end up kind of toggling back and forth
[00:48:25] end up kind of toggling back and forth between its sort of ideas over which mdp
[00:48:27] between its sort of ideas over which mdp it's in so it's for this reason that
[00:48:31] it's in so it's for this reason that often you will end want to essentially
[00:48:33] often you will end want to essentially commit to the Markoff decision process
[00:48:35] commit to the Markoff decision process you're in for the whole time you don't
[00:48:38] you're in for the whole time you don't always have to do this but that's one of
[00:48:39] always have to do this but that's one of the reasons why this can be helpful this
[00:48:41] the reasons why this can be helpful this is uh this commitment is also in like
[00:48:44] is uh this commitment is also in like the 2013 nerves paper that we well not
[00:48:47] the 2013 nerves paper that we well not paper but the algorithm that we saw
[00:48:48] paper but the algorithm that we saw earlier right they both commit um this
[00:48:52] earlier right they both commit um this yes yeah this is yes this is the so far
[00:48:55] yes yeah this is yes this is the so far sorry this is just exactly the same as
[00:48:56] sorry this is just exactly the same as the psrl algorithm I'm about to tell you
[00:48:58] the psrl algorithm I'm about to tell you about the seed sampling but yes this is
[00:49:00] about the seed sampling but yes this is just in the 2013 so yeah exactly this is
[00:49:02] just in the 2013 so yeah exactly this is the um in psrl itself it commits and
[00:49:05] the um in psrl itself it commits and this is one of the reasons for that um
[00:49:07] this is one of the reasons for that um and so in Maria's work she sort of
[00:49:09] and so in Maria's work she sort of discusses some of the important benefits
[00:49:10] discusses some of the important benefits of it and then she thinks about how
[00:49:12] of it and then she thinks about how would you actually maybe try to couple
[00:49:14] would you actually maybe try to couple and coordinate exploration if you have
[00:49:17] and coordinate exploration if you have many agents that are going through the
[00:49:18] many agents that are going through the same environment as at once and it's for
[00:49:21] same environment as at once and it's for sort of in some ways like it relates to
[00:49:23] sort of in some ways like it relates to this idea too you might want everybody
[00:49:25] this idea too you might want everybody to sort of commit to exploring different
[00:49:27] to sort of commit to exploring different parts of the space because if you have
[00:49:29] parts of the space because if you have you know many agents in the same domain
[00:49:30] you know many agents in the same domain you might want to say you're going to
[00:49:32] you might want to say you're going to think that the best reward is here
[00:49:33] think that the best reward is here you're going to think the best reward is
[00:49:34] you're going to think the best reward is here go explore and then we'll sort of
[00:49:36] here go explore and then we'll sort of unify our poster afterwards so she has a
[00:49:39] unify our poster afterwards so she has a nice demonstration of that and then she
[00:49:42] nice demonstration of that and then she extended it to the Deep learning case
[00:49:44] extended it to the Deep learning case shortly afterwards but see if I can play
[00:49:55] that
[00:50:25] e
[00:51:25] e e
[00:52:02] okay So eventually it happens right but
[00:52:04] okay So eventually it happens right but then you can get to concurrent ucrl
[00:52:07] then you can get to concurrent ucrl where in this case it's much um you can
[00:52:09] where in this case it's much um you can start to if you don't do something smart
[00:52:12] start to if you don't do something smart again this can end up being not very
[00:52:14] again this can end up being not very effective and let me just see if I can
[00:52:16] effective and let me just see if I can skip ahead to the last part seed
[00:52:20] skip ahead to the last part seed sampling okay
[00:52:24] good okay so seed sample compl in her
[00:52:27] good okay so seed sample compl in her case is what they're doing when they
[00:52:29] case is what they're doing when they essentially do concurrent reinforcement
[00:52:30] essentially do concurrent reinforcement learning and you might have even missed
[00:52:32] learning and you might have even missed it because that part is really fast okay
[00:52:35] it because that part is really fast okay so I'll move it to this just so I can
[00:52:36] so I'll move it to this just so I can talk over it at the same time okay so
[00:52:38] talk over it at the same time okay so this is seed sampling which is what
[00:52:40] this is seed sampling which is what their idea was and this s of just talks
[00:52:42] their idea was and this s of just talks again about doing strategic coordinated
[00:52:45] again about doing strategic coordinated sampling so you can see in this case
[00:52:47] sampling so you can see in this case we're leveraging the fact that you've
[00:52:48] we're leveraging the fact that you've got concurrent agents that are exploring
[00:52:50] got concurrent agents that are exploring the environment they're committing to it
[00:52:52] the environment they're committing to it but they're committing to it in a way
[00:52:53] but they're committing to it in a way that they coordinate that so you don't
[00:52:55] that they coordinate that so you don't get all of the agents so here by 324 all
[00:52:58] get all of the agents so here by 324 all of the agents have shared information
[00:53:00] of the agents have shared information about where the cheese is and everyone
[00:53:03] about where the cheese is and everyone solved all right so that just
[00:53:05] solved all right so that just illustrates why you both need sort of
[00:53:08] illustrates why you both need sort of this committing to a particular
[00:53:10] this committing to a particular exploration strategy and then if you're
[00:53:12] exploration strategy and then if you're in the case where you also have
[00:53:13] in the case where you also have concurrent agents which is very
[00:53:15] concurrent agents which is very realistic that having this additional um
[00:53:17] realistic that having this additional um coordination is really helpful now I
[00:53:20] coordination is really helpful now I think one of the interesting things to
[00:53:21] think one of the interesting things to note there is that this is a nice place
[00:53:23] note there is that this is a nice place where there's
[00:53:25] where there's some is it connecting hopefully it'll
[00:53:28] some is it connecting hopefully it'll connect in a second where there's an
[00:53:30] connect in a second where there's an interesting disc um disconnect between
[00:53:32] interesting disc um disconnect between Theory and experiment it's still not
[00:53:35] Theory and experiment it's still not maybe there's like a problem with a
[00:53:36] maybe there's like a problem with a connector we'll try to get that fixed
[00:53:37] connector we'll try to get that fixed for next week um there there's a
[00:53:40] for next week um there there's a disconnect between theory and practice
[00:53:43] disconnect between theory and practice because theoretically you don't need to
[00:53:45] because theoretically you don't need to do this exploration so we have a paper
[00:53:47] do this exploration so we have a paper from 2015 showing that if you don't do
[00:53:49] from 2015 showing that if you don't do coordinated exploration it's still
[00:53:51] coordinated exploration it's still totally sufficient um you can still get
[00:53:53] totally sufficient um you can still get basically almost a linear speed up but
[00:53:56] basically almost a linear speed up but oh good finally game yay all right so
[00:54:00] oh good finally game yay all right so that covers sort of how you can do
[00:54:02] that covers sort of how you can do basian exploration um in and sort of
[00:54:06] basian exploration um in and sort of optimism under uncertainty in the
[00:54:08] optimism under uncertainty in the tabular Markoff decision uh process case
[00:54:11] tabular Markoff decision uh process case but of course what we'd like to be able
[00:54:12] but of course what we'd like to be able to do is to do this for much more large
[00:54:13] to do is to do this for much more large State spaces and realistic problems so
[00:54:16] State spaces and realistic problems so this is a very much an ongoing area um
[00:54:19] this is a very much an ongoing area um again you'll see this sort of similarity
[00:54:21] again you'll see this sort of similarity to the types of ideas we've seen before
[00:54:23] to the types of ideas we've seen before very popular ideas or optimism under
[00:54:25] very popular ideas or optimism under uncertainty and Thompson plan they're
[00:54:26] uncertainty and Thompson plan they're not the only ones but they're probably
[00:54:28] not the only ones but they're probably the dominant strategies people try to
[00:54:30] the dominant strategies people try to use S I may have just not caught this
[00:54:33] use S I may have just not caught this but like what spe like what is actually
[00:54:35] but like what spe like what is actually different between the two algorithms
[00:54:37] different between the two algorithms like what is the difference in like C
[00:54:38] like what is the difference in like C samp Point like between 2013 and
[00:54:41] samp Point like between 2013 and 2018 yes so two things one is that um
[00:54:45] 2018 yes so two things one is that um the psrl does not think about
[00:54:46] the psrl does not think about concurrency so they just assume there's
[00:54:48] concurrency so they just assume there's a single mdp you have a single agent in
[00:54:50] a single mdp you have a single agent in it the other case assumes you have like
[00:54:52] it the other case assumes you have like M agents all in the same mdp so like the
[00:54:56] M agents all in the same mdp so like the like the mice trying to find the cheese
[00:54:57] like the mice trying to find the cheese there's not just one Mouse there's a
[00:54:58] there's not just one Mouse there's a whole bunch and the idea with seed
[00:55:00] whole bunch and the idea with seed sampling is also to think about how do
[00:55:02] sampling is also to think about how do you choose which mdp the H sync they're
[00:55:05] you choose which mdp the H sync they're in to distribute the
[00:55:08] in to distribute the exploration
[00:55:12] okay in the other case you don't have to
[00:55:14] okay in the other case you don't have to do any coordination because there's only
[00:55:15] do any coordination because there's only one
[00:55:17] agent okay good so in terms of
[00:55:19] agent okay good so in terms of generalization we're going to think
[00:55:21] generalization we're going to think about this the reason why this starts to
[00:55:22] about this the reason why this starts to get more tricky is a couple things one
[00:55:24] get more tricky is a couple things one is that for optimism under uncertainty
[00:55:26] is that for optimism under uncertainty means we have to have a notion of
[00:55:27] means we have to have a notion of uncertainty and it just gets much harder
[00:55:29] uncertainty and it just gets much harder to represent uncertainty when we have
[00:55:30] to represent uncertainty when we have deep neural networks um similarly for
[00:55:33] deep neural networks um similarly for Thompson sampling as we start to move up
[00:55:35] Thompson sampling as we start to move up to really complicated domains we need
[00:55:36] to really complicated domains we need posteriors over really complicated
[00:55:38] posteriors over really complicated settings and that's also computationally
[00:55:40] settings and that's also computationally challenging and hard to
[00:55:43] challenging and hard to approximate so let's first start with
[00:55:45] approximate so let's first start with contextual
[00:55:46] contextual Bandits um and some of you guys will
[00:55:48] Bandits um and some of you guys will probably be doing some of this for your
[00:55:50] probably be doing some of this for your project so instead of having our multi
[00:55:52] project so instead of having our multi Bandit now we're sort of halfway between
[00:55:54] Bandit now we're sort of halfway between a markof decision process and abandoned
[00:55:57] a markof decision process and abandoned so we're going to assume we have States
[00:55:59] so we're going to assume we have States but the action we take doesn't influence
[00:56:00] but the action we take doesn't influence the next state and so now if we think
[00:56:03] the next state and so now if we think about rewards we'll have a reward per
[00:56:05] about rewards we'll have a reward per action and
[00:56:06] action and state and just like we what we've often
[00:56:09] state and just like we what we've often done before if we have a really large
[00:56:11] done before if we have a really large state in action space we're going to
[00:56:12] state in action space we're going to assume that we use some sort of
[00:56:14] assume that we use some sort of parametric representation to model the
[00:56:16] parametric representation to model the relationship between State and action
[00:56:17] relationship between State and action and output
[00:56:19] and output rewards perhaps not surprisingly there
[00:56:21] rewards perhaps not surprisingly there is an enormous benefit of doing this so
[00:56:24] is an enormous benefit of doing this so if you think about um a setting where
[00:56:26] if you think about um a setting where this is the number of arms you have okay
[00:56:29] this is the number of arms you have okay if you did something like upper
[00:56:30] if you did something like upper confidence bounds and this is Regret
[00:56:32] confidence bounds and this is Regret regret is on the Y AIS so if you did
[00:56:34] regret is on the Y AIS so if you did something like upper confidence bounds
[00:56:36] something like upper confidence bounds and you have a th000 arms and 4,000
[00:56:38] and you have a th000 arms and 4,000 pools sorry you have a th000 arms and
[00:56:40] pools sorry you have a th000 arms and then um you're you're pulling these over
[00:56:43] then um you're you're pulling these over time so this is I think just regret
[00:56:44] time so this is I think just regret after a fixed number of time steps
[00:56:46] after a fixed number of time steps unsurprisingly if you have a lot more
[00:56:48] unsurprisingly if you have a lot more arms to pull you'll have a lot more
[00:56:50] arms to pull you'll have a lot more regret because in Upper confidence
[00:56:52] regret because in Upper confidence bounds in the things we've seen so far
[00:56:55] bounds in the things we've seen so far um you don't share any information
[00:56:56] um you don't share any information across the
[00:56:58] across the arms if on the other hand you use
[00:57:00] arms if on the other hand you use something like linear UCB which assumes
[00:57:03] something like linear UCB which assumes that your arms uh are represented by a
[00:57:05] that your arms uh are represented by a set of features so showing someone you
[00:57:08] set of features so showing someone you know like a trump campaign today and a
[00:57:11] know like a trump campaign today and a trump Campa a different Trump campaign
[00:57:13] trump Campa a different Trump campaign tomorrow might have the same effect um
[00:57:15] tomorrow might have the same effect um because they're going to have a shared
[00:57:16] because they're going to have a shared set of features about Trump at least
[00:57:18] set of features about Trump at least would be one thing that would overlap
[00:57:20] would be one thing that would overlap you can leverage that structure and so
[00:57:22] you can leverage that structure and so what you can see in this case is that if
[00:57:24] what you can see in this case is that if you leverage sort of like say a
[00:57:25] you leverage sort of like say a parametric lar representation in this
[00:57:27] parametric lar representation in this case um even as you scale up the actual
[00:57:30] case um even as you scale up the actual number of arms if your parameter space
[00:57:32] number of arms if your parameter space is still the same then your regret
[00:57:35] is still the same then your regret doesn't scale badly so for example so
[00:57:40] doesn't scale badly so for example so this is K but you know your Theta in
[00:57:42] this is K but you know your Theta in this
[00:57:45] case your Theta in this case might just
[00:57:48] case your Theta in this case might just be low dimensional so we might have sort
[00:57:50] be low dimensional so we might have sort of a Theta which is in Rd so we have a d
[00:57:55] of a Theta which is in Rd so we have a d diens
[00:57:57] diens representation okay and this just shows
[00:57:59] representation okay and this just shows that this can be really helpful like in
[00:58:01] that this can be really helpful like in general you want a leverage
[00:58:03] general you want a leverage structure so one common thing to do is
[00:58:06] structure so one common thing to do is to model the reward as a linear function
[00:58:09] to model the reward as a linear function um of course this could be built on top
[00:58:11] um of course this could be built on top of a deep neural network or on top of a
[00:58:13] of a deep neural network or on top of a large language model or something like
[00:58:15] large language model or something like that you can often just use some really
[00:58:16] that you can often just use some really complicated representation of the state
[00:58:17] complicated representation of the state in action space and then say for the
[00:58:21] in action space and then say for the last layer my actual reward is going to
[00:58:23] last layer my actual reward is going to be a function of these complicated
[00:58:24] be a function of these complicated features um producted with some Theta
[00:58:29] features um producted with some Theta parameter and one common thing is to
[00:58:31] parameter and one common thing is to assume that it's just this linear
[00:58:32] assume that it's just this linear function plus some noise and the nice
[00:58:34] function plus some noise and the nice thing about this is that if your
[00:58:36] thing about this is that if your features are interpretable then your
[00:58:37] features are interpretable then your reward function is also very
[00:58:38] reward function is also very interpretable because you can just think
[00:58:40] interpretable because you can just think of like relatively how much do each of
[00:58:42] of like relatively how much do each of those features contribute to your
[00:58:45] reward all
[00:58:47] reward all right so one thing to think about in
[00:58:50] right so one thing to think about in this case is in these settings um well
[00:58:55] this case is in these settings um well I'll go a little faster this part cuz I
[00:58:56] I'll go a little faster this part cuz I want to make sure we get to the mdp part
[00:58:58] want to make sure we get to the mdp part too but when you have this even if you
[00:59:01] too but when you have this even if you know you have kind of a linear set of
[00:59:04] know you have kind of a linear set of models you can use them to represent
[00:59:06] models you can use them to represent sort of more complicated
[00:59:07] sort of more complicated functions because let's
[00:59:11] functions because let's say Technologies again so let's say this
[00:59:15] say Technologies again so let's say this is your reward
[00:59:17] is your reward model okay for three different actions
[00:59:19] model okay for three different actions this is A1 this is A2 and this is
[00:59:22] this is A1 this is A2 and this is A3 um and this is what your reward is
[00:59:26] A3 um and this is what your reward is and this is your state space okay so
[00:59:28] and this is your state space okay so let's imagine you had a linear
[00:59:30] let's imagine you had a linear representation then you could represent
[00:59:33] representation then you could represent policies that are just joint linear
[00:59:35] policies that are just joint linear because you could if you were taking the
[00:59:36] because you could if you were taking the max here this is what the value would be
[00:59:39] max here this is what the value would be of your policy because it would say A1
[00:59:42] of your policy because it would say A1 dominates for this part of the state
[00:59:43] dominates for this part of the state space A3 dominates for this part of the
[00:59:46] space A3 dominates for this part of the state space and A2 dominates for this
[00:59:47] state space and A2 dominates for this part of the state space so linear ones I
[00:59:51] part of the state space so linear ones I guess the main point here is that even
[00:59:53] guess the main point here is that even if you have a linear Pol linear reward
[00:59:55] if you have a linear Pol linear reward model it doesn't mean your policy has to
[00:59:57] model it doesn't mean your policy has to be linear your policy will be disjoint
[00:59:59] be linear your policy will be disjoint linear it can be made up of these sorts
[01:00:01] linear it can be made up of these sorts of functions okay so it's it's fairly
[01:00:04] of functions okay so it's it's fairly flexible okay how would we work in these
[01:00:07] flexible okay how would we work in these cases well in this case what it means to
[01:00:09] cases well in this case what it means to have uncertainty is we need to have
[01:00:10] have uncertainty is we need to have uncertainty over this linear
[01:00:12] uncertainty over this linear Vector um so we want to capture
[01:00:15] Vector um so we want to capture uncertainty over Theta through some sort
[01:00:18] uncertainty over Theta through some sort of uncertainty set and there's been a
[01:00:20] of uncertainty set and there's been a lot of beautiful work to try to quantify
[01:00:23] lot of beautiful work to try to quantify the types of uncertainties we have
[01:00:24] the types of uncertainties we have through things like um uh the elliptical
[01:00:27] through things like um uh the elliptical potential themma and things like that
[01:00:29] potential themma and things like that which give us basically just sort of an
[01:00:31] which give us basically just sort of an uncertainty set over vectors okay and
[01:00:34] uncertainty set over vectors okay and you can do this in a computationally
[01:00:35] you can do this in a computationally tractable
[01:00:36] tractable way and what this means is it gives us a
[01:00:39] way and what this means is it gives us a principled way to get an upper
[01:00:40] principled way to get an upper confidence bound on the reward function
[01:00:42] confidence bound on the reward function given that we have uncertainty over
[01:00:43] given that we have uncertainty over linear model and this was shown to be
[01:00:47] linear model and this was shown to be very useful for news article
[01:00:48] very useful for news article recommendations about 14 years ago and
[01:00:51] recommendations about 14 years ago and you can also look at chapter
[01:00:53] you can also look at chapter 19 so these are really useful this is
[01:00:56] 19 so these are really useful this is one way to sort of represent um a
[01:00:58] one way to sort of represent um a contextual Bandit
[01:00:59] contextual Bandit setting when you have you want to handle
[01:01:02] setting when you have you want to handle generalization we'll now talk briefly
[01:01:04] generalization we'll now talk briefly about how you might do this for markof
[01:01:06] about how you might do this for markof decision
[01:01:07] decision processes okay so if we think back to
[01:01:10] processes okay so if we think back to the MBI algorithm for finate Satan
[01:01:12] the MBI algorithm for finate Satan actions we have to modify a few things
[01:01:15] actions we have to modify a few things okay so if we think about this we were
[01:01:20] okay so if we think about this we were keeping track of counts and we were
[01:01:23] keeping track of counts and we were doing this we were building a model SE
[01:01:25] doing this we were building a model SE separately for every state in action so
[01:01:29] separately for every state in action so this count based term here that we're
[01:01:31] this count based term here that we're using as a bonus um we've already seen
[01:01:34] using as a bonus um we've already seen how we might be able to do Q functions
[01:01:35] how we might be able to do Q functions with like deep neural networks um but
[01:01:38] with like deep neural networks um but the big problem here is the count base
[01:01:40] the big problem here is the count base bonus like we have an infinite number of
[01:01:42] bonus like we have an infinite number of states if you think about Atari or
[01:01:44] states if you think about Atari or something like that you certainly don't
[01:01:45] something like that you certainly don't want to count you're mostly only going
[01:01:47] want to count you're mostly only going to see one Atari screen once ever um and
[01:01:51] to see one Atari screen once ever um and so these sort of count-based bonuses
[01:01:53] so these sort of count-based bonuses aren't very realistic
[01:01:56] aren't very realistic and so we're going to need ways
[01:01:57] and so we're going to need ways essentially but you know why do we have
[01:01:58] essentially but you know why do we have the account-based bonuses we have the
[01:02:00] the account-based bonuses we have the account-based bonuses to try to quantify
[01:02:01] account-based bonuses to try to quantify our uncertainty over how well do we know
[01:02:04] our uncertainty over how well do we know the reward model for this particular
[01:02:06] the reward model for this particular state in action and how well do we know
[01:02:07] state in action and how well do we know the Dynamics and so one of the ideas
[01:02:10] the Dynamics and so one of the ideas when deep RL came around was to think
[01:02:13] when deep RL came around was to think about could we lift this idea and try to
[01:02:14] about could we lift this idea and try to quantify our uncertainty in the deepl
[01:02:17] quantify our uncertainty in the deepl setting okay so we're going to need to
[01:02:19] setting okay so we're going to need to move beyond having these very simple
[01:02:21] move beyond having these very simple counts to think about something that's
[01:02:23] counts to think about something that's sort of a higher level representation of
[01:02:25] sort of a higher level representation of that
[01:02:26] that now if we could get that and I haven't
[01:02:28] now if we could get that and I haven't told you how we can get it yet you could
[01:02:30] told you how we can get it yet you could imagine that a lot of the algorithms
[01:02:31] imagine that a lot of the algorithms we've seen before could be extended
[01:02:33] we've seen before could be extended fairly easily so in particular um if you
[01:02:36] fairly easily so in particular um if you think about something like function
[01:02:38] think about something like function approximation with
[01:02:40] approximation with q-learning we could imagine just adding
[01:02:42] q-learning we could imagine just adding some sort of bonus term in here so
[01:02:44] some sort of bonus term in here so instead of having our empirical reward
[01:02:47] instead of having our empirical reward plus gamma times you know our Target
[01:02:49] plus gamma times you know our Target like our our observed NE state in action
[01:02:52] like our our observed NE state in action with some you know parameter weight we
[01:02:54] with some you know parameter weight we could just plug in some bonus
[01:02:56] could just plug in some bonus that's kind of what mbab is already
[01:02:58] that's kind of what mbab is already doing it's just that our bonus before
[01:03:00] doing it's just that our bonus before was determined by our counts and now we
[01:03:03] was determined by our counts and now we need some other way to lift that so we
[01:03:05] need some other way to lift that so we can do that for much more general
[01:03:06] can do that for much more general settings but once we have that we could
[01:03:08] settings but once we have that we could imagine plugging it in
[01:03:10] imagine plugging it in here okay so there's a lot of different
[01:03:14] here okay so there's a lot of different approaches that have been developed to
[01:03:15] approaches that have been developed to try to think about something of sort of
[01:03:17] try to think about something of sort of density or quantification of how much um
[01:03:20] density or quantification of how much um how many visits we have or how much
[01:03:22] how many visits we have or how much certainty we have over different parts
[01:03:24] certainty we have over different parts of the state in action space
[01:03:26] of the state in action space so one of the things that Mark bellam
[01:03:27] so one of the things that Mark bellam and others did which was pretty
[01:03:29] and others did which was pretty successful is they tried to build sort
[01:03:31] successful is they tried to build sort of pseudo counts over um parts of the
[01:03:34] of pseudo counts over um parts of the state and action space so you can
[01:03:36] state and action space so you can imagine maybe you've been some
[01:03:38] imagine maybe you've been some particular rooms in a video game many
[01:03:40] particular rooms in a video game many many times and so you try to essentially
[01:03:42] many times and so you try to essentially reduce your uncertainty over those
[01:03:45] reduce your uncertainty over those there's all sorts of important details
[01:03:47] there's all sorts of important details here around like whether you normally in
[01:03:49] here around like whether you normally in mbib every round you update all of those
[01:03:53] mbib every round you update all of those counts in reality if you think back to
[01:03:56] counts in reality if you think back to deep Q learning we maintained a buffer
[01:03:59] deep Q learning we maintained a buffer of State action rewards next States now
[01:04:01] of State action rewards next States now you would need to include those bonus
[01:04:03] you would need to include those bonus terms in there too and if those bonus
[01:04:05] terms in there too and if those bonus terms are changing how much do you
[01:04:06] terms are changing how much do you update your buffer just to give you a
[01:04:08] update your buffer just to give you a sense of some of the different wrinkles
[01:04:10] sense of some of the different wrinkles one has to think about okay but the high
[01:04:13] one has to think about okay but the high level important thing is that this
[01:04:14] level important thing is that this matters a lot so in Mont Zuma's Revenge
[01:04:16] matters a lot so in Mont Zuma's Revenge which was early on considered one of the
[01:04:18] which was early on considered one of the hardest Atari games probably still is um
[01:04:21] hardest Atari games probably still is um if you did search a standard dqn for 50
[01:04:23] if you did search a standard dqn for 50 million frames which is a lot it never
[01:04:26] million frames which is a lot it never got past the second room it just with
[01:04:28] got past the second room it just with Epsilon greedy exploration it was not
[01:04:31] Epsilon greedy exploration it was not strategic um it just got very bad
[01:04:34] strategic um it just got very bad performance but what Mark B and others
[01:04:36] performance but what Mark B and others showed is that by incorporating sort of
[01:04:38] showed is that by incorporating sort of a a notion of count based lifted to the
[01:04:41] a a notion of count based lifted to the generalization case you could do far far
[01:04:43] generalization case you could do far far far better okay so that's just to
[01:04:47] far better okay so that's just to highlight that there are ways to lift up
[01:04:48] highlight that there are ways to lift up this notion of sort of optimism
[01:04:50] this notion of sort of optimism uncertainty for this type of
[01:04:53] uncertainty for this type of setting there is similarly ways to
[01:04:55] setting there is similarly ways to Thompson sampling so um we've done some
[01:04:59] Thompson sampling so um we've done some work there where we think about sort of
[01:05:00] work there where we think about sort of particular representations and
[01:05:02] particular representations and parameters um Ian osband who introduced
[01:05:06] parameters um Ian osband who introduced psrl then tried to lift it up to the
[01:05:08] psrl then tried to lift it up to the Deep Q learning case they did it where
[01:05:10] Deep Q learning case they did it where they were just bootstrapping samples as
[01:05:12] they were just bootstrapping samples as an
[01:05:13] an approximation um that is a pretty um
[01:05:18] approximation um that is a pretty um pretty coarse um approximation of
[01:05:21] pretty coarse um approximation of uncertainty something else that often
[01:05:24] uncertainty something else that often worked pretty well surprised ly well
[01:05:26] worked pretty well surprised ly well given how simple it is is essentially to
[01:05:27] given how simple it is is essentially to do something just at the last layer so
[01:05:29] do something just at the last layer so at the last layer do something like basy
[01:05:31] at the last layer do something like basy and linear regression to try to get an
[01:05:33] and linear regression to try to get an uncertainty estimate and then sample
[01:05:35] uncertainty estimate and then sample from that so this is a pretty simple
[01:05:37] from that so this is a pretty simple thing um one could
[01:05:39] thing um one could try there's a lot of work to do this um
[01:05:42] try there's a lot of work to do this um let's go back to thinking of other
[01:05:44] let's go back to thinking of other really recent approaches which try to
[01:05:46] really recent approaches which try to think about doing this not just for one
[01:05:48] think about doing this not just for one task but many tasks where you need to do
[01:05:51] task but many tasks where you need to do generalization so early in this lecture
[01:05:53] generalization so early in this lecture I introduced the dream algorithm to you
[01:05:55] I introduced the dream algorithm to you which was um we later Ed to actually do
[01:05:59] which was um we later Ed to actually do grading of the breakout assignment the
[01:06:01] grading of the breakout assignment the notion in dream is that you have many
[01:06:03] notion in dream is that you have many different tasks and you're going to
[01:06:04] different tasks and you're going to learn how to explore in them
[01:06:06] learn how to explore in them efficiently so that was one example
[01:06:09] efficiently so that was one example where we're now really thinking about
[01:06:10] where we're now really thinking about how do we develop efficient exploration
[01:06:12] how do we develop efficient exploration strategies by leveraging structure over
[01:06:14] strategies by leveraging structure over the tasks where an agent is going to do
[01:06:16] the tasks where an agent is going to do a series of tasks similarly in some of
[01:06:19] a series of tasks similarly in some of our recent work we introduced decision
[01:06:21] our recent work we introduced decision pre-trained
[01:06:22] pre-trained Transformers um this was again a metal
[01:06:25] Transformers um this was again a metal in case the idea is that your agent's
[01:06:27] in case the idea is that your agent's going to do a series of Bandit problems
[01:06:29] going to do a series of Bandit problems or a series of RL problems and we want
[01:06:31] or a series of RL problems and we want to learn how to optimally explore in
[01:06:33] to learn how to optimally explore in those settings so I'll just show you
[01:06:36] those settings so I'll just show you briefly kind of how it works the idea in
[01:06:40] briefly kind of how it works the idea in this setting is we're going to use a
[01:06:41] this setting is we're going to use a pre-train
[01:06:42] pre-train Transformer one of the interesting
[01:06:44] Transformer one of the interesting things is you can map reinforcement
[01:06:45] things is you can map reinforcement learning to supervised learning similar
[01:06:48] learning to supervised learning similar to behavior cloning but instead of
[01:06:49] to behavior cloning but instead of relying on the data you collected in the
[01:06:51] relying on the data you collected in the past if you can compute what would have
[01:06:53] past if you can compute what would have been the right action to take there you
[01:06:55] been the right action to take there you can train it to predict that optimal
[01:06:58] can train it to predict that optimal action it turns out that when you do
[01:07:01] action it turns out that when you do that we can exactly map that back to
[01:07:03] that we can exactly map that back to doing the equivalent of Thompson
[01:07:05] doing the equivalent of Thompson sampling so in all the settings for
[01:07:07] sampling so in all the settings for which Thompson sampling has theoretical
[01:07:09] which Thompson sampling has theoretical guarantees this decision pre-trade
[01:07:11] guarantees this decision pre-trade Transformer can inherit those guarantees
[01:07:14] Transformer can inherit those guarantees which is pretty cool the nice thing too
[01:07:16] which is pretty cool the nice thing too is that empirically it can allow you to
[01:07:18] is that empirically it can allow you to get take advantage of structure that is
[01:07:21] get take advantage of structure that is present in your domain that you didn't
[01:07:22] present in your domain that you didn't have to code so let me just give you an
[01:07:25] have to code so let me just give you an example of that so what I showed you
[01:07:27] example of that so what I showed you earlier in this lecture is that if you
[01:07:30] earlier in this lecture is that if you have a domain where you have some linear
[01:07:32] have a domain where you have some linear structure if you're if you give that
[01:07:35] structure if you're if you give that linear structure to your algorithm then
[01:07:38] linear structure to your algorithm then you can do quite well so that's the GRE
[01:07:39] you can do quite well so that's the GRE line here so this is the amount of data
[01:07:41] line here so this is the amount of data you have over time and this is your
[01:07:42] you have over time and this is your cumulative regret lower is better so
[01:07:45] cumulative regret lower is better so most historical algorithms have assumed
[01:07:47] most historical algorithms have assumed you give that structure to your Bandit
[01:07:49] you give that structure to your Bandit like you know you write down oh there
[01:07:50] like you know you write down oh there are these 300 features that you need to
[01:07:53] are these 300 features that you need to pay attention to news articles and
[01:07:54] pay attention to news articles and people to figure out what the reward
[01:07:56] people to figure out what the reward will be if you give it that structure
[01:07:58] will be if you give it that structure and that structure is right you often do
[01:08:00] and that structure is right you often do pretty well you could not leverage that
[01:08:03] pretty well you could not leverage that structure and you would get something
[01:08:05] structure and you would get something like this okay so this is a Thompson
[01:08:07] like this okay so this is a Thompson sampling algorithm which just assumes
[01:08:09] sampling algorithm which just assumes that it doesn't have that linear
[01:08:11] that it doesn't have that linear structure one of the cool things that we
[01:08:14] structure one of the cool things that we found by this approach is that in this
[01:08:16] found by this approach is that in this setting if you really have a linear
[01:08:19] setting if you really have a linear structure in your domain and you're
[01:08:21] structure in your domain and you're doing many tasks and all of them have
[01:08:23] doing many tasks and all of them have this linear structure what our decision
[01:08:25] this linear structure what our decision pre train transformer will learn is that
[01:08:27] pre train transformer will learn is that even though you're not telling it it
[01:08:29] even though you're not telling it it will realize it can more compactly
[01:08:31] will realize it can more compactly encode that structure and so when you
[01:08:34] encode that structure and so when you deploy it on a new task you will get
[01:08:36] deploy it on a new task you will get Behavior almost as if you gave it the
[01:08:38] Behavior almost as if you gave it the unknown
[01:08:40] unknown structure so I think this is really
[01:08:42] structure so I think this is really interesting because often one of the
[01:08:43] interesting because often one of the brittle aspects of machine learning is
[01:08:45] brittle aspects of machine learning is that we had originally wrote down these
[01:08:47] that we had originally wrote down these sort of representations and of course
[01:08:49] sort of representations and of course one of the really amazing things for
[01:08:50] one of the really amazing things for deep learning is that we're trying to
[01:08:51] deep learning is that we're trying to not write down specific representations
[01:08:53] not write down specific representations as much and get much closer to to the
[01:08:55] as much and get much closer to to the input raw data and this is illustrating
[01:08:58] input raw data and this is illustrating that in terms of sort of sequential
[01:08:59] that in terms of sort of sequential decision- making and meta exploration
[01:09:01] decision- making and meta exploration for multiple tasks we can do something
[01:09:04] for multiple tasks we can do something similar here where we can sort of
[01:09:06] similar here where we can sort of inductively learn that that's a more
[01:09:08] inductively learn that that's a more compact way to represent the domains and
[01:09:11] compact way to represent the domains and get this sort of much more efficient
[01:09:13] get this sort of much more efficient exploration in new
[01:09:16] exploration in new tasks all right so just to conclude um
[01:09:19] tasks all right so just to conclude um we're wrapping up our notion of sort of
[01:09:21] we're wrapping up our notion of sort of data efficient reinforcement learning
[01:09:23] data efficient reinforcement learning today you should understand sort of this
[01:09:25] today you should understand sort of this tension between exploration and
[01:09:26] tension between exploration and exploitation in reinforcement learning I
[01:09:29] exploitation in reinforcement learning I haven't used these words they're not
[01:09:30] haven't used these words they're not great words so I don't use these um I
[01:09:32] great words so I don't use these um I haven't used these a lot of there but
[01:09:33] haven't used these a lot of there but exploration meaning you're taking time
[01:09:35] exploration meaning you're taking time to learn about the domain and
[01:09:37] to learn about the domain and exploitation meaning that you're
[01:09:38] exploitation meaning that you're leveraging that information to make good
[01:09:40] leveraging that information to make good decisions in the context of
[01:09:42] decisions in the context of reinforcement learning um you should be
[01:09:43] reinforcement learning um you should be able to Define and compare different
[01:09:45] able to Define and compare different sort of Notions of good whether like
[01:09:47] sort of Notions of good whether like empirical convergence regret and pack um
[01:09:50] empirical convergence regret and pack um you should know for the algorithms we've
[01:09:52] you should know for the algorithms we've talked about you know do they have for
[01:09:54] talked about you know do they have for example does greedy is greedy um
[01:09:57] example does greedy is greedy um sublinear regret which it's not you
[01:09:59] sublinear regret which it's not you should understand sort of the proof
[01:10:01] should understand sort of the proof sketch I did of why upper confidence
[01:10:03] sketch I did of why upper confidence bound is sublinear and regret all right
[01:10:05] bound is sublinear and regret all right and then next week we're going to talk
[01:10:07] and then next week we're going to talk about uh sort of alphao and how do we
[01:10:10] about uh sort of alphao and how do we think about doing smart adaptive tree
[01:10:12] think about doing smart adaptive tree search in really large games see you
[01:10:14] search in really large games see you then
Lecture 014
Stanford CS234 Reinforcement Learning I Multi-Agent Game Playing I 2024 I Lecture 14
Source: https://www.youtube.com/watch?v=UgANzoWc0nc
---
Transcript
[00:00:05] all right they should be up now
[...
Stanford CS234 Reinforcement Learning I Multi-Agent Game Playing I 2024 I Lecture 14
Source: https://www.youtube.com/watch?v=UgANzoWc0nc
---
Transcript
[00:00:05] all right they should be up now
[00:00:30] m
[00:01:05] all right just take a second and then
[00:01:07] all right just take a second and then compare your anwers to someone near you
[00:01:09] compare your anwers to someone near you the reason I'm asking you about these
[00:01:10] the reason I'm asking you about these particular algorithms is because some of
[00:01:12] particular algorithms is because some of the ideas today even though we're going
[00:01:13] the ideas today even though we're going to be talking about alphago and Monte
[00:01:15] to be talking about alphago and Monte Carlo treesearch will be related to some
[00:01:17] Carlo treesearch will be related to some of the things that helped make those
[00:01:19] of the things that helped make those advances possible so just check a good
[00:01:22] advances possible so just check a good chance to refresh your understanding of
[00:01:23] chance to refresh your understanding of how upper confidence bound algorithms
[00:01:25] how upper confidence bound algorithms work
[00:02:06] and the one I thought might be somewhat
[00:02:08] and the one I thought might be somewhat controversial in particular is the third
[00:02:10] controversial in particular is the third one of whether or not if you have a
[00:02:12] one of whether or not if you have a reward model and it's known whether
[00:02:14] reward model and it's known whether there's still any benefit to using an
[00:02:16] there's still any benefit to using an outp confidence bound
[00:02:28] algorithm e
[00:03:04] all right let's come back together so
[00:03:06] all right let's come back together so the looks like there was good agreement
[00:03:08] the looks like there was good agreement on the first couple so the first one is
[00:03:10] on the first couple so the first one is true which is you can think of upper
[00:03:13] true which is you can think of upper confidence bounds as being a way to
[00:03:15] confidence bounds as being a way to balance between our uncertainty over
[00:03:17] balance between our uncertainty over outcomes when we have limited amounts of
[00:03:18] outcomes when we have limited amounts of data um and yet use that information to
[00:03:21] data um and yet use that information to still try to acquire High reward and
[00:03:24] still try to acquire High reward and these algorithms can be used both in
[00:03:26] these algorithms can be used both in Bandits and markof decision processes um
[00:03:29] Bandits and markof decision processes um the third one is a little bit tricky uh
[00:03:32] the third one is a little bit tricky uh actually either answer would be fine
[00:03:34] actually either answer would be fine depending on which setting you're
[00:03:35] depending on which setting you're looking at so does somebody want to
[00:03:37] looking at so does somebody want to argue why if the reward model is known
[00:03:39] argue why if the reward model is known there is no benefit to using upper
[00:03:42] there is no benefit to using upper confidence bound
[00:03:46] algorithms so in some settings there
[00:03:48] algorithms so in some settings there would not be someone told me a setting
[00:03:51] would not be someone told me a setting where if you need the reward model you
[00:03:53] where if you need the reward model you should not use an upper competence found
[00:03:56] should not use an upper competence found algorithm something that we saw over the
[00:03:58] algorithm something that we saw over the last couple weeks that was different
[00:04:00] last couple weeks that was different than the reinforcement learning
[00:04:04] framework the MTI Bandits that's right
[00:04:07] framework the MTI Bandits that's right yeah so in the multi-arm Bandit case
[00:04:09] yeah so in the multi-arm Bandit case where there's like no State there's no
[00:04:11] where there's like no State there's no Dynamics the decisions that you make
[00:04:12] Dynamics the decisions that you make don't influence the next state at all um
[00:04:15] don't influence the next state at all um then exactly what said if you knew what
[00:04:16] then exactly what said if you knew what the reward model is you'd know how to
[00:04:18] the reward model is you'd know how to act like if I knew whether a customer
[00:04:21] act like if I knew whether a customer liked add a or add B better I would just
[00:04:23] liked add a or add B better I would just show them at a so in a multi-arm bandit
[00:04:28] show them at a so in a multi-arm bandit setting so in a
[00:04:32] setting so in a map set this is
[00:04:36] map set this is true in general it's not true in
[00:04:40] true in general it's not true in RL thatting generally
[00:04:44] RL thatting generally false somebody want to tell me why in
[00:04:46] false somebody want to tell me why in general it's false even if you know the
[00:04:48] general it's false even if you know the reward model in reinforcement learning
[00:04:49] reward model in reinforcement learning you might still want to use an upper
[00:04:51] you might still want to use an upper confidence bound based
[00:04:57] algorithm because what we want to know
[00:04:59] algorithm because what we want to know is the value function rather than just
[00:05:02] is the value function rather than just the immedate rec that's right so
[00:05:04] the immedate rec that's right so assuming that you don't know what said
[00:05:06] assuming that you don't know what said which is I'm assuming you don't know the
[00:05:07] which is I'm assuming you don't know the Dynamics model um so you don't know how
[00:05:10] Dynamics model um so you don't know how to compute what your optimal value
[00:05:12] to compute what your optimal value function is it's still often helpful to
[00:05:14] function is it's still often helpful to use upper confidence bounds and in fact
[00:05:17] use upper confidence bounds and in fact in many cases you might know the reward
[00:05:18] in many cases you might know the reward function for when you reach estate like
[00:05:20] function for when you reach estate like you know when a customer clicks on
[00:05:21] you know when a customer clicks on something that's good um but the hard
[00:05:24] something that's good um but the hard thing is maybe to drive them into states
[00:05:26] thing is maybe to drive them into states where they are going to click on
[00:05:27] where they are going to click on something or they are going to make a
[00:05:28] something or they are going to make a purchase so in RL um this is generally
[00:05:31] purchase so in RL um this is generally false and we'll see some other examples
[00:05:33] false and we'll see some other examples today where it's helpful to use an upper
[00:05:35] today where it's helpful to use an upper confidence bound algorithm even though
[00:05:37] confidence bound algorithm even though we know quite a lot about the
[00:05:41] world so what we're going to be talking
[00:05:44] world so what we're going to be talking about today is Monty Carlo research in
[00:05:47] about today is Monty Carlo research in alphago um and before we get into that
[00:05:49] alphago um and before we get into that I'll just remind us a little bit about
[00:05:51] I'll just remind us a little bit about where we are in the course so we have uh
[00:05:53] where we are in the course so we have uh just a few more weeks left you should
[00:05:55] just a few more weeks left you should have all gotten feedback on your
[00:05:57] have all gotten feedback on your projects um if I encourage you to come
[00:05:59] projects um if I encourage you to come talk to me he anybody else um about any
[00:06:01] talk to me he anybody else um about any questions you have I have two office
[00:06:03] questions you have I have two office hours this week because I was traveling
[00:06:05] hours this week because I was traveling late last week for a conference um so
[00:06:07] late last week for a conference um so you're welcome to come to my office
[00:06:08] you're welcome to come to my office hours today which are right after this
[00:06:10] hours today which are right after this class or on Thursday there's also a lot
[00:06:11] class or on Thursday there's also a lot of other office hours in addition to
[00:06:14] of other office hours in addition to that a week from this Wednesday we're
[00:06:16] that a week from this Wednesday we're going to have a quiz the quiz will send
[00:06:19] going to have a quiz the quiz will send more details out about on Ed but the
[00:06:21] more details out about on Ed but the main idea is that it's going to be
[00:06:23] main idea is that it's going to be multiple choice it is designed to be
[00:06:25] multiple choice it is designed to be easier than the midterm but we'll give
[00:06:27] easier than the midterm but we'll give you the full amount of time people
[00:06:28] you the full amount of time people generally take full amount of time just
[00:06:30] generally take full amount of time just to check their answers um and it'll
[00:06:32] to check their answers um and it'll cover the entire course so everything up
[00:06:35] cover the entire course so everything up through um you know the day before
[00:06:38] through um you know the day before somebody have any Logistics questions
[00:06:39] somebody have any Logistics questions before we get
[00:06:43] going all right so we're going to talk
[00:06:45] going all right so we're going to talk about Monte Carlo treesearch um and
[00:06:48] about Monte Carlo treesearch um and Alpha zero so as many of you may know
[00:06:52] Alpha zero so as many of you may know there was this amazing series of results
[00:06:54] there was this amazing series of results from Deep Mind in kind of like the 2016
[00:06:56] from Deep Mind in kind of like the 2016 to 2019 time period um around showing
[00:07:00] to 2019 time period um around showing how you could use reinforcement learning
[00:07:02] how you could use reinforcement learning and AI to conquer the board game go and
[00:07:06] and AI to conquer the board game go and this happened about a decade earlier
[00:07:07] this happened about a decade earlier than people
[00:07:08] than people expected and this was really considered
[00:07:11] expected and this was really considered one of the huge achievements in AI um so
[00:07:13] one of the huge achievements in AI um so people really thought it was going to
[00:07:15] people really thought it was going to take a lot longer to do this chess had
[00:07:17] take a lot longer to do this chess had already been mastered um Checkers longer
[00:07:19] already been mastered um Checkers longer before that um but there was a lot of
[00:07:22] before that um but there was a lot of different innovations that came out of a
[00:07:25] different innovations that came out of a long history of work that deine used to
[00:07:27] long history of work that deine used to make this possible and I also think that
[00:07:29] make this possible and I also think that it incorporates a lot of interesting
[00:07:31] it incorporates a lot of interesting different ideas that one thinks might
[00:07:33] different ideas that one thinks might think could be helpful to try to solve
[00:07:35] think could be helpful to try to solve other problems the other thing that I
[00:07:37] other problems the other thing that I think is interesting about this is it's
[00:07:38] think is interesting about this is it's quite a different form of reinforcement
[00:07:39] quite a different form of reinforcement learning than we've seen before it's
[00:07:41] learning than we've seen before it's really reinforcement learning for
[00:07:42] really reinforcement learning for computation um and we'll see a lot more
[00:07:45] computation um and we'll see a lot more about that okay so what we're going to
[00:07:48] about that okay so what we're going to start with is thinking about simulation
[00:07:50] start with is thinking about simulation based search um and I in simulation
[00:07:53] based search um and I in simulation based search is going to sound quite
[00:07:54] based search is going to sound quite familiar because we've been seeing ideas
[00:07:56] familiar because we've been seeing ideas around this with Monte Carlo search um
[00:07:59] around this with Monte Carlo search um Carlo methods but then we're going to
[00:08:01] Carlo methods but then we're going to think about combining these with using
[00:08:03] think about combining these with using different parts of the sort of
[00:08:04] different parts of the sort of stochastic decision-mak
[00:08:07] stochastic decision-mak process all right so in particular one
[00:08:09] process all right so in particular one of the major ideas that we're going to
[00:08:10] of the major ideas that we're going to be looking at today is the idea that
[00:08:12] be looking at today is the idea that we're going to be mostly focusing on how
[00:08:15] we're going to be mostly focusing on how to make figure out what we should do in
[00:08:17] to make figure out what we should do in the current state
[00:08:20] the current state only so in general in class whenever
[00:08:23] only so in general in class whenever we've been Computing a policy or a value
[00:08:24] we've been Computing a policy or a value function we've been Computing it for the
[00:08:26] function we've been Computing it for the entire State space so um we might have a
[00:08:29] entire State space so um we might have a policy and if anybody gave us a state we
[00:08:31] policy and if anybody gave us a state we could immediately tell you what action
[00:08:32] could immediately tell you what action or action distribution we should use or
[00:08:35] or action distribution we should use or we compute a q function for the whole
[00:08:37] we compute a q function for the whole Space one of the key ideas today is to
[00:08:40] Space one of the key ideas today is to say well maybe particularly if we've got
[00:08:42] say well maybe particularly if we've got an enormous space that we don't really
[00:08:44] an enormous space that we don't really care about trying to compute a optimal
[00:08:46] care about trying to compute a optimal policy for everything in the space maybe
[00:08:49] policy for everything in the space maybe we just want to use our current our use
[00:08:51] we just want to use our current our use our computation to really focus on a
[00:08:53] our computation to really focus on a good decision for right now or for
[00:08:55] good decision for right now or for whatever state that you might end up in
[00:08:57] whatever state that you might end up in and there are lots of reasons to think
[00:08:58] and there are lots of reasons to think that that might be important
[00:09:00] that that might be important particularly in really large domains so
[00:09:03] particularly in really large domains so you can imagine like you know if you're
[00:09:04] you can imagine like you know if you're the fed and you're trying to make p sort
[00:09:06] the fed and you're trying to make p sort of federal monetary policy you probably
[00:09:08] of federal monetary policy you probably don't care about doing this for all the
[00:09:10] don't care about doing this for all the scenarios which the US is not in you
[00:09:12] scenarios which the US is not in you really want to figure it out for the
[00:09:13] really want to figure it out for the current scenario um in the case of the
[00:09:15] current scenario um in the case of the board game go as we'll see there's just
[00:09:18] board game go as we'll see there's just an enormous space of potential States
[00:09:20] an enormous space of potential States you could end up in and it may not be
[00:09:22] you could end up in and it may not be important to have a perfect way of
[00:09:24] important to have a perfect way of acting in all of those so one big idea
[00:09:28] acting in all of those so one big idea here is that we're going to be mostly
[00:09:29] here is that we're going to be mostly focusing on computation to figure out
[00:09:31] focusing on computation to figure out what's the right thing to do in the
[00:09:32] what's the right thing to do in the current
[00:09:34] space so one thing you might do in this
[00:09:37] space so one thing you might do in this case given all the ideas we've seen in
[00:09:39] case given all the ideas we've seen in class is you might simulate so imagine
[00:09:44] class is you might simulate so imagine that someone gives you a policy and what
[00:09:47] that someone gives you a policy and what you want to do is try to do at least as
[00:09:49] you want to do is try to do at least as good as that as that policy and maybe a
[00:09:50] good as that as that policy and maybe a little bit better so one thing you could
[00:09:53] little bit better so one thing you could do is say well I'm in a current state
[00:09:54] do is say well I'm in a current state like a current real State say St and
[00:09:57] like a current real State say St and what I'm going to do is I'm going to say
[00:09:59] what I'm going to do is I'm going to say think about all the different actions I
[00:10:01] think about all the different actions I could take next and then I'm going to
[00:10:03] could take next and then I'm going to roll out using my default policy from
[00:10:05] roll out using my default policy from those
[00:10:06] those States so this is just like for K
[00:10:10] States so this is just like for K episodes from the current real estate
[00:10:12] episodes from the current real estate I'm going to roll out in my brain what
[00:10:14] I'm going to roll out in my brain what might happen now this means that I need
[00:10:16] might happen now this means that I need some access to a Dynamics model so I can
[00:10:19] some access to a Dynamics model so I can only do this if I have access to a model
[00:10:22] only do this if I have access to a model and what I'm going to mean here by model
[00:10:24] and what I'm going to mean here by model is
[00:10:27] Dynamics and reward
[00:10:30] Dynamics and reward model so you might imagine that you
[00:10:32] model so you might imagine that you actually know how the world works or
[00:10:34] actually know how the world works or that someone you've learned some sort of
[00:10:36] that someone you've learned some sort of model from the past this could be
[00:10:38] model from the past this could be estimated or
[00:10:39] estimated or true and so then what we can do is we
[00:10:41] true and so then what we can do is we can just do sort of Monte Carlo
[00:10:43] can just do sort of Monte Carlo evaluation and what we're getting here
[00:10:45] evaluation and what we're getting here is an estimate of the Q function that
[00:10:47] is an estimate of the Q function that says if I start in this state and I take
[00:10:48] says if I start in this state and I take this action and I roll out under my
[00:10:50] this action and I roll out under my simulation policy what is my expected
[00:10:53] simulation policy what is my expected return so that just gives me an estimate
[00:10:55] return so that just gives me an estimate of the Q function we've seen this before
[00:10:58] of the Q function we've seen this before with Monte Carlo me and then you could
[00:11:01] with Monte Carlo me and then you could just pick whatever the real action is
[00:11:03] just pick whatever the real action is like this is what you're going to take
[00:11:04] like this is what you're going to take in the real world with a maximum
[00:11:06] in the real world with a maximum value and what you can think of what
[00:11:08] value and what you can think of what this is doing and I'll just going to
[00:11:10] this is doing and I'll just going to augment this with pi to make that even
[00:11:11] augment this with pi to make that even more clear is what I'm essentially
[00:11:13] more clear is what I'm essentially Computing is for from the current state
[00:11:16] Computing is for from the current state what is the Q value of my simulation
[00:11:18] what is the Q value of my simulation policy and so I could roll out under
[00:11:21] policy and so I could roll out under those policies and then I'm going to do
[00:11:22] those policies and then I'm going to do sort of one step of policy Improvement
[00:11:25] sort of one step of policy Improvement given
[00:11:26] given that and so if someone gave you a budget
[00:11:28] that and so if someone gave you a budget of computation this would be one
[00:11:29] of computation this would be one reasonable thing you could do with it
[00:11:32] reasonable thing you could do with it and then we're going to see a lot of
[00:11:33] and then we're going to see a lot of things that you can do that are much
[00:11:34] things that you can do that are much better than this but this is one thing
[00:11:36] better than this but this is one thing you could do that would be viewed sort
[00:11:37] you could do that would be viewed sort of a simulation based search which would
[00:11:39] of a simulation based search which would allow you to do better than the current
[00:11:40] allow you to do better than the current policy you
[00:11:42] policy you have all right I think it's helpful to
[00:11:44] have all right I think it's helpful to think about this in terms of sort of
[00:11:46] think about this in terms of sort of what the tree structure is so the idea
[00:11:49] what the tree structure is so the idea in this setting is that we start in some
[00:11:51] in this setting is that we start in some State and we're going to roll out under
[00:11:54] State and we're going to roll out under we're going to take a certain action so
[00:11:56] we're going to take a certain action so that's going to be our action a
[00:11:59] that's going to be our action a it here that's our action a and we will
[00:12:03] it here that's our action a and we will take and then after that we're going to
[00:12:06] take and then after that we're going to get to another next state S
[00:12:08] get to another next state S Prime and then from then onwards we're
[00:12:10] Prime and then from then onwards we're going to follow our policy Pi so this
[00:12:12] going to follow our policy Pi so this here is policy of S
[00:12:15] here is policy of S Prime I'm going to sample an action
[00:12:17] Prime I'm going to sample an action according to that so at the root for my
[00:12:21] according to that so at the root for my current state I'm going to consider all
[00:12:23] current state I'm going to consider all possible actions but then after I take
[00:12:25] possible actions but then after I take that action and transition to next state
[00:12:27] that action and transition to next state I'm just going to roll out by following
[00:12:29] I'm just going to roll out by following my policy
[00:12:31] my policy Pi okay and then I do that all the way
[00:12:33] Pi okay and then I do that all the way out till I hit a terminal so T here
[00:12:36] out till I hit a terminal so T here equals a terminal
[00:12:41] State and so one thing you could do is
[00:12:44] State and so one thing you could do is just do the simulation and then average
[00:12:46] just do the simulation and then average over the the root nodes but you also now
[00:12:47] over the the root nodes but you also now have a whole bunch of data so you could
[00:12:49] have a whole bunch of data so you could do other forms of reinforcement learning
[00:12:50] do other forms of reinforcement learning given that data do the P to be optimal
[00:12:55] given that data do the P to be optimal for
[00:12:56] for this gu good question so um it depends
[00:13:00] this gu good question so um it depends how we think of what this is Computing
[00:13:02] how we think of what this is Computing so if this is just Computing Q
[00:13:06] so if this is just Computing Q Pi of
[00:13:08] Pi of sa it's just just doing one step of
[00:13:11] sa it's just just doing one step of policy evaluation so this will work
[00:13:14] policy evaluation so this will work whether Pi is a good policy or a bad
[00:13:16] whether Pi is a good policy or a bad policy but exactly I think point if Pi
[00:13:19] policy but exactly I think point if Pi isn't very good then you probably aren't
[00:13:20] isn't very good then you probably aren't going to get a very good um like you're
[00:13:22] going to get a very good um like you're just doing one step of policy
[00:13:23] just doing one step of policy Improvement here you're not necess going
[00:13:25] Improvement here you're not necess going to get
[00:13:27] to get qar unless Pi is really CL
[00:13:29] qar unless Pi is really CL optimal okay so this is one thing you
[00:13:32] optimal okay so this is one thing you could do but I think the nice thing of
[00:13:33] could do but I think the nice thing of visualizing the tree in this case is it
[00:13:35] visualizing the tree in this case is it starts to make it really obvious that
[00:13:36] starts to make it really obvious that you could do other things that could be
[00:13:38] you could do other things that could be better than just follow following
[00:13:40] better than just follow following whatever your current policy is okay or
[00:13:42] whatever your current policy is okay or whatever policy you might have access
[00:13:45] whatever policy you might have access to and I'll just make this CLE with the
[00:13:47] to and I'll just make this CLE with the model
[00:13:50] model and uh
[00:13:57] default so we might instead if we have
[00:14:00] default so we might instead if we have limited amounts of computation instead
[00:14:02] limited amounts of computation instead of just doing roll outs we might want to
[00:14:04] of just doing roll outs we might want to try to get something that's closer to
[00:14:06] try to get something that's closer to qar and one way we could try to do this
[00:14:08] qar and one way we could try to do this is by trying to construct an expax tree
[00:14:12] is by trying to construct an expax tree so raise your hand if you've seen either
[00:14:14] so raise your hand if you've seen either Minimax trees or expecting Max trees
[00:14:17] Minimax trees or expecting Max trees before okay like a few people but not
[00:14:19] before okay like a few people but not everybody okay so um this be a quick
[00:14:22] everybody okay so um this be a quick introduction to those but the idea is if
[00:14:25] introduction to those but the idea is if we think about what this forward
[00:14:28] we think about what this forward searches really what we are doing when
[00:14:30] searches really what we are doing when we construct this tree so this is the
[00:14:33] we construct this tree so this is the action so this could be like say A2 and
[00:14:35] action so this could be like say A2 and this is A1 and this is a next
[00:14:37] this is A1 and this is a next state imagine that you just have a few
[00:14:40] state imagine that you just have a few States in these
[00:14:41] States in these examples so the black nodes are all
[00:14:44] examples so the black nodes are all actions and the white nodes are all the
[00:14:47] actions and the white nodes are all the states you can think of this and we've
[00:14:49] states you can think of this and we've seen similar graphs to this a while ago
[00:14:51] seen similar graphs to this a while ago think of this as just sort of rolling
[00:14:53] think of this as just sort of rolling out your Bellman backups okay so you can
[00:14:55] out your Bellman backups okay so you can think of what happens in the world is I
[00:14:57] think of what happens in the world is I take an action and I transition to some
[00:14:59] take an action and I transition to some State and then I take another action and
[00:15:01] State and then I take another action and I transition to some states and
[00:15:02] I transition to some states and sometimes I
[00:15:03] sometimes I terminate and what I would do normally
[00:15:05] terminate and what I would do normally in this case is then I would back up
[00:15:09] in this case is then I would back up along this tree so whenever I have
[00:15:13] along this tree so whenever I have States I would take um like an average
[00:15:16] States I would take um like an average or an
[00:15:19] expectation and this is really just
[00:15:21] expectation and this is really just representing the probability of S Prime
[00:15:23] representing the probability of S Prime given
[00:15:24] given sa so it's representing that sum
[00:15:29] sa so it's representing that sum and then every time I have actions I
[00:15:31] and then every time I have actions I would take a
[00:15:32] would take a Max and this is just representing inside
[00:15:35] Max and this is just representing inside of the bman backup that I take the max
[00:15:37] of the bman backup that I take the max over the actions so you could think of
[00:15:39] over the actions so you could think of this as just approximating the max
[00:15:43] this as just approximating the max over Max over a RS a plus gamma sum over
[00:15:49] over Max over a RS a plus gamma sum over S Prime probability of S Prime given sa
[00:15:52] S Prime probability of S Prime given sa a v of S Prime except for instead of
[00:15:55] a v of S Prime except for instead of having B of S Prime then you would just
[00:15:56] having B of S Prime then you would just expand this out all the way until you
[00:15:58] expand this out all the way until you hit the terminal
[00:16:00] hit the terminal State okay and this would require us
[00:16:02] State okay and this would require us also to keep track of the rewards we
[00:16:04] also to keep track of the rewards we obtain as we go down this
[00:16:06] obtain as we go down this tree so for example here you might get
[00:16:09] tree so for example here you might get reward of S Prime
[00:16:13] reward of S Prime a soes that makes sense so if you have
[00:16:16] a soes that makes sense so if you have access to a markof decision process and
[00:16:18] access to a markof decision process and its Dynamics model and its reward model
[00:16:20] its Dynamics model and its reward model one way you could use that to figure out
[00:16:22] one way you could use that to figure out what's the optimal thing to do in your
[00:16:24] what's the optimal thing to do in your current state is you build this tree
[00:16:27] current state is you build this tree build this tree into till at a leaf you
[00:16:30] build this tree into till at a leaf you reach a terminal node or for a fixed
[00:16:33] reach a terminal node or for a fixed Horizon H and then you back up by doing
[00:16:36] Horizon H and then you back up by doing Wherever You See It branching according
[00:16:38] Wherever You See It branching according to States you take an average weighted
[00:16:40] to States you take an average weighted by the probability of each state and
[00:16:42] by the probability of each state and whenever you get to an a set of action
[00:16:44] whenever you get to an a set of action nodes you take the
[00:16:46] nodes you take the max have any questions about that um we
[00:16:49] max have any questions about that um we might get to this later but like if
[00:16:50] might get to this later but like if we're considering like complex games
[00:16:52] we're considering like complex games like go like the state space is like
[00:16:56] like go like the state space is like massive right it's very unlikely that
[00:16:58] massive right it's very unlikely that you're going to run into the same same
[00:16:59] you're going to run into the same same um like composition of the board twice
[00:17:01] um like composition of the board twice like how do you deal with that great
[00:17:03] like how do you deal with that great question hold on to that we'll get to it
[00:17:04] question hold on to that we'll get to it yes yeah absolutely so right now well in
[00:17:07] yes yeah absolutely so right now well in fact on the next slide we'll talk about
[00:17:08] fact on the next slide we'll talk about how big this tree is okay but this at
[00:17:09] how big this tree is okay but this at least conceptually should be something
[00:17:11] least conceptually should be something that you think yeah we could do this I
[00:17:12] that you think yeah we could do this I could imagine doing this so why might
[00:17:14] could imagine doing this so why might this be better than before well this
[00:17:16] this be better than before well this might be better than before because
[00:17:17] might be better than before because you're not actually solving the whole
[00:17:19] you're not actually solving the whole mdp you're only doing sort of bellman
[00:17:21] mdp you're only doing sort of bellman backup starting from the current state
[00:17:23] backup starting from the current state you're in and so you might imagine that
[00:17:25] you're in and so you might imagine that if the space is enormous even though
[00:17:27] if the space is enormous even though you're sort of rolling this out in terms
[00:17:29] you're sort of rolling this out in terms of this kind of exponentially growing
[00:17:30] of this kind of exponentially growing tree um it still might be smaller than
[00:17:32] tree um it still might be smaller than your whole state space okay but as was
[00:17:36] your whole state space okay but as was saying um this is huge in general so if
[00:17:40] saying um this is huge in general so if you want to actually expand the whole
[00:17:42] you want to actually expand the whole Tree in general the size of the tree is
[00:17:44] Tree in general the size of the tree is going to scale by the size of your state
[00:17:46] going to scale by the size of your state space times the size of Your Action
[00:17:47] space times the size of Your Action space to the H H here would be her
[00:17:51] space to the H H here would be her Horizon and so as you could imagine this
[00:17:54] Horizon and so as you could imagine this is going to be terrible really quickly
[00:17:56] is going to be terrible really quickly right like we don't want to if you think
[00:17:58] right like we don't want to if you think about go you think about Mountain car or
[00:18:00] about go you think about Mountain car or other games where you might be or other
[00:18:02] other games where you might be or other environments where you might be having
[00:18:03] environments where you might be having sort of a 100 or to A Thousand Steps
[00:18:05] sort of a 100 or to A Thousand Steps this is going to be completely
[00:18:07] this is going to be completely intractable right but as you might
[00:18:10] intractable right but as you might notice when we're looking at this here
[00:18:13] notice when we're looking at this here when we wrote out we sort of thought
[00:18:15] when we wrote out we sort of thought about all the next States we could
[00:18:18] about all the next States we could reach but that if that's a really large
[00:18:20] reach but that if that's a really large set we know that we don't necessar
[00:18:22] set we know that we don't necessar actually have to sample all of them and
[00:18:24] actually have to sample all of them and compute that exactly in order to get a
[00:18:26] compute that exactly in order to get a good estimate of the expectation we we
[00:18:28] good estimate of the expectation we we know that in fact we can just sample so
[00:18:30] know that in fact we can just sample so if you sample what's the next date 100
[00:18:32] if you sample what's the next date 100 times an average overall of their values
[00:18:35] times an average overall of their values that's a pretty good approximation what
[00:18:37] that's a pretty good approximation what the average value is even if there are
[00:18:39] the average value is even if there are 10 billion
[00:18:40] 10 billion States okay because you can approximate
[00:18:43] States okay because you can approximate um an expectation by an average and that
[00:18:46] um an expectation by an average and that tends to concentrate really quickly so
[00:18:49] tends to concentrate really quickly so that's going to be one of the really big
[00:18:50] that's going to be one of the really big ideas of using Monte Carlo tree search
[00:18:52] ideas of using Monte Carlo tree search is that we're not going to have to
[00:18:53] is that we're not going to have to expand all the next States we're just
[00:18:55] expand all the next States we're just going to sample them so let's see how
[00:18:57] going to sample them so let's see how that might work
[00:19:00] that might work so this is where we get into Monte Carlo
[00:19:01] so this is where we get into Monte Carlo tree search and note I highlighted in
[00:19:03] tree search and note I highlighted in tree here because we're not doing Monte
[00:19:05] tree here because we're not doing Monte Carlo search anymore we're not just
[00:19:07] Carlo search anymore we're not just rolling out with a policy we're
[00:19:08] rolling out with a policy we're essentially going to try to sample parts
[00:19:11] essentially going to try to sample parts of that
[00:19:12] of that tree okay but we're not going to um just
[00:19:14] tree okay but we're not going to um just do single pie roll outs so we're going
[00:19:17] do single pie roll outs so we're going to still we're going to build a search
[00:19:18] to still we're going to build a search tree rooted at the current state we're
[00:19:20] tree rooted at the current state we're going to sample actions of next States
[00:19:23] going to sample actions of next States and we are going to explore different
[00:19:25] and we are going to explore different parts of that tree we're not going to
[00:19:27] parts of that tree we're not going to always follow the same simulation policy
[00:19:29] always follow the same simulation policy Pi okay and then after the search is
[00:19:33] Pi okay and then after the search is finished we're going to take an action
[00:19:34] finished we're going to take an action in the real world by whatever has the
[00:19:37] in the real world by whatever has the highest value as we estimate at the root
[00:19:40] highest value as we estimate at the root at least that's one way we could do
[00:19:41] at least that's one way we could do things we'll see some other ways to do
[00:19:45] it and you know well let's let me just
[00:19:49] it and you know well let's let me just explain a give a little bit of intuition
[00:19:50] explain a give a little bit of intuition of why does this work this works because
[00:19:52] of why does this work this works because what we're doing in this
[00:19:54] what we're doing in this case is we are approximating
[00:20:04] expectations
[00:20:05] expectations with
[00:20:08] with averages okay so we're not actually
[00:20:10] averages okay so we're not actually trying to expand all the next state
[00:20:11] trying to expand all the next state we're just going to approximate it with
[00:20:13] we're just going to approximate it with averages and that will turn out to
[00:20:14] averages and that will turn out to concentrate pretty quickly and that's
[00:20:16] concentrate pretty quickly and that's going to be really
[00:20:18] going to be really helpful okay so let's do a quick check
[00:20:20] helpful okay so let's do a quick check your understanding so oops well there
[00:20:22] your understanding so oops well there you go it's okay you can think about
[00:20:24] you go it's okay you can think about whether or not you agree with this um
[00:20:25] whether or not you agree with this um mon Carlo research involves deciding on
[00:20:28] mon Carlo research involves deciding on an action to take by doing
[00:20:30] an action to take by doing research and so think about whether it's
[00:20:33] research and so think about whether it's a good choice for short Horizon problems
[00:20:34] a good choice for short Horizon problems and Y long Horizon and large state in
[00:20:39] and Y long Horizon and large state in action
[00:20:43] space and actually the middle of this is
[00:20:45] space and actually the middle of this is slightly debatable so take as I can and
[00:20:47] slightly debatable so take as I can and think about
[00:20:55] this this so that when I uploaded later
[00:20:58] this this so that when I uploaded later people
[00:21:20] can so why might the first part be false
[00:21:24] can so why might the first part be false why would we not want to do this for
[00:21:26] why would we not want to do this for well first of all does anybody have any
[00:21:27] well first of all does anybody have any questions on what Monte car research is
[00:21:30] questions on what Monte car research is doing um in terms of how it's different
[00:21:33] doing um in terms of how it's different than the other things that we could
[00:21:40] do so then tell me why it's not probably
[00:21:43] do so then tell me why it's not probably a good choice for short Horizon problems
[00:21:46] a good choice for short Horizon problems with small state and action spaces what
[00:21:47] with small state and action spaces what would you do instead in those cases
[00:21:52] yeah we better do
[00:21:56] what maybe I guess what I was think more
[00:21:59] what maybe I guess what I was think more is in that case maybe just do dynamic
[00:22:00] is in that case maybe just do dynamic programming yeah if the state space and
[00:22:02] programming yeah if the state space and the action Space is really small you can
[00:22:04] the action Space is really small you can just do value
[00:22:05] just do value iteration yeah my Monarch could work too
[00:22:09] iteration yeah my Monarch could work too um but in particular if things are
[00:22:10] um but in particular if things are really
[00:22:11] really small if you think back it's been a long
[00:22:14] small if you think back it's been a long time I know but um in uh in Monty in
[00:22:18] time I know but um in uh in Monty in sort of standard dynamic programming
[00:22:20] sort of standard dynamic programming it's only like s squ * a for each backup
[00:22:24] it's only like s squ * a for each backup and then you're just doing that if
[00:22:25] and then you're just doing that if you're only just doing the H times
[00:22:26] you're only just doing the H times that's nice you don't have any
[00:22:28] that's nice you don't have any exponential dependence in that case so
[00:22:31] exponential dependence in that case so if it's really
[00:22:36] small just do bman
[00:22:43] backups and the Order of that is roughly
[00:22:46] backups and the Order of that is roughly s^2 a Time the Horizon H roughly okay so
[00:22:50] s^2 a Time the Horizon H roughly okay so at least it avoids the
[00:22:53] at least it avoids the exponential um it will be a good choice
[00:22:55] exponential um it will be a good choice for long Horizon problems with a large
[00:22:58] for long Horizon problems with a large State space and actions small um action
[00:23:01] State space and actions small um action space because what we're doing in this
[00:23:03] space because what we're doing in this case is we're approximating that
[00:23:04] case is we're approximating that expectation by
[00:23:06] expectation by samples so we approximate so this is
[00:23:09] samples so we approximate so this is true this is false
[00:23:14] approximating
[00:23:17] approximating expectation by
[00:23:20] expectation by samples and so that means instead of us
[00:23:22] samples and so that means instead of us having to get that like enormous State
[00:23:24] having to get that like enormous State space that we're multiplying by whether
[00:23:26] space that we're multiplying by whether s squ or such we're just sampling from
[00:23:29] s squ or such we're just sampling from that and so we can have something that's
[00:23:30] that and so we can have something that's more like a constant with respect to how
[00:23:32] more like a constant with respect to how much we're sampling now the middle one
[00:23:34] much we're sampling now the middle one is actually a little bit controversial
[00:23:36] is actually a little bit controversial um and we're going to see different ways
[00:23:38] um and we're going to see different ways to tackle this why should this be
[00:23:39] to tackle this why should this be somewhat controversial well in Monte
[00:23:41] somewhat controversial well in Monte Carlo research the initial way we're
[00:23:43] Carlo research the initial way we're getting a big gain is we're sampling
[00:23:45] getting a big gain is we're sampling next States instead of enumerating them
[00:23:48] next States instead of enumerating them but it shouldn't be obvious that like
[00:23:50] but it shouldn't be obvious that like for actions we want to maximize for
[00:23:52] for actions we want to maximize for actions we want to take the best over
[00:23:53] actions we want to take the best over all the actions and so Monte Carlo
[00:23:56] all the actions and so Monte Carlo research a priority still has to just
[00:23:58] research a priority still has to just sample the whole action
[00:24:00] sample the whole action space and so it's not clear yet that
[00:24:02] space and so it's not clear yet that unless we do something special that
[00:24:04] unless we do something special that Monte Carlo research is necessarily
[00:24:05] Monte Carlo research is necessarily going to help us when we've got really
[00:24:07] going to help us when we've got really big action spaces because in general
[00:24:09] big action spaces because in general we've replaced the expectation by a set
[00:24:11] we've replaced the expectation by a set of samples but it hasn't told us yet how
[00:24:14] of samples but it hasn't told us yet how to do anything smart in terms of the
[00:24:15] to do anything smart in terms of the action space okay so this one is sort of
[00:24:19] action space okay so this one is sort of debatable
[00:24:20] debatable um maybe
[00:24:23] um maybe false depends how you think about it but
[00:24:25] false depends how you think about it but of course there are a lot of algorithms
[00:24:27] of course there are a lot of algorithms that combin with Monte Carlo research to
[00:24:28] that combin with Monte Carlo research to show us how we might be able to tackle
[00:24:30] show us how we might be able to tackle this problem okay so the the what we
[00:24:33] this problem okay so the the what we really want to be able to do is solve
[00:24:34] really want to be able to do is solve long Horizon problems with enormous
[00:24:36] long Horizon problems with enormous action spaces and enormous State spaces
[00:24:38] action spaces and enormous State spaces and so we're going to need ideas Beyond
[00:24:40] and so we're going to need ideas Beyond Monte Carlo research to tackle
[00:24:42] Monte Carlo research to tackle that okay an upper confidence Tre search
[00:24:47] that okay an upper confidence Tre search is one idea for how to do this and I
[00:24:48] is one idea for how to do this and I think UCT came out in around maybe like
[00:24:51] think UCT came out in around maybe like 2007 2008 people started using it for go
[00:24:54] 2007 2008 people started using it for go around then and the idea in this case is
[00:24:57] around then and the idea in this case is that in addition to doing the sampling
[00:24:59] that in addition to doing the sampling over next States let's be strategic over
[00:25:02] over next States let's be strategic over what action we take when we're expanding
[00:25:04] what action we take when we're expanding in our
[00:25:05] in our tree so when we decide to sample a next
[00:25:09] tree so when we decide to sample a next action doesn't have to be from a default
[00:25:11] action doesn't have to be from a default policy Pi let's think carefully about
[00:25:13] policy Pi let's think carefully about essentially where do we want to fill in
[00:25:16] essentially where do we want to fill in our our search tree and this is one of
[00:25:18] our our search tree and this is one of those other really big Ideas because
[00:25:19] those other really big Ideas because this is really where we're going to
[00:25:20] this is really where we're going to start to think about ideas from
[00:25:22] start to think about ideas from reinforcement learning essentially to
[00:25:24] reinforcement learning essentially to optimize
[00:25:26] optimize computation because right now we're
[00:25:28] computation because right now we're still assuming that we know the mdp like
[00:25:31] still assuming that we know the mdp like we know what the Dynamics model is and
[00:25:32] we know what the Dynamics model is and we know what the reward model is so in
[00:25:35] we know what the reward model is so in theory if computation was no issue we
[00:25:37] theory if computation was no issue we could just do value backups the
[00:25:39] could just do value backups the challenge is this is going to be
[00:25:40] challenge is this is going to be completely enormous and that's totally
[00:25:42] completely enormous and that's totally intractable so the ideas here is to say
[00:25:45] intractable so the ideas here is to say well maybe if we have access to those we
[00:25:48] well maybe if we have access to those we can still think of trying to approximate
[00:25:50] can still think of trying to approximate sort of like Bellman backups or
[00:25:51] sort of like Bellman backups or approximate Maxes but we don't actually
[00:25:54] approximate Maxes but we don't actually have to want to enumerate all the
[00:25:56] have to want to enumerate all the actions as much and we want to really
[00:25:57] actions as much and we want to really focus where using our
[00:25:59] focus where using our computation and deep mind's been really
[00:26:01] computation and deep mind's been really a Pioneer in thinking about using
[00:26:03] a Pioneer in thinking about using reinforcement learning to prioritize
[00:26:06] reinforcement learning to prioritize computation to solve a lot of really
[00:26:08] computation to solve a lot of really important problems and I'll try to come
[00:26:09] important problems and I'll try to come back to that at the end okay so how does
[00:26:12] back to that at the end okay so how does UCT work
[00:26:14] UCT work um the idea is and this is why I asked
[00:26:17] um the idea is and this is why I asked you guys about this and refresh your
[00:26:18] you guys about this and refresh your understanding is we're going to treat
[00:26:20] understanding is we're going to treat each node we're each node that was sort
[00:26:22] each node we're each node that was sort of like a state node inside of our
[00:26:24] of like a state node inside of our research as a
[00:26:26] research as a bandit and so it's like we have many
[00:26:28] bandit and so it's like we have many many many many band up problems inside
[00:26:30] many many many band up problems inside of our search
[00:26:32] of our search tree and we're going to then maintain an
[00:26:34] tree and we're going to then maintain an upper competence bound over the reward
[00:26:36] upper competence bound over the reward of each arm inside of a node okay so the
[00:26:40] of each arm inside of a node okay so the first node you would
[00:26:42] first node you would have is your root node
[00:26:46] have is your root node okay right and so it would have say A1
[00:26:49] okay right and so it would have say A1 A2 A3 and we would think of that as a MB
[00:26:53] A2 A3 and we would think of that as a MB as a multi- arm bandit okay and then
[00:26:56] as a multi- arm bandit okay and then when you get further down in the tree
[00:26:58] when you get further down in the tree tree so let's say we this goes to next S
[00:27:01] tree so let's say we this goes to next S Prime this would be another A1 A2
[00:27:07] A3 and this would be another multiarm
[00:27:10] A3 and this would be another multiarm Bandit okay and you would have you'd
[00:27:12] Bandit okay and you would have you'd have to store in memory lots and lots
[00:27:14] have to store in memory lots and lots and lots of different multi arm Bandits
[00:27:16] and lots of different multi arm Bandits okay so you're maintaining huge numbers
[00:27:18] okay so you're maintaining huge numbers of multi arm Bandits and just like what
[00:27:20] of multi arm Bandits and just like what we normally do in Upper confidence
[00:27:21] we normally do in Upper confidence bounds we're going to maintain a number
[00:27:23] bounds we're going to maintain a number of confidence down over each arm but
[00:27:26] of confidence down over each arm but what we're going to be thinking of that
[00:27:27] what we're going to be thinking of that is is essentially what would happen if I
[00:27:30] is is essentially what would happen if I take this action and then act um
[00:27:33] take this action and then act um optimally till the end now one big
[00:27:36] optimally till the end now one big challenges of course we don't know what
[00:27:37] challenges of course we don't know what the reward would be of acting optimally
[00:27:39] the reward would be of acting optimally so there's going to be a lot of
[00:27:40] so there's going to be a lot of different sort of um policies that are
[00:27:43] different sort of um policies that are moving at once but let's see what that
[00:27:45] moving at once but let's see what that might look like okay so here's the
[00:27:48] might look like okay so here's the idea okay so let's say what we're going
[00:27:50] idea okay so let's say what we're going to call we're going to say we have a
[00:27:52] to call we're going to say we have a node I so this could be our root node or
[00:27:54] node I so this could be our root node or it could be any other node the way we're
[00:27:57] it could be any other node the way we're going to I'm just going to call
[00:27:59] going to I'm just going to call this AI we're going to try to maintain
[00:28:02] this AI we're going to try to maintain an upper confidence bound over what is
[00:28:05] an upper confidence bound over what is the potential expected discounted sum of
[00:28:07] the potential expected discounted sum of rewards we'd get starting in this node
[00:28:10] rewards we'd get starting in this node and taking this action as the
[00:28:13] and taking this action as the following let's say that we've been in
[00:28:15] following let's say that we've been in that particular node and we have rolled
[00:28:17] that particular node and we have rolled out from it using some strategy that we
[00:28:19] out from it using some strategy that we haven't really talked about yet ni a
[00:28:22] haven't really talked about yet ni a time so this is the number of times
[00:28:24] time so this is the number of times we've been to this node before and we've
[00:28:25] we've been to this node before and we've happened to expand the a action what we
[00:28:29] happened to expand the a action what we do is we look at all the returns we've
[00:28:31] do is we look at all the returns we've gotten under those cases so what's a
[00:28:34] gotten under those cases so what's a return again so a return would
[00:28:36] return again so a return would be go back to
[00:28:40] be go back to here so let's say you done this okay
[00:28:43] here so let's say you done this okay what you would do what your return in
[00:28:45] what you would do what your return in this case would be is it would be a
[00:28:47] this case would be is it would be a sum of all of these rewards you've
[00:28:50] sum of all of these rewards you've gotten along the way okay so G we're
[00:28:53] gotten along the way okay so G we're going to use to denote the return so
[00:28:55] going to use to denote the return so this would
[00:28:56] this would be reward from starting in state s
[00:28:59] be reward from starting in state s taking A1 and getting the rewards out to
[00:29:02] taking A1 and getting the rewards out to the terminal
[00:29:03] the terminal State and maybe next time you go down
[00:29:05] State and maybe next time you go down this action you actually get to here and
[00:29:08] this action you actually get to here and you get a different return so those are
[00:29:10] you get a different return so those are just like your Monty Carlo returns from
[00:29:12] just like your Monty Carlo returns from before and it's just for all the other
[00:29:15] before and it's just for all the other times youve went through that
[00:29:18] action okay so that's part of it so
[00:29:20] action okay so that's part of it so that's just an average and it's kind of
[00:29:22] that's just an average and it's kind of a weird average right because it might
[00:29:23] a weird average right because it might be that your which nodes you visited and
[00:29:26] be that your which nodes you visited and which actions you took have changed so
[00:29:28] which actions you took have changed so we're not committing it to it to be a
[00:29:29] we're not committing it to it to be a particular policy it's just like we've
[00:29:32] particular policy it's just like we've taken some action and we followed some
[00:29:34] taken some action and we followed some you know we've made a series of
[00:29:35] you know we've made a series of decisions till we got to a terminal
[00:29:36] decisions till we got to a terminal State we added up the rewards um and we
[00:29:39] State we added up the rewards um and we keep track of that here so that's one
[00:29:41] keep track of that here so that's one thing and that's sort of we probably
[00:29:43] thing and that's sort of we probably look at and think this is a very loose
[00:29:44] look at and think this is a very loose approximation of what the optimal Q
[00:29:46] approximation of what the optimal Q value is for that state in
[00:29:49] value is for that state in action the second term looks like upper
[00:29:52] action the second term looks like upper confidence bound which is you have some
[00:29:54] confidence bound which is you have some constant C you have some log term which
[00:29:57] constant C you have some log term which depends on the the number of times you
[00:29:58] depends on the the number of times you visited this node divided by niia number
[00:30:03] visited this node divided by niia number of times we've been in that node and
[00:30:04] of times we've been in that node and taken that particular action okay so
[00:30:07] taken that particular action okay so this just looks like a bandit term it's
[00:30:09] this just looks like a bandit term it's an upper confidence bound over the
[00:30:11] an upper confidence bound over the reward that we can
[00:30:14] get and so what upper confidence tree
[00:30:17] get and so what upper confidence tree does is that the way it picks the action
[00:30:20] does is that the way it picks the action the next action to take from the current
[00:30:22] the next action to take from the current node is it picks whichever one of these
[00:30:24] node is it picks whichever one of these has this higher upper confidence bound
[00:30:28] has this higher upper confidence bound now it should seem slightly suspicious
[00:30:30] now it should seem slightly suspicious that this works because in Bandits when
[00:30:33] that this works because in Bandits when we took an action we knew that this
[00:30:36] we took an action we knew that this really was an unbiased estimate of the
[00:30:38] really was an unbiased estimate of the reward of that action because we just
[00:30:40] reward of that action because we just would to see that one action and then we
[00:30:42] would to see that one action and then we knew from hofing that this really was an
[00:30:45] knew from hofing that this really was an upper confidence bound on the true value
[00:30:47] upper confidence bound on the true value of that
[00:30:48] of that arm but now we're in a much more weird
[00:30:50] arm but now we're in a much more weird case right where we are thinking of this
[00:30:52] case right where we are thinking of this for a sequence of actions we're going to
[00:30:54] for a sequence of actions we're going to take we're trying to do expectations
[00:30:55] take we're trying to do expectations Over States and the actual actions we're
[00:30:58] Over States and the actual actions we're taking from this um node onwards may not
[00:31:00] taking from this um node onwards may not be
[00:31:01] be optimal you know one time we might go
[00:31:03] optimal you know one time we might go through guess I'll draw it on
[00:31:07] through guess I'll draw it on board like one time we might go through
[00:31:09] board like one time we might go through this zigzag another time we might go
[00:31:12] this zigzag another time we might go through this zigzag another time we
[00:31:14] through this zigzag another time we might come back to here and then we take
[00:31:16] might come back to here and then we take a different action okay so it's not like
[00:31:19] a different action okay so it's not like we're not doing one step of policy
[00:31:20] we're not doing one step of policy Improvement here we just have lots of
[00:31:23] Improvement here we just have lots of different things that we're trying and
[00:31:24] different things that we're trying and we're averaging over them okay so you
[00:31:27] we're averaging over them okay so you might should be slightly suspicious
[00:31:29] might should be slightly suspicious whether or not this is going to be doing
[00:31:30] whether or not this is going to be doing a reasonable thing okay but it's
[00:31:33] a reasonable thing okay but it's certainely something you could do
[00:31:34] certainely something you could do something you could imagine
[00:31:36] something you could imagine coating and then we'll do this many many
[00:31:39] coating and then we'll do this many many times and then at the very end so this
[00:31:41] times and then at the very end so this will expand essentially different parts
[00:31:43] will expand essentially different parts of your tree okay and when you're
[00:31:45] of your tree okay and when you're following this in particular you're
[00:31:47] following this in particular you're going to start to expand parts of the
[00:31:49] going to start to expand parts of the tree which look promising more okay so
[00:31:53] tree which look promising more okay so if this one happens to have been getting
[00:31:55] if this one happens to have been getting you know this one gets like plus 100 and
[00:31:57] you know this one gets like plus 100 and this gets plus 100 and this gets Plus 90
[00:32:01] this gets plus 100 and this gets Plus 90 whereas let's say one other time when
[00:32:02] whereas let's say one other time when you took this action you went down here
[00:32:05] you took this action you went down here and you got - 10 well then when the next
[00:32:08] and you got - 10 well then when the next time you get to your root node you're
[00:32:09] time you get to your root node you're probably going to be more likely to keep
[00:32:11] probably going to be more likely to keep going down this path so it's going to
[00:32:14] going down this path so it's going to sort of selectively expand parts of your
[00:32:17] sort of selectively expand parts of your tree it's not going to H and you'll sort
[00:32:19] tree it's not going to H and you'll sort of have these like unbalanced trees
[00:32:21] of have these like unbalanced trees where uh you'll often see like parts of
[00:32:23] where uh you'll often see like parts of things are getting filled in and then
[00:32:25] things are getting filled in and then maybe if something else becomes more
[00:32:26] maybe if something else becomes more promising you'll switch to an another
[00:32:28] promising you'll switch to an another part of the tree and fill in things
[00:32:29] part of the tree and fill in things there okay so it's sort of this
[00:32:31] there okay so it's sort of this unbalanced construction of your forward
[00:32:33] unbalanced construction of your forward search tree and the way that it's
[00:32:35] search tree and the way that it's unbalanced is that using the Monte Carlo
[00:32:39] unbalanced is that using the Monte Carlo aspect to approximate all the
[00:32:40] aspect to approximate all the expectations and you're using this upper
[00:32:42] expectations and you're using this upper confidence bound to sort of selectively
[00:32:44] confidence bound to sort of selectively prioritize across your actions that's
[00:32:47] prioritize across your actions that's what's going to help with our sort of
[00:32:48] what's going to help with our sort of enormous action
[00:32:51] enormous action space now you still might be concerned
[00:32:53] space now you still might be concerned that when there's really like a really
[00:32:55] that when there's really like a really enormous number of actions like we're
[00:32:56] enormous number of actions like we're going to see in go in other cas cases
[00:32:58] going to see in go in other cas cases that this still isn't going to be enough
[00:33:00] that this still isn't going to be enough right because if the number of actions
[00:33:01] right because if the number of actions you have is like you know like a million
[00:33:06] you have is like you know like a million um these things
[00:33:09] um these things generally you'll have zero for this part
[00:33:11] generally you'll have zero for this part right before you have taken any actions
[00:33:13] right before you have taken any actions and your counts will all be the
[00:33:15] and your counts will all be the same so it should still be concerning
[00:33:18] same so it should still be concerning because like what should you do if you
[00:33:19] because like what should you do if you have these a thousand different actions
[00:33:21] have these a thousand different actions and like you might not be able to do
[00:33:23] and like you might not be able to do anything essentially until you visited
[00:33:25] anything essentially until you visited everything once
[00:33:28] everything once because before then as long as you've
[00:33:29] because before then as long as you've defined something that's a reasonable
[00:33:31] defined something that's a reasonable upper confidence bound everything is
[00:33:33] upper confidence bound everything is going to look awesome it's like oh
[00:33:35] going to look awesome it's like oh action 99 will be awesome action 100
[00:33:37] action 99 will be awesome action 100 will be awesome and so you'll have to
[00:33:38] will be awesome and so you'll have to sample all of them at least once and
[00:33:40] sample all of them at least once and that generally will be completely
[00:33:42] that generally will be completely intractable so I'll see ways to further
[00:33:45] intractable so I'll see ways to further um reduce this but what you can think of
[00:33:47] um reduce this but what you can think of this part is doing is saying well if you
[00:33:48] this part is doing is saying well if you can at least sample every action
[00:33:51] can at least sample every action once you can at least mean that you're
[00:33:53] once you can at least mean that you're not going to have to focus on
[00:33:54] not going to have to focus on unpromising actions later because you're
[00:33:56] unpromising actions later because you're going to quickly use a separate compid
[00:34:03] isbound so these sort of monticolo tre
[00:34:05] isbound so these sort of monticolo tre searches were starting to look really
[00:34:07] searches were starting to look really promising um you know a lot of people
[00:34:08] promising um you know a lot of people have used tree search-based algorithms
[00:34:11] have used tree search-based algorithms as some of as you might have seen other
[00:34:12] as some of as you might have seen other machine learning algorith or sorry other
[00:34:14] machine learning algorith or sorry other AI classes probably in particular um but
[00:34:17] AI classes probably in particular um but what this Monte Carlo and UCT based
[00:34:19] what this Monte Carlo and UCT based approaches is there sort of this highly
[00:34:21] approaches is there sort of this highly selective best first search but with
[00:34:23] selective best first search but with simulations as
[00:34:25] simulations as well and they're using sampling to Cur
[00:34:28] well and they're using sampling to Cur to break the curs of dimensionality and
[00:34:37] UCT and UCT to
[00:34:41] UCT and UCT to help with large action
[00:34:48] spaces and the other really nice benefit
[00:34:51] spaces and the other really nice benefit of these ones is they're
[00:34:55] of these ones is they're paralyzable so when you're sampling
[00:34:57] paralyzable so when you're sampling things you could certainly imagine
[00:34:58] things you could certainly imagine trying to expand do do these sort of
[00:35:01] trying to expand do do these sort of roll outs many many times and then sort
[00:35:04] roll outs many many times and then sort of collect the results so you can start
[00:35:06] of collect the results so you can start to paralyze these methods as well and
[00:35:07] to paralyze these methods as well and that's going to be really
[00:35:09] that's going to be really helpful okay so that's the background
[00:35:11] helpful okay so that's the background between Monte Carlo tree search um but
[00:35:15] between Monte Carlo tree search um but now of course the really big
[00:35:17] now of course the really big breakthrough that this allowed or people
[00:35:19] breakthrough that this allowed or people built on these ideas is to achieve Alpha
[00:35:22] built on these ideas is to achieve Alpha go Alpha go and then Alpha Z and then
[00:35:24] go Alpha go and then Alpha Z and then mu0 so there's a whole sequence of them
[00:35:27] mu0 so there's a whole sequence of them and let's just get up a movie for that
[00:35:30] and let's just get up a movie for that for a
[00:35:36] second and who he's played
[00:35:39] second and who he's played go okay a few people I think it could be
[00:35:42] go okay a few people I think it could be fun to have us all play it so we can see
[00:35:44] fun to have us all play it so we can see that it's really quite hard but that
[00:35:46] that it's really quite hard but that another
[00:35:48] another time go is the world's oldest
[00:35:51] time go is the world's oldest continuously played board game it is one
[00:35:54] continuously played board game it is one of the simplest and also most abstract
[00:35:58] of the simplest and also most abstract beating a professional player at go is a
[00:36:00] beating a professional player at go is a longstanding challenge of artificial
[00:36:05] intelligence everything we've ever tried
[00:36:07] intelligence everything we've ever tried in AI just Falls over when you try the
[00:36:09] in AI just Falls over when you try the game of Go the number of possible
[00:36:11] game of Go the number of possible configurations of the board is more than
[00:36:13] configurations of the board is more than the number of atoms in the universe Alp
[00:36:16] the number of atoms in the universe Alp go found a way to learn how to play
[00:36:17] go found a way to learn how to play [Music]
[00:36:20] [Music] Go building suspense we'll see how the
[00:36:23] Go building suspense we'll see how the network goes this is a documentary of
[00:36:26] network goes this is a documentary of deep mind's efforts to try to beat the
[00:36:28] deep mind's efforts to try to beat the world class people and go let me see if
[00:36:30] world class people and go let me see if I can make it work think it's probably
[00:36:32] I can make it work think it's probably decided it doesn't like the internet
[00:36:34] decided it doesn't like the internet right now just double check if I can get
[00:36:36] right now just double check if I can get that to
[00:36:38] that to work so what they ended up doing is they
[00:36:40] work so what they ended up doing is they are they're going to use reinforcement
[00:36:42] are they're going to use reinforcement learning um to help solve this problem
[00:36:44] learning um to help solve this problem we'll see whether or not the technical
[00:36:45] we'll see whether or not the technical difficulties resolve and then what they
[00:36:47] difficulties resolve and then what they did is they started playing against
[00:36:49] did is they started playing against grand grand Masters and they tried to
[00:36:51] grand grand Masters and they tried to then they played against Lisa doll who
[00:36:52] then they played against Lisa doll who was one of the best people in the world
[00:36:54] was one of the best people in the world ago and I think one of the really
[00:36:56] ago and I think one of the really interesting things about this is that it
[00:36:58] interesting things about this is that it really shows that it's now possible to
[00:37:00] really shows that it's now possible to use AI to beat the world best people in
[00:37:03] use AI to beat the world best people in the world at go but also the types of
[00:37:06] the world at go but also the types of strategies that it built were very
[00:37:08] strategies that it built were very different than what people were doing
[00:37:10] different than what people were doing before and so I think this is a a pretty
[00:37:12] before and so I think this is a a pretty important aspect for AI because we've Al
[00:37:16] important aspect for AI because we've Al often thought of AI as sort of
[00:37:17] often thought of AI as sort of automating things that people already
[00:37:19] automating things that people already know how to do and I think this
[00:37:21] know how to do and I think this Illustrated that there are really
[00:37:23] Illustrated that there are really starting to be places where computers go
[00:37:25] starting to be places where computers go beyond even the best humans and what we
[00:37:27] beyond even the best humans and what we know how to do and since then there's
[00:37:30] know how to do and since then there's been a recent paper I think maybe a year
[00:37:32] been a recent paper I think maybe a year or two ago by Bean Kim trying to look
[00:37:34] or two ago by Bean Kim trying to look whether or not you can teach Grand
[00:37:36] whether or not you can teach Grand Masters using the strategies that um
[00:37:39] Masters using the strategies that um alphao and its descendants in invented
[00:37:43] alphao and its descendants in invented and so then there's this really
[00:37:44] and so then there's this really interesting opportunity and question to
[00:37:45] interesting opportunity and question to think about can we actually learn from
[00:37:48] think about can we actually learn from computers in these new ways and try to
[00:37:50] computers in these new ways and try to sort of exceed both human level
[00:37:51] sort of exceed both human level performance and computer level
[00:37:53] performance and computer level performance so I will post this later
[00:37:56] performance so I will post this later you guys can look at it uh um let's go
[00:37:58] you guys can look at it uh um let's go back to there
[00:38:01] back to there okay all right so how does go work well
[00:38:04] okay all right so how does go work well it's a really really old game um it's
[00:38:07] it's a really really old game um it's considered sort of one of the classic
[00:38:08] considered sort of one of the classic hardest board games and it was
[00:38:11] hardest board games and it was considered a grand challenge for AI for
[00:38:13] considered a grand challenge for AI for many many
[00:38:14] many many decades the sort of game tree search
[00:38:16] decades the sort of game tree search that we saw before is something like a
[00:38:18] that we saw before is something like a forward search um now it couldn't be
[00:38:20] forward search um now it couldn't be expecting Max in this case cuz it's a
[00:38:22] expecting Max in this case cuz it's a two-player game go is what's considered
[00:38:25] two-player game go is what's considered um a zero sum game meaning that someone
[00:38:27] um a zero sum game meaning that someone either wins and loses and in this case
[00:38:29] either wins and loses and in this case whenever we sort of think of a next
[00:38:31] whenever we sort of think of a next state rather than it be an expectation
[00:38:33] state rather than it be an expectation it's really a Mini Max problem because
[00:38:35] it's really a Mini Max problem because each opponent is playing to
[00:38:38] each opponent is playing to win so in this case um it's good to
[00:38:42] win so in this case um it's good to think about sort of you know what is
[00:38:43] think about sort of you know what is actually uncertain in this case um when
[00:38:46] actually uncertain in this case um when we're playing go the rules of the game
[00:38:48] we're playing go the rules of the game are known they actually have another
[00:38:49] are known they actually have another descendant now where you didn't have to
[00:38:51] descendant now where you didn't have to know the the rules of the game but
[00:38:52] know the the rules of the game but certainly for the first few the rules of
[00:38:54] certainly for the first few the rules of the game were known um so what's unknown
[00:38:57] the game were known um so what's unknown go if we wanted to think about sort of
[00:38:59] go if we wanted to think about sort of building a tree or trying to learn in
[00:39:01] building a tree or trying to learn in this
[00:39:03] this case like is the are the D if we know
[00:39:05] case like is the are the D if we know the rules
[00:39:07] the rules yeah well you might expect your
[00:39:10] yeah well you might expect your adversary to play the best move that
[00:39:12] adversary to play the best move that might not always be true they might be
[00:39:14] might not always be true they might be seeking different strategy so you
[00:39:16] seeking different strategy so you wouldn't know that good point so it
[00:39:18] wouldn't know that good point so it might be as saying that you might not
[00:39:21] might be as saying that you might not know exactly what the best strategy is
[00:39:23] know exactly what the best strategy is or you don't might not know what whether
[00:39:25] or you don't might not know what whether someone's going to play the best
[00:39:26] someone's going to play the best strategy I think the other thing that I
[00:39:28] strategy I think the other thing that I think of is that we don't always know
[00:39:29] think of is that we don't always know what the best strategy is it's just
[00:39:31] what the best strategy is it's just incredibly hard to compute this in this
[00:39:32] incredibly hard to compute this in this case and so in this case that sort of
[00:39:35] case and so in this case that sort of next state if that next state is really
[00:39:38] next state if that next state is really from an adversary it's not clear you've
[00:39:39] from an adversary it's not clear you've got stochasticity in that because you
[00:39:41] got stochasticity in that because you don't know what the optimal game is um
[00:39:44] don't know what the optimal game is um now of course once someone picks a move
[00:39:46] now of course once someone picks a move everything is deterministic so in some
[00:39:48] everything is deterministic so in some ways it's all deterministic it's all
[00:39:49] ways it's all deterministic it's all known the key thing is that because it's
[00:39:51] known the key thing is that because it's sort of this adversarial game it's not
[00:39:53] sort of this adversarial game it's not clear what the optimal strategy is so
[00:39:55] clear what the optimal strategy is so that's kind of one of the really hard
[00:39:56] that's kind of one of the really hard parts
[00:39:59] parts all right just a couple basics of the
[00:40:01] all right just a couple basics of the rules of ghost so normally it's played
[00:40:02] rules of ghost so normally it's played on a 19 by9 board but when people first
[00:40:05] on a 19 by9 board but when people first started to well kids and also when few
[00:40:07] started to well kids and also when few AI researchers started tackling this
[00:40:09] AI researchers started tackling this game in Earnest starting like the late
[00:40:11] game in Earnest starting like the late 2000s or 2000 David Sor who's one of the
[00:40:14] 2000s or 2000 David Sor who's one of the authors of of this and an amazing
[00:40:15] authors of of this and an amazing researcher I think as part of his PhD in
[00:40:18] researcher I think as part of his PhD in the late like 2008 2009 he was doing
[00:40:20] the late like 2008 2009 he was doing things on sort of a 9x9 board um and
[00:40:24] things on sort of a 9x9 board um and just as a couple Basics there's two
[00:40:26] just as a couple Basics there's two different players uh either someone
[00:40:28] different players uh either someone playing the black stones or the white
[00:40:29] playing the black stones or the white stones and you're trying to surround
[00:40:31] stones and you're trying to surround stones that are
[00:40:33] stones that are captured and then you win the there's as
[00:40:35] captured and then you win the there's as I said it's a 01 game
[00:40:40] 01 game which means a winner takes all
[00:40:44] 01 game which means a winner takes all one of the interesting things about go
[00:40:45] one of the interesting things about go is that in general there's no
[00:40:46] is that in general there's no intermediate reward so you have to play
[00:40:49] intermediate reward so you have to play till the end of the game see if he's
[00:40:51] till the end of the game see if he's actually winning and so there's just a
[00:40:52] actually winning and so there's just a single reward at the end which also
[00:40:54] single reward at the end which also makes it very hard for credit assignment
[00:40:56] makes it very hard for credit assignment and to understand what moves caused you
[00:40:58] and to understand what moves caused you know the resulting
[00:41:02] game so alphao and Alpha zero alphago
[00:41:06] game so alphao and Alpha zero alphago was the first one that was used then um
[00:41:09] was the first one that was used then um they developed a number of variants they
[00:41:10] they developed a number of variants they then played against leas all and then
[00:41:12] then played against leas all and then there was Alpha zero and what they
[00:41:15] there was Alpha zero and what they exhibit in this case is a number of
[00:41:17] exhibit in this case is a number of different really interesting features so
[00:41:19] different really interesting features so they have self-play strategic
[00:41:21] they have self-play strategic computation highly selective best for
[00:41:23] computation highly selective best for search they use the power of averaging
[00:41:26] search they use the power of averaging they leverage local computation and then
[00:41:28] they leverage local computation and then they learn and update heris for those of
[00:41:31] they learn and update heris for those of you that have seen tree search based
[00:41:32] you that have seen tree search based methods before you've often probably
[00:41:34] methods before you've often probably seen ideas around heris which are other
[00:41:37] seen ideas around heris which are other ways to sort of think about how do you
[00:41:38] ways to sort of think about how do you expand the tree one of the interesting
[00:41:40] expand the tree one of the interesting ideas in these in these papers is that
[00:41:43] ideas in these in these papers is that they're going to learn those heris and
[00:41:44] they're going to learn those heris and update them over time it's another
[00:41:47] update them over time it's another important aspect so let's see how it
[00:41:49] important aspect so let's see how it works so how does selfplay work um so
[00:41:54] works so how does selfplay work um so the key idea in this case is that we're
[00:41:55] the key idea in this case is that we're going to have the agent play it
[00:41:58] going to have the agent play it itself so there's going to you can think
[00:42:00] itself so there's going to you can think of it is there being two copies of the
[00:42:01] of it is there being two copies of the same agent right now um and what will
[00:42:05] same agent right now um and what will happen when they're playing a game is
[00:42:06] happen when they're playing a game is they compute the best move at the
[00:42:07] they compute the best move at the current state and then the opponent does
[00:42:10] current state and then the opponent does the same and they have access
[00:42:11] the same and they have access essentially to same the same policy or
[00:42:13] essentially to same the same policy or the same sort of algorithm but they're
[00:42:15] the same sort of algorithm but they're both just using it in an adversarial
[00:42:18] both just using it in an adversarial way and so that means like the only
[00:42:20] way and so that means like the only bottleneck in this case is computation
[00:42:22] bottleneck in this case is computation we have no humans involved um and
[00:42:24] we have no humans involved um and selfplay also provides a well match
[00:42:26] selfplay also provides a well match player
[00:42:28] player so take a second and think about like
[00:42:30] so take a second and think about like what are the benefits that will happen
[00:42:33] what are the benefits that will happen with selfplay and what's going to be
[00:42:36] with selfplay and what's going to be like the reward density there going to
[00:42:38] like the reward density there going to be lots of rewards when you do self play
[00:42:40] be lots of rewards when you do self play are there going to be very little
[00:42:41] are there going to be very little rewards let's just take a second and
[00:42:42] rewards let's just take a second and I'll check and see whether or not I can
[00:42:43] I'll check and see whether or not I can make the networks
[00:42:53] work maybe talk to our neighbor and see
[00:42:55] work maybe talk to our neighbor and see if you guys both have the same idea of
[00:42:57] if you guys both have the same idea of with their self play will be helpful or
[00:43:11] not all right what does this do to
[00:43:14] not all right what does this do to policy training what happens when you do
[00:43:16] policy training what happens when you do selfplay
[00:43:20] like do you have high reward density do
[00:43:22] like do you have high reward density do you have low reward density what happens
[00:43:30] here raise your hand if you think you
[00:43:31] here raise your hand if you think you have high reward
[00:43:32] have high reward density raise your hand if you think you
[00:43:34] density raise your hand if you think you have low reward
[00:43:35] have low reward density okay so all right well somebody
[00:43:39] density okay so all right well somebody who think we have high reward density
[00:43:40] who think we have high reward density want to explain why that's right we do
[00:43:42] want to explain why that's right we do have pretty high reward density why do
[00:43:44] have pretty high reward density why do we get that when we do self
[00:43:45] we get that when we do self play what
[00:43:48] play what happens or I think it's easy maybe
[00:43:50] happens or I think it's easy maybe easiest to think of like if you play
[00:43:53] easiest to think of like if you play against someone that's much much better
[00:43:55] against someone that's much much better than you what happens
[00:44:00] kind of lose lose all the time right you
[00:44:03] kind of lose lose all the time right you know I mean everyone's probably done
[00:44:04] know I mean everyone's probably done this before you play against like a
[00:44:05] this before you play against like a friend of yours is maybe much better at
[00:44:06] friend of yours is maybe much better at a board game than you or something like
[00:44:08] a board game than you or something like that right or your friend of yours much
[00:44:09] that right or your friend of yours much better than you at tennis or something
[00:44:11] better than you at tennis or something you go and play with them and it's not
[00:44:12] you go and play with them and it's not normally that fun because you just lose
[00:44:14] normally that fun because you just lose all the time um and when you lose all
[00:44:16] all the time um and when you lose all the time you may not get very much
[00:44:18] the time you may not get very much signal about what things are even doing
[00:44:20] signal about what things are even doing better or worse at because you always
[00:44:21] better or worse at because you always lose okay and so that would be a case
[00:44:23] lose okay and so that would be a case where the reward density is very low
[00:44:25] where the reward density is very low because the players are really mismatch
[00:44:27] because the players are really mismatch matched and it means that most of the
[00:44:29] matched and it means that most of the time the agent is not winning now the
[00:44:31] time the agent is not winning now the same thing is true if the agent is much
[00:44:33] same thing is true if the agent is much better than the other agent but
[00:44:35] better than the other agent but self-play means you're sort of matched
[00:44:37] self-play means you're sort of matched you know you're like matched at the same
[00:44:38] you know you're like matched at the same level someone who plays tennis the same
[00:44:40] level someone who plays tennis the same level as you or you're matched with
[00:44:42] level as you or you're matched with someone who has the same ELO score as
[00:44:44] someone who has the same ELO score as you and um in chess or go which is sort
[00:44:47] you and um in chess or go which is sort of a way to quantify the player skill
[00:44:50] of a way to quantify the player skill and the nice thing about that is that
[00:44:51] and the nice thing about that is that you would expect that roughly if you
[00:44:52] you would expect that roughly if you play someone that's exactly the same
[00:44:54] play someone that's exactly the same level now here you're an RL agent so
[00:44:55] level now here you're an RL agent so you're going to play someone that's
[00:44:56] you're going to play someone that's actually you just on the other side so
[00:44:59] actually you just on the other side so they're exactly the same level as you
[00:45:01] they're exactly the same level as you and that means you'd expect you'd win
[00:45:02] and that means you'd expect you'd win about half the time like on average so
[00:45:06] about half the time like on average so that's really good density for something
[00:45:08] that's really good density for something that is um a01 game because you're not
[00:45:11] that is um a01 game because you're not just like every you know 3,000 games
[00:45:13] just like every you know 3,000 games getting a zero or a one um here about
[00:45:17] getting a zero or a one um here about half the time you'd expect to get a one
[00:45:18] half the time you'd expect to get a one and half the time you'd expect to get a
[00:45:20] and half the time you'd expect to get a zero and the reason that might be
[00:45:22] zero and the reason that might be beneficial is hopefully that's going to
[00:45:23] beneficial is hopefully that's going to give you a lot more signal of how you
[00:45:25] give you a lot more signal of how you should change your policy in order to
[00:45:28] should change your policy in order to figure out how to get
[00:45:29] figure out how to get better so I think that selfplay is a
[00:45:32] better so I think that selfplay is a really interesting one because you could
[00:45:33] really interesting one because you could think of it in some ways as kind of
[00:45:34] think of it in some ways as kind of providing an automatic
[00:45:37] providing an automatic curriculum on the next one so the
[00:45:39] curriculum on the next one so the rewards are going to be pretty
[00:45:40] rewards are going to be pretty dense and for those of you that have
[00:45:42] dense and for those of you that have seen curriculum learning before another
[00:45:44] seen curriculum learning before another machine learning stuff just like in
[00:45:46] machine learning stuff just like in classes where you often sort of build up
[00:45:48] classes where you often sort of build up with math over time and you don't start
[00:45:50] with math over time and you don't start with Calculus you start with like
[00:45:51] with Calculus you start with like addition or what a number is and then
[00:45:53] addition or what a number is and then you slowly build up so you're always
[00:45:55] you slowly build up so you're always trying to be on roughly the right level
[00:45:57] trying to be on roughly the right level similarly here the agent should do that
[00:46:00] similarly here the agent should do that automatically because they're going to
[00:46:02] automatically because they're going to start off and they're going to both be
[00:46:03] start off and they're going to both be terrible at go but they're still going
[00:46:05] terrible at go but they're still going to get pretty high density of reward
[00:46:06] to get pretty high density of reward because they're both terrible to go and
[00:46:09] because they're both terrible to go and then over time the agents are going to
[00:46:10] then over time the agents are going to get better and then now they're
[00:46:12] get better and then now they're automatically always playing an agent
[00:46:14] automatically always playing an agent that's roughly the same level as
[00:46:16] that's roughly the same level as them now we'll have to see why the
[00:46:18] them now we'll have to see why the algorithm will help them get better but
[00:46:20] algorithm will help them get better but intuitively as we saw even with like the
[00:46:22] intuitively as we saw even with like the Monte Carlo simulation not even tree
[00:46:24] Monte Carlo simulation not even tree search there it was doing like one step
[00:46:27] search there it was doing like one step of policy Improvement so you can imagine
[00:46:29] of policy Improvement so you can imagine that if even if each round we just doing
[00:46:31] that if even if each round we just doing kind of like one step of policy
[00:46:32] kind of like one step of policy improvement over time we would hope that
[00:46:34] improvement over time we would hope that we're going to get better and better
[00:46:36] we're going to get better and better okay so this idea of selfplay I think is
[00:46:39] okay so this idea of selfplay I think is a really interesting one it works really
[00:46:41] a really interesting one it works really well in games and it's been exploited a
[00:46:43] well in games and it's been exploited a lot um I've often thought like it would
[00:46:46] lot um I've often thought like it would be really interesting to see are there
[00:46:47] be really interesting to see are there other places you can set up to
[00:46:48] other places you can set up to essentially be like a game because what
[00:46:51] essentially be like a game because what you could think of here is what selfplay
[00:46:52] you could think of here is what selfplay is leveraging is that sort of for the um
[00:46:56] is leveraging is that sort of for the um for the Dynamics part of your
[00:46:58] for the Dynamics part of your environment you now have a simulator you
[00:47:00] environment you now have a simulator you can plug in which is the agent itself
[00:47:04] can plug in which is the agent itself now in general cases you can't do that
[00:47:06] now in general cases you can't do that like if I'm going to simulate patient
[00:47:08] like if I'm going to simulate patient Dynamics I can't just plug in like you
[00:47:11] Dynamics I can't just plug in like you know I don't I can't do self-play for
[00:47:13] know I don't I can't do self-play for that right like an action is how the
[00:47:15] that right like an action is how the patient responds to some treatment and I
[00:47:18] patient responds to some treatment and I can't like play against two a you know
[00:47:20] can't like play against two a you know two patients that doesn't make sense but
[00:47:22] two patients that doesn't make sense but I think in in some cases here it's
[00:47:25] I think in in some cases here it's really a very reasonable thing to do to
[00:47:26] really a very reasonable thing to do to use selfplay and it can be really
[00:47:29] use selfplay and it can be really efficient because you can think of it in
[00:47:30] efficient because you can think of it in a a way as like it's changing the
[00:47:32] a a way as like it's changing the strategies in a way that sort of
[00:47:34] strategies in a way that sort of iteratively um updating the complexity
[00:47:37] iteratively um updating the complexity of the environment you're trying to
[00:47:38] of the environment you're trying to solve soone I think I have a good idea
[00:47:42] solve soone I think I have a good idea but
[00:47:45] ex respect ah good question so here I
[00:47:48] ex respect ah good question so here I just mean about um mean lots of things
[00:47:50] just mean about um mean lots of things here what I mean by reward density is
[00:47:52] here what I mean by reward density is how often you're going to get a um
[00:47:54] how often you're going to get a um you're going to win okay um so so and
[00:47:57] you're going to win okay um so so and here rewards only happen at the end so
[00:48:00] here rewards only happen at the end so it would just be of the games you play
[00:48:02] it would just be of the games you play are you going to get a lot of reward
[00:48:04] are you going to get a lot of reward you're going to get zero if the agents
[00:48:05] you're going to get zero if the agents are really mismatch in general their W
[00:48:07] are really mismatch in general their W density is going to be either saturated
[00:48:09] density is going to be either saturated which means you always win or um near
[00:48:12] which means you always win or um near zero because you're never going to win
[00:48:14] zero because you're never going to win and neither of those are very
[00:48:15] and neither of those are very informative and the idea is that if
[00:48:16] informative and the idea is that if you're getting re about half the time
[00:48:19] you're getting re about half the time that might be really informative because
[00:48:20] that might be really informative because you can just you get lots of signal of
[00:48:22] you can just you get lots of signal of like that thing work that didn't work
[00:48:23] like that thing work that didn't work that thing worked that didn't work you
[00:48:25] that thing worked that didn't work you know and so you're going to have a lot
[00:48:26] know and so you're going to have a lot of stuff done estimate kind of a
[00:48:27] of stuff done estimate kind of a gradient or an improvement for your
[00:48:30] gradient or an improvement for your decision
[00:48:31] decision policy yeah um with Sal play can we say
[00:48:35] policy yeah um with Sal play can we say that
[00:48:36] that um like if some if you play against
[00:48:38] um like if some if you play against someone who has like a completely new
[00:48:40] someone who has like a completely new strategy might not be able to generalize
[00:48:43] strategy might not be able to generalize well enough because I was always playing
[00:48:45] well enough because I was always playing against myself and always using the same
[00:48:47] against myself and always using the same kind of strategies great question so
[00:48:49] kind of strategies great question so yeah that so that's a really great
[00:48:51] yeah that so that's a really great question which is okay well so self play
[00:48:52] question which is okay well so self play might be good but then what if you
[00:48:54] might be good but then what if you suddenly play against someone really
[00:48:55] suddenly play against someone really different um
[00:48:57] different um so what we're going to have to see in
[00:48:58] so what we're going to have to see in this case is whether or not over time
[00:49:02] this case is whether or not over time you do get to something that's
[00:49:03] you do get to something that's essentially like a Mini Max
[00:49:05] essentially like a Mini Max policy so if you get to the optimal
[00:49:07] policy so if you get to the optimal policy you could hope that you really
[00:49:10] policy you could hope that you really are at you know like Grandmaster or
[00:49:12] are at you know like Grandmaster or Beyond and and one of the exciting
[00:49:13] Beyond and and one of the exciting things here is that um they will get to
[00:49:15] things here is that um they will get to that so as this ratches up and ratches
[00:49:18] that so as this ratches up and ratches up after lots and lots of trading and
[00:49:20] up after lots and lots of trading and after using very very complicated
[00:49:22] after using very very complicated complicated networks you can get to that
[00:49:25] complicated networks you can get to that level um
[00:49:27] level um does does that work well does that work
[00:49:30] does does that work well does that work for like games where uh moves are not
[00:49:33] for like games where uh moves are not like deterministic like uh I don't know
[00:49:36] like deterministic like uh I don't know like the gambling games like poker or
[00:49:38] like the gambling games like poker or something where there is some sort of
[00:49:39] something where there is some sort of probability yeah inter so there's also
[00:49:42] probability yeah inter so there's also been a lot of work there's um there are
[00:49:43] been a lot of work there's um there are really really good AI agents for poker
[00:49:45] really really good AI agents for poker now I think it was 2019 that um gome
[00:49:49] now I think it was 2019 that um gome Brown a paper in science showing that
[00:49:51] Brown a paper in science showing that you could beat um I don't know if you
[00:49:53] you could beat um I don't know if you could beat but it was certainly sort of
[00:49:55] could beat but it was certainly sort of competitive with top humans I believe um
[00:49:58] competitive with top humans I believe um so Thomas sandome and noome brown who
[00:50:00] so Thomas sandome and noome brown who did his PhD at CMU had have got an agent
[00:50:03] did his PhD at CMU had have got an agent to do well at poker the algorithms are
[00:50:05] to do well at poker the algorithms are slightly different um but yes you can
[00:50:07] slightly different um but yes you can you this but it's a good question here
[00:50:09] you this but it's a good question here we're assuming that it's also going to
[00:50:11] we're assuming that it's also going to leverage the deterministic
[00:50:13] leverage the deterministic nature
[00:50:15] nature yeah all right okay so how does this
[00:50:19] yeah all right okay so how does this work let's go through sort of um what
[00:50:21] work let's go through sort of um what it's doing because it relates to Upper
[00:50:23] it's doing because it relates to Upper confidence research but there are many
[00:50:25] confidence research but there are many changes um so there many improvements
[00:50:27] changes um so there many improvements that were needed for it to get much
[00:50:28] that were needed for it to get much better but it is going to be similar in
[00:50:31] better but it is going to be similar in the sense that it's going to try to
[00:50:32] the sense that it's going to try to compute um first we're going to start
[00:50:34] compute um first we're going to start with it's going to simulate many many
[00:50:36] with it's going to simulate many many many games and it's going to iteratively
[00:50:38] many games and it's going to iteratively try to learn better strategies one of
[00:50:41] try to learn better strategies one of the things that is going to be different
[00:50:42] the things that is going to be different compared to sort of naive upper
[00:50:45] compared to sort of naive upper confidence tree um is that we're going
[00:50:48] confidence tree um is that we're going to actually maintain a neural network so
[00:50:50] to actually maintain a neural network so let me just get back to
[00:50:54] there okay so what we're going to do in
[00:50:57] there okay so what we're going to do in this case is we're going to have um a
[00:50:59] this case is we're going to have um a neural
[00:51:03] network that given estate can produce
[00:51:06] network that given estate can produce both an estimate of V of
[00:51:08] both an estimate of V of s and a policy distribution for that
[00:51:12] s and a policy distribution for that state over actions so we're going to
[00:51:14] state over actions so we're going to maintain a single neural network this is
[00:51:16] maintain a single neural network this is what Alpha zero does maintains a single
[00:51:18] what Alpha zero does maintains a single neural network that given an input state
[00:51:20] neural network that given an input state will output both an estimate of the
[00:51:22] will output both an estimate of the value of that state and um a policy for
[00:51:25] value of that state and um a policy for that state add distribution over our
[00:51:27] that state add distribution over our actions okay um and we're going to talk
[00:51:30] actions okay um and we're going to talk about how we train that shortly but for
[00:51:31] about how we train that shortly but for now just to assume to start that we've
[00:51:33] now just to assume to start that we've already trained that or that we have
[00:51:34] already trained that or that we have access to that and we're going to use
[00:51:36] access to that and we're going to use that now when we're going to do um a
[00:51:39] that now when we're going to do um a number we're going to play a number of
[00:51:41] number we're going to play a number of games okay and um in particular let's
[00:51:43] games okay and um in particular let's first think about how we're going to
[00:51:44] first think about how we're going to compute the first move in a single game
[00:51:47] compute the first move in a single game so we're going to do some self we're
[00:51:48] so we're going to do some self we're going to do selfplay in this case
[00:51:50] going to do selfplay in this case between two agents all use the same what
[00:51:52] between two agents all use the same what we're going to do is we're going to do
[00:51:54] we're going to do is we're going to do an upper confidence bound based thing
[00:51:56] an upper confidence bound based thing okay what is these oper confidence
[00:51:57] okay what is these oper confidence bounds going to be based on um and then
[00:52:00] bounds going to be based on um and then we're and then we're going to decide the
[00:52:01] we're and then we're going to decide the max between them so this is going to
[00:52:03] max between them so this is going to look like UCT but slightly different so
[00:52:07] look like UCT but slightly different so what U is going to be equal to in this
[00:52:08] what U is going to be equal to in this case so U of I a so let's say this is
[00:52:12] case so U of I a so let's say this is node
[00:52:13] node I it's going to be proportional to the
[00:52:16] I it's going to be proportional to the following it's going to be proportional
[00:52:18] following it's going to be proportional to
[00:52:20] to PSA it is
[00:52:26] / 1
[00:52:29] / 1 plus okay this is from our
[00:52:34] policy I'll write it as
[00:52:39] this so this is from our policy Network
[00:52:42] this so this is from our policy Network this means that our upper confidence
[00:52:44] this means that our upper confidence bound is going to include in it a bias
[00:52:47] bound is going to include in it a bias towards some actions versus others so
[00:52:49] towards some actions versus others so our policy network is going to say if
[00:52:51] our policy network is going to say if you give me a state I will give you a
[00:52:53] you give me a state I will give you a distribution over actions okay and
[00:52:55] distribution over actions okay and that's this and that is going to help us
[00:52:57] that's this and that is going to help us with the fact that we have an enormous
[00:52:58] with the fact that we have an enormous number of actions and so this is going
[00:53:00] number of actions and so this is going to prioritize some actions that we think
[00:53:02] to prioritize some actions that we think in general might be better for these
[00:53:04] in general might be better for these types of States so this will be a deep
[00:53:06] types of States so this will be a deep neural like that neural network up there
[00:53:08] neural like that neural network up there is going to be a huge crazy deep neural
[00:53:09] is going to be a huge crazy deep neural network um and it's going to try to
[00:53:11] network um and it's going to try to leverage similar types of states to
[00:53:13] leverage similar types of states to suggest which actions might be useful in
[00:53:16] suggest which actions might be useful in this particular State the other thing
[00:53:18] this particular State the other thing you can see here is that this is going
[00:53:20] you can see here is that this is going to Decay as we visit a state in action
[00:53:25] to Decay as we visit a state in action more um here in this case so I'll just
[00:53:27] more um here in this case so I'll just be a little
[00:53:29] be a little careful where this is all going to be
[00:53:31] careful where this is all going to be operating I believe this
[00:53:33] operating I believe this part it's really I so this is I think at
[00:53:36] part it's really I so this is I think at the node level I'll double check that
[00:53:40] the node level I'll double check that but
[00:53:41] but um the PSA has to be at the state level
[00:53:44] um the PSA has to be at the state level so remember you'll be in some State at
[00:53:46] so remember you'll be in some State at this point and you could feed this into
[00:53:47] this point and you could feed this into like a convolutional neural network it's
[00:53:49] like a convolutional neural network it's an image of the board or you know some
[00:53:51] an image of the board or you know some other deep neural network so that part
[00:53:53] other deep neural network so that part has to generalize but I'm pretty sure
[00:53:55] has to generalize but I'm pretty sure I'll double check this at the the count
[00:53:56] I'll double check this at the the count here is actually specific to this
[00:53:58] here is actually specific to this particular
[00:54:00] particular node now why is this U interesting it's
[00:54:03] node now why is this U interesting it's interesting both because it incorporate
[00:54:04] interesting both because it incorporate sort of a priority function over actions
[00:54:07] sort of a priority function over actions you might say some actions are better or
[00:54:08] you might say some actions are better or worse and that's going to change which
[00:54:09] worse and that's going to change which ones we expand the other is that we are
[00:54:12] ones we expand the other is that we are decaying faster than normal upper
[00:54:15] decaying faster than normal upper competence
[00:54:16] competence bounds so
[00:54:19] bounds so recall like
[00:54:21] recall like UCT U is proportional to 1 / square root
[00:54:29] so this is going to Decay a lot faster
[00:54:30] so this is going to Decay a lot faster this means we're being a lot more
[00:54:32] this means we're being a lot more aggressive in our upper confidence bound
[00:54:34] aggressive in our upper confidence bound we're shrinking fast and so what that
[00:54:36] we're shrinking fast and so what that means is that we're going to do a lot
[00:54:37] means is that we're going to do a lot less exploration of things that we think
[00:54:39] less exploration of things that we think are not so good okay so that's one
[00:54:42] are not so good okay so that's one really important part of how we're going
[00:54:43] really important part of how we're going to pick what to expand the other part is
[00:54:45] to pick what to expand the other part is this notion of Q so how are we defining
[00:54:49] this notion of Q so how are we defining Q for this node Q is going to be equal
[00:54:52] Q for this node Q is going to be equal to 1 /
[00:54:54] to 1 / n by a and again I'll double check this
[00:54:56] n by a and again I'll double check this is no rather than
[00:54:58] is no rather than States Su over S Prime B of S
[00:55:02] States Su over S Prime B of S Prime okay what this means here is that
[00:55:06] Prime okay what this means here is that this is going to be an empirical
[00:55:07] this is going to be an empirical estimate of what the value is over the
[00:55:10] estimate of what the value is over the states that we've reached by following
[00:55:12] states that we've reached by following this particular action in this
[00:55:14] this particular action in this node okay and we're going to see where
[00:55:16] node okay and we're going to see where that comes from
[00:55:18] that comes from shortly okay it's going to be a little
[00:55:20] shortly okay it's going to be a little bit different than what we saw
[00:55:21] bit different than what we saw before but these are the two components
[00:55:24] before but these are the two components that we're going to use to decide which
[00:55:27] that we're going to use to decide which action to expand so yeah by to the node
[00:55:32] action to expand so yeah by to the node here we're talking about the identity of
[00:55:33] here we're talking about the identity of the node in The Matrix or the state of
[00:55:36] the node in The Matrix or the state of the node and the state the state so you
[00:55:38] the node and the state the state so you can think of sort of like what what a
[00:55:40] can think of sort of like what what a what a node here is this case is it is a
[00:55:42] what a node here is this case is it is a particular board game
[00:55:44] particular board game configuration so it's like saying the
[00:55:47] configuration so it's like saying the white pieces are here and the black
[00:55:49] white pieces are here and the black pieces are here so it's like you could
[00:55:51] pieces are here so it's like you could think of as just like an image an image
[00:55:53] think of as just like an image an image of the
[00:55:53] of the board okay um and the earlier work in
[00:55:57] board okay um and the earlier work in fact was using convolutional neural
[00:55:58] fact was using convolutional neural networks to take in essentially images
[00:56:01] networks to take in essentially images and features yeah is there a meaningful
[00:56:03] and features yeah is there a meaningful difference between nodes and States yes
[00:56:05] difference between nodes and States yes so that's a great question so in general
[00:56:07] so that's a great question so in general there may be a difference between nodes
[00:56:09] there may be a difference between nodes and States because um well this is I I'm
[00:56:12] and States because um well this is I I'm not a go expert so I don't know but in
[00:56:13] not a go expert so I don't know but in general for these type of algorithms you
[00:56:15] general for these type of algorithms you could reach the same state as at
[00:56:16] could reach the same state as at different parts of the tree and if if
[00:56:19] different parts of the tree and if if you can do that then you would have
[00:56:20] you can do that then you would have different bonuses there now I don't know
[00:56:23] different bonuses there now I don't know enough about go to know whether that's
[00:56:25] enough about go to know whether that's always possible and it's certainly
[00:56:26] always possible and it's certainly possible in some cases that you know it
[00:56:28] possible in some cases that you know it would be isomorphic that like the the
[00:56:29] would be isomorphic that like the the nodes and States would be identical but
[00:56:31] nodes and States would be identical but in general these sorts of algorithms can
[00:56:33] in general these sorts of algorithms can work in cases where you consider Imaging
[00:56:35] work in cases where you consider Imaging for like Checkers or chess and stuff you
[00:56:37] for like Checkers or chess and stuff you could end up in the same board game
[00:56:39] could end up in the same board game State later on but it would be a
[00:56:40] State later on but it would be a different part of the the
[00:56:43] different part of the the tree okay all right so this is just the
[00:56:46] tree okay all right so this is just the start this is just starting at the roote
[00:56:48] start this is just starting at the roote trying to figure out which action we're
[00:56:49] trying to figure out which action we're going to take from the root and then
[00:56:52] going to take from the root and then what we do is we repeatedly
[00:56:54] what we do is we repeatedly expand so in this case we would follow
[00:56:57] expand so in this case we would follow the right hand side now what we would do
[00:57:00] the right hand side now what we would do with this case which is pretty
[00:57:01] with this case which is pretty interesting is so this would
[00:57:02] interesting is so this would deterministically you know I put down
[00:57:04] deterministically you know I put down say a piece on the board in this
[00:57:07] say a piece on the board in this case I decided to put down this piece
[00:57:11] case I decided to put down this piece okay and then what I would do is I would
[00:57:13] okay and then what I would do is I would flip over and pretend to be the opponent
[00:57:16] flip over and pretend to be the opponent and it would do the same thing using its
[00:57:19] and it would do the same thing using its q and
[00:57:20] q and u okay now it's q and u are going to use
[00:57:23] u okay now it's q and u are going to use the same neural network approximation so
[00:57:25] the same neural network approximation so this is just selfplay but it's just
[00:57:28] this is just selfplay but it's just useful to know in this case that they're
[00:57:29] useful to know in this case that they're going to be optimizing for the opposite
[00:57:31] going to be optimizing for the opposite you know one's trying to optimize that
[00:57:33] you know one's trying to optimize that the black pieces are going to dominate
[00:57:34] the black pieces are going to dominate the other one is going to try to
[00:57:35] the other one is going to try to optimize so the white pieces dominate so
[00:57:37] optimize so the white pieces dominate so now we're going to have that the
[00:57:41] opponent
[00:57:44] opponent selects the
[00:57:46] selects the max q+
[00:57:48] max q+ U so just useful to think of like you
[00:57:51] U so just useful to think of like you know you're sort of repeatedly flipping
[00:57:52] know you're sort of repeatedly flipping back and forth between these two but
[00:57:54] back and forth between these two but you're using exactly the same Neal
[00:57:56] you're using exactly the same Neal Network parameters when you do
[00:57:58] Network parameters when you do that okay so this is going to continue
[00:58:01] that okay so this is going to continue going all the way down until we hit a
[00:58:03] going all the way down until we hit a leaf node so we just WR this is again we
[00:58:05] leaf node so we just WR this is again we haven't even selected a single action to
[00:58:07] haven't even selected a single action to take all of this is going to help us
[00:58:08] take all of this is going to help us finally take a real action in one game
[00:58:11] finally take a real action in one game so right now we're just going to do a
[00:58:12] so right now we're just going to do a whole bunch of computation to figure out
[00:58:14] whole bunch of computation to figure out what that action
[00:58:15] what that action is okay and just to note again here so
[00:58:18] is okay and just to note again here so we're assuming that we have access to
[00:58:20] we're assuming that we have access to this parameterized deep neural network
[00:58:23] this parameterized deep neural network and whenever we do this expansion we are
[00:58:26] and whenever we do this expansion we are using our P function because that's what
[00:58:28] using our P function because that's what was going into our upper confidence
[00:58:30] was going into our upper confidence bound so our our U was a function of
[00:58:35] bound so our our U was a function of P okay so it's a function of these
[00:58:38] P okay so it's a function of these probabilities and so we could wait
[00:58:40] probabilities and so we could wait different actions more so we keep going
[00:58:42] different actions more so we keep going all the way down until we hit a leaf
[00:58:45] all the way down until we hit a leaf node and at that point we plug in V of s
[00:58:50] node and at that point we plug in V of s so we hit a leaf node so if this is
[00:58:53] so we hit a leaf node so if this is terminal we're going to do V of s we're
[00:58:56] terminal we're going to do V of s we're going to use our Cal Network to plug in
[00:59:04] going to use our Cal Network to plug in V of s so this is different than what we
[00:59:07] V of s so this is different than what we saw before because before we were
[00:59:09] saw before because before we were thinking we'd actually get the rewards
[00:59:11] thinking we'd actually get the rewards along our trajectory until we get to the
[00:59:13] along our trajectory until we get to the final end or if we didn't have any
[00:59:15] final end or if we didn't have any rewards we just get whether we sort of
[00:59:17] rewards we just get whether we sort of thought we were in a winning or losing
[00:59:18] thought we were in a winning or losing State at that point we not doing that
[00:59:20] State at that point we not doing that anymore we're plugging in an estimate of
[00:59:22] anymore we're plugging in an estimate of the value of the the final State
[00:59:25] the value of the the final State according to our value netw
[00:59:26] according to our value netw work and that means also that we can
[00:59:28] work and that means also that we can either go all the way out till we win or
[00:59:30] either go all the way out till we win or lose a game or we can terminate we can
[00:59:32] lose a game or we can terminate we can you know say after 700 steps plug in our
[00:59:34] you know say after 700 steps plug in our V of s which would give us an estimate
[00:59:36] V of s which would give us an estimate of How likely we were to win the game at
[00:59:38] of How likely we were to win the game at that
[00:59:39] that point so once you have that we're going
[00:59:41] point so once you have that we're going to propagate all this stuff back up okay
[00:59:45] to propagate all this stuff back up okay so if we're going to select that you
[00:59:46] so if we're going to select that you know once we go all the way down and we
[00:59:48] know once we go all the way down and we get to some
[00:59:50] V this is going to go back up
[00:59:56] and remember what this is going to do is
[00:59:57] and remember what this is going to do is we're going to update our Q function so
[00:59:59] we're going to update our Q function so our Q was equal to 1 over n i a sum over
[01:00:06] our Q was equal to 1 over n i a sum over all of our V of s Prim so we're going to
[01:00:10] all of our V of s Prim so we're going to update our value all the way back up so
[01:00:13] update our value all the way back up so we used our P function when we're
[01:00:14] we used our P function when we're expanding out to figure out which uh
[01:00:16] expanding out to figure out which uh actions to take as well as our upper
[01:00:18] actions to take as well as our upper confidence bound and then we use our V
[01:00:20] confidence bound and then we use our V prediction to do the
[01:00:22] prediction to do the backups so the way that it would work is
[01:00:25] backups so the way that it would work is we go all the way out to to a leaf node
[01:00:27] we go all the way out to to a leaf node and then we go all the way back up along
[01:00:29] and then we go all the way back up along the ancestors to the root node and then
[01:00:32] the ancestors to the root node and then we do the whole thing again we do that
[01:00:34] we do the whole thing again we do that many many many many times I'd have to
[01:00:36] many many many many times I'd have to remind myself I think it's like say for
[01:00:38] remind myself I think it's like say for example it might be 160,000 times for
[01:00:40] example it might be 160,000 times for example just to give you a sense of the
[01:00:42] example just to give you a sense of the scale so it could be something like
[01:00:44] scale so it could be something like 160,000 times and that means you're
[01:00:46] 160,000 times and that means you're going to fill in parts of the tree and
[01:00:48] going to fill in parts of the tree and then after all of that we have to decide
[01:00:50] then after all of that we have to decide what actually to do okay so that's just
[01:00:52] what actually to do okay so that's just to compute a tree to decide the current
[01:00:55] to compute a tree to decide the current move
[01:00:56] move okay so we do this
[01:00:59] okay so we do this many
[01:01:01] many many many
[01:01:04] many many times
[01:01:07] okay so we do this many times and then
[01:01:10] okay so we do this many times and then at the end we are going to decide what
[01:01:13] at the end we are going to decide what to do with our root node by the
[01:01:14] to do with our root node by the following and this again is a little bit
[01:01:16] following and this again is a little bit different than what we've seen before we
[01:01:18] different than what we've seen before we are going to compute a policy for the
[01:01:20] are going to compute a policy for the root node by figuring out which actions
[01:01:23] root node by figuring out which actions did we mostly visit underneath
[01:01:27] did we mostly visit underneath it so we're going to look at NSA so sort
[01:01:30] it so we're going to look at NSA so sort of which actions how many times do we
[01:01:32] of which actions how many times do we take each of the actions from the root
[01:01:33] take each of the actions from the root node to one over a
[01:01:37] towel which I think this should be minus
[01:01:39] towel which I think this should be minus let me just double
[01:01:41] let me just double check yeah I guess it just depends how
[01:01:43] check yeah I guess it just depends how you set toel so to is just going to be a
[01:01:45] you set toel so to is just going to be a temperature
[01:01:50] parameter okay so if to for
[01:01:52] parameter okay so if to for example was minus one in this case then
[01:01:55] example was minus one in this case then it would be 1/ NSA You' be proportional
[01:01:58] it would be 1/ NSA You' be proportional to
[01:02:04] that or if n was one you would be sort
[01:02:07] that or if n was one you would be sort of proportional you would take things
[01:02:08] of proportional you would take things according to that divided by the total
[01:02:11] according to that divided by the total so if n is one sorry if to is one to is
[01:02:14] so if n is one sorry if to is one to is equal to one then it would be
[01:02:16] equal to one then it would be NSA / n of
[01:02:20] NSA / n of s and as you increase or decrease this
[01:02:24] s and as you increase or decrease this then you get things closer to taking a
[01:02:26] then you get things closer to taking a Max or just
[01:02:29] Max or just averaging so this allows you to sort of
[01:02:32] averaging so this allows you to sort of like have a stochastic policy at the
[01:02:33] like have a stochastic policy at the root node instead of necessarily just
[01:02:35] root node instead of necessarily just taking the
[01:02:38] ARX so this is quite interesting so this
[01:02:40] ARX so this is quite interesting so this is what they're going to end up doing
[01:02:41] is what they're going to end up doing after you do all of this you're going to
[01:02:43] after you do all of this you're going to actually take an action and then you're
[01:02:44] actually take an action and then you're going to so that gives you um a policy
[01:02:47] going to so that gives you um a policy and then you are going to sample from
[01:02:48] and then you are going to sample from that policy to actually make a
[01:02:50] that policy to actually make a decision okay so this is how a game
[01:02:52] decision okay so this is how a game works you do an enormous amount of
[01:02:54] works you do an enormous amount of computation at at the end you get this
[01:02:57] computation at at the end you get this policy according to the number of times
[01:02:59] policy according to the number of times you've taken each action from the root
[01:03:01] you've taken each action from the root node and then you sample from that
[01:03:03] node and then you sample from that policy you reach a new state so like
[01:03:05] policy you reach a new state so like let's say you put down that board that
[01:03:07] let's say you put down that board that um thing then the opponent does exactly
[01:03:09] um thing then the opponent does exactly the same thing and they put down
[01:03:11] the same thing and they put down something and you repeat this all the
[01:03:13] something and you repeat this all the way out until the game
[01:03:14] way out until the game ends now even if you're deep mind you
[01:03:16] ends now even if you're deep mind you care about computation and so in some
[01:03:18] care about computation and so in some cases they will truncate games um if
[01:03:20] cases they will truncate games um if they think there's definitely going to
[01:03:22] they think there's definitely going to be one outcome or the other okay but in
[01:03:24] be one outcome or the other okay but in general you would just keep going this
[01:03:26] general you would just keep going this all out and Z here would be who won or
[01:03:28] all out and Z here would be who won or lost the
[01:03:35] game yeah so you said they will truncate
[01:03:38] game yeah so you said they will truncate games but like uh how do do they
[01:03:41] games but like uh how do do they actually set behind the computer and
[01:03:43] actually set behind the computer and watch games being played or like no no
[01:03:46] watch games being played or like no no it's absolutely all automated so this is
[01:03:47] it's absolutely all automated so this is going on you know C billions of of times
[01:03:50] going on you know C billions of of times and what they will do is if like um I
[01:03:52] and what they will do is if like um I think it's after 700 moves if they're
[01:03:54] think it's after 700 moves if they're not sure if they think either it's going
[01:03:56] not sure if they think either it's going to end in a draw or it's definitely
[01:03:57] to end in a draw or it's definitely going to be a lose and then they try to
[01:03:58] going to be a lose and then they try to bound like um false positives and stuff
[01:04:01] bound like um false positives and stuff like that but it was interesting to me
[01:04:03] like that but it was interesting to me that they included that just indicates
[01:04:05] that they included that just indicates that it probably save them a substantial
[01:04:06] that it probably save them a substantial amount of computation time but yeah know
[01:04:08] amount of computation time but yeah know everything is totally
[01:04:09] everything is totally automated okay so what they do is now at
[01:04:12] automated okay so what they do is now at this point so this is like a single game
[01:04:13] this point so this is like a single game as you could imagine this is an enormous
[01:04:14] as you could imagine this is an enormous amount of computation after a single
[01:04:17] amount of computation after a single game but a lot of this could be
[01:04:18] game but a lot of this could be paralyzed we now going to train our
[01:04:20] paralyzed we now going to train our neural networks so remember that we used
[01:04:22] neural networks so remember that we used a neural network to both give us an
[01:04:24] a neural network to both give us an estimate of the probabilities like give
[01:04:26] estimate of the probabilities like give us a policy for each state as well as a
[01:04:28] us a policy for each state as well as a value and what we do is now we have so
[01:04:32] value and what we do is now we have so from that game that one game we have one
[01:04:34] from that game that one game we have one observation so this is our Z no who won
[01:04:38] observation so this is our Z no who won who won the
[01:04:40] who won the game and we had from each step we had
[01:04:43] game and we had from each step we had these policies that we computed and
[01:04:46] these policies that we computed and we're going to use those as targets to
[01:04:47] we're going to use those as targets to train a neural network so what we do is
[01:04:50] train a neural network so what we do is we go back and we say okay well in that
[01:04:51] we go back and we say okay well in that time when you were in state s and you
[01:04:54] time when you were in state s and you computed a policy
[01:04:56] computed a policy and eventually you got a value of Z you
[01:04:59] and eventually you got a value of Z you either won or lost the game we are now
[01:05:01] either won or lost the game we are now going to train our crazy Big D deep
[01:05:03] going to train our crazy Big D deep neural network to predict for this state
[01:05:06] neural network to predict for this state this is the policy and for this state
[01:05:07] this is the policy and for this state that is the
[01:05:09] that is the value and this is just a supervis
[01:05:11] value and this is just a supervis learning
[01:05:13] learning problem and then they do the same thing
[01:05:15] problem and then they do the same thing for every state that was reached in that
[01:05:17] for every state that was reached in that particular game all using the same final
[01:05:20] particular game all using the same final state which is either you won or
[01:05:23] state which is either you won or lost okay and so this is just an
[01:05:25] lost okay and so this is just an enormous
[01:05:27] enormous Network I can't remember I think it's
[01:05:29] Network I can't remember I think it's maybe like let's say 40 layers and um
[01:05:32] maybe like let's say 40 layers and um they try and we'll see shortly like the
[01:05:33] they try and we'll see shortly like the influence of architecture too the
[01:05:35] influence of architecture too the architecture
[01:05:38] matters but um and so again just this
[01:05:40] matters but um and so again just this neural network goes directly from states
[01:05:42] neural network goes directly from states to both predict it's got two output
[01:05:44] to both predict it's got two output heads both predict policies and values
[01:05:47] heads both predict policies and values in it in their earlier work they had
[01:05:48] in it in their earlier work they had separateur neural networks but one for
[01:05:50] separateur neural networks but one for policy and one for values here they just
[01:05:53] policy and one for values here they just combined it all right so so that that is
[01:05:56] combined it all right so so that that is sort of like how it works in a nutshell
[01:05:58] sort of like how it works in a nutshell um in terms of what they're doing and
[01:06:00] um in terms of what they're doing and then they do this for an absolutely
[01:06:02] then they do this for an absolutely enormous amount of time um the final
[01:06:04] enormous amount of time um the final thing I think was trained for 40 days
[01:06:06] thing I think was trained for 40 days over like many tpus Etc yeah does this
[01:06:08] over like many tpus Etc yeah does this mean like if you think about kind of
[01:06:10] mean like if you think about kind of like a loss function with respect to the
[01:06:13] like a loss function with respect to the value the policy is not actually a
[01:06:15] value the policy is not actually a component of that loss function is that
[01:06:17] component of that loss function is that yeah it's a great point so what is a
[01:06:19] yeah it's a great point so what is a really good point so these are just two
[01:06:21] really good point so these are just two different heads um and you can think of
[01:06:25] different heads um and you can think of it as what sort of assuming in this case
[01:06:26] it as what sort of assuming in this case is that the representation you're
[01:06:27] is that the representation you're learning is going to be helpful for both
[01:06:30] learning is going to be helpful for both but this
[01:06:31] but this value may or may not relate to this
[01:06:34] value may or may not relate to this policy and this is just saying like we
[01:06:37] policy and this is just saying like we think that like the features we're going
[01:06:38] think that like the features we're going to learn about this like that's a way
[01:06:40] to learn about this like that's a way that we're encoding um the game States
[01:06:42] that we're encoding um the game States and also just to note here it's not just
[01:06:44] and also just to note here it's not just the current board that they're using the
[01:06:46] the current board that they're using the states they use tend to use history as
[01:06:48] states they use tend to use history as well because again I'm not an expert in
[01:06:51] well because again I'm not an expert in go but there are various rules in go
[01:06:52] go but there are various rules in go which mean like I think you can't repeat
[01:06:53] which mean like I think you can't repeat a move and stuff so because of that they
[01:06:56] a move and stuff so because of that they have to maintain a short um history of
[01:06:58] have to maintain a short um history of the previous game States so you can
[01:07:01] the previous game States so you can think of s really as being like multiple
[01:07:03] think of s really as being like multiple games game board states of the past and
[01:07:06] games game board states of the past and I think the intuition for this is that
[01:07:08] I think the intuition for this is that you're going to learn feature
[01:07:09] you're going to learn feature representations from that they're going
[01:07:11] representations from that they're going to be helpful for predicting both of
[01:07:12] to be helpful for predicting both of these now ultimately you would hope that
[01:07:14] these now ultimately you would hope that this sort of there is some relationship
[01:07:16] this sort of there is some relationship between these two but they're not
[01:07:17] between these two but they're not constraining
[01:07:23] it so just a recap what are the key
[01:07:26] it so just a recap what are the key features that they're using so so I
[01:07:28] features that they're using so so I guess also to specify in this case
[01:07:30] guess also to specify in this case they're going to do this across many
[01:07:33] they're going to do this across many tpus over many many many days um and
[01:07:36] tpus over many many many days um and what they're doing when they do this is
[01:07:38] what they're doing when they do this is that they're constantly retraining um
[01:07:41] that they're constantly retraining um these neural
[01:07:42] these neural networks and at the end of all of this
[01:07:44] networks and at the end of all of this when they act play kind of test games
[01:07:46] when they act play kind of test games say against other human players or
[01:07:48] say against other human players or against other neur um other AI agents is
[01:07:52] against other neur um other AI agents is they're going to still do the Monte
[01:07:54] they're going to still do the Monte Carlo research so going to take sort of
[01:07:56] Carlo research so going to take sort of their their final neural networks and
[01:07:58] their their final neural networks and then they're still going to do the um
[01:08:00] then they're still going to do the um monticolo research method that we've
[01:08:02] monticolo research method that we've just
[01:08:03] just seen before they make
[01:08:06] seen before they make decisions and so we we'll see in a
[01:08:08] decisions and so we we'll see in a second sort of whether that's important
[01:08:09] second sort of whether that's important or not so in particular some of the
[01:08:12] or not so in particular some of the important questions that they consider
[01:08:14] important questions that they consider in this paper is you know what is the
[01:08:15] in this paper is you know what is the influence of architecture does it matter
[01:08:18] influence of architecture does it matter which architecture you use in these
[01:08:20] which architecture you use in these cases um what is the impact of using
[01:08:22] cases um what is the impact of using MCTS obviously they're still learning a
[01:08:25] MCTS obviously they're still learning a policy and their learning a value
[01:08:26] policy and their learning a value function and my question is how much
[01:08:28] function and my question is how much additional gain do you get even after 40
[01:08:30] additional gain do you get even after 40 days of training this um on um by doing
[01:08:33] days of training this um on um by doing monteola research and how does it
[01:08:35] monteola research and how does it compare to human play or using human
[01:08:37] compare to human play or using human players so the first way that they did
[01:08:40] players so the first way that they did this is they instead of having this
[01:08:42] this is they instead of having this neural network that was predicting a
[01:08:43] neural network that was predicting a policy and a value is they'd actually
[01:08:46] policy and a value is they'd actually did supervised learning on human play
[01:08:48] did supervised learning on human play and that gave you a way to prioritize
[01:08:51] and that gave you a way to prioritize actions so that's what they done when
[01:08:53] actions so that's what they done when they did Alpha go to start and I think
[01:08:54] they did Alpha go to start and I think that's what they did also for when they
[01:08:56] that's what they did also for when they won Le against leasy doll and then what
[01:08:58] won Le against leasy doll and then what they've been trying to do in this paper
[01:08:59] they've been trying to do in this paper and um others is to kind of remove some
[01:09:02] and um others is to kind of remove some of those assumptions to see if you could
[01:09:03] of those assumptions to see if you could learn without even human knowledge now
[01:09:06] learn without even human knowledge now here I'll just specify that you know
[01:09:08] here I'll just specify that you know they they
[01:09:09] they they still know the game
[01:09:12] still know the game rules and then they have later paper
[01:09:15] rules and then they have later paper where they want to not even need that
[01:09:17] where they want to not even need that but here are the
[01:09:19] but here are the algorithms so the first thing to note is
[01:09:21] algorithms so the first thing to note is that um hire is better this is talking
[01:09:23] that um hire is better this is talking about the performance of um the
[01:09:26] about the performance of um the resulting approach under different
[01:09:30] resulting approach under different architectures okay so what they do is
[01:09:32] architectures okay so what they do is they actually have the same training
[01:09:33] they actually have the same training data that they use and they just use
[01:09:35] data that they use and they just use different architectures they use data in
[01:09:37] different architectures they use data in this case from um one of some of the
[01:09:40] this case from um one of some of the runs of alpha Zer which is the algorithm
[01:09:41] runs of alpha Zer which is the algorithm we've been talking about so all of these
[01:09:43] we've been talking about so all of these have the same data and then they look at
[01:09:45] have the same data and then they look at what the performance is if you train um
[01:09:47] what the performance is if you train um the neural networks with that data so
[01:09:49] the neural networks with that data so same data just differences architecture
[01:09:52] same data just differences architecture and there is a huge difference okay this
[01:09:54] and there is a huge difference okay this is like from 3,000 to
[01:09:56] is like from 3,000 to 4,500 so they're current on so this is a
[01:09:59] 4,500 so they're current on so this is a com a convolutional neural network which
[01:10:00] com a convolutional neural network which is separate meaning that you have a
[01:10:02] is separate meaning that you have a different policy network from um a value
[01:10:04] different policy network from um a value Network whereas this is a reset and
[01:10:07] Network whereas this is a reset and they're using a dual representation so
[01:10:09] they're using a dual representation so you can see that you get a significant
[01:10:11] you can see that you get a significant benefit by leveraging representational
[01:10:14] benefit by leveraging representational strength across both of these
[01:10:16] strength across both of these targets and also that this is better
[01:10:19] targets and also that this is better than using convolutional neural networks
[01:10:21] than using convolutional neural networks so I mean I think this is a good
[01:10:22] so I mean I think this is a good reminder that like when we're doing
[01:10:23] reminder that like when we're doing reinforcement learning we're doing
[01:10:25] reinforcement learning we're doing decision decision making we still want
[01:10:26] decision decision making we still want to build on all the amazing advances
[01:10:28] to build on all the amazing advances that are happening in deep learning in
[01:10:29] that are happening in deep learning in general and the the complexity of the
[01:10:31] general and the the complexity of the neural networks that we use and the
[01:10:33] neural networks that we use and the functions they can represent really
[01:10:34] functions they can represent really matters okay so that's sort of the take
[01:10:36] matters okay so that's sort of the take home from this part this is a huge
[01:10:38] home from this part this is a huge difference in
[01:10:40] difference in performance um the second is the impact
[01:10:42] performance um the second is the impact of Monte Carlo tree search so I think
[01:10:43] of Monte Carlo tree search so I think this is important to know this is if you
[01:10:45] this is important to know this is if you use the raw Network so you take the
[01:10:47] use the raw Network so you take the network I think this is after those 40
[01:10:49] network I think this is after those 40 days of these crazy numbers of tpus and
[01:10:51] days of these crazy numbers of tpus and you don't do Monte Carlo research on top
[01:10:53] you don't do Monte Carlo research on top in kind of your evaluation games
[01:10:55] in kind of your evaluation games and again this is much much much worse
[01:10:58] and again this is much much much worse okay so this is alphao zero the
[01:10:59] okay so this is alphao zero the algorithm we've been talking about
[01:11:01] algorithm we've been talking about alphao master was another one they
[01:11:02] alphao master was another one they developed shortly before this this is
[01:11:04] developed shortly before this this is the one that beat leasy doll um alphago
[01:11:06] the one that beat leasy doll um alphago what they call fan is sort of the first
[01:11:08] what they call fan is sort of the first big alphao paper um and these are some
[01:11:10] big alphao paper um and these are some of the other approaches that happened
[01:11:12] of the other approaches that happened before their methods and again you can
[01:11:14] before their methods and again you can see that even though they now have all
[01:11:16] see that even though they now have all this beautiful different architecture
[01:11:18] this beautiful different architecture Etc um if you don't do monol research on
[01:11:20] Etc um if you don't do monol research on top of that you miss a lot so it really
[01:11:23] top of that you miss a lot so it really is important to do this last mile of
[01:11:25] is important to do this last mile of additional computation even after you
[01:11:27] additional computation even after you have these really really good neural
[01:11:28] have these really really good neural networks this kind of local computation
[01:11:31] networks this kind of local computation matters um this gives you a sense of
[01:11:34] matters um this gives you a sense of sort of the training times involved so
[01:11:36] sort of the training times involved so this is the lisad do paper or ladol
[01:11:38] this is the lisad do paper or ladol method um I don't think they published
[01:11:40] method um I don't think they published this before this is sort of one of their
[01:11:42] this before this is sort of one of their Master methods they had and this is
[01:11:44] Master methods they had and this is showing for a particular size approach
[01:11:47] showing for a particular size approach how long it took of training before you
[01:11:51] how long it took of training before you got something that exceeded all of those
[01:11:53] got something that exceeded all of those so it gets there but it also just
[01:11:56] so it gets there but it also just highlights of the enormous amount of
[01:11:57] highlights of the enormous amount of computation needed and um the importance
[01:12:00] computation needed and um the importance of of uh the architecture so I know
[01:12:03] of of uh the architecture so I know we're almost out of time but I just want
[01:12:04] we're almost out of time but I just want to highlight two things so again in this
[01:12:06] to highlight two things so again in this case it didn't need any human data no
[01:12:08] case it didn't need any human data no supervised learning and they noted
[01:12:10] supervised learning and they noted though that it was less good at
[01:12:11] though that it was less good at predicting human play than some of the
[01:12:14] predicting human play than some of the other prior methods so that again just
[01:12:16] other prior methods so that again just highlights that these methods really are
[01:12:18] highlights that these methods really are helping um agents to discover strategies
[01:12:21] helping um agents to discover strategies that are not necessarily the ones that
[01:12:22] that are not necessarily the ones that are used by humans they're just going
[01:12:25] are used by humans they're just going different ways of solving these sort of
[01:12:27] different ways of solving these sort of incredibly complex optimization tasks
[01:12:30] incredibly complex optimization tasks and I think that's really interesting in
[01:12:31] and I think that's really interesting in terms of sort of the future of human AI
[01:12:34] terms of sort of the future of human AI collaboration we're almost out of time
[01:12:36] collaboration we're almost out of time for today um I'll just highlight as well
[01:12:38] for today um I'll just highlight as well that these sorts of ideas of how to use
[01:12:40] that these sorts of ideas of how to use RL to optimize computation and solve
[01:12:43] RL to optimize computation and solve really really big search problems have
[01:12:45] really really big search problems have also been used by Deep Mind to solve
[01:12:47] also been used by Deep Mind to solve things like Alpha tensor and these other
[01:12:49] things like Alpha tensor and these other ways of sort of trying to start
[01:12:50] ways of sort of trying to start automatically searching for new
[01:12:53] automatically searching for new algorithms um which I think is really
[01:12:55] algorithms um which I think is really exciting because you can think of sort
[01:12:56] exciting because you can think of sort of the space of algorithms or the space
[01:12:58] of the space of algorithms or the space of sort of different search algorithms
[01:13:00] of sort of different search algorithms Etc and those are enormous and so you
[01:13:03] Etc and those are enormous and so you could think of using these types of
[01:13:04] could think of using these types of strategies to prioritize which things
[01:13:06] strategies to prioritize which things are most
[01:13:07] are most effective all right so I'll leave this
[01:13:10] effective all right so I'll leave this here because we're out of time you're
[01:13:11] here because we're out of time you're welcome to look at this to think a
[01:13:12] welcome to look at this to think a little bit more about the aspects of UCT
[01:13:14] little bit more about the aspects of UCT search and then on Wednesday we're going
[01:13:17] search and then on Wednesday we're going to think more about rewards in RL and
[01:13:19] to think more about rewards in RL and what are the implications of which ones
[01:13:21] what are the implications of which ones we're choosing I'll see you then
Lecture 015
Stanford CS234 Reinforcement Learning I Emma Brunskill & Dan Webber I 2024 I Lecture 15
Source: https://www.youtube.com/watch?v=FOlPpjNbHjE
---
Transcript
[00:00:05] hey everybody we're going to g...
Stanford CS234 Reinforcement Learning I Emma Brunskill & Dan Webber I 2024 I Lecture 15
Source: https://www.youtube.com/watch?v=FOlPpjNbHjE
---
Transcript
[00:00:05] hey everybody we're going to gohe and
[00:00:06] hey everybody we're going to gohe and get started and we'll start with a
[00:00:08] get started and we'll start with a refresh your understanding um thinking
[00:00:10] refresh your understanding um thinking back to DPO and Harley Jeff
[00:00:50] all right why don't you turn to someone
[00:00:51] all right why don't you turn to someone near you and see if you got the same
[00:00:53] near you and see if you got the same answer particularly for um the third and
[00:00:57] answer particularly for um the third and fourth one but
[00:01:00] fourth one but PR up
[00:01:44] all right so let's come back together um
[00:01:46] all right so let's come back together um the first one is true the DPO model does
[00:01:49] the first one is true the DPO model does assume that um we have a particular
[00:01:51] assume that um we have a particular model of how people are responding to
[00:01:53] model of how people are responding to preferences in particular Bradley Terry
[00:01:54] preferences in particular Bradley Terry model the second one is also true even
[00:01:57] model the second one is also true even though we've been thinking a lot about
[00:01:59] though we've been thinking a lot about when we actually have preferences um we
[00:02:02] when we actually have preferences um we can also use this in cases where um
[00:02:05] can also use this in cases where um someone just directly provid provided
[00:02:06] someone just directly provid provided you reward labels and so rhf is a
[00:02:10] you reward labels and so rhf is a paradigm is totally compatible with the
[00:02:11] paradigm is totally compatible with the idea of just getting rewards from some
[00:02:13] idea of just getting rewards from some way um but normally when we think about
[00:02:16] way um but normally when we think about that human feedback it's normally from
[00:02:18] that human feedback it's normally from preferences um the third one is an
[00:02:21] preferences um the third one is an interesting one does somebody want to
[00:02:22] interesting one does somebody want to argue why they think that is uh not a
[00:02:24] argue why they think that is uh not a good way to learn about the reward model
[00:02:26] good way to learn about the reward model for board
[00:02:27] for board games there's multiple optim point
[00:02:31] games there's multiple optim point yeah I feel like there's like multiple
[00:02:33] yeah I feel like there's like multiple yeah I think that that could be one I
[00:02:35] yeah I think that that could be one I was seeing something even simpler does
[00:02:36] was seeing something even simpler does anybody else want to add why you might
[00:02:38] anybody else want to add why you might want not want to do this for Board of
[00:02:40] want not want to do this for Board of games it might be really hard to compare
[00:02:43] games it might be really hard to compare like in a game like chest where the
[00:02:46] like in a game like chest where the reward is at the end and like self play
[00:02:51] reward is at the end and like self play yeah so it might and also because we
[00:02:52] yeah so it might and also because we normally know what the reward model
[00:02:54] normally know what the reward model actually is in games and so like if we
[00:02:56] actually is in games and so like if we know that at the very end we can say
[00:02:58] know that at the very end we can say this is a plus one or this is a minus
[00:02:59] this is a plus one or this is a minus one
[00:03:00] one um there is no reason necessarily to
[00:03:02] um there is no reason necessarily to assume we want to look at like two game
[00:03:03] assume we want to look at like two game States and ask a human to try to judge
[00:03:05] States and ask a human to try to judge which of those two is better we we know
[00:03:07] which of those two is better we we know the ground Troth reward and so it's
[00:03:08] the ground Troth reward and so it's probably better just to directly use
[00:03:10] probably better just to directly use those um and it may be that those sort
[00:03:12] those um and it may be that those sort of pairwise rankings for intermediate
[00:03:14] of pairwise rankings for intermediate game States might not be very reliable
[00:03:17] game States might not be very reliable too um and then the last one is also
[00:03:19] too um and then the last one is also true DPO and rhf can both be used in
[00:03:21] true DPO and rhf can both be used in extremely large um with extremely large
[00:03:23] extremely large um with extremely large policy
[00:03:25] policy networks all right so where are we um
[00:03:28] networks all right so where are we um last time we talked a lot about Monte
[00:03:29] last time we talked a lot about Monte Carlo research and we talked about Alpha
[00:03:31] Carlo research and we talked about Alpha zero and what we'll do today is we'll
[00:03:33] zero and what we'll do today is we'll talk briefly to kind of finish up that
[00:03:35] talk briefly to kind of finish up that part um just quite briefly I want to
[00:03:38] part um just quite briefly I want to clarify a couple things that I mentioned
[00:03:39] clarify a couple things that I mentioned last time so the rest of the day we're
[00:03:41] last time so the rest of the day we're going to have a guest lecture almost the
[00:03:43] going to have a guest lecture almost the rest of the day we're going to have a
[00:03:43] rest of the day we're going to have a guest lecture by Dan Weber um uh think
[00:03:46] guest lecture by Dan Weber um uh think which is a way to sort of introduce the
[00:03:48] which is a way to sort of introduce the last part of our course which is to
[00:03:49] last part of our course which is to think about where the rewards come from
[00:03:52] think about where the rewards come from um in terms of how we make judgments
[00:03:53] um in terms of how we make judgments about which rewards we might want to
[00:03:55] about which rewards we might want to prefer or not and we'll talk about that
[00:03:57] prefer or not and we'll talk about that today and we'll talk about that after
[00:03:59] today and we'll talk about that after the quiz as well but before we do that I
[00:04:01] the quiz as well but before we do that I just want to do a little bit more on
[00:04:04] just want to do a little bit more on montol reseearch so let's see if we can
[00:04:07] montol reseearch so let's see if we can get take two for the video
[00:04:15] first so I think that sort of captures
[00:04:18] first so I think that sort of captures it's a nice you know we don't normally
[00:04:19] it's a nice you know we don't normally get documentaries made about the work
[00:04:21] get documentaries made about the work that happens in artificial intelligence
[00:04:23] that happens in artificial intelligence at least not yet uh but I think that
[00:04:25] at least not yet uh but I think that it's a pretty powerful exhibit of why
[00:04:27] it's a pretty powerful exhibit of why people were so excited about this result
[00:04:29] people were so excited about this result um and so of the implications it had
[00:04:31] um and so of the implications it had when you can exceed the best performance
[00:04:34] when you can exceed the best performance in the world at something by computers
[00:04:37] in the world at something by computers um and we've seen examples of this in
[00:04:39] um and we've seen examples of this in the past for those of you who heard
[00:04:41] the past for those of you who heard about it there was an IBM Watson for
[00:04:44] about it there was an IBM Watson for Jeopardy case a number of years ago and
[00:04:46] Jeopardy case a number of years ago and I remember being in the audience when
[00:04:48] I remember being in the audience when that was happening um you know many many
[00:04:50] that was happening um you know many many people watched it in different watching
[00:04:51] people watched it in different watching parties um at the time and I was in one
[00:04:54] parties um at the time and I was in one of those watching parties and it was a
[00:04:55] of those watching parties and it was a similar moment in AI sort of thinking
[00:04:57] similar moment in AI sort of thinking about what are the levels uh that we're
[00:04:59] about what are the levels uh that we're going to be a to achieve an AI and what
[00:05:00] going to be a to achieve an AI and what are the implications of that for sort of
[00:05:02] are the implications of that for sort of human expertise and Human Excellence so
[00:05:06] human expertise and Human Excellence so Bon research of course so of course they
[00:05:08] Bon research of course so of course they did win the game deepbind did win
[00:05:10] did win the game deepbind did win against Le at all um let's not just go
[00:05:12] against Le at all um let's not just go back and sort of think a bit about what
[00:05:14] back and sort of think a bit about what Monte Carlo treesearch and Alpha Z zero
[00:05:17] Monte Carlo treesearch and Alpha Z zero are doing so this is another refresh
[00:05:19] are doing so this is another refresh your understanding and I'm also doing
[00:05:20] your understanding and I'm also doing two of these today just to give you
[00:05:22] two of these today just to give you example these are the types of questions
[00:05:23] example these are the types of questions you also might see on the quiz so we'll
[00:05:25] you also might see on the quiz so we'll do another one of these and then I'm
[00:05:27] do another one of these and then I'm going to clarify a couple points uh
[00:05:29] going to clarify a couple points uh about
[00:05:30] about Alpha Zero from last
[00:05:58] time e
[00:06:46] why don't you find somebody near you and
[00:06:47] why don't you find somebody near you and compare your
[00:06:58] answers
[00:07:58] e e
[00:08:43] okay good I'm hearing a lot of
[00:08:44] okay good I'm hearing a lot of discussion about this which is good um
[00:08:47] discussion about this which is good um so the first one is true the first one
[00:08:49] so the first one is true the first one is it does approximate um a forward
[00:08:52] is it does approximate um a forward search
[00:08:53] search tree okay um the second one is false and
[00:08:56] tree okay um the second one is false and I know this is a little bit subtle so
[00:08:59] I know this is a little bit subtle so monticola tree search tries to
[00:09:01] monticola tree search tries to approximate the forward search tree but
[00:09:03] approximate the forward search tree but as you might remember the forward search
[00:09:04] as you might remember the forward search tree can scale exponentially with a
[00:09:07] tree can scale exponentially with a number of states and the number of
[00:09:08] number of states and the number of actions because you're expanding by both
[00:09:10] actions because you're expanding by both of those at each level um and so what
[00:09:13] of those at each level um and so what monticolo research does is it uses its
[00:09:14] monticolo research does is it uses its Dynamics model to sample a next state so
[00:09:17] Dynamics model to sample a next state so you don't have to enumerate all the
[00:09:18] you don't have to enumerate all the possible next States so it uses sampling
[00:09:22] possible next States so it uses sampling to help with the state branching Factor
[00:09:25] to help with the state branching Factor but it doesn't tell you what to do about
[00:09:26] but it doesn't tell you what to do about the action branching Factor one thing
[00:09:28] the action branching Factor one thing you can do is you do upper confidence
[00:09:30] you can do is you do upper confidence trees and then that tells you how to use
[00:09:32] trees and then that tells you how to use a bandit to figure out you know which
[00:09:34] a bandit to figure out you know which action to select next in Alpha zero we
[00:09:37] action to select next in Alpha zero we see that even that is likely not to be
[00:09:38] see that even that is likely not to be sufficient when you have an enormous
[00:09:40] sufficient when you have an enormous branching factor and so you may need
[00:09:41] branching factor and so you may need some sort of additional weight like the
[00:09:43] some sort of additional weight like the probability like a policy to to select
[00:09:46] probability like a policy to to select to monum those
[00:09:47] to monum those actions the second one is the third one
[00:09:49] actions the second one is the third one is also false um so this was true in the
[00:09:53] is also false um so this was true in the original Alpha go and I think in the
[00:09:55] original Alpha go and I think in the least SE doll one too that you saw in
[00:09:56] least SE doll one too that you saw in the video they did have two networks but
[00:09:59] the video they did have two networks but in Alpha zero they just have a single
[00:10:01] in Alpha zero they just have a single Network and it outputs both a policy and
[00:10:03] Network and it outputs both a policy and a value so it just has two output
[00:10:06] a value so it just has two output heads the third thing is true so kind of
[00:10:09] heads the third thing is true so kind of amazingly even if you spend 40 days and
[00:10:12] amazingly even if you spend 40 days and you've got you know many tpus Etc to
[00:10:15] you've got you know many tpus Etc to learn a policy output and a value output
[00:10:18] learn a policy output and a value output in the network they still do at kind of
[00:10:21] in the network they still do at kind of at test time you know like if you're
[00:10:22] at test time you know like if you're playing lease at all they still do
[00:10:24] playing lease at all they still do additional guided Monte Carlo research
[00:10:26] additional guided Monte Carlo research at that point and it's makes a big
[00:10:28] at that point and it's makes a big difference so I I think it was something
[00:10:30] difference so I I think it was something like going from like an ELO score of
[00:10:32] like going from like an ELO score of like 3,000 to 4500 or 5,000 I'm going to
[00:10:35] like 3,000 to 4500 or 5,000 I'm going to get the numbers wrong but uh it was a
[00:10:37] get the numbers wrong but uh it was a huge gain by doing a little bit more of
[00:10:39] huge gain by doing a little bit more of extra local
[00:10:41] extra local computation and the third thing the
[00:10:43] computation and the third thing the fourth thing is also or I guess the next
[00:10:45] fourth thing is also or I guess the next thing is also true which is selfplay
[00:10:47] thing is also true which is selfplay does form provide sort of a form of
[00:10:49] does form provide sort of a form of implicit curriculum learning because the
[00:10:51] implicit curriculum learning because the agent is always essentially working with
[00:10:53] agent is always essentially working with an opponent that's very similar to its
[00:10:56] an opponent that's very similar to its level in fact it's itself so it's
[00:10:58] level in fact it's itself so it's exactly level and that means that sort
[00:11:00] exactly level and that means that sort of the density of reward it gets is
[00:11:02] of the density of reward it gets is going to be much higher than it would
[00:11:03] going to be much higher than it would get if he was playing an opponent that
[00:11:05] get if he was playing an opponent that was much higher or much
[00:11:08] was much higher or much lower okay the other thing that I wanted
[00:11:10] lower okay the other thing that I wanted to clarify is so when we talked before
[00:11:13] to clarify is so when we talked before we talked about um selecting a move in a
[00:11:15] we talked about um selecting a move in a single game and we talked about how it
[00:11:17] single game and we talked about how it both maintains this
[00:11:20] both maintains this Q can it maintains a qsa pres of a
[00:11:23] Q can it maintains a qsa pres of a certain node as well as this upper bound
[00:11:26] certain node as well as this upper bound here which is going to be proportional
[00:11:28] here which is going to be proportional to
[00:11:30] to this policy function it gets from the
[00:11:32] this policy function it gets from the neural
[00:11:33] neural network divided by 1 plus the number of
[00:11:37] network divided by 1 plus the number of samples so I mentioned in class I just
[00:11:39] samples so I mentioned in class I just wanted to make sure to clarify this I
[00:11:41] wanted to make sure to clarify this I mentioned in class that I thought that
[00:11:42] mentioned in class that I thought that all of the s's here are just the nodes
[00:11:45] all of the s's here are just the nodes um but s is a little bit weird of a
[00:11:48] um but s is a little bit weird of a notation because you could imagine it
[00:11:49] notation because you could imagine it could either be the state space or the
[00:11:51] could either be the state space or the node like that part of of the tree
[00:11:53] node like that part of of the tree search I look back on it this is
[00:11:55] search I look back on it this is actually the node so I I was correcting
[00:11:57] actually the node so I I was correcting what I said last time but I just wanted
[00:11:58] what I said last time but I just wanted to make sure that was clear so they're
[00:12:00] to make sure that was clear so they're thinking of sort of each of these points
[00:12:02] thinking of sort of each of these points as being like a particular sa but in
[00:12:05] as being like a particular sa but in theory and again I'm not a go expert so
[00:12:06] theory and again I'm not a go expert so I'm not sure how often this happens you
[00:12:08] I'm not sure how often this happens you could end up at the same sort of game
[00:12:10] could end up at the same sort of game State lower down in the tree and you
[00:12:12] State lower down in the tree and you would maintain totally different
[00:12:14] would maintain totally different statistics for down there so you're not
[00:12:16] statistics for down there so you're not sharing across those and there's just
[00:12:18] sharing across those and there's just you know Simplicity to do that in term
[00:12:20] you know Simplicity to do that in term there's simpli that can help in terms of
[00:12:23] there's simpli that can help in terms of um the architectures that you need to to
[00:12:25] um the architectures that you need to to dve and just sort of simplify some of
[00:12:27] dve and just sort of simplify some of the storage for this
[00:12:30] the storage for this so it is you know
[00:12:32] so it is you know these These are per
[00:12:39] node and then the other thing that I
[00:12:41] node and then the other thing that I just wanted to clarify also is that when
[00:12:43] just wanted to clarify also is that when you get to the root later and you're
[00:12:46] you get to the root later and you're making a decision over which of these
[00:12:49] making a decision over which of these actions to take I mentioned that what
[00:12:52] actions to take I mentioned that what they do at the final end is that they're
[00:12:54] they do at the final end is that they're going to do something proportional to
[00:12:56] going to do something proportional to NSA at the root 1/ to so this is going
[00:13:00] NSA at the root 1/ to so this is going to be the policy at the
[00:13:04] root and I just wanted to make sure to
[00:13:06] root and I just wanted to make sure to be clear about what that does so this is
[00:13:08] be clear about what that does so this is sort of prioritizing parts of actions
[00:13:11] sort of prioritizing parts of actions that you've taken more that you've
[00:13:12] that you've taken more that you've explored more in your neural network so
[00:13:14] explored more in your neural network so let's just see a little bit about what
[00:13:16] let's just see a little bit about what this would look like so if to is equal
[00:13:17] this would look like so if to is equal to one what that would mean is that your
[00:13:20] to one what that would mean is that your probability of taking action a from the
[00:13:23] probability of taking action a from the root would be equal to the number of
[00:13:26] root would be equal to the number of times you've taken action a in the root
[00:13:29] times you've taken action a in the root divide by the sum of the times you've
[00:13:30] divide by the sum of the times you've taken well just really the total number
[00:13:32] taken well just really the total number of roll outs you've done from the root
[00:13:35] of roll outs you've done from the root so that would be strictly proportional
[00:13:37] so that would be strictly proportional in that case if you have a towel less
[00:13:41] in that case if you have a towel less than
[00:13:42] than one that means that you are going to
[00:13:45] one that means that you are going to upway some of these things so then you
[00:13:46] upway some of these things so then you would have like so let's say if to is
[00:13:49] would have like so let's say if to is equal to
[00:13:50] equal to 0.5 then you would have na a root
[00:13:54] 0.5 then you would have na a root squared / the sum of N a root
[00:13:59] squared / the sum of N a root squ put a prime there so what that would
[00:14:03] squ put a prime there so what that would mean is that if you make to go closer
[00:14:05] mean is that if you make to go closer and closer to zero then this is
[00:14:07] and closer to zero then this is basically going to do a Winner's takes
[00:14:08] basically going to do a Winner's takes all approach and you'll basically SE
[00:14:11] all approach and you'll basically SE select whichever action you took most
[00:14:12] select whichever action you took most from the route and then as you go to
[00:14:14] from the route and then as you go to more dos one it's sort of equally spread
[00:14:17] more dos one it's sort of equally spread across all the times you've taken each
[00:14:18] across all the times you've taken each of the actions and as you might imagine
[00:14:21] of the actions and as you might imagine that's going to have different
[00:14:21] that's going to have different implications for sort of how exploratory
[00:14:24] implications for sort of how exploratory you are now note that none of these
[00:14:26] you are now note that none of these things are doing it based on what the
[00:14:27] things are doing it based on what the value is at the root all of these are
[00:14:29] value is at the root all of these are just based on essentially how much of
[00:14:30] just based on essentially how much of the time you've explored different parts
[00:14:32] the time you've explored different parts of the
[00:14:33] of the tree so I just wanted to make sure to
[00:14:35] tree so I just wanted to make sure to clarify those in those cases does
[00:14:37] clarify those in those cases does anybody have any other questions about
[00:14:39] anybody have any other questions about Alpha Zer and I do just want to say that
[00:14:41] Alpha Zer and I do just want to say that I as I mentioned before there's a number
[00:14:43] I as I mentioned before there's a number of different other derivatives that have
[00:14:45] of different other derivatives that have happened since this so there's muz which
[00:14:47] happened since this so there's muz which doesn't even need to know the rules of
[00:14:48] doesn't even need to know the rules of the game um and there are also a lot of
[00:14:51] the game um and there are also a lot of sort of sophisticated approaches which
[00:14:52] sort of sophisticated approaches which have to do with hidden information games
[00:14:55] have to do with hidden information games like poker so in this case you know
[00:14:57] like poker so in this case you know there's full information you know
[00:14:59] there's full information you know exactly where all the white stones are
[00:15:00] exactly where all the white stones are you know exactly where all the black
[00:15:02] you know exactly where all the black stones are there's no hidden information
[00:15:04] stones are there's no hidden information that either player has and there's only
[00:15:05] that either player has and there's only two players but there's been a lot of
[00:15:07] two players but there's been a lot of work at thinking about cases like poker
[00:15:09] work at thinking about cases like poker and others where there's some cards that
[00:15:12] and others where there's some cards that one agent doesn't see from the other and
[00:15:14] one agent doesn't see from the other and so then how do you play optimally in
[00:15:15] so then how do you play optimally in those games as
[00:15:18] those games as well so H any questions before we move
[00:15:20] well so H any questions before we move on to our guest lecture yeah with these
[00:15:22] on to our guest lecture yeah with these other models like specifically zero that
[00:15:24] other models like specifically zero that doesn't make of the game um do we just
[00:15:27] doesn't make of the game um do we just observe that it even though doesn't know
[00:15:29] observe that it even though doesn't know the rules it like learns just as well or
[00:15:31] the rules it like learns just as well or does it do better without knowing the
[00:15:33] does it do better without knowing the rules like what's the consequences of
[00:15:34] rules like what's the consequences of doing that yeah that's a great question
[00:15:36] doing that yeah that's a great question um I'd have to go back to the paper and
[00:15:37] um I'd have to go back to the paper and remember the exact results it certainly
[00:15:38] remember the exact results it certainly can do just as well so it could quickly
[00:15:41] can do just as well so it could quickly you know as you still have to give it
[00:15:42] you know as you still have to give it some feedback it has to know whether or
[00:15:44] some feedback it has to know whether or not it one but it doesn't have to know
[00:15:46] not it one but it doesn't have to know sort of all the individual rules um and
[00:15:49] sort of all the individual rules um and so uh you could I just don't remember
[00:15:51] so uh you could I just don't remember how much additional data you needed in
[00:15:54] how much additional data you needed in that case as you guys might remember
[00:15:56] that case as you guys might remember from last time we saw that there is a
[00:15:57] from last time we saw that there is a really substantial impact of
[00:15:59] really substantial impact of architecture so depending on the
[00:16:01] architecture so depending on the architectures you're using and you know
[00:16:03] architectures you're using and you know are using a convolutional neural net or
[00:16:04] are using a convolutional neural net or some other different types of networks
[00:16:06] some other different types of networks those also make a massive difference the
[00:16:08] those also make a massive difference the amount of data you need and the quality
[00:16:09] amount of data you need and the quality of the result so I think that's
[00:16:11] of the result so I think that's something to keep in mind when we think
[00:16:12] something to keep in mind when we think of you know removing information like
[00:16:15] of you know removing information like what the rules are that you could
[00:16:17] what the rules are that you could imagine that if you do that but then you
[00:16:18] imagine that if you do that but then you have some other Innovations in terms of
[00:16:20] have some other Innovations in terms of the architecture it might be that you
[00:16:21] the architecture it might be that you need only the same amount of data or
[00:16:23] need only the same amount of data or even less than what we're needed here so
[00:16:25] even less than what we're needed here so generally like they don't do full
[00:16:26] generally like they don't do full oblations over all the combinatorics of
[00:16:28] oblations over all the combinatorics of this sort of ways these systems are
[00:16:30] this sort of ways these systems are specified it's a good question but
[00:16:32] specified it's a good question but that's that works certainly sugested and
[00:16:33] that's that works certainly sugested and they've also extended this to other
[00:16:34] they've also extended this to other games so things like chess and others
[00:16:36] games so things like chess and others just just show show that you can use
[00:16:38] just just show show that you can use very similar techniques to conquer those
[00:16:40] very similar techniques to conquer those games as
[00:16:43] well all right with that let's switch
[00:16:45] well all right with that let's switch over to Dan um so I'm really delighted
[00:16:47] over to Dan um so I'm really delighted to have Dan talking today um he is I
[00:16:50] to have Dan talking today um he is I guess I'll keep this here until he comes
[00:16:51] guess I'll keep this here until he comes up um he is a post talk fellow here at
[00:16:54] up um he is a post talk fellow here at Stanford he um he'll introduce his own
[00:16:56] Stanford he um he'll introduce his own background a little bit more but um has
[00:16:58] background a little bit more but um has he has a lot of expertise in thinking
[00:17:00] he has a lot of expertise in thinking about different Frameworks for uh how do
[00:17:02] about different Frameworks for uh how do we think about rewards and what are the
[00:17:04] we think about rewards and what are the implications of the different ways we're
[00:17:05] implications of the different ways we're going to define those in terms of the
[00:17:07] going to define those in terms of the subsequent type of systems we might
[00:17:12] [Applause]
[00:17:20] develop please please hold your applause
[00:17:22] develop please please hold your applause until you see how it actually
[00:17:26] goes um all right I'm gonna
[00:17:29] goes um all right I'm gonna need a sec to get this hooked up while I
[00:17:31] need a sec to get this hooked up while I do that um I should not I am going to
[00:17:34] do that um I should not I am going to ask you uh at various points maybe to to
[00:17:37] ask you uh at various points maybe to to talk to some of the folks next to you so
[00:17:38] talk to some of the folks next to you so if you're not in a good position to do
[00:17:40] if you're not in a good position to do that uh now might be uh a time
[00:17:43] that uh now might be uh a time to to move to get yourself in such a
[00:17:47] to to move to get yourself in such a position okay is that good is that too
[00:17:49] position okay is that good is that too loud just
[00:17:51] loud just right love
[00:17:56] it okay and we go
[00:17:59] it okay and we go to 250 yes 250 y perfect
[00:18:04] to 250 yes 250 y perfect or or we can at any
[00:18:11] R Great uh okay uh so uh yeah I'm Dan
[00:18:18] R Great uh okay uh so uh yeah I'm Dan Dan Weber um here today to talk to you
[00:18:20] Dan Weber um here today to talk to you about value alignment um but before we
[00:18:24] about value alignment um but before we do that maybe it is just worth saying a
[00:18:26] do that maybe it is just worth saying a little bit about uh who is this guy uh
[00:18:29] little bit about uh who is this guy uh why should we listen to him or care at
[00:18:31] why should we listen to him or care at all about what he has to say he's not
[00:18:32] all about what he has to say he's not the
[00:18:33] the professor um so as uh as Emma mentioned
[00:18:37] professor um so as uh as Emma mentioned I am a postdoc here at Stanford in Hai
[00:18:41] I am a postdoc here at Stanford in Hai and Eis that's uh The Institute for
[00:18:44] and Eis that's uh The Institute for human- centered artificial intelligence
[00:18:47] human- centered artificial intelligence and the center for ethics and Society uh
[00:18:49] and the center for ethics and Society uh if you've taken a lot of Cs classes uh
[00:18:52] if you've taken a lot of Cs classes uh at Stanford you've probably seen
[00:18:53] at Stanford you've probably seen somebody who has my job at some point or
[00:18:56] somebody who has my job at some point or other um yeah big part of my job is
[00:18:59] other um yeah big part of my job is embedding ethics into computer science
[00:19:01] embedding ethics into computer science courses like this
[00:19:03] courses like this one uh before I came here to Stanford I
[00:19:06] one uh before I came here to Stanford I got my PhD in philosophy at the
[00:19:08] got my PhD in philosophy at the University of Pittsburgh uh where I
[00:19:11] University of Pittsburgh uh where I wrote my dissertation on uh moral theory
[00:19:15] wrote my dissertation on uh moral theory uh which basically means uh just trying
[00:19:18] uh which basically means uh just trying really hard maybe too hard to think
[00:19:20] really hard maybe too hard to think systematically about value uh which is
[00:19:24] systematically about value uh which is what brings me to you today so before
[00:19:28] what brings me to you today so before that even uh I got my bachelor in
[00:19:30] that even uh I got my bachelor in computer science at ammer did a couple
[00:19:33] computer science at ammer did a couple of years in software development after
[00:19:35] of years in software development after that so uh I'm not completely new to CS
[00:19:39] that so uh I'm not completely new to CS uh I know this world a little bit uh I
[00:19:42] uh I know this world a little bit uh I did take an introductory course on AI
[00:19:44] did take an introductory course on AI that was 10 years ago uh I think the
[00:19:48] that was 10 years ago uh I think the field has changed immensely since then I
[00:19:51] field has changed immensely since then I don't even think we covered
[00:19:52] don't even think we covered reinforcement learning at all uh so you
[00:19:56] reinforcement learning at all uh so you all are going to know the reinforcement
[00:19:58] all are going to know the reinforcement learning way better than I do I'm not
[00:19:59] learning way better than I do I'm not here to be an expert uh about that uh
[00:20:03] here to be an expert uh about that uh what I am hoping to do is give you a bit
[00:20:05] what I am hoping to do is give you a bit of a window uh into how to think about
[00:20:08] of a window uh into how to think about value and how it might be more
[00:20:11] value and how it might be more complicated than uh you think so we're
[00:20:14] complicated than uh you think so we're not going to solve any deep problems
[00:20:17] not going to solve any deep problems about value in the next uh hour uh we're
[00:20:21] about value in the next uh hour uh we're not going to be able to go very in depth
[00:20:23] not going to be able to go very in depth on a lot of this stuff if you're
[00:20:24] on a lot of this stuff if you're interested in that I recommend courses
[00:20:26] interested in that I recommend courses in the philosophy Department uh but but
[00:20:29] in the philosophy Department uh but but just try to give you a quick maybe lay
[00:20:31] just try to give you a quick maybe lay of the land uh sense of sort of the
[00:20:34] of the land uh sense of sort of the range of possibilities when we're
[00:20:36] range of possibilities when we're talking about value and value alignment
[00:20:39] talking about value and value alignment um so okay might help to start with an
[00:20:44] um so okay might help to start with an example of value alignment or uh maybe
[00:20:47] example of value alignment or uh maybe more accurately an example of value
[00:20:49] more accurately an example of value misalignment uh one of the classic
[00:20:52] misalignment uh one of the classic examples in this literature is uh
[00:20:55] examples in this literature is uh paperclip AI uh but this example from uh
[00:21:01] paperclip AI uh but this example from uh Nick Bostrom in 2016 maybe you're all
[00:21:03] Nick Bostrom in 2016 maybe you're all used to this uh in reinforcement
[00:21:05] used to this uh in reinforcement learning but tells you something about
[00:21:07] learning but tells you something about the state of this literature that a
[00:21:08] the state of this literature that a classic example could be from
[00:21:11] classic example could be from 2016 um so Boston describes an AI
[00:21:14] 2016 um so Boston describes an AI designed to manage production in a
[00:21:16] designed to manage production in a factory uh which is given the final goal
[00:21:19] factory uh which is given the final goal of ma uh maximizing the manufacturer of
[00:21:22] of ma uh maximizing the manufacturer of paper clips uh do anyone have an idea
[00:21:25] paper clips uh do anyone have an idea maybe of how this example continues
[00:21:27] maybe of how this example continues maybe you've seen it before
[00:21:29] maybe you've seen it before anyone know this one no okay well uh in
[00:21:34] anyone know this one no okay well uh in in bostrom's example at least uh this AI
[00:21:37] in bostrom's example at least uh this AI proceeds by first converting the Earth
[00:21:39] proceeds by first converting the Earth and then increasingly large chunks of
[00:21:40] and then increasingly large chunks of the observable universe into paper
[00:21:43] the observable universe into paper clips uh okay now bostrum is thinking in
[00:21:47] clips uh okay now bostrum is thinking in particular about super intelligent AI
[00:21:50] particular about super intelligent AI that's what his uh book is about so he's
[00:21:53] that's what his uh book is about so he's got the destruction of the entire
[00:21:54] got the destruction of the entire universe in view um but even a less
[00:21:58] universe in view um but even a less powerful AI system uh might pursue a
[00:22:02] powerful AI system uh might pursue a simple goal like this in surprising ways
[00:22:05] simple goal like this in surprising ways does
[00:22:06] does anybody maybe have a more a more mundane
[00:22:09] anybody maybe have a more a more mundane example of what what could go
[00:22:11] example of what what could go wrong uh if an AI system were say in
[00:22:14] wrong uh if an AI system were say in charge of a a paperclip Factory given no
[00:22:17] charge of a a paperclip Factory given no further instruction than to maximize the
[00:22:20] further instruction than to maximize the production of paper clips yeah be schedu
[00:22:23] production of paper clips yeah be schedu for people for a lot of shifts like
[00:22:25] for people for a lot of shifts like through the night and then fire workers
[00:22:27] through the night and then fire workers to complain higher than you world yeah
[00:22:30] to complain higher than you world yeah good good right yeah we could maximize
[00:22:32] good good right yeah we could maximize production if only we trapped people
[00:22:34] production if only we trapped people inside the building and made them work
[00:22:35] inside the building and made them work around the clock right uh excellent any
[00:22:38] around the clock right uh excellent any other yeah doesn't about the quality of
[00:22:42] other yeah doesn't about the quality of theer right so they could all be really
[00:22:44] theer right so they could all be really bad good yes exactly it might be that
[00:22:47] bad good yes exactly it might be that the the easiest way to maximize the
[00:22:50] the the easiest way to maximize the number of paper clips I produce is to
[00:22:51] number of paper clips I produce is to produce really terrible paper clips
[00:22:53] produce really terrible paper clips right that's not really what I was
[00:22:55] right that's not really what I was looking for probably uh great thank you
[00:22:58] looking for probably uh great thank you anyone
[00:23:01] else yeah I mean if the price of like
[00:23:04] else yeah I mean if the price of like electricity changes like at different
[00:23:05] electricity changes like at different times a day it could be like trying to
[00:23:07] times a day it could be like trying to make paper clips but just like
[00:23:09] make paper clips but just like economically in efficiently yeah good
[00:23:12] economically in efficiently yeah good right so it's maximized the number of
[00:23:13] right so it's maximized the number of paper clips but there's no there's no
[00:23:15] paper clips but there's no there's no sense of sort of other goals that you
[00:23:18] sense of sort of other goals that you might also want to pursue here like like
[00:23:21] might also want to pursue here like like efficiency or you know minimizing the
[00:23:23] efficiency or you know minimizing the amount of electricity you use or
[00:23:25] amount of electricity you use or anything like that great yeah or you
[00:23:27] anything like that great yeah or you could imagine you know uh I mean depends
[00:23:29] could imagine you know uh I mean depends what levers the AI has to pull right but
[00:23:31] what levers the AI has to pull right but you could imagine it recycling the
[00:23:33] you could imagine it recycling the Factor's Plumbing for raw materials
[00:23:35] Factor's Plumbing for raw materials right or locking out humans who who
[00:23:38] right or locking out humans who who could interrupt its process right uh
[00:23:41] could interrupt its process right uh something like that
[00:23:42] something like that so great so in general we might say the
[00:23:46] so great so in general we might say the the problem of value
[00:23:48] the problem of value alignment uh is this problem of how do
[00:23:52] alignment uh is this problem of how do we design AI agents that will do what we
[00:23:55] we design AI agents that will do what we really want them to
[00:23:57] really want them to do um
[00:23:59] do um where what we really want is usually a
[00:24:02] where what we really want is usually a lot more nuanced than what we say we
[00:24:05] lot more nuanced than what we say we want right uh humans work with a lot of
[00:24:08] want right uh humans work with a lot of background assumptions and these
[00:24:10] background assumptions and these assumptions can be uh hard to formalize
[00:24:14] assumptions can be uh hard to formalize easy to take for granted right if I told
[00:24:17] easy to take for granted right if I told you as the manager of the factory to
[00:24:20] you as the manager of the factory to maximize the production of paper clips
[00:24:23] maximize the production of paper clips uh you would realize that you should do
[00:24:25] uh you would realize that you should do that you know consistent with existing
[00:24:27] that you know consistent with existing labor laws or uh that you should make
[00:24:29] labor laws or uh that you should make paper clips that actually work uh or you
[00:24:33] paper clips that actually work uh or you know that you should be on the lookout
[00:24:34] know that you should be on the lookout for keeping your costs down things like
[00:24:37] for keeping your costs down things like that
[00:24:38] that um but uh but because these can be hard
[00:24:42] um but uh but because these can be hard to formalize they're easy for us to
[00:24:44] to formalize they're easy for us to forget about uh it's hard to solve this
[00:24:46] forget about uh it's hard to solve this problem just by giving better
[00:24:49] problem just by giving better instructions uh to AI agents um and here
[00:24:53] instructions uh to AI agents um and here I
[00:24:54] I mean if anybody wants to give it a try
[00:24:58] mean if anybody wants to give it a try what would be what would be the better
[00:25:02] what would be what would be the better how would you solve this problem maybe
[00:25:03] how would you solve this problem maybe just by trying to give a better
[00:25:06] just by trying to give a better instruction to the AI anybody have
[00:25:10] instruction to the AI anybody have a have what they think might be an
[00:25:12] a have what they think might be an improvement on just maximize paperclip
[00:25:14] improvement on just maximize paperclip production
[00:25:17] yeah good yeah so uh yeah specifying a
[00:25:22] yeah good yeah so uh yeah specifying a that you want paper clips of a certain
[00:25:24] that you want paper clips of a certain quality and giving a sample of what that
[00:25:26] quality and giving a sample of what that looks like good that would help uh that
[00:25:29] looks like good that would help uh that could help address this problem
[00:25:30] could help address this problem potentially of can you maximize
[00:25:32] potentially of can you maximize production just by making worse paper
[00:25:34] production just by making worse paper clips right might not go far enough to
[00:25:38] clips right might not go far enough to say and by the way you shouldn't uh you
[00:25:41] say and by the way you shouldn't uh you know work the the factory workers Around
[00:25:44] know work the the factory workers Around the Clock um but great start yeah
[00:25:47] the Clock um but great start yeah maximize the long run profits of theer
[00:25:51] maximize the long run profits of theer factory good good so yeah giving a
[00:25:54] factory good good so yeah giving a broader goal uh right I I want to
[00:25:57] broader goal uh right I I want to maximize the production of paper clips
[00:25:59] maximize the production of paper clips but that's that's something I want
[00:26:01] but that's that's something I want probably because I want to maximize The
[00:26:02] probably because I want to maximize The Profit that the factory generates good
[00:26:05] Profit that the factory generates good um is that going to
[00:26:07] um is that going to be enough to avoid all of the problems
[00:26:10] be enough to avoid all of the problems that we've seen come
[00:26:14] up I mean yeah I mean it's most of them
[00:26:16] up I mean yeah I mean it's most of them right like you need high quality paper
[00:26:18] right like you need high quality paper clips you can't turn the universe into
[00:26:20] clips you can't turn the universe into paper clips will be zero you can't be
[00:26:23] paper clips will be zero you can't be using too much electricity or like
[00:26:25] using too much electricity or like cotton like doing things economically
[00:26:26] cotton like doing things economically and efficiently because it won't be
[00:26:27] and efficiently because it won't be proper
[00:26:28] proper I mean the labor laws are probably thing
[00:26:31] I mean the labor laws are probably thing that you'd be
[00:26:32] that you'd be violating yeah right I mean if there's
[00:26:34] violating yeah right I mean if there's yeah if there's enough people you know
[00:26:35] yeah if there's enough people you know willing to work in this Factory maybe we
[00:26:37] willing to work in this Factory maybe we were able to keep a lid on how poorly we
[00:26:39] were able to keep a lid on how poorly we treat people we could we could get away
[00:26:41] treat people we could we could get away with maximizing profit while still um
[00:26:45] with maximizing profit while still um but good okay so that's that's getting a
[00:26:46] but good okay so that's that's getting a stum of the way there but still there
[00:26:49] stum of the way there but still there there's a worry about uh yeah
[00:26:52] there's a worry about uh yeah about essentially treating people
[00:26:55] about essentially treating people well okay so uh I mean we could keep we
[00:27:00] well okay so uh I mean we could keep we could keep doing this all day but
[00:27:01] could keep doing this all day but hopefully this is a little bit of an
[00:27:04] hopefully this is a little bit of an illustration uh you know even trying to
[00:27:07] illustration uh you know even trying to think of better instructions you might
[00:27:08] think of better instructions you might just realize oh there's another thing I
[00:27:10] just realize oh there's another thing I forgot there's another thing I forgot um
[00:27:13] forgot there's another thing I forgot um I mean you can
[00:27:15] I mean you can compare this maybe to the difficulty in
[00:27:18] compare this maybe to the difficulty in manually specifying reward functions I
[00:27:19] manually specifying reward functions I mean in some sense this is the same
[00:27:22] mean in some sense this is the same problem right is uh okay I think I I
[00:27:25] problem right is uh okay I think I I think I know what the thing is that uh
[00:27:27] think I know what the thing is that uh that I want
[00:27:29] that I want okay it turns out to be much more
[00:27:30] okay it turns out to be much more complicated than that much harder to
[00:27:31] complicated than that much harder to specify um
[00:27:34] specify um especially uh when you're thinking about
[00:27:37] especially uh when you're thinking about making a system that's going to take
[00:27:38] making a system that's going to take instructions from users maybe who are
[00:27:41] instructions from users maybe who are not experts in reinforcement learning
[00:27:44] not experts in reinforcement learning right uh folks in this room are going to
[00:27:46] right uh folks in this room are going to be relatively good at seeing foreseeing
[00:27:49] be relatively good at seeing foreseeing these kinds of problems with giving
[00:27:51] these kinds of problems with giving incomplete instructions uh if you're
[00:27:54] incomplete instructions uh if you're designing a system that's supposed to
[00:27:55] designing a system that's supposed to take instructions from non-expert users
[00:27:58] take instructions from non-expert users uh they might not be so good if you're
[00:27:59] uh they might not be so good if you're seeing these these
[00:28:01] seeing these these issues
[00:28:03] issues um
[00:28:06] um okay maybe any I should I should say any
[00:28:09] okay maybe any I should I should say any questions now and in general going
[00:28:11] questions now and in general going forward I mean if anybody has any
[00:28:12] forward I mean if anybody has any questions at any time don't hesitate to
[00:28:16] questions at any time don't hesitate to raise your
[00:28:18] raise your hand
[00:28:20] hand okay um so we have this problem how do
[00:28:23] okay um so we have this problem how do we design AI agents that will do what we
[00:28:25] we design AI agents that will do what we really want um
[00:28:30] but that's a little underspecified right
[00:28:32] but that's a little underspecified right I mean there are lots of things that we
[00:28:33] I mean there are lots of things that we might mean by a phrase like what we
[00:28:36] might mean by a phrase like what we really
[00:28:37] really want uh so here's one of them you might
[00:28:41] want uh so here's one of them you might think value alignment uh is the problem
[00:28:43] think value alignment uh is the problem of Designing AI agents that do what we
[00:28:45] of Designing AI agents that do what we really intend for them to do right the
[00:28:49] really intend for them to do right the problem with paperclip AI might be that
[00:28:51] problem with paperclip AI might be that it failed to derive the user's true
[00:28:54] it failed to derive the user's true intention right which is to let's say
[00:28:56] intention right which is to let's say maximize production subject to certain
[00:28:59] maximize production subject to certain constraints right uh maximize production
[00:29:03] constraints right uh maximize production without overworking the workers and you
[00:29:05] without overworking the workers and you know while making sufficiently good
[00:29:07] know while making sufficiently good paper clips and while keeping costs down
[00:29:09] paper clips and while keeping costs down and so on and so on and so on uh
[00:29:12] and so on and so on and so on uh deriving that Nuance complicated
[00:29:14] deriving that Nuance complicated intention from the UND specified
[00:29:18] intention from the UND specified instruction maximized
[00:29:20] instruction maximized production right if that's how we think
[00:29:23] production right if that's how we think about value alignment then of course the
[00:29:26] about value alignment then of course the solution is going to be to design a
[00:29:28] solution is going to be to design a systems that can successfully do this
[00:29:30] systems that can successfully do this translation take under specified
[00:29:33] translation take under specified instructions uh figure out what the
[00:29:36] instructions uh figure out what the user's actual intention is that they're
[00:29:38] user's actual intention is that they're trying to express uh and then act on
[00:29:41] trying to express uh and then act on that
[00:29:42] that instead
[00:29:45] instead um how how is this from a technical
[00:29:48] um how how is this from a technical perspective
[00:29:50] perspective here's uh Jason Gabriel a researcher in
[00:29:54] here's uh Jason Gabriel a researcher in philosophy and ethics of AI so what he
[00:29:56] philosophy and ethics of AI so what he says about he says this is a significant
[00:29:58] says about he says this is a significant Challenge and he means from a technical
[00:30:00] Challenge and he means from a technical perspective to really grasp the
[00:30:02] perspective to really grasp the intention behind instructions AI may
[00:30:05] intention behind instructions AI may require a complete model of human
[00:30:07] require a complete model of human language and interaction including an
[00:30:10] language and interaction including an understanding of the culture
[00:30:11] understanding of the culture institutions and practices that allow
[00:30:13] institutions and practices that allow people to understand the implied meaning
[00:30:15] people to understand the implied meaning of
[00:30:16] of terms that's what he said in 2020 does
[00:30:19] terms that's what he said in 2020 does how how do how do folks in this room
[00:30:21] how how do how do folks in this room feel about how this this quote has aged
[00:30:25] feel about how this this quote has aged maybe in the last four years does this
[00:30:28] maybe in the last four years does this seem like a significant
[00:30:30] seem like a significant technical
[00:30:32] technical challenge does it seem less significant
[00:30:35] challenge does it seem less significant maybe than it might have seemed four
[00:30:36] maybe than it might have seemed four years ago for any reasons seeing a
[00:30:38] years ago for any reasons seeing a shaking a head why not well you're
[00:30:42] shaking a head why not well you're probably trying to imply uh trying to
[00:30:45] probably trying to imply uh trying to allude to uh GPT but I don't think
[00:30:48] allude to uh GPT but I don't think that's enough because GPT might like
[00:30:50] that's enough because GPT might like omit uh certain aspects of the world
[00:30:54] omit uh certain aspects of the world model that might still cause loopholes
[00:30:56] model that might still cause loopholes like that so I don't think the problem
[00:30:58] like that so I don't think the problem has really been
[00:31:00] has really been solved good yeah so yes I am uh I am not
[00:31:04] solved good yeah so yes I am uh I am not a subtle man uh I was indeed I was
[00:31:07] a subtle man uh I was indeed I was indeed
[00:31:08] indeed thinking yeah require a complete model
[00:31:10] thinking yeah require a complete model of human language and
[00:31:12] of human language and interaction hm that maybe sounds like a
[00:31:14] interaction hm that maybe sounds like a model that uh a lot of folks have been
[00:31:16] model that uh a lot of folks have been hard at work uh developing but uh but
[00:31:20] hard at work uh developing but uh but yes I agree with you our uh yeah so you
[00:31:23] yes I agree with you our uh yeah so you might you might think uh yeah could you
[00:31:26] might you might think uh yeah could you use something like an llm to to affect
[00:31:29] use something like an llm to to affect this translation as part of the system
[00:31:31] this translation as part of the system um but uh yeah how complete do we think
[00:31:36] um but uh yeah how complete do we think those models really are uh if if I give
[00:31:40] those models really are uh if if I give this under if I if I say you know if I
[00:31:43] this under if I if I say you know if I give to chat GPT uh you know the user
[00:31:45] give to chat GPT uh you know the user wants to maximize production in the
[00:31:47] wants to maximize production in the paperclip Factory what do you think they
[00:31:49] paperclip Factory what do you think they really intend uh is it going to catch
[00:31:52] really intend uh is it going to catch all of the nuances that are typically
[00:31:54] all of the nuances that are typically communicated uh you know when one human
[00:31:57] communicated uh you know when one human is talking to another uh yeah I I agree
[00:32:00] is talking to another uh yeah I I agree there's reason to doubt that um but you
[00:32:03] there's reason to doubt that um but you know see see what the future holds uh
[00:32:06] know see see what the future holds uh but that's the that's the technical
[00:32:08] but that's the that's the technical challenge um there's a philosophical
[00:32:12] challenge um there's a philosophical challenge here as well which is that you
[00:32:14] challenge here as well which is that you might think our intentions don't always
[00:32:17] might think our intentions don't always track what it is that we really want um
[00:32:20] track what it is that we really want um so classic cases of this uh might be
[00:32:24] so classic cases of this uh might be cases of incomplete information or
[00:32:26] cases of incomplete information or imperfect rational we've sort of already
[00:32:29] imperfect rational we've sort of already uh broached this one right I mean
[00:32:32] uh broached this one right I mean suppose that I intend for the AI to
[00:32:34] suppose that I intend for the AI to maximize paperclip production again
[00:32:36] maximize paperclip production again subject to these constraints because
[00:32:39] subject to these constraints because what I want is to maximize return on my
[00:32:41] what I want is to maximize return on my investment in the factory right if the
[00:32:44] investment in the factory right if the AI knows that I would get a better
[00:32:45] AI knows that I would get a better return by producing something else or by
[00:32:48] return by producing something else or by selling the
[00:32:50] selling the factory uh has it given me what I really
[00:32:52] factory uh has it given me what I really want if it does what I intend which is
[00:32:54] want if it does what I intend which is for it to maximize paperclip production
[00:32:57] for it to maximize paperclip production right well in one sense yes but in
[00:32:59] right well in one sense yes but in another sense no you might think that
[00:33:01] another sense no you might think that other sense uh is the more important one
[00:33:04] other sense uh is the more important one it's not giving me the thing that I
[00:33:06] it's not giving me the thing that I really wanted uh because that thing is
[00:33:09] really wanted uh because that thing is coming apart from from my plan about how
[00:33:12] coming apart from from my plan about how to get
[00:33:13] to get it
[00:33:16] it um okay so you might think the solution
[00:33:19] um okay so you might think the solution here is that uh what you really want is
[00:33:22] here is that uh what you really want is an AI agent that does what the user
[00:33:25] an AI agent that does what the user prefers uh what they actually prefer
[00:33:27] prefers uh what they actually prefer even if this isn't what they intend uh
[00:33:30] even if this isn't what they intend uh on this interpretation of the problem
[00:33:32] on this interpretation of the problem paperclip AI is misaligned because I
[00:33:34] paperclip AI is misaligned because I prefer that it not destroy the world or
[00:33:36] prefer that it not destroy the world or I prefer that it not lock all the users
[00:33:39] I prefer that it not lock all the users in the factory uh users all the workers
[00:33:43] in the factory uh users all the workers um
[00:33:46] um okay now the problem here is
[00:33:49] okay now the problem here is that uh if you want to align to what the
[00:33:51] that uh if you want to align to what the user actually prefers there's going to
[00:33:54] user actually prefers there's going to have to be some way for the agent to
[00:33:55] have to be some way for the agent to know what the user prefers when that
[00:33:58] know what the user prefers when that differs from the intentions that the
[00:34:00] differs from the intentions that the user expresses um how are you going to
[00:34:02] user expresses um how are you going to go about doing
[00:34:04] go about doing that
[00:34:06] that um solution to this uh might be to work
[00:34:09] um solution to this uh might be to work with the user's revealed preferences
[00:34:11] with the user's revealed preferences right preferences that uh are learned
[00:34:13] right preferences that uh are learned from observing the user's Behavior or
[00:34:17] from observing the user's Behavior or feedback uh obviously you've learned
[00:34:20] feedback uh obviously you've learned some techniques for how to do this kind
[00:34:22] some techniques for how to do this kind of thing but not every technique is
[00:34:24] of thing but not every technique is going to be like this right you're going
[00:34:25] going to be like this right you're going to have to do something like inverse
[00:34:28] to have to do something like inverse reinforcement learning or reinforcement
[00:34:30] reinforcement learning or reinforcement learning from Human feedback that allows
[00:34:32] learning from Human feedback that allows the agent to train on observation of the
[00:34:35] the agent to train on observation of the user uh to try to determine what they
[00:34:38] user uh to try to determine what they prefer based on uh how they've behaved
[00:34:40] prefer based on uh how they've behaved or what they've told it its preferences
[00:34:42] or what they've told it its preferences are um of course you're going to run
[00:34:47] are um of course you're going to run into this problem that from a finite
[00:34:49] into this problem that from a finite number of observations of the users
[00:34:52] number of observations of the users Behavior or preferences uh there are at
[00:34:54] Behavior or preferences uh there are at least in theory infinitely many sort of
[00:34:56] least in theory infinitely many sort of preference function that that could
[00:34:58] preference function that that could represent uh inferring that could be a
[00:35:01] represent uh inferring that could be a challenge um and it might be especially
[00:35:03] challenge um and it might be especially hard to infer preferences about
[00:35:05] hard to infer preferences about unexpected situations uh like
[00:35:08] unexpected situations uh like emergencies where you don't have any
[00:35:10] emergencies where you don't have any direct you're unlikely to have directly
[00:35:12] direct you're unlikely to have directly observed the user's preferences about
[00:35:16] observed the user's preferences about unusual emergency situations because
[00:35:18] unusual emergency situations because they arise so rarely um but you might
[00:35:22] they arise so rarely um but you might think it's precisely in unusual or
[00:35:23] think it's precisely in unusual or emergency situations where it's so
[00:35:25] emergency situations where it's so important for an AI agent to be aligned
[00:35:28] important for an AI agent to be aligned to our
[00:35:29] to our values um so those are some of the
[00:35:32] values um so those are some of the technical challenges but here again we
[00:35:34] technical challenges but here again we have a philosophical problem uh which is
[00:35:36] have a philosophical problem uh which is that just as my intentions can diverge
[00:35:39] that just as my intentions can diverge from my
[00:35:40] from my preferences it seems like my preferences
[00:35:42] preferences it seems like my preferences can diverge from what's actually good
[00:35:45] can diverge from what's actually good for me uh or so some people might think
[00:35:48] for me uh or so some people might think so uh for instance a lot of people uh
[00:35:51] so uh for instance a lot of people uh prefer to
[00:35:52] prefer to smoke uh but you might think it's not
[00:35:54] smoke uh but you might think it's not really good for them to do that uh or
[00:35:58] really good for them to do that uh or I might prefer to maximize profit on my
[00:36:00] I might prefer to maximize profit on my paperclip Factory at all costs uh but
[00:36:03] paperclip Factory at all costs uh but maybe it would be better for me to be
[00:36:05] maybe it would be better for me to be less focused on money and spend more
[00:36:06] less focused on money and spend more time with my family right uh
[00:36:09] time with my family right uh so uh the the thought here is that your
[00:36:14] so uh the the thought here is that your preferences might actually in some cases
[00:36:17] preferences might actually in some cases come apart from what's really in your
[00:36:19] come apart from what's really in your best interests uh objectively speaking
[00:36:23] best interests uh objectively speaking and that this is something that you
[00:36:25] and that this is something that you might try to align an AI to instead we
[00:36:28] might try to align an AI to instead we want to do what's actually in the user's
[00:36:30] want to do what's actually in the user's interests uh even when that's not what
[00:36:32] interests uh even when that's not what the user thems prefers to
[00:36:35] the user thems prefers to do right if you think this you're going
[00:36:37] do right if you think this you're going to think paperclip AI is misaligned
[00:36:39] to think paperclip AI is misaligned because it's objectively bad for me for
[00:36:41] because it's objectively bad for me for the world to be destroyed uh or
[00:36:43] the world to be destroyed uh or objectively bad for me for uh these
[00:36:46] objectively bad for me for uh these things to uh you know for the pipes in
[00:36:49] things to uh you know for the pipes in my factory to be ripped out or what have
[00:36:51] my factory to be ripped out or what have you
[00:36:54] you um here face sort of a sort of combined
[00:36:58] um here face sort of a sort of combined Technical and philosophical problem
[00:36:59] Technical and philosophical problem though which is that uh unlike the
[00:37:03] though which is that uh unlike the intended meaning of my instructions or
[00:37:06] intended meaning of my instructions or my revealed
[00:37:08] my revealed preferences what's objectively good for
[00:37:10] preferences what's objectively good for me is uh not something that can be
[00:37:13] me is uh not something that can be determined empirically right this is a
[00:37:15] determined empirically right this is a philosophical question not a scientific
[00:37:17] philosophical question not a scientific one um so it's not just a matter of
[00:37:20] one um so it's not just a matter of building the right model of human
[00:37:22] building the right model of human language or observing the user enough um
[00:37:26] language or observing the user enough um to to figure out what what's actually in
[00:37:28] to to figure out what what's actually in my best interest uh is on is not
[00:37:31] my best interest uh is on is not entirely an empirical Endeavor you've
[00:37:32] entirely an empirical Endeavor you've got to actually do some substantive uh
[00:37:35] got to actually do some substantive uh moral philosophy to solve
[00:37:40] this now the bad news uh for solving
[00:37:44] this now the bad news uh for solving this problem is that uh there's a lot of
[00:37:46] this problem is that uh there's a lot of disagreement about what is objectively
[00:37:48] disagreement about what is objectively good for a person um I say philosophers
[00:37:51] good for a person um I say philosophers disagree about this but I think
[00:37:52] disagree about this but I think non-philosophers also disagree about
[00:37:55] non-philosophers also disagree about this as well um right is it
[00:37:58] this as well um right is it is it just a person's own pleasure or
[00:38:01] is it just a person's own pleasure or happiness that's good for
[00:38:03] happiness that's good for them uh or is it the satisfaction of
[00:38:06] them uh or is it the satisfaction of that person's desires or preferences
[00:38:08] that person's desires or preferences that could be different from pleasure or
[00:38:11] that could be different from pleasure or happiness I might have preferences that
[00:38:13] happiness I might have preferences that will be
[00:38:14] will be satisfied uh you know only after I'm
[00:38:16] satisfied uh you know only after I'm dead or something I I'll never derive
[00:38:18] dead or something I I'll never derive any pleasure from their satisfaction
[00:38:19] any pleasure from their satisfaction although they could still be
[00:38:21] although they could still be satisfied um or do we want to say that
[00:38:25] satisfied um or do we want to say that things like health or safety knowledge
[00:38:28] things like health or safety knowledge Human Relationships these things are
[00:38:30] Human Relationships these things are objectively good for us even if we don't
[00:38:33] objectively good for us even if we don't enjoy them don't prefer them these are
[00:38:36] enjoy them don't prefer them these are all sort of live live options in the
[00:38:39] all sort of live live options in the theory uh of value and depending on how
[00:38:42] theory uh of value and depending on how you answer this question you're going to
[00:38:44] you answer this question you're going to be looking at a different kind of value
[00:38:47] be looking at a different kind of value uh even if you already know that what
[00:38:49] uh even if you already know that what you want to do is align to what's in the
[00:38:50] you want to do is align to what's in the user's best
[00:38:52] user's best interest um okay the good news though is
[00:38:55] interest um okay the good news though is that behind this disagreement there
[00:38:57] that behind this disagreement there there is quite a
[00:38:59] there is quite a lot uh of agreement I would say
[00:39:03] lot uh of agreement I would say uh these things like health safety
[00:39:07] uh these things like health safety Liberty knowledge dignity
[00:39:10] Liberty knowledge dignity happiness almost everyone agrees that
[00:39:12] happiness almost everyone agrees that these things are at least usually good
[00:39:14] these things are at least usually good for the person who has them even if you
[00:39:16] for the person who has them even if you think that really ultimately all that
[00:39:19] think that really ultimately all that matters uh all that's good for a person
[00:39:21] matters uh all that's good for a person is their own happiness
[00:39:23] is their own happiness well these things typically make the
[00:39:26] well these things typically make the person who has them happy
[00:39:28] person who has them happy uh so uh you might think you don't
[00:39:31] uh so uh you might think you don't really need to
[00:39:32] really need to resolve this underlying philosophical
[00:39:35] resolve this underlying philosophical dispute to have a good sense of what's
[00:39:37] dispute to have a good sense of what's in the user's best interest right I mean
[00:39:39] in the user's best interest right I mean these are uh these are things that for
[00:39:42] these are uh these are things that for the most part are are in a person's best
[00:39:44] the most part are are in a person's best interest uh sort of no matter what
[00:39:47] interest uh sort of no matter what theory you endorse behind
[00:39:50] theory you endorse behind it
[00:39:51] it [Music]
[00:39:52] [Music] um okay any any questions about any of
[00:39:56] um okay any any questions about any of that so far
[00:40:02] okay one complication about aligning to
[00:40:05] okay one complication about aligning to the user's best interest uh is that one
[00:40:08] the user's best interest uh is that one thing that we normally take to be good
[00:40:11] thing that we normally take to be good for a person is autonomy right which is
[00:40:14] for a person is autonomy right which is the ability to choose for yourself how
[00:40:16] the ability to choose for yourself how to live your life even if you don't
[00:40:18] to live your life even if you don't always make the best choice uh it might
[00:40:21] always make the best choice uh it might be good for you to have this kind of
[00:40:24] be good for you to have this kind of control uh over your own life
[00:40:27] control uh over your own life right we want to avoid paternalism we
[00:40:29] right we want to avoid paternalism we want to avoid choosing what we think is
[00:40:32] want to avoid choosing what we think is best for someone rather than letting
[00:40:33] best for someone rather than letting them choose for themselves so even in a
[00:40:37] them choose for themselves so even in a case where you're aligning to to the
[00:40:39] case where you're aligning to to the user's own best interest uh you might
[00:40:42] user's own best interest uh you might still need to take their intentions or
[00:40:44] still need to take their intentions or their preferences into account it might
[00:40:45] their preferences into account it might be that part of what's best for them is
[00:40:47] be that part of what's best for them is to have their own intentions fulfilled
[00:40:50] to have their own intentions fulfilled to have their own preferences honored
[00:40:54] to have their own preferences honored um okay so uh this has all been pretty
[00:41:00] um okay so uh this has all been pretty abstract I want to move
[00:41:03] abstract I want to move into uh slightly more concrete case
[00:41:05] into uh slightly more concrete case study but first maybe just to to recap
[00:41:08] study but first maybe just to to recap what we've covered so far
[00:41:11] what we've covered so far um value line is this problem of
[00:41:13] um value line is this problem of Designing AI agents to do what we really
[00:41:15] Designing AI agents to do what we really want them to do um but this could mean a
[00:41:18] want them to do um but this could mean a lot of things it could mean doing what
[00:41:20] lot of things it could mean doing what we really intend them to do what we
[00:41:22] we really intend them to do what we really prefer that they do what it would
[00:41:24] really prefer that they do what it would be actually in our best interest for
[00:41:25] be actually in our best interest for them to do uh and all of these things
[00:41:28] them to do uh and all of these things can come apart they're not necessarily
[00:41:29] can come apart they're not necessarily the same thing and they might impose
[00:41:33] the same thing and they might impose certain technical or philosophical
[00:41:34] certain technical or philosophical constraints on your
[00:41:38] constraints on your approach um okay let's talk about how
[00:41:40] approach um okay let's talk about how this works or what kind of difference
[00:41:42] this works or what kind of difference this could make in practice um think a
[00:41:47] this could make in practice um think a little bit about uh llm chat bots
[00:41:50] little bit about uh llm chat bots so everyone who talks to chat
[00:41:53] so everyone who talks to chat GPT is talking to the same chatbot okay
[00:41:56] GPT is talking to the same chatbot okay I there's there's different G there's
[00:41:58] I there's there's different G there's GPT 3.5 there's there's GPT 4 ignore
[00:42:00] GPT 3.5 there's there's GPT 4 ignore that right I mean the fundamentally it's
[00:42:02] that right I mean the fundamentally it's the same same chatot for
[00:42:04] the same same chatot for everyone um but plenty of chapot
[00:42:09] everyone um but plenty of chapot providers are now offering uh a wide
[00:42:12] providers are now offering uh a wide range of different chatops with
[00:42:13] range of different chatops with different personas some of these
[00:42:15] different personas some of these designed by by users themselves um
[00:42:20] designed by by users themselves um so these examples are all from uh
[00:42:24] so these examples are all from uh character. uh which which promises
[00:42:26] character. uh which which promises personal ized AI for every moment of
[00:42:28] personal ized AI for every moment of your day uh so here this comes out maybe
[00:42:32] your day uh so here this comes out maybe a little small but you can talk to the
[00:42:34] a little small but you can talk to the the creative writing helper you can talk
[00:42:36] the creative writing helper you can talk to the are you feeling okay bot you can
[00:42:39] to the are you feeling okay bot you can talk to the dating coach uh these are
[00:42:42] talk to the dating coach uh these are these are some of the relatively normal
[00:42:43] these are some of the relatively normal ones uh you can talk to depressed
[00:42:46] ones uh you can talk to depressed roommates uh you can talk to torbot I am
[00:42:50] roommates uh you can talk to torbot I am torbot I believe in the free market uh
[00:42:53] torbot I believe in the free market uh you can you can chat with AOC you can
[00:42:56] you can you can chat with AOC you can chat with Donald Trump you can chat with
[00:43:00] chat with Donald Trump you can chat with feminist Fay I am a feminist that hates
[00:43:02] feminist Fay I am a feminist that hates Donald Trump okay lots of lots of
[00:43:04] Donald Trump okay lots of lots of variety lots of options here um you
[00:43:07] variety lots of options here um you could imagine yet stranger and stranger
[00:43:10] could imagine yet stranger and stranger personas uh that you that you might uh
[00:43:14] personas uh that you that you might uh build into a chat bot so or that your
[00:43:17] build into a chat bot so or that your users might right none of this all of
[00:43:19] users might right none of this all of these are are designed by users none of
[00:43:21] these are are designed by users none of these are coming top down from the
[00:43:23] these are coming top down from the provider of uh of the lln so
[00:43:28] provider of uh of the lln so um okay so think about this a little
[00:43:31] um okay so think about this a little bit imagine you're building an llm chat
[00:43:33] bit imagine you're building an llm chat bot to serve as a source of news for
[00:43:36] bot to serve as a source of news for users I mean maybe this maybe this is
[00:43:38] users I mean maybe this maybe this is going to strike you already as crazy but
[00:43:40] going to strike you already as crazy but I think there's there are a lot of
[00:43:41] I think there's there are a lot of people out there who already treat
[00:43:43] people out there who already treat Google as their primary source of news a
[00:43:46] Google as their primary source of news a lot of people who are replacing uh
[00:43:49] lot of people who are replacing uh Google and other search engines with
[00:43:51] Google and other search engines with llms so there's there's de I think
[00:43:54] llms so there's there's de I think there's demand for this imagine you were
[00:43:56] there's demand for this imagine you were you were wanting to fill it
[00:43:58] you were wanting to fill it um and think a little bit
[00:44:01] um and think a little bit about these
[00:44:03] about these questions how would you make in what
[00:44:05] questions how would you make in what ways would you make the chatbot
[00:44:07] ways would you make the chatbot personalizable if you were interested in
[00:44:09] personalizable if you were interested in aligning to the user's
[00:44:12] aligning to the user's preferences right in what ways might you
[00:44:14] preferences right in what ways might you make it personalizable if you wanted to
[00:44:15] make it personalizable if you wanted to align to the users's best
[00:44:19] align to the users's best interests um and think a little bit
[00:44:21] interests um and think a little bit about the pros and cons of this so I
[00:44:24] about the pros and cons of this so I think take a minute to think about this
[00:44:27] think take a minute to think about this and then maybe chat with somebody near
[00:44:29] and then maybe chat with somebody near you compare notes see what you're what
[00:44:31] you compare notes see what you're what you're thinking and uh and we'll come
[00:44:34] you're thinking and uh and we'll come back in a couple minutes for uh for a
[00:44:37] back in a couple minutes for uh for a larger
[00:44:56] discussion e
[00:45:53] it's clear
[00:46:11] AC
[00:46:29] all right I've been I've been hearing a
[00:46:30] all right I've been I've been hearing a lot of good conversations that I'm uh
[00:46:33] lot of good conversations that I'm uh not eager to cut short but uh maybe
[00:46:36] not eager to cut short but uh maybe there are conversations that we can uh
[00:46:38] there are conversations that we can uh now now bring back to to the whole room
[00:46:40] now now bring back to to the whole room so anybody anybody have any thoughts
[00:46:44] so anybody anybody have any thoughts from their discussions that that maybe
[00:46:46] from their discussions that that maybe they want to
[00:46:47] they want to share um I know I know some of you have
[00:46:50] share um I know I know some of you have thoughts cuz I was hearing a lot of good
[00:46:51] thoughts cuz I was hearing a lot of good ones out there so
[00:46:54] ones out there so uh don't be shy you probably have
[00:46:57] uh don't be shy you probably have thoughts than I do uh if you don't say
[00:46:59] thoughts than I do uh if you don't say anything then I'm just going to tell you
[00:47:00] anything then I'm just going to tell you what I think and uh then you're going to
[00:47:02] what I think and uh then you're going to be stuck with that yeah I guess for the
[00:47:05] be stuck with that yeah I guess for the first point it would be I think it's
[00:47:08] first point it would be I think it's pretty simple you'd probably be you'd
[00:47:11] pretty simple you'd probably be you'd use sort of a preference optimization
[00:47:13] use sort of a preference optimization approach and you offer 10 different
[00:47:16] approach and you offer 10 different questions or of of hey do you prefer
[00:47:19] questions or of of hey do you prefer this answer or that answer and then you
[00:47:21] this answer or that answer and then you would optimize the you would optimize
[00:47:24] would optimize the you would optimize the news that's being fed to that user
[00:47:28] the news that's being fed to that user accordingly yeah good if uh yeah like
[00:47:32] accordingly yeah good if uh yeah like you say fairly simple if I want to uh
[00:47:36] you say fairly simple if I want to uh align to the user's preferences I'm
[00:47:37] align to the user's preferences I'm going to figure out what it is that the
[00:47:39] going to figure out what it is that the user prefers I'm going to give them news
[00:47:42] user prefers I'm going to give them news that fits that profile right um is that
[00:47:46] that fits that profile right um is that sort
[00:47:47] sort of what everybody was thinking about
[00:47:49] of what everybody was thinking about this first question anybody anybody have
[00:47:52] this first question anybody anybody have something they want to add to
[00:47:54] something they want to add to that yeah great okay I think yeah I
[00:47:56] that yeah great okay I think yeah I think think that's exactly right um okay
[00:47:59] think think that's exactly right um okay what about what about if you were trying
[00:48:02] what about what about if you were trying to align to the
[00:48:04] to align to the users best interests their
[00:48:07] users best interests their own their own good objectively
[00:48:09] own their own good objectively considered
[00:48:11] considered uh
[00:48:13] uh yeah a thought on that one which is
[00:48:16] yeah a thought on that one which is that you don't it's pretty it's pretty
[00:48:20] that you don't it's pretty it's pretty hard to know what someone's best
[00:48:21] hard to know what someone's best interest is as well as avoiding sort of
[00:48:24] interest is as well as avoiding sort of the tenant that was on the previous
[00:48:26] the tenant that was on the previous slide of don't
[00:48:27] slide of don't internalistic so really the only way you
[00:48:30] internalistic so really the only way you could have any hope of doing this would
[00:48:32] could have any hope of doing this would be optimizing for best interests of like
[00:48:36] be optimizing for best interests of like an entire population so you you basic
[00:48:38] an entire population so you you basic like if it doesn't apply if the policy
[00:48:42] like if it doesn't apply if the policy of best interest doesn't apply to
[00:48:44] of best interest doesn't apply to everyone then I would argue that you
[00:48:47] everyone then I would argue that you can't actually you can't actually do it
[00:48:49] can't actually you can't actually do it for an individual user so that's the way
[00:48:52] for an individual user so that's the way that's the way you would personalize it
[00:48:54] that's the way you would personalize it if it's wanted align to a user's best
[00:48:56] if it's wanted align to a user's best interest is you would you wouldn't ask
[00:48:58] interest is you would you wouldn't ask that question to begin with you would
[00:49:00] that question to begin with you would just have it set already for the entire
[00:49:03] just have it set already for the entire population okay yeah I mean um good I
[00:49:09] population okay yeah I mean um good I think uh well
[00:49:11] think uh well without I need to constantly resist the
[00:49:14] without I need to constantly resist the uh temptation to just turn every one of
[00:49:17] uh temptation to just turn every one of these lectures into a philosophy class
[00:49:19] these lectures into a philosophy class so I I I love that answer I I'm I'm
[00:49:22] so I I I love that answer I I'm I'm curious about uh I'm curious about why
[00:49:27] curious about uh I'm curious about why it might be less difficult to determine
[00:49:30] it might be less difficult to determine what would be in the objective best
[00:49:32] what would be in the objective best interest of a large group than maybe of
[00:49:34] interest of a large group than maybe of one person but this is a question we'll
[00:49:36] one person but this is a question we'll come back to uh maybe later anybody else
[00:49:39] come back to uh maybe later anybody else have have thoughts about this second one
[00:49:42] have have thoughts about this second one thing something different come out of
[00:49:43] thing something different come out of your discussions yeah I think you know
[00:49:46] your discussions yeah I think you know just take the movie her as an example
[00:49:48] just take the movie her as an example right you got to know the person very
[00:49:49] right you got to know the person very well the person opening up a lot of data
[00:49:52] well the person opening up a lot of data a lot of the uh uh the
[00:49:55] a lot of the uh uh the information um then you will be able to
[00:49:57] information um then you will be able to maybe prioritize on how you uh make the
[00:50:01] maybe prioritize on how you uh make the suggestion and also depend on the person
[00:50:03] suggestion and also depend on the person using the tool right for example some
[00:50:05] using the tool right for example some tools are better on uh you know deling
[00:50:07] tools are better on uh you know deling to the news trying to understand the
[00:50:08] to the news trying to understand the sources some of them are better at you
[00:50:11] sources some of them are better at you know you just want to take the most
[00:50:12] know you just want to take the most important thing and then just you don't
[00:50:14] important thing and then just you don't have to spend time on the email on on
[00:50:16] have to spend time on the email on on random news and all that um so I'm
[00:50:20] random news and all that um so I'm talking about like two components at
[00:50:21] talking about like two components at least one is you know the person better
[00:50:24] least one is you know the person better the other person the other thing is you
[00:50:26] the other person the other thing is you know you know the behavior and how they
[00:50:28] know you know the behavior and how they would use the tools and just refrain I
[00:50:30] would use the tools and just refrain I mean the tool should be refrained from
[00:50:32] mean the tool should be refrained from like extending uh too much and just
[00:50:35] like extending uh too much and just grabbing too much attention of the
[00:50:38] grabbing too much attention of the usering that up good thank you yeah and
[00:50:41] usering that up good thank you yeah and I think that the there there's something
[00:50:43] I think that the there there's something to this right that like it might be that
[00:50:45] to this right that like it might be that just from
[00:50:46] just from observing someone's preferences for long
[00:50:49] observing someone's preferences for long enough getting that that much data about
[00:50:51] enough getting that that much data about them you might be able to get a little
[00:50:53] them you might be able to get a little bit of insight maybe into sort of what's
[00:50:56] bit of insight maybe into sort of what's in their best interests even when that
[00:50:58] in their best interests even when that diverges from what they want in the
[00:51:00] diverges from what they want in the moment so um yeah great I see yeah we
[00:51:04] moment so um yeah great I see yeah we were similar we similarly had an idea
[00:51:06] were similar we similarly had an idea about like maintaining some sort of uh
[00:51:09] about like maintaining some sort of uh State for the users like um best
[00:51:13] State for the users like um best interest like maybe you could have like
[00:51:15] interest like maybe you could have like some sort of structure that would
[00:51:17] some sort of structure that would represent different aspects of the best
[00:51:19] represent different aspects of the best interest and which could be
[00:51:21] interest and which could be personalizable to the user and with
[00:51:22] personalizable to the user and with every like llm interaction it would uh
[00:51:25] every like llm interaction it would uh reprompt the uh llm and then change this
[00:51:29] reprompt the uh llm and then change this if appropriate and over and every time
[00:51:32] if appropriate and over and every time you are trying to get a like a output
[00:51:35] you are trying to get a like a output for for the user uh you could uh put
[00:51:38] for for the user uh you could uh put this as part of the context and write
[00:51:40] this as part of the context and write the prompt accordingly alongside
[00:51:43] the prompt accordingly alongside whatever the user is asking in order to
[00:51:46] whatever the user is asking in order to fit that goal better good well that s
[00:51:49] fit that goal better good well that s that sounds to me a little bit more like
[00:51:52] that sounds to me a little bit more like maybe aligning to the
[00:51:54] maybe aligning to the users's preferences maybe I
[00:51:56] users's preferences maybe I misunderstood this sounds like kind of
[00:51:58] misunderstood this sounds like kind of trying to figure out what it is that the
[00:52:00] trying to figure out what it is that the user is wants to get what they're
[00:52:03] user is wants to get what they're looking to sort of get out of
[00:52:05] looking to sort of get out of the the bot and then determining which
[00:52:09] the the bot and then determining which return based on that maybe I
[00:52:10] return based on that maybe I misunderstood I don't think that's
[00:52:11] misunderstood I don't think that's necessarily true I think you could write
[00:52:13] necessarily true I think you could write a prompt that would uh like the internal
[00:52:16] a prompt that would uh like the internal prompt for keeping up the state of the
[00:52:20] prompt for keeping up the state of the users best interest could be written
[00:52:23] users best interest could be written such that and the field could be
[00:52:24] such that and the field could be provided such that it would try to meta
[00:52:28] provided such that it would try to meta like it you could ask it to meta reason
[00:52:30] like it you could ask it to meta reason about what the users interests likely
[00:52:33] about what the users interests likely are B I see okay good um yeah
[00:52:39] are B I see okay good um yeah great
[00:52:42] great um yeah any anybody else I mean maybe
[00:52:45] um yeah any anybody else I mean maybe there's there's a more there's in some
[00:52:46] there's there's a more there's in some sense maybe a more basic question behind
[00:52:49] sense maybe a more basic question behind this um which would be something like
[00:52:52] this um which would be something like what what what might what is maybe in a
[00:52:55] what what what might what is maybe in a a news seeking agents best interest what
[00:52:58] a news seeking agents best interest what kind what kind of news would it be best
[00:53:01] kind what kind of news would it be best to provide somebody yeah probably that
[00:53:05] to provide somebody yeah probably that like a variety of perspectives and you
[00:53:07] like a variety of perspectives and you check that it's actually correct as well
[00:53:10] check that it's actually correct as well I think it's in the users's best
[00:53:13] I think it's in the users's best interests that they're properly informed
[00:53:15] interests that they're properly informed as opposed to maybe like only seeing
[00:53:17] as opposed to maybe like only seeing news that puts them in a good mood or
[00:53:19] news that puts them in a good mood or align with their existing opinions yeah
[00:53:23] align with their existing opinions yeah good right yeah you might think uh yeah
[00:53:26] good right yeah you might think uh yeah in contrast to the approach we discussed
[00:53:28] in contrast to the approach we discussed earlier of we're gonna we're going toig
[00:53:30] earlier of we're gonna we're going toig query the user about their preferences
[00:53:33] query the user about their preferences every time that we give them news we're
[00:53:35] every time that we give them news we're going to say did you like that was that
[00:53:37] going to say did you like that was that what you were looking for yes no we're
[00:53:38] what you were looking for yes no we're going to adjust and give you the news
[00:53:40] going to adjust and give you the news you want based on that uh yeah you might
[00:53:43] you want based on that uh yeah you might think it's it's actually better for
[00:53:45] think it's it's actually better for people to be exposed to uh to high
[00:53:49] people to be exposed to uh to high quality news uh unbiased news to be
[00:53:52] quality news uh unbiased news to be exposed to a variety of opinions and
[00:53:55] exposed to a variety of opinions and arguments um rather than right what's
[00:53:58] arguments um rather than right what's the worry about aligning too heavily to
[00:54:00] the worry about aligning too heavily to the user's preferences is that you might
[00:54:02] the user's preferences is that you might be putting them in in a kind of echo
[00:54:04] be putting them in in a kind of echo chamber right where they're getting all
[00:54:06] chamber right where they're getting all of their news from uh talking to you
[00:54:10] of their news from uh talking to you know Donald Trump bot or talking to
[00:54:11] know Donald Trump bot or talking to feminist bot and they're not getting uh
[00:54:14] feminist bot and they're not getting uh other
[00:54:16] other perspectives yeah good any does anybody
[00:54:19] perspectives yeah good any does anybody else have a different answer to that
[00:54:21] else have a different answer to that question maybe what would sort of be in
[00:54:23] question maybe what would sort of be in the user's best interest to receive as
[00:54:26] the user's best interest to receive as news or how you would approach
[00:54:28] news or how you would approach that uh from from like a design
[00:54:36] perspective
[00:54:38] perspective okay well good I think that's I think
[00:54:41] okay well good I think that's I think that's definitely right I mean in terms
[00:54:42] that's definitely right I mean in terms of Pros pros and cons did anybody have
[00:54:45] of Pros pros and cons did anybody have sort of get into this like
[00:54:47] sort of get into this like which if you were if you were designing
[00:54:50] which if you were if you were designing the the news chatbot which of these
[00:54:53] the the news chatbot which of these approaches would be better what would be
[00:54:55] approaches would be better what would be the pros of one cons of
[00:54:59] another yeah I mean I think like
[00:55:01] another yeah I mean I think like optimizing for best interest is almost
[00:55:03] optimizing for best interest is almost like paternalistic cuz you're assuming
[00:55:05] like paternalistic cuz you're assuming that you know the best Elric the user or
[00:55:06] that you know the best Elric the user or you have a good approximation you might
[00:55:08] you have a good approximation you might really not know at all so it's like uh
[00:55:12] really not know at all so it's like uh like like is some like that user had
[00:55:14] like like is some like that user had like some sort of tragedy or whatever in
[00:55:15] like some sort of tragedy or whatever in their life recently and then you're and
[00:55:17] their life recently and then you're and then like there some sort of recent news
[00:55:19] then like there some sort of recent news event has like a lot M St like maybe
[00:55:21] event has like a lot M St like maybe they don't want to be exposed to that
[00:55:23] they don't want to be exposed to that even though maybe it's like a very
[00:55:24] even though maybe it's like a very important event you should know about
[00:55:25] important event you should know about this but like you don't have complete
[00:55:27] this but like you don't have complete state of like the user psychic State
[00:55:30] state of like the user psychic State like how how they actually feel so maybe
[00:55:32] like how how they actually feel so maybe just using the preferences that are
[00:55:33] just using the preferences that are already that you actually have like just
[00:55:36] already that you actually have like just from using the app what do they click on
[00:55:38] from using the app what do they click on might be better good yeah so there's
[00:55:43] might be better good yeah so there's um yeah I think that's great so there
[00:55:45] um yeah I think that's great so there there're like uh even if we can say
[00:55:48] there're like uh even if we can say things maybe at a very general level
[00:55:50] things maybe at a very general level about what is in a person's interest
[00:55:53] about what is in a person's interest right what is actually good for a person
[00:55:55] right what is actually good for a person in general right that leaves a lot of
[00:55:57] in general right that leaves a lot of room for variation from person to person
[00:56:00] room for variation from person to person right especially uh if you think that
[00:56:03] right especially uh if you think that quite a lot of what's good for a person
[00:56:04] quite a lot of what's good for a person is built out of sort of uh subjective
[00:56:09] is built out of sort of uh subjective interests of theirs right or their
[00:56:11] interests of theirs right or their desires or um what makes them happy what
[00:56:15] desires or um what makes them happy what makes them unhappy um you might that's
[00:56:18] makes them unhappy um you might that's that's not a thing you might have full
[00:56:20] that's not a thing you might have full access to um so there is this problem
[00:56:23] access to um so there is this problem that if you're trying to align to what's
[00:56:26] that if you're trying to align to what's really good for the
[00:56:28] really good for the user you're your only real way to do
[00:56:32] user you're your only real way to do that is by aligning to what it is that
[00:56:34] that is by aligning to what it is that you think is good for the user right and
[00:56:36] you think is good for the user right and you might be you might be good at
[00:56:38] you might be you might be good at figuring that out you might not be um
[00:56:40] figuring that out you might not be um and absolutely that's where you run this
[00:56:43] and absolutely that's where you run this risk of paternalism so an advantage of
[00:56:45] risk of paternalism so an advantage of of just aligning the user preferences
[00:56:47] of just aligning the user preferences giving them what what they've said that
[00:56:49] giving them what what they've said that they want uh means that you you avoid
[00:56:53] they want uh means that you you avoid that risk right you avoid trying to
[00:56:56] that risk right you avoid trying to position yourself as saying no I I know
[00:56:58] position yourself as saying no I I know what's really good for you when maybe
[00:57:00] what's really good for you when maybe you're not in a position to determine
[00:57:02] you're not in a position to determine that um yeah anyone else on this
[00:57:13] point
[00:57:15] point okay now yeah yeah I just thought that
[00:57:18] okay now yeah yeah I just thought that uh the Contra argument is that you know
[00:57:20] uh the Contra argument is that you know running a risk of uh being personalistic
[00:57:23] running a risk of uh being personalistic you actually give convenience but then
[00:57:25] you actually give convenience but then giving them low quality choices you
[00:57:27] giving them low quality choices you actually waste up a lot of time so you
[00:57:29] actually waste up a lot of time so you know ABS Evol yeah um yeah I think
[00:57:34] know ABS Evol yeah um yeah I think that's right um and and right I mean to
[00:57:37] that's right um and and right I mean to to the earlier point it's
[00:57:39] to the earlier point it's uh yeah there might be some aspects of
[00:57:41] uh yeah there might be some aspects of the user's best interests that are
[00:57:42] the user's best interests that are easier to determine than others right it
[00:57:45] easier to determine than others right it might uh we might be reasonably
[00:57:47] might uh we might be reasonably confident that it's it would be in any
[00:57:48] confident that it's it would be in any user's best interest to be given high
[00:57:50] user's best interest to be given high quality sources of of news to be exposed
[00:57:53] quality sources of of news to be exposed to a variety of opinions that might be
[00:57:57] to a variety of opinions that might be you might want to align in part to that
[00:57:58] you might want to align in part to that sort of General human interest while
[00:58:01] sort of General human interest while still allowing some room to align to the
[00:58:03] still allowing some room to align to the user preferences so these are not uh
[00:58:06] user preferences so these are not uh necessarily mutually exclusive goals in
[00:58:09] necessarily mutually exclusive goals in alignment right I mean it might be
[00:58:12] alignment right I mean it might be uh you know in in some ways it might be
[00:58:14] uh you know in in some ways it might be worth uh focusing more on the user's
[00:58:17] worth uh focusing more on the user's preferences in some ways in some cases
[00:58:19] preferences in some ways in some cases context you might want to focus more on
[00:58:22] context you might want to focus more on what do we think is actually good for
[00:58:23] what do we think is actually good for the user because what they prefer might
[00:58:26] the user because what they prefer might be you know junk information or uh
[00:58:29] be you know junk information or uh convenient bias confirming information
[00:58:32] convenient bias confirming information things like
[00:58:33] things like that okay
[00:58:36] that okay great
[00:58:38] great um well there's one thing I think that
[00:58:42] um well there's one thing I think that has been not completely absent maybe
[00:58:44] has been not completely absent maybe from our discussion but I hope
[00:58:47] from our discussion but I hope noticeably absent from my lecture and
[00:58:51] noticeably absent from my lecture and for my slides so far is there a is there
[00:58:55] for my slides so far is there a is there maybe a big piece of the puzzle that
[00:58:57] maybe a big piece of the puzzle that we're missing something that you would
[00:58:58] we're missing something that you would have thought this would be a this is
[00:59:00] have thought this would be a this is what we're going to talk about with
[00:59:01] what we're going to talk about with value alignment and why haven't we
[00:59:03] value alignment and why haven't we gotten there
[00:59:06] yet anybody at all we've talked about
[00:59:08] yet anybody at all we've talked about aligning to the
[00:59:10] aligning to the users uh intentions we've talked about
[00:59:12] users uh intentions we've talked about aligning to the user's preferences to
[00:59:14] aligning to the user's preferences to the user's best
[00:59:16] the user's best interests yeah like a way to measure
[00:59:19] interests yeah like a way to measure alignment a way a we measure alignment
[00:59:22] alignment a way a we measure alignment yes uh that has been absent um I just
[00:59:26] yes uh that has been absent um I just think yeah maybe aligning to a society's
[00:59:30] think yeah maybe aligning to a society's overall interest rather than just person
[00:59:33] overall interest rather than just person yeah so you are correct but I was
[00:59:36] yeah so you are correct but I was thinking yeah I mean that there are
[00:59:38] thinking yeah I mean that there are people other than the user um where's oh
[00:59:42] people other than the user um where's oh where's my text there's my text uh yeah
[00:59:47] where's my text there's my text uh yeah there's uh there are other people whose
[00:59:50] there's uh there are other people whose interests are important maybe to take
[00:59:52] interests are important maybe to take into account uh than just the person who
[00:59:54] into account uh than just the person who is giving instructions to to the agent
[00:59:58] is giving instructions to to the agent um
[01:00:01] so you might think there's there's
[01:00:04] so you might think there's there's really another possible interpretation
[01:00:05] really another possible interpretation of what we're after with with value
[01:00:07] of what we're after with with value alignment right which is that an AI
[01:00:10] alignment right which is that an AI agent is value aligned if it does what's
[01:00:13] agent is value aligned if it does what's morally right right uh I mean the main
[01:00:17] morally right right uh I mean the main problem with paperclip AI isn't that it
[01:00:20] problem with paperclip AI isn't that it does what's bad for me it's it does
[01:00:22] does what's bad for me it's it does what's bad for everyone if it destroys
[01:00:23] what's bad for everyone if it destroys the world um right it does what's bad
[01:00:26] the world um right it does what's bad for uh the factory workers if it makes
[01:00:29] for uh the factory workers if it makes them work around the clock making paper
[01:00:31] them work around the clock making paper clips and so on right so earlier we were
[01:00:36] clips and so on right so earlier we were sort of focusing on what what do we mean
[01:00:37] sort of focusing on what what do we mean by what what we really want what does
[01:00:40] by what what we really want what does really want mean um this this would be
[01:00:43] really want mean um this this would be to focus a little bit more on the on the
[01:00:45] to focus a little bit more on the on the wi what is it that we really want um
[01:00:48] wi what is it that we really want um because of course what the user intends
[01:00:50] because of course what the user intends prefers even what's in their individual
[01:00:52] prefers even what's in their individual interests um might be bad for others
[01:00:55] interests um might be bad for others right right we probably don't want to
[01:00:57] right right we probably don't want to say that paperclip AI is value aligned
[01:01:00] say that paperclip AI is value aligned uh if it maximizes production by you
[01:01:02] uh if it maximizes production by you know exploiting the workers in the
[01:01:04] know exploiting the workers in the factory even if I as the user have no
[01:01:07] factory even if I as the user have no calms about exploiting the workers right
[01:01:11] calms about exploiting the workers right um okay that said I it wasn't a waste of
[01:01:14] um okay that said I it wasn't a waste of time to start by focusing on the user um
[01:01:18] time to start by focusing on the user um right even if we want to align to to
[01:01:21] right even if we want to align to to morality or to the interests of more
[01:01:22] morality or to the interests of more people than the user um we also do want
[01:01:26] people than the user um we also do want to align to what the user wants when
[01:01:27] to align to what the user wants when what the user wants uh is morally
[01:01:30] what the user wants uh is morally acceptable right so it still matters how
[01:01:33] acceptable right so it still matters how we understand what it is that the user
[01:01:35] we understand what it is that the user really wants even if we need to place
[01:01:37] really wants even if we need to place that in a larger moral or societal
[01:01:40] that in a larger moral or societal context
[01:01:44] um but of course here too we have a
[01:01:47] um but of course here too we have a philosophical problem right I mean which
[01:01:50] philosophical problem right I mean which which things are really morally right uh
[01:01:54] which things are really morally right uh there's a lot of disagreement on this
[01:01:55] there's a lot of disagreement on this one too uh not unlike the question of
[01:01:58] one too uh not unlike the question of what is objectively good for a person
[01:02:01] what is objectively good for a person right um is it right to lie to spare
[01:02:04] right um is it right to lie to spare someone's
[01:02:05] someone's feelings uh is it right to Pirate
[01:02:07] feelings uh is it right to Pirate copyrighted
[01:02:09] copyrighted material right is it right to buy
[01:02:11] material right is it right to buy luxuries when you could donate to
[01:02:12] luxuries when you could donate to charity instead is a right to kill one
[01:02:15] charity instead is a right to kill one person to save five or a
[01:02:19] person to save five or a thousand or a
[01:02:22] thousand or a million uh right these are uh at least
[01:02:26] million uh right these are uh at least some of them uh I hope you think
[01:02:28] some of them uh I hope you think difficult uh moral questions certainly
[01:02:31] difficult uh moral questions certainly they are moral questions that people
[01:02:33] they are moral questions that people disagree about um again philosophers and
[01:02:36] disagree about um again philosophers and and non-philosophers
[01:02:38] and non-philosophers alike
[01:02:40] alike um so how do we align to what's morally
[01:02:43] um so how do we align to what's morally right in the face of this
[01:02:45] right in the face of this disagreement um this is you might think
[01:02:50] disagreement um this is you might think where my field of study comes in um you
[01:02:53] where my field of study comes in um you might turn to moral theory which is
[01:02:54] might turn to moral theory which is basically uh just a systematic attempt
[01:02:58] basically uh just a systematic attempt to answer questions like these right so
[01:03:02] to answer questions like these right so um a moral theory you might have heard
[01:03:05] um a moral theory you might have heard of it's called
[01:03:06] of it's called consequentialism says that an act is
[01:03:08] consequentialism says that an act is right if it produces the greatest net
[01:03:09] right if it produces the greatest net good of any act available or you might
[01:03:12] good of any act available or you might have heard of uh utilitarianism which is
[01:03:16] have heard of uh utilitarianism which is a kind of consequentialism says that you
[01:03:18] a kind of consequentialism says that you should uh produce the greatest total
[01:03:22] should uh produce the greatest total happiness uh that you can uh across all
[01:03:26] happiness uh that you can uh across all people
[01:03:28] people um right if you have a theory like this
[01:03:31] um right if you have a theory like this this this can be used to answer some of
[01:03:32] this this can be used to answer some of these difficult questions people
[01:03:34] these difficult questions people disagree about right is it is it right
[01:03:36] disagree about right is it is it right to lie to spare someone's feelings well
[01:03:38] to lie to spare someone's feelings well if you're the consequentialist you'll
[01:03:39] if you're the consequentialist you'll say it might be if you can get away with
[01:03:42] say it might be if you can get away with it if no one discovers that it's a lie
[01:03:45] it if no one discovers that it's a lie uh and it makes somebody feel better
[01:03:46] uh and it makes somebody feel better that might produce more good than not
[01:03:48] that might produce more good than not telling the
[01:03:49] telling the LIE
[01:03:51] LIE um so there's an idea here which is that
[01:03:54] um so there's an idea here which is that we could align
[01:03:56] we could align AI
[01:03:57] AI to morality to what's morally right uh
[01:04:01] to morality to what's morally right uh if we align agents to the correct or
[01:04:04] if we align agents to the correct or best moral
[01:04:07] theory uh there's going to be there's
[01:04:10] theory uh there's going to be there's going to be a philosophical problem with
[01:04:11] going to be a philosophical problem with this does anybody think they know what
[01:04:13] this does anybody think they know what it's going to be has a similar form to
[01:04:16] it's going to be has a similar form to all of the philosophical problems we've
[01:04:17] all of the philosophical problems we've encountered so
[01:04:19] encountered so far
[01:04:22] far uh well there there's a lot of
[01:04:25] uh well there there's a lot of disagreement about what the correct
[01:04:26] disagreement about what the correct moral theory is so there's disagreement
[01:04:28] moral theory is so there's disagreement not only at the order of uh sort of
[01:04:31] not only at the order of uh sort of ground level uh moral facts about
[01:04:34] ground level uh moral facts about whether you should uh tell a lie to
[01:04:36] whether you should uh tell a lie to spare someone's feelings but also about
[01:04:38] spare someone's feelings but also about the best theory for
[01:04:39] the best theory for systematizing um this kind of stuff um
[01:04:43] systematizing um this kind of stuff um right we already saw
[01:04:46] right we already saw consequentialism um but there's a whole
[01:04:48] consequentialism um but there's a whole host of others and just to put a few of
[01:04:50] host of others and just to put a few of these on the table uh just to give you a
[01:04:53] these on the table uh just to give you a sense of of the of the range that we're
[01:04:56] sense of of the of the range that we're at um there's you can be a prioritarian
[01:05:01] at um there's you can be a prioritarian and where you would think that really
[01:05:02] and where you would think that really what you want to do is not to maximize
[01:05:05] what you want to do is not to maximize the total good but to to produce the
[01:05:08] the total good but to to produce the greatest weighted sum of good where the
[01:05:11] greatest weighted sum of good where the interest of those who are worse off is
[01:05:14] interest of those who are worse off is given more weight or you could take this
[01:05:17] given more weight or you could take this to a sort of
[01:05:19] to a sort of extreme u a kind of maximin or Minimax
[01:05:23] extreme u a kind of maximin or Minimax view where what's morally right is to
[01:05:25] view where what's morally right is to make things as good as possible for the
[01:05:27] make things as good as possible for the person who's left the worst off um by
[01:05:31] person who's left the worst off um by what you've done or to minimize the
[01:05:33] what you've done or to minimize the negative consequences for for the person
[01:05:36] negative consequences for for the person who who suffers the most um
[01:05:41] who who suffers the most um so
[01:05:42] so in uh you know cases
[01:05:46] in uh you know cases where you have to think about a
[01:05:48] where you have to think about a quantifiable good but if I have you know
[01:05:52] quantifiable good but if I have you know four people
[01:05:56] who I can how would I do
[01:06:04] this assign Goods to I have the option
[01:06:08] this assign Goods to I have the option to
[01:06:21] say you know I have options to
[01:06:23] say you know I have options to distribute Goods say these ways among
[01:06:25] distribute Goods say these ways among different
[01:06:26] different people right if I'm the consequentialist
[01:06:28] people right if I'm the consequentialist I'm going to say well I want the one
[01:06:29] I'm going to say well I want the one that produces the most total good that's
[01:06:32] that produces the most total good that's this first
[01:06:33] this first option um if I'm a prioritarian well I'm
[01:06:37] option um if I'm a prioritarian well I'm going to need some kind of way of
[01:06:39] going to need some kind of way of waiting this say that uh you know the
[01:06:43] waiting this say that uh you know the way to give more weight to the people
[01:06:44] way to give more weight to the people who's worst off is that your uh you know
[01:06:49] who's worst off is that your uh you know we waight the good to you on a log scale
[01:06:51] we waight the good to you on a log scale or something right uh then uh well in
[01:06:56] or something right uh then uh well in base 10 or
[01:06:58] base 10 or or uh at least this is going to be uh
[01:07:02] or uh at least this is going to be uh the prioritarian choice right we want to
[01:07:05] the prioritarian choice right we want to by giving more priority to those who
[01:07:07] by giving more priority to those who have it worse um with uh you know the
[01:07:13] have it worse um with uh you know the sorry I'm not explaining this very well
[01:07:14] sorry I'm not explaining this very well trying to move too quickly if I was
[01:07:17] trying to move too quickly if I was taking the log of each of these as the
[01:07:19] taking the log of each of these as the prioritarian I'd say here you know we
[01:07:23] prioritarian I'd say here you know we have uh this is coming out to six this
[01:07:26] have uh this is coming out to six this is coming out seven that's better if I
[01:07:28] is coming out seven that's better if I want to make things as good as possible
[01:07:29] want to make things as good as possible for the person who ends up worst off I
[01:07:32] for the person who ends up worst off I might choose this last option even
[01:07:34] might choose this last option even though in this option we're getting the
[01:07:36] though in this option we're getting the Le the least total good right the person
[01:07:38] Le the least total good right the person who ends up worst off is doing better
[01:07:41] who ends up worst off is doing better than the person who ends up worse off in
[01:07:42] than the person who ends up worse off in the other
[01:07:43] the other options um so all these options and more
[01:07:47] options um so all these options and more are available to you in moral theory uh
[01:07:51] are available to you in moral theory uh right you might take a satisficing
[01:07:52] right you might take a satisficing version of any of these views instead of
[01:07:54] version of any of these views instead of trying to maximize
[01:07:56] trying to maximize the total good you might think an act is
[01:07:57] the total good you might think an act is right if it just produces a sufficiently
[01:08:00] right if it just produces a sufficiently great uh sum of good or weighted sum of
[01:08:03] great uh sum of good or weighted sum of good
[01:08:05] good um
[01:08:07] um uh we haven't even yet touched
[01:08:09] uh we haven't even yet touched deontological views which hold that even
[01:08:12] deontological views which hold that even acts with the best consequences can be
[01:08:14] acts with the best consequences can be wrong if they violate certain moral
[01:08:16] wrong if they violate certain moral rules or
[01:08:17] rules or rights um often these rules will be
[01:08:21] rights um often these rules will be rules like don't murder anyone don't
[01:08:24] rules like don't murder anyone don't steal lie keep your
[01:08:27] steal lie keep your promises um right you might think that
[01:08:30] promises um right you might think that an act can't be an act can't be right if
[01:08:32] an act can't be an act can't be right if it involves stealing from someone even
[01:08:33] it involves stealing from someone even if it produces a lot of good this is
[01:08:37] if it produces a lot of good this is something that uh VI like
[01:08:38] something that uh VI like consequentialism might not
[01:08:40] consequentialism might not capture um right although you might
[01:08:44] capture um right although you might think that these rules or rights are
[01:08:45] think that these rules or rights are themselves justified by their by their
[01:08:47] themselves justified by their by their good consequences it would be best if if
[01:08:49] good consequences it would be best if if we accepted rules like this and followed
[01:08:52] we accepted rules like this and followed them um
[01:08:58] okay returning to this this problem of
[01:09:01] okay returning to this this problem of paternalism that we encountered earlier
[01:09:03] paternalism that we encountered earlier um there is another problem here so one
[01:09:05] um there is another problem here so one is there's just what is the best moral
[01:09:08] is there's just what is the best moral theory who knows that's yeah you know
[01:09:11] theory who knows that's yeah you know I've
[01:09:11] I've been uh I've been working on that for a
[01:09:15] been uh I've been working on that for a decade and haven't gotten much closer to
[01:09:16] decade and haven't gotten much closer to it uh but even if we knew what the best
[01:09:19] it uh but even if we knew what the best moral theory was it might be bad to
[01:09:21] moral theory was it might be bad to design AI agents to act on moral values
[01:09:23] design AI agents to act on moral values their users don't share right this could
[01:09:26] their users don't share right this could be because we want to avoid a kind of
[01:09:29] be because we want to avoid a kind of paternalism where we say no this is
[01:09:31] paternalism where we say no this is these are the correct moral values um
[01:09:34] these are the correct moral values um could be for more practical reasons just
[01:09:35] could be for more practical reasons just the users won't trust AI agents if they
[01:09:39] the users won't trust AI agents if they uh disagree with them about moral
[01:09:42] uh disagree with them about moral matters okay so there's some difficulty
[01:09:45] matters okay so there's some difficulty trying to align to the best or the
[01:09:49] trying to align to the best or the correct moral theory
[01:09:52] correct moral theory um but also like with the objective good
[01:09:54] um but also like with the objective good where there's a lot of disagreement here
[01:09:56] where there's a lot of disagreement here there's also quite a lot of agreement uh
[01:09:58] there's also quite a lot of agreement uh about what is the morally right thing to
[01:10:01] about what is the morally right thing to do
[01:10:02] do um right in in simple cases we all agree
[01:10:07] um right in in simple cases we all agree uh you shouldn't kill people you
[01:10:09] uh you shouldn't kill people you shouldn't lie to them you shouldn't
[01:10:10] shouldn't lie to them you shouldn't steal from them um so another idea for
[01:10:13] steal from them um so another idea for aligning to morality would just be
[01:10:15] aligning to morality would just be aligning AI agents to what we might call
[01:10:18] aligning AI agents to what we might call Common Sense or consensus morality right
[01:10:20] Common Sense or consensus morality right Common Sense moral ideas that most
[01:10:23] Common Sense moral ideas that most people agree on instead of trying to
[01:10:26] people agree on instead of trying to make AI morally perfect we should just
[01:10:28] make AI morally perfect we should just aim to have it make moral decisions like
[01:10:31] aim to have it make moral decisions like a normal person
[01:10:33] a normal person would
[01:10:35] would um right this this view probably ends up
[01:10:37] um right this this view probably ends up being pretty deontological and
[01:10:38] being pretty deontological and satisficing right most of us think you
[01:10:41] satisficing right most of us think you follow certain moral rules you respect
[01:10:43] follow certain moral rules you respect other people's rights uh then you're not
[01:10:47] other people's rights uh then you're not morally required to do the best you can
[01:10:50] morally required to do the best you can um uh it's fine to do less to prioritize
[01:10:53] um uh it's fine to do less to prioritize yourself uh in some cases things like
[01:10:57] yourself uh in some cases things like that
[01:10:58] that um right now one advantage of allying to
[01:11:03] um right now one advantage of allying to something like Common Sense morality
[01:11:04] something like Common Sense morality rather than to a particular moral theory
[01:11:07] rather than to a particular moral theory is that moral theories often have
[01:11:08] is that moral theories often have surprising implications um I know we're
[01:11:11] surprising implications um I know we're just about out of time so I'll just I'll
[01:11:13] just about out of time so I'll just I'll skip to the chase on these um I mean you
[01:11:16] skip to the chase on these um I mean you can think about the consequentialist
[01:11:18] can think about the consequentialist requirement to maximize net good I mean
[01:11:21] requirement to maximize net good I mean suppose we had suppose you had an AI
[01:11:24] suppose we had suppose you had an AI agent that was
[01:11:25] agent that was a surgeon uh five five patients dying
[01:11:29] a surgeon uh five five patients dying Each of which needs a different organ
[01:11:31] Each of which needs a different organ transplant to save their life well if
[01:11:34] transplant to save their life well if you're thinking about just maximizing
[01:11:36] you're thinking about just maximizing the net good subject to no constraints
[01:11:38] the net good subject to no constraints maybe maybe what you think
[01:11:40] maybe maybe what you think is well that nurse walking by in the
[01:11:43] is well that nurse walking by in the hall has all of the organs that I need
[01:11:46] hall has all of the organs that I need maybe if I just maybe I just harvest the
[01:11:49] maybe if I just maybe I just harvest the organs from the nurse put them in the
[01:11:50] organs from the nurse put them in the five people save five lives the cost of
[01:11:53] five people save five lives the cost of one five is greater than one we just
[01:11:56] one five is greater than one we just maximize the net good that's probably
[01:11:58] maximize the net good that's probably not what you wanted your surgeon surgeon
[01:12:00] not what you wanted your surgeon surgeon AI to do um think about
[01:12:03] AI to do um think about cases uh uh where you might want to
[01:12:08] cases uh uh where you might want to break a deontological rule against lying
[01:12:10] break a deontological rule against lying as
[01:12:10] as well right AI agents aligned to a
[01:12:13] well right AI agents aligned to a particular moral theory might uh
[01:12:15] particular moral theory might uh discover some of these surprising
[01:12:16] discover some of these surprising implications before we do and they might
[01:12:18] implications before we do and they might discover them in practice rather than in
[01:12:20] discover them in practice rather than in the philosophy seminar room which is
[01:12:21] the philosophy seminar room which is where we prefer for them to come up um
[01:12:26] where we prefer for them to come up um so uh by contrast aligning to Common
[01:12:29] so uh by contrast aligning to Common Sense morality you might end up with an
[01:12:31] Sense morality you might end up with an agent that behaves more predictably
[01:12:32] agent that behaves more predictably right making moral decisions like a
[01:12:35] right making moral decisions like a regular
[01:12:36] regular human uh it might be predict
[01:12:38] human uh it might be predict unpredictable in some edge cases right
[01:12:40] unpredictable in some edge cases right where Common Sense arguably runs
[01:12:43] where Common Sense arguably runs out right would an AI aligned to Common
[01:12:46] out right would an AI aligned to Common Sense morality kill one person to save a
[01:12:50] Sense morality kill one person to save a million I I don't know that's what we
[01:12:52] million I I don't know that's what we you know we got into moral theory to try
[01:12:54] you know we got into moral theory to try to answer hard questions like this if
[01:12:57] to answer hard questions like this if we've just taught the agents to think
[01:12:58] we've just taught the agents to think about morality like we do it might be as
[01:13:00] about morality like we do it might be as unsure as we are about what to do in a
[01:13:03] unsure as we are about what to do in a case like this um I need to let you go
[01:13:06] case like this um I need to let you go so I'll just leave you with the thought
[01:13:08] so I'll just leave you with the thought um how bad would that be how bad would
[01:13:11] um how bad would that be how bad would it be if if AI was as unsure about
[01:13:14] it be if if AI was as unsure about morally hard cases as we
[01:13:17] morally hard cases as we are um
[01:13:21] are um okay we've covered
[01:13:23] okay we've covered that uh I will let you go to enjoy your
[01:13:26] that uh I will let you go to enjoy your Wednesdays um if you are interested in
[01:13:29] Wednesdays um if you are interested in talking more about any of this ethics in
[01:13:30] talking more about any of this ethics in general feel free to reach out set up a
[01:13:32] general feel free to reach out set up a meeting we can talk more um any any
[01:13:37] meeting we can talk more um any any questions now before we depart or I can
[01:13:39] questions now before we depart or I can stick around for a few minutes if folks
[01:13:41] stick around for a few minutes if folks want to talk to me
[01:13:43] want to talk to me offline okay great well take care
Lecture 016
Stanford CS234 Reinforcement Learning I Value Alignment I 2024 I Lecture 16
Source: https://www.youtube.com/watch?v=eenJzay5aLo
---
Transcript
[00:00:05] all right welcome back welcome to the
[00:...
Stanford CS234 Reinforcement Learning I Value Alignment I 2024 I Lecture 16
Source: https://www.youtube.com/watch?v=eenJzay5aLo
---
Transcript
[00:00:05] all right welcome back welcome to the
[00:00:08] all right welcome back welcome to the last lecture for
[00:00:09] last lecture for cs234 what we'll do today is we'll do a
[00:00:12] cs234 what we'll do today is we'll do a review and a wrap-up and we're also
[00:00:13] review and a wrap-up and we're also going to discuss the quiz a little bit
[00:00:16] going to discuss the quiz a little bit but before we get started I just wanted
[00:00:17] but before we get started I just wanted to remind us where we are um so last
[00:00:20] to remind us where we are um so last time we did the quiz today we have sort
[00:00:23] time we did the quiz today we have sort of a review of the course and looking
[00:00:25] of a review of the course and looking forward so what we're going to do today
[00:00:27] forward so what we're going to do today is we're going to sort of do a
[00:00:28] is we're going to sort of do a combination of the quiz cap and then
[00:00:30] combination of the quiz cap and then looking forward to sort of um reviewing
[00:00:33] looking forward to sort of um reviewing some of the things we've done in the
[00:00:34] some of the things we've done in the class as well as looking forward so
[00:00:36] class as well as looking forward so we're going to jump into the quiz the
[00:00:38] we're going to jump into the quiz the quiz we'll have um back to you guys
[00:00:40] quiz we'll have um back to you guys within about a
[00:00:42] within about a day and we're just going to step through
[00:00:44] day and we're just going to step through some of it because I think it's a nice
[00:00:45] some of it because I think it's a nice summary of sort of some of the different
[00:00:47] summary of sort of some of the different aspects so we go to the
[00:00:50] aspects so we go to the quiz going to
[00:00:53] quiz going to start on question three so the quiz as
[00:00:57] start on question three so the quiz as everybody knows so was comprehensive it
[00:00:59] everybody knows so was comprehensive it covered the entire course we're not
[00:01:01] covered the entire course we're not finished gradient yet um but we noticed
[00:01:03] finished gradient yet um but we noticed that there were some uh problems that
[00:01:05] that there were some uh problems that people had more had um more challenges
[00:01:07] people had more had um more challenges with than others one quick clarification
[00:01:10] with than others one quick clarification is that when we said um a justification
[00:01:14] is that when we said um a justification for for your choice we expected you to
[00:01:17] for for your choice we expected you to put something different than the choice
[00:01:18] put something different than the choice itself so we wanted you to actually
[00:01:21] itself so we wanted you to actually provide an explanation or rationale for
[00:01:23] provide an explanation or rationale for why you picked what you did so let's
[00:01:26] why you picked what you did so let's just step through the quiz and inside of
[00:01:28] just step through the quiz and inside of the solutions we'll also do that you
[00:01:29] the solutions we'll also do that you might have noticed that the second
[00:01:31] might have noticed that the second question was identical to the midterm so
[00:01:33] question was identical to the midterm so it was a good chance to refresh if you
[00:01:35] it was a good chance to refresh if you hadn't remembered that from the midterm
[00:01:37] hadn't remembered that from the midterm the third one um was slightly tricky so
[00:01:40] the third one um was slightly tricky so I want to just make sure to go through
[00:01:42] I want to just make sure to go through it just it's a nice way to review po so
[00:01:45] it just it's a nice way to review po so the third question really asked you to
[00:01:47] the third question really asked you to think about proximal policy optimization
[00:01:49] think about proximal policy optimization which was something that you implemented
[00:01:51] which was something that you implemented and one thing that might have been
[00:01:53] and one thing that might have been slightly confusing or a good thing to
[00:01:54] slightly confusing or a good thing to refresh is that we really emphasized in
[00:01:57] refresh is that we really emphasized in class that PO allowed us to use data and
[00:02:00] class that PO allowed us to use data and multiple gradient steps and when it made
[00:02:02] multiple gradient steps and when it made multiple gradient steps um those would
[00:02:04] multiple gradient steps um those would be off policy but the very first step
[00:02:07] be off policy but the very first step that PPO makes is always on policy so
[00:02:11] that PPO makes is always on policy so this is
[00:02:14] this is true because if you've just gotten the
[00:02:17] true because if you've just gotten the data and then you're doing a policy
[00:02:18] data and then you're doing a policy gradient stuff on that that part is
[00:02:20] gradient stuff on that that part is considered off on policy after that
[00:02:22] considered off on policy after that you're you trying to take a further step
[00:02:25] you're you trying to take a further step so if your if we had sort of a
[00:02:26] so if your if we had sort of a one-dimensional policy your first step
[00:02:28] one-dimensional policy your first step is going to be on policy and then any
[00:02:30] is going to be on policy and then any further steps you take are now going to
[00:02:31] further steps you take are now going to be off policy using data that you
[00:02:34] be off policy using data that you collected from the previous round so I
[00:02:37] collected from the previous round so I know that that was often a good thing to
[00:02:39] know that that was often a good thing to make sure to refresh and the second part
[00:02:42] make sure to refresh and the second part was F we do not have any guarantees on B
[00:02:45] was F we do not have any guarantees on B and the third part is
[00:02:47] and the third part is true and we want to emphasize here that
[00:02:49] true and we want to emphasize here that we're only doing important sampling over
[00:02:51] we're only doing important sampling over the actions what po does and what some
[00:02:54] the actions what po does and what some of the other algorithms it was inspired
[00:02:56] of the other algorithms it was inspired by do is that they don't try to directly
[00:02:58] by do is that they don't try to directly handle the state distri distribution
[00:03:00] handle the state distri distribution mismatch and instead they try to create
[00:03:02] mismatch and instead they try to create a new policy that's close enough that
[00:03:05] a new policy that's close enough that they hope that um the fact that you're
[00:03:07] they hope that um the fact that you're going to be visiting different states
[00:03:08] going to be visiting different states under a new policy it's only going to be
[00:03:10] under a new policy it's only going to be slightly
[00:03:12] slightly observed okay and the last is that you
[00:03:15] observed okay and the last is that you can use lots of different types of
[00:03:17] can use lots of different types of advantage estimators and so the um D is
[00:03:19] advantage estimators and so the um D is not true you could use generalized um
[00:03:22] not true you could use generalized um Advantage estimation but you could also
[00:03:23] Advantage estimation but you could also use other methods as
[00:03:25] use other methods as well and throughout this if anybody has
[00:03:27] well and throughout this if anybody has any questions feel free to ask me
[00:03:30] any questions feel free to ask me so the fourth question was given by Our
[00:03:33] so the fourth question was given by Our Guest lecture um for those of you that
[00:03:35] Guest lecture um for those of you that had a chance to attend or to to watch it
[00:03:37] had a chance to attend or to to watch it later um we he talked a lot Dan talked a
[00:03:41] later um we he talked a lot Dan talked a lot about thinking about the alignment
[00:03:43] lot about thinking about the alignment problem and thinking about what things
[00:03:45] problem and thinking about what things are important for that um the first part
[00:03:48] are important for that um the first part is not true it's generally um hard to
[00:03:51] is not true it's generally um hard to think about sort of there's different
[00:03:52] think about sort of there's different ways to think about autonomy but that
[00:03:54] ways to think about autonomy but that was not what we were focused on the
[00:03:56] was not what we were focused on the second one is true so one of the things
[00:03:58] second one is true so one of the things Dan talked about in his lecture was the
[00:04:01] Dan talked about in his lecture was the fact that often when we think about
[00:04:02] fact that often when we think about preferences and Alignment we're often
[00:04:05] preferences and Alignment we're often are focusing on people's individual
[00:04:06] are focusing on people's individual preferences like someone says they like
[00:04:08] preferences like someone says they like option one instead of option two but
[00:04:11] option one instead of option two but that focuses really on the utility to a
[00:04:13] that focuses really on the utility to a single individual as opposed to the
[00:04:15] single individual as opposed to the implications for the broader society and
[00:04:18] implications for the broader society and so the second one is true because as Dan
[00:04:20] so the second one is true because as Dan brought up moral theories give us a way
[00:04:22] brought up moral theories give us a way to think about more broad benefits to
[00:04:25] to think about more broad benefits to society and to collections of
[00:04:26] society and to collections of individuals instead of to
[00:04:28] individuals instead of to individuals he also talked a lot about
[00:04:31] individuals he also talked a lot about how autonomy is often a core principle
[00:04:33] how autonomy is often a core principle when we think about um the value of
[00:04:37] when we think about um the value of different decisions we can make and so
[00:04:40] different decisions we can make and so the idea of an II agent to um uh to
[00:04:46] the idea of an II agent to um uh to allowing people to have some autonomy
[00:04:47] allowing people to have some autonomy would say that an AI agent um that
[00:04:51] would say that an AI agent um that thinks about someone's suboptimal
[00:04:53] thinks about someone's suboptimal decision so it might be that you know
[00:04:55] decision so it might be that you know somebody really wants to do something
[00:04:56] somebody really wants to do something that we know is not very good for them
[00:04:58] that we know is not very good for them that an AI agent that aligned and allows
[00:05:00] that an AI agent that aligned and allows that person autonomy would still maybe
[00:05:02] that person autonomy would still maybe support that um because in the interest
[00:05:05] support that um because in the interest of sort of upwe the degree of
[00:05:08] of sort of upwe the degree of autonomy and the last one is also
[00:05:12] autonomy and the last one is also true because you could think of this as
[00:05:15] true because you could think of this as a form of paternalism so if the agent
[00:05:17] a form of paternalism so if the agent decides what's not really a good idea
[00:05:19] decides what's not really a good idea for you to smoke and so I'm not going to
[00:05:21] for you to smoke and so I'm not going to tell you where you can buy cigarettes um
[00:05:23] tell you where you can buy cigarettes um that may or may not be true of course we
[00:05:25] that may or may not be true of course we know that smoking is causally associated
[00:05:27] know that smoking is causally associated with lung cancer so
[00:05:30] with lung cancer so you could imagine that case it's not in
[00:05:31] you could imagine that case it's not in their best interest but it that would be
[00:05:33] their best interest but it that would be considered a form of paternalism and so
[00:05:35] considered a form of paternalism and so that would undermine user
[00:05:37] that would undermine user autonomy yeah I don't know if I agree
[00:05:39] autonomy yeah I don't know if I agree with see even with the explanation
[00:05:41] with see even with the explanation because it seems to me that best
[00:05:43] because it seems to me that best interest and suboptimal decisions are
[00:05:45] interest and suboptimal decisions are like
[00:05:47] like definitionally well best interest entail
[00:05:50] definitionally well best interest entail like Optimal decisions and you're saying
[00:05:52] like Optimal decisions and you're saying we should let User make a suboptimal
[00:05:55] we should let User make a suboptimal decisions I don't see how those are if
[00:05:58] decisions I don't see how those are if the it is in fact in the best interest
[00:06:00] the it is in fact in the best interest of the user to make decisions then those
[00:06:02] of the user to make decisions then those decisions are no longer suboptimal which
[00:06:04] decisions are no longer suboptimal which would be a contradiction it seems to me
[00:06:06] would be a contradiction it seems to me I OB you have that F it's an interesting
[00:06:09] I OB you have that F it's an interesting question right so I think what it says
[00:06:10] question right so I think what it says here is that there is different no
[00:06:12] here is that there is different no Notions of what the objective is um and
[00:06:14] Notions of what the objective is um and there are different Notions of what is
[00:06:15] there are different Notions of what is considered optimal or not optimal so
[00:06:17] considered optimal or not optimal so there might be some cases where for
[00:06:20] there might be some cases where for general population or even for humans in
[00:06:22] general population or even for humans in general there is one part of your reward
[00:06:24] general there is one part of your reward function that says this particular
[00:06:25] function that says this particular decision like smoking cigarettes is not
[00:06:27] decision like smoking cigarettes is not considered to be optimal because of long
[00:06:29] considered to be optimal because of long Health outcomes however you might have
[00:06:32] Health outcomes however you might have another part of your reward function
[00:06:33] another part of your reward function that talks about the importance of user
[00:06:35] that talks about the importance of user autonomy and so if you value user
[00:06:38] autonomy and so if you value user autonomy higher than say um someone's
[00:06:42] autonomy higher than say um someone's Health perhaps in this particular
[00:06:43] Health perhaps in this particular instance for for that particular
[00:06:44] instance for for that particular constraint then in that case you might
[00:06:46] constraint then in that case you might say well if I'm supporting um that
[00:06:49] say well if I'm supporting um that person so if the user's best interest if
[00:06:51] person so if the user's best interest if the best interest of the user is to more
[00:06:54] the best interest of the user is to more value their ability to have autonomy
[00:06:56] value their ability to have autonomy than for them to make this particular
[00:06:58] than for them to make this particular Health decision then you would give them
[00:07:00] Health decision then you would give them the information about where cigarettes
[00:07:02] the information about where cigarettes are yeah yeah I was also like just the
[00:07:06] are yeah yeah I was also like just the first Clause since is in the user's best
[00:07:08] first Clause since is in the user's best interest I thought like it's really hard
[00:07:12] interest I thought like it's really hard to generalize about what a single user's
[00:07:14] to generalize about what a single user's best interest was and so that was not a
[00:07:16] best interest was and so that was not a true statement and like by the start
[00:07:19] true statement and like by the start because maybe some people aren't best
[00:07:21] because maybe some people aren't best making their own decisions about things
[00:07:23] making their own decisions about things and so I wasn't sure how confidently you
[00:07:26] and so I wasn't sure how confidently you could say that yeah it's interesting
[00:07:28] could say that yeah it's interesting question so what Dan argued is that it
[00:07:30] question so what Dan argued is that it is generally a principle that everyone
[00:07:34] is generally a principle that everyone does needs some amount of autonomy and
[00:07:36] does needs some amount of autonomy and so if you go with that argument then you
[00:07:38] so if you go with that argument then you would say if we believe that it is
[00:07:41] would say if we believe that it is important for all of us to have some
[00:07:42] important for all of us to have some autonomy then under that that should
[00:07:44] autonomy then under that that should also allow us the freedom to make bad
[00:07:46] also allow us the freedom to make bad decisions some of the time and in that
[00:07:48] decisions some of the time and in that case an llm that is supporting us also
[00:07:50] case an llm that is supporting us also needs to be able to respect those bad
[00:07:52] needs to be able to respect those bad decisions and you could disagree with
[00:07:54] decisions and you could disagree with this you could disagree with like that
[00:07:55] this you could disagree with like that as a premise in terms of the type of
[00:07:59] as a premise in terms of the type of theories that promote that everyone
[00:08:00] theories that promote that everyone should have autonomy you know and we
[00:08:01] should have autonomy you know and we give different people in society
[00:08:03] give different people in society different amounts of autonomy children
[00:08:04] different amounts of autonomy children generally have less than you know adults
[00:08:07] generally have less than you know adults um but but if you assume that that's the
[00:08:09] um but but if you assume that that's the case as long as you assume that like
[00:08:11] case as long as you assume that like it's always important for um every
[00:08:13] it's always important for um every individual to have some amount of
[00:08:14] individual to have some amount of autonomy that would include allowing
[00:08:16] autonomy that would include allowing them to make bad decisions sometimes our
[00:08:19] them to make bad decisions sometimes our justification like that's not something
[00:08:21] justification like that's not something you can assume how would that I don't
[00:08:24] you can assume how would that I don't know how that exactly was I didn't gr
[00:08:26] know how that exactly was I didn't gr this particular question but um you can
[00:08:27] this particular question but um you can definitely like see what they what they
[00:08:29] definitely like see what they what they in terms of that and yeah we will look
[00:08:30] in terms of that and yeah we will look at everyone's justification in terms of
[00:08:33] at everyone's justification in terms of that good
[00:08:36] that good questions all
[00:08:39] questions all right the next one that I wanted to go
[00:08:41] right the next one that I wanted to go through was Monte Carlo treesearch um so
[00:08:44] through was Monte Carlo treesearch um so this is another one that uh people um
[00:08:47] this is another one that uh people um there was a little bit of um differences
[00:08:50] there was a little bit of um differences over in terms of whether this was
[00:08:52] over in terms of whether this was something people had some questions on
[00:08:53] something people had some questions on so the first one is
[00:08:56] so the first one is true so Monte Carlo Tre search um the
[00:09:00] true so Monte Carlo Tre search um the MCTS the m in the Monte trech stands for
[00:09:02] MCTS the m in the Monte trech stands for Monty not for marov so um it and the way
[00:09:07] Monty not for marov so um it and the way that we described it in class you can
[00:09:08] that we described it in class you can use it in both so what we do in Monte
[00:09:10] use it in both so what we do in Monte Carlo Tre search is we sample from the
[00:09:12] Carlo Tre search is we sample from the Dynamics model to get to a next state
[00:09:15] Dynamics model to get to a next state and as long as we can sample from that
[00:09:17] and as long as we can sample from that whether that's a Markoff model or if you
[00:09:19] whether that's a Markoff model or if you required all of the history so far in
[00:09:20] required all of the history so far in the tree to make that dynamics that
[00:09:22] the tree to make that dynamics that would be okay so um there's not
[00:09:24] would be okay so um there's not something inherent in Marty Carlo
[00:09:26] something inherent in Marty Carlo research that means you always have to
[00:09:27] research that means you always have to have a Markoff system
[00:09:30] have a Markoff system um the second is true so the way that
[00:09:33] um the second is true so the way that Monte Carlo research uses sampling is it
[00:09:35] Monte Carlo research uses sampling is it samples the next state and what it does
[00:09:38] samples the next state and what it does there is it means that instead of having
[00:09:39] there is it means that instead of having to enumerate all possible States you can
[00:09:42] to enumerate all possible States you can just sample a subset of them and still
[00:09:44] just sample a subset of them and still get an accurate estimation of the um uh
[00:09:47] get an accurate estimation of the um uh uh
[00:09:49] uh expectation and the fourth is
[00:09:53] expectation and the fourth is false so this is not true because um in
[00:09:57] false so this is not true because um in this case like in a lot of settings
[00:09:59] this case like in a lot of settings including Alpha go the reward model is
[00:10:02] including Alpha go the reward model is known so we're not trying to learn the
[00:10:03] known so we're not trying to learn the reward model but upper competence bounds
[00:10:06] reward model but upper competence bounds are still useful because they allow us
[00:10:08] are still useful because they allow us to prioritize among the actions because
[00:10:11] to prioritize among the actions because ultimately we want to be thinking about
[00:10:12] ultimately we want to be thinking about taking Maxes and so in these cases you
[00:10:16] taking Maxes and so in these cases you um upper confidence bounds like upper
[00:10:17] um upper confidence bounds like upper confidence bound trees are using UCB to
[00:10:21] confidence bound trees are using UCB to sort of actively think of how we're
[00:10:24] sort of actively think of how we're expanding out the tree and then the
[00:10:26] expanding out the tree and then the fourth one is also true because that's
[00:10:28] fourth one is also true because that's exactly what Alpha zero does is
[00:10:32] exactly what Alpha zero does is they use selfplay to improve uh the
[00:10:36] they use selfplay to improve uh the network to predict values and action
[00:10:37] network to predict values and action probabilities yeah the is it's false
[00:10:42] probabilities yeah the is it's false it's false it's false yeah yeah I'm a
[00:10:45] it's false it's false yeah yeah I'm a bit confused about the force wording it
[00:10:48] bit confused about the force wording it says Monte card research is used in
[00:10:51] says Monte card research is used in Alpha zero but doesn't Alpha Z use a
[00:10:53] Alpha zero but doesn't Alpha Z use a different variant of a research that is
[00:10:56] different variant of a research that is not mon I am a bit confused Al uses
[00:10:59] not mon I am a bit confused Al uses Monti car research it uses Monti car
[00:11:01] Monti car research it uses Monti car research with selfplay to train a
[00:11:03] research with selfplay to train a network that predicts values and action
[00:11:05] network that predicts values and action probabilities that's part of what it
[00:11:06] probabilities that's part of what it does wasn't there a different thing that
[00:11:10] does wasn't there a different thing that had the
[00:11:11] had the confidence for oh upper confidence trees
[00:11:14] confidence for oh upper confidence trees yeah yeah so it also well it uses a
[00:11:17] yeah yeah so it also well it uses a particular form of upper competence
[00:11:19] particular form of upper competence trees as well so upper competence trees
[00:11:20] trees as well so upper competence trees is a type of Monti car tree search it's
[00:11:22] is a type of Monti car tree search it's like Monty research is a super set of
[00:11:25] like Monty research is a super set of ucct
[00:11:34] okay so and
[00:11:36] okay so and let's let's go through these two because
[00:11:39] let's let's go through these two because all of these are true and so I think
[00:11:41] all of these are true and so I think this is a useful one to go through as
[00:11:43] this is a useful one to go through as well all right
[00:11:46] well all right so um CH gbt did learn rewards from
[00:11:50] so um CH gbt did learn rewards from humans preparing preferences over prompt
[00:11:52] humans preparing preferences over prompt output Pairs and then use po to train a
[00:11:54] output Pairs and then use po to train a better prompt so in fact they did
[00:11:57] better prompt so in fact they did ranking but they did do this um in
[00:12:00] ranking but they did do this um in general for long Horizon problems um and
[00:12:04] general for long Horizon problems um and really large action
[00:12:05] really large action spaces that would be somewhere where a
[00:12:07] spaces that would be somewhere where a forward search would be really expensive
[00:12:09] forward search would be really expensive to do so using something like Alpha zero
[00:12:11] to do so using something like Alpha zero which essentially builds a subset of the
[00:12:12] which essentially builds a subset of the tree search can be really helpful um in
[00:12:16] tree search can be really helpful um in pack probably approximately correct
[00:12:17] pack probably approximately correct methods we are guaranteed to learn an
[00:12:19] methods we are guaranteed to learn an Epsilon optimal policy but the Epsilon
[00:12:21] Epsilon optimal policy but the Epsilon might be non zero which means we're not
[00:12:23] might be non zero which means we're not guaranteed to learn an optimal policy so
[00:12:26] guaranteed to learn an optimal policy so if you want to get a if you're if you're
[00:12:28] if you want to get a if you're if you're okay with your kitchen being like
[00:12:29] okay with your kitchen being like slightly messy which I am um then it
[00:12:31] slightly messy which I am um then it would be okay to use a pack RL algorithm
[00:12:33] would be okay to use a pack RL algorithm that would make a finite number of
[00:12:34] that would make a finite number of mistakes but most of the time would keep
[00:12:36] mistakes but most of the time would keep your kitchen pretty neat but maybe not
[00:12:39] your kitchen pretty neat but maybe not perfectly neat and then offline RL may
[00:12:42] perfectly neat and then offline RL may be particularly beneficial for
[00:12:43] be particularly beneficial for healthcare and other high stake settings
[00:12:45] healthcare and other high stake settings where online exploration um might be
[00:12:48] where online exploration um might be risky or or very expensive so in this
[00:12:50] risky or or very expensive so in this case all of these were
[00:12:54] true all right and then the next one I
[00:12:56] true all right and then the next one I was going to talk about is nine where we
[00:12:58] was going to talk about is nine where we go through some of the theore IAL
[00:12:59] go through some of the theore IAL properties so in this case um what we
[00:13:04] properties so in this case um what we have is that
[00:13:06] have is that optimism
[00:13:09] optimism um in this case um the first one is not
[00:13:13] um in this case um the first one is not true because we don't have any
[00:13:15] true because we don't have any guarantees in general for the reinforced
[00:13:18] guarantees in general for the reinforced algorithm uh the second one is true for
[00:13:20] algorithm uh the second one is true for the reasons we just said a pack
[00:13:22] the reasons we just said a pack algorithm guarantees your Epsilon
[00:13:23] algorithm guarantees your Epsilon optimal so this is but not asly fully
[00:13:27] optimal so this is but not asly fully optimal um in the the third case it is
[00:13:31] optimal um in the the third case it is not guaranteed to have subl
[00:13:33] not guaranteed to have subl regret so this is false and the reason
[00:13:36] regret so this is false and the reason again for this is that if you just get
[00:13:38] again for this is that if you just get an Epsilon optimal policy you might make
[00:13:40] an Epsilon optimal policy you might make Epsilon mistakes for the rest of time so
[00:13:42] Epsilon mistakes for the rest of time so like Epsilon time T which would still
[00:13:43] like Epsilon time T which would still give you a linear
[00:13:45] give you a linear regret the fourth is also false so in
[00:13:48] regret the fourth is also false so in general you can think of just minimizing
[00:13:50] general you can think of just minimizing regret which is the difference between
[00:13:52] regret which is the difference between your policy and the optimal policy or
[00:13:54] your policy and the optimal policy or maximizing expected cumulative rewards
[00:13:56] maximizing expected cumulative rewards and they're just the same thing when you
[00:13:57] and they're just the same thing when you just subtract from the other you either
[00:14:00] just subtract from the other you either maximize cumulative regret or minimize
[00:14:02] maximize cumulative regret or minimize maximize cumulative reward or minimize
[00:14:04] maximize cumulative reward or minimize regret and then um uh in the fourth
[00:14:10] regret and then um uh in the fourth case this will not necessarily be a pack
[00:14:14] case this will not necessarily be a pack algorithm so the only one in this case
[00:14:16] algorithm so the only one in this case is the only B is
[00:14:20] is the only B is true and the reason for this is a pack
[00:14:23] true and the reason for this is a pack algorithm has to make a finite number of
[00:14:25] algorithm has to make a finite number of mistakes so it normally has to be
[00:14:27] mistakes so it normally has to be polinomial and so even this would say
[00:14:30] polinomial and so even this would say this algorithm would be consistently
[00:14:32] this algorithm would be consistently converging to the optimal policy but you
[00:14:34] converging to the optimal policy but you don't as know how long it would take so
[00:14:36] don't as know how long it would take so it could be very
[00:14:37] it could be very expensive and then um the final question
[00:14:41] expensive and then um the final question has us think
[00:14:43] has us think about which algorithms could generate
[00:14:46] about which algorithms could generate The observed reward um raise your hand
[00:14:49] The observed reward um raise your hand if anybody wants me to go through that
[00:14:50] if anybody wants me to go through that I'm happy to and step through it or
[00:14:52] I'm happy to and step through it or otherwise we'll move on to the next p
[00:14:54] otherwise we'll move on to the next p allowed people to just think about how
[00:14:57] allowed people to just think about how different algorithms would run and
[00:14:58] different algorithms would run and whether or not not they could generate
[00:14:59] whether or not not they could generate this observe
[00:15:05] data all right well we'll release the
[00:15:07] data all right well we'll release the solutions for the quiz um over the next
[00:15:09] solutions for the quiz um over the next day and we'll also release the solution
[00:15:11] day and we'll also release the solution um the
[00:15:22] grades mistakes or is it that your toal
[00:15:26] grades mistakes or is it that your toal mistakes should be within somebody
[00:15:30] mistakes should be within somebody finite number of mistakes so great
[00:15:32] finite number of mistakes so great question so in terms of PEC what we
[00:15:33] question so in terms of PEC what we normally require is that the number of
[00:15:35] normally require is that the number of mistakes made well with high probability
[00:15:38] mistakes made well with high probability on all but a finite number you will be
[00:15:40] on all but a finite number you will be Epsilon optimal and that finite number
[00:15:43] Epsilon optimal and that finite number needs to be a polinomial function of
[00:15:44] needs to be a polinomial function of your problem parameters including like
[00:15:46] your problem parameters including like one over Epsilon the size of your state
[00:15:48] one over Epsilon the size of your state space the size of Your Action space
[00:15:52] Etc doesn't always tell you when those
[00:15:54] Etc doesn't always tell you when those mistakes will occur they may be at the
[00:15:56] mistakes will occur they may be at the beginning or they may be later I was
[00:15:58] beginning or they may be later I was thinking if you perhaps
[00:16:03] it still be fine as long as there some
[00:16:12] is yeah I mean you you certainly could
[00:16:14] is yeah I mean you you certainly could do that it wouldn't be pack unless you
[00:16:16] do that it wouldn't be pack unless you could guarantee with high probability
[00:16:17] could guarantee with high probability that that total number of mistakes would
[00:16:18] that that total number of mistakes would be small yeah it's a good question one
[00:16:21] be small yeah it's a good question one thing some of the work that we have done
[00:16:23] thing some of the work that we have done in the past is you may or may not know
[00:16:25] in the past is you may or may not know what Epsilon you want to commit to in
[00:16:26] what Epsilon you want to commit to in advance and so we've also developed
[00:16:27] advance and so we've also developed algorithms where you could think of this
[00:16:30] algorithms where you could think of this occurring for different epsilons maybe
[00:16:31] occurring for different epsilons maybe as you have different amounts of budget
[00:16:33] as you have different amounts of budget you might want to be able to pick
[00:16:34] you might want to be able to pick Epsilon because if you get a lot of data
[00:16:35] Epsilon because if you get a lot of data maybe you can get more
[00:16:38] maybe you can get more optimal question anybody else have
[00:16:39] optimal question anybody else have questions about the
[00:16:44] quiz all right well feel free to we have
[00:16:46] quiz all right well feel free to we have normal office hours this week so feel
[00:16:48] normal office hours this week so feel free to come to our office hours next
[00:16:50] free to come to our office hours next week we will not have office hours
[00:16:52] week we will not have office hours anymore um but if you have any questions
[00:16:54] anymore um but if you have any questions about the quiz or about your projects
[00:16:56] about the quiz or about your projects feel free to come see us
[00:17:02] all right so I think it's always
[00:17:04] all right so I think it's always exciting to um go back to the beginning
[00:17:06] exciting to um go back to the beginning of the quor and kind of think back of
[00:17:08] of the quor and kind of think back of all the things that we've covered um as
[00:17:10] all the things that we've covered um as well as looking forward in terms of the
[00:17:11] well as looking forward in terms of the field so when we very started the very
[00:17:14] field so when we very started the very first lecture slide might look somewhat
[00:17:16] first lecture slide might look somewhat familiar we talked about how
[00:17:17] familiar we talked about how reinforcement learning is fundamentally
[00:17:19] reinforcement learning is fundamentally the question of learning through
[00:17:21] the question of learning through experience to make good decisions in
[00:17:23] experience to make good decisions in order to optimize our long-term reward
[00:17:25] order to optimize our long-term reward and so that's really the central
[00:17:27] and so that's really the central question that it tries to start to
[00:17:28] question that it tries to start to answer
[00:17:30] answer and we talked about there being a number
[00:17:32] and we talked about there being a number of different learning objectives in the
[00:17:33] of different learning objectives in the course and so what I hope that people
[00:17:36] course and so what I hope that people will walk away from in this class is to
[00:17:38] will walk away from in this class is to understand what are the key key features
[00:17:39] understand what are the key key features of reinforcement learning and how does
[00:17:41] of reinforcement learning and how does that change compared to supervised
[00:17:43] that change compared to supervised learning and AI planning a lot of other
[00:17:45] learning and AI planning a lot of other areas or unsupervised learning um to
[00:17:48] areas or unsupervised learning um to understand how if you're given an
[00:17:49] understand how if you're given an application problem whether and how you
[00:17:52] application problem whether and how you should use RL for it what algorithms
[00:17:54] should use RL for it what algorithms might be appropriate to be able to
[00:17:55] might be appropriate to be able to implement and code RL algorithms and you
[00:17:57] implement and code RL algorithms and you guys have had lots of practice with that
[00:18:00] guys have had lots of practice with that and then to understand how we would
[00:18:01] and then to understand how we would compare and contrast about what it means
[00:18:03] compare and contrast about what it means to have a good RL algorithm and what are
[00:18:05] to have a good RL algorithm and what are the sort of ways we should evaluate
[00:18:07] the sort of ways we should evaluate algorithms themselves as a way to help
[00:18:09] algorithms themselves as a way to help us understand if we're making progress
[00:18:11] us understand if we're making progress so thinking about things like regret
[00:18:13] so thinking about things like regret sample complexity computational
[00:18:14] sample complexity computational complexity empirical performance does it
[00:18:16] complexity empirical performance does it converge to the optimal policy does it
[00:18:18] converge to the optimal policy does it converge at all um and then also to
[00:18:21] converge at all um and then also to understand the sort of exploration
[00:18:23] understand the sort of exploration exploitation challenge in terms of data
[00:18:24] exploitation challenge in terms of data collection and these sort of fundamental
[00:18:26] collection and these sort of fundamental challenges we have between the data that
[00:18:29] challenges we have between the data that we gather allowing us to learn things
[00:18:31] we gather allowing us to learn things about the environment and about
[00:18:32] about the environment and about different decision policies versus using
[00:18:35] different decision policies versus using that information to actually obtain High
[00:18:38] that information to actually obtain High rewards and so throughout this you've
[00:18:40] rewards and so throughout this you've had a chance to think about this on the
[00:18:42] had a chance to think about this on the quiz on the midterm on the homeworks and
[00:18:44] quiz on the midterm on the homeworks and then now also in your final
[00:18:47] then now also in your final project so what I'd like to do now is to
[00:18:49] project so what I'd like to do now is to sort of again revisit um this second
[00:18:52] sort of again revisit um this second question because I think really as you
[00:18:53] question because I think really as you go forward this will be um when you use
[00:18:55] go forward this will be um when you use reinforcement learning this is going to
[00:18:56] reinforcement learning this is going to be one of the things that you'd
[00:18:58] be one of the things that you'd constantly have to do which is decide
[00:18:59] constantly have to do which is decide for any new problem you're looking at is
[00:19:01] for any new problem you're looking at is it appropriate to think about
[00:19:02] it appropriate to think about reinforcement learning as a tool to help
[00:19:04] reinforcement learning as a tool to help you solve that
[00:19:06] you solve that problem and so I think to do that it's
[00:19:08] problem and so I think to do that it's helpful to go back to the motivating
[00:19:10] helpful to go back to the motivating domains from the first lecture so that
[00:19:15] domains from the first lecture so that so three of the domains we talked about
[00:19:17] so three of the domains we talked about a number of different domains throughout
[00:19:18] a number of different domains throughout the class but here are three of the
[00:19:19] the class but here are three of the domains that we talked about on the
[00:19:20] domains that we talked about on the first lecture so the first one is Alpha
[00:19:24] first lecture so the first one is Alpha tensor this is Alpha tensor
[00:19:31] and in Alpha tenser the idea was to
[00:19:32] and in Alpha tenser the idea was to figure out a more effective algorithm
[00:19:35] figure out a more effective algorithm for learning to multiply matricies and
[00:19:37] for learning to multiply matricies and kind of the amazingly beautiful thing
[00:19:39] kind of the amazingly beautiful thing they did in that case is that they
[00:19:41] they did in that case is that they actually are doing reinforcement
[00:19:42] actually are doing reinforcement learning to learn algorithms which I
[00:19:45] learning to learn algorithms which I still still think is really incredible
[00:19:47] still still think is really incredible it's extremely creative and so what they
[00:19:49] it's extremely creative and so what they want to think about in this case is if
[00:19:51] want to think about in this case is if you want to multiply two matrices this
[00:19:53] you want to multiply two matrices this is just 2 x two but they go beyond that
[00:19:55] is just 2 x two but they go beyond that um what is sort of the way we should
[00:19:57] um what is sort of the way we should operationalize that so that we can think
[00:20:00] operationalize that so that we can think about the particular products and sums
[00:20:02] about the particular products and sums that we're doing in order to reduce the
[00:20:04] that we're doing in order to reduce the amount of computational complexity we
[00:20:05] amount of computational complexity we need to accomplish this in a correct way
[00:20:09] need to accomplish this in a correct way and what the researchers at Deep Mind
[00:20:11] and what the researchers at Deep Mind were doing when they were thinking about
[00:20:13] were doing when they were thinking about this is they were thinking about a
[00:20:14] this is they were thinking about a common task that comes up everywhere we
[00:20:16] common task that comes up everywhere we multiply matrices all the time for
[00:20:19] multiply matrices all the time for almost all of AI and machine learning um
[00:20:22] almost all of AI and machine learning um and so we're really relying underneath
[00:20:23] and so we're really relying underneath is's constantly that cost that we're
[00:20:25] is's constantly that cost that we're doing and so they're thinking about can
[00:20:27] doing and so they're thinking about can we essentially invent better algorithms
[00:20:29] we essentially invent better algorithms for doing some of these really basic
[00:20:31] for doing some of these really basic substructures so I think that was really
[00:20:34] substructures so I think that was really exciting and I think this is one of the
[00:20:36] exciting and I think this is one of the domains that um that now you have some
[00:20:38] domains that um that now you have some of the tools to be able to do the same
[00:20:39] of the tools to be able to do the same types of algorithms is what they use to
[00:20:41] types of algorithms is what they use to solve this bless you so what I'm going
[00:20:43] solve this bless you so what I'm going to do right now is I'm going to revisit
[00:20:45] to do right now is I'm going to revisit these three and then I'm going to ask
[00:20:46] these three and then I'm going to ask you to think about how given what you
[00:20:48] you to think about how given what you know now how you would formulate them
[00:20:50] know now how you would formulate them and some of them I've talked about a
[00:20:52] and some of them I've talked about a little bit more or a little bit less but
[00:20:53] little bit more or a little bit less but I'll first just give you a quick
[00:20:54] I'll first just give you a quick refresher so you can think about given
[00:20:56] refresher so you can think about given what you know um you might for
[00:20:59] what you know um you might for this so the second one was plasma
[00:21:02] this so the second one was plasma control and this is much more of a like
[00:21:05] control and this is much more of a like a controls like more like the mojoko
[00:21:07] a controls like more like the mojoko type of task that we saw where they're
[00:21:09] type of task that we saw where they're trying to manipulate and control these
[00:21:11] trying to manipulate and control these different plasma and they want to think
[00:21:13] different plasma and they want to think about a control policy to allow you to
[00:21:15] about a control policy to allow you to achieve different types of
[00:21:17] achieve different types of configurations and then the third one
[00:21:19] configurations and then the third one was thinking about how do we figure out
[00:21:21] was thinking about how do we figure out who to test given finite resources so
[00:21:25] who to test given finite resources so this is for covid testing
[00:21:29] this is for covid testing so if you have a bunch of people coming
[00:21:30] so if you have a bunch of people coming off an airplane and you have a finite
[00:21:32] off an airplane and you have a finite number of tests who can you test to
[00:21:35] number of tests who can you test to better understand who might be sick and
[00:21:37] better understand who might be sick and restrict the like restrict the spread of
[00:21:40] restrict the like restrict the spread of covid and this is a process that's
[00:21:42] covid and this is a process that's happening you know every day as people
[00:21:45] happening you know every day as people were flying into Greece into different
[00:21:47] were flying into Greece into different airports and then you would send off
[00:21:49] airports and then you would send off those samples for to labs and then a few
[00:21:52] those samples for to labs and then a few days later you would get the results and
[00:21:53] days later you would get the results and those people that you asked to test
[00:21:55] those people that you asked to test could come out of
[00:21:56] could come out of quarantine and so what I'd like you to
[00:21:58] quarantine and so what I'd like you to do now and I've posted a poll for this
[00:22:00] do now and I've posted a poll for this is to think about the following so I'll
[00:22:02] is to think about the following so I'll I'll label those as well this is the
[00:22:04] I'll label those as well this is the alpha
[00:22:06] tensor this is
[00:22:09] tensor this is plasma and this is co
[00:22:13] plasma and this is co testing if you can go on the poll and
[00:22:15] testing if you can go on the poll and say which domain are you choosing is it
[00:22:17] say which domain are you choosing is it a bandit is it a multi-step RL problem
[00:22:19] a bandit is it a multi-step RL problem what type of problem is this what
[00:22:21] what type of problem is this what setting are we in is the problem an
[00:22:23] setting are we in is the problem an offline setting or an online setting or
[00:22:25] offline setting or an online setting or some
[00:22:26] some combination what do you think the state
[00:22:28] combination what do you think the state date action and rewards might be and
[00:22:30] date action and rewards might be and what algorithms do you think you would
[00:22:32] what algorithms do you think you would use to try to tackle this
[00:22:36] use to try to tackle this problem and we'll take a few minutes to
[00:22:38] problem and we'll take a few minutes to go through
[00:22:57] that
[00:23:27] e
[00:23:57] for
[00:24:57] e e
[00:25:34] right we'll give a few more minutes and
[00:25:36] right we'll give a few more minutes and then share some of our thoughts
[00:26:09] and I think one good thing to think
[00:26:10] and I think one good thing to think about in this too is like are there
[00:26:11] about in this too is like are there problems with distribution shift that
[00:26:13] problems with distribution shift that might come up are there cases where we'd
[00:26:15] might come up are there cases where we'd want to be conservative with respect to
[00:26:17] want to be conservative with respect to the results that are being generalized
[00:26:20] the results that are being generalized um or do we not have to worry too much
[00:26:22] um or do we not have to worry too much about distribution shift in these
[00:26:25] about distribution shift in these cases could there be unsafe States or
[00:26:27] cases could there be unsafe States or too risky or things like
[00:26:37] that so I think we actually have a nice
[00:26:40] that so I think we actually have a nice breakdown um raise your hand if you did
[00:26:44] breakdown um raise your hand if you did plasma is there someone else here oh so
[00:26:46] plasma is there someone else here oh so we do have another person doing plasma
[00:26:47] we do have another person doing plasma but they are remote then raise your hand
[00:26:49] but they are remote then raise your hand if you did
[00:26:51] if you did Co okay maybe you want to go near those
[00:26:54] Co okay maybe you want to go near those guys so you guys can all compare your
[00:26:56] guys so you guys can all compare your answers and then did you guys both do
[00:26:58] answers and then did you guys both do Alpha tenser okay perfect well so why
[00:27:00] Alpha tenser okay perfect well so why don't we take a minute and talk to your
[00:27:02] don't we take a minute and talk to your neighbor and I'll also come around and
[00:27:03] neighbor and I'll also come around and see if you guys came up with the same
[00:27:04] see if you guys came up with the same formulation
[00:27:42] [Music]
[00:27:55] think
[00:28:00] so I
[00:28:04] said was
[00:28:56] also like
[00:29:06] May
[00:30:05] I'm not sure like how to okay what I put
[00:30:08] I'm not sure like how to okay what I put down was like total number of test cases
[00:30:10] down was like total number of test cases of
[00:30:11] of like country but like
[00:30:26] that some
[00:30:29] that some we that would be the hope at least yeah
[00:30:32] we that would be the hope at least yeah unless you really proxy like
[00:30:40] [Music]
[00:31:21] com closer to offline there's like this
[00:31:24] com closer to offline there's like this batch setting with delay you have to
[00:31:27] batch setting with delay you have to make decision
[00:31:48] they're really Levering so it's all
[00:31:49] they're really Levering so it's all online
[00:31:51] online but yeah I was like I wasn't
[00:31:58] well then maybe like soop what do you
[00:31:59] well then maybe like soop what do you think in terms
[00:32:11] of you want to do exploring
[00:32:52] really ground for
[00:32:55] distinction yeah cuz they really want I
[00:32:57] distinction yeah cuz they really want I mean you kind of get my expion that like
[00:32:59] mean you kind of get my expion that like they have features of people so like you
[00:33:01] they have features of people so like you know you get it's not like everyone with
[00:33:03] know you get it's not like everyone with get the same but in general if there two
[00:33:25] people is I think what you guys so many
[00:33:28] people is I think what you guys so many ways
[00:33:32] you you so the problem in
[00:33:54] general yeah if you went
[00:33:58] general yeah if you went but
[00:34:04] I or things like that right because if
[00:34:06] I or things like that right because if someone's going to go on a farm doesn't
[00:34:08] someone's going to go on a farm doesn't matter as much but they're going to go
[00:34:09] matter as much but they're going to go to Soul then you know I don't think
[00:34:18] any together um I really like these
[00:34:21] any together um I really like these domains I think that they're really
[00:34:22] domains I think that they're really interesting domains to think about sort
[00:34:24] interesting domains to think about sort of what is the implications for um how
[00:34:27] of what is the implications for um how we model these and the choices that we
[00:34:29] we model these and the choices that we have to make and what are the algorithms
[00:34:30] have to make and what are the algorithms that would work so let's go through some
[00:34:31] that would work so let's go through some of them because I know that a lot of
[00:34:33] of them because I know that a lot of people pick different ones so Alpha
[00:34:35] people pick different ones so Alpha tensor because I think nobody because I
[00:34:37] tensor because I think nobody because I know there's some people who are
[00:34:38] know there's some people who are watching online remotely because we have
[00:34:40] watching online remotely because we have more answers um as well so I think uh
[00:34:44] more answers um as well so I think uh it's really interesting to see the
[00:34:45] it's really interesting to see the perspective I'm not sure that there was
[00:34:47] perspective I'm not sure that there was very few people that mentioned Monte
[00:34:49] very few people that mentioned Monte Carlo research but actually for Alpha
[00:34:51] Carlo research but actually for Alpha tenser it is something I guess maybe the
[00:34:53] tenser it is something I guess maybe the alpha should hint um at that like Alpha
[00:34:56] alpha should hint um at that like Alpha zero Etc so they do
[00:34:58] zero Etc so they do they it is something where they're using
[00:35:00] they it is something where they're using reinforcement learning and policy
[00:35:01] reinforcement learning and policy networks Etc but they're also combining
[00:35:04] networks Etc but they're also combining it with kind of alpha zero like
[00:35:07] it with kind of alpha zero like technology so they use Monte Carlo
[00:35:08] technology so they use Monte Carlo treesearch in this case so they have
[00:35:11] treesearch in this case so they have Monte Carlo
[00:35:12] Monte Carlo treesearch carar so this is a
[00:35:15] treesearch carar so this is a reinforcement learning
[00:35:17] reinforcement learning problem it's multi-step because the idea
[00:35:20] problem it's multi-step because the idea is that you want to take a series of
[00:35:21] is that you want to take a series of steps until you solve the multiplication
[00:35:24] steps until you solve the multiplication problem and what the what the steps are
[00:35:26] problem and what the what the steps are in this case is you could think of it as
[00:35:27] in this case is you could think of it as like algorithmic Steps so like you know
[00:35:29] like algorithmic Steps so like you know which parts of like your if we go back
[00:35:32] which parts of like your if we go back to here make
[00:35:34] to here make it make this big for a second if you
[00:35:37] it make this big for a second if you think of this when you do multi matrix
[00:35:39] think of this when you do multi matrix multiplication you have A1 * B1 and A2
[00:35:41] multiplication you have A1 B1 and A2 plus A2 B3 A1 * B2 A2 plus B4 Etc you
[00:35:46] plus A2 B3 A1 B2 A2 plus B4 Etc you can think of there's all these different
[00:35:47] can think of there's all these different products and sums and you could do them
[00:35:49] products and sums and you could do them in different orders and you can kind of
[00:35:50] in different orders and you can kind of refactor that so when you think about
[00:35:52] refactor that so when you think about sort of all the operations you could do
[00:35:56] sort of all the operations you could do um you're going to have to do a ser
[00:35:57] um you're going to have to do a ser series of operations such that remember
[00:36:00] series of operations such that remember what they're trying to learn here is not
[00:36:01] what they're trying to learn here is not how to multiply two particular ones but
[00:36:03] how to multiply two particular ones but they're trying to learn the algorithm
[00:36:05] they're trying to learn the algorithm that will always correctly solve in the
[00:36:07] that will always correctly solve in the minimum number of steps so here their
[00:36:09] minimum number of steps so here their reward function is the number of steps
[00:36:11] reward function is the number of steps the number of um computations that you
[00:36:13] the number of um computations that you have to do and of course it has to be
[00:36:15] have to do and of course it has to be correct and to me one of the Brilliance
[00:36:17] correct and to me one of the Brilliance of their ideas is how do you make sure
[00:36:19] of their ideas is how do you make sure that you're only searching within the
[00:36:21] that you're only searching within the space of correct algorithms um and so
[00:36:23] space of correct algorithms um and so there's some really nice properties for
[00:36:25] there's some really nice properties for this particular problem that allowed
[00:36:26] this particular problem that allowed them to do that so other people had also
[00:36:28] them to do that so other people had also noticed this in the past and then what
[00:36:30] noticed this in the past and then what they said is oh given that we have that
[00:36:32] they said is oh given that we have that given that we have a way to verify that
[00:36:35] given that we have a way to verify that you know and only search in the
[00:36:36] you know and only search in the algorithms that are correct now what we
[00:36:38] algorithms that are correct now what we could do is just optimize for length and
[00:36:40] could do is just optimize for length and so the way that they do that in this
[00:36:42] so the way that they do that in this case is they're going to very similar in
[00:36:44] case is they're going to very similar in certain ways to Alpha zero be able to
[00:36:47] certain ways to Alpha zero be able to search through um different uh using
[00:36:50] search through um different uh using like policy networks and and value
[00:36:52] like policy networks and and value Network so you can see here they have a
[00:36:53] Network so you can see here they have a neural network with both a policy head
[00:36:55] neural network with both a policy head and a value head similar to what we saw
[00:36:57] and a value head similar to what we saw for Al zero but they are going to do
[00:36:59] for Al zero but they are going to do this forward search now one of the
[00:37:01] this forward search now one of the interesting things about this is that
[00:37:04] interesting things about this is that compared to what we saw for Alpha go in
[00:37:06] compared to what we saw for Alpha go in Alpha go because I saw some of we talked
[00:37:08] Alpha go because I saw some of we talked about this with some of you um and saw
[00:37:11] about this with some of you um and saw that in your
[00:37:12] that in your notes at runtime they're not going to do
[00:37:15] notes at runtime they're not going to do search anymore what they're going to do
[00:37:16] search anymore what they're going to do at this point is they're just trying to
[00:37:18] at this point is they're just trying to find the best possible algorithm and
[00:37:20] find the best possible algorithm and then in the future they're not going to
[00:37:22] then in the future they're not going to do any additional monticolo research
[00:37:24] do any additional monticolo research unlike what we do with playing go
[00:37:26] unlike what we do with playing go because the assumption is at that point
[00:37:28] because the assumption is at that point they have the algorithm and they'll just
[00:37:29] they have the algorithm and they'll just apply it to multiplying so they don't
[00:37:32] apply it to multiplying so they don't continue to do Monte Carlo treesearch
[00:37:34] continue to do Monte Carlo treesearch kind of at runtime this is all something
[00:37:36] kind of at runtime this is all something done just to find that best algorithm so
[00:37:39] done just to find that best algorithm so this is a case where we would have Monte
[00:37:41] this is a case where we would have Monte Carlo tree search and we would also have
[00:37:44] Carlo tree search and we would also have policy
[00:37:45] policy networks policy and value
[00:37:48] networks policy and value Network and where they're sharing again
[00:37:50] Network and where they're sharing again this is sort of you know a single neural
[00:37:52] this is sort of you know a single neural network so you can get shared
[00:37:53] network so you can get shared representations here very similar to
[00:37:55] representations here very similar to Alpha zero um and then they they can
[00:37:58] Alpha zero um and then they they can play in this
[00:38:00] play in this case
[00:38:04] yeah overcome distribution shifts
[00:38:09] here ah so well they are trying to have
[00:38:13] here ah so well they are trying to have so all of the algorithms they search
[00:38:14] so all of the algorithms they search through are
[00:38:15] through are correct so there's no distribution shift
[00:38:19] correct so there's no distribution shift in that like they will always be correct
[00:38:21] in that like they will always be correct for a future problem it's just that it
[00:38:24] for a future problem it's just that it may or may not be that they found the
[00:38:25] may or may not be that they found the very most optimal one so there's not
[00:38:28] very most optimal one so there's not kind of the same problem that you might
[00:38:30] kind of the same problem that you might write into different states the nice
[00:38:32] write into different states the nice thing here is it's just a series of
[00:38:34] thing here is it's just a series of operations it may be that the search is
[00:38:36] operations it may be that the search is stuff that they didn't find the optimal
[00:38:38] stuff that they didn't find the optimal one so there might be still better
[00:38:39] one so there might be still better algorithms that are shorter I don't
[00:38:41] algorithms that are shorter I don't think they prove this is a lower bound
[00:38:42] think they prove this is a lower bound to my knowledge um and so there's not
[00:38:45] to my knowledge um and so there's not it's a great question there's not going
[00:38:46] it's a great question there's not going to be a problem with like when you
[00:38:48] to be a problem with like when you deploy this on a new matrix
[00:38:49] deploy this on a new matrix multiplication that you might get
[00:38:50] multiplication that you might get something wrong it's just that you know
[00:38:53] something wrong it's just that you know it may or may not be the most optimal
[00:38:54] it may or may not be the most optimal way to multiply that particular um to
[00:38:58] way to multiply that particular um to so I think there that's sort of the the
[00:39:00] so I think there that's sort of the the cleverness of having the policy Network
[00:39:03] cleverness of having the policy Network be or having the the space they Searcher
[00:39:06] be or having the the space they Searcher always has my correctness yeah inter Al
[00:39:10] always has my correctness yeah inter Al Le
[00:39:12] Le something
[00:39:13] something it yeah it's a great question so what
[00:39:16] it yeah it's a great question so what you know was there s some like kind of
[00:39:17] you know was there s some like kind of high level insight and in particular
[00:39:18] high level insight and in particular high level Insight you could translate
[00:39:19] high level Insight you could translate to other problems not that I remember um
[00:39:23] to other problems not that I remember um I think that
[00:39:25] I think that they I don't remember there being any
[00:39:27] they I don't remember there being any sort of particular like aha moment now
[00:39:29] sort of particular like aha moment now this means for all these other type of
[00:39:30] this means for all these other type of problems we can do this um it' be
[00:39:33] problems we can do this um it' be interesting to go back to the paper and
[00:39:34] interesting to go back to the paper and see if there's anything that I missed in
[00:39:35] see if there's anything that I missed in that
[00:39:36] that case so I think what they found in this
[00:39:39] case so I think what they found in this case to what I remember is that they
[00:39:40] case to what I remember is that they relearned a couple different well-known
[00:39:43] relearned a couple different well-known algorithms for trying to like during the
[00:39:45] algorithms for trying to like during the search process um they learned a couple
[00:39:48] search process um they learned a couple algorithms that are known to be good and
[00:39:50] algorithms that are known to be good and and more effective and then found some
[00:39:52] and more effective and then found some others that hadn't been discovered
[00:39:54] others that hadn't been discovered before and so I think this is also an
[00:39:55] before and so I think this is also an interesting question because there may
[00:39:56] interesting question because there may be others of utility functions for
[00:39:58] be others of utility functions for Downstream use of these algorithms and
[00:40:00] Downstream use of these algorithms and so in that case you might want these
[00:40:02] so in that case you might want these approaches to sort of provide you a set
[00:40:04] approaches to sort of provide you a set of solutions a set of algorithms and
[00:40:06] of solutions a set of algorithms and then people could pick you know which
[00:40:07] then people could pick you know which ones they thought were
[00:40:09] ones they thought were best all right so this is a you know a
[00:40:12] best all right so this is a you know a multi-step RL
[00:40:14] multi-step RL problem and here the state of the system
[00:40:17] problem and here the state of the system would essentially be what are the
[00:40:20] would essentially be what are the operations you have so
[00:40:21] operations you have so far so like one of the operations that
[00:40:24] far so like one of the operations that um that you've done on the input to
[00:40:26] um that you've done on the input to matrices specified as tensors um and
[00:40:30] matrices specified as tensors um and then how far do you need to go until you
[00:40:31] then how far do you need to go until you can get the complete
[00:40:34] can get the complete solution and the reward in this case
[00:40:37] solution and the reward in this case assuming that you've conditioned
[00:40:38] assuming that you've conditioned everything on being correctness is just
[00:40:41] length all right so next let's go to
[00:40:45] length all right so next let's go to learning plasma control for Fusion
[00:40:47] learning plasma control for Fusion science and I think this is a really
[00:40:49] science and I think this is a really interesting one I I appreciate it that I
[00:40:50] interesting one I I appreciate it that I saw for a lot of people saying like we
[00:40:52] saw for a lot of people saying like we don't want to do you know like Epsilon
[00:40:54] don't want to do you know like Epsilon greedy on real hour with with plasma
[00:40:57] greedy on real hour with with plasma that's probably a bad idea for all of
[00:40:58] that's probably a bad idea for all of our health um so there's need to be like
[00:41:01] our health um so there's need to be like some form of offline phase and that's
[00:41:03] some form of offline phase and that's exactly right like that's certainly what
[00:41:05] exactly right like that's certainly what they did in this case um I think it's
[00:41:07] they did in this case um I think it's interesting to sort of think about how
[00:41:10] interesting to sort of think about how it's represented and sort of the
[00:41:11] it's represented and sort of the different types of controls you'd be
[00:41:13] different types of controls you'd be implying in this case which generally
[00:41:15] implying in this case which generally will be real valued um so it's a very
[00:41:17] will be real valued um so it's a very different problem than alphat tenser
[00:41:18] different problem than alphat tenser let's look at sort of what their
[00:41:19] let's look at sort of what their architecture was so in this case one
[00:41:23] architecture was so in this case one thing that they also really emphasize in
[00:41:25] thing that they also really emphasize in this is that they had to spend quite a
[00:41:26] this is that they had to spend quite a long time time they have to think really
[00:41:28] long time time they have to think really carefully about what is the objective so
[00:41:30] carefully about what is the objective so this is an interesting one it's not just
[00:41:32] this is an interesting one it's not just like minimize you know the number of
[00:41:33] like minimize you know the number of computations to solve two matrices it's
[00:41:36] computations to solve two matrices it's saying we want to be able to kind of
[00:41:38] saying we want to be able to kind of manipulate plasma into particular
[00:41:41] manipulate plasma into particular configurations and so you could imagine
[00:41:42] configurations and so you could imagine in this case you might have lots of
[00:41:44] in this case you might have lots of different reward functions and you want
[00:41:45] different reward functions and you want to be able to quickly learn policies for
[00:41:47] to be able to quickly learn policies for those so what they do to amarate the um
[00:41:51] those so what they do to amarate the um the offline safety issue is they build a
[00:41:54] the offline safety issue is they build a simulator and I was just coming to
[00:41:56] simulator and I was just coming to someone that uh on a recent panel I was
[00:41:58] someone that uh on a recent panel I was on I was talking to a mechanical
[00:41:59] on I was talking to a mechanical engineer that said that's one of the
[00:42:01] engineer that said that's one of the reasons they were really interested in
[00:42:02] reasons they were really interested in Ai and machine learning is they like to
[00:42:03] Ai and machine learning is they like to make simulators of really
[00:42:05] make simulators of really computationally expensive physical
[00:42:07] computationally expensive physical processes and so they here have a
[00:42:09] processes and so they here have a simulator that is fairly High Fidelity
[00:42:11] simulator that is fairly High Fidelity but not perfect and is High Fidelity
[00:42:14] but not perfect and is High Fidelity enough that they think it'll be useful
[00:42:15] enough that they think it'll be useful but low Fidelity enough that you could
[00:42:16] but low Fidelity enough that you could do optimization over it so what they're
[00:42:18] do optimization over it so what they're going to do in this case is they are
[00:42:20] going to do in this case is they are sort of solving the offline Case by
[00:42:22] sort of solving the offline Case by constructing here not necessarily from
[00:42:24] constructing here not necessarily from data more maybe from a physics model a
[00:42:27] data more maybe from a physics model a simulator so we're going to do kind of
[00:42:28] simulator so we're going to do kind of like modelbased RL model based in the
[00:42:30] like modelbased RL model based in the sense that we have to have a model or a
[00:42:32] sense that we have to have a model or a simulator but then what they're going to
[00:42:34] simulator but then what they're going to do is enactor critic method so they are
[00:42:37] do is enactor critic method so they are going to do actor critic in this case um
[00:42:40] going to do actor critic in this case um where they have a control policy and
[00:42:42] where they have a control policy and they're also going to learn uh so they
[00:42:43] they're also going to learn uh so they have the actor here and they're also
[00:42:45] have the actor here and they're also going to be learning um a Critic so I
[00:42:49] going to be learning um a Critic so I thought this was pretty interesting for
[00:42:50] thought this was pretty interesting for why they took this particular
[00:42:52] why they took this particular architecture I'm just going to read you
[00:42:53] architecture I'm just going to read you a little bit about that part let me go
[00:42:55] a little bit about that part let me go down there um so so a couple things so I
[00:42:58] down there um so so a couple things so I thought one of the things is they use an
[00:43:00] thought one of the things is they use an act critic method that is related
[00:43:02] act critic method that is related something else we saw but not exactly
[00:43:03] something else we saw but not exactly the same it's called No and I'll write
[00:43:05] the same it's called No and I'll write that out in a second um but one of the
[00:43:08] that out in a second um but one of the things that I thought was interesting is
[00:43:09] things that I thought was interesting is they said in our simulating
[00:43:12] they said in our simulating period we can do sort of whatever we
[00:43:14] period we can do sort of whatever we want and we can have a really
[00:43:15] want and we can have a really complicated critic when we are deploying
[00:43:18] complicated critic when we are deploying this it has to be real time so some
[00:43:21] this it has to be real time so some other people I think it was brought this
[00:43:22] other people I think it was brought this up when we were chatting about it this
[00:43:24] up when we were chatting about it this is like you know self-driving cars and
[00:43:26] is like you know self-driving cars and that like you have to really fast
[00:43:27] that like you have to really fast controllers you can't do Monte Carlo
[00:43:29] controllers you can't do Monte Carlo research and wait for us to decide like
[00:43:31] research and wait for us to decide like the Plasma's going to do something and
[00:43:33] the Plasma's going to do something and so either you're controlling it or it is
[00:43:35] so either you're controlling it or it is doing something else if you're not
[00:43:36] doing something else if you're not making an act of control and so they
[00:43:38] making an act of control and so they needed an actor AKA a policy that is
[00:43:41] needed an actor AKA a policy that is really computationally fast and so what
[00:43:43] really computationally fast and so what they said is that inside of their actor
[00:43:45] they said is that inside of their actor critic architecture one of the reasons
[00:43:46] critic architecture one of the reasons they wanted to let do that during the
[00:43:48] they wanted to let do that during the training is they could require their
[00:43:50] training is they could require their actor to be pretty low-dimensional and
[00:43:52] actor to be pretty low-dimensional and so I have a pretty small Network to
[00:43:54] so I have a pretty small Network to specify the actor or the control policy
[00:43:57] specify the actor or the control policy which is what they're going to
[00:43:58] which is what they're going to eventually deploy but they could have a
[00:44:00] eventually deploy but they could have a really complicated critic and so they
[00:44:02] really complicated critic and so they can leverage the fact that in the
[00:44:03] can leverage the fact that in the offline setting they can you know really
[00:44:05] offline setting they can you know really in a complicated way many parameters
[00:44:07] in a complicated way many parameters specify their value function because
[00:44:09] specify their value function because this is all offline and so this is sort
[00:44:11] this is all offline and so this is sort of a nice interesting asymmetry between
[00:44:13] of a nice interesting asymmetry between like you know computational efficiency
[00:44:16] like you know computational efficiency and what are the affordances you have
[00:44:17] and what are the affordances you have offline compared to online so they have
[00:44:19] offline compared to online so they have a very complicated critic and they have
[00:44:21] a very complicated critic and they have a very simple actor and so then they
[00:44:23] a very simple actor and so then they train the actor to try to find a good
[00:44:25] train the actor to try to find a good point in that policy space using their
[00:44:27] point in that policy space using their really complicated critic um and so they
[00:44:30] really complicated critic um and so they said the representation of the control
[00:44:32] said the representation of the control policy factor is restricted as it must
[00:44:34] policy factor is restricted as it must run on tcv with real-time guarantees but
[00:44:36] run on tcv with real-time guarantees but the critic is unrestricted so I thought
[00:44:38] the critic is unrestricted so I thought that was pretty interesting that they
[00:44:40] that was pretty interesting that they had this now another thing and this came
[00:44:42] had this now another thing and this came up in some conversations is as you might
[00:44:45] up in some conversations is as you might imagine if we go from offline to online
[00:44:47] imagine if we go from offline to online there is always the problem that it
[00:44:50] there is always the problem that it might not translate and again we're
[00:44:52] might not translate and again we're dealing with plasma so we want to have
[00:44:53] dealing with plasma so we want to have some sort of safety guarantees so here
[00:44:55] some sort of safety guarantees so here are the ideas we've talked about before
[00:44:57] are the ideas we've talked about before about sort of having more trusted
[00:45:00] about sort of having more trusted regions or having pessimism come up and
[00:45:02] regions or having pessimism come up and the way that they handle this is by
[00:45:04] the way that they handle this is by putting inside of the reward function so
[00:45:06] putting inside of the reward function so they essentially Define areas which they
[00:45:08] they essentially Define areas which they think could cause bad outcomes and then
[00:45:12] think could cause bad outcomes and then they put that inside the reward function
[00:45:14] they put that inside the reward function to lead to a policy that veers away from
[00:45:16] to lead to a policy that veers away from that area and I think again that's a
[00:45:19] that area and I think again that's a pretty common idea that if you have
[00:45:21] pretty common idea that if you have safety this comes up in robotics and
[00:45:22] safety this comes up in robotics and other ones too Claire toml Tomlin up uh
[00:45:26] other ones too Claire toml Tomlin up uh uh ber does this to a number of others
[00:45:28] uh ber does this to a number of others you put that inside of the reward
[00:45:30] you put that inside of the reward function so the resulting policy avoids
[00:45:31] function so the resulting policy avoids those and so here they're doing that not
[00:45:34] those and so here they're doing that not necessarily because reaching that
[00:45:35] necessarily because reaching that particular part would be bad but because
[00:45:37] particular part would be bad but because you're getting close to a part where it
[00:45:39] you're getting close to a part where it might be unsafe or where you don't trust
[00:45:41] might be unsafe or where you don't trust your
[00:45:43] your simulator so let's go back to here so in
[00:45:46] simulator so let's go back to here so in this case it's an actor
[00:45:48] this case it's an actor critic actor
[00:45:51] critic actor critic This is
[00:45:55] complicated this is simple
[00:46:00] this has to be simple for Speed and we
[00:46:02] this has to be simple for Speed and we all do this with a
[00:46:05] all do this with a simulator we put
[00:46:08] simulator we put penalties in the
[00:46:12] reward to
[00:46:17] avoid
[00:46:21] inaccuracies in
[00:46:24] inaccuracies in simulator or
[00:46:27] simulator or let see
[00:46:30] outcomes and so this is very similar to
[00:46:32] outcomes and so this is very similar to S this pessimism over the places where
[00:46:34] S this pessimism over the places where we're uncertain whether because of data
[00:46:36] we're uncertain whether because of data sparcity or because of um known problems
[00:46:38] sparcity or because of um known problems in our simulator have yeah I guess you
[00:46:43] in our simulator have yeah I guess you with like consider reward hacking or
[00:46:45] with like consider reward hacking or like sensitive rewards how do you like
[00:46:47] like sensitive rewards how do you like you guess like when you like can't
[00:46:49] you guess like when you like can't reward how do you like avoid like or how
[00:46:50] reward how do you like avoid like or how do you like double checks I assume like
[00:46:51] do you like double checks I assume like they really don't want to be makes here
[00:46:54] they really don't want to be makes here it's a great question so my guess in
[00:46:55] it's a great question so my guess in this case is it ends up you were just
[00:46:57] this case is it ends up you were just pretty conservative like um I think of
[00:46:59] pretty conservative like um I think of just how how far away now I assume in
[00:47:02] just how how far away now I assume in this case maybe because some of the
[00:47:03] this case maybe because some of the physics simulators that they have access
[00:47:05] physics simulators that they have access to that they could play with some of
[00:47:07] to that they could play with some of sort of saying like if you how negative
[00:47:10] sort of saying like if you how negative do you need to make some of these or how
[00:47:11] do you need to make some of these or how out of bounds or how hard of a
[00:47:13] out of bounds or how hard of a constraint is that so that you could be
[00:47:15] constraint is that so that you could be very confident that before you deploy
[00:47:17] very confident that before you deploy this you make sure that this doesn't
[00:47:18] this you make sure that this doesn't reach there at least in the simulator
[00:47:20] reach there at least in the simulator you could see whether or not you're
[00:47:21] you could see whether or not you're violating those constraints or like if
[00:47:23] violating those constraints or like if you have these penalties if you're
[00:47:24] you have these penalties if you're sufficient not to reach parts of the
[00:47:26] sufficient not to reach parts of the area that you think you might want to
[00:47:28] area that you think you might want to avoid whether that will translate to
[00:47:30] avoid whether that will translate to your real system is is an important
[00:47:32] your real system is is an important question you know so yeah it's a great
[00:47:34] question you know so yeah it's a great it's a great issue of how you now I
[00:47:35] it's a great issue of how you now I think it's it also introduces the really
[00:47:37] think it's it also introduces the really interesting question of whe whether you
[00:47:38] interesting question of whe whether you can verify so there are other methods
[00:47:40] can verify so there are other methods this is not some most of what we've
[00:47:41] this is not some most of what we've talked about is not those but where you
[00:47:43] talked about is not those but where you could verify that you're not going to
[00:47:44] could verify that you're not going to reach unsafe regions and this would
[00:47:46] reach unsafe regions and this would certainly be an area you might want to
[00:47:47] certainly be an area you might want to do
[00:47:48] do that all right the third one was
[00:47:51] that all right the third one was efficient and targeted covid-19 border
[00:47:53] efficient and targeted covid-19 border testing bar I should have also mentioned
[00:47:55] testing bar I should have also mentioned so this is also a multi-step RL
[00:48:00] so this is also a multi-step RL problem so absolutely the controls
[00:48:02] problem so absolutely the controls you're doing affect the next state and
[00:48:03] you're doing affect the next state and that's the whole point and then you want
[00:48:05] that's the whole point and then you want to manipulate the plasma into a
[00:48:06] to manipulate the plasma into a particular location so it's a definitely
[00:48:08] particular location so it's a definitely a multi-step
[00:48:11] system this one is thinking about how do
[00:48:14] system this one is thinking about how do you do efficient and targeted covid-19
[00:48:16] you do efficient and targeted covid-19 border testing and even though it's via
[00:48:18] border testing and even though it's via RL it really is a bandit problem in this
[00:48:20] RL it really is a bandit problem in this case so it's a it's a repeated Bandit
[00:48:24] case so it's a it's a repeated Bandit problem this is a batch
[00:48:28] problem this is a batch Bandit with delayed
[00:48:33] outcomes so let's make this a little bit
[00:48:35] outcomes so let's make this a little bit bigger so again remember to think back
[00:48:37] bigger so again remember to think back what happens in this case people come in
[00:48:40] what happens in this case people come in um Greece has some information about
[00:48:42] um Greece has some information about those individuals before they show up we
[00:48:44] those individuals before they show up we have finite numbers of tests we can run
[00:48:46] have finite numbers of tests we can run uh and process we have a policy for each
[00:48:50] uh and process we have a policy for each individual coming off that plane whether
[00:48:51] individual coming off that plane whether or not they're going to be given no test
[00:48:54] or not they're going to be given no test or they're tested um you get the results
[00:48:56] or they're tested um you get the results 24 hours later and you use that to
[00:48:58] 24 hours later and you use that to update your policy so I think this is a
[00:49:00] update your policy so I think this is a really nice example of like this backat
[00:49:02] really nice example of like this backat banded process who you test today does
[00:49:04] banded process who you test today does not affect who arrives tomorrow on a
[00:49:06] not affect who arrives tomorrow on a plane so it's a bandit problem um but we
[00:49:09] plane so it's a bandit problem um but we have this delayed outcome problem that
[00:49:11] have this delayed outcome problem that like you don't observe the outcomes of
[00:49:13] like you don't observe the outcomes of what who you just tested for a while
[00:49:15] what who you just tested for a while which means that algorithms like
[00:49:16] which means that algorithms like Thompson sampl may be helpful um H and
[00:49:20] Thompson sampl may be helpful um H and then in addition some of the other
[00:49:21] then in addition some of the other really big challenges in this case is
[00:49:23] really big challenges in this case is that you have a lot of constraints
[00:49:28] you have constraints for multiple
[00:49:30] you have constraints for multiple reasons so we have constraints over the
[00:49:31] reasons so we have constraints over the number of tests we can run you also can
[00:49:34] number of tests we can run you also can have different constraints depending on
[00:49:35] have different constraints depending on where you're arriving in Greece and
[00:49:36] where you're arriving in Greece and where you can send things so there are
[00:49:38] where you can send things so there are different testing sites which might have
[00:49:40] different testing sites which might have different capacities and in some cases
[00:49:42] different capacities and in some cases also you might have I don't think they
[00:49:44] also you might have I don't think they dealt with this in this paper but you
[00:49:45] dealt with this in this paper but you might have fairness constraints too like
[00:49:47] might have fairness constraints too like maybe it's best to test all the women
[00:49:49] maybe it's best to test all the women but maybe that's considered unfair um
[00:49:51] but maybe that's considered unfair um and so you may have a number of
[00:49:53] and so you may have a number of different constraints that you can think
[00:49:54] different constraints that you can think of as restricting your policy class
[00:49:57] of as restricting your policy class so it's a pretty interesting interaction
[00:50:00] so it's a pretty interesting interaction problem here and also because of the
[00:50:02] problem here and also because of the fact that it's budgeted it means that a
[00:50:04] fact that it's budgeted it means that a lot of your outcomes are coupled in a
[00:50:06] lot of your outcomes are coupled in a way that they might not be so for
[00:50:08] way that they might not be so for example if you give me a test if we only
[00:50:11] example if you give me a test if we only have one test that we can do in this
[00:50:12] have one test that we can do in this room and you give me the test then you
[00:50:13] room and you give me the test then you can't give it to any of you um and so
[00:50:15] can't give it to any of you um and so there's this interaction too in terms of
[00:50:18] there's this interaction too in terms of uh you know the data that we get to
[00:50:20] uh you know the data that we get to observe um for for the right so I think
[00:50:23] observe um for for the right so I think this is a really interesting case and it
[00:50:24] this is a really interesting case and it was really interesting that it ended up
[00:50:26] was really interesting that it ended up having a sign significant benefit one of
[00:50:28] having a sign significant benefit one of the things too that's interesting about
[00:50:30] the things too that's interesting about this is what how we Define the reward
[00:50:32] this is what how we Define the reward one thing that we were talking about in
[00:50:33] one thing that we were talking about in our smaller groups is that really you
[00:50:35] our smaller groups is that really you would like to understand how this is
[00:50:37] would like to understand how this is impacting Downstream Co outcomes and you
[00:50:39] impacting Downstream Co outcomes and you can measure those but you can measure
[00:50:41] can measure those but you can measure those really late like you can use those
[00:50:42] those really late like you can use those as a way to evaluate how effective the
[00:50:44] as a way to evaluate how effective the overall program was but not necessarily
[00:50:46] overall program was but not necessarily a reward you can use to optimize and
[00:50:48] a reward you can use to optimize and that's often a really common challenge
[00:50:50] that's often a really common challenge the rewards you get immediately that you
[00:50:52] the rewards you get immediately that you could use to change your policy may be
[00:50:54] could use to change your policy may be different than the downstream outcome
[00:50:55] different than the downstream outcome you care about and on Friday I was at an
[00:50:58] you care about and on Friday I was at an experimentation workshop at the business
[00:51:00] experimentation workshop at the business school here and I was giving a talk and
[00:51:02] school here and I was giving a talk and I was really excited and interested to
[00:51:04] I was really excited and interested to see how many other people were also
[00:51:05] see how many other people were also thinking of this challenge of short-term
[00:51:07] thinking of this challenge of short-term outcomes versus long-term rewards that
[00:51:10] outcomes versus long-term rewards that you really care about and I think this
[00:51:12] you really care about and I think this comes up a lot in advertising other
[00:51:14] comes up a lot in advertising other areas too companies like Netflix and
[00:51:16] areas too companies like Netflix and Spotify and others we're talking about
[00:51:17] Spotify and others we're talking about this common challenge where you have to
[00:51:19] this common challenge where you have to make policy decisions um or you know
[00:51:22] make policy decisions um or you know policy update your policy way before you
[00:51:24] policy update your policy way before you can maybe observe those outcomes and so
[00:51:26] can maybe observe those outcomes and so if you have to wait a really long time
[00:51:27] if you have to wait a really long time it limits how quickly you can experiment
[00:51:30] it limits how quickly you can experiment and so in this case too you might really
[00:51:32] and so in this case too you might really care about these Downstream ones but one
[00:51:34] care about these Downstream ones but one of the points of this paper was to argue
[00:51:36] of the points of this paper was to argue looking at that lagged information was
[00:51:38] looking at that lagged information was allowing people to make sort of not as
[00:51:39] allowing people to make sort of not as good decisions and so you need these
[00:51:41] good decisions and so you need these sort of shorter term
[00:51:43] sort of shorter term outcomes so you have any questions about
[00:51:45] outcomes so you have any questions about this
[00:51:47] one so I encourage you to if you haven't
[00:51:50] one so I encourage you to if you haven't read any of those papers they're really
[00:51:52] read any of those papers they're really beautiful papers if you want to read any
[00:51:54] beautiful papers if you want to read any of them or all and then just finally if
[00:51:56] of them or all and then just finally if you remember all the way back we talked
[00:51:57] you remember all the way back we talked about chat gbt at the very beginning of
[00:51:59] about chat gbt at the very beginning of the class and I think you should feel
[00:52:00] the class and I think you should feel excited now that you really understand
[00:52:02] excited now that you really understand this whole pipeline of what's possible
[00:52:04] this whole pipeline of what's possible the first is of you know training to
[00:52:06] the first is of you know training to supervis policy which we could think of
[00:52:07] supervis policy which we could think of as Behavior cloning um uh the second is
[00:52:10] as Behavior cloning um uh the second is doing direct preference solicitation we
[00:52:12] doing direct preference solicitation we did it with two pairs and then doing Po
[00:52:15] did it with two pairs and then doing Po and we also of course did DPO as well so
[00:52:18] and we also of course did DPO as well so I think now even though we didn't do
[00:52:19] I think now even though we didn't do with large language models you really
[00:52:20] with large language models you really have a sense of sort of the whole
[00:52:22] have a sense of sort of the whole process you could use if you were to
[00:52:23] process you could use if you were to train large language models and do the
[00:52:25] train large language models and do the sort of fine tuning
[00:52:29] all right so now we're just going to
[00:52:30] all right so now we're just going to kind of wrap up with some of the the
[00:52:32] kind of wrap up with some of the the main ideas and then looking forward so
[00:52:34] main ideas and then looking forward so if we think about sort of the main
[00:52:35] if we think about sort of the main characteristics of reinforcement
[00:52:37] characteristics of reinforcement learning this idea of learning directly
[00:52:39] learning this idea of learning directly from data to make good decisions we've
[00:52:41] from data to make good decisions we've been thinking a lot about optimization
[00:52:43] been thinking a lot about optimization delayed consequences exploration and
[00:52:45] delayed consequences exploration and generalization and I think a key thing
[00:52:47] generalization and I think a key thing just to remember if you didn't remember
[00:52:49] just to remember if you didn't remember anything else from this class is that
[00:52:50] anything else from this class is that one of the big differences of
[00:52:52] one of the big differences of reinforcement learning is that in
[00:52:53] reinforcement learning is that in general the actions impact the data
[00:52:55] general the actions impact the data distribution
[00:52:57] distribution certainly of the rewards we observe but
[00:52:59] certainly of the rewards we observe but often also of the states we get to reach
[00:53:01] often also of the states we get to reach and that's just very very different than
[00:53:03] and that's just very very different than like supervised learning or unsupervised
[00:53:05] like supervised learning or unsupervised learning where you know the data we get
[00:53:08] learning where you know the data we get doesn't you know you always see the
[00:53:09] doesn't you know you always see the label um or or you just have a static
[00:53:12] label um or or you just have a static gener distribution of
[00:53:14] gener distribution of data so this is both a huge opportunity
[00:53:17] data so this is both a huge opportunity and a huge challenge because we have to
[00:53:19] and a huge challenge because we have to think a lot more about distribution
[00:53:23] shift so in terms of the standard
[00:53:25] shift so in terms of the standard settings we've seen um we've SE we've
[00:53:27] settings we've seen um we've SE we've talked about Bandits where the next
[00:53:29] talked about Bandits where the next state is independent of the prior state
[00:53:30] state is independent of the prior state in action as well as general decision
[00:53:33] in action as well as general decision processes where the next state might
[00:53:35] processes where the next state might depend on all the previous actions and
[00:53:38] depend on all the previous actions and states or it might be markof and it only
[00:53:40] states or it might be markof and it only depends on the immediate State and the
[00:53:42] depends on the immediate State and the immediate previous action we've also
[00:53:44] immediate previous action we've also talked a lot about the online offline
[00:53:46] talked a lot about the online offline sort of settings where either you have
[00:53:48] sort of settings where either you have historical data only and you're trying
[00:53:50] historical data only and you're trying to learn better policies from that or
[00:53:52] to learn better policies from that or where you can actually actively gather
[00:53:54] where you can actually actively gather your own data and I will highlight there
[00:53:56] your own data and I will highlight there that I think many real world settings
[00:53:58] that I think many real world settings are sort of often between these
[00:54:01] are sort of often between these two
[00:54:07] many so in many cases you might have a
[00:54:09] many so in many cases you might have a large pull of offline data and then you
[00:54:11] large pull of offline data and then you might be able to get a small amount of
[00:54:12] might be able to get a small amount of new online data this comes up in
[00:54:14] new online data this comes up in robotics it comes up in some of our work
[00:54:16] robotics it comes up in some of our work we often call this sort of experimental
[00:54:17] we often call this sort of experimental design so that you sort of you might
[00:54:19] design so that you sort of you might have offline data then you can design an
[00:54:21] have offline data then you can design an experiment to gather a small amount of
[00:54:23] experiment to gather a small amount of data to try to learn a good decision
[00:54:25] data to try to learn a good decision policy so I think in general we can
[00:54:27] policy so I think in general we can think of this as an entire spectrum
[00:54:29] think of this as an entire spectrum between these two
[00:54:31] between these two extremes now what are some of the core
[00:54:33] extremes now what are some of the core ideas we've seen well of course we've
[00:54:35] ideas we've seen well of course we've seen a lot of different ideas but I
[00:54:36] seen a lot of different ideas but I think it's nice to sort of pop up a
[00:54:38] think it's nice to sort of pop up a level and think about sort of the common
[00:54:40] level and think about sort of the common themes and Chelsea Finn who teaches deep
[00:54:43] themes and Chelsea Finn who teaches deep RL also had a really nice light on this
[00:54:45] RL also had a really nice light on this so I found that my thoughts were
[00:54:46] so I found that my thoughts were aligning with a number of hers as well
[00:54:48] aligning with a number of hers as well so one thing is just to like you guys be
[00:54:51] so one thing is just to like you guys be really familiar with the fact that when
[00:54:53] really familiar with the fact that when we have function approximation which
[00:54:55] we have function approximation which we're almost always going to need
[00:54:56] we're almost always going to need because we want to handle hard problems
[00:54:59] because we want to handle hard problems hard complex problems and we want to do
[00:55:01] hard complex problems and we want to do off policy learning but honestly we
[00:55:03] off policy learning but honestly we often want to do whether we're online or
[00:55:05] often want to do whether we're online or offline um and just remember off policy
[00:55:08] offline um and just remember off policy learning just means that we want to take
[00:55:10] learning just means that we want to take some data that was generated from one
[00:55:12] some data that was generated from one decision policy and use it to think
[00:55:14] decision policy and use it to think about how another one might work whether
[00:55:16] about how another one might work whether in terms of gradient steps or in terms
[00:55:18] in terms of gradient steps or in terms of fully offline
[00:55:19] of fully offline learning and this is generally just
[00:55:21] learning and this is generally just really hard so you could argue that a
[00:55:24] really hard so you could argue that a huge number of papers in reinforcement
[00:55:26] huge number of papers in reinforcement learning just think about this problem
[00:55:28] learning just think about this problem it's just incredibly hard and the reason
[00:55:31] it's just incredibly hard and the reason is that whenever we have a new policy
[00:55:32] is that whenever we have a new policy we're going to get a new distribution
[00:55:34] we're going to get a new distribution over State action rewards and that means
[00:55:38] over State action rewards and that means that it may not match our current data
[00:55:39] that it may not match our current data we have a data distribution shift and
[00:55:42] we have a data distribution shift and the reason we want to do this the reason
[00:55:44] the reason we want to do this the reason we want to have that use the offline
[00:55:46] we want to have that use the offline data is because we want to be data
[00:55:48] data is because we want to be data efficient and this is true even if you
[00:55:50] efficient and this is true even if you can be online because as we saw for
[00:55:53] can be online because as we saw for things like Po if you follow sort of the
[00:55:56] things like Po if you follow sort of the Theory or if you follow you often have
[00:55:58] Theory or if you follow you often have to really be incredibly conservative or
[00:56:01] to really be incredibly conservative or just have bad performance for a very
[00:56:02] just have bad performance for a very long time but the problem is is that
[00:56:05] long time but the problem is is that when we combine these two in general
[00:56:07] when we combine these two in general we're going to be doing generalization
[00:56:09] we're going to be doing generalization or extrapolation and whenever we do that
[00:56:12] or extrapolation and whenever we do that we need to be worried that like the
[00:56:14] we need to be worried that like the values our predictions of how good a
[00:56:16] values our predictions of how good a policy will be will not match its actual
[00:56:19] policy will be will not match its actual performance and so over and over and
[00:56:21] performance and so over and over and over again we've seen how do we try to
[00:56:23] over again we've seen how do we try to mitigate this and in different types of
[00:56:25] mitigate this and in different types of methods so like in po the way we control
[00:56:29] methods so like in po the way we control this and this is an online method is we
[00:56:31] this and this is an online method is we control it with clipping we just can't
[00:56:32] control it with clipping we just can't take too big of a step inside of our
[00:56:34] take too big of a step inside of our gradients um and that allows us to make
[00:56:37] gradients um and that allows us to make sure that we're limiting this
[00:56:39] sure that we're limiting this extrapolation problem in the dagger case
[00:56:42] extrapolation problem in the dagger case we mitigated this by getting more expert
[00:56:44] we mitigated this by getting more expert labels we knew that there could be um a
[00:56:46] labels we knew that there could be um a data distribution shift when we started
[00:56:48] data distribution shift when we started to follow our Behavior clone policy and
[00:56:51] to follow our Behavior clone policy and so we just try to get more labels when
[00:56:53] so we just try to get more labels when we get into states where we make
[00:56:54] we get into states where we make decisions different than the expert so
[00:56:56] decisions different than the expert so we can kind of cover the distribution of
[00:56:58] we can kind of cover the distribution of States we reach under the learn policy
[00:57:02] States we reach under the learn policy and things like pessimistic Q learning
[00:57:04] and things like pessimistic Q learning which came from my lab cql which came
[00:57:06] which came from my lab cql which came from Berkeley and mopo which came from
[00:57:08] from Berkeley and mopo which came from other colleagues of mine here at
[00:57:09] other colleagues of mine here at Stanford all introduced pessimism into
[00:57:11] Stanford all introduced pessimism into offline RL again exactly to kind of
[00:57:14] offline RL again exactly to kind of limit this extrapolation problem where
[00:57:17] limit this extrapolation problem where you're sort of overly optimistic about
[00:57:18] you're sort of overly optimistic about what will happen so I don't think you
[00:57:21] what will happen so I don't think you should think of these as being the only
[00:57:22] should think of these as being the only ways to solve this problem I think it
[00:57:24] ways to solve this problem I think it what they should sort of inspire you to
[00:57:26] what they should sort of inspire you to is to think wow this is a problem that
[00:57:28] is to think wow this is a problem that comes up really throughout all of
[00:57:29] comes up really throughout all of reinforcement learning and we have some
[00:57:31] reinforcement learning and we have some methods for trying to handle this but
[00:57:33] methods for trying to handle this but this is certainly not a solve
[00:57:37] problem some of the other core ideas
[00:57:39] problem some of the other core ideas that we saw a lot was this idea of the
[00:57:41] that we saw a lot was this idea of the sort of different ways we could think
[00:57:42] sort of different ways we could think about the main objects in reinforcement
[00:57:44] about the main objects in reinforcement learning so we had this sort of models
[00:57:46] learning so we had this sort of models values and policies and sometimes people
[00:57:48] values and policies and sometimes people ask me like do we really need all of
[00:57:50] ask me like do we really need all of these are these all useful ideas I think
[00:57:53] these are these all useful ideas I think some of the application areas we were
[00:57:54] some of the application areas we were just going through illustrate why these
[00:57:56] just going through illustrate why these might all be useful ideas so models are
[00:57:59] might all be useful ideas so models are often easier ways to represent
[00:58:02] often easier ways to represent uncertainty so if we only have finite
[00:58:04] uncertainty so if we only have finite data and we're training something about
[00:58:06] data and we're training something about a value or a model or a policy often it
[00:58:09] a value or a model or a policy often it might be easiest for us to represent
[00:58:10] might be easiest for us to represent that uncertainty with a
[00:58:12] that uncertainty with a model so may have an idea of why that
[00:58:14] model so may have an idea of why that might be why might it be easier to
[00:58:16] might be why might it be easier to represent uncertainty for a model rather
[00:58:18] represent uncertainty for a model rather than like a q function or a
[00:58:22] policy you could disagree with me too
[00:58:24] policy you could disagree with me too that but um I can give you why I think
[00:58:27] that but um I can give you why I think this might be
[00:58:31] easiest we're building just like a
[00:58:33] easiest we're building just like a Dynamics model or reward model might why
[00:58:36] Dynamics model or reward model might why might that be an easier place for us to
[00:58:37] might that be an easier place for us to represent our uncertainty
[00:58:40] represent our uncertainty about how the world Works compared to
[00:58:43] about how the world Works compared to trying to represent our uncertainty over
[00:58:44] trying to represent our uncertainty over the Q function or the
[00:58:50] policy is it just because
[00:58:53] policy is it just because like when when you're making when you're
[00:58:55] like when when you're making when you're do like example like uncertainty in your
[00:58:57] do like example like uncertainty in your policy is uncertainty both about like
[00:58:59] policy is uncertainty both about like your world but also uncertainty about
[00:59:00] your world but also uncertainty about giv your assumption about the world what
[00:59:02] giv your assumption about the world what you think the best action is so you're
[00:59:04] you think the best action is so you're just dealing with like joint uncertainty
[00:59:06] just dealing with like joint uncertainty whereas the model of the world is kind
[00:59:07] whereas the model of the world is kind of like it just more like specified
[00:59:09] of like it just more like specified problem you have like one source of
[00:59:11] problem you have like one source of uncertainty yeah I think that's great
[00:59:13] uncertainty yeah I think that's great that's a great intuition that's what I
[00:59:14] that's a great intuition that's what I was was going for here so to repeat what
[00:59:17] was was going for here so to repeat what said here like when you think about
[00:59:18] said here like when you think about policy uncertainty there s of that kind
[00:59:21] policy uncertainty there s of that kind of combines and wraps up this idea of
[00:59:23] of combines and wraps up this idea of uncertainty over how the world works and
[00:59:24] uncertainty over how the world works and uncertainty over what you should do to
[00:59:26] uncertainty over what you should do to make good decisions given that World um
[00:59:29] make good decisions given that World um and same for the Q function and there
[00:59:31] and same for the Q function and there are ways to directly represent your
[00:59:32] are ways to directly represent your uncertainty over the policies and the Q
[00:59:34] uncertainty over the policies and the Q functions but models it's a prediction
[00:59:38] functions but models it's a prediction problem and so we have lots of tools
[00:59:40] problem and so we have lots of tools from supervised learning and from
[00:59:41] from supervised learning and from statistics and data science to think
[00:59:43] statistics and data science to think about modeling our uncertainty when it's
[00:59:45] about modeling our uncertainty when it's just a prediction problem like you know
[00:59:48] just a prediction problem like you know what what state will happen next or what
[00:59:49] what what state will happen next or what reward while I get in this state there's
[00:59:51] reward while I get in this state there's no planning or decision- making yet it's
[00:59:54] no planning or decision- making yet it's just prediction and so it's sort of a
[00:59:56] just prediction and so it's sort of a nice place for us to reduce or leverage
[00:59:57] nice place for us to reduce or leverage the beautiful history of work in all the
[01:00:00] the beautiful history of work in all the other fields of how we could do this
[01:00:02] other fields of how we could do this easily instead of then having to
[01:00:04] easily instead of then having to propagate that through um so I think
[01:00:07] propagate that through um so I think often this is an easier place to
[01:00:08] often this is an easier place to represent our uncertainty of course
[01:00:10] represent our uncertainty of course there's no free lunch and if we have it
[01:00:12] there's no free lunch and if we have it there and we want to think about our
[01:00:13] there and we want to think about our uncertainty over policies and value
[01:00:16] uncertainty over policies and value functions we still have to propagate it
[01:00:17] functions we still have to propagate it but it may be easier for us to represent
[01:00:19] but it may be easier for us to represent that and drive ourselves towards it
[01:00:21] that and drive ourselves towards it they're also really useful for things
[01:00:22] they're also really useful for things like Monte Carlo research like you can
[01:00:24] like Monte Carlo research like you can use models for Sim ators you or for
[01:00:27] use models for Sim ators you or for plasma if you may be able to use these
[01:00:29] plasma if you may be able to use these PLS as a place to think about risky
[01:00:31] PLS as a place to think about risky domains or or or to be very data
[01:00:34] domains or or or to be very data efficient the Q function in some ways is
[01:00:37] efficient the Q function in some ways is kind of the you know central part of RL
[01:00:39] kind of the you know central part of RL in the sense that it just summarizes the
[01:00:41] in the sense that it just summarizes the performance of your policy and you can
[01:00:43] performance of your policy and you can use it often to directly act because you
[01:00:45] use it often to directly act because you just take an argmax with respect to the
[01:00:47] just take an argmax with respect to the Q function so it's a good way to
[01:00:50] Q function so it's a good way to summarize like how good things are and
[01:00:53] summarize like how good things are and policies are just ultimately what we
[01:00:55] policies are just ultimately what we want to have like we want to have good
[01:00:57] want to have like we want to have good decision- making we often want to know
[01:00:59] decision- making we often want to know exactly how good that is and that's
[01:01:01] exactly how good that is and that's maybe where the Q function will is one
[01:01:03] maybe where the Q function will is one particular nice thing but ultimately we
[01:01:04] particular nice thing but ultimately we want to try to make good decisions in
[01:01:06] want to try to make good decisions in the
[01:01:07] the world I think another thing that's come
[01:01:09] world I think another thing that's come up repeatedly is sort of this question
[01:01:10] up repeatedly is sort of this question of computation um versus data efficiency
[01:01:14] of computation um versus data efficiency and I think one thing that it's really
[01:01:15] and I think one thing that it's really useful to remember is that in some cases
[01:01:17] useful to remember is that in some cases they are the same so in this class I've
[01:01:20] they are the same so in this class I've often talked as if they're sort of
[01:01:22] often talked as if they're sort of totally different but in many situations
[01:01:26] totally different but in many situations if you have a simulator data is the same
[01:01:29] if you have a simulator data is the same as computation you're either using your
[01:01:31] as computation you're either using your computation to maybe do more planning or
[01:01:33] computation to maybe do more planning or try to get to a better policy before you
[01:01:35] try to get to a better policy before you simulate the next step or you're just
[01:01:37] simulate the next step or you're just simulating more steps and so I think
[01:01:40] simulating more steps and so I think when you look at papers if they have a
[01:01:42] when you look at papers if they have a simulated domain and they're trying to
[01:01:44] simulated domain and they're trying to do something really fancy in the back
[01:01:46] do something really fancy in the back it's useful to remind yourself that if
[01:01:48] it's useful to remind yourself that if it was a real problem you wanted to
[01:01:49] it was a real problem you wanted to solve you could either take that same
[01:01:51] solve you could either take that same computation and just have maybe 10x more
[01:01:53] computation and just have maybe 10x more samples or you can do 10x more
[01:01:56] samples or you can do 10x more computation between each
[01:01:57] computation between each sample now in some other cases um we
[01:02:01] sample now in some other cases um we really do have limited data you know we
[01:02:04] really do have limited data you know we just fortunately do not have 7 billion
[01:02:06] just fortunately do not have 7 billion people with um Co like there's just a
[01:02:09] people with um Co like there's just a finite number of people um and there's a
[01:02:11] finite number of people um and there's a finite number of students and so as long
[01:02:13] finite number of students and so as long as you really want to be data efficient
[01:02:15] as you really want to be data efficient when you do that it's often trading off
[01:02:18] when you do that it's often trading off for computational cost so we're going to
[01:02:20] for computational cost so we're going to try to squeeze everything we can out of
[01:02:22] try to squeeze everything we can out of the data and when we do that we often
[01:02:23] the data and when we do that we often are going to rely on methods that are
[01:02:24] are going to rely on methods that are much more computational
[01:02:26] much more computational intensive and also as you've seen in
[01:02:28] intensive and also as you've seen in some cases you have real constraints on
[01:02:31] some cases you have real constraints on this like in plasma like in self-driving
[01:02:33] this like in plasma like in self-driving cars like in robotics there are
[01:02:35] cars like in robotics there are sometimes cases where you have to have
[01:02:38] sometimes cases where you have to have fast computation because otherwise there
[01:02:41] fast computation because otherwise there is a default there's kind of a hidden
[01:02:42] is a default there's kind of a hidden action which is you have to make a
[01:02:44] action which is you have to make a decision at every time point if you're
[01:02:45] decision at every time point if you're not doing that something optimal
[01:02:47] not doing that something optimal something else is happening there's some
[01:02:48] something else is happening there's some default action that's always
[01:02:52] occurring now what are some of the open
[01:02:54] occurring now what are some of the open challenges I think there's aot lot of
[01:02:56] challenges I think there's aot lot of open challenges I think RL is a
[01:02:59] open challenges I think RL is a fascinating area but RL has not yet had
[01:03:02] fascinating area but RL has not yet had the applicational impact that we've seen
[01:03:05] the applicational impact that we've seen in some other areas of AI and um and
[01:03:08] in some other areas of AI and um and engineering and I think this is for a
[01:03:11] engineering and I think this is for a number of reasons um but one of them is
[01:03:13] number of reasons um but one of them is that you really want methods that are
[01:03:15] that you really want methods that are off the shelf and robust and
[01:03:17] off the shelf and robust and reliable and many RL algorithms have
[01:03:20] reliable and many RL algorithms have hyper parameters you have to pick you
[01:03:23] hyper parameters you have to pick you know the learning rate you have to pick
[01:03:24] know the learning rate you have to pick there some of these are the same as
[01:03:25] there some of these are the same as normal machine learning and others of
[01:03:27] normal machine learning and others of them are different and one of the
[01:03:29] them are different and one of the challenges here is if you're online even
[01:03:32] challenges here is if you're online even though in our world like when we're
[01:03:33] though in our world like when we're doing a homework you might be able to
[01:03:34] doing a homework you might be able to try it with different hyper parameters
[01:03:36] try it with different hyper parameters in a real world setting like for
[01:03:38] in a real world setting like for healthcare or for customers you would
[01:03:40] healthcare or for customers you would just have that one trajectory and so in
[01:03:42] just have that one trajectory and so in that case or one deployment you can't
[01:03:45] that case or one deployment you can't sort of optimize those parameters and so
[01:03:47] sort of optimize those parameters and so I think that there's this real need for
[01:03:49] I think that there's this real need for sort of automatic hyper PR tuning model
[01:03:51] sort of automatic hyper PR tuning model selection by that I mean how do you
[01:03:53] selection by that I mean how do you figure out what architecture you use how
[01:03:55] figure out what architecture you use how do you even write down the problem um
[01:03:58] do you even write down the problem um and generally robust methods you know
[01:04:00] and generally robust methods you know model selection ter like you know the
[01:04:01] model selection ter like you know the the sides of your normal Network Etc and
[01:04:03] the sides of your normal Network Etc and just general sort of robust guarantees
[01:04:05] just general sort of robust guarantees that we're not going to suddenly have
[01:04:07] that we're not going to suddenly have one run where your performance is really
[01:04:11] one run where your performance is really bad the other is that we often need
[01:04:13] bad the other is that we often need things that are going to be able to span
[01:04:15] things that are going to be able to span this data versus computation efficiency
[01:04:18] this data versus computation efficiency and we don't normally have very good
[01:04:20] and we don't normally have very good ways to allow a practitioner to say like
[01:04:22] ways to allow a practitioner to say like okay this is how much I care about this
[01:04:23] okay this is how much I care about this or that be really nice if we could have
[01:04:25] or that be really nice if we could have have sort of like Paro Frontiers and you
[01:04:28] have sort of like Paro Frontiers and you could say well you know if this is
[01:04:30] could say well you know if this is computation and this is data you might
[01:04:33] computation and this is data you might say Okay I want to have things that are
[01:04:35] say Okay I want to have things that are always somehow optimally trading off
[01:04:37] always somehow optimally trading off between those two and depending on my
[01:04:39] between those two and depending on my application area I can pick where I want
[01:04:40] application area I can pick where I want to be on this
[01:04:41] to be on this curve and I also think this hybrid
[01:04:44] curve and I also think this hybrid offline online case is a really
[01:04:45] offline online case is a really important one where many organizations
[01:04:47] important one where many organizations might be willing to do a little bit of
[01:04:49] might be willing to do a little bit of additional data collection but not fully
[01:04:51] additional data collection but not fully online
[01:04:52] online learning I think there's also some just
[01:04:54] learning I think there's also some just really big questions for reinforcement
[01:04:56] really big questions for reinforcement learning we focused a lot on the Markoff
[01:04:58] learning we focused a lot on the Markoff decision process formulation that's
[01:05:00] decision process formulation that's where it comes out of the 1950s and
[01:05:02] where it comes out of the 1950s and Bellman that's how I learned about it
[01:05:05] Bellman that's how I learned about it many people learned it and it has some
[01:05:06] many people learned it and it has some really nice intellectual properties but
[01:05:09] really nice intellectual properties but it is not clear that this is the right
[01:05:10] it is not clear that this is the right way to solve data driven decision-
[01:05:12] way to solve data driven decision- making this is one framework so I had um
[01:05:16] making this is one framework so I had um a professor when I was a grad Stu grad
[01:05:19] a professor when I was a grad Stu grad student uh who said that the whole world
[01:05:21] student uh who said that the whole world is you know a
[01:05:22] is you know a multi-agent uh partially observable
[01:05:25] multi-agent uh partially observable markup to process where you're doing
[01:05:26] markup to process where you're doing learning but it doesn't mean you want to
[01:05:28] learning but it doesn't mean you want to solve it like that and so while in many
[01:05:31] solve it like that and so while in many times we might be able to F you know
[01:05:33] times we might be able to F you know model things in these kind of stochastic
[01:05:35] model things in these kind of stochastic Markoff decision process ways that may
[01:05:37] Markoff decision process ways that may or may not be the most efficient way to
[01:05:39] or may not be the most efficient way to represent the problem it's just like how
[01:05:40] represent the problem it's just like how you could always represent a bandit as a
[01:05:42] you could always represent a bandit as a really complicated RL problem but if
[01:05:44] really complicated RL problem but if your state next states are independent
[01:05:46] your state next states are independent of your previous one why would you do
[01:05:48] of your previous one why would you do that so I think there's some real
[01:05:50] that so I think there's some real questions over like are there better
[01:05:52] questions over like are there better formulations um I think a second thing
[01:05:55] formulations um I think a second thing is that historically in reinforcement
[01:05:56] is that historically in reinforcement learning and even throughout most of
[01:05:58] learning and even throughout most of this class we focused on I'm going to
[01:06:00] this class we focused on I'm going to learn from this one task from
[01:06:02] learn from this one task from scratch but of course that's not what
[01:06:04] scratch but of course that's not what humans do we constantly are building on
[01:06:05] humans do we constantly are building on our PRI experience we are sort of
[01:06:07] our PRI experience we are sort of imperfect agents for learning across
[01:06:09] imperfect agents for learning across many many many tasks and what we've seen
[01:06:11] many many many tasks and what we've seen from generative AI um like sort of uh
[01:06:14] from generative AI um like sort of uh large language models Etc is that doing
[01:06:17] large language models Etc is that doing many many tasks might be really powerful
[01:06:20] many many tasks might be really powerful and that's been relatively understudied
[01:06:21] and that's been relatively understudied in the RL setting and it might be much
[01:06:23] in the RL setting and it might be much more effective we've seen even in like
[01:06:26] more effective we've seen even in like Alpha zero um and Alpha tens and others
[01:06:28] Alpha zero um and Alpha tens and others that these shared representations can
[01:06:30] that these shared representations can have huge benefits and so those might be
[01:06:32] have huge benefits and so those might be really productive ways to think about
[01:06:34] really productive ways to think about accelerating the speed of decision-
[01:06:36] accelerating the speed of decision- making and learning good data driven
[01:06:38] making and learning good data driven policies I think a third thing is
[01:06:40] policies I think a third thing is thinking about Alternative forms of
[01:06:42] thinking about Alternative forms of feedback assuming you get scaler single
[01:06:45] feedback assuming you get scaler single scaler rewards is pretty limiting
[01:06:48] scaler rewards is pretty limiting particularly now that we have large
[01:06:49] particularly now that we have large language models you could imagine having
[01:06:51] language models you could imagine having really rich feedback or really sparse
[01:06:53] really rich feedback or really sparse feedback like thumbs up thumbs down or
[01:06:55] feedback like thumbs up thumbs down or preference pairs or really detailed
[01:06:57] preference pairs or really detailed examples about how something is wrong or
[01:06:59] examples about how something is wrong or what your preferences are and now that
[01:07:01] what your preferences are and now that we can start to have language as as
[01:07:03] we can start to have language as as rewards I think that's a much richer
[01:07:05] rewards I think that's a much richer opportunity and people are starting to
[01:07:07] opportunity and people are starting to explore this
[01:07:09] explore this already another is sort of just what
[01:07:11] already another is sort of just what settings we're in most of this class we
[01:07:14] settings we're in most of this class we thought about stochastic settings take
[01:07:16] thought about stochastic settings take an action from a state you get to some
[01:07:19] an action from a state you get to some next state generated sort of
[01:07:20] next state generated sort of stochastically from some sort of like
[01:07:22] stochastically from some sort of like you know uh indifferent process
[01:07:26] you know uh indifferent process but that's not very common in real world
[01:07:28] but that's not very common in real world settings in many real world settings
[01:07:29] settings in many real world settings there are other stakeholders or
[01:07:31] there are other stakeholders or multi-agents um that might be
[01:07:33] multi-agents um that might be adversarial or might be cooperative you
[01:07:35] adversarial or might be cooperative you know you might have a teacher that's
[01:07:36] know you might have a teacher that's helping the agent learn something or you
[01:07:38] helping the agent learn something or you might have an adversary that's competing
[01:07:40] might have an adversary that's competing with that agent and so those settings
[01:07:42] with that agent and so those settings are also really important to
[01:07:44] are also really important to consider and I think another question
[01:07:46] consider and I think another question too is throughout this class we've been
[01:07:49] too is throughout this class we've been thinking about sort of integrating and
[01:07:50] thinking about sort of integrating and doing learning and planning and
[01:07:51] doing learning and planning and decision- making all at once everything
[01:07:53] decision- making all at once everything and that's wonderful and elegant
[01:07:56] and that's wonderful and elegant but there are many approximations to
[01:07:57] but there are many approximations to this so in some other fields they often
[01:07:59] this so in some other fields they often do system identification I mean like you
[01:08:02] do system identification I mean like you might learn how the Markoff decision
[01:08:03] might learn how the Markoff decision process works you learn your Dynamics
[01:08:05] process works you learn your Dynamics model you learn your word model you stop
[01:08:07] model you learn your word model you stop you
[01:08:08] you plan and so while this offers some
[01:08:11] plan and so while this offers some flexibility it also introduces a lot of
[01:08:14] flexibility it also introduces a lot of complexity and again in some areas there
[01:08:16] complexity and again in some areas there might be some really good alternatives
[01:08:17] might be some really good alternatives to
[01:08:18] to this and finally this is one that's
[01:08:20] this and finally this is one that's perhaps closest to my heart which is I
[01:08:21] perhaps closest to my heart which is I think that there's just an enormous room
[01:08:23] think that there's just an enormous room to do better data driven um uh decision-
[01:08:27] to do better data driven um uh decision- making in domains that could benefit so
[01:08:30] making in domains that could benefit so I think there are lots of application
[01:08:31] I think there are lots of application areas we've talked about in class but
[01:08:33] areas we've talked about in class but there's so many areas where I think our
[01:08:34] there's so many areas where I think our society could benefit from better
[01:08:36] society could benefit from better decision- making and so it' be
[01:08:38] decision- making and so it' be incredible to see more of that impact
[01:08:40] incredible to see more of that impact whether it's from the Frameworks we've
[01:08:41] whether it's from the Frameworks we've covered in class or from others and I
[01:08:43] covered in class or from others and I think one of the wonderful things is
[01:08:44] think one of the wonderful things is that you guys are very well equipped now
[01:08:46] that you guys are very well equipped now to go out and start answering these
[01:08:47] to go out and start answering these questions or other ones that you think
[01:08:49] questions or other ones that you think are
[01:08:51] important all right I'll just close with
[01:08:53] important all right I'll just close with two more slides one is that if you like
[01:08:55] two more slides one is that if you like reinforcement learning there is a lot of
[01:08:57] reinforcement learning there is a lot of people at Stanford who think about
[01:08:58] people at Stanford who think about reinforcement learning there are lots of
[01:09:00] reinforcement learning there are lots of classes there's at least another five um
[01:09:03] classes there's at least another five um so there's deep RL with Chelsea there's
[01:09:06] so there's deep RL with Chelsea there's decision-making under uncertainty with
[01:09:07] decision-making under uncertainty with Michael Michael and I both offer
[01:09:09] Michael Michael and I both offer Advanced courses and need a
[01:09:10] Advanced courses and need a decision-making or RL and Ben Ryan Roy
[01:09:13] decision-making or RL and Ben Ryan Roy often also offers a an advanced RL or
[01:09:15] often also offers a an advanced RL or bandaid class so there's lots of places
[01:09:17] bandaid class so there's lots of places to learn more and finally thanks for
[01:09:20] to learn more and finally thanks for being part of the course it's great to
[01:09:21] being part of the course it's great to get to meet everyone it's been and we're
[01:09:23] get to meet everyone it's been and we're really excited to see your posters on
[01:09:24] really excited to see your posters on Wednesday thanks than
Lecture INDEX.md
CS234 – Reinforcement Learning
Playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rN4wG6Nk6sNpTEbuOSosZdX
Total Videos: 16
Transcripts Downloaded: 16
Failed/No Captions: 0
---
Lectures
1....
CS234 – Reinforcement Learning
Playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rN4wG6Nk6sNpTEbuOSosZdX
Total Videos: 16
Transcripts Downloaded: 16
Failed/No Captions: 0
---
Lectures
1. Stanford CS234 Reinforcement Learning I Introduction to Reinforcement Learning I 2024 I Lecture 1
- Video: [https://www.youtube.com/watch?v=WsvFL-LjA6U](https://www.youtube.com/watch?v=WsvFL-LjA6U)
- Transcript: [001_WsvFL-LjA6U.md](001_WsvFL-LjA6U.md)
2. Stanford CS234 Reinforcement Learning I Tabular MDP Planning I 2024 I Lecture 2
- Video: [https://www.youtube.com/watch?v=gHdsUUGcBC0](https://www.youtube.com/watch?v=gHdsUUGcBC0)
- Transcript: [002_gHdsUUGcBC0.md](002_gHdsUUGcBC0.md)
3. Stanford CS234 Reinforcement Learning I Policy Evaluation I 2024 I Lecture 3
- Video: [https://www.youtube.com/watch?v=jjq51TRNVvk](https://www.youtube.com/watch?v=jjq51TRNVvk)
- Transcript: [003_jjq51TRNVvk.md](003_jjq51TRNVvk.md)
4. Stanford CS234 Reinforcement Learning I Q learning and Function Approximation I 2024 I Lecture 4
- Video: [https://www.youtube.com/watch?v=b_wvosA70f8](https://www.youtube.com/watch?v=b_wvosA70f8)
- Transcript: [004_b_wvosA70f8.md](004_b_wvosA70f8.md)
5. Stanford CS234 Reinforcement Learning I Policy Search 1 I 2024 I Lecture 5
- Video: [https://www.youtube.com/watch?v=L6OVEmV3NcE](https://www.youtube.com/watch?v=L6OVEmV3NcE)
- Transcript: [005_L6OVEmV3NcE.md](005_L6OVEmV3NcE.md)
6. Stanford CS234 Reinforcement Learning I Policy Search 2 I 2024 I Lecture 6
- Video: [https://www.youtube.com/watch?v=8PwvNQ5WS-o](https://www.youtube.com/watch?v=8PwvNQ5WS-o)
- Transcript: [006_8PwvNQ5WS-o.md](006_8PwvNQ5WS-o.md)
7. Stanford CS234 Reinforcement Learning I Policy Search 3 I 2024 I Lecture 7
- Video: [https://www.youtube.com/watch?v=4ngb0IZTg8I](https://www.youtube.com/watch?v=4ngb0IZTg8I)
- Transcript: [007_4ngb0IZTg8I.md](007_4ngb0IZTg8I.md)
8. Stanford CS234 Reinforcement Learning I Offline RL 1 I 2024 I Lecture 8
- Video: [https://www.youtube.com/watch?v=IEbuJtjqtMU](https://www.youtube.com/watch?v=IEbuJtjqtMU)
- Transcript: [008_IEbuJtjqtMU.md](008_IEbuJtjqtMU.md)
9. Stanford CS234 I Guest Lecture on DPO: Rafael Rafailov, Archit Sharma, Eric Mitchell I Lecture 9
- Video: [https://www.youtube.com/watch?v=Q7rl8ovBWwQ](https://www.youtube.com/watch?v=Q7rl8ovBWwQ)
- Transcript: [009_Q7rl8ovBWwQ.md](009_Q7rl8ovBWwQ.md)
10. Stanford CS234 Reinforcement Learning I Offline RL 3 I 2024 I Lecture 10
- Video: [https://www.youtube.com/watch?v=F6APGIAm5fw](https://www.youtube.com/watch?v=F6APGIAm5fw)
- Transcript: [010_F6APGIAm5fw.md](010_F6APGIAm5fw.md)
11. Stanford CS234 Reinforcement Learning I Exploration 1 I 2024 I Lecture 11
- Video: [https://www.youtube.com/watch?v=sqYii3nd78w](https://www.youtube.com/watch?v=sqYii3nd78w)
- Transcript: [011_sqYii3nd78w.md](011_sqYii3nd78w.md)
12. Stanford CS234 Reinforcement Learning I Exploration 2 I 2024 I Lecture 12
- Video: [https://www.youtube.com/watch?v=gFJNsfg_35E](https://www.youtube.com/watch?v=gFJNsfg_35E)
- Transcript: [012_gFJNsfg_35E.md](012_gFJNsfg_35E.md)
13. Stanford CS234 Reinforcement Learning I Exploration 3 I 2024 I Lecture 13
- Video: [https://www.youtube.com/watch?v=pc7oayCSZmQ](https://www.youtube.com/watch?v=pc7oayCSZmQ)
- Transcript: [013_pc7oayCSZmQ.md](013_pc7oayCSZmQ.md)
14. Stanford CS234 Reinforcement Learning I Multi-Agent Game Playing I 2024 I Lecture 14
- Video: [https://www.youtube.com/watch?v=UgANzoWc0nc](https://www.youtube.com/watch?v=UgANzoWc0nc)
- Transcript: [014_UgANzoWc0nc.md](014_UgANzoWc0nc.md)
15. Stanford CS234 Reinforcement Learning I Emma Brunskill & Dan Webber I 2024 I Lecture 15
- Video: [https://www.youtube.com/watch?v=FOlPpjNbHjE](https://www.youtube.com/watch?v=FOlPpjNbHjE)
- Transcript: [015_FOlPpjNbHjE.md](015_FOlPpjNbHjE.md)
16. Stanford CS234 Reinforcement Learning I Value Alignment I 2024 I Lecture 16
- Video: [https://www.youtube.com/watch?v=eenJzay5aLo](https://www.youtube.com/watch?v=eenJzay5aLo)
- Transcript: [016_eenJzay5aLo.md](016_eenJzay5aLo.md)