================================================================================ LECTURE 001 ================================================================================ Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018) Source: https://www.youtube.com/watch?v=jGwO_UgTS7I --- Transcript [00:00:03] welcome to 69 machine learning some of [00:00:07] welcome to 69 machine learning some of you know that this closet or the [00:00:08] you know that this closet or the Stanford for a long time and this is [00:00:10] Stanford for a long time and this is often the cost that I most look forward [00:00:13] often the cost that I most look forward to teaching each year because this is [00:00:15] to teaching each year because this is where we've helped I think several [00:00:17] where we've helped I think several generations of Stanford students become [00:00:19] generations of Stanford students become experts in machine learning golf built [00:00:21] experts in machine learning golf built many of their products and services and [00:00:23] many of their products and services and startups that I'm sure many of you I [00:00:25] startups that I'm sure many of you I pray all of you are using today so what [00:00:30] pray all of you are using today so what I want to do today was spend some time [00:00:32] I want to do today was spend some time talking over logistics and then spend [00:00:36] talking over logistics and then spend some time you know giving you a [00:00:37] some time you know giving you a beginning of an intro talk a little bit [00:00:39] beginning of an intro talk a little bit about machine learning so about 229 you [00:00:46] about machine learning so about 229 you know all of you have been reading about [00:00:49] know all of you have been reading about AI in the news about machine learning in [00:00:52] AI in the news about machine learning in the news and you pray heard me others [00:00:56] the news and you pray heard me others say as a new electricity much as the [00:00:59] say as a new electricity much as the rise of electricity about 100 years ago [00:01:01] rise of electricity about 100 years ago transformed every major industry I think [00:01:03] transformed every major industry I think AI already we call it machine learning [00:01:06] AI already we call it machine learning but the rest of world seems to call it [00:01:07] but the rest of world seems to call it AI machine learning and an AI and deep [00:01:11] AI machine learning and an AI and deep learning will change the world [00:01:13] learning will change the world and I hope that 3 2 3 9 will give you [00:01:16] and I hope that 3 2 3 9 will give you the tools you need so that you can be [00:01:18] the tools you need so that you can be many of these future titans of [00:01:20] many of these future titans of industries that you can be one the gold [00:01:22] industries that you can be one the gold and don't you know hope the large tech [00:01:24] and don't you know hope the large tech companies do the amazing things they do [00:01:26] companies do the amazing things they do or both your own startup or going to [00:01:29] or both your own startup or going to some other industry go go transform [00:01:31] some other industry go go transform healthcare or go transport [00:01:32] healthcare or go transport transportation or go put a self-driving [00:01:33] transportation or go put a self-driving car and do all of these things that [00:01:36] car and do all of these things that after this class I think you'll be able [00:01:39] after this class I think you'll be able to do you know the majority of students [00:01:44] to do you know the majority of students applying the demand for AI skills the [00:01:46] applying the demand for AI skills the demand for machine learning skills is so [00:01:48] demand for machine learning skills is so vast I think you all know that and I [00:01:50] vast I think you all know that and I think it's because machine learning has [00:01:52] think it's because machine learning has advanced so rapidly in the last few [00:01:54] advanced so rapidly in the last few years that there are so many [00:01:56] years that there are so many opportunities to apply learning [00:01:58] opportunities to apply learning algorithms right both in industry as [00:02:01] algorithms right both in industry as well as in academia I think today we [00:02:03] well as in academia I think today we have the English department professors [00:02:05] have the English department professors trying to apply learning algorithms to [00:02:07] trying to apply learning algorithms to understand history better we have [00:02:09] understand history better we have lawyers trying to apply machine learning [00:02:11] lawyers trying to apply machine learning inter-process legal [00:02:13] inter-process legal humans and off-campus every company both [00:02:15] humans and off-campus every company both the tech companies as well as a lot of [00:02:17] the tech companies as well as a lot of other companies that you wouldn't [00:02:18] other companies that you wouldn't consider tech companies everything from [00:02:20] consider tech companies everything from me faction companies healthcare [00:02:21] me faction companies healthcare companies the just six companies are [00:02:24] companies the just six companies are also trying to apply machine learning so [00:02:27] also trying to apply machine learning so I think that um if you look at it on a [00:02:31] I think that um if you look at it on a factual basis the number of people doing [00:02:35] factual basis the number of people doing very valuable machine learning projects [00:02:37] very valuable machine learning projects today it's much greater than it was six [00:02:39] today it's much greater than it was six months ago and six months ago is much [00:02:40] months ago and six months ago is much greater than it was twelve months ago [00:02:42] greater than it was twelve months ago and the amount of value the amounts of [00:02:44] and the amount of value the amounts of exciting meaningful work being done in [00:02:46] exciting meaningful work being done in machine learning is very strongly going [00:02:49] machine learning is very strongly going up and I think that given the rise of [00:02:54] up and I think that given the rise of you know the amount of data we have as [00:02:57] you know the amount of data we have as well as the new machine learning tools [00:02:59] well as the new machine learning tools that we have it would be a long time [00:03:02] that we have it would be a long time before we run out of opportunities you [00:03:04] before we run out of opportunities you have before before society as a whole [00:03:06] have before before society as a whole has enough people where the machine [00:03:08] has enough people where the machine learning skill set so just as maybe I [00:03:12] learning skill set so just as maybe I don't know 20 years ago was a good time [00:03:14] don't know 20 years ago was a good time to start working on this internet thing [00:03:16] to start working on this internet thing a lot of people that started working on [00:03:18] a lot of people that started working on the internet like 20 years ago and [00:03:20] the internet like 20 years ago and fantastic careers I think today is a [00:03:23] fantastic careers I think today is a wonderful time to jump to machine [00:03:25] wonderful time to jump to machine learning and the number and the [00:03:28] learning and the number and the opportunities for you to do unique [00:03:31] opportunities for you to do unique things that no one is no one else is [00:03:33] things that no one is no one else is doing right you are paying for you to go [00:03:34] doing right you are paying for you to go to logistics company and find an [00:03:37] to logistics company and find an exciting way to apply machine learning [00:03:39] exciting way to apply machine learning will be very high because chances are [00:03:42] will be very high because chances are that logistics company has no one else [00:03:44] that logistics company has no one else even working on this because you know [00:03:46] even working on this because you know they probably can they may not be able [00:03:48] they probably can they may not be able to hire a fantastic Stanford student as [00:03:50] to hire a fantastic Stanford student as a graduate cs2 29 right girls don't they [00:03:52] a graduate cs2 29 right girls don't they just on a lot of CSUN graduates around [00:03:54] just on a lot of CSUN graduates around um so what I want to do today is do a [00:04:00] um so what I want to do today is do a quick intro talking about logistics and [00:04:03] quick intro talking about logistics and then we'll spend the second half of the [00:04:06] then we'll spend the second half of the day you know giving an overview and talk [00:04:08] day you know giving an overview and talk a little bit more about machine learning [00:04:09] a little bit more about machine learning okay and oh and I apologize I think that [00:04:13] okay and oh and I apologize I think that this room according to that sign there [00:04:15] this room according to that sign there seats what 300 something students I [00:04:19] seats what 300 something students I think we have like not quite 800 people [00:04:23] think we have like not quite 800 people and wrote in this class [00:04:25] and wrote in this class so if there are people outside and all [00:04:27] so if there are people outside and all of the classes are recorded broadcast on [00:04:30] of the classes are recorded broadcast on SCPD they're usually the videos usually [00:04:32] SCPD they're usually the videos usually made a very available same day so for [00:04:35] made a very available same day so for those who they can't get into the room [00:04:37] those who they can't get into the room my apologies you know there are some [00:04:39] my apologies you know there are some years where even I had trouble getting [00:04:42] years where even I had trouble getting into the room but I'm but hopefully you [00:04:46] into the room but I'm but hopefully you can wash you you better wash all of [00:04:48] can wash you you better wash all of these things online shortly OSE yes yeah [00:04:52] these things online shortly OSE yes yeah I don't know it's a bit complicated yeah [00:04:56] I don't know it's a bit complicated yeah thank you I think it's okay yeah okay [00:04:59] thank you I think it's okay yeah okay yeah yeah for the next few classes you [00:05:02] yeah yeah for the next few classes you can squeeze in and use at the NCC so for [00:05:04] can squeeze in and use at the NCC so for now might be too complicated so quick [00:05:08] now might be too complicated so quick and shows um [00:05:10] and shows um oh I'm sorry I should have introduced [00:05:12] oh I'm sorry I should have introduced myself my name is Andrew and I want to [00:05:16] myself my name is Andrew and I want to introduce some of the rest of the [00:05:18] introduce some of the rest of the teaching team as well [00:05:22] is a class coordinator she has been [00:05:25] is a class coordinator she has been playing this role for many years now and [00:05:27] playing this role for many years now and helps keep the trains run on time and [00:05:30] helps keep the trains run on time and make sure that everything of course [00:05:30] make sure that everything of course happens when it's supposed to so she'll [00:05:35] happens when it's supposed to so she'll be and then what throat [00:05:39] cause my Santa be the co-head TS [00:05:44] respectively a PhD students working with [00:05:46] respectively a PhD students working with me and so bringing a lot of own [00:05:50] me and so bringing a lot of own technical experience technical [00:05:53] technical experience technical experience in machine learning as well [00:05:54] experience in machine learning as well as practical know-how on how to make [00:05:56] as practical know-how on how to make these things work and with the large [00:05:59] these things work and with the large class that we have we have a large ta [00:06:01] class that we have we have a large ta team maybe I won't introduce all of the [00:06:03] team maybe I won't introduce all of the TAS here today but you meet many of them [00:06:05] TAS here today but you meet many of them throughout the school sir but the TAS [00:06:07] throughout the school sir but the TAS expertise span everything from [00:06:09] expertise span everything from conversions and language processing [00:06:10] conversions and language processing technology to robotics and so through [00:06:14] technology to robotics and so through this quarter as you work on your class [00:06:16] this quarter as you work on your class projects I hope that you get a lot of [00:06:18] projects I hope that you get a lot of help and advice and mentoring from the [00:06:21] help and advice and mentoring from the TAS all of which all of whom have deep [00:06:23] TAS all of which all of whom have deep expertise not just in machine learning [00:06:25] expertise not just in machine learning but often in a specific vertical [00:06:27] but often in a specific vertical application area of machine learning so [00:06:30] application area of machine learning so depend on what your projects we try to [00:06:31] depend on what your projects we try to match you to a TA they can give you [00:06:33] match you to a TA they can give you advice they're most relevant whatever [00:06:36] advice they're most relevant whatever project you end up working on um so you [00:06:41] project you end up working on um so you know go with this class I hope that [00:06:43] know go with this class I hope that after the next ten weeks you will be an [00:06:46] after the next ten weeks you will be an expert in machine learning it turns out [00:06:50] expert in machine learning it turns out that you know and I hope that after this [00:06:55] that you know and I hope that after this class you'll be able to go out and build [00:06:58] class you'll be able to go out and build very meaningful machine learning [00:07:00] very meaningful machine learning applications either in an academic [00:07:02] applications either in an academic setting where hopefully you can apply it [00:07:05] setting where hopefully you can apply it to your problems in mechanical [00:07:07] to your problems in mechanical engineering Electrical Engineering and [00:07:09] engineering Electrical Engineering and English and law and and education and [00:07:14] English and law and and education and all of this wonderful work that happens [00:07:15] all of this wonderful work that happens on campus as well as off the grass from [00:07:18] on campus as well as off the grass from Stanford to Bill apply to whatever jobs [00:07:20] Stanford to Bill apply to whatever jobs you find one of the things I find very [00:07:23] you find one of the things I find very exciting about machine learning is that [00:07:24] exciting about machine learning is that it's no longer a sort of pure tech [00:07:28] it's no longer a sort of pure tech company only kind of thing right I think [00:07:30] company only kind of thing right I think that many years ago [00:07:32] that many years ago machine learning it was like a thing [00:07:34] machine learning it was like a thing that you know computer science [00:07:36] that you know computer science department would do and that the elite [00:07:38] department would do and that the elite AI companies like Google and Facebook [00:07:40] AI companies like Google and Facebook and Baidu and Microsoft with you but now [00:07:43] and Baidu and Microsoft with you but now it is so pervasive that even companies [00:07:46] it is so pervasive that even companies that are not traditional tech companies [00:07:48] that are not traditional tech companies see a huge need to apply these tools and [00:07:51] see a huge need to apply these tools and I find a lot of the most exciting work [00:07:52] I find a lot of the most exciting work these days and maybe some of you guys [00:07:55] these days and maybe some of you guys know my histories I'm a little bit [00:07:57] know my histories I'm a little bit is very I let the Google brain team [00:07:58] is very I let the Google brain team which help Google transform from what [00:08:01] which help Google transform from what was already a great company ten years [00:08:03] was already a great company ten years ago to today which is a great AI company [00:08:05] ago to today which is a great AI company and then also let the AI group that [00:08:07] and then also let the AI group that might do and you know let the company's [00:08:09] might do and you know let the company's technology strategy to help I do also [00:08:11] technology strategy to help I do also transform from what was already a green [00:08:14] transform from what was already a green company many years ago to today arguably [00:08:16] company many years ago to today arguably China's greatest AI company so having [00:08:19] China's greatest AI company so having let the you know build the teams that [00:08:21] let the you know build the teams that let the AI transformations of two large [00:08:23] let the AI transformations of two large tech companies III feel like that's a [00:08:26] tech companies III feel like that's a great thing to do but even beyond tech I [00:08:29] great thing to do but even beyond tech I think that um there's lot of exciting [00:08:31] think that um there's lot of exciting work to do as well to help other [00:08:32] work to do as well to help other industries to help other sectors embrace [00:08:35] industries to help other sectors embrace machine learning and use these tools [00:08:36] machine learning and use these tools effectively but after this class I hope [00:08:40] effectively but after this class I hope that each one of you will be well [00:08:42] that each one of you will be well qualified to get a job at shiny tech [00:08:45] qualified to get a job at shiny tech company and do machine learning there or [00:08:47] company and do machine learning there or go into one of these other industries [00:08:48] go into one of these other industries and do very valuable machine learning [00:08:50] and do very valuable machine learning projects there um and in addition if any [00:08:54] projects there um and in addition if any of you are taking this class with the [00:08:57] of you are taking this class with the primary goal of being able to do [00:09:00] primary goal of being able to do research in machine learning you know so [00:09:02] research in machine learning you know so actually some some of you I know are PhD [00:09:05] actually some some of you I know are PhD students I hope that this class will [00:09:08] students I hope that this class will also leave you well equipped to really [00:09:11] also leave you well equipped to really read and understand research papers as [00:09:13] read and understand research papers as well as you know be qualified to start [00:09:16] well as you know be qualified to start pushing forward the state-of-the-art so [00:09:23] pushing forward the state-of-the-art so let's see so today so just as machine [00:09:30] let's see so today so just as machine learning is evolving rapidly the whole [00:09:33] learning is evolving rapidly the whole teaching team we've been constantly [00:09:35] teaching team we've been constantly updating CS 229 as well so it's actually [00:09:39] updating CS 229 as well so it's actually very interesting I feel like the pace of [00:09:40] very interesting I feel like the pace of progress in machine learning has [00:09:41] progress in machine learning has accelerated so it actually feels like [00:09:44] accelerated so it actually feels like that the amount we changed the cost [00:09:47] that the amount we changed the cost year-over-year has been increasing over [00:09:49] year-over-year has been increasing over time so so if your friends that took the [00:09:51] time so so if your friends that took the class last year you know things are a [00:09:53] class last year you know things are a little bit different this year because [00:09:54] little bit different this year because we're constantly updating the class to [00:09:57] we're constantly updating the class to keep up with what feels like slow [00:09:59] keep up with what feels like slow accelerating progress in the whole field [00:10:01] accelerating progress in the whole field of machine learning so so so so there [00:10:04] of machine learning so so so so there are some logistical changes for example [00:10:06] are some logistical changes for example we've gone from what we used to hand out [00:10:09] we've gone from what we used to hand out paper copy [00:10:10] paper copy of handouts that where we're trying to [00:10:12] of handouts that where we're trying to make this class digital-only but let me [00:10:15] make this class digital-only but let me talk a little bit about prerequisites as [00:10:17] talk a little bit about prerequisites as well as in case your friends have taken [00:10:18] well as in case your friends have taken this class before some of the [00:10:19] this class before some of the differences for this year um so [00:10:23] differences for this year um so prerequisites we are going to assume [00:10:27] prerequisites we are going to assume that all of you have a knowledge of [00:10:31] that all of you have a knowledge of basic computer skills and principles so [00:10:33] basic computer skills and principles so you know Big O notation Q stands binary [00:10:35] you know Big O notation Q stands binary trees hopefully you understand what all [00:10:38] trees hopefully you understand what all of those concepts are and assume that [00:10:40] of those concepts are and assume that all we have a basic familiarity with [00:10:43] all we have a basic familiarity with probability right that hopefully you [00:10:46] probability right that hopefully you know what's the random variable what's [00:10:47] know what's the random variable what's the expected value of a random variable [00:10:49] the expected value of a random variable what's the variance of a random variable [00:10:51] what's the variance of a random variable and if for some of you maybe especially [00:10:53] and if for some of you maybe especially the SCPD students taking there's [00:10:55] the SCPD students taking there's enrollee if there's been you know some [00:10:58] enrollee if there's been you know some number of years since you lost the [00:10:59] number of years since you lost the probability and statistics Falls we will [00:11:02] probability and statistics Falls we will have review sessions on Fridays where [00:11:05] have review sessions on Fridays where we'll go over some of this prerequisite [00:11:07] we'll go over some of this prerequisite material as well hopefully you know [00:11:09] material as well hopefully you know whether random variable is what the [00:11:11] whether random variable is what the expected value is but if you are a [00:11:12] expected value is but if you are a little bit fuzzy on those concepts we'll [00:11:14] little bit fuzzy on those concepts we'll go over them again at a discussion [00:11:17] go over them again at a discussion section on Friday also seeing the [00:11:20] section on Friday also seeing the familiar basic linear algebra so [00:11:22] familiar basic linear algebra so hopefully that you know what the matrix [00:11:23] hopefully that you know what the matrix was a vector how to multiply two [00:11:25] was a vector how to multiply two matrices and multiplying majors in a [00:11:27] matrices and multiplying majors in a vector if you know what an eigenvector [00:11:30] vector if you know what an eigenvector then that's even better if you're not [00:11:32] then that's even better if you're not quite sure what an eigenvector is look [00:11:34] quite sure what an eigenvector is look over it uuugh but yeah we'll go over it [00:11:38] over it uuugh but yeah we'll go over it I guess um and then um a large part of [00:11:43] I guess um and then um a large part of this class is having you practice these [00:11:49] this class is having you practice these ideas through the homeworks as well as I [00:11:52] ideas through the homeworks as well as I mentioned later a open-ended project and [00:11:55] mentioned later a open-ended project and so one there we've actually until now we [00:12:00] so one there we've actually until now we used to use MATLAB in octave for their [00:12:03] used to use MATLAB in octave for their premiere assignments but this year we're [00:12:06] premiere assignments but this year we're trying to ship the permian Simon's to [00:12:08] trying to ship the permian Simon's to Python and so I think for a long time [00:12:11] Python and so I think for a long time even today you know I sometimes use [00:12:14] even today you know I sometimes use octaves their prototype because the [00:12:16] octaves their prototype because the syntax the octave is so nice and just [00:12:18] syntax the octave is so nice and just run very simple experiments very quickly [00:12:21] run very simple experiments very quickly but I think the Machine there in the [00:12:23] but I think the Machine there in the world [00:12:24] world you know really migrating I think from [00:12:27] you know really migrating I think from MATLAB Python world to increasing using [00:12:30] MATLAB Python world to increasing using MATLAB octave world to increasingly a [00:12:33] MATLAB octave world to increasingly a Python maybe and then eventually for [00:12:36] Python maybe and then eventually for production Java or C++ kind of world and [00:12:38] production Java or C++ kind of world and so we're rewriting a lot of the [00:12:40] so we're rewriting a lot of the assignments that this causes quilter [00:12:43] assignments that this causes quilter I've been having driving that process so [00:12:46] I've been having driving that process so that so that this course here you could [00:12:48] that so that this course here you could do more of the assignments maybe most [00:12:51] do more of the assignments maybe most maybe all of the assignments in Python [00:12:54] maybe all of the assignments in Python numpy instead now an oath of the only [00:12:58] numpy instead now an oath of the only codes we all said you know we we [00:13:01] codes we all said you know we we actually encourage you to form study [00:13:02] actually encourage you to form study groups so you know up and I'm fascinated [00:13:06] groups so you know up and I'm fascinated by education a long time for a long time [00:13:09] by education a long time for a long time studying education in pedagogy and how [00:13:11] studying education in pedagogy and how instructors like us can help support you [00:13:13] instructors like us can help support you to learn more efficiently and one of the [00:13:16] to learn more efficiently and one of the lessons I've learned from the [00:13:17] lessons I've learned from the educational research literature is that [00:13:19] educational research literature is that for highly technical classes like this [00:13:21] for highly technical classes like this if you form study groups you will [00:13:24] if you form study groups you will probably have an easier time right so so [00:13:26] probably have an easier time right so so CS News and I would go for the highly [00:13:28] CS News and I would go for the highly technical material there's a lot of math [00:13:30] technical material there's a lot of math so the programs are hard and they have a [00:13:31] so the programs are hard and they have a group of friends to study with you [00:13:34] group of friends to study with you probably have an easier time because you [00:13:37] probably have an easier time because you can ask each other questions and work [00:13:38] can ask each other questions and work together help each other um where we ask [00:13:40] together help each other um where we ask you to draw the line or what we ask you [00:13:43] you to draw the line or what we ask you to do relative to the standards on the [00:13:46] to do relative to the standards on the codes is we ask that you do the homework [00:13:49] codes is we ask that you do the homework problems by yourself right and and and [00:13:52] problems by yourself right and and and more specifically is okay to discuss the [00:13:55] more specifically is okay to discuss the homework problems or friends but if you [00:13:57] homework problems or friends but if you but after discussing homework problems [00:13:59] but after discussing homework problems with friends we ask you to go back and [00:14:01] with friends we ask you to go back and write up the solutions by yourself [00:14:03] write up the solutions by yourself without referring to notes that you know [00:14:06] without referring to notes that you know you and your friends have developed [00:14:07] you and your friends have developed together okay the classes on the code is [00:14:10] together okay the classes on the code is written clearly on the class handouts [00:14:13] written clearly on the class handouts posted digitally on their website so if [00:14:16] posted digitally on their website so if you ever have any questions about what [00:14:18] you ever have any questions about what there's a lot of collaboration and what [00:14:19] there's a lot of collaboration and what isn't allowed please refer to that [00:14:21] isn't allowed please refer to that written document on the course website [00:14:23] written document on the course website where we distract us more clearly but [00:14:25] where we distract us more clearly but out of respect for the Stanford honor [00:14:28] out of respect for the Stanford honor code as well as the [00:14:30] code as well as the your students kind of doing their own [00:14:32] your students kind of doing their own work we asked you to basically do your [00:14:34] work we asked you to basically do your own work for the soca to discuss it but [00:14:38] own work for the soca to discuss it but after discussing home problems with [00:14:39] after discussing home problems with friends ultimately we asked you to write [00:14:41] friends ultimately we asked you to write up your problems by yourself so that the [00:14:43] up your problems by yourself so that the homework submissions reflect your own [00:14:46] homework submissions reflect your own work right and I care about this because [00:14:49] work right and I care about this because turns out that having CS 239 you know CS [00:14:52] turns out that having CS 239 you know CS 229 is one of those classes that [00:14:54] 229 is one of those classes that employers recognize I don't know if you [00:14:57] employers recognize I don't know if you guys know but they're been um [00:14:59] guys know but they're been um companies that have put up job ads that [00:15:01] companies that have put up job ads that say stuff like so long as you got solace [00:15:04] say stuff like so long as you got solace you complete the CST three now and we [00:15:06] you complete the CST three now and we guarantee you get an interview right [00:15:07] guarantee you get an interview right I've seen stuff like that and so I think [00:15:10] I've seen stuff like that and so I think you know in order to maintain that [00:15:13] you know in order to maintain that sanctity of what it means to be a CSU to [00:15:15] sanctity of what it means to be a CSU to nine computer I think and I all said all [00:15:17] nine computer I think and I all said all of you so the really do work or stay [00:15:21] of you so the really do work or stay within the bounds of accepted of [00:15:22] within the bounds of accepted of acceptable collaboration relative the [00:15:25] acceptable collaboration relative the honor codes let's see and I think that [00:15:29] honor codes let's see and I think that um if you know what this is and I think [00:15:37] um if you know what this is and I think that one of the best parts of CS 339 it [00:15:41] that one of the best parts of CS 339 it turns out is excuse me [00:15:49] oh yeah sorry I'm gonna try looking for [00:15:53] oh yeah sorry I'm gonna try looking for the mouse cursor [00:16:08] all right serve all that might might [00:16:11] all right serve all that might might might displays on not smear erisa so [00:16:14] might displays on not smear erisa so this is a little bit awkward um so one [00:16:26] this is a little bit awkward um so one of the best parts of the class is she [00:16:30] of the best parts of the class is she sorry about that [00:16:34] right no mind I won't do this you could [00:16:37] right no mind I won't do this you could do that you could do yourself online [00:16:38] do that you could do yourself online later yeah I started using I started [00:16:44] later yeah I started using I started using Firefox recently in addition to [00:16:46] using Firefox recently in addition to Chrome it was just a mix-up um one of [00:16:50] Chrome it was just a mix-up um one of the best parts of the class is the class [00:16:55] the best parts of the class is the class project and so you know one of the girls [00:16:59] project and so you know one of the girls of the qualities to leave you well [00:17:00] of the qualities to leave you well qualified to do a meaningful machine [00:17:02] qualified to do a meaningful machine learning project and so one of the best [00:17:06] learning project and so one of the best ways to make sure you have that skill [00:17:07] ways to make sure you have that skill set is through this class and hopefully [00:17:10] set is through this class and hopefully with the help of some of the TAS we want [00:17:13] with the help of some of the TAS we want to support you to work on a small group [00:17:15] to support you to work on a small group to complete a meaningful machine [00:17:17] to complete a meaningful machine learning project and so one thing I hope [00:17:20] learning project and so one thing I hope you start doing you know later today is [00:17:23] you start doing you know later today is to start brainstorming maybe of your [00:17:25] to start brainstorming maybe of your friends some of the some of the class [00:17:28] friends some of the some of the class projects you might work on and the most [00:17:31] projects you might work on and the most common class project that you know [00:17:33] common class project that you know people do in CSU's you know it's the [00:17:35] people do in CSU's you know it's the picker area pick an application that [00:17:37] picker area pick an application that excites you and to apply machine [00:17:40] excites you and to apply machine learning to it and see if it can build a [00:17:42] learning to it and see if it can build a good machine learning system for some [00:17:43] good machine learning system for some application in the area and so if you go [00:17:46] application in the area and so if you go to the course website [00:17:47] to the course website you know cs2 2/9 does time for the edu [00:17:49] you know cs2 2/9 does time for the edu and look at previous year's projects you [00:17:51] and look at previous year's projects you you you see machine learning projects [00:17:53] you you see machine learning projects applied to pretty much you know pretty [00:17:55] applied to pretty much you know pretty much every imaginable application Under [00:17:57] much every imaginable application Under the Sun everything from no diagnosing [00:18:00] the Sun everything from no diagnosing cancer to creating arts to lots of [00:18:03] cancer to creating arts to lots of projects apply to other areas of [00:18:06] projects apply to other areas of engineering applying to application [00:18:08] engineering applying to application areas in EE or my contouring or silver [00:18:10] areas in EE or my contouring or silver engineering or earthquake immersion and [00:18:12] engineering or earthquake immersion and so on to applying it to understand [00:18:14] so on to applying it to understand literature it's applying it to know and [00:18:18] literature it's applying it to know and and and and so if you look at the [00:18:21] and and and so if you look at the previous year's projects many of which [00:18:23] previous year's projects many of which are posted on the course website you can [00:18:25] are posted on the course website you can use that as inspiration to see the types [00:18:27] use that as inspiration to see the types of projects students complete completing [00:18:29] of projects students complete completing this class are able to do and also [00:18:31] this class are able to do and also encourage you to you can look at that [00:18:34] encourage you to you can look at that for inspiration to get a sense of what [00:18:36] for inspiration to get a sense of what you'll be able to do at the conclusion [00:18:38] you'll be able to do at the conclusion of this class and also see if looking at [00:18:41] of this class and also see if looking at previous year's projects gives you [00:18:43] previous year's projects gives you inspiration [00:18:44] inspiration for what you might do yourself so we [00:18:48] for what you might do yourself so we also know we invite you I guess to do [00:18:50] also know we invite you I guess to do class projects in small groups and so [00:18:53] class projects in small groups and so after class today also encourage you to [00:18:56] after class today also encourage you to start making friends in the class both [00:18:58] start making friends in the class both for the purpose of forming study groups [00:18:59] for the purpose of forming study groups as well for the purpose and maybe [00:19:01] as well for the purpose and maybe finding a small group to do a class [00:19:03] finding a small group to do a class project with we ask you to form project [00:19:07] project with we ask you to form project groups of up to size three most project [00:19:11] groups of up to size three most project groups end up being size two or three if [00:19:14] groups end up being size two or three if you insist on doing it by yourself [00:19:15] you insist on doing it by yourself right without any partners that's [00:19:17] right without any partners that's actually okay - you're welcome to do [00:19:18] actually okay - you're welcome to do that but but but I think often you know [00:19:21] that but but but I think often you know having one or two others to work with [00:19:23] having one or two others to work with may give you an easier time and for [00:19:25] may give you an easier time and for projects of exceptional scope if you [00:19:27] projects of exceptional scope if you have a very large project it just cannot [00:19:29] have a very large project it just cannot be done by three people sometimes you [00:19:32] be done by three people sometimes you know let us know and we're open to work [00:19:35] know let us know and we're open to work with to some project groups of size four [00:19:38] with to some project groups of size four but our expectation but we do whole [00:19:40] but our expectation but we do whole projects you know with a group of four [00:19:42] projects you know with a group of four to a higher standard than project group [00:19:45] to a higher standard than project group size one to three so so what that means [00:19:47] size one to three so so what that means is that if your project team size is one [00:19:50] is that if your project team size is one two or three persons the grading is one [00:19:52] two or three persons the grading is one criteria if your project group is bigger [00:19:55] criteria if your project group is bigger than three persons we use a stricter [00:19:57] than three persons we use a stricter criteria when it comes to creating class [00:19:59] criteria when it comes to creating class projects okay um oh and that reminds me [00:20:03] projects okay um oh and that reminds me I know that the scene so for most of you [00:20:07] I know that the scene so for most of you since is since this starts at 9:30 a.m. [00:20:09] since is since this starts at 9:30 a.m. on the first day of the quarter for many [00:20:12] on the first day of the quarter for many of you this may be this / your very [00:20:14] of you this may be this / your very first cause at Stanford for how many of [00:20:16] first cause at Stanford for how many of you does your very first cause at [00:20:18] you does your very first cause at Stanford Wow cool okay awesome [00:20:20] Stanford Wow cool okay awesome great welcome to Stanford oh and there's [00:20:23] great welcome to Stanford oh and there's someone next to you just raise their [00:20:25] someone next to you just raise their hand actually raise your hand again so I [00:20:27] hand actually raise your hand again so I hope that you know maybe off the class [00:20:28] hope that you know maybe off the class today if someone makes you raise their [00:20:30] today if someone makes you raise their hand [00:20:30] hand welcome them to Stanford and then say hi [00:20:33] welcome them to Stanford and then say hi and show yourself and good friends I'll [00:20:35] and show yourself and good friends I'll do it cool nice nice to see so many of [00:20:37] do it cool nice nice to see so many of you yeah [00:20:44] all right so just a bit more logistics [00:20:53] all right so just a bit more logistics so let's see in addition to the main [00:20:58] so let's see in addition to the main lectures that we'll have here on Mondays [00:21:01] lectures that we'll have here on Mondays and Wednesdays [00:21:03] and Wednesdays since 39 also has discussion sections on [00:21:06] since 39 also has discussion sections on held on Fridays that are and everything [00:21:09] held on Fridays that are and everything we do including the see all the all the [00:21:11] we do including the see all the all the lectures and discussion sections are [00:21:12] lectures and discussion sections are recorded and broadcast through SCPD [00:21:15] recorded and broadcast through SCPD through the online website and one of [00:21:19] through the online website and one of and discussion section are taught [00:21:22] and discussion section are taught usually by the TAS on Fridays and [00:21:25] usually by the TAS on Fridays and attendance at discussion sections is [00:21:27] attendance at discussion sections is optional and what I mean is that you you [00:21:31] optional and what I mean is that you you know you punch up some promise there [00:21:32] know you punch up some promise there won't be material on the midterm that [00:21:35] won't be material on the midterm that will sneak in from the section so it's [00:21:37] will sneak in from the section so it's hundred-percent optional and you will be [00:21:39] hundred-percent optional and you will be able to do all the homework and [00:21:40] able to do all the homework and appropriate projects without attending [00:21:41] appropriate projects without attending in this question section but what we'll [00:21:43] in this question section but what we'll use the discussion section for for the [00:21:45] use the discussion section for for the first three discussion sections so you [00:21:47] first three discussion sections so you know this week next week week after that [00:21:49] know this week next week week after that we'll use the discussion sections to go [00:21:51] we'll use the discussion sections to go over prerequisite material and great to [00:21:53] over prerequisite material and great to Jeff so go over linear algebra or basic [00:21:57] Jeff so go over linear algebra or basic crime statistics teach a little about [00:21:59] crime statistics teach a little about Python numpy in case you're less [00:22:01] Python numpy in case you're less familiar with those frameworks so do [00:22:03] familiar with those frameworks so do that for the first few weeks and then [00:22:05] that for the first few weeks and then for the discussion sections that are [00:22:07] for the discussion sections that are held later this quarter will usually use [00:22:09] held later this quarter will usually use them to go over a more advanced optional [00:22:12] them to go over a more advanced optional material for example cs50 now involved [00:22:15] material for example cs50 now involved in learning algorithms you you hear [00:22:17] in learning algorithms you you hear about in the class rely on convex [00:22:19] about in the class rely on convex optimization algorithms but we want to [00:22:22] optimization algorithms but we want to focus the class on the learning [00:22:24] focus the class on the learning algorithms and spend less time on convex [00:22:26] algorithms and spend less time on convex optimization so if you want to come and [00:22:28] optimization so if you want to come and hear about more advanced concepts in [00:22:30] hear about more advanced concepts in convex optimization we'll defer that to [00:22:32] convex optimization we'll defer that to a discussion section and then there are [00:22:35] a discussion section and then there are few other advanced topics hidden Markov [00:22:38] few other advanced topics hidden Markov models time series that were planning to [00:22:40] models time series that were planning to defer to the Friday discussion sections [00:22:44] defer to the Friday discussion sections okay [00:22:47] okay so let's see [00:22:52] so let's see cool and oh and final bit of logistics [00:23:00] there'll be there are digital tools that [00:23:02] there'll be there are digital tools that some of you have seen but for this class [00:23:04] some of you have seen but for this class will drive a lot of the discussion [00:23:06] will drive a lot of the discussion through the online website Piazza how [00:23:09] through the online website Piazza how many of you abuse Piazza Cavour [00:23:10] many of you abuse Piazza Cavour okay cool mostly Wow all of you that's [00:23:14] okay cool mostly Wow all of you that's pretty amazing good so so online [00:23:18] pretty amazing good so so online discussion board for those of you that [00:23:19] discussion board for those of you that haven't seen it before but definitely [00:23:21] haven't seen it before but definitely encourage you to participate actively on [00:23:24] encourage you to participate actively on Piazza and also to answer all the [00:23:26] Piazza and also to answer all the students questions I think that one of [00:23:29] students questions I think that one of the best ways to learn as was contribute [00:23:31] the best ways to learn as was contribute you know back to the course as a whole [00:23:33] you know back to the course as a whole is if you see someone else ask a [00:23:35] is if you see someone else ask a question on Piazza if you jump in and [00:23:37] question on Piazza if you jump in and have answer that that that often helps [00:23:39] have answer that that that often helps you and helps your classmates so I [00:23:41] you and helps your classmates so I strongly encourage you to do that for [00:23:45] strongly encourage you to do that for those of you that have a private [00:23:46] those of you that have a private question you know sometimes we have [00:23:48] question you know sometimes we have students reaching out to us too with a [00:23:52] students reaching out to us too with a personal matter or something that you [00:23:55] personal matter or something that you know it's not appropriate to share on [00:23:56] know it's not appropriate to share on the public forum in which case you're [00:23:57] the public forum in which case you're welcome to email us at the cross email [00:24:00] welcome to email us at the cross email address as well and we also and the [00:24:03] address as well and we also and the class the email address the clock [00:24:04] class the email address the clock teaching staffs email address on the [00:24:05] teaching staffs email address on the course website you can find it there in [00:24:07] course website you can find it there in contact us but for anything technical or [00:24:09] contact us but for anything technical or anything reasonable to share the class [00:24:12] anything reasonable to share the class which includes most technical questions [00:24:14] which includes most technical questions that most logistical questions write [00:24:16] that most logistical questions write questions like you know chief can you [00:24:18] questions like you know chief can you confirm what date is it midterm or you [00:24:20] confirm what date is it midterm or you know what happens can you confirm [00:24:22] know what happens can you confirm Wednesday handout for this going on and [00:24:24] Wednesday handout for this going on and so on for questions are not personal or [00:24:26] so on for questions are not personal or private in nature strongly encourage you [00:24:29] private in nature strongly encourage you to post on Piazza rather than emailing [00:24:30] to post on Piazza rather than emailing us because statistically you actually [00:24:33] us because statistically you actually get a faster answer posting this on post [00:24:36] get a faster answer posting this on post posting on Piazza then then you know if [00:24:38] posting on Piazza then then you know if you wait for one of us to respond to you [00:24:40] you wait for one of us to respond to you and we'll be using great scope as well [00:24:44] and we'll be using great scope as well for online grading if you don't know why [00:24:47] for online grading if you don't know why brace go up is don't worry about it well [00:24:48] brace go up is don't worry about it well we'll send you links and show you how to [00:24:51] we'll send you links and show you how to use it data [00:24:52] use it data oh and again relative to want one loss [00:24:57] oh and again relative to want one loss which is a real thing to plan for unlike [00:25:01] which is a real thing to plan for unlike previous [00:25:03] previous yes when we taught CS 339 so we're [00:25:06] yes when we taught CS 339 so we're constantly updating the syllabus right [00:25:08] constantly updating the syllabus right the technical content to try to show you [00:25:10] the technical content to try to show you the latest machine learning algorithms [00:25:11] the latest machine learning algorithms and the to pick up little changes we're [00:25:15] and the to pick up little changes we're making this year I guess one is a Python [00:25:18] making this year I guess one is a Python instead of MATLAB and the other one is [00:25:20] instead of MATLAB and the other one is instead of having a midterm exam you [00:25:24] instead of having a midterm exam you know that's a timed midterm we're [00:25:27] know that's a timed midterm we're planning to have a take hold midterm [00:25:29] planning to have a take hold midterm this quarter [00:25:31] this quarter this day so I know some people just [00:25:34] this day so I know some people just breathing sharply when I said that I [00:25:36] breathing sharply when I said that I don't know what that means [00:25:37] don't know what that means was that shock full happiness don't [00:25:41] was that shock full happiness don't worry midterms are fun you love it [00:25:45] worry midterms are fun you love it all right oh so that's it for the that's [00:25:51] all right oh so that's it for the that's it for the logistical aspects let me [00:25:54] it for the logistical aspects let me check with you and so let me check there [00:25:56] check with you and so let me check there any questions [00:25:57] any questions oh yeah go ahead oh yeah so that's such [00:26:16] oh yeah go ahead oh yeah so that's such thing oh let's see I think has offered [00:26:18] thing oh let's see I think has offered in spring and one of the presses [00:26:23] in spring and one of the presses oh yes and I was teaching it so someone [00:26:25] oh yes and I was teaching it so someone else is teaching it in Spring Quarter I [00:26:30] else is teaching it in Spring Quarter I actually did not know it was going to be [00:26:33] actually did not know it was going to be offered in winter so I think a big guy [00:26:44] offered in winter so I think a big guy teaching notes in [00:26:47] teaching notes in Neverending right teaching it in spring [00:26:51] Neverending right teaching it in spring and I don't think is often in winter [00:26:58] well this kind of sections be recorded [00:27:00] well this kind of sections be recorded yes they will be oh and by the way if [00:27:02] yes they will be oh and by the way if you wonder why I'm recording that I'm [00:27:04] you wonder why I'm recording that I'm repeating the question I know it feels [00:27:06] repeating the question I know it feels weird I'm recording for the microphone [00:27:07] weird I'm recording for the microphone so that so that people watching those at [00:27:09] so that so that people watching those at home can hear the question but both the [00:27:11] home can hear the question but both the lectures and the discussion sections [00:27:13] lectures and the discussion sections will be will be recorded and put on the [00:27:16] will be will be recorded and put on the website maybe the one thing we do [00:27:19] website maybe the one thing we do there's not recorded and broadcast at [00:27:21] there's not recorded and broadcast at the office hours right oh oh but I think [00:27:25] the office hours right oh oh but I think this year we have a 60 hour how many on [00:27:30] this year we have a 60 hour how many on 60 office hours per week right yeah so [00:27:34] 60 office hours per week right yeah so so so hopefully I just again we're [00:27:36] so so hopefully I just again we're constantly trying to improve the cause [00:27:38] constantly trying to improve the cause in previous years one of the feedback we [00:27:39] in previous years one of the feedback we got was that the officers are really [00:27:41] got was that the officers are really crowded so so we have 60 60 hours of [00:27:44] crowded so so we have 60 60 hours of football 60 offers all slots per week [00:27:46] football 60 offers all slots per week this that seems like long so hopefully [00:27:48] this that seems like long so hopefully if you need to track down one of us [00:27:50] if you need to track down one of us track down the teeth to get help [00:27:51] track down the teeth to get help hopefully that'll make it easier for you [00:27:54] hopefully that'll make it easier for you to do so good say okay well oh well [00:28:06] to do so good say okay well oh well logistical things like when Homer said [00:28:08] logistical things like when Homer said you be covering lectures we have a yes [00:28:11] you be covering lectures we have a yes so we have four plans homeworks yeah and [00:28:16] so we have four plans homeworks yeah and if you go to the if you go to the course [00:28:18] if you go to the if you go to the course website and click on the syllabus link [00:28:20] website and click on the syllabus link that has a calendar with when each [00:28:23] that has a calendar with when each homework assignments called and when [00:28:24] homework assignments called and when Opie do so for homeworks and project [00:28:28] Opie do so for homeworks and project proposal due few weeks from now and then [00:28:31] proposal due few weeks from now and then final projects due at the end of the [00:28:33] final projects due at the end of the quarter but all the other exact dates [00:28:35] quarter but all the other exact dates are listed on the course website [00:28:39] oh sure [00:28:43] oh sure yes difference between this class into [00:28:45] yes difference between this class into 39a let me think how does dead yes so [00:28:49] 39a let me think how does dead yes so yeah I know I was debating earlier this [00:28:52] yeah I know I was debating earlier this morning how to answer that is no doubt [00:28:54] morning how to answer that is no doubt that a few times so I think that what [00:28:58] that a few times so I think that what has happened at Stanford is that the [00:29:00] has happened at Stanford is that the volume of demand for machine learning [00:29:02] volume of demand for machine learning education is just right skyrocketing [00:29:05] education is just right skyrocketing because I think everyone sees everyone [00:29:07] because I think everyone sees everyone wants to learn this stuff and so so [00:29:12] wants to learn this stuff and so so whooping so the computer science [00:29:13] whooping so the computer science department has been trying to grow the [00:29:15] department has been trying to grow the number of machine learning offerings we [00:29:16] number of machine learning offerings we have we actually kept the enrollment at [00:29:20] have we actually kept the enrollment at CSC 39a at a relatively low number at a [00:29:23] CSC 39a at a relatively low number at a hundred students so I actually don't [00:29:25] hundred students so I actually don't want to encourage too many of you to [00:29:27] want to encourage too many of you to sign up because I think we might be [00:29:29] sign up because I think we might be hitting the enrollment cap already so so [00:29:31] hitting the enrollment cap already so so please don't all sign up for CC 90 [00:29:34] please don't all sign up for CC 90 because we - 398 does not have the [00:29:37] because we - 398 does not have the capacity this quarter but since 229 a is [00:29:40] capacity this quarter but since 229 a is a much less mathematical and much more [00:29:44] a much less mathematical and much more quite relatively more apply version of [00:29:48] quite relatively more apply version of machine learning and so I guess I'm [00:29:52] machine learning and so I guess I'm teaching 69 ACS 230 NCSU now in this [00:29:55] teaching 69 ACS 230 NCSU now in this quarter of the 3 CS 229 is the most [00:29:58] quarter of the 3 CS 229 is the most mathematical it has a little bit less [00:30:01] mathematical it has a little bit less apply than 63 9 a which is more apply [00:30:03] apply than 63 9 a which is more apply machine learning and since 230 which is [00:30:05] machine learning and since 230 which is deep learning [00:30:06] deep learning my advice to students is that 63 962 9s [00:30:11] my advice to students is that 63 962 9s is let me write this down [00:30:21] so since 229 a is taught in a flipped [00:30:24] so since 229 a is taught in a flipped classroom format which means that [00:30:26] classroom format which means that students taking and will mainly watch [00:30:29] students taking and will mainly watch videos on the Coursera website and do a [00:30:32] videos on the Coursera website and do a lot of programming exercises and then [00:30:34] lot of programming exercises and then meet for weekly discussion sections but [00:30:38] meet for weekly discussion sections but it's a smaller class with captain Romans [00:30:40] it's a smaller class with captain Romans I would advise you that if you feel [00:30:44] I would advise you that if you feel ready for cs50 9 + CS 2:30 to do those [00:30:47] ready for cs50 9 + CS 2:30 to do those but cs50 9 you know because of the math [00:30:52] but cs50 9 you know because of the math this is a this is a very heavy workload [00:30:55] this is a this is a very heavy workload and pretty challenging class and so if [00:30:58] and pretty challenging class and so if you're not sure we're ready for CCTV 969 [00:31:01] you're not sure we're ready for CCTV 969 a may be a good thing to take first and [00:31:07] a may be a good thing to take first and then cs50 90s TV 9 a cover a broader [00:31:11] then cs50 90s TV 9 a cover a broader range of machine learning algorithms and [00:31:14] range of machine learning algorithms and cs2 30 is more focused on deep learning [00:31:17] cs2 30 is more focused on deep learning algorithms specifically right which is a [00:31:19] algorithms specifically right which is a much narrower set of algorithms but it [00:31:21] much narrower set of algorithms but it is you know one of the hottest areas of [00:31:23] is you know one of the hottest areas of deep learning there is not that much [00:31:26] deep learning there is not that much overlap in content between the three [00:31:28] overlap in content between the three classes so if you actually take all [00:31:29] classes so if you actually take all three you learn relatively different [00:31:32] three you learn relatively different things from all of them in the positive [00:31:34] things from all of them in the positive has students simultaneously take to 29 [00:31:36] has students simultaneously take to 29 and 229 a and there is a little bit of [00:31:39] and 229 a and there is a little bit of overlap you know they do kind of cover [00:31:40] overlap you know they do kind of cover related algorithms but from different [00:31:43] related algorithms but from different points of view so so some people [00:31:44] points of view so so some people actually take multiple of these classes [00:31:46] actually take multiple of these classes at the same time but sooner nine is more [00:31:50] at the same time but sooner nine is more applied a bit more you know practical [00:31:52] applied a bit more you know practical know-how hands-on and so on and much [00:31:55] know-how hands-on and so on and much less mathematical and CS 230 is also [00:32:00] less mathematical and CS 230 is also less mathematical more applied more [00:32:02] less mathematical more applied more about kind of getting to work whereas [00:32:03] about kind of getting to work whereas see Susan honey we do much more [00:32:06] see Susan honey we do much more mathematical derivations in c ester's [00:32:22] so why'd you say that what I would [00:32:28] so why'd you say that what I would generally prefer students not do that in [00:32:30] generally prefer students not do that in the interest of time but what what do [00:32:32] the interest of time but what what do you want oh I see [00:32:38] you want oh I see sure go for it who's enrolled in 239 to [00:32:41] sure go for it who's enrolled in 239 to 3000 not that many of you interesting oh [00:32:44] 3000 not that many of you interesting oh that's actually really interesting cool [00:32:46] that's actually really interesting cool yeah [00:32:47] yeah thank you yeah I just didn't want to say [00:32:49] thank you yeah I just didn't want to say the presence of students using this as a [00:32:51] the presence of students using this as a forum to run surveys so that was that [00:32:53] forum to run surveys so that was that was that was an interesting question so [00:32:55] was that was an interesting question so thank you cool all right [00:33:01] thank you cool all right and by the way I think you know just one [00:33:04] and by the way I think you know just one thing about Stanford is on the AI world [00:33:07] thing about Stanford is on the AI world machine they're in the world [00:33:08] machine they're in the world yeah is bigger than machine learning [00:33:09] yeah is bigger than machine learning right machine is bigger than deep [00:33:10] right machine is bigger than deep learning um one of the great things [00:33:13] learning um one of the great things about being a Stanford student is you [00:33:15] about being a Stanford student is you can and I think should take multiple [00:33:18] can and I think should take multiple classes right I think that you know [00:33:20] classes right I think that you know since we're now in has for many years [00:33:21] since we're now in has for many years been the call of the machine learning [00:33:23] been the call of the machine learning world at Stanford but even beyond CS 239 [00:33:27] world at Stanford but even beyond CS 239 is worth your while to take multiple [00:33:30] is worth your while to take multiple classes including multiple perspectives [00:33:32] classes including multiple perspectives so so if you want to be really effective [00:33:35] so so if you want to be really effective you know after you drive don't stand [00:33:37] you know after you drive don't stand there you do want to be an exponent [00:33:38] there you do want to be an exponent machine learning you do want to be an [00:33:39] machine learning you do want to be an expert and deep learning and you [00:33:41] expert and deep learning and you probably want to know probably in [00:33:43] probably want to know probably in statistics maybe you want to know bit of [00:33:45] statistics maybe you want to know bit of confidence optimization maybe you want [00:33:46] confidence optimization maybe you want to know bit more about the force when [00:33:48] to know bit more about the force when learning know a little bit about [00:33:49] learning know a little bit about planning a little bit about lots of [00:33:51] planning a little bit about lots of things so so I actually encourage you to [00:33:54] things so so I actually encourage you to take multiple classes like this if there [00:34:03] take multiple classes like this if there are no more questions let's go on to [00:34:05] are no more questions let's go on to talk a bit about machine learning so all [00:34:15] talk a bit about machine learning so all right so in the remainder of the sauce [00:34:17] right so in the remainder of the sauce what I'd like to do is give a quick [00:34:21] what I'd like to do is give a quick overview of you know the major areas of [00:34:26] overview of you know the major areas of machine learning and also [00:34:29] and and also give you absolute overview [00:34:32] and and also give you absolute overview of the things you learn in the next ten [00:34:35] of the things you learn in the next ten weeks so you know what is machine [00:34:38] weeks so you know what is machine learning right it seems to be everywhere [00:34:40] learning right it seems to be everywhere these days and it's useful for so many [00:34:41] these days and it's useful for so many places and and I think that and you know [00:34:46] places and and I think that and you know and and I feel like if either just to [00:34:50] and and I feel like if either just to share view my personal bias right you [00:34:52] share view my personal bias right you you read the news about these people [00:34:54] you read the news about these people making so much money building learning [00:34:56] making so much money building learning algorithms I think that's great I hope I [00:34:58] algorithms I think that's great I hope I hope all of you go make up all the money [00:35:00] hope all of you go make up all the money but the thing I find even more exciting [00:35:02] but the thing I find even more exciting is the meaningful work we could do right [00:35:04] is the meaningful work we could do right I think that you know I think that every [00:35:06] I think that you know I think that every time there's a major technological [00:35:07] time there's a major technological disruption which there is now through [00:35:09] disruption which there is now through machine learning it gives us an [00:35:11] machine learning it gives us an opportunity to remake large parts of the [00:35:13] opportunity to remake large parts of the world and if we behave ethically in a [00:35:15] world and if we behave ethically in a principled way and use these super [00:35:17] principled way and use these super powers of machine learning to do things [00:35:20] powers of machine learning to do things that you know helps people's lives right [00:35:22] that you know helps people's lives right maybe we could maybe you can improve the [00:35:25] maybe we could maybe you can improve the health care system maybe you can improve [00:35:27] health care system maybe you can improve give every child a personalized tutor [00:35:30] give every child a personalized tutor maybe you can make a democracy run [00:35:32] maybe you can make a democracy run better rather than make it run worse but [00:35:34] better rather than make it run worse but I think that the meaning I find in [00:35:37] I think that the meaning I find in machine learning is that there's so many [00:35:38] machine learning is that there's so many people that are so eager for us to go in [00:35:41] people that are so eager for us to go in and help them with these tools that if [00:35:44] and help them with these tools that if you become good at these tools it gives [00:35:47] you become good at these tools it gives you an opportunity to really remake some [00:35:49] you an opportunity to really remake some peace some meaningful piece of the world [00:35:52] peace some meaningful piece of the world hopefully in a way that helps other [00:35:54] hopefully in a way that helps other people and makes the world kind of makes [00:35:56] people and makes the world kind of makes the world a better place is very cliche [00:35:58] the world a better place is very cliche in Silicon Valley but but I think you [00:36:00] in Silicon Valley but but I think you know with these tools you actually have [00:36:02] know with these tools you actually have the power to do that and they've got [00:36:04] the power to do that and they've got make a ton of money that's great too but [00:36:05] make a ton of money that's great too but I find a much greater meaning in the [00:36:07] I find a much greater meaning in the work we could do but um [00:36:14] work we could do but um despite all the excitement of machine [00:36:15] despite all the excitement of machine learning what is machine learning so let [00:36:17] learning what is machine learning so let me give you a couple definitions of [00:36:20] me give you a couple definitions of machine learning author Samuel whose [00:36:24] machine learning author Samuel whose claim to fame was building a checkers [00:36:26] claim to fame was building a checkers playing program defined it as follows [00:36:28] playing program defined it as follows fields I think is computable II learned [00:36:30] fields I think is computable II learned well being explicitly programmed and you [00:36:34] well being explicitly programmed and you know interesting when when also samuel [00:36:37] know interesting when when also samuel many many decades ago wrote the checkers [00:36:40] many many decades ago wrote the checkers playing program [00:36:41] playing program the debates of the day was coming [00:36:43] the debates of the day was coming computer ever do something that it [00:36:46] computer ever do something that it wasn't explicitly told to do and Arthur [00:36:49] wasn't explicitly told to do and Arthur Samuel wrote checkers playing program [00:36:52] Samuel wrote checkers playing program but that through self play learned what [00:36:56] but that through self play learned what are the patterns of checkerboard that [00:36:59] are the patterns of checkerboard that are more likely to lead to win versus [00:37:01] are more likely to lead to win versus more likely lead to a loss and learn to [00:37:03] more likely lead to a loss and learn to be even better than Arthur Samuel the [00:37:06] be even better than Arthur Samuel the author himself at playing checkers so [00:37:09] author himself at playing checkers so back then there was viewed as a [00:37:10] back then there was viewed as a remarkable result there the computer [00:37:12] remarkable result there the computer program huh you know that could write a [00:37:14] program huh you know that could write a piece of software to do something that [00:37:16] piece of software to do something that the computer program that himself could [00:37:17] the computer program that himself could not do right because this program became [00:37:19] not do right because this program became better and also Samuel at the toss of [00:37:26] better and also Samuel at the toss of played checkers and I think today we are [00:37:31] played checkers and I think today we are used to computers or machine learning [00:37:34] used to computers or machine learning algorithms outperforming humans on so [00:37:36] algorithms outperforming humans on so many tasks but it turns out that when [00:37:39] many tasks but it turns out that when you choose a narrow tasks like speech [00:37:41] you choose a narrow tasks like speech recognition on a certain type of tasks [00:37:43] recognition on a certain type of tasks you can maybe surpass human level [00:37:45] you can maybe surpass human level performance if it was a narrow toss like [00:37:47] performance if it was a narrow toss like playing the game of goal then by [00:37:49] playing the game of goal then by throwing really tons of computation [00:37:51] throwing really tons of computation power at it and self play you can have a [00:37:55] power at it and self play you can have a computer you know become very good at [00:37:57] computer you know become very good at these narrow tasks but this is maybe one [00:38:01] these narrow tasks but this is maybe one of the first such examples in history of [00:38:03] of the first such examples in history of computing and I think this is still one [00:38:08] computing and I think this is still one of the most widely cited definitions [00:38:11] of the most widely cited definitions right gives computer disability learn [00:38:13] right gives computer disability learn while being explicitly programs [00:38:15] while being explicitly programs um my friend Tom Mitchell in his [00:38:18] um my friend Tom Mitchell in his textbook defined this as well-posed [00:38:21] textbook defined this as well-posed learning problem program said to learn [00:38:25] learning problem program said to learn from experience ee respect to toss T on [00:38:27] from experience ee respect to toss T on some performance measure people on t as [00:38:29] some performance measure people on t as measured by P improves experience you [00:38:31] measured by P improves experience you know and III asked on this I asked Tom [00:38:34] know and III asked on this I asked Tom if he wrote this definition just because [00:38:37] if he wrote this definition just because he wanted it to rhyme [00:38:39] he wanted it to rhyme he did not say yes but I I don't know um [00:38:43] he did not say yes but I I don't know um but in this definition the experience II [00:38:46] but in this definition the experience II for the case of playing checkers the [00:38:49] for the case of playing checkers the experience he would be the experience of [00:38:51] experience he would be the experience of having the checklist the program play [00:38:53] having the checklist the program play tons of games against itself so [00:38:55] tons of games against itself so computers lots of patience sit there for [00:38:57] computers lots of patience sit there for days playing games of checkers against [00:38:59] days playing games of checkers against itself so that's the experience II the [00:39:02] itself so that's the experience II the toss T is the toss the playing checkers [00:39:03] toss T is the toss the playing checkers and performance measure P maybe was the [00:39:06] and performance measure P maybe was the chance of this program winning the next [00:39:09] chance of this program winning the next game of checkers it plays against the [00:39:10] game of checkers it plays against the next opponent right so so we say that [00:39:12] next opponent right so so we say that this is a well-posed learning problem [00:39:14] this is a well-posed learning problem doing anything checkers now within this [00:39:18] doing anything checkers now within this set of ideas of machine learning there [00:39:21] set of ideas of machine learning there are many different tools we use in [00:39:24] are many different tools we use in machine learning and so in the next ten [00:39:27] machine learning and so in the next ten weeks you learn about a variety of these [00:39:30] weeks you learn about a variety of these different tools and so the first of them [00:39:33] different tools and so the first of them and the most widely used one is [00:39:34] and the most widely used one is supervised learning let's see I want to [00:39:38] supervised learning let's see I want to switch to the whiteboard do you guys [00:39:40] switch to the whiteboard do you guys know how do i erase the screen [00:39:59] so what I want to do today is really go [00:40:01] so what I want to do today is really go over some of the major categories of [00:40:03] over some of the major categories of machine learning tools and and so that [00:40:07] machine learning tools and and so that what you learn in the next by the end of [00:40:11] what you learn in the next by the end of this quarter so the most widely used [00:40:19] machine learning - is today is [00:40:22] machine learning - is today is supervised learning actually then we [00:40:23] supervised learning actually then we check how many of you know what [00:40:24] check how many of you know what supervised learning is two-thirds half [00:40:29] supervised learning is two-thirds half of you maybe okay cool let me just [00:40:30] of you maybe okay cool let me just briefly define it um here's one example [00:40:33] briefly define it um here's one example let's say you have a database of housing [00:40:38] let's say you have a database of housing prices and so I'm gonna plot your data [00:40:40] prices and so I'm gonna plot your data set where on the horizontal axis I want [00:40:43] set where on the horizontal axis I want to plot the size of the house in square [00:40:45] to plot the size of the house in square feet you know and then the vertical axis [00:40:48] feet you know and then the vertical axis will plot the price of the house right [00:40:51] will plot the price of the house right and maybe a dataset looks like that and [00:40:59] and maybe a dataset looks like that and so horizontal axis I guess we call this [00:41:01] so horizontal axis I guess we call this X and vertical axis we'll call that Y so [00:41:06] X and vertical axis we'll call that Y so um the supervised learning problem is [00:41:08] um the supervised learning problem is given the data set like this to find the [00:41:10] given the data set like this to find the relationship mapping from X to Y and so [00:41:14] relationship mapping from X to Y and so for example let's see yeah let's say [00:41:17] for example let's see yeah let's say let's say you have a let's say you are [00:41:18] let's say you have a let's say you are fortunate enough to own a house in [00:41:20] fortunate enough to own a house in Colorado and you're trying to sell it [00:41:23] Colorado and you're trying to sell it and you want to know how to price the [00:41:25] and you want to know how to price the house so maybe your house has a size you [00:41:29] house so maybe your house has a size you know of that amount on the horizontal [00:41:31] know of that amount on the horizontal axis this is five inches square feet [00:41:34] axis this is five inches square feet 1,000 square feet 1,500 square feet so [00:41:37] 1,000 square feet 1,500 square feet so your house is 1250 square feet right and [00:41:40] your house is 1250 square feet right and you want to know you know how do you [00:41:42] you want to know you know how do you price this house [00:41:44] price this house so given this dataset one thing you can [00:41:46] so given this dataset one thing you can do is fit a straight line to it right [00:41:51] do is fit a straight line to it right and then you could estimate so predict [00:41:53] and then you could estimate so predict the price to be whatever value you read [00:41:55] the price to be whatever value you read off on the vertical axis so in [00:41:59] off on the vertical axis so in supervised learning you are given a data [00:42:02] supervised learning you are given a data set with infos X and labels Y and your [00:42:07] set with infos X and labels Y and your goal is to learn a [00:42:10] goal is to learn a from X to Y right now fitting a straight [00:42:15] from X to Y right now fitting a straight line to the data is maybe the simplest [00:42:16] line to the data is maybe the simplest possible maybe the simplest possible [00:42:19] possible maybe the simplest possible learning algorithm maybe that one of the [00:42:21] learning algorithm maybe that one of the simplest on learning algorithms given [00:42:24] simplest on learning algorithms given they said like there's there many [00:42:25] they said like there's there many possible ways to learn the mapping to [00:42:28] possible ways to learn the mapping to learn the function mapping from the [00:42:30] learn the function mapping from the input size to the estimated price and so [00:42:33] input size to the estimated price and so maybe you want to fit a quadratic [00:42:34] maybe you want to fit a quadratic function instead maybe that actually [00:42:36] function instead maybe that actually fits a date a little bit better and so [00:42:38] fits a date a little bit better and so how do you choose among different models [00:42:40] how do you choose among different models will be either automatically or manually [00:42:42] will be either automatically or manually vention will be will be something we'll [00:42:44] vention will be will be something we'll spend a lot of time talking about now to [00:42:48] spend a lot of time talking about now to give a little bit more to define a few [00:42:51] give a little bit more to define a few more things this pickle example is a [00:42:53] more things this pickle example is a problem called a regression problem and [00:42:58] problem called a regression problem and the term regression refers to that D [00:43:00] the term regression refers to that D value Y you're trying to predict is [00:43:03] value Y you're trying to predict is continuous right in contrast here is a [00:43:07] continuous right in contrast here is a here's a different type of problem so [00:43:11] here's a different type of problem so problem that some one friends were [00:43:12] problem that some one friends were working on and I'll simplify it was was [00:43:14] working on and I'll simplify it was was a healthcare problem where they were [00:43:17] a healthcare problem where they were looking at breast cancer breast tumors [00:43:20] looking at breast cancer breast tumors and trying to decide if a tumor is [00:43:23] and trying to decide if a tumor is benign or malignant [00:43:25] benign or malignant right so tumor you know serve a lump in [00:43:27] right so tumor you know serve a lump in ER in a woman's breast is it can be [00:43:31] ER in a woman's breast is it can be malign or cancerous or benign meaning [00:43:34] malign or cancerous or benign meaning you know roughly there's not that [00:43:36] you know roughly there's not that harmful and so if on the horizontal axis [00:43:40] harmful and so if on the horizontal axis you plot the size of a tumor and on the [00:43:45] you plot the size of a tumor and on the vertical axis you plot is it malignant [00:43:48] vertical axis you plot is it malignant or not so malignant means harmful rate [00:43:52] or not so malignant means harmful rate and some tumors are harmful some are not [00:43:56] and some tumors are harmful some are not and so whether it's malignant or not [00:43:58] and so whether it's malignant or not takes only two values one or zero and so [00:44:05] takes only two values one or zero and so you may have a data set like that [00:44:13] you may have a data set like that right and given this can you learn a [00:44:17] right and given this can you learn a mapping from X to Y so that if a new [00:44:20] mapping from X to Y so that if a new patient walks into your office was in [00:44:23] patient walks into your office was in the doctor's office and the tumor size [00:44:25] the doctor's office and the tumor size is you know say this can i learning [00:44:28] is you know say this can i learning algorithm figure out from this day there [00:44:29] algorithm figure out from this day there that it was probably well based on this [00:44:31] that it was probably well based on this data set it looks like there's a there's [00:44:33] data set it looks like there's a there's a high chance that that tumor is [00:44:35] a high chance that that tumor is malignant so so this is an example of a [00:44:43] malignant so so this is an example of a classification problem and the term [00:44:49] classification problem and the term classification refers to that Y here [00:44:52] classification refers to that Y here takes on a discrete number of variables [00:44:54] takes on a discrete number of variables so for a regression problem Y is a real [00:44:57] so for a regression problem Y is a real number I guess technically prices can be [00:45:00] number I guess technically prices can be rounded off to the nearest dollar [00:45:01] rounded off to the nearest dollar instead so prices aren't really real [00:45:03] instead so prices aren't really real numbers because you pretty not price of [00:45:07] numbers because you pretty not price of how's that like pi times 1 million or [00:45:09] how's that like pi times 1 million or whatever but so but for all practical [00:45:12] whatever but so but for all practical purposes prices are continuous so we [00:45:14] purposes prices are continuous so we call them housing price prediction to be [00:45:16] call them housing price prediction to be a regression problem whereas if you have [00:45:18] a regression problem whereas if you have two values that possible help was zero [00:45:20] two values that possible help was zero one call that classification problem if [00:45:22] one call that classification problem if you have K discrete outputs so if the [00:45:26] you have K discrete outputs so if the tumor can be malignant or if there are [00:45:29] tumor can be malignant or if there are five types of cancer where and so you [00:45:31] five types of cancer where and so you have one of five possible outputs then [00:45:33] have one of five possible outputs then there's also a classification problem [00:45:34] there's also a classification problem that the output is discrete now I want [00:45:40] that the output is discrete now I want to find a different way to visualize [00:45:42] to find a different way to visualize this data set which is let me throw a [00:45:45] this data set which is let me throw a line on top and I'm just gonna you know [00:45:48] line on top and I'm just gonna you know mat all this data on the horizontal axis [00:45:51] mat all this data on the horizontal axis upward onto a line but I'm going to use [00:45:56] upward onto a line but I'm going to use a symbol ol to denote I hope what I did [00:46:09] a symbol ol to denote I hope what I did was clear so I took the two sets of [00:46:11] was clear so I took the two sets of examples that positive and negative [00:46:13] examples that positive and negative examples positive example was this one [00:46:15] examples positive example was this one negative example zero and I took all of [00:46:18] negative example zero and I took all of these examples and kind of push them up [00:46:20] these examples and kind of push them up onto a straight line and I used two [00:46:23] onto a straight line and I used two symbols I use OHS to denote negative [00:46:26] symbols I use OHS to denote negative examples and I use [00:46:27] examples and I use processes in our positive examples okay [00:46:29] processes in our positive examples okay so this is just a different way of [00:46:31] so this is just a different way of visualizing the same data but drawing it [00:46:35] visualizing the same data but drawing it on a line and using you know two symbols [00:46:38] on a line and using you know two symbols to denote the two discrete values around [00:46:40] to denote the two discrete values around one right so um it turns out that in [00:46:46] one right so um it turns out that in both of these examples the input X was [00:46:49] both of these examples the input X was one-dimensional it was a single real [00:46:50] one-dimensional it was a single real number for most of the machine learning [00:46:53] number for most of the machine learning application to work with the input X [00:46:55] application to work with the input X will be multi-dimensional and you won't [00:46:57] will be multi-dimensional and you won't be given just one number and also [00:46:59] be given just one number and also predict another number instead you often [00:47:02] predict another number instead you often be given multiple features or multiple [00:47:05] be given multiple features or multiple numbers to predict another number so for [00:47:07] numbers to predict another number so for example instead of just using two [00:47:10] example instead of just using two messiahs to predict to estimate [00:47:12] messiahs to predict to estimate malignancy open malignant versus benign [00:47:15] malignancy open malignant versus benign tumors you may instead have two features [00:47:19] tumors you may instead have two features where one is tumor size and the second [00:47:23] where one is tumor size and the second is age of the patient and be given a [00:47:27] is age of the patient and be given a data set maybe I'll sit and be given the [00:47:45] data set maybe I'll sit and be given the data set that looks like that well now [00:47:49] data set that looks like that well now your task is given two input features so [00:47:53] your task is given two input features so X's tumor size and age you know like a [00:47:56] X's tumor size and age you know like a two dimensional vector and your toss is [00:47:58] two dimensional vector and your toss is given these two input features to [00:48:03] given these two input features to predict whether a given tumor is [00:48:05] predict whether a given tumor is malignant or benign so the new patient [00:48:06] malignant or benign so the new patient walks in the doctor's office and that [00:48:08] walks in the doctor's office and that the tumor size is here and the ages here [00:48:11] the tumor size is here and the ages here so that point there that hopefully you [00:48:14] so that point there that hopefully you conclude that you know this patients [00:48:16] conclude that you know this patients tumors probably benign very [00:48:17] tumors probably benign very corresponding oh that's a negative [00:48:19] corresponding oh that's a negative example and so one thing one thing you [00:48:23] example and so one thing one thing you learn next week is a learning algorithm [00:48:25] learn next week is a learning algorithm that can further straight line to the [00:48:29] that can further straight line to the data as follows [00:48:30] data as follows kind of like that to separate out the [00:48:31] kind of like that to separate out the positive negative example several out [00:48:33] positive negative example several out the holes and the crosses and so next [00:48:36] the holes and the crosses and so next week you learn about the logistic [00:48:37] week you learn about the logistic regression algorithm which [00:48:40] regression algorithm which and do that okay so one of the most [00:48:45] and do that okay so one of the most interesting things you learn about is [00:48:48] interesting things you learn about is let's see so in this example I drew a [00:48:51] let's see so in this example I drew a dataset with two input features when I [00:48:55] dataset with two input features when I said I have friends that actually worked [00:48:56] said I have friends that actually worked on the breast cancer prediction problem [00:48:59] on the breast cancer prediction problem and in practice you usually have a lot [00:49:02] and in practice you usually have a lot more than one or two features and [00:49:03] more than one or two features and usually you have so many features you [00:49:05] usually you have so many features you can't plot on the board right and so for [00:49:07] can't plot on the board right and so for an actual breast cancer prediction [00:49:09] an actual breast cancer prediction problem my friends are working on this [00:49:11] problem my friends are working on this well we're using many other features [00:49:13] well we're using many other features such as don't worry about what these [00:49:14] such as don't worry about what these need mean I guess clump thickness you [00:49:18] need mean I guess clump thickness you know uniformity of cell size uniformity [00:49:25] know uniformity of cell size uniformity of cell shape right at Heejun how will [00:49:32] of cell shape right at Heejun how will the cells sync together [00:49:33] the cells sync together don't worry about what these means but [00:49:35] don't worry about what these means but if you're actually doing this in a [00:49:37] if you're actually doing this in a actual medical application there's a [00:49:39] actual medical application there's a good chance that you'll be using a lot [00:49:40] good chance that you'll be using a lot more features than just two and this [00:49:43] more features than just two and this means that you actually can't plot this [00:49:44] means that you actually can't plot this data right it's too high dimensional you [00:49:46] data right it's too high dimensional you can't plot things higher than three [00:49:48] can't plot things higher than three dimensional maybe four dimensional [00:49:49] dimensional maybe four dimensional something right and so we have all the [00:49:51] something right and so we have all the features are you difficult to plot this [00:49:53] features are you difficult to plot this data I'll come back to this in a second [00:49:55] data I'll come back to this in a second in learning theory and one of the things [00:50:00] in learning theory and one of the things you learn about so as we develop our [00:50:02] you learn about so as we develop our rhythms you learn how to build [00:50:04] rhythms you learn how to build regression algorithms or classification [00:50:07] regression algorithms or classification algorithms that can deal with these [00:50:09] algorithms that can deal with these relatively larger number of features one [00:50:11] relatively larger number of features one of the most fascinating results you [00:50:13] of the most fascinating results you learn is that you also learn about an [00:50:17] learn is that you also learn about an algorithm called support vector machine [00:50:18] algorithm called support vector machine which uses not one or two or three or [00:50:22] which uses not one or two or three or ten or hundred or million input features [00:50:25] ten or hundred or million input features but uses an infinite number of input [00:50:29] but uses an infinite number of input features right and so so so just be [00:50:31] features right and so so so just be clear if in this example the state of a [00:50:34] clear if in this example the state of a patient will represents as one number [00:50:36] patient will represents as one number you know tumor size in this example you [00:50:39] you know tumor size in this example you get two features so the state of a [00:50:40] get two features so the state of a patient will be represented using two [00:50:42] patient will be represented using two numbers tumor size in the age if you use [00:50:44] numbers tumor size in the age if you use this list of features maybe a patient [00:50:46] this list of features maybe a patient arrives enter with five or six numbers [00:50:48] arrives enter with five or six numbers but there's an algorithm called the [00:50:50] but there's an algorithm called the support vector machine that allows you [00:50:53] support vector machine that allows you to use an infinite dimensional vector to [00:50:59] to use an infinite dimensional vector to represent patients and how do you deal [00:51:03] represent patients and how do you deal with that and how can a computer even [00:51:05] with that and how can a computer even store an infinite dimensional vector [00:51:06] store an infinite dimensional vector right I mean you know computer memory [00:51:09] right I mean you know computer memory you can store one girl number two real [00:51:11] you can store one girl number two real numbers but you can't store an infinite [00:51:13] numbers but you can't store an infinite number of real numbers in the computer [00:51:14] number of real numbers in the computer without running on the memory or [00:51:16] without running on the memory or processor speed or whatever so so how do [00:51:18] processor speed or whatever so so how do you do that so we talked about support [00:51:20] you do that so we talked about support vector machines and specifically the [00:51:23] vector machines and specifically the technical method called kernels you [00:51:25] technical method called kernels you learn how to build during the algorithms [00:51:27] learn how to build during the algorithms that work with so that it infinitely [00:51:29] that work with so that it infinitely long [00:51:30] long lissa features infinitely long list of [00:51:32] lissa features infinitely long list of features for which you can imagine that [00:51:36] features for which you can imagine that if you have an infinitely long list of [00:51:38] if you have an infinitely long list of numbers to represent a patient that [00:51:40] numbers to represent a patient that might give you a lot of information [00:51:41] might give you a lot of information about that patient and so that is one of [00:51:44] about that patient and so that is one of the relatively effective learning [00:51:46] the relatively effective learning algorithm problems um so that's [00:51:52] algorithm problems um so that's supervised learning and you know let me [00:51:54] supervised learning and you know let me just some play a video show you a fun [00:52:00] just some play a video show you a fun slightly older example of supervised [00:52:03] slightly older example of supervised there in the previous hands what this [00:52:04] there in the previous hands what this means but at the heart of supervised [00:52:08] means but at the heart of supervised learning is the idea that during [00:52:10] learning is the idea that during training you are given inputs X together [00:52:14] training you are given inputs X together with the labels Y and you're given both [00:52:16] with the labels Y and you're given both at the same time and the job of your [00:52:18] at the same time and the job of your learning algorithm is to find a mapping [00:52:21] learning algorithm is to find a mapping so that given a new X you can map it to [00:52:25] so that given a new X you can map it to the most appropriate output Y so this is [00:52:28] the most appropriate output Y so this is a very old video made by a Dean [00:52:31] a very old video made by a Dean Pomerleau known for a long time as well [00:52:32] Pomerleau known for a long time as well on using supervised learning for [00:52:35] on using supervised learning for autonomous driving this does not save [00:52:37] autonomous driving this does not save the art for Toms driving anymore but it [00:52:39] the art for Toms driving anymore but it actually does remarkably well oh and as [00:52:42] actually does remarkably well oh and as you you hear a few technical terms like [00:52:45] you you hear a few technical terms like back propagation you learn all those [00:52:47] back propagation you learn all those techniques in this cause and by the end [00:52:50] techniques in this cause and by the end of class you've really built a learning [00:52:51] of class you've really built a learning algorithm much more effective than what [00:52:52] algorithm much more effective than what you see here but let's let's see this [00:52:54] you see here but let's let's see this application [00:52:59] could you turn up the volume maybe have [00:53:01] could you turn up the volume maybe have that are you guys getting volleyball [00:53:04] that are you guys getting volleyball yeah I see [00:53:13] alright I'll narrate this Oh so I'll be [00:53:17] alright I'll narrate this Oh so I'll be using artificial neural network to drive [00:53:19] using artificial neural network to drive this vehicle that was built at carnegie [00:53:21] this vehicle that was built at carnegie mellon university many years ago and [00:53:24] mellon university many years ago and what happens is during training it [00:53:27] what happens is during training it watches the human drive the vehicle and [00:53:30] watches the human drive the vehicle and I think ten times a second it digitizes [00:53:34] I think ten times a second it digitizes the image in front of the vehicle and so [00:53:37] the image in front of the vehicle and so that's a picture taken by a front-facing [00:53:40] that's a picture taken by a front-facing camera and what it does is in order to [00:53:44] camera and what it does is in order to collect labelled data the car while the [00:53:46] collect labelled data the car while the human is driving it records both the [00:53:49] human is driving it records both the image such as the scene here as well as [00:53:52] image such as the scene here as well as the steering direction that was chosen [00:53:53] the steering direction that was chosen by human so at the bottom here is the [00:53:56] by human so at the bottom here is the image turned into grayscale and lower [00:53:58] image turned into grayscale and lower res and on top let me pose this for a [00:54:02] res and on top let me pose this for a second this is the driver direction the [00:54:07] second this is the driver direction the phone's kind of blurry but this Texas [00:54:08] phone's kind of blurry but this Texas driver direction so this is the Y label [00:54:11] driver direction so this is the Y label the label Y that the human driver chose [00:54:15] the label Y that the human driver chose and so the position of this white bar of [00:54:18] and so the position of this white bar of this white blob shows how the human is [00:54:21] this white blob shows how the human is choosing to steer the car so in this in [00:54:24] choosing to steer the car so in this in this image the white blob is a little [00:54:26] this image the white blob is a little bit to the left of center so the humans [00:54:27] bit to the left of center so the humans you know steering just a little bit to [00:54:29] you know steering just a little bit to the left this second line here is the [00:54:33] the left this second line here is the output of the neural network and [00:54:35] output of the neural network and initially the neural network doesn't [00:54:38] initially the neural network doesn't know how to drive and so it's just [00:54:39] know how to drive and so it's just outputting this white schmear everywhere [00:54:42] outputting this white schmear everywhere you say you know I don't know do I Drive [00:54:43] you say you know I don't know do I Drive left right center [00:54:44] left right center I don't know so stop putting this gray [00:54:46] I don't know so stop putting this gray blur everywhere and as the algorithm [00:54:49] blur everywhere and as the algorithm learns using the back propagation [00:54:52] learns using the back propagation learning algorithm or gradient descent [00:54:54] learning algorithm or gradient descent which you learn about you actually learn [00:54:56] which you learn about you actually learn about gradient descent this Wednesday [00:54:57] about gradient descent this Wednesday you see that the neural networks output [00:55:00] you see that the neural networks output becomes less and less of this white [00:55:02] becomes less and less of this white shmear this white blur but starts become [00:55:06] shmear this white blur but starts become sharper and starts to mimic more [00:55:10] sharper and starts to mimic more accurately the human selected driving [00:55:14] accurately the human selected driving direction [00:55:16] direction so this um there's an example of [00:55:20] so this um there's an example of supervised learning because the human [00:55:22] supervised learning because the human driver demonstrates inputs X and outputs [00:55:25] driver demonstrates inputs X and outputs y maybe if you see this in front of the [00:55:29] y maybe if you see this in front of the car steer like that so that's x and y [00:55:31] car steer like that so that's x and y and after the learning algorithm has [00:55:34] and after the learning algorithm has learned you can then well he pushes a [00:55:39] learned you can then well he pushes a button takes a hand off the steering you [00:55:41] button takes a hand off the steering you know and then it's using this network to [00:55:45] know and then it's using this network to drive itself right digitizing the image [00:55:48] drive itself right digitizing the image in front of the note taking this image [00:55:51] in front of the note taking this image and passing it through the learning [00:55:53] and passing it through the learning algorithm through the trained neural [00:55:54] algorithm through the trained neural network letting the neural network [00:55:56] network letting the neural network select the steering direction and then [00:55:59] select the steering direction and then using a little motor to turn the wheel [00:56:02] using a little motor to turn the wheel this is a slightly more advanced version [00:56:05] this is a slightly more advanced version which has trained two separate models [00:56:07] which has trained two separate models one for I think a two-lane road one for [00:56:09] one for I think a two-lane road one for a four-lane roads so that's the so the [00:56:15] a four-lane roads so that's the so the second and third lines this is for a [00:56:16] second and third lines this is for a two-lane road this is a four-lane road [00:56:18] two-lane road this is a four-lane road and the arbitrator is another algorithm [00:56:21] and the arbitrator is another algorithm that tries to decide whether the two [00:56:24] that tries to decide whether the two lane of the four-lane road model is the [00:56:26] lane of the four-lane road model is the more appropriate one for a particular [00:56:28] more appropriate one for a particular given situation and so as Alvin [00:56:31] given situation and so as Alvin destroying excuse me a one named world [00:56:34] destroying excuse me a one named world or two lane road so it says driving from [00:56:36] or two lane road so it says driving from a one lane road here toward an [00:56:39] a one lane road here toward an intersection [00:56:51] the the algorithm realizes is just [00:56:54] the the algorithm realizes is just switch over from I think I forget what I [00:56:58] switch over from I think I forget what I think the one lane or network today to [00:57:00] think the one lane or network today to the two lane your network and one or [00:57:02] the two lane your network and one or season [00:57:18] okay oh all right fine we just see the [00:57:21] okay oh all right fine we just see the final dramatic moment as searching for a [00:57:23] final dramatic moment as searching for a one-day road to tuning all right and I [00:57:40] one-day road to tuning all right and I think you know so this is just using [00:57:42] think you know so this is just using supervised learning to take as input [00:57:44] supervised learning to take as input what's in front of your car to decide on [00:57:46] what's in front of your car to decide on steering direction this is not so the [00:57:48] steering direction this is not so the odds for how self-driving cars are built [00:57:50] odds for how self-driving cars are built today but you know you could do some [00:57:52] today but you know you could do some things in some limited context and I [00:57:55] things in some limited context and I think within several weeks you actually [00:57:58] think within several weeks you actually be able to build something that is more [00:58:00] be able to build something that is more sophisticated than this um so after [00:58:05] sophisticated than this um so after supervised learning we will in this [00:58:10] supervised learning we will in this class to spend a bit of time talking [00:58:12] class to spend a bit of time talking about machine learning strategy you know [00:58:14] about machine learning strategy you know I think on the class nails we annotate [00:58:17] I think on the class nails we annotate this as a learning theory but what that [00:58:19] this as a learning theory but what that means is um I want to give you the tools [00:58:21] means is um I want to give you the tools to gaulden apply learning algorithms [00:58:24] to gaulden apply learning algorithms effectively and I think I've been [00:58:26] effectively and I think I've been fortunate to have you know to know a lot [00:58:30] fortunate to have you know to know a lot of I think that um I've been fortunate [00:58:34] of I think that um I've been fortunate to have you know over the years [00:58:35] to have you know over the years constantly visited lots of great tech [00:58:39] constantly visited lots of great tech companies more than ones that I've dead [00:58:41] companies more than ones that I've dead that I've been probably associated with [00:58:43] that I've been probably associated with right but often just a home friend zone [00:58:45] right but often just a home friend zone I visit various tech companies of the [00:58:48] I visit various tech companies of the Sun whose products I'm sure installed on [00:58:50] Sun whose products I'm sure installed on your cell phone but I often visit tech [00:58:52] your cell phone but I often visit tech companies and you know talk to the [00:58:53] companies and you know talk to the Machine there any themes and see what [00:58:55] Machine there any themes and see what they're doing and see if I can help them [00:58:56] they're doing and see if I can help them out and what I see is that there's a [00:58:59] out and what I see is that there's a huge difference in the effectiveness of [00:59:02] huge difference in the effectiveness of how two different teams could apply the [00:59:04] how two different teams could apply the exact same learning algorithm and I [00:59:08] exact same learning algorithm and I think that what I've seen savvy is that [00:59:11] think that what I've seen savvy is that sometimes there will be a team even in [00:59:14] sometimes there will be a team even in some of the best tech companies right [00:59:16] some of the best tech companies right the the elite AI companies in multiple [00:59:19] the the elite AI companies in multiple of them where you go talk to a team and [00:59:22] of them where you go talk to a team and they'll tell you about something they've [00:59:23] they'll tell you about something they've been working on for six months [00:59:25] been working on for six months and then you can quickly take a look at [00:59:27] and then you can quickly take a look at the data and and hear that did not the [00:59:31] the data and and hear that did not the album isn't quite working and sometimes [00:59:33] album isn't quite working and sometimes you could look like what they're doing [00:59:34] you could look like what they're doing and go yeah you know I could have told [00:59:36] and go yeah you know I could have told you six months ago the disapproval is [00:59:38] you six months ago the disapproval is never gonna work right and what I find [00:59:42] never gonna work right and what I find is that the most skill machine learning [00:59:44] is that the most skill machine learning practitioners are very strategic by [00:59:46] practitioners are very strategic by which I mean that your skill at deciding [00:59:49] which I mean that your skill at deciding when you work on a machine learning [00:59:51] when you work on a machine learning project know you have a lot of decisions [00:59:53] project know you have a lot of decisions to make right [00:59:54] to make right do you collect more data do you try a [00:59:56] do you collect more data do you try a different learning algorithm do you ran [00:59:59] different learning algorithm do you ran faster GPUs to train your learning oven [01:00:00] faster GPUs to train your learning oven for longer or if you collect more data [01:00:02] for longer or if you collect more data what type of data is you collect or for [01:00:04] what type of data is you collect or for all of these architecture choices using [01:00:06] all of these architecture choices using neural networks what vector machine [01:00:08] neural networks what vector machine which is regression which one do you [01:00:09] which is regression which one do you pick [01:00:10] pick but there are a lot of decisions you [01:00:12] but there are a lot of decisions you need to make when building these [01:00:15] need to make when building these learning algorithms so one thing that's [01:00:18] learning algorithms so one thing that's quite unique to the way we teach is we [01:00:21] quite unique to the way we teach is we want to help you become more systematic [01:00:23] want to help you become more systematic in driving machine learning as a [01:00:27] in driving machine learning as a sabbatical [01:00:27] sabbatical engineering discipline so that when one [01:00:29] engineering discipline so that when one day your work on machine learning [01:00:30] day your work on machine learning project you can efficiently figure out [01:00:33] project you can efficiently figure out what to do next [01:00:34] what to do next and sometimes make an analogy to how to [01:00:38] and sometimes make an analogy to how to a software engineering [01:00:41] a software engineering you know I many years ago I had a friend [01:00:45] you know I many years ago I had a friend that would debug code by compiling it [01:00:48] that would debug code by compiling it and then this friend will look all these [01:00:51] and then this friend will look all these syntax errors right that you know see [01:00:54] syntax errors right that you know see pluses compiler outputs and they thought [01:00:56] pluses compiler outputs and they thought that the best way to eliminate the [01:00:58] that the best way to eliminate the errors is the delete all the lines of [01:01:00] errors is the delete all the lines of code with syntax errors and that was [01:01:02] code with syntax errors and that was their first serious thing so that did [01:01:03] their first serious thing so that did not go well right took me a while to [01:01:07] not go well right took me a while to persuade them to start doing that Oh [01:01:09] persuade them to start doing that Oh but-but-but-but so it turns out that [01:01:11] but-but-but-but so it turns out that when you run a learning algorithm you [01:01:13] when you run a learning algorithm you know it almost never works the first [01:01:15] know it almost never works the first time is just life and the way you go [01:01:19] time is just life and the way you go about debugging the learning algorithm [01:01:20] about debugging the learning algorithm will have a huge impact on your [01:01:22] will have a huge impact on your efficiency on how quickly you can build [01:01:25] efficiency on how quickly you can build effective learning systems and I think [01:01:27] effective learning systems and I think until now too much of the of this [01:01:30] until now too much of the of this process of making your learning Urban's [01:01:33] process of making your learning Urban's work well has been a black magic kind of [01:01:35] work well has been a black magic kind of process where you know as [01:01:37] process where you know as it's the decades so when you run [01:01:40] it's the decades so when you run something and don't know why it's not [01:01:41] something and don't know why it's not working I hate what I do and says oh [01:01:43] working I hate what I do and says oh yeah I'll do that and then and then [01:01:44] yeah I'll do that and then and then because he's so experienced it works but [01:01:47] because he's so experienced it works but I think um what we're trying to do with [01:01:49] I think um what we're trying to do with the discipline of machine learning is to [01:01:50] the discipline of machine learning is to evolve it from a black magic tribal [01:01:53] evolve it from a black magic tribal knowledge experience based thing to a [01:01:55] knowledge experience based thing to a systemic engineering process right and [01:01:58] systemic engineering process right and so later this quarter as we talk about [01:02:01] so later this quarter as we talk about machine learning strategy or talk about [01:02:03] machine learning strategy or talk about learning theory you try to suspect to [01:02:05] learning theory you try to suspect to give you tools on how to go about [01:02:08] give you tools on how to go about strategizing so can be very efficient in [01:02:11] strategizing so can be very efficient in how you how you yourself how you can [01:02:14] how you how you yourself how you can lead a team to build an effective [01:02:16] lead a team to build an effective learning system because I don't want you [01:02:18] learning system because I don't want you to be one of those people that you know [01:02:20] to be one of those people that you know waste six months on some direction that [01:02:22] waste six months on some direction that maybe could have relatively quickly [01:02:25] maybe could have relatively quickly figured out what's not promising well [01:02:27] figured out what's not promising well maybe one loss analogy if you if you use [01:02:30] maybe one loss analogy if you if you use the optimizing code right making code [01:02:32] the optimizing code right making code run faster not tell me if you learn that [01:02:36] run faster not tell me if you learn that less experience software engineers will [01:02:39] less experience software engineers will just dive in and optimize the code to [01:02:41] just dive in and optimize the code to try to make it run faster right let's [01:02:42] try to make it run faster right let's take the C++ and code in the 70 or [01:02:44] take the C++ and code in the 70 or something but more experienced people [01:02:46] something but more experienced people will run the profiler to try to figure [01:02:48] will run the profiler to try to figure out what part of your code is actually [01:02:50] out what part of your code is actually the whole night and then just focus on [01:02:51] the whole night and then just focus on changing on that so one things hope to [01:02:54] changing on that so one things hope to do this quarter is convey to you some of [01:02:58] do this quarter is convey to you some of these more systemic engineering [01:02:59] these more systemic engineering principles [01:03:00] principles oh and actually this is a actually I've [01:03:05] oh and actually this is a actually I've been down I've been writing this up [01:03:08] been down I've been writing this up actually so how many of you have heard a [01:03:10] actually so how many of you have heard a machine there on in your name oh just a [01:03:12] machine there on in your name oh just a few of you interesting so actually - so [01:03:15] few of you interesting so actually - so if any of you interested just in my [01:03:19] if any of you interested just in my spare time I've been writing a book to [01:03:26] spare time I've been writing a book to try to codify systemic engineering [01:03:27] try to codify systemic engineering principles for machine learning and so [01:03:29] principles for machine learning and so if you are and so if you want a you know [01:03:33] if you are and so if you want a you know free draft copy of the book sign up for [01:03:35] free draft copy of the book sign up for a mailing list here I tend to just write [01:03:37] a mailing list here I tend to just write stuff and put it on the internet for you [01:03:39] stuff and put it on the internet for you so if you want a free drop copy of the [01:03:41] so if you want a free drop copy of the book [01:03:43] book you know go to this website enter your [01:03:46] you know go to this website enter your email address and the website was saying [01:03:48] email address and the website was saying Jerry copy of the book they'll talk a [01:03:49] Jerry copy of the book they'll talk a little bit about these interview [01:03:50] little bit about these interview principles as well okay all right so so [01:03:55] principles as well okay all right so so first object machine learning second [01:03:57] first object machine learning second subject learning theory and the third [01:04:02] subject learning theory and the third major subject we'll talk about is deep [01:04:04] major subject we'll talk about is deep learning and so you know the lot of [01:04:07] learning and so you know the lot of tools in machine learning and many of [01:04:09] tools in machine learning and many of them are worth learning about and I use [01:04:10] them are worth learning about and I use many different tools the machine [01:04:12] many different tools the machine learning you know for many different [01:04:14] learning you know for many different applications there's one subset of [01:04:16] applications there's one subset of machine learning that's really hot right [01:04:17] machine learning that's really hot right now because it's just advancing very [01:04:19] now because it's just advancing very rapidly which is deep learning and so [01:04:21] rapidly which is deep learning and so we'll spend a bit of time talking about [01:04:23] we'll spend a bit of time talking about deep learning so they can understand the [01:04:25] deep learning so they can understand the basics of how the training your network [01:04:28] basics of how the training your network as well but I think that um whereas 2:29 [01:04:32] as well but I think that um whereas 2:29 covers a much broader set of out rules [01:04:34] covers a much broader set of out rules which are all useful see su-30 [01:04:37] which are all useful see su-30 more narrowly covers just deep learning [01:04:42] more narrowly covers just deep learning so other than deep learning slash after [01:04:45] so other than deep learning slash after after deep learning such new neural [01:04:47] after deep learning such new neural networks the other the the fall of both [01:04:51] networks the other the the fall of both of the five major topics we'll cover [01:04:52] of the five major topics we'll cover will be an unsupervised learning so one [01:04:56] will be an unsupervised learning so one is unsupervised learning [01:05:06] so you saw me draw a picture like this [01:05:10] so you saw me draw a picture like this just now right and this would be a [01:05:12] just now right and this would be a classification problem like the tumor [01:05:14] classification problem like the tumor malignant benign problems this is a [01:05:16] malignant benign problems this is a classification problem and that was a [01:05:18] classification problem and that was a supervised learning problem because you [01:05:19] supervised learning problem because you have to learn the function mapping from [01:05:21] have to learn the function mapping from X to Y unsupervised learning would be if [01:05:25] X to Y unsupervised learning would be if I give you a data set like this with no [01:05:28] I give you a data set like this with no labels so you just give in inputs X and [01:05:31] labels so you just give in inputs X and no why and you're asked to find me [01:05:35] no why and you're asked to find me something interesting in this data [01:05:36] something interesting in this data figure out you know interesting [01:05:37] figure out you know interesting structure in this data and so in this [01:05:41] structure in this data and so in this data set it looks like there are two [01:05:42] data set it looks like there are two clusters and then unsupervised learning [01:05:45] clusters and then unsupervised learning algorithm which you learn about called [01:05:46] algorithm which you learn about called k-means clustering will discover this [01:05:49] k-means clustering will discover this this structure in the data other [01:05:53] this structure in the data other examples and so I was learning you know [01:05:55] examples and so I was learning you know if you actually Google News is a very [01:05:57] if you actually Google News is a very interesting website sometimes I use it [01:05:59] interesting website sometimes I use it to look up right latest news there's an [01:06:01] to look up right latest news there's an only example but Google News every day [01:06:04] only example but Google News every day crawls or reads many many thousands or [01:06:10] crawls or reads many many thousands or tens of thousands of news articles on [01:06:11] tens of thousands of news articles on the Internet and groups them together [01:06:13] the Internet and groups them together right for example there's a set of [01:06:15] right for example there's a set of articles on DPP all well still and it [01:06:19] articles on DPP all well still and it has taken a lot of the articles written [01:06:22] has taken a lot of the articles written by different reporters and grouped them [01:06:24] by different reporters and grouped them together so you can you know figure out [01:06:26] together so you can you know figure out that what BP Macondo all well write that [01:06:32] that what BP Macondo all well write that this is a CNN article about the whole [01:06:34] this is a CNN article about the whole world still there's a guardian article [01:06:36] world still there's a guardian article about all well spill this example of a [01:06:37] about all well spill this example of a clustering algorithm where is taking [01:06:40] clustering algorithm where is taking these different new sources and figuring [01:06:42] these different new sources and figuring out that these are all stories kind of [01:06:44] out that these are all stories kind of about the same thing and other examples [01:06:50] about the same thing and other examples of clustering just getting data and [01:06:52] of clustering just getting data and figuring out what groups belong together [01:06:56] figuring out what groups belong together all the work on genetic data this is a [01:07:00] all the work on genetic data this is a visualization of the genetic microwave [01:07:03] visualization of the genetic microwave radiator we're giving it like this you [01:07:07] radiator we're giving it like this you can group individuals into different [01:07:09] can group individuals into different types of individuals at different [01:07:12] types of individuals at different characteristics or clustering algorithms [01:07:16] characteristics or clustering algorithms grouping this type of data together is [01:07:18] grouping this type of data together is used to organize computing clusters you [01:07:21] used to organize computing clusters you know figure out what machines workloads [01:07:23] know figure out what machines workloads are more related to each other and [01:07:24] are more related to each other and organize communities probably so to take [01:07:26] organize communities probably so to take a social network like LinkedIn or [01:07:29] a social network like LinkedIn or Facebook or other social networks and [01:07:31] Facebook or other social networks and figure out which are the groups of [01:07:33] figure out which are the groups of friends on which are the cohesive [01:07:34] friends on which are the cohesive communities within a social network or [01:07:37] communities within a social network or market segmentation actually many [01:07:39] market segmentation actually many companies I've worked with look at the [01:07:41] companies I've worked with look at the customer database and cluster the users [01:07:43] customer database and cluster the users together so you can say that looks like [01:07:45] together so you can say that looks like where four types of users you know looks [01:07:47] where four types of users you know looks like that there are the young [01:07:50] like that there are the young professionals looking to develop [01:07:52] professionals looking to develop themselves they're the you know soccer [01:07:55] themselves they're the you know soccer moms and soccer dads that the discount [01:07:57] moms and soccer dads that the discount in this case who can then market to the [01:07:59] in this case who can then market to the different market segments separately and [01:08:02] different market segments separately and and actually many years ago my friend [01:08:04] and actually many years ago my friend Andrew Moore was using this type of data [01:08:08] Andrew Moore was using this type of data for astronomical data analysis group [01:08:10] for astronomical data analysis group together galaxies question Oh is almost [01:08:18] together galaxies question Oh is almost worse than the clustering knows not so [01:08:20] worse than the clustering knows not so as well as well as learning brought me [01:08:22] as well as well as learning brought me is the concept of using unlabeled data [01:08:23] is the concept of using unlabeled data so just X and finding interesting things [01:08:26] so just X and finding interesting things about it right so for example actually [01:08:30] about it right so for example actually here's shoot this won't work with all [01:08:34] here's shoot this won't work with all audio will do this later in the clock in [01:08:36] audio will do this later in the clock in the class I guess maybe I'll save and do [01:08:39] the class I guess maybe I'll save and do this later cocktail party problem is [01:08:44] this later cocktail party problem is another unsupervised learning problem [01:08:46] another unsupervised learning problem reading the audio for this to explain [01:08:48] reading the audio for this to explain this though everything how to explain [01:08:51] this though everything how to explain this [01:08:52] this you know cocktail party problems I'll [01:08:54] you know cocktail party problems I'll try to do the demo when we can get all [01:08:56] try to do the demo when we can get all your work on this laptop there's a [01:08:58] your work on this laptop there's a problem where if you have a noisy room [01:09:01] problem where if you have a noisy room and you stick multiple microphones in [01:09:03] and you stick multiple microphones in the room every call overlapping voices [01:09:05] the room every call overlapping voices so there no labels readers and multiple [01:09:07] so there no labels readers and multiple microphones my array of microphones in a [01:09:10] microphones my array of microphones in a room with lots of people talking how can [01:09:13] room with lots of people talking how can you have the algorithm separate out the [01:09:14] you have the algorithm separate out the people's voices so that's an [01:09:16] people's voices so that's an unsupervised learning problem because [01:09:18] unsupervised learning problem because there are no labels you just stick [01:09:20] there are no labels you just stick microphones in the room and have a [01:09:22] microphones in the room and have a record different people's voices over [01:09:23] record different people's voices over that [01:09:24] that voices multiple Utah at the same time [01:09:25] voices multiple Utah at the same time and then haven't tried to separate out [01:09:28] and then haven't tried to separate out people's voices and one of the pro next [01:09:30] people's voices and one of the pro next Assizes you do later is if we have you [01:09:33] Assizes you do later is if we have you know five people talking so each [01:09:36] know five people talking so each microphone there cause five people's [01:09:38] microphone there cause five people's overlapping voices right because you [01:09:40] overlapping voices right because you know each microphone here's five people [01:09:42] know each microphone here's five people at the same time how can you have an [01:09:44] at the same time how can you have an algorithm separate out these voices so [01:09:46] algorithm separate out these voices so you can't clean recordings of just one [01:09:48] you can't clean recordings of just one voice at a time so that's called a [01:09:50] voice at a time so that's called a cocktail party problem and the algorithm [01:09:52] cocktail party problem and the algorithm you used to do this is called ICA [01:09:53] you used to do this is called ICA independent components analysis and [01:09:55] independent components analysis and that's something you implement in one of [01:09:57] that's something you implement in one of the latest homework exercises and there [01:10:02] the latest homework exercises and there are other examples of us who has [01:10:03] are other examples of us who has learning as well the Internet has tons [01:10:06] learning as well the Internet has tons of unlabeled text later you just sucked [01:10:08] of unlabeled text later you just sucked down data from the internet there no [01:10:10] down data from the internet there no labels necessarily but can you learn [01:10:13] labels necessarily but can you learn interesting things about language figure [01:10:15] interesting things about language figure out what figure out I don't know one of [01:10:17] out what figure out I don't know one of the best cited results recently was [01:10:20] the best cited results recently was learning analogies like you know man is [01:10:22] learning analogies like you know man is the woman as king of the Queen right or [01:10:26] the woman as king of the Queen right or what Tokyo mister Japan as Washington [01:10:30] what Tokyo mister Japan as Washington DC's the United States right to learn [01:10:32] DC's the United States right to learn and energies like that some say you can [01:10:34] and energies like that some say you can learn analogies like that from unlabeled [01:10:36] learn analogies like that from unlabeled data just from text on the internet so [01:10:37] data just from text on the internet so there's also unsupervised learning okay [01:10:40] there's also unsupervised learning okay um so after on sooo eyes learning oh and [01:10:46] um so after on sooo eyes learning oh and I'm surprised earning so you know [01:10:47] I'm surprised earning so you know machine learning is very useful today [01:10:49] machine learning is very useful today turns out that most of the recent wave [01:10:52] turns out that most of the recent wave of economic value created by machine [01:10:55] of economic value created by machine learning is through supervised learning [01:10:56] learning is through supervised learning but there are important use cases for [01:10:59] but there are important use cases for unsupervised learning as well so I use [01:11:01] unsupervised learning as well so I use them in my work occasionally and there's [01:11:04] them in my work occasionally and there's also beating edge for a lot of exciting [01:11:06] also beating edge for a lot of exciting research and then the final topic find [01:11:09] research and then the final topic find out the five topics we cover so talk [01:11:11] out the five topics we cover so talk about supervised learning machine [01:11:13] about supervised learning machine learning strategy deep learning [01:11:15] learning strategy deep learning unsupervised learning and then the fifth [01:11:17] unsupervised learning and then the fifth one is reinforcement learning [01:11:18] one is reinforcement learning is this which is um let's say I give you [01:11:22] is this which is um let's say I give you the keys to Stanford on this helicopter [01:11:24] the keys to Stanford on this helicopter this hyung copters actually sitting in [01:11:25] this hyung copters actually sitting in my office I'm trying to figure out how [01:11:27] my office I'm trying to figure out how to get rid of it and I'll see the ready [01:11:29] to get rid of it and I'll see the ready program to make it fly right so how do [01:11:32] program to make it fly right so how do you do that so this is a video of a [01:11:37] you do that so this is a video of a helicopter flying the audio is just a [01:11:41] helicopter flying the audio is just a lot of helicopter noise so that's not [01:11:42] lot of helicopter noise so that's not important but uh well zoom out the video [01:11:45] important but uh well zoom out the video you can see it responds in the sky right [01:11:47] you can see it responds in the sky right but so you can use learning out of yeah [01:11:50] but so you can use learning out of yeah that's kind of cool [01:11:51] that's kind of cool I was the cameraman that day oh but so [01:11:54] I was the cameraman that day oh but so you can use there any algorithms to get [01:11:56] you can use there any algorithms to get you know robots to do pretty interesting [01:11:59] you know robots to do pretty interesting things like this and it turns out that a [01:12:02] things like this and it turns out that a good way to do this is through [01:12:04] good way to do this is through reinforcement learning [01:12:05] reinforcement learning so what's reinforcement learning um it [01:12:07] so what's reinforcement learning um it turns out that no one knows what's the [01:12:09] turns out that no one knows what's the optimal way to fly a helicopter right if [01:12:11] optimal way to fly a helicopter right if you find helicopter you have to control [01:12:13] you find helicopter you have to control sticks that you're moving but no one [01:12:16] sticks that you're moving but no one knows what's the octal way to move the [01:12:18] knows what's the octal way to move the control sticks so that way you can get a [01:12:20] control sticks so that way you can get a helicopter fly itself is let the [01:12:22] helicopter fly itself is let the helicopter do whatever you think of it [01:12:25] helicopter do whatever you think of it us training a dog right you can't teach [01:12:27] us training a dog right you can't teach a dog the optimal way to behave but [01:12:29] a dog the optimal way to behave but actually how many have a pet dog a pet [01:12:32] actually how many have a pet dog a pet cat before [01:12:34] cat before it's fascinating [01:12:36] it's fascinating okay so I had a pet dog when I was a kid [01:12:39] okay so I had a pet dog when I was a kid and my family made it my job to train [01:12:41] and my family made it my job to train the dog so how do you train at all you [01:12:43] the dog so how do you train at all you let the dog do whatever it once and then [01:12:45] let the dog do whatever it once and then whenever the Hays well there you go oh [01:12:47] whenever the Hays well there you go oh good dog and when it misbehaves you go [01:12:50] good dog and when it misbehaves you go bad dog [01:12:51] bad dog and then over time the dog learns to do [01:12:54] and then over time the dog learns to do more of the good don't things and fear [01:12:56] more of the good don't things and fear of the bad dog things and so [01:12:58] of the bad dog things and so reinforcement learning is a bit like [01:12:59] reinforcement learning is a bit like that [01:13:00] that right I don't know what's the awful way [01:13:01] right I don't know what's the awful way to fly a helicopter so you let the [01:13:03] to fly a helicopter so you let the helicopter do whatever it wants and then [01:13:05] helicopter do whatever it wants and then whenever it flies well you know doesn't [01:13:08] whenever it flies well you know doesn't mean everybody whines or flies [01:13:09] mean everybody whines or flies accurately without getting around too [01:13:11] accurately without getting around too much you go [01:13:12] much you go oh good helicopter and when it crashes [01:13:15] oh good helicopter and when it crashes to go bad helicopter and it's the job of [01:13:18] to go bad helicopter and it's the job of the reinforcement learning algorithms to [01:13:20] the reinforcement learning algorithms to figure out how to control it over time [01:13:21] figure out how to control it over time so as to get more of a good helicopter [01:13:23] so as to get more of a good helicopter things and fear that bad [01:13:25] things and fear that bad couple of things um and I think um well [01:13:29] couple of things um and I think um well just one more video yeah all right and [01:13:38] just one more video yeah all right and so again given a robot like this I [01:13:40] so again given a robot like this I actually don't know how the programmer [01:13:42] actually don't know how the programmer actually you know robot like this as a [01:13:45] actually you know robot like this as a Lolich joints right so how do you get a [01:13:46] Lolich joints right so how do you get a robot like this to climb more obstacles [01:13:48] robot like this to climb more obstacles so well this is actually a robot dog so [01:13:52] so well this is actually a robot dog so you can actually say good dog dog but by [01:13:56] you can actually say good dog dog but by giving those signals called a reward [01:13:58] giving those signals called a reward signal you can have a learning algorithm [01:14:01] signal you can have a learning algorithm figure out by itself how's the optimize [01:14:03] figure out by itself how's the optimize the reward therefore climb over these [01:14:07] the reward therefore climb over these types of obstacles and I think recently [01:14:10] types of obstacles and I think recently the most famous application is a very [01:14:12] the most famous application is a very for student learning happened for game [01:14:14] for student learning happened for game playing playing Atari games or playing a [01:14:16] playing playing Atari games or playing a game of gold can alphago I think that uh [01:14:20] game of gold can alphago I think that uh IIIi I think that uh game play has made [01:14:23] IIIi I think that uh game play has made for some remarkable stunts a remarkable [01:14:26] for some remarkable stunts a remarkable PR but I'm also equally excited or maybe [01:14:29] PR but I'm also equally excited or maybe even more excited about the integrals [01:14:31] even more excited about the integrals their reinforcement or anything is [01:14:33] their reinforcement or anything is making it's a robotics applications so I [01:14:35] making it's a robotics applications so I think I think yeah reinforcement has [01:14:38] think I think yeah reinforcement has been proven to be fantastic for playing [01:14:40] been proven to be fantastic for playing games is also getting making real [01:14:42] games is also getting making real traction in optimizing robots and [01:14:45] traction in optimizing robots and optimizing logistic system things like [01:14:48] optimizing logistic system things like that so you learn about all these things [01:14:53] that so you learn about all these things last thing for today I hope that you [01:14:56] last thing for today I hope that you will start to tell me people in the [01:15:00] will start to tell me people in the class to make friends phone project [01:15:02] class to make friends phone project partners and study groups and if you [01:15:04] partners and study groups and if you have any questions you know dog on the [01:15:06] have any questions you know dog on the Piazza ask you questions let's help [01:15:07] Piazza ask you questions let's help others answer the questions so let's [01:15:10] others answer the questions so let's break for today and I look forward to [01:15:12] break for today and I look forward to seeing you on Wednesday [01:15:14] seeing you on Wednesday you ================================================================================ LECTURE 002 ================================================================================ Stanford CS229: Machine Learning - Linear Regression and Gradient Descent | Lecture 2 (Autumn 2018) Source: https://www.youtube.com/watch?v=4b4MUYve_U8 --- Transcript [00:00:03] morning and welcome back so what we'll [00:00:09] morning and welcome back so what we'll see today in class is the first in-depth [00:00:12] see today in class is the first in-depth discussion of a learning algorithm [00:00:14] discussion of a learning algorithm linear regression and in particular over [00:00:17] linear regression and in particular over the next one hour and a bit you see a [00:00:20] the next one hour and a bit you see a linear regression batch and Tsukasa [00:00:23] linear regression batch and Tsukasa Granderson's is an algorithm for fitting [00:00:24] Granderson's is an algorithm for fitting linear aggression models and then the [00:00:27] linear aggression models and then the normal equations as a way of is a very [00:00:31] normal equations as a way of is a very efficient way to let you fit linear [00:00:33] efficient way to let you fit linear models and we're going to define [00:00:37] models and we're going to define notation and a few concepts today that [00:00:39] notation and a few concepts today that will lay the foundation for a lot of the [00:00:42] will lay the foundation for a lot of the work that we'll see the rest of this [00:00:43] work that we'll see the rest of this quarter so to motivate linear regression [00:00:48] quarter so to motivate linear regression scofield maybe the may be the simplest [00:00:50] scofield maybe the may be the simplest one of the simplest learning algorithms [00:00:51] one of the simplest learning algorithms you remember the Alvin video the [00:00:55] you remember the Alvin video the autonomous driving video that I had [00:00:57] autonomous driving video that I had shown in class on Monday for the [00:01:00] shown in class on Monday for the self-driving car video that was a [00:01:02] self-driving car video that was a supervised learning problem and this [00:01:05] supervised learning problem and this term supervised learning meant that you [00:01:10] term supervised learning meant that you were given access which was a picture of [00:01:12] were given access which was a picture of what's in front of the car and the [00:01:14] what's in front of the car and the algorithm had to map that to an output Y [00:01:17] algorithm had to map that to an output Y which was the steering direction and [00:01:19] which was the steering direction and that was a regression problem because [00:01:25] that was a regression problem because the output Y that you want is continuous [00:01:27] the output Y that you want is continuous value right as opposed to a [00:01:28] value right as opposed to a classification problem where Y is the [00:01:30] classification problem where Y is the speed and we'll talk about [00:01:31] speed and we'll talk about classification next Monday but [00:01:34] classification next Monday but supervised learning regression so I [00:01:36] supervised learning regression so I think the simplest maybe the simplest [00:01:38] think the simplest maybe the simplest possible learning algorithm a supervised [00:01:40] possible learning algorithm a supervised learning regression problem is linear [00:01:43] learning regression problem is linear regression and to motivate that rather [00:01:47] regression and to motivate that rather than using a saw driving car example you [00:01:50] than using a saw driving car example you know which is quite complicated it will [00:01:51] know which is quite complicated it will build up a supervised learning algorithm [00:01:53] build up a supervised learning algorithm using a simpler example so let's say you [00:01:57] using a simpler example so let's say you want to predict or what estimate prices [00:01:59] want to predict or what estimate prices of houses so the way you'd build a [00:02:03] of houses so the way you'd build a learning algorithm is start by [00:02:05] learning algorithm is start by collecting a data set of houses and [00:02:07] collecting a data set of houses and their prices so this is a data set that [00:02:10] their prices so this is a data set that we collected off Craigslist a little bit [00:02:12] we collected off Craigslist a little bit back this is data from Portland Oregon [00:02:15] back this is data from Portland Oregon but so there's a size of house in square [00:02:18] but so there's a size of house in square feet and that's the price of a house in [00:02:21] feet and that's the price of a house in thousands of dollars right so there's a [00:02:26] thousands of dollars right so there's a house that is a 2100 full square feet [00:02:29] house that is a 2100 full square feet who's asking price was $400,000 pulse [00:02:33] who's asking price was $400,000 pulse with that size with that price and so on [00:02:44] with that size with that price and so on okay and maybe more conventionally if [00:02:49] okay and maybe more conventionally if you plot this data with there's the size [00:02:52] you plot this data with there's the size that's the price see some data set like [00:02:55] that's the price see some data set like that and what would end up doing today [00:02:57] that and what would end up doing today is fit a straight line to this data I [00:03:00] is fit a straight line to this data I didn't go through how to do that so in [00:03:02] didn't go through how to do that so in supervised learning um the process of [00:03:05] supervised learning um the process of supervised learning is that you have our [00:03:07] supervised learning is that you have our training set such as the data set that I [00:03:10] training set such as the data set that I drew on the left and you feed this to [00:03:12] drew on the left and you feed this to learning algorithm and the job of the [00:03:20] learning algorithm and the job of the learning algorithm is to output a [00:03:22] learning algorithm is to output a function to make predictions about [00:03:25] function to make predictions about housing prices and by convention I'm [00:03:28] housing prices and by convention I'm gonna call this function that it outputs [00:03:30] gonna call this function that it outputs a hypothesis and a job with the [00:03:36] a hypothesis and a job with the hypothesis is you know it will it can [00:03:39] hypothesis is you know it will it can input the size of a new house the size [00:03:42] input the size of a new house the size of a different house that you haven't [00:03:43] of a different house that you haven't seen yet and will output the estimated [00:03:49] price okay so the job of the learning [00:03:53] price okay so the job of the learning algorithm is to input a training set and [00:03:54] algorithm is to input a training set and out for the hypothesis the job with [00:03:56] out for the hypothesis the job with hypothesis is to take as input any size [00:03:59] hypothesis is to take as input any size of a house and try to tell you what if [00:04:01] of a house and try to tell you what if things should be the price of that house [00:04:04] things should be the price of that house now when designing a learning algorithm [00:04:08] now when designing a learning algorithm and and you know even though linear [00:04:11] and and you know even though linear regression right you may have seen it in [00:04:12] regression right you may have seen it in a linear algebra class before in some [00:04:14] a linear algebra class before in some class before the way you go about [00:04:16] class before the way you go about structuring a machine learning algorithm [00:04:18] structuring a machine learning algorithm is important and design choices of you [00:04:21] is important and design choices of you know what is the workflow what does the [00:04:22] know what is the workflow what does the data say what is the hypothesis [00:04:24] data say what is the hypothesis represents a hypothesis these are the [00:04:25] represents a hypothesis these are the key decisions you have to make in pretty [00:04:28] key decisions you have to make in pretty much every supervised learning every [00:04:30] much every supervised learning every machine learning algorithms design so as [00:04:32] machine learning algorithms design so as we go through the new regression I'll [00:04:34] we go through the new regression I'll try to describe the concepts clearly as [00:04:36] try to describe the concepts clearly as well because they'll lay the foundation [00:04:38] well because they'll lay the foundation for the rest of the algorithms sometimes [00:04:40] for the rest of the algorithms sometimes much more completely I'll go as you see [00:04:41] much more completely I'll go as you see later this quarter so when designing a [00:04:45] later this quarter so when designing a learning algorithm the first thing we'll [00:04:46] learning algorithm the first thing we'll need to ask is um how do you represent [00:04:51] the hypothesis and in linear regression [00:04:56] the hypothesis and in linear regression the for the purpose of this lecture [00:04:58] the for the purpose of this lecture we're going to say that the hypothesis [00:05:01] we're going to say that the hypothesis is going to be input size X and output a [00:05:10] is going to be input size X and output a number as a as a linear function of the [00:05:13] number as a as a linear function of the size X okay and then the mathematicians [00:05:17] size X okay and then the mathematicians in the room you say technique doesn't [00:05:18] in the room you say technique doesn't have fine function it was a linear [00:05:20] have fine function it was a linear function there's no theta zero [00:05:21] function there's no theta zero technically you know they've been in [00:05:23] technically you know they've been in machine learning is sometimes just [00:05:24] machine learning is sometimes just causes a linear function but technically [00:05:26] causes a linear function but technically is an affine function it does it doesn't [00:05:27] is an affine function it does it doesn't matter so more generally in this example [00:05:33] matter so more generally in this example we have just one input feature X more [00:05:36] we have just one input feature X more generally if you have multiple input [00:05:39] generally if you have multiple input features so if you have more data more [00:05:41] features so if you have more data more information about these houses such as [00:05:43] information about these houses such as number of bedrooms excuse me mom [00:05:49] number of bedrooms excuse me mom handwriting's okay that's the word [00:05:51] handwriting's okay that's the word bedrooms then I guess my father-in-law [00:06:02] bedrooms then I guess my father-in-law lives a little bit outside Portland and [00:06:04] lives a little bit outside Portland and he's actually really into real estate so [00:06:05] he's actually really into real estate so this is that your real data set in [00:06:06] this is that your real data set in Portland so more generally if you know [00:06:12] Portland so more generally if you know the size as was the number of bedrooms [00:06:13] the size as was the number of bedrooms in these houses then you may have a two [00:06:17] in these houses then you may have a two input features where x1 is the size and [00:06:20] input features where x1 is the size and x2 is the number of bedrooms I'm using [00:06:26] x2 is the number of bedrooms I'm using the pound sign bedrooms to denote number [00:06:29] the pound sign bedrooms to denote number of bedrooms and you might say that you [00:06:31] of bedrooms and you might say that you estimate the size of a house as H of X [00:06:35] estimate the size of a house as H of X equals theta 0 or Stata 1x [00:06:37] equals theta 0 or Stata 1x 1 plus theta 2 x2 where x1 is the size [00:06:43] 1 plus theta 2 x2 where x1 is the size of the house and x2 is is the number of [00:06:47] of the house and x2 is is the number of bedrooms okay so in order to [00:06:57] so in order to simplify the notation in [00:07:07] so in order to simplify the notation in order to make that notation a little bit [00:07:09] order to make that notation a little bit more compact I'm also going to introduce [00:07:12] more compact I'm also going to introduce this other notation where we want to [00:07:16] this other notation where we want to write the hypothesis as sum from J [00:07:22] write the hypothesis as sum from J equals 0 to 2 of theta J XJ so the [00:07:31] equals 0 to 2 of theta J XJ so the summation where for conciseness we [00:07:35] summation where for conciseness we define X 0 to be equal to 1 ok see we [00:07:39] define X 0 to be equal to 1 ok see we define if you define X 0 to be a dummy [00:07:42] define if you define X 0 to be a dummy feature that always takes on the value [00:07:44] feature that always takes on the value of 1 then you can write the hypothesis H [00:07:47] of 1 then you can write the hypothesis H of X this way sum from J equals 0 to 2 [00:07:49] of X this way sum from J equals 0 to 2 or just theta J XJ [00:07:51] or just theta J XJ it's the same with that equation that [00:07:53] it's the same with that equation that you saw to the upper right and so here [00:07:56] you saw to the upper right and so here theta becomes a three dimensional [00:07:59] theta becomes a three dimensional parameter theta 0 theta 1 theta 2 this [00:08:03] parameter theta 0 theta 1 theta 2 this index starting from 0 and the features [00:08:06] index starting from 0 and the features become a 3 dimensional feature vector X [00:08:08] become a 3 dimensional feature vector X 0 X 1 X 2 where X 0 is always 1 X 1 is [00:08:14] 0 X 1 X 2 where X 0 is always 1 X 1 is the size of the house and X 2 is the [00:08:16] the size of the house and X 2 is the number of bedrooms of the house so to [00:08:22] number of bedrooms of the house so to introduce a bit more terminology theta [00:08:26] introduce a bit more terminology theta is called the parameters of the learning [00:08:33] is called the parameters of the learning algorithm and the job of the learning [00:08:35] algorithm and the job of the learning algorithm is to choose parameters theta [00:08:38] algorithm is to choose parameters theta that allows you to make good predictions [00:08:40] that allows you to make good predictions about your prices of houses right and [00:08:44] about your prices of houses right and just to lay out some more notation that [00:08:48] just to lay out some more notation that we're going to use throughout this [00:08:49] we're going to use throughout this quarter I'm going to use a standard that [00:08:53] quarter I'm going to use a standard that M will define as the number of training [00:09:03] M will define as the number of training examples so M is going to be the number [00:09:05] examples so M is going to be the number of rows [00:09:08] right in the table above where you know [00:09:13] right in the table above where you know each house you have your training said [00:09:15] each house you have your training said just one training example you've already [00:09:18] just one training example you've already seen me use X to denote the inputs and [00:09:24] seen me use X to denote the inputs and often the inputs [00:09:26] often the inputs I'll call features you know I think as a [00:09:31] I'll call features you know I think as a as an emerging discipline grows up right [00:09:34] as an emerging discipline grows up right notation kind of emerges depending on [00:09:36] notation kind of emerges depending on what different scientists use for the [00:09:38] what different scientists use for the first time when you write a paper so I [00:09:39] first time when you write a paper so I think that you know I think that the [00:09:41] think that you know I think that the fact that we call these things [00:09:42] fact that we call these things hypotheses frankly I don't think that's [00:09:44] hypotheses frankly I don't think that's a great name but but I think someone [00:09:46] a great name but but I think someone many decades ago wrote a few papers [00:09:48] many decades ago wrote a few papers calling a hypothesis and then others [00:09:50] calling a hypothesis and then others follow and we kind of stuck with some of [00:09:52] follow and we kind of stuck with some of this terminology but X is what's called [00:09:53] this terminology but X is what's called input features a sentence input [00:09:55] input features a sentence input attributes and Y is the output right and [00:10:02] attributes and Y is the output right and sometimes we call this a target variable [00:10:07] and so X comma Y is one training example [00:10:18] and I'm going to use this notation X [00:10:26] and I'm going to use this notation X superscript I comma Y superscript I in [00:10:29] superscript I comma Y superscript I in parentheses to denote the training [00:10:37] parentheses to denote the training example okay so the superscript [00:10:39] example okay so the superscript parentheses I that's not exponentiation [00:10:42] parentheses I that's not exponentiation I think that as we build this is this [00:10:46] I think that as we build this is this notation X I comma Y I this is just a [00:10:48] notation X I comma Y I this is just a way of writing an index into the table [00:10:52] way of writing an index into the table of training examples above so so maybe [00:10:54] of training examples above so so maybe for example if the first training [00:10:56] for example if the first training example is the size house of science to [00:10:59] example is the size house of science to 104 so X 1 1 would be equal to 2104 [00:11:06] 104 so X 1 1 would be equal to 2104 right because this is the size of the [00:11:08] right because this is the size of the first house in the training example [00:11:10] first house in the training example and I guess X the second example feature [00:11:16] and I guess X the second example feature one would be one four one six with our [00:11:18] one would be one four one six with our example though so the super strip in [00:11:20] example though so the super strip in parentheses is just some it's just the [00:11:26] parentheses is just some it's just the index into the different training [00:11:28] index into the different training examples where I superscript I here [00:11:31] examples where I superscript I here we're running from one through m1 [00:11:33] we're running from one through m1 through the number of training examples [00:11:34] through the number of training examples you have and then one last bit of [00:11:38] you have and then one last bit of notation I'm going to use n to denote [00:11:45] the number of features you have for the [00:11:47] the number of features you have for the supervised learning problem so in this [00:11:49] supervised learning problem so in this example n is equal to 2 right because we [00:11:53] example n is equal to 2 right because we have two features which is the size the [00:11:56] have two features which is the size the house and the number of bedrooms so two [00:11:58] house and the number of bedrooms so two features which is why you can take this [00:12:02] features which is why you can take this write and write this as a sum from J [00:12:08] write and write this as a sum from J equals 0 to n and so here X and theta [00:12:16] equals 0 to n and so here X and theta are n plus 1 dimensional because we [00:12:18] are n plus 1 dimensional because we added the extra X 0 and theta 0 ok so so [00:12:24] added the extra X 0 and theta 0 ok so so if you have two features then these are [00:12:26] if you have two features then these are three dimensional vectors and more [00:12:28] three dimensional vectors and more generally if you have n features you end [00:12:30] generally if you have n features you end up with X and theta being n plus 1 [00:12:33] up with X and theta being n plus 1 dimensional features all right and you [00:12:37] dimensional features all right and you know you see this notation multiple [00:12:40] know you see this notation multiple times in multiple algorithms throughout [00:12:41] times in multiple algorithms throughout this quarter so if you you know don't [00:12:44] this quarter so if you you know don't manage to memorize all these symbols [00:12:45] manage to memorize all these symbols right now don't worry about it you see [00:12:47] right now don't worry about it you see them over and over and over come [00:12:48] them over and over and over come familiar alright so um given the data [00:12:53] familiar alright so um given the data set and given that this is the way you [00:12:56] set and given that this is the way you define the hypothesis how do you choose [00:12:59] define the hypothesis how do you choose the parameters right so you're the [00:13:01] the parameters right so you're the learning algorithms job is to choose [00:13:02] learning algorithms job is to choose values for parameters theta so that it [00:13:05] values for parameters theta so that it can output hypotheses so how do you [00:13:08] can output hypotheses so how do you choose parameters theta well what we'll [00:13:11] choose parameters theta well what we'll do is let's choose theta [00:13:23] such that H of X is close to Y for the [00:13:30] such that H of X is close to Y for the training examples so and I think the [00:13:38] training examples so and I think the final bit of notation I've been writing [00:13:41] final bit of notation I've been writing H of X as a function of the features of [00:13:45] H of X as a function of the features of the house as a function of the size and [00:13:47] the house as a function of the size and number of bedrooms the house sometimes [00:13:49] number of bedrooms the house sometimes to emphasize that H depends both on the [00:13:52] to emphasize that H depends both on the parameters theta and on the input [00:13:55] parameters theta and on the input features X I'm going to use H subscript [00:13:59] features X I'm going to use H subscript theta X to emphasize that the hypothesis [00:14:03] theta X to emphasize that the hypothesis depends both on the parameters and on [00:14:05] depends both on the parameters and on the you know input features X right but [00:14:08] the you know input features X right but sometimes for notational convenience I [00:14:10] sometimes for notational convenience I just write this as H of X sometimes I [00:14:12] just write this as H of X sometimes I include the theta there and they mean [00:14:14] include the theta there and they mean the same thing it's just maybe a [00:14:15] the same thing it's just maybe a abbreviation in notation but so in order [00:14:19] abbreviation in notation but so in order to learn set of parameters what we'll [00:14:24] to learn set of parameters what we'll want to do is choose a parameters theta [00:14:27] want to do is choose a parameters theta so that at least for the houses whose [00:14:28] so that at least for the houses whose prices you know that you know the [00:14:31] prices you know that you know the learning algorithm outputs prices that [00:14:33] learning algorithm outputs prices that are close to what you know were the [00:14:35] are close to what you know were the correct prices for that set of houses [00:14:37] correct prices for that set of houses with their compare asking prices for [00:14:40] with their compare asking prices for those houses and so more formally in the [00:14:44] those houses and so more formally in the linear regression algorithm also called [00:14:46] linear regression algorithm also called ordinary least-squares with a linear [00:14:48] ordinary least-squares with a linear regression we will want to minimize I'm [00:14:56] regression we will want to minimize I'm going to build out this equation one [00:14:57] going to build out this equation one piece of the time okay minimize the [00:15:00] piece of the time okay minimize the squared difference between what the [00:15:02] squared difference between what the hypothesis outputs H subscript theta of [00:15:05] hypothesis outputs H subscript theta of X minus y squared right so let's say we [00:15:15] X minus y squared right so let's say we want to minimize the squared difference [00:15:16] want to minimize the squared difference between the prediction which is H of x [00:15:18] between the prediction which is H of x and y which is a correct price and so [00:15:24] and y which is a correct price and so what we want to do is [00:15:26] what we want to do is choose values of theta that minimizes [00:15:28] choose values of theta that minimizes that to fill this out you know you have [00:15:32] that to fill this out you know you have M training examples so I'm going to sum [00:15:36] M training examples so I'm going to sum from I equals 1 through m of that [00:15:41] from I equals 1 through m of that squared difference so this is sum over I [00:15:44] squared difference so this is sum over I equals 1 through all say 50 examples you [00:15:47] equals 1 through all say 50 examples you have right the squared difference [00:15:50] have right the squared difference between what your algorithm predicts and [00:15:52] between what your algorithm predicts and what the true price of the house is and [00:15:55] what the true price of the house is and then finally by convention we put up [00:15:59] then finally by convention we put up one-half there it's put a 1/2 constant [00:16:01] one-half there it's put a 1/2 constant there because when we take derivatives [00:16:03] there because when we take derivatives to minimize this later [00:16:05] to minimize this later putting 1/2 there will make some of the [00:16:06] putting 1/2 there will make some of the math a little bit simpler so you know [00:16:08] math a little bit simpler so you know changing 1 adding a 1/2 minimizing that [00:16:10] changing 1 adding a 1/2 minimizing that formula should give you the same ran [00:16:12] formula should give you the same ran says minimize a 1/2 of that but we often [00:16:14] says minimize a 1/2 of that but we often put a 1/2 there since I make the math a [00:16:16] put a 1/2 there since I make the math a little bit simpler later ok [00:16:18] little bit simpler later ok and so in linear regression I'm gonna [00:16:23] and so in linear regression I'm gonna define the cost function J of theta to [00:16:27] define the cost function J of theta to be equal to that and we'll find [00:16:32] be equal to that and we'll find parameters theta that minimizes the cost [00:16:35] parameters theta that minimizes the cost function J of theta ok and questions [00:16:41] function J of theta ok and questions often gotten is you know why squared [00:16:44] often gotten is you know why squared error why not absolute error or this [00:16:46] error why not absolute error or this error to the power of 4 we'll talk more [00:16:48] error to the power of 4 we'll talk more about that when we talk about when we [00:16:52] about that when we talk about when we talk about a generalization of linear [00:16:55] talk about a generalization of linear regression when we talk about [00:16:57] regression when we talk about generalized linear models we should do [00:16:58] generalized linear models we should do next week you see that linear regression [00:17:02] next week you see that linear regression is a special case of a bigger family of [00:17:05] is a special case of a bigger family of algorithms called generalizing the [00:17:06] algorithms called generalizing the models and using squared error [00:17:09] models and using squared error corresponds to a Gaussian but justified [00:17:13] corresponds to a Gaussian but justified may be a little bit more Y squared error [00:17:15] may be a little bit more Y squared error rather than absolute error or error to [00:17:17] rather than absolute error or error to the power 4 next week so um let me just [00:17:22] the power 4 next week so um let me just check see if any questions [00:17:29] okay cool alright so um so let's Knicks [00:17:55] okay cool alright so um so let's Knicks let's see how you can implement an [00:17:56] let's see how you can implement an algorithm to find a value of theta that [00:17:59] algorithm to find a value of theta that minimizes J of theta that minimizes the [00:18:02] minimizes J of theta that minimizes the cost function J of theta we're going to [00:18:06] cost function J of theta we're going to use an algorithm called gradient descent [00:18:12] and you know there's our first loss [00:18:16] and you know there's our first loss seeking this Austrian so trying to [00:18:17] seeking this Austrian so trying to figure out what just takes like this all [00:18:27] figure out what just takes like this all right and so with gradient descent we [00:18:29] right and so with gradient descent we are going to start with some value of [00:18:36] are going to start with some value of theta and it could be you know theta [00:18:41] theta and it could be you know theta equals the vector of all zeros would be [00:18:43] equals the vector of all zeros would be a reasonable default we could initialize [00:18:44] a reasonable default we could initialize a random you can't doesn't really matter [00:18:46] a random you can't doesn't really matter but theta is this three dimensional [00:18:48] but theta is this three dimensional vector and I'm writing zero with an [00:18:51] vector and I'm writing zero with an arrow on top to denote the vector of all [00:18:54] arrow on top to denote the vector of all zero so zero with an arrow on top does [00:18:56] zero so zero with an arrow on top does it vector there's a zero zero zero [00:18:57] it vector there's a zero zero zero everywhere right so so stop to some you [00:19:01] everywhere right so so stop to some you know initial value of theta and we're [00:19:04] know initial value of theta and we're going to keep changing theta to reduce J [00:19:18] going to keep changing theta to reduce J of theta okay so let me show you a [00:19:28] but let me show you a visualization of [00:19:31] but let me show you a visualization of gradient descent and and it will write [00:19:33] gradient descent and and it will write all the math [00:19:36] so alright let's say you want to [00:19:41] so alright let's say you want to minimize some function J of theta and is [00:19:45] minimize some function J of theta and is importantly get the axis right in this [00:19:47] importantly get the axis right in this diagram right so in this diagram the [00:19:49] diagram right so in this diagram the horizontal axes are theta 0 and theta 1 [00:19:52] horizontal axes are theta 0 and theta 1 and what you want to do is find values [00:19:55] and what you want to do is find values for theta 0 and theta 1 in our in our [00:19:59] for theta 0 and theta 1 in our in our examples as you say the zero theta 1 [00:20:00] examples as you say the zero theta 1 theta 2 cos theta 3 dimentional I can't [00:20:02] theta 2 cos theta 3 dimentional I can't plot that so I'm just using theta 0 and [00:20:04] plot that so I'm just using theta 0 and theta 1 but what you want to do is find [00:20:07] theta 1 but what you want to do is find values for theta 0 and theta 1 right [00:20:10] values for theta 0 and theta 1 right that's the right you want to find values [00:20:17] that's the right you want to find values of theta zero and theta one that [00:20:20] of theta zero and theta one that minimizes the height to the surface J of [00:20:22] minimizes the height to the surface J of theta so maybe this this looks like a [00:20:24] theta so maybe this this looks like a pretty good point [00:20:24] pretty good point or something okay and so in green [00:20:27] or something okay and so in green descent you you know start off at some [00:20:30] descent you you know start off at some point on this surface and you do that by [00:20:33] point on this surface and you do that by initializing theta 0 and theta 1 either [00:20:36] initializing theta 0 and theta 1 either randomly or to the value of all zeros or [00:20:38] randomly or to the value of all zeros or something doesn't doesn't matter too [00:20:39] something doesn't doesn't matter too much and what you do is imagine you are [00:20:43] much and what you do is imagine you are standing on this little hill right [00:20:45] standing on this little hill right standing at the point at that little [00:20:47] standing at the point at that little extra that little cross what you're [00:20:50] extra that little cross what you're doing great in the sentence is turn on [00:20:52] doing great in the sentence is turn on turn around [00:20:53] turn around all 360 degrees and look around you and [00:20:55] all 360 degrees and look around you and see if you were to take a tiny little [00:20:58] see if you were to take a tiny little step you know take a tiny little baby [00:21:00] step you know take a tiny little baby set in what direction should you take a [00:21:02] set in what direction should you take a little step to go downhill as fast as [00:21:05] little step to go downhill as fast as possible because you're trying to go [00:21:07] possible because you're trying to go downhill which is go to the lowest [00:21:09] downhill which is go to the lowest possible elevation go to the lowest [00:21:11] possible elevation go to the lowest possible point of J of theta so what [00:21:14] possible point of J of theta so what you're in descent will do is a stand at [00:21:16] you're in descent will do is a stand at that point look around look all around [00:21:19] that point look around look all around you and say well what direction should I [00:21:21] you and say well what direction should I take a little step in to go down so as [00:21:23] take a little step in to go down so as quickly as possible because you want to [00:21:25] quickly as possible because you want to minimize J of theta you want to reduce [00:21:28] minimize J of theta you want to reduce the value of J of theta you want to go [00:21:30] the value of J of theta you want to go to the lowest possible elevation on this [00:21:31] to the lowest possible elevation on this pillow and so Grint descent will take [00:21:36] pillow and so Grint descent will take that little baby step right and then and [00:21:39] that little baby step right and then and then repeat now you're a little bit [00:21:41] then repeat now you're a little bit lower on the surface so you can take a [00:21:44] lower on the surface so you can take a look all around you and say oh looks [00:21:45] look all around you and say oh looks like that he'll let that little [00:21:47] like that he'll let that little direction this [00:21:48] direction this cheapest direction of the steepest [00:21:49] cheapest direction of the steepest gradient downhill so you take another [00:21:52] gradient downhill so you take another little step take another set another [00:21:54] little step take another set another step and so on until until you until you [00:22:01] step and so on until until you until you get to a whole via local optima now one [00:22:04] get to a whole via local optima now one property or green descent is that um [00:22:06] property or green descent is that um depending where you initialize [00:22:08] depending where you initialize parameters you can't get to local [00:22:10] parameters you can't get to local different points right so previously we [00:22:13] different points right so previously we had started it at that little point X [00:22:14] had started it at that little point X but imagine if you had started it you [00:22:17] but imagine if you had started it you know just a few steps over to the right [00:22:19] know just a few steps over to the right right that new axis of the one on the [00:22:22] right that new axis of the one on the Left if you are on green descent from [00:22:24] Left if you are on green descent from that new point then that wouldn't be the [00:22:27] that new point then that wouldn't be the first step there on the second step and [00:22:29] first step there on the second step and so on and you would have gotten to a [00:22:31] so on and you would have gotten to a different local optimum your different [00:22:34] different local optimum your different level okay it turns out that when you [00:22:38] level okay it turns out that when you run gradient descent on linear [00:22:40] run gradient descent on linear regression it turns out that there will [00:22:44] regression it turns out that there will not be local optimum the world we'll [00:22:46] not be local optimum the world we'll talk about that a little bit okay so [00:22:49] talk about that a little bit okay so let's formalize D [00:22:53] gradient descent algorithm [00:22:55] gradient descent algorithm [Music] [00:23:03] in gradient descent each step of [00:23:11] in gradient descent each step of gradient descent is implemented as [00:23:14] gradient descent is implemented as follows so so remember in this example [00:23:17] follows so so remember in this example the training set is fixed right you you [00:23:20] the training set is fixed right you you know you've collected the data set of [00:23:21] know you've collected the data set of housing prices from Portland Oregon so [00:23:23] housing prices from Portland Oregon so you just have to add so in your computer [00:23:25] you just have to add so in your computer memory and so the data centers takes the [00:23:28] memory and so the data centers takes the cost function J is a fixed function this [00:23:30] cost function J is a fixed function this function parameters theta and the only [00:23:32] function parameters theta and the only thing you're gonna do is tweak or modify [00:23:34] thing you're gonna do is tweak or modify the parameters theta one step of [00:23:38] the parameters theta one step of gradient descent it can be implemented [00:23:40] gradient descent it can be implemented as follows or just say that J gets [00:23:44] as follows or just say that J gets updated as say that J - I'll just write [00:23:48] updated as say that J - I'll just write this out so bit more notation I'm gonna [00:23:57] this out so bit more notation I'm gonna use : equals and let me use this [00:24:00] use : equals and let me use this notation to denote assignment so what [00:24:03] notation to denote assignment so what this means is we're going to take the [00:24:04] this means is we're going to take the value on the right and assign it to [00:24:06] value on the right and assign it to theta on the left right and so so in [00:24:09] theta on the left right and so so in other words in the notation will use [00:24:11] other words in the notation will use this quarter you know a colon equals a [00:24:14] this quarter you know a colon equals a plus one this means increment the value [00:24:16] plus one this means increment the value of a by one whereas you know a equals B [00:24:21] of a by one whereas you know a equals B if I write a equals B I'm asserting a [00:24:24] if I write a equals B I'm asserting a statement of fact I'm searching that the [00:24:26] statement of fact I'm searching that the value of a is equal to the value of B [00:24:28] value of a is equal to the value of B and hopefully I won't ever write a [00:24:31] and hopefully I won't ever write a equals a plus one right because because [00:24:34] equals a plus one right because because that is really true alright so in each [00:24:42] that is really true alright so in each step of gradient descent you're going to [00:24:44] step of gradient descent you're going to for each value of J so you're gonna do [00:24:47] for each value of J so you're gonna do this for J equals 0 1 2 or 0 1 up to n [00:24:52] this for J equals 0 1 2 or 0 1 up to n where n is the number of features for [00:24:55] where n is the number of features for each value of J takes a DJ and update it [00:24:58] each value of J takes a DJ and update it according to theta J minus alpha which [00:25:04] according to theta J minus alpha which is called the learning rate [00:25:10] is called the learning rate alpha the learning rate times this [00:25:12] alpha the learning rate times this formula and this formula is the partial [00:25:14] formula and this formula is the partial derivative of the cost function J of [00:25:16] derivative of the cost function J of theta with respect to the parameter [00:25:20] theta with respect to the parameter theta J okay and then this is partial [00:25:23] theta J okay and then this is partial derivative notation for those of you [00:25:25] derivative notation for those of you that I know haven't seen calculus for a [00:25:28] that I know haven't seen calculus for a while or haven't seen you know some of [00:25:30] while or haven't seen you know some of their prerequisite or a while we'll go [00:25:32] their prerequisite or a while we'll go over some more of this in a little bit [00:25:34] over some more of this in a little bit greater detail in discussion section but [00:25:36] greater detail in discussion section but I'll do this [00:25:38] I'll do this quickly now [00:25:46] but I know if you take a look calculus [00:25:49] but I know if you take a look calculus class a while back you may remember that [00:25:51] class a while back you may remember that the derivative of a function is you know [00:25:54] the derivative of a function is you know defines the direction of steepest [00:25:55] defines the direction of steepest descent so it defines the direction that [00:25:57] descent so it defines the direction that allows you to go downhill as steeply as [00:26:00] allows you to go downhill as steeply as possible on the hill and dance question [00:26:03] possible on the hill and dance question oh how do you determine learning rate [00:26:07] oh how do you determine learning rate let me get back to that it's a good [00:26:08] let me get back to that it's a good question for now you know there's a [00:26:12] question for now you know there's a theory there's a practice in practice [00:26:14] theory there's a practice in practice you said to 0.01 let me say a bit more [00:26:19] you said to 0.01 let me say a bit more about that later if you actually if you [00:26:24] about that later if you actually if you scale all the features between zero and [00:26:26] scale all the features between zero and one you know minus one and plus one or [00:26:28] one you know minus one and plus one or something like that and then try you [00:26:31] something like that and then try you could try a few values and see what lets [00:26:32] could try a few values and see what lets you minimize the function best but if [00:26:34] you minimize the function best but if the features are scale to plus minus one [00:26:37] the features are scale to plus minus one I usually start with 0.01 and try [00:26:39] I usually start with 0.01 and try increasing and diffusing it say say a [00:26:41] increasing and diffusing it say say a little more about it all right cool so [00:26:48] little more about it all right cool so um let's see I'm just quickly show how [00:26:56] um let's see I'm just quickly show how the derivative calculation is done and [00:26:59] the derivative calculation is done and you know I'm gonna do a few more [00:27:01] you know I'm gonna do a few more equations in this lecture and then and [00:27:03] equations in this lecture and then and then over time I think all of this all [00:27:06] then over time I think all of this all of these definitions and derivations are [00:27:08] of these definitions and derivations are written out in full detail in the [00:27:10] written out in full detail in the lecture notes posted on the course [00:27:12] lecture notes posted on the course website so sometimes I'll do more math [00:27:14] website so sometimes I'll do more math in class when we want you to see the [00:27:17] in class when we want you to see the steps of the derivation and sometimes [00:27:18] steps of the derivation and sometimes the save time and cost will gloss over [00:27:20] the save time and cost will gloss over the mathematical details that leave you [00:27:22] the mathematical details that leave you the read over the full details in the [00:27:24] the read over the full details in the lecture notes under sisters you know [00:27:25] lecture notes under sisters you know course website so partially whatever [00:27:29] course website so partially whatever strategy of theta that's the partial er [00:27:32] strategy of theta that's the partial er of respect to that 1/2 H of theta of X [00:27:40] of respect to that 1/2 H of theta of X minus y squared and so I'm gonna do a [00:27:43] minus y squared and so I'm gonna do a slightly simpler version assuming we [00:27:46] slightly simpler version assuming we have just one training example right the [00:27:48] have just one training example right the actual derivate definition of J of theta [00:27:50] actual derivate definition of J of theta has a sum over I from 1 to M over all [00:27:54] has a sum over I from 1 to M over all the training examples so I'm just [00:27:56] the training examples so I'm just forgetting that some for now so if you [00:27:58] forgetting that some for now so if you have only one training example [00:28:00] have only one training example and so from calculus if you take the [00:28:03] and so from calculus if you take the derivative of a square you know the two [00:28:05] derivative of a square you know the two comes down and so that cancels out with [00:28:07] comes down and so that cancels out with the hall so two times one half times the [00:28:12] the hall so two times one half times the thing inside right and then by the chain [00:28:17] thing inside right and then by the chain rule of derivatives that's times the [00:28:21] rule of derivatives that's times the partial derivative of theta J minus y so [00:28:27] partial derivative of theta J minus y so if you take the derivative of a square [00:28:28] if you take the derivative of a square the two comes down and then you take the [00:28:31] the two comes down and then you take the derivative of what's inside and multiply [00:28:33] derivative of what's inside and multiply that right and so the 200 one half [00:28:38] that right and so the 200 one half cancel out so this leaves you with minus [00:28:43] cancel out so this leaves you with minus y times partial derivative respect to [00:28:45] y times partial derivative respect to theta J of say the zero X zero plus [00:28:49] theta J of say the zero X zero plus theta 1 x1 plus dot plus theta n X and [00:28:56] theta 1 x1 plus dot plus theta n X and minus y I where I just took the [00:29:00] minus y I where I just took the definition of H of X and expanded it out [00:29:03] definition of H of X and expanded it out to that to that sum right because H of X [00:29:08] to that to that sum right because H of X is just equal to that so if you look at [00:29:10] is just equal to that so if you look at the partial derivative of each of these [00:29:12] the partial derivative of each of these terms with respect to theta J the [00:29:16] terms with respect to theta J the partial derivative of every one of these [00:29:18] partial derivative of every one of these terms respect to theta J is going to be [00:29:21] terms respect to theta J is going to be 0 except for the term corresponding to J [00:29:25] 0 except for the term corresponding to J because you know if if J was equal to [00:29:29] because you know if if J was equal to one say right then this term doesn't [00:29:33] one say right then this term doesn't depend on theta 1 this term this term [00:29:36] depend on theta 1 this term this term all of them do not even depend on theta [00:29:38] all of them do not even depend on theta 1 the only term that depends on theta 1 [00:29:41] 1 the only term that depends on theta 1 is this term over there and the partial [00:29:44] is this term over there and the partial derivative of this term respect to theta [00:29:46] derivative of this term respect to theta 1 would be just X 1 and so when you take [00:29:52] 1 would be just X 1 and so when you take the partial derivative of this big some [00:29:54] the partial derivative of this big some with respect to say the J intercept just [00:29:58] with respect to say the J intercept just J equals 1 and respect to theta J in [00:30:00] J equals 1 and respect to theta J in general then the only term that even [00:30:02] general then the only term that even depends on theta J is the term theta J [00:30:06] depends on theta J is the term theta J XJ [00:30:08] XJ and so the partial derivative of all the [00:30:10] and so the partial derivative of all the other terms end up being zero and [00:30:12] other terms end up being zero and partial observer this term respect to [00:30:15] partial observer this term respect to theta J it is equal to XJ okay and so [00:30:19] theta J it is equal to XJ okay and so this ends up being a theta X minus y [00:30:24] this ends up being a theta X minus y times XJ okay and again if you haven't [00:30:30] times XJ okay and again if you haven't you haven't played with calculus for a [00:30:32] you haven't played with calculus for a while if you you know don't quite [00:30:33] while if you you know don't quite remember what positive is or don't quite [00:30:35] remember what positive is or don't quite get what I just said don't worry too [00:30:36] get what I just said don't worry too much about it go over a bit more in [00:30:38] much about it go over a bit more in section and we and then also read [00:30:40] section and we and then also read through the lecture notes which kind of [00:30:41] through the lecture notes which kind of goes over this in in in more detail and [00:30:45] goes over this in in in more detail and more slowly than then we might do in [00:30:47] more slowly than then we might do in class [00:31:01] so um so plugging this let's see so [00:31:06] so um so plugging this let's see so we've just calculated that this partial [00:31:08] we've just calculated that this partial derivative is equal to this and so [00:31:13] derivative is equal to this and so plugging it back into that formula one [00:31:15] plugging it back into that formula one step of gradient descent is is the [00:31:19] step of gradient descent is is the following which is that we will let [00:31:23] following which is that we will let theta J be updated according to u theta [00:31:26] theta J be updated according to u theta J minus the learning rate and times H of [00:31:31] J minus the learning rate and times H of X minus y times X change ok now I'm [00:31:41] X minus y times X change ok now I'm gonna just add a few more things to this [00:31:43] gonna just add a few more things to this equation so I did this for one training [00:31:45] equation so I did this for one training example but this was I kind of used [00:31:47] example but this was I kind of used definition of the cost function J of [00:31:49] definition of the cost function J of theta defined using just one single [00:31:52] theta defined using just one single training example but you actually have M [00:31:54] training example but you actually have M training examples and so the the correct [00:31:58] training examples and so the the correct formula for the derivative is actually [00:32:01] formula for the derivative is actually if you take this thing and sum it over [00:32:03] if you take this thing and sum it over all M training examples the derivative [00:32:06] all M training examples the derivative of the derivative sum is the sum of the [00:32:08] of the derivative sum is the sum of the derivatives right so so you actually if [00:32:12] derivatives right so so you actually if you redo this derivation you know [00:32:15] you redo this derivation you know summing with the correct definition of J [00:32:16] summing with the correct definition of J of theta which sums over all M training [00:32:18] of theta which sums over all M training examples if you just redo that the [00:32:20] examples if you just redo that the derivation you end up with some equals I [00:32:23] derivation you end up with some equals I threw em that right where remember X I [00:32:31] threw em that right where remember X I is the I've training examples input [00:32:34] is the I've training examples input features y I is the target Abel is the [00:32:36] features y I is the target Abel is the price in the life training example and [00:32:41] price in the life training example and so this is the actual correct formula [00:32:45] so this is the actual correct formula for the partial derivative respect to [00:32:47] for the partial derivative respect to that the cost function J of theta when [00:32:52] that the cost function J of theta when it's defined using all of the what is [00:32:57] it's defined using all of the what is defined using all of the training [00:32:58] defined using all of the training examples [00:33:00] examples and so the gradient descent algorithm is [00:33:03] and so the gradient descent algorithm is to repeat until convergence carry out [00:33:15] to repeat until convergence carry out this update and in each iteration of [00:33:18] this update and in each iteration of gradient descent you do this update [00:33:20] gradient descent you do this update before J equals 0 1 up to n where n is [00:33:30] before J equals 0 1 up to n where n is the number of features so n was 2 in our [00:33:33] the number of features so n was 2 in our example ok and if you do this then you [00:33:39] example ok and if you do this then you know then what will happen is show you [00:33:45] know then what will happen is show you the animation yes you fit hopefully you [00:33:48] the animation yes you fit hopefully you find a pretty good value of the [00:33:50] find a pretty good value of the parameters theta so it turns out that [00:33:56] parameters theta so it turns out that when you plot the cost function J of [00:34:00] when you plot the cost function J of theta for a linear regression model it [00:34:04] theta for a linear regression model it turns out that unlike the earlier [00:34:07] turns out that unlike the earlier diagram I have shown which has local [00:34:09] diagram I have shown which has local Optima it turns out that if J of theta [00:34:12] Optima it turns out that if J of theta is defined a way that you know we just [00:34:14] is defined a way that you know we just defined it for linear regression there's [00:34:16] defined it for linear regression there's this sum of squared terms then J of [00:34:19] this sum of squared terms then J of theta turns out to be a quadratic [00:34:20] theta turns out to be a quadratic function where it's the sum of these [00:34:22] function where it's the sum of these squares of terms and so J of theta will [00:34:25] squares of terms and so J of theta will always look like look like a big bowl [00:34:26] always look like look like a big bowl like this another way to look at this oh [00:34:30] like this another way to look at this oh and so and so J of theta does not have [00:34:33] and so and so J of theta does not have local Optima or the only local Optima is [00:34:35] local Optima or the only local Optima is also the global optimum the other way to [00:34:38] also the global optimum the other way to look at the function like this is to [00:34:40] look at the function like this is to look at the contours of this plot right [00:34:42] look at the contours of this plot right so you pull the contours by looking at [00:34:44] so you pull the contours by looking at the big bowl and taking horizontal [00:34:46] the big bowl and taking horizontal slices and plotting where they're where [00:34:48] slices and plotting where they're where the curves where where the edges of the [00:34:50] the curves where where the edges of the horizontal slices so the contours of a [00:34:53] horizontal slices so the contours of a big bowl or I guess a formal is a bigger [00:34:56] big bowl or I guess a formal is a bigger and of this quadratic function will be [00:35:00] and of this quadratic function will be ellipses like these or these ovals or [00:35:03] ellipses like these or these ovals or these ellipses like this and so if you [00:35:06] these ellipses like this and so if you run gradient descent on this algorithm [00:35:09] run gradient descent on this algorithm let's say I initialize [00:35:12] let's say I initialize my parameters at that little X shown [00:35:16] my parameters at that little X shown over here and usually you initialize say [00:35:19] over here and usually you initialize say to the ruler zero but but you know but [00:35:21] to the ruler zero but but you know but it doesn't matter too much so let's to [00:35:23] it doesn't matter too much so let's to initialize over there then with one step [00:35:26] initialize over there then with one step of gradient descent the algorithm will [00:35:28] of gradient descent the algorithm will take that step downhill and then where [00:35:31] take that step downhill and then where the second step will take that step [00:35:34] the second step will take that step downhill oh and by the way fun fact if [00:35:37] downhill oh and by the way fun fact if you if you think about the contours of a [00:35:38] you if you think about the contours of a function it turns out that the direction [00:35:40] function it turns out that the direction of steepest descent is always at ninety [00:35:42] of steepest descent is always at ninety degrees is always orthogonal to the [00:35:44] degrees is always orthogonal to the contour direction right so no seem to [00:35:48] contour direction right so no seem to remember that file my high school [00:35:50] remember that file my high school something I think it's alright and so as [00:35:53] something I think it's alright and so as you as you take steps downhill because [00:35:56] you as you take steps downhill because there's only one global minimum this [00:35:59] there's only one global minimum this algorithm will eventually converge to [00:36:02] algorithm will eventually converge to the and so the question just now about [00:36:06] the and so the question just now about the choice of the learning rate alpha if [00:36:09] the choice of the learning rate alpha if you set alpha to be very very large to [00:36:11] you set alpha to be very very large to be too large then it can overshoot right [00:36:14] be too large then it can overshoot right the steps you take can be too large and [00:36:16] the steps you take can be too large and you can run past the minimum if you said [00:36:19] you can run past the minimum if you said to be too small then you need a lot of [00:36:21] to be too small then you need a lot of iterations and yah will be slow and so [00:36:24] iterations and yah will be slow and so what happens in practice is usually you [00:36:27] what happens in practice is usually you try a few values and and and see what [00:36:30] try a few values and and and see what value of the learning rate allows you to [00:36:32] value of the learning rate allows you to most efficiently you know drive down the [00:36:34] most efficiently you know drive down the value of J of theta and if you see J of [00:36:37] value of J of theta and if you see J of theta increasing rather than decreasing [00:36:39] theta increasing rather than decreasing you see the cost function increasing [00:36:41] you see the cost function increasing rather than decreasing then there's a [00:36:44] rather than decreasing then there's a very strong sign that the learning rate [00:36:46] very strong sign that the learning rate is too large and so actually what I [00:36:50] is too large and so actually what I often do is actually try out multiple [00:36:52] often do is actually try out multiple values of the learning rate alpha and [00:36:55] values of the learning rate alpha and and and and usually try them on [00:36:58] and and and usually try them on exponential scale so try 0.01 0.02 to [00:37:01] exponential scale so try 0.01 0.02 to 0.04 0.08 kind of a doubling scale or [00:37:06] 0.04 0.08 kind of a doubling scale or doubling scale or tripling scale and try [00:37:07] doubling scale or tripling scale and try a few values and see what value allows [00:37:09] a few values and see what value allows you to drive down the learning rate [00:37:10] you to drive down the learning rate that's this I'm just [00:37:14] that's this I'm just so I just want to visualize this in one [00:37:18] so I just want to visualize this in one other way which is with the data so this [00:37:22] other way which is with the data so this is this is the actual dataset there [00:37:24] is this is the actual dataset there they're actually 49 points in the [00:37:26] they're actually 49 points in the dataset so M the number of training [00:37:28] dataset so M the number of training examples is 49 and so if you initialize [00:37:31] examples is 49 and so if you initialize the parameters to 0 that means [00:37:34] the parameters to 0 that means initializing your hypothesis or [00:37:36] initializing your hypothesis or initializing your straight line fit to [00:37:38] initializing your straight line fit to the data to be that horizontal line [00:37:40] the data to be that horizontal line right so if you initialize theta 0 [00:37:43] right so if you initialize theta 0 equals 0 theta 1 equals 0 then your [00:37:45] equals 0 theta 1 equals 0 then your hypothesis is you know for any input [00:37:48] hypothesis is you know for any input size of house of a price the estimated [00:37:50] size of house of a price the estimated price is 0 and so your hypothesis starts [00:37:53] price is 0 and so your hypothesis starts off with a horizontal line there just [00:37:55] off with a horizontal line there just whatever the input X the output Y is 0 [00:37:57] whatever the input X the output Y is 0 and what you're doing as you run [00:38:02] and what you're doing as you run gradient descent is you're changing the [00:38:04] gradient descent is you're changing the parameters theta right so the parameters [00:38:06] parameters theta right so the parameters went from this value to this value to [00:38:08] went from this value to this value to this value to this value and so on and [00:38:10] this value to this value and so on and so the other way of visualizing gradient [00:38:13] so the other way of visualizing gradient descent is if gradient descent starts [00:38:16] descent is if gradient descent starts off with this hypothesis with each [00:38:18] off with this hypothesis with each iteration of gradient descent you are [00:38:20] iteration of gradient descent you are trying to find different values of the [00:38:23] trying to find different values of the parameters theta that allows the [00:38:26] parameters theta that allows the straight line to fit the data better so [00:38:28] straight line to fit the data better so after one iteration of gradient descent [00:38:29] after one iteration of gradient descent this is the new hypothesis you now have [00:38:32] this is the new hypothesis you now have different values of theta zero and theta [00:38:33] different values of theta zero and theta one that fits the gate a little bit [00:38:35] one that fits the gate a little bit better after two iterations you end up [00:38:39] better after two iterations you end up with that hypothesis and with each [00:38:41] with that hypothesis and with each iteration of grading descent is trying [00:38:43] iteration of grading descent is trying to minimize J of theta is trying to [00:38:45] to minimize J of theta is trying to minimize one half of the sum of squares [00:38:47] minimize one half of the sum of squares errors of the hypothesis or predictions [00:38:50] errors of the hypothesis or predictions on the different examples right well [00:38:52] on the different examples right well three iterations of green descent for [00:38:55] three iterations of green descent for their Asians and so on and then and then [00:38:57] their Asians and so on and then and then the bunch more durations and eventually [00:39:00] the bunch more durations and eventually converges to that hypothesis which is [00:39:03] converges to that hypothesis which is pretty pretty decent straight line fit [00:39:04] pretty pretty decent straight line fit to the data ok so question [00:39:09] oh sure me just repeat question why is [00:39:30] oh sure me just repeat question why is the y-you subtracting alpha times the [00:39:33] the y-you subtracting alpha times the gradient rather than adding alpha times [00:39:35] gradient rather than adding alpha times the gradient um let me suggest let me [00:39:38] the gradient um let me suggest let me raise the screen so let me suggest you [00:39:42] raise the screen so let me suggest you work through one example it turns out [00:39:46] work through one example it turns out that if you add a multiple times the [00:39:48] that if you add a multiple times the gradient you'll be going uphill rather [00:39:50] gradient you'll be going uphill rather than going down low and maybe one way to [00:39:52] than going down low and maybe one way to see that would be um you know take a [00:39:56] see that would be um you know take a quadratic function right if you're here [00:40:01] quadratic function right if you're here the gradient is a positive direction and [00:40:04] the gradient is a positive direction and you want to reduce so this would be [00:40:06] you want to reduce so this would be theta and just me jr. yes so you want [00:40:08] theta and just me jr. yes so you want thither to decrease so the green is [00:40:10] thither to decrease so the green is positive you want to decrease say there [00:40:12] positive you want to decrease say there is you want to subtract in multiple [00:40:13] is you want to subtract in multiple times a gradient [00:40:14] times a gradient um I think maybe the best way to see [00:40:16] um I think maybe the best way to see that would be to work through an example [00:40:17] that would be to work through an example yourself said J of theta equals theta [00:40:21] yourself said J of theta equals theta squared and sin theta equals one so here [00:40:24] squared and sin theta equals one so here at the quadratic function the differs is [00:40:25] at the quadratic function the differs is equal to one so you want to subtract the [00:40:27] equal to one so you want to subtract the value from say a rotten egg all right [00:40:34] value from say a rotten egg all right great so you've now seen your first [00:40:39] great so you've now seen your first learning algorithm and you know green [00:40:43] learning algorithm and you know green descent and linear aggression is [00:40:45] descent and linear aggression is Daphne's still one of the most widely [00:40:47] Daphne's still one of the most widely used learning algorithms in the world [00:40:48] used learning algorithms in the world today and if you implement this review [00:40:50] today and if you implement this review if you implement this today right you [00:40:52] if you implement this today right you could use this but some actually pretty [00:40:55] could use this but some actually pretty pretty decent purposes right now I want [00:41:02] pretty decent purposes right now I want to give this algorithm one other name [00:41:05] to give this algorithm one other name so our gradient descent algorithm here [00:41:10] so our gradient descent algorithm here calculates this derivative by summing [00:41:13] calculates this derivative by summing over your entire training set M and so [00:41:16] over your entire training set M and so sometimes this version of gradient [00:41:18] sometimes this version of gradient descent [00:41:19] descent has another name which is Bosch gradient [00:41:21] has another name which is Bosch gradient descent and the term batch you know and [00:41:36] descent and the term batch you know and again I think a machine learning a whole [00:41:38] again I think a machine learning a whole committee we just make up names of stuff [00:41:40] committee we just make up names of stuff and sometimes the names aren't great but [00:41:41] and sometimes the names aren't great but the the term Bactrian descent refers to [00:41:44] the the term Bactrian descent refers to that you look at the entire training set [00:41:47] that you look at the entire training set all 49 examples in the example I just [00:41:49] all 49 examples in the example I just had on PowerPoint you know you think of [00:41:52] had on PowerPoint you know you think of all 49 examples it's one batch of data [00:41:54] all 49 examples it's one batch of data and we're gonna process all the data as [00:41:56] and we're gonna process all the data as a batch so hence the name batch gradient [00:41:59] a batch so hence the name batch gradient descent [00:41:59] descent do you disadvantage a bachelor in [00:42:02] do you disadvantage a bachelor in descent is that if you have a giant data [00:42:04] descent is that if you have a giant data set if you have and and you're in era of [00:42:07] set if you have and and you're in era of big data we're really moving to large [00:42:09] big data we're really moving to large and larger data set there and serve use [00:42:11] and larger data set there and serve use you know train machine learning models [00:42:13] you know train machine learning models of like hundreds of millions of examples [00:42:15] of like hundreds of millions of examples and and if you are trying to if you have [00:42:18] and and if you are trying to if you have if you download the US Census database [00:42:21] if you download the US Census database if your data on the United States Census [00:42:23] if your data on the United States Census that's a very large data set and you [00:42:25] that's a very large data set and you want to predict housing prices from [00:42:27] want to predict housing prices from across the United States that that that [00:42:29] across the United States that that that may have a data set with many many [00:42:31] may have a data set with many many millions of examples and the [00:42:33] millions of examples and the disadvantage a batch gradient descent is [00:42:36] disadvantage a batch gradient descent is that in order to make one update to your [00:42:40] that in order to make one update to your parameters in order to take even a [00:42:41] parameters in order to take even a single step of gradient descent you need [00:42:44] single step of gradient descent you need to calculate this sum and if M is say a [00:42:48] to calculate this sum and if M is say a million or ten million or 100 million [00:42:50] million or ten million or 100 million you need to scan through your entire [00:42:53] you need to scan through your entire database scan your entire dataset and [00:42:55] database scan your entire dataset and calculate this for you know 100 million [00:43:00] calculate this for you know 100 million examples and sum it up and so every [00:43:01] examples and sum it up and so every single step of gradient descent becomes [00:43:03] single step of gradient descent becomes very slow because you're scanning over [00:43:06] very slow because you're scanning over you're reading over right like 100 [00:43:08] you're reading over right like 100 million training examples and before you [00:43:12] million training examples and before you can even you know make one tiny little [00:43:13] can even you know make one tiny little step of gradient descent okay by the way [00:43:17] step of gradient descent okay by the way I think I don't know I feel like in [00:43:19] I think I don't know I feel like in today's ever a big day there people [00:43:20] today's ever a big day there people start to lose intuitions about what's [00:43:22] start to lose intuitions about what's the baby that's and I think even by [00:43:23] the baby that's and I think even by today's standards like a hundred many [00:43:25] today's standards like a hundred many examples is still very big [00:43:27] examples is still very big I rarely only rarely use dungeon [00:43:29] I rarely only rarely use dungeon examples although maybe in a few years [00:43:33] examples although maybe in a few years we'll look back on Hydra examples and [00:43:35] we'll look back on Hydra examples and say that was really small but at least [00:43:36] say that was really small but at least today so the main disadvantage of [00:43:41] today so the main disadvantage of battery and descends is every single [00:43:44] battery and descends is every single step of your in descent requires that [00:43:45] step of your in descent requires that you read through you know your entire [00:43:47] you read through you know your entire data set may be terabytes of data sets [00:43:50] data set may be terabytes of data sets maybe maybe maybe tens of hundreds of [00:43:52] maybe maybe maybe tens of hundreds of terabytes of data before you can even [00:43:54] terabytes of data before you can even update the parameters just once [00:43:56] update the parameters just once and if gradient descent needs you know [00:43:59] and if gradient descent needs you know hundreds of iterations to converge then [00:44:01] hundreds of iterations to converge then you be scanning through your entire data [00:44:03] you be scanning through your entire data set hundreds of times oh and sometimes [00:44:06] set hundreds of times oh and sometimes we train our algorithms so thousands of [00:44:09] we train our algorithms so thousands of tens of thousands of iterations and so [00:44:10] tens of thousands of iterations and so so this this gets expensive so there's [00:44:15] so this this gets expensive so there's an alternative to bash gradient descent [00:44:18] an alternative to bash gradient descent and let me just write out the algorithm [00:44:20] and let me just write out the algorithm here then we can talk about it which is [00:44:23] here then we can talk about it which is going to repeatedly do this [00:44:52] so this algorithm which is called [00:44:56] so this algorithm which is called stochastic gradient descent instead of [00:45:08] stochastic gradient descent instead of scanning through all million examples [00:45:10] scanning through all million examples before you update the parameters theta [00:45:12] before you update the parameters theta even a little bit in stochastic grain [00:45:15] even a little bit in stochastic grain descent instead in the inner loop of the [00:45:17] descent instead in the inner loop of the algorithm you loop through J equals 1 [00:45:19] algorithm you loop through J equals 1 through m of ticking a gradient descent [00:45:22] through m of ticking a gradient descent step using the derivative of just one [00:45:26] step using the derivative of just one single example of just that one example [00:45:29] single example of just that one example oh excuse me I write so let I go from 1 [00:45:34] oh excuse me I write so let I go from 1 to M and update theta J for every J so [00:45:37] to M and update theta J for every J so you update this for J equals 1 through n [00:45:40] you update this for J equals 1 through n update theta J using this derivative but [00:45:45] update theta J using this derivative but now this derivative is taken just with [00:45:48] now this derivative is taken just with respect to one training example the [00:45:50] respect to one training example the example I just I guess you update this [00:45:57] example I just I guess you update this for every J and so let me just draw a [00:46:06] for every J and so let me just draw a picture of what this algorithm is doing [00:46:09] picture of what this algorithm is doing if this is the contour like the one you [00:46:15] if this is the contour like the one you saw just now so the axes are theta 0 and [00:46:19] saw just now so the axes are theta 0 and theta 1 and the height of the surface [00:46:21] theta 1 and the height of the surface right to know the contours J of theta [00:46:23] right to know the contours J of theta with stochastic render sense what you do [00:46:26] with stochastic render sense what you do is you initialize the parameters [00:46:27] is you initialize the parameters somewhere and then you will look at your [00:46:30] somewhere and then you will look at your first training example hey let's just [00:46:32] first training example hey let's just look at one house and see if we can [00:46:33] look at one house and see if we can predict that houses better and you [00:46:35] predict that houses better and you modify the parameters to increase the [00:46:38] modify the parameters to increase the accuracy where you predict the price of [00:46:40] accuracy where you predict the price of that one house and because you for the [00:46:41] that one house and because you for the innovator just the one house you know [00:46:44] innovator just the one house you know maybe you end up improving the [00:46:47] maybe you end up improving the parameters a little bit but not quite [00:46:49] parameters a little bit but not quite going in the most direct direction [00:46:52] going in the most direct direction downhill and you're going look at the [00:46:53] downhill and you're going look at the second house and say hey let's try to [00:46:55] second house and say hey let's try to fit that house better and then update [00:46:57] fit that house better and then update the parameters and look at third house a [00:46:58] the parameters and look at third house a house [00:47:00] house and so as you run so costly gradient [00:47:03] and so as you run so costly gradient descent it takes a slightly noisy [00:47:05] descent it takes a slightly noisy slightly random path but on average is [00:47:09] slightly random path but on average is headed to what the global minimum okay [00:47:13] headed to what the global minimum okay so as you run stochastic current descent [00:47:16] so as you run stochastic current descent so her great in the sense will actually [00:47:19] so her great in the sense will actually never quite converge in backstreet [00:47:21] never quite converge in backstreet understand it kind of went to the global [00:47:24] understand it kind of went to the global minimum and stopped right but so [00:47:27] minimum and stopped right but so classroom in this end even as you won't [00:47:28] classroom in this end even as you won't run it the parameters oscillate and [00:47:30] run it the parameters oscillate and won't ever quite converge because you're [00:47:32] won't ever quite converge because you're always running around looking at [00:47:34] always running around looking at different houses and trying to do better [00:47:35] different houses and trying to do better on just that one hold on that one house [00:47:37] on just that one hold on that one house on that one house but when you have a [00:47:40] on that one house but when you have a very large data set stochastic gradient [00:47:43] very large data set stochastic gradient descent allows your implementation [00:47:45] descent allows your implementation allows you algorithm to make much faster [00:47:47] allows you algorithm to make much faster progress and so and and so when you have [00:47:52] progress and so and and so when you have very large data sets the casa grande [00:47:55] very large data sets the casa grande descent is use much more practice than - [00:47:57] descent is use much more practice than - brilliant [00:48:11] you know is it possible stop so [00:48:13] you know is it possible stop so customers and and it's such over the [00:48:15] customers and and it's such over the battery understand yes it is so boy [00:48:20] battery understand yes it is so boy something wasn't a tough one in this [00:48:21] something wasn't a tough one in this class solvency a suitor these mini [00:48:23] class solvency a suitor these mini battery in the sense where you don't [00:48:26] battery in the sense where you don't when you use say 100 examples this time [00:48:28] when you use say 100 examples this time rather than one example of the time and [00:48:31] rather than one example of the time and so that's another algorithm that I [00:48:33] so that's another algorithm that I should use more often in practice I [00:48:35] should use more often in practice I think people rarely actually so in [00:48:37] think people rarely actually so in practice you know when your data set is [00:48:41] practice you know when your data set is large we rarely ever switch to batch [00:48:45] large we rarely ever switch to batch gradient descent because battery in the [00:48:47] gradient descent because battery in the sin is just so slow right so I don't [00:48:50] sin is just so slow right so I don't know I'm thinking through concrete [00:48:52] know I'm thinking through concrete examples of problems that worked on and [00:48:54] examples of problems that worked on and I think that what may actually may I [00:48:57] I think that what may actually may I think that dump right for a lot of for [00:49:01] think that dump right for a lot of for modern machine learning where you have [00:49:03] modern machine learning where you have if you have very very large datasets [00:49:04] if you have very very large datasets right so you know whatever if you're [00:49:06] right so you know whatever if you're building a speech recognition system you [00:49:08] building a speech recognition system you might have like a terabyte of data right [00:49:10] might have like a terabyte of data right and so it's so expensive to scan through [00:49:14] and so it's so expensive to scan through a terabyte of day they're just reading [00:49:16] a terabyte of day they're just reading it from disk right it's so expensive [00:49:18] it from disk right it's so expensive that you would probably never even run [00:49:21] that you would probably never even run one iteration about you in the sense and [00:49:23] one iteration about you in the sense and it turns out the the the this one one [00:49:26] it turns out the the the this one one huge saving graces to consecrate [00:49:28] huge saving graces to consecrate understands is let's say runs the costly [00:49:30] understands is let's say runs the costly grained descent right and you know you [00:49:33] grained descent right and you know you end up with this parameter and that's [00:49:37] end up with this parameter and that's the parameter you use for your machine [00:49:39] the parameter you use for your machine learning system rather than the global [00:49:41] learning system rather than the global optimum it turns out that parameter is [00:49:44] optimum it turns out that parameter is actually not that bad right you you [00:49:45] actually not that bad right you you probably make perfectly fine predictions [00:49:47] probably make perfectly fine predictions even if you don't get to like the global [00:49:50] even if you don't get to like the global global minimum so what you said I think [00:49:53] global minimum so what you said I think is a fine thing to do no harm trying it [00:49:56] is a fine thing to do no harm trying it although in practice in practice we [00:50:01] although in practice in practice we don't bother I think in practice we use [00:50:02] don't bother I think in practice we use the customer in the same the thing that [00:50:04] the customer in the same the thing that actually is more common is to slowly [00:50:06] actually is more common is to slowly decrease the learning rate so just keep [00:50:08] decrease the learning rate so just keep using so-called green descent but reduce [00:50:10] using so-called green descent but reduce the learning rate [00:50:10] the learning rate time so it takes smaller and smaller [00:50:12] time so it takes smaller and smaller steps so if you do that then what [00:50:14] steps so if you do that then what happens is the size of the oscillations [00:50:16] happens is the size of the oscillations of decrease and so you end up [00:50:18] of decrease and so you end up oscillating or bouncing around the [00:50:20] oscillating or bouncing around the smaller regions so wherever you end up [00:50:22] smaller regions so wherever you end up may not be the global global minimum but [00:50:25] may not be the global global minimum but at least it'll be it'll be closer to the [00:50:27] at least it'll be it'll be closer to the yeah so the appeasing learning rate is [00:50:29] yeah so the appeasing learning rate is used much more often cool oh sure when [00:50:40] used much more often cool oh sure when do you start with certain ranges and [00:50:42] do you start with certain ranges and plot to J of theta over a time so J of [00:50:47] plot to J of theta over a time so J of theta is the cost function that you're [00:50:48] theta is the cost function that you're trying to drive down so monitor J of [00:50:51] trying to drive down so monitor J of theta as you know it's going down over [00:50:53] theta as you know it's going down over time and then if it looks like this stop [00:50:55] time and then if it looks like this stop going down then you can say oh it looks [00:50:57] going down then you can say oh it looks like a spout going down when it's not [00:50:58] like a spout going down when it's not raining oh and you know one nice thing [00:51:02] raining oh and you know one nice thing about linear regression is it has no [00:51:04] about linear regression is it has no local optimum and so if you run into [00:51:09] local optimum and so if you run into these conversions debugging in terms of [00:51:12] these conversions debugging in terms of issues less often when you're training [00:51:14] issues less often when you're training highly nonlinear things like neural [00:51:16] highly nonlinear things like neural networks which talk about later in CST [00:51:17] networks which talk about later in CST tonight as well [00:51:18] tonight as well oh these issues become more acute okay [00:51:26] oh these issues become more acute okay great so um [00:51:33] Oh which I learn here if you want to [00:51:36] Oh which I learn here if you want to rent times linear expressions and not [00:51:38] rent times linear expressions and not really it's usually much bigger than [00:51:39] really it's usually much bigger than that [00:51:40] that yeah yeah because if your learning rate [00:51:43] yeah yeah because if your learning rate was 1 over n times that of Y should use [00:51:45] was 1 over n times that of Y should use the fashion in descent then it ended up [00:51:47] the fashion in descent then it ended up being as slow as factual innocence so [00:51:49] being as slow as factual innocence so there's usually much bigger okay so um [00:51:54] there's usually much bigger okay so um so that's the classic greater sin oh and [00:51:57] so that's the classic greater sin oh and and so I'll tell you what I do if you [00:51:59] and so I'll tell you what I do if you have a relatively small dataset you know [00:52:01] have a relatively small dataset you know if you have if yep I don't know like a [00:52:03] if you have if yep I don't know like a hundreds of examples maybe a thousands [00:52:05] hundreds of examples maybe a thousands of examples where it's computationally [00:52:07] of examples where it's computationally efficient to do batch gradient descent [00:52:10] efficient to do batch gradient descent if battery and descent doesn't cost too [00:52:11] if battery and descent doesn't cost too much I would almost always just use [00:52:13] much I would almost always just use battery and descent because it's one [00:52:15] battery and descent because it's one less thing to fiddle with right it's [00:52:16] less thing to fiddle with right it's just one less thing to have to worry [00:52:18] just one less thing to have to worry about the parameters oscillating but [00:52:20] about the parameters oscillating but your data said is too large that battery [00:52:23] your data said is too large that battery understand becomes prohibitively slow [00:52:26] understand becomes prohibitively slow then almost everyone would use you know [00:52:29] then almost everyone would use you know so costly Granger sentence there right [00:52:31] so costly Granger sentence there right Oh however more like so cost for any [00:52:32] Oh however more like so cost for any sense [00:52:47] all right so gradient descents both [00:52:53] all right so gradient descents both master and descent as so costly green [00:52:55] master and descent as so costly green descent is an iterative algorithm [00:52:58] descent is an iterative algorithm meaning that you have to take multiple [00:53:00] meaning that you have to take multiple steps to get to you know get near [00:53:02] steps to get to you know get near hopefully the global optimum it turns [00:53:05] hopefully the global optimum it turns out this is another algorithm oh and and [00:53:08] out this is another algorithm oh and and for many other algorithms we'll talk [00:53:11] for many other algorithms we'll talk about in this class including general [00:53:13] about in this class including general linear models and neural networks and a [00:53:15] linear models and neural networks and a few other algorithms you will have to [00:53:17] few other algorithms you will have to use gradient descent and so and so we'll [00:53:19] use gradient descent and so and so we'll see gradient descent you know as we [00:53:21] see gradient descent you know as we develop multiple different algorithms [00:53:23] develop multiple different algorithms later this quarter it turns out that for [00:53:27] later this quarter it turns out that for the special case of linear regression [00:53:29] the special case of linear regression and I mean linear regression but not the [00:53:31] and I mean linear regression but not the other and we'll talk about next Monday [00:53:33] other and we'll talk about next Monday not the r1 with help on the expensive [00:53:34] not the r1 with help on the expensive but if the algorithm you're using is [00:53:36] but if the algorithm you're using is linear regression of exactly linear [00:53:38] linear regression of exactly linear regression it turns out that there's a [00:53:40] regression it turns out that there's a way to solve for the optimal value of [00:53:43] way to solve for the optimal value of the parameters theta to just jump in one [00:53:46] the parameters theta to just jump in one step to the global optimum without [00:53:48] step to the global optimum without needing to use an iterative algorithm [00:53:50] needing to use an iterative algorithm right and this this one I'm gonna [00:53:53] right and this this one I'm gonna present makes is called the normal [00:53:55] present makes is called the normal equation it works only for linear [00:53:56] equation it works only for linear regression doesn't work for any of the [00:53:58] regression doesn't work for any of the other arrows talked about me to the [00:53:59] other arrows talked about me to the school sir [00:54:00] school sir but but let me quickly show you the [00:54:10] but but let me quickly show you the derivation of that and what I want to do [00:54:15] derivation of that and what I want to do is give you a flavor of how to derive [00:54:19] is give you a flavor of how to derive the normal equation and where you end up [00:54:22] the normal equation and where you end up with is you know what what I hope to do [00:54:25] with is you know what what I hope to do is end up with a formula that lets you [00:54:27] is end up with a formula that lets you say theta equals some stuff where you [00:54:31] say theta equals some stuff where you just set theta equals to that and in one [00:54:33] just set theta equals to that and in one step with a few matrix multiplications [00:54:35] step with a few matrix multiplications you end up with the optimal value of [00:54:37] you end up with the optimal value of theta that lands you right at the global [00:54:39] theta that lands you right at the global optimum right now just like that just in [00:54:41] optimum right now just like that just in one step okay um and if you've taken you [00:54:45] one step okay um and if you've taken you know advanced linear algebra classes [00:54:47] know advanced linear algebra classes before something you may have seen [00:54:49] before something you may have seen in this formula for linear regression [00:54:53] in this formula for linear regression what what what the longer than Yashiro [00:54:55] what what what the longer than Yashiro clauses do is what some of the natural [00:54:58] clauses do is what some of the natural classes do is cover the board with you [00:55:00] classes do is cover the board with you know pages and pages and matrix [00:55:02] know pages and pages and matrix derivatives what I want to do is [00:55:04] derivatives what I want to do is describe to you a matrix derivative [00:55:08] describe to you a matrix derivative notation that allows you to derive the [00:55:10] notation that allows you to derive the normal equation in roughly four lines of [00:55:13] normal equation in roughly four lines of linear algebra rather than so pages and [00:55:15] linear algebra rather than so pages and pages in linear algebra and in the work [00:55:18] pages in linear algebra and in the work I've done in machine learning [00:55:19] I've done in machine learning you know sometimes notation really [00:55:21] you know sometimes notation really matters right if you're the right [00:55:22] matters right if you're the right notation you can solve some problems [00:55:24] notation you can solve some problems much more easily and what I want to do [00:55:26] much more easily and what I want to do is define this matrix linear algebra [00:55:31] is define this matrix linear algebra notation and then I don't want to do all [00:55:34] notation and then I don't want to do all the steps of the derivation I'm gonna [00:55:35] the steps of the derivation I'm gonna give you a give you a sense of the [00:55:37] give you a give you a sense of the flavor of what it looks like and then [00:55:39] flavor of what it looks like and then I'll ask you to get a lot of details [00:55:42] I'll ask you to get a lot of details yourself in the in the lecture notes [00:55:46] yourself in the in the lecture notes will work out everything in more detail [00:55:48] will work out everything in more detail than I want to do algebra in class oh [00:55:49] than I want to do algebra in class oh and um in problem set one you get to [00:55:52] and um in problem set one you get to practice using this yourself - you know [00:55:55] practice using this yourself - you know derive some additional things that I [00:55:57] derive some additional things that I found this notation really convenient [00:55:59] found this notation really convenient right for deriving learning algorithms [00:56:02] right for deriving learning algorithms okay so I'm going to use the following [00:56:07] okay so I'm going to use the following notation so J right there's a function [00:56:14] notation so J right there's a function mapping from parameters to the real [00:56:16] mapping from parameters to the real numbers so I'm going to define this this [00:56:21] numbers so I'm going to define this this is the derivative of J of theta with [00:56:24] is the derivative of J of theta with respect to theta where remember a theta [00:56:28] respect to theta where remember a theta is a three-dimensional vector so it's r3 [00:56:32] is a three-dimensional vector so it's r3 Rashi's are n plus 1 right if you have [00:56:35] Rashi's are n plus 1 right if you have two features of the house if N equals 2 [00:56:37] two features of the house if N equals 2 then theta is three-dimensional n plus 1 [00:56:40] then theta is three-dimensional n plus 1 dimensional so it's a vector and so I'm [00:56:42] dimensional so it's a vector and so I'm going to define the derivative with [00:56:44] going to define the derivative with respect to theta of J of theta as [00:56:46] respect to theta of J of theta as follows [00:56:48] follows this is going to be yourself 3 by 1 [00:56:51] this is going to be yourself 3 by 1 vector [00:57:01] so hope this notation is clear so this [00:57:04] so hope this notation is clear so this is a three-dimensional vector with three [00:57:07] is a three-dimensional vector with three components all right so that's why so [00:57:21] components all right so that's why so that's the first component is vector [00:57:23] that's the first component is vector that's the second and the third okay [00:57:25] that's the second and the third okay it's a partial derivative J respecting [00:57:27] it's a partial derivative J respecting each of the three elements and more [00:57:33] each of the three elements and more generally in the notation we'll use [00:57:50] generally in the notation we'll use maybe an example um let's say that E is [00:57:56] maybe an example um let's say that E is a matrix so let's say that's a a is the [00:58:00] a matrix so let's say that's a a is the 2 by 2 matrix then you can have a [00:58:04] 2 by 2 matrix then you can have a function right so let's say a is you [00:58:08] function right so let's say a is you know a11 a12 a21 a.22 right so a is a 2 [00:58:15] know a11 a12 a21 a.22 right so a is a 2 by 2 matrix then you might have some [00:58:18] by 2 matrix then you might have some function of a matrix a right then that's [00:58:23] function of a matrix a right then that's a real number so I'm gonna be F max from [00:58:25] a real number so I'm gonna be F max from a 2 by 2 - excuse me are 2 by 2 [00:58:31] a 2 by 2 - excuse me are 2 by 2 it's a real number so and so for example [00:58:37] it's a real number so and so for example if F of a equals a 1 1 plus a 1 2 [00:58:40] if F of a equals a 1 1 plus a 1 2 squared then f of you know 5 6 7 8 would [00:58:47] squared then f of you know 5 6 7 8 would be equal to I guess 5 plus 6 squared [00:58:51] be equal to I guess 5 plus 6 squared right so as we derived this will be [00:58:54] right so as we derived this will be working a little bit with functions that [00:58:56] working a little bit with functions that map from matrices to real numbers and [00:58:58] map from matrices to real numbers and this is just one made-up example of a [00:59:00] this is just one made-up example of a function that impose a matrix and maps [00:59:02] function that impose a matrix and maps the matrix massive values of a matrix [00:59:03] the matrix massive values of a matrix serial number [00:59:05] serial number and when you have a matrix function like [00:59:09] and when you have a matrix function like this I'm going to define the derivative [00:59:12] this I'm going to define the derivative we respected a of F of a to be equal to [00:59:17] we respected a of F of a to be equal to itself a matrix where the derivative of [00:59:22] itself a matrix where the derivative of f of a with respect to the matrix a this [00:59:26] f of a with respect to the matrix a this itself will be a matrix with the same [00:59:30] itself will be a matrix with the same dimension of a and the elements of this [00:59:33] dimension of a and the elements of this are the derivative with respect to the [00:59:39] are the derivative with respect to the individual elements I'm just ready to [00:59:46] individual elements I'm just ready to like this [01:00:01] okay so if a was a 2x2 matrix then the [01:00:05] okay so if a was a 2x2 matrix then the derivative of F of a respect to a is [01:00:07] derivative of F of a respect to a is itself a two by two matrix and you [01:00:09] itself a two by two matrix and you compute this two-by-two matrix just by [01:00:11] compute this two-by-two matrix just by looking at F and taking derivatives with [01:00:16] looking at F and taking derivatives with respect to the different elements and [01:00:17] respect to the different elements and plugging them into the different the [01:00:19] plugging them into the different the different elements of this matrix okay [01:00:21] different elements of this matrix okay and so in this example I guess the [01:00:25] and so in this example I guess the derivative respect to any of F of a this [01:00:28] derivative respect to any of F of a this would be right it would be over that and [01:00:39] would be right it would be over that and I got these four numbers by taking the [01:00:45] I got these four numbers by taking the definition of F and taking the [01:00:48] definition of F and taking the derivative with respect to a 1 1 and [01:00:50] derivative with respect to a 1 1 and plugging that here taking the respect to [01:00:55] plugging that here taking the respect to a 1 2 and plugging that here and taking [01:00:58] a 1 2 and plugging that here and taking the derivative respect to the remaining [01:01:00] the derivative respect to the remaining elements and plugging them here so [01:01:04] elements and plugging them here so that's the definition of a matrix they [01:01:06] that's the definition of a matrix they remember - yeah [01:01:11] oh yes we do same definition for a [01:01:14] oh yes we do same definition for a vector and by 1 or n by 1 matrix yes and [01:01:17] vector and by 1 or n by 1 matrix yes and in fact that definition and dis [01:01:19] in fact that definition and dis definition for the the reserving J [01:01:21] definition for the the reserving J respective thing so these are consistent [01:01:23] respective thing so these are consistent so if you apply that definition to a [01:01:25] so if you apply that definition to a column vector treating a column vector [01:01:27] column vector treating a column vector as an N by 1 matrix or input and it has [01:01:30] as an N by 1 matrix or input and it has to be n plus 1 by 1 matrix then that [01:01:33] to be n plus 1 by 1 matrix then that that specializes to what we described [01:01:34] that specializes to what we described here [01:01:48] all right so let's see so um I want to [01:02:01] all right so let's see so um I want to leave the details to lecture notes [01:02:03] leave the details to lecture notes because there is more lines of algebra [01:02:05] because there is more lines of algebra but I want to but he'll give you an [01:02:06] but I want to but he'll give you an overview of what the derivation of the [01:02:09] overview of what the derivation of the normal equation looks like so arms of [01:02:13] normal equation looks like so arms of this definition of a derivative of a [01:02:15] this definition of a derivative of a matrix the broad outline that what we're [01:02:18] matrix the broad outline that what we're going to do is we're going to take J of [01:02:21] going to do is we're going to take J of theta alright that's the cost function [01:02:25] theta alright that's the cost function take the derivative with respect to [01:02:28] take the derivative with respect to theta right since theta is a vector so [01:02:33] theta right since theta is a vector so you want to take derivatives with [01:02:34] you want to take derivatives with respect to theta and you know well how [01:02:37] respect to theta and you know well how do you minimize the function you take [01:02:38] do you minimize the function you take the Reuters respective theta and set it [01:02:40] the Reuters respective theta and set it equal to zero and then you solve for the [01:02:44] equal to zero and then you solve for the value of theta so that the derivative is [01:02:46] value of theta so that the derivative is zero right the minimum you know the [01:02:48] zero right the minimum you know the maximum minimum function is whether [01:02:49] maximum minimum function is whether there is equal to zero so so how you [01:02:52] there is equal to zero so so how you derive the normal equation is take this [01:02:53] derive the normal equation is take this vector so J of theta maps from a vector [01:02:57] vector so J of theta maps from a vector to a real number so we'll take [01:02:59] to a real number so we'll take derivatives with respect to theta set [01:03:00] derivatives with respect to theta set that there is equal zero and solve for [01:03:02] that there is equal zero and solve for theta and then we end up with a formula [01:03:04] theta and then we end up with a formula for theta that lets you just you know [01:03:08] for theta that lets you just you know immediately go to the global minimum of [01:03:10] immediately go to the global minimum of the cost function J of theta and and and [01:03:14] the cost function J of theta and and and all of the build up and all of this [01:03:15] all of the build up and all of this notation is you know is there what does [01:03:18] notation is you know is there what does this mean and is there an easy way to [01:03:20] this mean and is there an easy way to compute the derivative of J of theta [01:03:22] compute the derivative of J of theta okay so to help you understand the [01:03:29] okay so to help you understand the lecture notes when hopefully you take a [01:03:30] lecture notes when hopefully you take a look at them just a couple other [01:03:33] look at them just a couple other derivation if a is a square matrix so [01:03:42] derivation if a is a square matrix so let's say a is a an N by n matrix so [01:03:45] let's say a is a an N by n matrix so number of rows equals number of columns [01:03:48] number of rows equals number of columns I'm going to denote the trace of a to be [01:03:53] I'm going to denote the trace of a to be equal to the sum of the diagonal entries [01:03:59] so some of our III and this is [01:04:05] so some of our III and this is pronounced the trace of a and and and [01:04:10] pronounced the trace of a and and and you know you can you can also write this [01:04:12] you know you can you can also write this as trace operator like the trace [01:04:15] as trace operator like the trace function applied to a but by convention [01:04:17] function applied to a but by convention we often write trace of a without the [01:04:19] we often write trace of a without the parentheses and so this is called a [01:04:21] parentheses and so this is called a trace so trace just means sum of [01:04:26] trace so trace just means sum of diagonal entries and some facts about [01:04:29] diagonal entries and some facts about the trace of a matrix you know trace of [01:04:33] the trace of a matrix you know trace of a is equal to the trace of a transpose [01:04:36] a is equal to the trace of a transpose because if you transpose the matrix [01:04:37] because if you transpose the matrix right you're just flipping along the 45 [01:04:40] right you're just flipping along the 45 degree axis and so the diagonal entries [01:04:43] degree axis and so the diagonal entries actually stay the same when you [01:04:44] actually stay the same when you transpose the matrix so that trace of a [01:04:46] transpose the matrix so that trace of a is equal to trace of a transpose and [01:04:49] is equal to trace of a transpose and then there there there are some other [01:04:52] then there there there are some other useful properties of the trace operator [01:04:56] useful properties of the trace operator here's one that I don't want to prove [01:04:58] here's one that I don't want to prove but that you could go home and prove [01:05:00] but that you could go home and prove yourself with a few some little bit of [01:05:04] yourself with a few some little bit of work maybe not not too much which is if [01:05:06] work maybe not not too much which is if you define F of a equals trace of a [01:05:14] you define F of a equals trace of a times B so here it B is some fixed [01:05:18] times B so here it B is some fixed matrix right and what F of a does is it [01:05:22] matrix right and what F of a does is it multiplies a and B and then it takes to [01:05:23] multiplies a and B and then it takes to sum of diagonal entries then it turns [01:05:26] sum of diagonal entries then it turns out that the derivative with respect to [01:05:28] out that the derivative with respect to a of F of a is equal to B transpose and [01:05:38] a of F of a is equal to B transpose and this is you could prove this yourself [01:05:40] this is you could prove this yourself for any matrix B if F of a is defined [01:05:43] for any matrix B if F of a is defined this way the derivative is equal to B [01:05:45] this way the derivative is equal to B transpose the trace function or the [01:05:49] transpose the trace function or the trace operator has other interesting [01:05:50] trace operator has other interesting properties the trace of a B is equal to [01:05:53] properties the trace of a B is equal to the trace of B a you could prove this [01:05:58] the trace of B a you could prove this from principle it's a little bit of work [01:05:59] from principle it's a little bit of work to prove that you if you expand out the [01:06:03] to prove that you if you expand out the definitions a and B sure [01:06:04] definitions a and B sure that and the tracer a times B times C is [01:06:08] that and the tracer a times B times C is equal to you the trace of C times a [01:06:11] equal to you the trace of C times a times B this is a cyclic permutation [01:06:14] times B this is a cyclic permutation property if you ever multiple you know [01:06:16] property if you ever multiple you know multiply several matrices together you [01:06:18] multiply several matrices together you can always take one from the end and [01:06:19] can always take one from the end and move it to the front and the trace will [01:06:22] move it to the front and the trace will remain the same and another one that is [01:06:31] remain the same and another one that is a little bit harder to prove is that the [01:06:34] a little bit harder to prove is that the trace [01:06:35] trace excuse me derivative of eight runs a [01:06:39] excuse me derivative of eight runs a transpose C is okay so I think just as [01:06:49] transpose C is okay so I think just as just as for your ordinary calculus we [01:06:53] just as for your ordinary calculus we know the derivative of x squared is 2x [01:06:55] know the derivative of x squared is 2x right and so we all figured out that [01:06:57] right and so we all figured out that grew and we just use it too much without [01:06:59] grew and we just use it too much without without having to read arrive every time [01:07:01] without having to read arrive every time this is a little bit like that the trace [01:07:03] this is a little bit like that the trace of a squared C is you know two times C a [01:07:06] of a squared C is you know two times C a right it's an open like that with matrix [01:07:10] right it's an open like that with matrix notation that's there so think of this [01:07:12] notation that's there so think of this as analogous to DDA of a squared C [01:07:16] as analogous to DDA of a squared C equals to AC but this is like the matrix [01:07:21] equals to AC but this is like the matrix version of that [01:07:44] all right so finally what I'd like to do [01:07:53] all right so finally what I'd like to do is take J of theta and express it in [01:07:58] is take J of theta and express it in this you know matrix vector notation so [01:08:01] this you know matrix vector notation so we can take the Reuters respect to theta [01:08:03] we can take the Reuters respect to theta and set those equal to zero and just [01:08:05] and set those equal to zero and just solve for the value of theta right and [01:08:07] solve for the value of theta right and so let me just write out the definition [01:08:10] so let me just write out the definition of J of theta so J of theta it was 1/2 [01:08:15] of J of theta so J of theta it was 1/2 something I equals 1 cm squared and it [01:08:26] something I equals 1 cm squared and it turns out that [01:08:39] it turns out that some if you did if you [01:08:42] it turns out that some if you did if you define the matrix capital X as follows [01:08:46] define the matrix capital X as follows which is I'm going to take the matrix [01:08:48] which is I'm going to take the matrix capital X and take the training examples [01:08:52] capital X and take the training examples we have you know and stack them up in [01:08:57] we have you know and stack them up in rows so we have M training examples [01:09:03] rows so we have M training examples right so so the XS will call them [01:09:05] right so so the XS will call them vectors so I'm taking transpose you just [01:09:07] vectors so I'm taking transpose you just stack up to K examples in n rows here so [01:09:11] stack up to K examples in n rows here so let me call this the design matrix but [01:09:13] let me call this the design matrix but couple X color design matrix and it [01:09:18] couple X color design matrix and it turns out that if you define X this way [01:09:21] turns out that if you define X this way then x times theta is this thing times [01:09:28] then x times theta is this thing times theta and the way of matrix vector [01:09:32] theta and the way of matrix vector multiplication works is your theta is [01:09:34] multiplication works is your theta is now a column vector right so theta is [01:09:36] now a column vector right so theta is you know theta 0 theta 1 theta 2 so the [01:09:41] you know theta 0 theta 1 theta 2 so the way that matrix vector multiplication [01:09:43] way that matrix vector multiplication works is you multiply this column vector [01:09:46] works is you multiply this column vector with each of these in intern and so this [01:09:49] with each of these in intern and so this ends up being X 1 transpose theta X 2 [01:09:54] ends up being X 1 transpose theta X 2 transpose theta down to X M transpose [01:10:01] transpose theta down to X M transpose theta which is of course just a vector [01:10:06] theta which is of course just a vector of all of the predictions of the [01:10:08] of all of the predictions of the algorithm [01:10:28] and so if now let me also define a [01:10:34] and so if now let me also define a vector Y to be taking all of the labels [01:10:44] vector Y to be taking all of the labels from your training example and stacking [01:10:47] from your training example and stacking them up into a big column vector right [01:10:49] them up into a big column vector right let me define Y that way it turns out [01:10:53] let me define Y that way it turns out that J of theta can then be written as [01:11:02] that J of theta can then be written as 1/2 X theta minus y transpose X theta [01:11:10] minus y ok and let me just outline the [01:11:20] minus y ok and let me just outline the proof but I won't do this in great [01:11:21] proof but I won't do this in great detail so X theta minus y is going to be [01:11:25] detail so X theta minus y is going to be right so this is X later this is y so [01:11:29] right so this is X later this is y so you know X theta minus y it's going to [01:11:32] you know X theta minus y it's going to be this vector of H of X 1 minus y 1 [01:11:38] be this vector of H of X 1 minus y 1 down to H of X M minus y M right this is [01:11:46] down to H of X M minus y M right this is just all the errors your learning [01:11:47] just all the errors your learning algorithm is making on the examples the [01:11:49] algorithm is making on the examples the difference between predictions and the [01:11:50] difference between predictions and the actual labels and if you remember so Z [01:11:55] actual labels and if you remember so Z transpose Z is equal to sum over I Z [01:11:58] transpose Z is equal to sum over I Z squared like a vector transpose itself [01:12:01] squared like a vector transpose itself is the sum of squares of elements and so [01:12:03] is the sum of squares of elements and so this vector transpose itself is the sum [01:12:07] this vector transpose itself is the sum of squares of the elements right so so [01:12:10] of squares of the elements right so so which is y so the cost function J of [01:12:13] which is y so the cost function J of theta is computed by taking the sum of [01:12:15] theta is computed by taking the sum of squares of all of these elements of all [01:12:17] squares of all of these elements of all of these errors and the way you do that [01:12:19] of these errors and the way you do that is to take this vector you know X a the [01:12:22] is to take this vector you know X a the minus y transpose itself is the sum of [01:12:26] minus y transpose itself is the sum of squares of the [01:12:27] squares of the which is exactly the era so that's why [01:12:29] which is exactly the era so that's why you end up with this is the sum of [01:12:32] you end up with this is the sum of squares of the those error terms okay [01:12:39] and um if some of the steps don't quite [01:12:43] and um if some of the steps don't quite make sense really don't worry about it [01:12:45] make sense really don't worry about it all this is written out more slowly and [01:12:47] all this is written out more slowly and carefully in the lecture notes but I [01:12:49] carefully in the lecture notes but I wanted you to have a sense of the [01:12:51] wanted you to have a sense of the brought off of the of the big picture of [01:12:54] brought off of the of the big picture of the derivation before you go through [01:12:56] the derivation before you go through them yourself and great disease on the [01:12:57] them yourself and great disease on the lecture notes elsewhere [01:13:08] so finally what we want to do is take [01:13:12] so finally what we want to do is take the derivative respect the theta of jr. [01:13:16] the derivative respect the theta of jr. theta and set that to zero and so this [01:13:19] theta and set that to zero and so this is going to be equal to the derivative [01:13:21] is going to be equal to the derivative of 1/2 X theta minus y transpose X Y and [01:13:30] of 1/2 X theta minus y transpose X Y and so I'm gonna I'm gonna do the steps [01:13:33] so I'm gonna I'm gonna do the steps really quickly right so the steps [01:13:35] really quickly right so the steps require some of the little properties of [01:13:37] require some of the little properties of traces and matrix derivatives that wrote [01:13:39] traces and matrix derivatives that wrote down briefly just now but so I'm gonna [01:13:41] down briefly just now but so I'm gonna do these very quickly without going into [01:13:43] do these very quickly without going into the details better [01:13:44] the details better so this is equal to 1/2 there's our [01:13:48] so this is equal to 1/2 there's our theta of so take transposes of these [01:13:52] theta of so take transposes of these things so this becomes theta transpose X [01:13:54] things so this becomes theta transpose X transpose minus y transpose and then [01:14:02] transpose minus y transpose and then kind of like expanding out a quadratic [01:14:05] kind of like expanding out a quadratic function right this is you know a minus [01:14:08] function right this is you know a minus B times C minus D as you can this is AC [01:14:10] B times C minus D as you can this is AC minus 80 and so on since so I just write [01:14:14] minus 80 and so on since so I just write this out [01:14:29] and so what I just did here this is [01:14:32] and so what I just did here this is similar to how you know ax minus B times [01:14:36] similar to how you know ax minus B times ax minus B is equal to a squared x [01:14:40] ax minus B is equal to a squared x squared minus ax b minus b ax ax plus b [01:14:45] squared minus ax b minus b ax ax plus b squared it's kind of it's just expanding [01:14:48] squared it's kind of it's just expanding out the quadratic function [01:15:03] and then the final step is is that right [01:15:14] and then the final step is is that right oh yes thank you um and then the final [01:15:20] oh yes thank you um and then the final step is you know for each of these four [01:15:22] step is you know for each of these four terms first second third and fourth [01:15:25] terms first second third and fourth terms to take the derivative with [01:15:27] terms to take the derivative with respect to theta and if you use some of [01:15:30] respect to theta and if you use some of the formulas I was alluding to over [01:15:32] the formulas I was alluding to over there you find that the derivative which [01:15:35] there you find that the derivative which which I don't want to show the [01:15:37] which I don't want to show the derivation out but it turns out that the [01:15:39] derivation out but it turns out that the derivative is X transpose X theta plus X [01:15:43] derivative is X transpose X theta plus X transpose X theta minus X transpose Y [01:15:49] transpose X theta minus X transpose Y minus X transpose Y and we're going to [01:15:56] minus X transpose Y and we're going to set this derivative and so the [01:16:00] set this derivative and so the simplifies to X transpose X theta minus [01:16:04] simplifies to X transpose X theta minus X transpose Y and so as described where [01:16:09] X transpose Y and so as described where they are gonna set this derivative to [01:16:12] they are gonna set this derivative to zero but how they go from this step to [01:16:14] zero but how they go from this step to that step is using the matrix [01:16:15] that step is using the matrix derivatives explained in more detail in [01:16:17] derivatives explained in more detail in the lecture notes and so the final step [01:16:20] the lecture notes and so the final step is you know having said this a zero this [01:16:23] is you know having said this a zero this implies that X transpose X theta equals [01:16:25] implies that X transpose X theta equals X transpose Y so this is called the [01:16:28] X transpose Y so this is called the normal equations and the optimum value [01:16:36] normal equations and the optimum value for theta is theta equals x transpose x [01:16:41] for theta is theta equals x transpose x inverse X transpose Y okay [01:16:49] inverse X transpose Y okay and if you implement this then you know [01:16:55] and if you implement this then you know you can in basically one step get the [01:16:58] you can in basically one step get the value of theta that corresponds the [01:17:00] value of theta that corresponds the global minimum and and and again common [01:17:08] global minimum and and and again common question I get is - well what if x is [01:17:09] question I get is - well what if x is non-invertible what that usually means [01:17:12] non-invertible what that usually means is you have redundant features that your [01:17:14] is you have redundant features that your features are linearly dependent [01:17:16] features are linearly dependent but if you use something called the [01:17:18] but if you use something called the pseudo-inverse you you kind of get the [01:17:20] pseudo-inverse you you kind of get the right answer if that's the case although [01:17:21] right answer if that's the case although I think that even more right answers if [01:17:23] I think that even more right answers if you have linear dependent features [01:17:24] you have linear dependent features probably means you have the same feature [01:17:26] probably means you have the same feature repeated twice and I would usually go [01:17:28] repeated twice and I would usually go and figure out what features actually [01:17:29] and figure out what features actually repeated leading to this problem okay [01:17:34] repeated leading to this problem okay all right any last questions before so [01:17:37] all right any last questions before so that so that's a normal equations hope [01:17:38] that so that's a normal equations hope you read through the detailed [01:17:39] you read through the detailed derivations in lecture notes any last [01:17:41] derivations in lecture notes any last questions going great [01:17:53] oh yeah how do you choose salon area [01:17:58] oh yeah how do you choose salon area it's this is quite empirical I think so [01:18:00] it's this is quite empirical I think so most people would try different values [01:18:02] most people would try different values and just pick one all right I think [01:18:05] and just pick one all right I think let's let's break if people have more [01:18:07] let's let's break if people have more questions where the tears come up we can [01:18:09] questions where the tears come up we can acute Aten questions were less grateful [01:18:10] acute Aten questions were less grateful today thanks everyone ================================================================================ LECTURE 003 ================================================================================ Locally Weighted & Logistic Regression | Stanford CS229: Machine Learning - Lecture 3 (Autumn 2018) Source: https://www.youtube.com/watch?v=het9HFqo1TQ --- Transcript [00:00:03] what I'd like to do today is continue [00:00:07] what I'd like to do today is continue our discussion of supervised learning so [00:00:11] our discussion of supervised learning so lost Wednesday you saw the linear [00:00:14] lost Wednesday you saw the linear regression algorithm including both [00:00:16] regression algorithm including both gradient descent how poorly the problem [00:00:19] gradient descent how poorly the problem then great in a sense and then the [00:00:20] then great in a sense and then the normal equations what I'd like to do [00:00:23] normal equations what I'd like to do today is talk about locally weighted [00:00:26] today is talk about locally weighted regression which is a way to modify [00:00:29] regression which is a way to modify linear regression to make it fit very [00:00:31] linear regression to make it fit very nonlinear functions so you're on just a [00:00:33] nonlinear functions so you're on just a straight lines and then we'll talk about [00:00:35] straight lines and then we'll talk about property interpretation of linear [00:00:37] property interpretation of linear regression and that will lead us into [00:00:39] regression and that will lead us into the first classification algorithm [00:00:41] the first classification algorithm you've seen this also called logistic [00:00:43] you've seen this also called logistic regression and we'll talk about an [00:00:45] regression and we'll talk about an algorithm called Newton's method for [00:00:47] algorithm called Newton's method for logistic regression and so the [00:00:49] logistic regression and so the dependency of ideas in this class is [00:00:51] dependency of ideas in this class is that locally weighted regression will [00:00:54] that locally weighted regression will depend on what you learned in linear [00:00:56] depend on what you learned in linear regression and then where are you gonna [00:00:59] regression and then where are you gonna just cover the key ideas of locally [00:01:02] just cover the key ideas of locally weighted regression and let you play [00:01:03] weighted regression and let you play some of the ideas yourself in the [00:01:05] some of the ideas yourself in the problem set one which will release later [00:01:07] problem set one which will release later this week and then I guess give a [00:01:11] this week and then I guess give a probability interpretation of linear [00:01:12] probability interpretation of linear regression logistic rest on depend on [00:01:14] regression logistic rest on depend on that and new test method is for logistic [00:01:17] that and new test method is for logistic regression [00:01:19] regression to recap the notation you saw on [00:01:22] to recap the notation you saw on Wednesday we use this notation X I comma [00:01:26] Wednesday we use this notation X I comma iy I to denote a single training example [00:01:29] iy I to denote a single training example where X I was n plus 1 dimensional so if [00:01:33] where X I was n plus 1 dimensional so if you had two features the size of a house [00:01:36] you had two features the size of a house and the number of bedrooms then X I [00:01:38] and the number of bedrooms then X I would be two plus one repeat [00:01:39] would be two plus one repeat three-dimensional because we had [00:01:41] three-dimensional because we had introduced a new sort of fake feature x0 [00:01:45] introduced a new sort of fake feature x0 which was always set to the value of 1 [00:01:47] which was always set to the value of 1 and then why I in the case of regression [00:01:50] and then why I in the case of regression is always a real number and was the [00:01:53] is always a real number and was the number of training examples and was the [00:01:55] number of training examples and was the number of features and this was the [00:01:58] number of features and this was the hypothesis right as the linear function [00:02:00] hypothesis right as the linear function of the features X including this feature [00:02:03] of the features X including this feature x0 which is always set to 1 and J was [00:02:07] x0 which is always set to 1 and J was the cost function you would minimize you [00:02:09] the cost function you would minimize you minimizes as function of J to find the [00:02:12] minimizes as function of J to find the parameter [00:02:13] parameter theta for your straight line fit to the [00:02:16] theta for your straight line fit to the data okay so that's what you saw last [00:02:18] data okay so that's what you saw last Wednesday um now if you have a data set [00:02:28] Wednesday um now if you have a data set that looks like that where this is the [00:02:32] that looks like that where this is the size of a house and this is the price of [00:02:33] size of a house and this is the price of a house what you saw on Wednesday last [00:02:37] a house what you saw on Wednesday last Wednesday was an algorithm to fit a [00:02:39] Wednesday was an algorithm to fit a straight line right to this data so the [00:02:43] straight line right to this data so the hypothesis was on the phone theta 0 plus [00:02:45] hypothesis was on the phone theta 0 plus theta 1 EXO EXO theta 1 x1 right same [00:02:49] theta 1 EXO EXO theta 1 x1 right same thing but with this data set maybe it [00:02:53] thing but with this data set maybe it actually looks you know maybe the data [00:02:55] actually looks you know maybe the data looks a little bit like that and so one [00:02:57] looks a little bit like that and so one question that you have to address when [00:02:59] question that you have to address when fitting models to the data is what are [00:03:02] fitting models to the data is what are the features you want do you want to fit [00:03:04] the features you want do you want to fit a straight line to this problem [00:03:05] a straight line to this problem or do you want to fit a hypothesis of [00:03:08] or do you want to fit a hypothesis of the form theta 1x plus theta 2 x squared [00:03:16] the form theta 1x plus theta 2 x squared since this may be a quadratic function [00:03:17] since this may be a quadratic function right now the problem quadratic function [00:03:20] right now the problem quadratic function is a quadratic function eventually [00:03:21] is a quadratic function eventually starts you know curving back down no [00:03:23] starts you know curving back down no that would be a quadratic function [00:03:25] that would be a quadratic function this starts curving back down so maybe [00:03:27] this starts curving back down so maybe you don't want to fit a quadratic [00:03:28] you don't want to fit a quadratic function instead maybe you want um it's [00:03:32] function instead maybe you want um it's a fit [00:03:35] something like that [00:03:36] something like that if housing prices sort of curve down a [00:03:39] if housing prices sort of curve down a little bit but you don't want it to [00:03:41] little bit but you don't want it to eventually curve back down the way a [00:03:43] eventually curve back down the way a quadratic function weight right [00:03:46] quadratic function weight right so oh and and if you want to do this the [00:03:49] so oh and and if you want to do this the way you would implement this is you [00:03:51] way you would implement this is you define the first feature x1 equals x and [00:03:54] define the first feature x1 equals x and the second feature x2 equals x squared [00:03:57] the second feature x2 equals x squared or you define X 1 to be equal to X and X [00:04:00] or you define X 1 to be equal to X and X 2 equals square root of x right and by [00:04:02] 2 equals square root of x right and by defining a new feature X 2 which can be [00:04:04] defining a new feature X 2 which can be the square of X the square root of x [00:04:06] the square of X the square root of x then the machinery that you solve from [00:04:08] then the machinery that you solve from wednesday of linear regression applies [00:04:10] wednesday of linear regression applies to fit these types of these types of [00:04:14] to fit these types of these types of functions the data [00:04:16] functions the data later this quarter you hear about [00:04:18] later this quarter you hear about feature selection algorithms which is a [00:04:20] feature selection algorithms which is a type of algorithm for automatically [00:04:22] type of algorithm for automatically deciding do you want x squared as a [00:04:24] deciding do you want x squared as a feature or square root of x as a feature [00:04:26] feature or square root of x as a feature or maybe you want some longer of X as a [00:04:30] or maybe you want some longer of X as a feature right but what's other features [00:04:32] feature right but what's other features does the best job fitting the data that [00:04:35] does the best job fitting the data that you have if it's not fit well by a [00:04:37] you have if it's not fit well by a perfectly straight line what I'd like to [00:04:41] perfectly straight line what I'd like to do today is so you hear about feature [00:04:43] do today is so you hear about feature selection later this quarter what I want [00:04:46] selection later this quarter what I want to share you today is a different way of [00:04:47] to share you today is a different way of accessing this out this problem of one [00:04:50] accessing this out this problem of one of the data isn't just fit Y by a [00:04:52] of the data isn't just fit Y by a straight line and in particular my share [00:04:54] straight line and in particular my share of you an idea called [00:04:54] of you an idea called a locally weighted regression or locally [00:04:57] a locally weighted regression or locally weighted linear regression so let me use [00:05:00] weighted linear regression so let me use a slightly different example to [00:05:02] a slightly different example to illustrate this which is which is that [00:05:06] illustrate this which is which is that you know if you have a data set that [00:05:08] you know if you have a data set that looks like that so it's pretty clear [00:05:18] looks like that so it's pretty clear what the shape of this data is but how [00:05:21] what the shape of this data is but how do you fit a curve that you know kind of [00:05:23] do you fit a curve that you know kind of looks like that right and it's actually [00:05:26] looks like that right and it's actually quite difficult to find features is it [00:05:28] quite difficult to find features is it square root of x log of X X cubed like [00:05:31] square root of x log of X X cubed like third root of X except off 2/3 but what [00:05:33] third root of X except off 2/3 but what is the set of features that lets you do [00:05:35] is the set of features that lets you do this so well sidestep all those problems [00:05:37] this so well sidestep all those problems of an algorithm called locally weighted [00:05:39] of an algorithm called locally weighted regression [00:05:53] and so introduce if it will machine [00:05:57] and so introduce if it will machine learning terminology in machine learning [00:06:00] learning terminology in machine learning we sometimes distinguish between [00:06:02] we sometimes distinguish between parametric learning algorithms and non [00:06:06] parametric learning algorithms and non parametric learning algorithms but in a [00:06:11] parametric learning algorithms but in a parametric learning algorithm there's a [00:06:14] parametric learning algorithm there's a you fit some fixed set of parameters [00:06:21] such as theta I to data and so linear [00:06:26] such as theta I to data and so linear regression as you saw last Wednesday is [00:06:29] regression as you saw last Wednesday is a parametric learning algorithm because [00:06:31] a parametric learning algorithm because there's a fixed set of parameters the [00:06:33] there's a fixed set of parameters the theta I so you fit the data and then [00:06:34] theta I so you fit the data and then you're done right locally weighted [00:06:37] you're done right locally weighted regression will be our first exposure to [00:06:41] regression will be our first exposure to a nonparametric learning algorithm and [00:06:49] a nonparametric learning algorithm and what that means is that the amount of [00:06:54] data / parameters you need to keep [00:07:04] throws and in this case it grows [00:07:08] throws and in this case it grows linearly with the size of the data size [00:07:16] linearly with the size of the data size the training set okay so with a [00:07:19] the training set okay so with a parametric learning algorithm no matter [00:07:21] parametric learning algorithm no matter how big your training your training set [00:07:24] how big your training your training set is you fit the parameters stay there I [00:07:25] is you fit the parameters stay there I then you could erase the training set [00:07:28] then you could erase the training set from your computer memory and make [00:07:29] from your computer memory and make predictions just using the parameters [00:07:31] predictions just using the parameters data all in a nonparametric learning [00:07:33] data all in a nonparametric learning algorithm which we'll see in a second [00:07:35] algorithm which we'll see in a second the amount of stuff you need to keep [00:07:36] the amount of stuff you need to keep around in computer memory or the net [00:07:38] around in computer memory or the net amount stuff you need to store around [00:07:40] amount stuff you need to store around grows linearly as a function of the [00:07:42] grows linearly as a function of the training set size and so this type of [00:07:45] training set size and so this type of algorithm is you know we may not be [00:07:47] algorithm is you know we may not be great if you're a really really massive [00:07:49] great if you're a really really massive dataset because you keep all of the data [00:07:51] dataset because you keep all of the data you're in computer memory or on this [00:07:53] you're in computer memory or on this just to make predictions okay so but [00:07:56] just to make predictions okay so but we'll see an example of this and one of [00:07:58] we'll see an example of this and one of the effects of this is that will that it [00:08:00] the effects of this is that will that it will be able to fit that data that I [00:08:03] will be able to fit that data that I drew up there quite well without you [00:08:05] drew up there quite well without you needing to fiddle manually with features [00:08:08] needing to fiddle manually with features again you get to practice implementing [00:08:11] again you get to practice implementing locally way to regression that whole [00:08:13] locally way to regression that whole work so I'm going to go for the height [00:08:15] work so I'm going to go for the height of ideas relatively quickly and then let [00:08:17] of ideas relatively quickly and then let you gain practice in the problem set all [00:08:22] you gain practice in the problem set all right so let me redraw that data set [00:08:24] right so let me redraw that data set something like this all right so say you [00:08:31] something like this all right so say you have a data set like this now for linear [00:08:35] have a data set like this now for linear regression if you want to evaluate it at [00:08:43] regression if you want to evaluate it at a certain value of the input so to make [00:08:53] a certain value of the input so to make a prediction at a certain value of X [00:08:55] a prediction at a certain value of X what's you for linear regression what [00:08:58] what's you for linear regression what you do is you fit theta you know to [00:09:04] you do is you fit theta you know to minimize this cost function and then you [00:09:17] minimize this cost function and then you return say the transpose X right so you [00:09:20] return say the transpose X right so you for the straight line and then you know [00:09:23] for the straight line and then you know if you want to make a prediction that [00:09:24] if you want to make a prediction that this value X you then return say the [00:09:27] this value X you then return say the transpose X for locally weighted [00:09:30] transpose X for locally weighted regression [00:09:41] you do something slightly different [00:09:43] you do something slightly different which is if this is the value of [00:09:46] which is if this is the value of actually you want to make a prediction [00:09:47] actually you want to make a prediction around that value of X what you do is [00:09:49] around that value of X what you do is you look in a little local neighborhood [00:09:52] you look in a little local neighborhood at the training examples close to that [00:09:54] at the training examples close to that point X we want to make a prediction and [00:09:56] point X we want to make a prediction and then I'll describe this informally for [00:10:00] then I'll describe this informally for now but we'll formalize this in math in [00:10:02] now but we'll formalize this in math in a second but focusing mainly on these [00:10:06] a second but focusing mainly on these examples and you know looking a little [00:10:08] examples and you know looking a little bit at further examples but really [00:10:10] bit at further examples but really focusing mainly on these examples you're [00:10:12] focusing mainly on these examples you're trying to fit a straight line like that [00:10:15] trying to fit a straight line like that focusing on the training examples that [00:10:18] focusing on the training examples that close to where you want to make a [00:10:19] close to where you want to make a prediction and by close I mean the [00:10:21] prediction and by close I mean the values are similar on the x axis the x [00:10:24] values are similar on the x axis the x values are similar and then to actually [00:10:27] values are similar and then to actually make a prediction you will use this [00:10:31] make a prediction you will use this Green Line there you just fit to make a [00:10:33] Green Line there you just fit to make a prediction at that value of x now if you [00:10:38] prediction at that value of x now if you want to make a prediction at a different [00:10:40] want to make a prediction at a different point let's say that you know the user [00:10:43] point let's say that you know the user now says hey make a prediction for this [00:10:45] now says hey make a prediction for this point then what you would do is gain [00:10:48] point then what you would do is gain focus on this local area kind of look at [00:10:50] focus on this local area kind of look at those points and when I say focus saying [00:10:53] those points and when I say focus saying you know put most of the weight on these [00:10:55] you know put most of the weight on these points but you kind of take a glance at [00:10:56] points but you kind of take a glance at the points further away but most of the [00:10:58] the points further away but most of the attention is on these for the straight [00:11:00] attention is on these for the straight lines of that and then you use that [00:11:02] lines of that and then you use that straight line to make a prediction okay [00:11:06] straight line to make a prediction okay and so to formalize this and locally [00:11:12] and so to formalize this and locally weight a regression you will fit theta [00:11:16] weight a regression you will fit theta to minimize a modified cost function [00:11:33] where WI is a weight function and so a [00:11:45] where WI is a weight function and so a good well the default choice a common [00:11:47] good well the default choice a common choice of WI will be this I'm gonna add [00:11:57] choice of WI will be this I'm gonna add something to this equation a little bit [00:11:59] something to this equation a little bit later but WI is a weighting function [00:12:02] later but WI is a weighting function where notice that this this formula has [00:12:06] where notice that this this formula has defining property right if X I minus X [00:12:11] defining property right if X I minus X is small then the weight will be close [00:12:18] is small then the weight will be close to one because if X I X so X is the [00:12:22] to one because if X I X so X is the location where you want to make a [00:12:24] location where you want to make a prediction and X I is the input X for [00:12:27] prediction and X I is the input X for your life training example so WI is a [00:12:31] your life training example so WI is a weighting function there's a value which [00:12:34] weighting function there's a value which is 0 and 1 that tells you how much [00:12:37] is 0 and 1 that tells you how much should you pay attention to the values [00:12:40] should you pay attention to the values of X I comma Y I when fitting say this [00:12:43] of X I comma Y I when fitting say this green line or that red line and so if X [00:12:47] green line or that red line and so if X I minus X is small so that's the [00:12:51] I minus X is small so that's the training example that is close to where [00:12:53] training example that is close to where you want to make the prediction for X [00:12:55] you want to make the prediction for X then this is about e to the zero right e [00:12:59] then this is about e to the zero right e to the negative 0 if the numerator you [00:13:01] to the negative 0 if the numerator you are small and e to the 0 is close to 1 [00:13:05] are small and e to the 0 is close to 1 right and conversely if X I minus X is [00:13:11] right and conversely if X I minus X is large then WI it's close to 0 and so if [00:13:18] large then WI it's close to 0 and so if X is very far away so let's see a [00:13:21] X is very far away so let's see a fitting this green line and this is your [00:13:24] fitting this green line and this is your example X I Y I then the saying give [00:13:27] example X I Y I then the saying give this example all the way out there if [00:13:29] this example all the way out there if your fitting the Green Line right look [00:13:31] your fitting the Green Line right look at this verse X saying that example [00:13:33] at this verse X saying that example shadow weight very close to 0 ok [00:13:39] shadow weight very close to 0 ok and so if you look at the cost function [00:13:45] and so if you look at the cost function the main modification to the cost [00:13:48] the main modification to the cost function with main is that we've added [00:13:50] function with main is that we've added this weighting term and so what locally [00:13:55] this weighting term and so what locally weighted regression does is the same if [00:13:57] weighted regression does is the same if an example X I is far from where you [00:14:01] an example X I is far from where you want to make a prediction multiply get [00:14:04] want to make a prediction multiply get error term by 0 or by a constant very [00:14:07] error term by 0 or by a constant very close to zero whereas if it's close to [00:14:10] close to zero whereas if it's close to where you want to make prediction [00:14:12] where you want to make prediction multiply the error term by one and so [00:14:15] multiply the error term by one and so the net effect of this is that this is [00:14:17] the net effect of this is that this is something if you know the terms [00:14:19] something if you know the terms multiplied by zero disappear right so [00:14:21] multiplied by zero disappear right so the net effect of this is that this sums [00:14:23] the net effect of this is that this sums over essentially only the terms for the [00:14:28] over essentially only the terms for the squared error for the examples they're [00:14:30] squared error for the examples they're close to the value close to the value of [00:14:33] close to the value close to the value of x where you want to make a prediction [00:14:37] and that's why when you fit theta to [00:14:43] and that's why when you fit theta to minimize this you end up paying [00:14:47] minimize this you end up paying attention only to the points only to the [00:14:49] attention only to the points only to the examples close to where you wanna make [00:14:50] examples close to where you wanna make friction and fitting a line like the [00:14:53] friction and fitting a line like the Green Line over there okay so let me [00:14:58] Green Line over there okay so let me draw a couple more pictures to [00:14:59] draw a couple more pictures to illustrate this so if you're slightly [00:15:05] illustrate this so if you're slightly smaller data set just to make this [00:15:07] smaller data set just to make this easier illustrate so that's your [00:15:10] easier illustrate so that's your training set so that's the Oh example 6 [00:15:11] training set so that's the Oh example 6 1 X 2 X 3 X 4 and if you want to make a [00:15:14] 1 X 2 X 3 X 4 and if you want to make a prediction here right at that point X [00:15:17] prediction here right at that point X then this curve here looks the the shape [00:15:23] then this curve here looks the the shape of this curve is actually like this and [00:15:27] of this curve is actually like this and it this is the shape of a Gaussian bell [00:15:30] it this is the shape of a Gaussian bell curve but this is nothing to do with a [00:15:32] curve but this is nothing to do with a Gaussian density right so this thing [00:15:34] Gaussian density right so this thing does not integrate the 1 it's just [00:15:37] does not integrate the 1 it's just sometimes your awesome one is this is [00:15:38] sometimes your awesome one is this is using Gaussian density the answer is no [00:15:40] using Gaussian density the answer is no this is just a function that is shaped a [00:15:43] this is just a function that is shaped a lot like a Gaussian but you know [00:15:45] lot like a Gaussian but you know Gaussian density is probably the C [00:15:47] Gaussian density is probably the C functions have to integrate to one and [00:15:49] functions have to integrate to one and distance [00:15:49] distance so there's nothing to do for Gaussian [00:15:50] so there's nothing to do for Gaussian probably density question oh so how do [00:15:56] probably density question oh so how do you choose well let me get back to that [00:15:59] you choose well let me get back to that and so for this example this height here [00:16:04] and so for this example this height here says if this example a weight equal to [00:16:08] says if this example a weight equal to the height to that thing give this [00:16:10] the height to that thing give this example a way to go height of this [00:16:12] example a way to go height of this height of this height of that right [00:16:15] height of this height of that right which is why if you actually if you have [00:16:16] which is why if you actually if you have an example this way out there you know [00:16:19] an example this way out there you know it's given a weight that's essentially [00:16:21] it's given a weight that's essentially zero which is why it's waiting or [00:16:23] zero which is why it's waiting or neither nearby the examples when trying [00:16:25] neither nearby the examples when trying to fit a straight line right for that [00:16:29] to fit a straight line right for that for making predictions close to this [00:16:31] for making predictions close to this okay um now so one last thing I want to [00:16:39] okay um now so one last thing I want to mention which is the question just now [00:16:42] mention which is the question just now which is how do you choose the width of [00:16:44] which is how do you choose the width of this Gaussian density right how fast is [00:16:46] this Gaussian density right how fast is it how thin should it be [00:16:48] it how thin should it be and this decides how big a neighborhood [00:16:50] and this decides how big a neighborhood should you look in order to decide [00:16:52] should you look in order to decide what's the neighborhood of points that [00:16:54] what's the neighborhood of points that you use to fit this your local straight [00:16:56] you use to fit this your local straight line and so for a Gaussian function like [00:17:00] line and so for a Gaussian function like this this i'm gonna call this the [00:17:04] this this i'm gonna call this the bandwidth parameter towel and this is a [00:17:12] bandwidth parameter towel and this is a parameter or hyper parameter of the [00:17:16] parameter or hyper parameter of the algorithm and depending on the choice of [00:17:19] algorithm and depending on the choice of towel you can choose a fatter or thinner [00:17:23] towel you can choose a fatter or thinner bell-shaped curve which causes you to [00:17:25] bell-shaped curve which causes you to look in a bigger or a narrower window in [00:17:29] look in a bigger or a narrower window in order to decide you know how many nearby [00:17:32] order to decide you know how many nearby examples used in order to fit the [00:17:34] examples used in order to fit the straight line okay [00:17:36] straight line okay and it turns out that and I wonder the I [00:17:38] and it turns out that and I wonder the I want to leave you to discover this [00:17:40] want to leave you to discover this yourself in the problem set if if you've [00:17:43] yourself in the problem set if if you've taken a little bit of machine learning [00:17:44] taken a little bit of machine learning elsewhere I've heard of the terms oh yes [00:17:52] elsewhere I've heard of the terms oh yes it turns out that um the choice of the [00:17:55] it turns out that um the choice of the bandwidth towel has an effect on over 15 [00:17:59] bandwidth towel has an effect on over 15 another fitting if you don't know what [00:18:00] another fitting if you don't know what those terms being don't worry about it [00:18:01] those terms being don't worry about it to find them later this quarter but what [00:18:04] to find them later this quarter but what you get to do in the problem sets is [00:18:06] you get to do in the problem sets is play with Tao yourself and see why if [00:18:12] play with Tao yourself and see why if tau is too broad you end up fitting your [00:18:16] tau is too broad you end up fitting your end up over smoothing the data and if [00:18:18] end up over smoothing the data and if tau is too thin you end up fitting a [00:18:20] tau is too thin you end up fitting a very jagged fit to the data and if any [00:18:21] very jagged fit to the data and if any of these things don't make sense yet [00:18:23] of these things don't make sense yet don't worry about it they'll make sense [00:18:24] don't worry about it they'll make sense after you play of it in the in the [00:18:26] after you play of it in the in the problem set okay yeah since since you [00:18:30] problem set okay yeah since since you you play with the varying tau and the [00:18:33] you play with the varying tau and the problem set and see for yourself the net [00:18:35] problem set and see for yourself the net impact okay thank you this is tau screen [00:18:44] impact okay thank you this is tau screen yeah what happens you need to defer the [00:18:56] yeah what happens you need to defer the value of H outside school they said it [00:18:59] value of H outside school they said it turns out that you can still use this [00:19:01] turns out that you can still use this algorithm it's just that it's results [00:19:04] algorithm it's just that it's results may not be very good yeah [00:19:09] locally within the linear regression is [00:19:12] locally within the linear regression is usually not greater than extrapolation [00:19:14] usually not greater than extrapolation but then most many learning armors are [00:19:15] but then most many learning armors are not great at extrapolation so all the [00:19:17] not great at extrapolation so all the formulas still work is so implement is [00:19:19] formulas still work is so implement is but um yeah you know also try you can [00:19:22] but um yeah you know also try you can also try the your problem set and see [00:19:24] also try the your problem set and see what happens yes this is multiple the [00:19:36] what happens yes this is multiple the variable towel Devon [00:19:37] variable towel Devon yes it is and there are quite [00:19:40] yes it is and there are quite complicated ways to choose tau based on [00:19:42] complicated ways to choose tau based on how many points there on the local [00:19:43] how many points there on the local region and so on yes there's a huge [00:19:45] region and so on yes there's a huge literature on different formulas [00:19:47] literature on different formulas actually for example it serves as [00:19:48] actually for example it serves as Gaussian bumping there's a sometimes [00:19:51] Gaussian bumping there's a sometimes people use that triangle shape function [00:19:53] people use that triangle shape function so it happily goes to zero upsides and [00:19:54] so it happily goes to zero upsides and small me so there are there are many [00:19:56] small me so there are there are many versions of this algorithm so I tend to [00:20:00] versions of this algorithm so I tend to use locally weighted linear regression [00:20:03] use locally weighted linear regression when you have a relatively low [00:20:05] when you have a relatively low dimensional dataset so when the number [00:20:07] dimensional dataset so when the number features it's not too big right so when [00:20:09] features it's not too big right so when n is quite [00:20:10] n is quite all right two or three or something and [00:20:12] all right two or three or something and we have a lot of data and you don't want [00:20:14] we have a lot of data and you don't want to think about what features to use [00:20:16] to think about what features to use right so that's the scenario so if you [00:20:19] right so that's the scenario so if you actually a data set that looks like [00:20:20] actually a data set that looks like these up and drawing you know locally [00:20:22] these up and drawing you know locally way to the interaction is a pretty good [00:20:25] way to the interaction is a pretty good algorithm oh sure yes the remote data [00:20:37] algorithm oh sure yes the remote data wanted to accomplish an expensive yes it [00:20:39] wanted to accomplish an expensive yes it would be I guess what data is relative [00:20:41] would be I guess what data is relative yes we have you know two three four [00:20:44] yes we have you know two three four dimensional later and hundreds of [00:20:46] dimensional later and hundreds of examples of many thousand examples it [00:20:49] examples of many thousand examples it turns out the computation needed to fit [00:20:50] turns out the computation needed to fit the minimization is similar to the [00:20:53] the minimization is similar to the normal equations and so you involve [00:20:56] normal equations and so you involve solving a linear system of equations of [00:20:58] solving a linear system of equations of dimension equal the number of training [00:20:59] dimension equal the number of training examples you have so that's you know [00:21:01] examples you have so that's you know like a thousand or a few thousand that's [00:21:03] like a thousand or a few thousand that's not too bad if you have millions of [00:21:05] not too bad if you have millions of examples then then there are also most [00:21:07] examples then then there are also most of skilled algorithms like KT trees and [00:21:09] of skilled algorithms like KT trees and much more complicated algorithms to do [00:21:10] much more complicated algorithms to do this when you have millions or tens of [00:21:13] this when you have millions or tens of millions of examples yeah okay so ready [00:21:18] millions of examples yeah okay so ready you get a better sense of this algorithm [00:21:20] you get a better sense of this algorithm when you play of it in the problem set [00:21:24] when you play of it in the problem set now the second topic when so I'm going [00:21:28] now the second topic when so I'm going to put aside locally weighted regression [00:21:29] to put aside locally weighted regression we won't talk about that said ideas [00:21:31] we won't talk about that said ideas anymore today but but what I want to do [00:21:33] anymore today but but what I want to do today is on last Wednesday I had said [00:21:37] today is on last Wednesday I had said that I had promised last Wednesday that [00:21:39] that I had promised last Wednesday that today I'll give a justification for why [00:21:42] today I'll give a justification for why we use the squared error right why the [00:21:44] we use the squared error right why the squared error why not you know to the [00:21:46] squared error why not you know to the fourth power or absolute value and so [00:21:50] fourth power or absolute value and so what I want to show you today now is the [00:21:53] what I want to show you today now is the promisee interpretation of linear [00:21:55] promisee interpretation of linear regression and this properties [00:21:56] regression and this properties interpretation will put us into good [00:21:58] interpretation will put us into good standing as we go on to logistic [00:22:00] standing as we go on to logistic regression today and then generalized [00:22:02] regression today and then generalized any models later this week keep up to [00:22:06] any models later this week keep up to keep the notation there a secure [00:22:08] keep the notation there a secure continue to refer to it [00:22:13] so right so why these squares Y squared [00:22:26] so right so why these squares Y squared error going to present a set of [00:22:29] error going to present a set of assumptions under which these squares [00:22:31] assumptions under which these squares using squared error falls out very [00:22:33] using squared error falls out very naturally which is let's say for housing [00:22:37] naturally which is let's say for housing price prediction let's assume that [00:22:40] price prediction let's assume that there's a true price of every house why [00:22:42] there's a true price of every house why I which is X transpose say there I plus [00:22:50] I which is X transpose say there I plus epsilon I where epsilon I is an error [00:22:55] epsilon I where epsilon I is an error term that includes unmodeled effects you [00:23:05] term that includes unmodeled effects you know and just random noise so let's [00:23:11] know and just random noise so let's assume that the way you know housing [00:23:13] assume that the way you know housing prices truly work is that every houses [00:23:15] prices truly work is that every houses price is a linear function of the size [00:23:17] price is a linear function of the size of holes and number of bedrooms plus an [00:23:20] of holes and number of bedrooms plus an error term they captures unmodeled [00:23:22] error term they captures unmodeled effects such as maybe one day that cell [00:23:25] effects such as maybe one day that cell is an unusually good mood or an [00:23:27] is an unusually good mood or an unusually bad mood and so that makes the [00:23:29] unusually bad mood and so that makes the price go higher or lower [00:23:30] price go higher or lower we just don't model that as well as [00:23:32] we just don't model that as well as random noise right or maybe I don't want [00:23:36] random noise right or maybe I don't want to screw this straight [00:23:37] to screw this straight you know percent adjusting caption [00:23:38] you know percent adjusting caption that's one of the features but other [00:23:39] that's one of the features but other things have an impact on housing prices [00:23:41] things have an impact on housing prices and we're going to assume that epsilon I [00:23:49] and we're going to assume that epsilon I is distributed Gaussian with mean zero [00:23:54] is distributed Gaussian with mean zero and covariance Sigma squared so I'm [00:23:56] and covariance Sigma squared so I'm going to use this notation to mean so [00:23:59] going to use this notation to mean so the way you read this notation is [00:24:01] the way you read this notation is epsilon I this turtle new pronoun says [00:24:03] epsilon I this turtle new pronoun says is distributed and then script n for n 0 [00:24:07] is distributed and then script n for n 0 comma Sigma squared this is a normal [00:24:09] comma Sigma squared this is a normal distribution also called the Gaussian [00:24:11] distribution also called the Gaussian distribution same thing normal in the [00:24:12] distribution same thing normal in the Spirit of God students should be in the [00:24:13] Spirit of God students should be in the same [00:24:14] same the normal distribution with mean zero [00:24:17] the normal distribution with mean zero and variance Sigma squared okay and what [00:24:21] and variance Sigma squared okay and what this means is that the probability [00:24:23] this means is that the probability density of epsilon I this is the [00:24:27] density of epsilon I this is the Gaussian density one of the root 2 Pi [00:24:29] Gaussian density one of the root 2 Pi Sigma e to the negative epsilon I [00:24:34] Sigma e to the negative epsilon I squared over 2 Sigma squared ok oh and [00:24:37] squared over 2 Sigma squared ok oh and unlike the bell shape the bell-shaped [00:24:39] unlike the bell shape the bell-shaped curve I use earlier for locally weighted [00:24:41] curve I use earlier for locally weighted linear regression this thing does [00:24:43] linear regression this thing does integrate to one right this dysfunction [00:24:45] integrate to one right this dysfunction integrates to 1 and so this is a [00:24:48] integrates to 1 and so this is a Gaussian density this is a probability [00:24:50] Gaussian density this is a probability density function and this is the [00:24:54] density function and this is the familiar you know Gaussian bell-shaped [00:24:58] familiar you know Gaussian bell-shaped curve would mean 0 and covere and [00:25:02] curve would mean 0 and covere and variance Sigma squared right where Sigma [00:25:06] variance Sigma squared right where Sigma kind of controls the width of this [00:25:08] kind of controls the width of this Gaussian okay and if you haven't seen [00:25:10] Gaussian okay and if you haven't seen gaussians for a while we'll go over some [00:25:12] gaussians for a while we'll go over some of the probability probably prereqs as [00:25:15] of the probability probably prereqs as well in the classes friday and [00:25:17] well in the classes friday and discussion sections [00:25:23] so in other words we assume that the way [00:25:26] so in other words we assume that the way housing prices are determined is that [00:25:28] housing prices are determined is that first is a true price theta transpose X [00:25:31] first is a true price theta transpose X and then you know some random force of [00:25:33] and then you know some random force of nature right the move of the seller or I [00:25:37] nature right the move of the seller or I I don't have other factors right [00:25:40] I don't have other factors right perturbs it from this true value say [00:25:43] perturbs it from this true value say their transpose X I and the huge [00:25:47] their transpose X I and the huge assumption we're gonna make is that the [00:25:49] assumption we're gonna make is that the epsilon is these error terms are a ID [00:25:51] epsilon is these error terms are a ID and iid from statistics sense for [00:25:54] and iid from statistics sense for independently and identically [00:25:56] independently and identically distributed and what that means is that [00:25:57] distributed and what that means is that their error term for one house is [00:25:59] their error term for one house is independent as the error term for a [00:26:02] independent as the error term for a different house which is actually not a [00:26:04] different house which is actually not a true assumption right because you know [00:26:06] true assumption right because you know if if one house this price on one street [00:26:08] if if one house this price on one street is unusually high probably a price on a [00:26:10] is unusually high probably a price on a different house on the same street will [00:26:12] different house on the same street will also be unusually high and so but this [00:26:14] also be unusually high and so but this assumption that these epsilonr iid [00:26:17] assumption that these epsilonr iid sensor independently and identically [00:26:19] sensor independently and identically distributed is one of those assumptions [00:26:21] distributed is one of those assumptions that that you know it's probably not [00:26:23] that that you know it's probably not absolutely true but may be good enough [00:26:25] absolutely true but may be good enough that if you make this assumption you get [00:26:27] that if you make this assumption you get a pretty good model and so let's see [00:26:33] a pretty good model and so let's see under the set of assumptions this [00:26:35] under the set of assumptions this implies that the density or the [00:26:44] implies that the density or the probability of Y given X I and theta [00:26:50] probability of Y given X I and theta this is going to be this and I'll take [00:27:06] this is going to be this and I'll take this and writes in another way in other [00:27:22] this and writes in another way in other words given X and theta [00:27:25] words given X and theta what's the density well what's the [00:27:28] what's the density well what's the probability of a particular house this [00:27:30] probability of a particular house this price well it's going to be Gaussian [00:27:32] price well it's going to be Gaussian working in given by theta transpose X I [00:27:35] working in given by theta transpose X I or theta transpose X and the area is [00:27:38] or theta transpose X and the area is given by Sigma squared okay [00:27:43] given by Sigma squared okay and so because the way that the price of [00:27:47] and so because the way that the price of a house is determined is by thinking [00:27:49] a house is determined is by thinking theta transpose X was the you know the [00:27:51] theta transpose X was the you know the quote true price of the house and then [00:27:53] quote true price of the house and then adding noise or adding error of variance [00:27:56] adding noise or adding error of variance Sigma squared to it and so the the [00:27:59] Sigma squared to it and so the the assumptions on the left imply that given [00:28:02] assumptions on the left imply that given X and theta the density of Y you know [00:28:05] X and theta the density of Y you know has this distribution which is really [00:28:07] has this distribution which is really this is the random variable Y and that's [00:28:09] this is the random variable Y and that's the mean that's the variance of the [00:28:15] the mean that's the variance of the Gaussian density okay now um two pieces [00:28:20] Gaussian density okay now um two pieces of notation I have one more that that [00:28:23] of notation I have one more that that you should get familiar with the reason [00:28:27] you should get familiar with the reason I wrote the semicolon here is that the [00:28:31] I wrote the semicolon here is that the way you read this equation is the [00:28:32] way you read this equation is the semicolon should be read as a [00:28:34] semicolon should be read as a parameterised ass and so because you [00:28:42] parameterised ass and so because you know the alternative way to write this [00:28:44] know the alternative way to write this would be to say P of X are given why I [00:28:47] would be to say P of X are given why I give Y given X I comma theta but if you [00:28:51] give Y given X I comma theta but if you were to write this notation this way [00:28:53] were to write this notation this way this would be conditioning on theta but [00:28:57] this would be conditioning on theta but theta is not a random variable so you [00:28:58] theta is not a random variable so you shouldn't conditional on theta which is [00:29:00] shouldn't conditional on theta which is why I'm gonna write a semicolon and so [00:29:03] why I'm gonna write a semicolon and so the way you read this is the program Y [00:29:05] the way you read this is the program Y are given X I and parameterised excuse [00:29:08] are given X I and parameterised excuse me parametrized by theta is equal to [00:29:12] me parametrized by theta is equal to that formula okay if you don't [00:29:15] that formula okay if you don't understand this distinction again don't [00:29:16] understand this distinction again don't worry too much about it in statistics [00:29:18] worry too much about it in statistics there are multiple schools of statistics [00:29:21] there are multiple schools of statistics called Bayesian statistics and [00:29:22] called Bayesian statistics and frequencies statistics this is a [00:29:23] frequencies statistics this is a frequentist interpretation for the [00:29:26] frequentist interpretation for the purpose of machine learning don't worry [00:29:27] purpose of machine learning don't worry about it but I find there being more [00:29:28] about it but I find there being more consistent terminology [00:29:30] consistent terminology some of our statisticians friends from [00:29:32] some of our statisticians friends from getting really upset but you know some [00:29:34] getting really upset but you know some try to follow statistics convention so [00:29:38] try to follow statistics convention so because just only unnecessary flag I [00:29:40] because just only unnecessary flag I guess but for the practical purposes is [00:29:43] guess but for the practical purposes is not that important if you get this [00:29:44] not that important if you get this notation wrong your homework don't worry [00:29:46] notation wrong your homework don't worry about it we won't penalize you but our [00:29:47] about it we won't penalize you but our child be consistent but this just means [00:29:50] child be consistent but this just means that theta in this view is not a random [00:29:52] that theta in this view is not a random variable it's just theta as a set of [00:29:54] variable it's just theta as a set of parameters that parameterize is this [00:29:56] parameters that parameterize is this probably distribution okay and the way [00:30:01] probably distribution okay and the way to read the second equation is when you [00:30:04] to read the second equation is when you write these equations usually don't [00:30:05] write these equations usually don't write down with parentheses but the way [00:30:07] write down with parentheses but the way to pause this equation is to say that [00:30:09] to pause this equation is to say that this thing as a random variable the [00:30:12] this thing as a random variable the random variable Y given X and [00:30:14] random variable Y given X and parametrized by theta this thing that I [00:30:16] parametrized by theta this thing that I just drew in green parentheses is this [00:30:18] just drew in green parentheses is this you take Gaussian with that distribution [00:30:20] you take Gaussian with that distribution okay all right any questions about this [00:30:35] so it turns out that if you are willing [00:30:43] so it turns out that if you are willing to make those assumptions then the new [00:30:47] to make those assumptions then the new regression falls out almost naturally of [00:30:53] regression falls out almost naturally of the assumptions we just made and in [00:30:58] the assumptions we just made and in particular under the assumptions we just [00:31:00] particular under the assumptions we just made the likelihood of the parameters [00:31:08] made the likelihood of the parameters theta so this is pronounced the [00:31:11] theta so this is pronounced the likelihood of the parameters theta L of [00:31:18] likelihood of the parameters theta L of theta which is defined as the [00:31:21] theta which is defined as the probability of the data right so this is [00:31:26] probability of the data right so this is probably of all the values of Y of y1 up [00:31:28] probably of all the values of Y of y1 up to Y M given all the X's and given the [00:31:32] to Y M given all the X's and given the parameters theta a parametrized by theta [00:31:37] this is equal to the product from I [00:31:42] this is equal to the product from I equals 1 through m appear why I even X [00:31:48] equals 1 through m appear why I even X my franchise by theta because we assume [00:31:56] my franchise by theta because we assume the examples were because we assume the [00:31:57] the examples were because we assume the errors are iid right then the error [00:31:59] errors are iid right then the error terms are independently and identity [00:32:02] terms are independently and identity destroys each other so the probability [00:32:04] destroys each other so the probability of all of the observations of all the [00:32:07] of all of the observations of all the values of Y in our training set is equal [00:32:09] values of Y in our training set is equal to the product of the probabilities [00:32:10] to the product of the probabilities because of the independence assumption [00:32:11] because of the independence assumption we made and so plugging in the [00:32:14] we made and so plugging in the definition that P of Y given X franchise [00:32:16] definition that P of Y given X franchise by theta that we had up there this is [00:32:18] by theta that we had up there this is equal to product [00:32:36] okay now again one more piece of [00:32:43] okay now again one more piece of terminology [00:32:44] terminology you know another question about mandalas [00:32:46] you know another question about mandalas if you say hey Andrew what's the [00:32:48] if you say hey Andrew what's the difference between likelihood and [00:32:49] difference between likelihood and probability right and so the likelihood [00:32:52] probability right and so the likelihood of parameters is exactly the same thing [00:32:55] of parameters is exactly the same thing as the probability of the data but the [00:32:57] as the probability of the data but the reason we sometimes talk about [00:32:58] reason we sometimes talk about likelihood and some to solve a [00:33:00] likelihood and some to solve a probability is we think of likelihood so [00:33:03] probability is we think of likelihood so this is some function right this thing [00:33:05] this is some function right this thing is a function of the data as well as a [00:33:07] is a function of the data as well as a function of the parameters theta and if [00:33:10] function of the parameters theta and if we viewed this number whatever this [00:33:11] we viewed this number whatever this number is if you view this thing as a [00:33:13] number is if you view this thing as a function of the parameters holding the [00:33:15] function of the parameters holding the data fix then we call that the [00:33:17] data fix then we call that the likelihood so you think of the training [00:33:18] likelihood so you think of the training set the data is a fixed thing and then [00:33:21] set the data is a fixed thing and then varying parameters theta then I'm going [00:33:23] varying parameters theta then I'm going to use the term likelihood whereas if [00:33:26] to use the term likelihood whereas if you view the parameters theta as fixed [00:33:28] you view the parameters theta as fixed and maybe varying the data are going to [00:33:30] and maybe varying the data are going to say probability right so so you hear me [00:33:33] say probability right so so you hear me use well I'll try to be consistent I [00:33:35] use well I'll try to be consistent I find I'm pretty good at being consistent [00:33:38] find I'm pretty good at being consistent but not perfect but I'm going to try to [00:33:39] but not perfect but I'm going to try to say likelihood of the parameters and [00:33:42] say likelihood of the parameters and probability of the data even though [00:33:44] probability of the data even though those evaluate to the same thing it's [00:33:46] those evaluate to the same thing it's just you know for this function this [00:33:48] just you know for this function this function is function the data and the [00:33:50] function is function the data and the parameters which one that UV leaners [00:33:51] parameters which one that UV leaners fixing which one are you viewing this is [00:33:53] fixing which one are you viewing this is variable so when you view this as a [00:33:55] variable so when you view this as a function of theta when I use the term [00:33:56] function of theta when I use the term likelihood but so so hopefully you hear [00:33:59] likelihood but so so hopefully you hear me say likelihood of the parameters [00:34:01] me say likelihood of the parameters hopefully you won't hear me say [00:34:03] hopefully you won't hear me say likelihood of the data right and then [00:34:06] likelihood of the data right and then similarly hopefully you hear me say [00:34:08] similarly hopefully you hear me say probably the data and not probably of [00:34:10] probably the data and not probably of the parameters [00:34:18] like the Frances okay so probably at the [00:34:25] like the Frances okay so probably at the data no tech got it sorry yes like your [00:34:32] data no tech got it sorry yes like your paycheck got it yes sorry yes like a [00:34:35] paycheck got it yes sorry yes like a theory right oh no so no so theta is a [00:34:48] theory right oh no so no so theta is a set of parameters it's not a random [00:34:49] set of parameters it's not a random variable so we like you of theta [00:34:52] variable so we like you of theta doesn't mean theta is a random variable [00:34:53] doesn't mean theta is a random variable right by the way the stuff about what's [00:34:58] right by the way the stuff about what's a random variable and what's not the [00:34:59] a random variable and what's not the semicolon versus comma thing we explain [00:35:02] semicolon versus comma thing we explain this in more detail on the lecture notes [00:35:03] this in more detail on the lecture notes to me this is part of them you know a [00:35:07] to me this is part of them you know a little bit paying homage to the to their [00:35:10] little bit paying homage to the to their religion of Bayesian frequencies versus [00:35:13] religion of Bayesian frequencies versus Bayesian frequencies versus Bayesian [00:35:15] Bayesian frequencies versus Bayesian statistics from from a machine from [00:35:18] statistics from from a machine from apply machine learning operational what [00:35:20] apply machine learning operational what you write code point of view it doesn't [00:35:22] you write code point of view it doesn't matter that much yeah but theta is not a [00:35:26] matter that much yeah but theta is not a random variable we have likely or the [00:35:27] random variable we have likely or the parameters which another random variable [00:35:35] what's the rationale for choosing oh [00:35:39] sure why is epsilon I Gaussian so it [00:35:43] sure why is epsilon I Gaussian so it turns out because the central limit [00:35:45] turns out because the central limit theorem from statistics most error [00:35:47] theorem from statistics most error distribution or Gaussian right if [00:35:49] distribution or Gaussian right if something is it is an error that's made [00:35:51] something is it is an error that's made up of lots of little noise sources which [00:35:53] up of lots of little noise sources which are not too correlated and by the [00:35:55] are not too correlated and by the central limit theorem it will be [00:35:56] central limit theorem it will be Gaussian so if you think that hope that [00:35:58] Gaussian so if you think that hope that the rotations are the mood in the seller [00:36:00] the rotations are the mood in the seller was the School District [00:36:02] was the School District you know what's the weather like access [00:36:04] you know what's the weather like access to transportation and all of these [00:36:05] to transportation and all of these sources are not too correlated and you [00:36:07] sources are not too correlated and you add them up then the distribution will [00:36:09] add them up then the distribution will be Gaussian and and I think yeah [00:36:14] so really because the central limit [00:36:15] so really because the central limit theorem I think the gaussians become a [00:36:16] theorem I think the gaussians become a default noise distribution but for [00:36:19] default noise distribution but for things where the true noise distribution [00:36:22] things where the true noise distribution is very far from Gaussian this model [00:36:24] is very far from Gaussian this model does do this well and in fact for when [00:36:27] does do this well and in fact for when you see generalized linear models on [00:36:28] you see generalized linear models on Wednesday you see when how to generalize [00:36:31] Wednesday you see when how to generalize all of these algorithms to very [00:36:33] all of these algorithms to very different distributions like possum and [00:36:34] different distributions like possum and so on alright so so we've seen the [00:36:41] so on alright so so we've seen the likelihood of the parameters theta so [00:36:47] likelihood of the parameters theta so I'm going to use lower case L to denote [00:36:50] I'm going to use lower case L to denote the log likelihood and the log [00:36:56] the log likelihood and the log likelihood is just a longer the [00:36:58] likelihood is just a longer the likelihood and so and so log of a [00:37:12] likelihood and so and so log of a product is equal to the sum of the logs [00:37:15] product is equal to the sum of the logs right and so this is equal to [00:37:49] okay and so one of the you know well [00:37:58] okay and so one of the you know well tested methods in statistics for [00:38:00] tested methods in statistics for estimating parameters is to use maximum [00:38:03] estimating parameters is to use maximum likelihood estimation which means atchoo [00:38:22] likelihood estimation which means atchoo Stata to maximize the likelihood right [00:38:32] Stata to maximize the likelihood right so you're going to dataset how would you [00:38:37] so you're going to dataset how would you like to estimate theta [00:38:38] like to estimate theta well one natural way to choose theta is [00:38:41] well one natural way to choose theta is to choose whatever value of theta has [00:38:43] to choose whatever value of theta has the highest likelihood or in other words [00:38:45] the highest likelihood or in other words choose the value of theta so that that [00:38:46] choose the value of theta so that that value of theta maximizes the probability [00:38:49] value of theta maximizes the probability of the data and so for to simplify the [00:38:56] of the data and so for to simplify the algebra rather than maximizing the [00:38:57] algebra rather than maximizing the likelihood capital L is actually easier [00:39:00] likelihood capital L is actually easier to maximize the log likelihood but the [00:39:02] to maximize the log likelihood but the log is a strictly monotonically [00:39:04] log is a strictly monotonically increasing function so the value of [00:39:06] increasing function so the value of theta that maximizes the log likelihood [00:39:08] theta that maximizes the log likelihood it should be the same as the value of [00:39:10] it should be the same as the value of theta that maximizes our likelihood and [00:39:11] theta that maximizes our likelihood and if you derived a log likelihood we [00:39:15] if you derived a log likelihood we conclude that if you're using maximum [00:39:17] conclude that if you're using maximum likelihood estimation what you'd like to [00:39:19] likelihood estimation what you'd like to do is choose the value of theta that [00:39:21] do is choose the value of theta that maximizes this thing right but this [00:39:25] maximizes this thing right but this first term is just a constant theta [00:39:26] first term is just a constant theta doesn't even appear in this first term [00:39:29] doesn't even appear in this first term and so what you like to do is choose the [00:39:32] and so what you like to do is choose the value of theta that maximizes the second [00:39:35] value of theta that maximizes the second term notice there's a minus sign there [00:39:38] term notice there's a minus sign there and so what you'd like to do is ie you [00:39:43] and so what you'd like to do is ie you know choose theta to minimize this term [00:40:01] right oh so Sigma squared is just a [00:40:04] right oh so Sigma squared is just a constant right no matter what Sigma [00:40:06] constant right no matter what Sigma squared is you know so so so if you want [00:40:10] squared is you know so so so if you want to minimize this term excuse me if you [00:40:12] to minimize this term excuse me if you want to maximize this term negative of [00:40:14] want to maximize this term negative of this thing that's the same as minimizing [00:40:16] this thing that's the same as minimizing this term but this is just J of theta [00:40:21] this term but this is just J of theta the cost function you saw earlier for [00:40:25] the cost function you saw earlier for linear regression okay so this little [00:40:29] linear regression okay so this little proof shows that choosing the value of [00:40:33] proof shows that choosing the value of theta to minimize the least squares [00:40:35] theta to minimize the least squares errors like you saw last Wednesday [00:40:38] errors like you saw last Wednesday that's just finding the maximum [00:40:40] that's just finding the maximum likelihood estimate for the parameters [00:40:42] likelihood estimate for the parameters theta under the set of assumptions we [00:40:45] theta under the set of assumptions we made that the error terms are Gaussian [00:40:48] made that the error terms are Gaussian and iid okay oh thank you [00:41:03] Oh is there a situation using this phone [00:41:09] Oh is there a situation using this phone as a least-squares cost function with [00:41:10] as a least-squares cost function with better idea no so this I think this [00:41:12] better idea no so this I think this derivation shows that this this is [00:41:14] derivation shows that this this is completely equivalent to these squares [00:41:16] completely equivalent to these squares right that if you want if you're willing [00:41:19] right that if you want if you're willing to assume that the error terms are [00:41:21] to assume that the error terms are Gaussian and iid and if you want to use [00:41:24] Gaussian and iid and if you want to use maximum likelihood estimation which is [00:41:26] maximum likelihood estimation which is very natural procedure and statistics [00:41:27] very natural procedure and statistics then you know then you should use these [00:41:31] then you know then you should use these squares if you knew for some reason [00:41:42] squares if you knew for some reason Arizona idea which figure about the cost [00:41:44] Arizona idea which figure about the cost function yes I know I think that you [00:41:48] function yes I know I think that you know when building learning algorithms [00:41:50] know when building learning algorithms often we make maldo we make assumptions [00:41:53] often we make maldo we make assumptions about the world that we just know are [00:41:54] about the world that we just know are not hunched up sent true because it [00:41:56] not hunched up sent true because it leads to algorithms accomplished and [00:41:58] leads to algorithms accomplished and efficient and so if you knew that your [00:42:01] efficient and so if you knew that your if you knew that your training set was [00:42:03] if you knew that your training set was very very non-id there are there most of [00:42:05] very very non-id there are there most of skated modeled as you could build but [00:42:09] skated modeled as you could build but yeah but but very often we wouldn't [00:42:12] yeah but but very often we wouldn't bother I think ya know more often than [00:42:14] bother I think ya know more often than not we might not bother I can think of a [00:42:17] not we might not bother I can think of a few special cases where you would bother [00:42:19] few special cases where you would bother but only if you think the assumptions [00:42:21] but only if you think the assumptions are really really bad if you don't have [00:42:22] are really really bad if you don't have enough data and so something waitwait [00:42:25] enough data and so something waitwait via alright I want to move on to make [00:42:31] via alright I want to move on to make sure we get through the rest of things [00:42:33] sure we get through the rest of things any questions all right [00:42:39] so armed with this machinery right so so [00:42:44] so armed with this machinery right so so what do we do here was we set up a set [00:42:47] what do we do here was we set up a set of problems occur some shion's we made [00:42:48] of problems occur some shion's we made certain assumptions about P of Y given X [00:42:51] certain assumptions about P of Y given X where the key assumption was Gaussian [00:42:53] where the key assumption was Gaussian errors in IIT and then through maximum [00:42:55] errors in IIT and then through maximum likelihood estimation we derived an [00:42:57] likelihood estimation we derived an algorithm which turns out to be exactly [00:42:59] algorithm which turns out to be exactly the least squares algorithm right what [00:43:02] the least squares algorithm right what I'd like to do is take this framework [00:43:04] I'd like to do is take this framework and apply it [00:43:06] and apply it our first classification problem right [00:43:09] our first classification problem right and so the key steps are you know one [00:43:12] and so the key steps are you know one make an assumption about P of Y given X [00:43:14] make an assumption about P of Y given X P of Y given X parameter [00:43:16] P of Y given X parameter in a second is figure out maximum [00:43:18] in a second is figure out maximum likelihood estimation it's nice to take [00:43:19] likelihood estimation it's nice to take this framework and apply it to a [00:43:21] this framework and apply it to a different type of problem where the [00:43:23] different type of problem where the value of y is now either zero or one [00:43:26] value of y is now either zero or one size of your classification problem okay [00:43:28] size of your classification problem okay so let's see so in the classification [00:43:40] so let's see so in the classification problem in our first classification [00:43:43] problem in our first classification problem we're going to start with binary [00:43:45] problem we're going to start with binary classification so the value of y is [00:43:48] classification so the value of y is either 0 or 1 [00:43:50] either 0 or 1 and sometimes we call this binary [00:43:52] and sometimes we call this binary classification because there are two [00:43:54] classification because there are two classes and so so that's a data set [00:44:09] classes and so so that's a data set where yes this is X and this is y um so [00:44:13] where yes this is X and this is y um so something that's not a good idea is [00:44:15] something that's not a good idea is applied linear regression to this data [00:44:17] applied linear regression to this data set some sometimes we will do it and [00:44:19] set some sometimes we will do it and maybe they get away with it but I [00:44:21] maybe they get away with it but I wouldn't do it and here's why [00:44:22] wouldn't do it and here's why which is um it's tempting to just fit a [00:44:25] which is um it's tempting to just fit a straight line to this data and then take [00:44:27] straight line to this data and then take the straight line and threshold it at [00:44:29] the straight line and threshold it at 0.5 and then say oh if this is above 0.5 [00:44:32] 0.5 and then say oh if this is above 0.5 rounds after 1 it is below 0.5 rends it [00:44:34] rounds after 1 it is below 0.5 rends it off to 0 but it turns out that this is [00:44:39] off to 0 but it turns out that this is not a good idea for classification [00:44:41] not a good idea for classification problems and here's why which is for [00:44:44] problems and here's why which is for this data set it's really obvious what [00:44:46] this data set it's really obvious what the what the pattern is right everything [00:44:47] the what the pattern is right everything to the left at this point for the 0 I [00:44:49] to the left at this point for the 0 I think the right at that point for the [00:44:51] think the right at that point for the good one but let's say we now change the [00:44:53] good one but let's say we now change the data set to just add one more example [00:44:56] data set to just add one more example there and the pattern is still really [00:44:59] there and the pattern is still really obvious is everything to the left of [00:45:00] obvious is everything to the left of this breathing zero [00:45:02] this breathing zero I think it's right of that for the good [00:45:03] I think it's right of that for the good one but they fit a straight line to this [00:45:05] one but they fit a straight line to this data set with this extra one point there [00:45:07] data set with this extra one point there and just not even the outlier it's [00:45:09] and just not even the outlier it's really obvious at this point way out [00:45:10] really obvious at this point way out there should be labeled one but was this [00:45:12] there should be labeled one but was this extra example if you fit a straight line [00:45:15] extra example if you fit a straight line to the data you end up with maybe [00:45:18] to the data you end up with maybe something like that and somehow having [00:45:22] something like that and somehow having this one example it really didn't change [00:45:24] this one example it really didn't change anything right but somehow the string I [00:45:26] anything right but somehow the string I fit [00:45:26] fit from the green lines of the move from [00:45:29] from the green lines of the move from the blue line to the green line and if [00:45:31] the blue line to the green line and if you now flash hold it at 0.5 you end up [00:45:33] you now flash hold it at 0.5 you end up with a very different decision boundary [00:45:35] with a very different decision boundary and so linear regression is just not a [00:45:38] and so linear regression is just not a good algorithm for classification some [00:45:40] good algorithm for classification some people use it and sometimes again lucky [00:45:42] people use it and sometimes again lucky is not too bad but I personally never [00:45:45] is not too bad but I personally never used the neighbor aggression for [00:45:47] used the neighbor aggression for classification algorithms right because [00:45:48] classification algorithms right because just don't know if you end up with a [00:45:50] just don't know if you end up with a really bad fit to the data like this um [00:45:55] really bad fit to the data like this um so oh and and and the other unnatural [00:46:00] so oh and and and the other unnatural thing about using linear regression for [00:46:02] thing about using linear regression for classification problem is that you know [00:46:04] classification problem is that you know for a classification problem that the [00:46:06] for a classification problem that the values are you know 0 or 1 right and so [00:46:10] values are you know 0 or 1 right and so it's output negative values or values [00:46:12] it's output negative values or values even greater than 1 seems seems strange [00:46:18] so what I'd like to share of you now is [00:46:22] so what I'd like to share of you now is really probably by far the most commonly [00:46:25] really probably by far the most commonly used classification algorithm called [00:46:27] used classification algorithm called logistic regression I always say the two [00:46:34] logistic regression I always say the two learning are rooms I probably use the [00:46:36] learning are rooms I probably use the most often on linear regression and [00:46:38] most often on linear regression and logistic regression yeah yeah and this [00:46:45] logistic regression yeah yeah and this is the algorithm so as we designed in [00:46:49] is the algorithm so as we designed in logistic regression algorithm one of the [00:46:50] logistic regression algorithm one of the things we might naturally want is for [00:46:54] things we might naturally want is for the hypothesis to output values between [00:46:57] the hypothesis to output values between 0 & 1 [00:46:58] 0 & 1 right and this is mathematical notation [00:47:01] right and this is mathematical notation for the values for H of X or H prime H [00:47:04] for the values for H of X or H prime H subscript theta of X lies in the set [00:47:07] subscript theta of X lies in the set from 0 to 1 right the 0 to 1 square [00:47:10] from 0 to 1 right the 0 to 1 square bracket is the set of all real numbers [00:47:11] bracket is the set of all real numbers from 0 to 1 so this is we want the [00:47:14] from 0 to 1 so this is we want the hypothesis output values in between 0 [00:47:17] hypothesis output values in between 0 and 1 in the set of all numbers which is [00:47:19] and 1 in the set of all numbers which is from 0 to 1 and so we're going to choose [00:47:24] from 0 to 1 and so we're going to choose the following form of the hypothesis so [00:47:40] so will it define the function G of Z [00:47:45] so will it define the function G of Z that looks like this and this is called [00:47:48] that looks like this and this is called the sigmoid or the logistic function [00:47:58] these are synonyms they mean exactly the [00:48:00] these are synonyms they mean exactly the same thing so can we call the sigmoid [00:48:03] same thing so can we call the sigmoid function or the logistic function it [00:48:04] function or the logistic function it means exactly the same thing but I'm [00:48:07] means exactly the same thing but I'm gonna choose a function G of Z and this [00:48:11] gonna choose a function G of Z and this function is shaped as follows if you [00:48:12] function is shaped as follows if you plot this function you find that it [00:48:15] plot this function you find that it looks like this where if the horizontal [00:48:19] looks like this where if the horizontal axis is Z then this is G of Z and so it [00:48:23] axis is Z then this is G of Z and so it crosses x intercept at 0 and it you know [00:48:28] crosses x intercept at 0 and it you know starts off well really close to 0 Rises [00:48:32] starts off well really close to 0 Rises and then asymptotes towards 1 okay and [00:48:36] and then asymptotes towards 1 okay and so G of Z outputs values between 0 and 1 [00:48:40] so G of Z outputs values between 0 and 1 and what logistic regression does is [00:48:45] and what logistic regression does is instead of let's see so previously for [00:48:48] instead of let's see so previously for linear regression we had chosen this [00:48:50] linear regression we had chosen this form for the hypothesis right we just [00:48:53] form for the hypothesis right we just made a choice that will say that housing [00:48:54] made a choice that will say that housing prices are a linear function of the [00:48:56] prices are a linear function of the features X and what logistic regression [00:48:59] features X and what logistic regression does is say the transpose X could be [00:49:01] does is say the transpose X could be bigger than 1 it could be less than 0 [00:49:03] bigger than 1 it could be less than 0 which is not very natural but it's gonna [00:49:05] which is not very natural but it's gonna take theta transpose X and pass it [00:49:07] take theta transpose X and pass it through this sigmoid function G so this [00:49:10] through this sigmoid function G so this force the output values only between 0 [00:49:13] force the output values only between 0 and 1 ok so you know when designing a [00:49:20] and 1 ok so you know when designing a learning algorithm sometimes you just [00:49:22] learning algorithm sometimes you just have to choose the form of the [00:49:24] have to choose the form of the hypothesis how are you going to [00:49:25] hypothesis how are you going to represent the function H or H of H [00:49:28] represent the function H or H of H subscript theta and so we're making that [00:49:30] subscript theta and so we're making that choice here today and if you're [00:49:33] choice here today and if you're wondering you know there are lots of [00:49:35] wondering you know there are lots of functions that we could have chosen [00:49:36] functions that we could have chosen right there loss of why why not why not [00:49:40] right there loss of why why not why not dysfunction or why not you know there [00:49:42] dysfunction or why not you know there lots of functions with very the shape to [00:49:43] lots of functions with very the shape to go but easier and [00:49:44] go but easier and so why are we choosing this specifically [00:49:48] so why are we choosing this specifically it turns out that there's a broader [00:49:49] it turns out that there's a broader class of algorithms called generalized [00:49:51] class of algorithms called generalized any models you hear about on Wednesday [00:49:52] any models you hear about on Wednesday of which this is a special case so we've [00:49:55] of which this is a special case so we've seen linear Russian you see logistic [00:49:58] seen linear Russian you see logistic regression in a second [00:49:58] regression in a second and on Wednesday you see that both of [00:50:01] and on Wednesday you see that both of these examples of a much bigger set of [00:50:02] these examples of a much bigger set of algorithms derive using a broader set of [00:50:04] algorithms derive using a broader set of principle so so for now just you know [00:50:07] principle so so for now just you know take my word for it that we want to use [00:50:09] take my word for it that we want to use the logistic function it'll turn out you [00:50:12] the logistic function it'll turn out you see on Wednesday that this way to derive [00:50:13] see on Wednesday that this way to derive even dysfunction from from more basic [00:50:17] even dysfunction from from more basic principles rather than just putting all [00:50:18] principles rather than just putting all this does that happen for now let me [00:50:20] this does that happen for now let me just pull this out of a hat and say [00:50:21] just pull this out of a hat and say that's the one we want to use [00:50:39] so let's make some assumptions about the [00:50:48] so let's make some assumptions about the distribution of Y given X franchise by [00:50:51] distribution of Y given X franchise by theta so I'm going to assume that the [00:50:56] theta so I'm going to assume that the data has a following distribution the [00:50:59] data has a following distribution the probability of Y being 1 again from the [00:51:02] probability of Y being 1 again from the breast cancer prediction that we had [00:51:03] breast cancer prediction that we had from the first lecture right it would be [00:51:07] from the first lecture right it would be the chance of a tumor being cancerous of [00:51:09] the chance of a tumor being cancerous of being malignant chance of Y be new one [00:51:12] being malignant chance of Y be new one given the size of a tumor that's the [00:51:15] given the size of a tumor that's the future x parametrized by theta that this [00:51:18] future x parametrized by theta that this is equal to the output of your [00:51:22] is equal to the output of your hypothesis so in other words going [00:51:25] hypothesis so in other words going assume that what you want your learning [00:51:27] assume that what you want your learning algorithm to do is input the features [00:51:30] algorithm to do is input the features and tell me what's the chance that this [00:51:32] and tell me what's the chance that this tumor is malignant right what's the [00:51:34] tumor is malignant right what's the chance that Y is equal to one and by [00:51:39] chance that Y is equal to one and by logic I guess because Y can be only one [00:51:42] logic I guess because Y can be only one or zero the chance of Y being equal to [00:51:44] or zero the chance of Y being equal to zero this has got to be one minus that [00:51:50] right because if a tumor has a 10% [00:51:53] right because if a tumor has a 10% chance of being malignant that means it [00:51:55] chance of being malignant that means it has a 1 minus that means it must have a [00:51:58] has a 1 minus that means it must have a 90% chance of being benign right since [00:52:00] 90% chance of being benign right since these two probabilities must add up to [00:52:01] these two probabilities must add up to one [00:52:13] I'll say it again [00:52:15] I'll say it again oh can we change Peru [00:52:19] oh can we change Peru yes you can but I'm not yet but I think [00:52:21] yes you can but I'm not yet but I think just a stick of convention in the [00:52:23] just a stick of convention in the z-direction you yeah sure because assume [00:52:26] z-direction you yeah sure because assume that P of y equals 1 was this in P of y [00:52:28] that P of y equals 1 was this in P of y equals 1 was that but I think either way [00:52:30] equals 1 was that but I think either way it's just what you call positive example [00:52:31] it's just what you call positive example why you call negative example um and now [00:52:38] why you call negative example um and now bearing in mind that Y right by [00:52:42] bearing in mind that Y right by definition because it's a binary [00:52:45] definition because it's a binary classification problem but bearing in [00:52:47] classification problem but bearing in mind that Y can only take on two values [00:52:49] mind that Y can only take on two values 0 1 there's a nifty so the algebra way [00:52:54] 0 1 there's a nifty so the algebra way to take these 2 equations and write them [00:52:57] to take these 2 equations and write them in one equation and this will make some [00:52:59] in one equation and this will make some of the math a little bit easier when I [00:53:00] of the math a little bit easier when I take these two equations take these two [00:53:02] take these two equations take these two assumptions and take these two facts and [00:53:04] assumptions and take these two facts and compress it into one equation which is [00:53:07] compress it into one equation which is this oh and I dropped the theta [00:53:16] this oh and I dropped the theta subscript just to simplify the notation [00:53:18] subscript just to simplify the notation I'm gonna be a little bit sloppy [00:53:20] I'm gonna be a little bit sloppy sometimes well no less one more whether [00:53:22] sometimes well no less one more whether I write the theta there okay but these [00:53:26] I write the theta there okay but these two definitions appear Y given X [00:53:28] two definitions appear Y given X paralyzed by theta bear in mind that Y [00:53:31] paralyzed by theta bear in mind that Y is either 0 one can be compressed into [00:53:33] is either 0 one can be compressed into one equation like this and then just say [00:53:36] one equation like this and then just say Y right is because if y [00:53:46] if one is equal to one then this becomes [00:53:52] if one is equal to one then this becomes H of X to the power of one times this [00:53:54] H of X to the power of one times this thing it's a power of zero right if Y is [00:53:58] thing it's a power of zero right if Y is equal to 1 then 1 minus y is 0 and you [00:54:03] equal to 1 then 1 minus y is 0 and you know anything so the power of 0 is just [00:54:07] know anything so the power of 0 is just equal to 1 and so if Y is equal to 1 you [00:54:12] equal to 1 and so if Y is equal to 1 you end up with P of Y given X prioritize by [00:54:15] end up with P of Y given X prioritize by theta equals H of X which is just what [00:54:21] theta equals H of X which is just what we had there and conversely if Y is [00:54:27] we had there and conversely if Y is equal to 0 then this thing will be 0 and [00:54:32] equal to 0 then this thing will be 0 and this thing would be 1 and so you end up [00:54:35] this thing would be 1 and so you end up with P of Y given X perilous theta is [00:54:39] with P of Y given X perilous theta is equal to 1 minus H of X which is just [00:54:42] equal to 1 minus H of X which is just equal to that second equation ok right [00:54:47] equal to that second equation ok right and so this is a nifty way to take these [00:54:51] and so this is a nifty way to take these two equations and compress them into one [00:54:53] two equations and compress them into one line because depending on whether Y is [00:54:55] line because depending on whether Y is zero one one of these two terms switches [00:54:58] zero one one of these two terms switches off because it's exponentiated to the [00:55:00] off because it's exponentiated to the power of zero and anything to the power [00:55:03] power of zero and anything to the power of zero is just equal of one right so [00:55:05] of zero is just equal of one right so one of these terms is just you know one [00:55:07] one of these terms is just you know one doesn't leaving the other term and just [00:55:10] doesn't leaving the other term and just selecting the appropriate equation [00:55:12] selecting the appropriate equation depend on whether Y is zero one okay so [00:55:15] depend on whether Y is zero one okay so with that so with this little on a [00:55:19] with that so with this little on a notational trick you'll make the later [00:55:22] notational trick you'll make the later derivations simpler [00:55:31] so all right she can reuse along with [00:55:55] so all right she can reuse along with this all right so we're gonna use [00:56:01] this all right so we're gonna use maximum likelihood estimation again so [00:56:03] maximum likelihood estimation again so let's write down the likelihood of the [00:56:07] let's write down the likelihood of the parameters so it's actually PF otherwise [00:56:11] parameters so it's actually PF otherwise given all the XS for entrance by theta [00:56:14] given all the XS for entrance by theta which is equal to this which is now [00:56:17] which is equal to this which is now equal to product from I equals 1 through [00:56:21] equal to product from I equals 1 through X KY ^ Y I times 1 minus H of X I - why [00:56:32] X KY ^ Y I times 1 minus H of X I - why okay where all I did was take this [00:56:35] okay where all I did was take this definition of P of Y given X parent [00:56:37] definition of P of Y given X parent choice by theta you know from that after [00:56:40] choice by theta you know from that after we did that little exponentiation trick [00:56:41] we did that little exponentiation trick and wrote it in here [00:56:50] and then what maximum likelihood [00:56:53] and then what maximum likelihood estimation we'll want to find the value [00:56:56] estimation we'll want to find the value of theta that maximizes the likelihood [00:56:58] of theta that maximizes the likelihood maximize the likelihood of the [00:57:00] maximize the likelihood of the parameters and so same as what we did [00:57:04] parameters and so same as what we did for linear regression to make the [00:57:06] for linear regression to make the algebra yeah to make the out you're a [00:57:08] algebra yeah to make the out you're a bit more simple we're going to take the [00:57:10] bit more simple we're going to take the log of the likelihood and so compute the [00:57:12] log of the likelihood and so compute the log likelihood and so that's equal to [00:57:19] let's see right I say if you take the [00:57:21] let's see right I say if you take the log of that you end up with and it so so [00:57:52] log of that you end up with and it so so in other words the last thing you want [00:57:54] in other words the last thing you want to do is try to choose the value of [00:57:56] to do is try to choose the value of theta to try to maximize L of theta now [00:58:10] theta to try to maximize L of theta now so just just to summarize where we are [00:58:13] so just just to summarize where we are right if you're trying to predict your [00:58:15] right if you're trying to predict your malignancy in bananas of tumors you have [00:58:19] malignancy in bananas of tumors you have a training set with X I Y I you define [00:58:22] a training set with X I Y I you define the likelihood to find a log likelihood [00:58:25] the likelihood to find a log likelihood and then what you need to do is have an [00:58:27] and then what you need to do is have an algorithm such as gradient descent [00:58:28] algorithm such as gradient descent agreed innocent talk about that a sec to [00:58:31] agreed innocent talk about that a sec to try to find the value of theta that [00:58:32] try to find the value of theta that maximizes the log likelihood and then [00:58:35] maximizes the log likelihood and then having chosen the value of theta when a [00:58:38] having chosen the value of theta when a new patient walks into the doctor's [00:58:39] new patient walks into the doctor's office you would you know take the [00:58:41] office you would you know take the features of the new tumor and then use H [00:58:44] features of the new tumor and then use H of theta to estimate the chance of this [00:58:47] of theta to estimate the chance of this new tumor and the new patient that walks [00:58:49] new tumor and the new patient that walks in tomorrow's estimate the chance that [00:58:51] in tomorrow's estimate the chance that this new thing is this malignant [00:58:54] this new thing is this malignant okay so the algorithm were going to use [00:59:01] okay so the algorithm were going to use to choose theta to try to maximize the [00:59:03] to choose theta to try to maximize the log likelihood is gradient ascent or [00:59:06] log likelihood is gradient ascent or batch gradient ascent and what that [00:59:10] batch gradient ascent and what that means is we will update the parameters [00:59:13] means is we will update the parameters theta J according to theta J plus the [00:59:21] theta J according to theta J plus the partial derivative with respect to the [00:59:24] partial derivative with respect to the log-likelihood okay and the differences [00:59:27] log-likelihood okay and the differences from what you saw for linear regression [00:59:29] from what you saw for linear regression from last time it's the following just [00:59:33] from last time it's the following just two differences I guess for linear [00:59:36] two differences I guess for linear regression last week I have written this [00:59:39] regression last week I have written this down theta J gets updated as theta J [00:59:41] down theta J gets updated as theta J minus partial with respect to theta J of [00:59:45] minus partial with respect to theta J of J of theta right so you saw this on [00:59:47] J of theta right so you saw this on Wednesday so the two differences between [00:59:49] Wednesday so the two differences between dances are well first instead of J of [00:59:53] dances are well first instead of J of theta you're now trying to optimize the [00:59:56] theta you're now trying to optimize the log likelihood instead of the squared [00:59:58] log likelihood instead of the squared cost function and the second change is [01:00:00] cost function and the second change is previously you were trying to minimize [01:00:02] previously you were trying to minimize the squared error that's why we had the [01:00:04] the squared error that's why we had the minus and today you're trying to [01:00:06] minus and today you're trying to maximize the log likelihood which is why [01:00:09] maximize the log likelihood which is why there's a plus sign okay and so so great [01:00:13] there's a plus sign okay and so so great in E sent you know it's trying to climb [01:00:17] in E sent you know it's trying to climb down this hill whereas gradient ascent [01:00:20] down this hill whereas gradient ascent has a has a concave function like this [01:00:25] has a has a concave function like this and it's trying to write climb up the [01:00:29] and it's trying to write climb up the hill [01:00:29] hill rather than climb down their goal so [01:00:31] rather than climb down their goal so that's why there's a plus symbol here [01:00:33] that's why there's a plus symbol here instead of a minus because we maximize [01:00:36] instead of a minus because we maximize the function rather than minimize the [01:00:38] the function rather than minimize the function so the last thing to really [01:00:44] function so the last thing to really flesh out this algorithm which is done [01:00:47] flesh out this algorithm which is done in the lecture notes but I don't want to [01:00:48] in the lecture notes but I don't want to do to you today is to plug in the [01:00:51] do to you today is to plug in the definition of H of theta into this [01:00:54] definition of H of theta into this equation and then take this thing so [01:00:57] equation and then take this thing so that's the log likelihood of theta and [01:01:00] that's the log likelihood of theta and then through you know calculus and [01:01:02] then through you know calculus and algebra you can take derivatives of this [01:01:05] algebra you can take derivatives of this whole thing with respect [01:01:06] whole thing with respect theta this is done in detail in the [01:01:08] theta this is done in detail in the lecture notes I don't want to use it [01:01:09] lecture notes I don't want to use it cost but go ahead and take the Ritter's [01:01:12] cost but go ahead and take the Ritter's at this big formula with respect to the [01:01:14] at this big formula with respect to the parameters theta in order to figure out [01:01:17] parameters theta in order to figure out what is that thing right what is this [01:01:19] what is that thing right what is this thing that I just circled and it turns [01:01:21] thing that I just circled and it turns out that if you do so you will find that [01:01:27] out that if you do so you will find that batch gradient descent is the following [01:01:31] batch gradient descent is the following you update theta J according to actually [01:01:41] you update theta J according to actually I'm sorry I forgot the learning rate [01:01:43] I'm sorry I forgot the learning rate yeah it's relearning the Alpha learning [01:01:46] yeah it's relearning the Alpha learning rate alpha times this because this term [01:01:56] rate alpha times this because this term here is the partial derivative respect [01:01:58] here is the partial derivative respect to theta J and the full calculus and so [01:02:09] to theta J and the full calculus and so on derivation is given in the lecture [01:02:12] on derivation is given in the lecture notes is a chance of local Maxima in [01:02:20] notes is a chance of local Maxima in this case no there isn't [01:02:21] this case no there isn't it turns out that this function that the [01:02:24] it turns out that this function that the log-likelihood function o of theta for [01:02:27] log-likelihood function o of theta for logistic regression that always looks [01:02:28] logistic regression that always looks like that so this is a concave function [01:02:30] like that so this is a concave function so there are no local op that the only [01:02:33] so there are no local op that the only maximum as a global Maxima there's [01:02:35] maximum as a global Maxima there's actually another reason why we chose the [01:02:37] actually another reason why we chose the logistic function because if you choose [01:02:38] logistic function because if you choose the logistic function rather than some [01:02:40] the logistic function rather than some other function reserve the one you're [01:02:42] other function reserve the one you're guaranteed that the likelihood function [01:02:44] guaranteed that the likelihood function has only one global maximum and this [01:02:47] has only one global maximum and this there's actually a big positive actually [01:02:49] there's actually a big positive actually what you see on Wednesday this is a big [01:02:52] what you see on Wednesday this is a big class of algorithms [01:02:53] class of algorithms I wish linear regression is one example [01:02:55] I wish linear regression is one example which is addresses another example and [01:02:57] which is addresses another example and for all of these algorithms in this [01:02:58] for all of these algorithms in this class there are no local Optima problems [01:03:00] class there are no local Optima problems when you when you derived them this way [01:03:02] when you when you derived them this way so you see that on Wednesday wouldn't [01:03:04] so you see that on Wednesday wouldn't talk about us [01:03:05] talk about us okay so actually I think your bun is [01:03:09] okay so actually I think your bun is just one question for you to think about [01:03:12] just one question for you to think about this looks exactly the same as what [01:03:14] this looks exactly the same as what we've paid it out for linear regression [01:03:15] we've paid it out for linear regression right actually the difference the linear [01:03:19] right actually the difference the linear regression was I had a minus sign here [01:03:20] regression was I had a minus sign here and I reversed these two terms I think [01:03:23] and I reversed these two terms I think there's a big sign - weii if you put the [01:03:27] there's a big sign - weii if you put the minus sign there and reverse these two [01:03:28] minus sign there and reverse these two terms so take the minus - this is [01:03:31] terms so take the minus - this is actually exactly the same as when we had [01:03:32] actually exactly the same as when we had come up with a linear regression so why [01:03:35] come up with a linear regression so why is this different right I started off [01:03:37] is this different right I started off saying don't use the knee regression for [01:03:38] saying don't use the knee regression for classification problems because of [01:03:40] classification problems because of because of that problem that a single [01:03:42] because of that problem that a single example could really you know I'm sorry [01:03:44] example could really you know I'm sorry I start off with an example saying that [01:03:46] I start off with an example saying that linear regression is really bad for [01:03:48] linear regression is really bad for classification and we did all this work [01:03:50] classification and we did all this work and came back to the same algorithm so [01:03:52] and came back to the same algorithm so what happened all right cool awesome [01:04:00] what happened all right cool awesome right so what happened is the definition [01:04:02] right so what happened is the definition of H of theta is now different than [01:04:04] of H of theta is now different than before but the surface level the [01:04:07] before but the surface level the equation turns out to be the same okay [01:04:10] equation turns out to be the same okay and again it turns out that for every [01:04:12] and again it turns out that for every algorithm and discourse around as you [01:04:13] algorithm and discourse around as you see on Wednesday you end up with the [01:04:15] see on Wednesday you end up with the same thing actually there's a general [01:04:17] same thing actually there's a general property of a much bigger class of [01:04:19] property of a much bigger class of algorithms called generalize many models [01:04:22] algorithms called generalize many models although yeah interesting historical [01:04:25] although yeah interesting historical divergence because of the confusion [01:04:28] divergence because of the confusion between these two algorithms in the [01:04:30] between these two algorithms in the early history of machine learning there [01:04:32] early history of machine learning there was some debate about a Oh between [01:04:34] was some debate about a Oh between academia saying no I invented that no I [01:04:36] academia saying no I invented that no I invented that [01:04:37] invented that no is actually different algorithms all [01:04:44] no is actually different algorithms all right any questions [01:04:48] oh great question is their equivalents [01:04:54] oh great question is their equivalents of normal equations to logistic [01:04:57] of normal equations to logistic regression short answer is no so for [01:05:01] regression short answer is no so for linear regression the normal equations [01:05:03] linear regression the normal equations gives you like a one-shot way to just [01:05:04] gives you like a one-shot way to just find the best value of theta there is no [01:05:07] find the best value of theta there is no known way to just have a closed form [01:05:08] known way to just have a closed form equation unless you find the best value [01:05:10] equation unless you find the best value of theta which is why you always have to [01:05:12] of theta which is why you always have to use an algorithm in sort of optimization [01:05:15] use an algorithm in sort of optimization out rooms such as creating ascend or and [01:05:17] out rooms such as creating ascend or and we'll see in a second Newton's method [01:05:21] cool [01:05:25] so there's a great lead-in to the lost [01:05:32] so there's a great lead-in to the lost topic for today which is Mutants method [01:05:56] um you know created in a sense right [01:05:59] um you know created in a sense right it's a good algorithm I use screen to [01:06:01] it's a good algorithm I use screen to send all the time but it takes the baby [01:06:02] send all the time but it takes the baby step takes a baby sir taking the baby [01:06:04] step takes a baby sir taking the baby set takes a lot of iterations for [01:06:05] set takes a lot of iterations for gradient ascent to converge there's [01:06:09] gradient ascent to converge there's another algorithm called Newton's method [01:06:11] another algorithm called Newton's method which allows you to take much bigger [01:06:13] which allows you to take much bigger jumps to let's data you know so there [01:06:16] jumps to let's data you know so there are problems where you might need [01:06:17] are problems where you might need they'll say a hundred iterations or a [01:06:19] they'll say a hundred iterations or a thousand iterations are great in ascent [01:06:21] thousand iterations are great in ascent that if you run this algorithm called [01:06:23] that if you run this algorithm called Newton's method you might need only ten [01:06:26] Newton's method you might need only ten iterations to get a very good value of [01:06:28] iterations to get a very good value of theta but each iteration will be more [01:06:31] theta but each iteration will be more expensive we talked about pros and cons [01:06:32] expensive we talked about pros and cons a second [01:06:33] a second but um let's see how let's let's [01:06:36] but um let's see how let's let's describe this algorithm which is [01:06:38] describe this algorithm which is sometimes much faster for gradient than [01:06:40] sometimes much faster for gradient than great innocent for optimizing the value [01:06:43] great innocent for optimizing the value of theta okay so um what we'd like to do [01:06:48] of theta okay so um what we'd like to do is so let me let me use the simplify one [01:06:52] is so let me let me use the simplify one dimensional problem to describe Newton's [01:06:55] dimensional problem to describe Newton's method so I'm going to solve a slightly [01:07:05] method so I'm going to solve a slightly different problem with Newton's method [01:07:06] different problem with Newton's method which is say you have some function f [01:07:10] which is say you have some function f and you want to find theta such that f [01:07:18] and you want to find theta such that f of theta is equal to zero okay so this [01:07:23] of theta is equal to zero okay so this is a problem that Newton's method solves [01:07:25] is a problem that Newton's method solves and the way we're going to use this [01:07:27] and the way we're going to use this later is what you really want is to [01:07:30] later is what you really want is to maximize L of theta right and well at [01:07:40] maximize L of theta right and well at the maximum the first derivative must be [01:07:42] the maximum the first derivative must be zero so ie [01:07:45] zero so ie you want to value where the derivative L [01:07:49] you want to value where the derivative L prime of theta is equal to zero right [01:07:51] prime of theta is equal to zero right and L prime is the derivative of theta [01:07:54] and L prime is the derivative of theta because this is L prime is another [01:07:57] because this is L prime is another notation for the first derivative theta [01:07:59] notation for the first derivative theta so you want to maximize the function or [01:08:01] so you want to maximize the function or minimize the function whether [01:08:02] minimize the function whether Muse's you want to find a point where [01:08:04] Muse's you want to find a point where the derivative is equal to zero [01:08:06] the derivative is equal to zero so the way we're going to use Newton's [01:08:08] so the way we're going to use Newton's method is we're going to set f of theta [01:08:10] method is we're going to set f of theta equal to the derivative and then try to [01:08:13] equal to the derivative and then try to find the point where the derivative is [01:08:14] find the point where the derivative is equal to zero okay but to explain your [01:08:17] equal to zero okay but to explain your tennis method I'm gonna you know work on [01:08:20] tennis method I'm gonna you know work on this other problem where you have a [01:08:21] this other problem where you have a function f and you just want to find the [01:08:23] function f and you just want to find the value of theta where f of states is [01:08:25] value of theta where f of states is equal to zero and then it will set F [01:08:27] equal to zero and then it will set F equal to L prime theta and that's how we [01:08:29] equal to L prime theta and that's how we will apply this to logistic regression [01:08:33] will apply this to logistic regression so let me draw in pictures how this [01:08:38] so let me draw in pictures how this algorithm works [01:08:52] all right so let's say that's the [01:08:54] all right so let's say that's the function f and you know to make this [01:08:57] function f and you know to make this drawable on a whiteboard I'm gonna [01:08:59] drawable on a whiteboard I'm gonna assume theta is just a real number for [01:09:01] assume theta is just a real number for now so theta is just a single you know [01:09:03] now so theta is just a single you know like a scalar a real number so this is [01:09:07] like a scalar a real number so this is how Newton's method works oh and the [01:09:13] how Newton's method works oh and the goal is to find this point right [01:09:15] goal is to find this point right the goal is to find the value of theta [01:09:17] the goal is to find the value of theta with F of theta is equal to zero okay so [01:09:21] with F of theta is equal to zero okay so let's say you start off from let's see [01:09:25] let's say you start off from let's see you start off at this point right at the [01:09:27] you start off at this point right at the first iteration you know randomly I stay [01:09:29] first iteration you know randomly I stay there and nationally thing is zero [01:09:31] there and nationally thing is zero something but let's say you start up at [01:09:32] something but let's say you start up at that point this is how one iteration of [01:09:36] that point this is how one iteration of Newton's method will work which is start [01:09:44] Newton's method will work which is start off with theta zero that's just the [01:09:45] off with theta zero that's just the first value first iteration what we're [01:09:48] first value first iteration what we're going to do is look at the function f [01:09:49] going to do is look at the function f and then find a line that's just tangent [01:09:52] and then find a line that's just tangent to F so take the derivative of F and [01:09:55] to F so take the derivative of F and find a line that's just tangent to F so [01:10:00] find a line that's just tangent to F so take that red line there this just [01:10:02] take that red line there this just touches a function f and we're going to [01:10:04] touches a function f and we're going to use if you will use the straight line [01:10:05] use if you will use the straight line approximation to F and solve for where F [01:10:08] approximation to F and solve for where F touches the horizontal axis so we're [01:10:11] touches the horizontal axis so we're gonna solve for the point where this [01:10:13] gonna solve for the point where this straight line touches the horizontal [01:10:15] straight line touches the horizontal axis okay and then we're going to set [01:10:18] axis okay and then we're going to set this and that's one iteration of [01:10:21] this and that's one iteration of Newton's method so they're going to move [01:10:22] Newton's method so they're going to move from this value to this value and then [01:10:27] from this value to this value and then in the second iteration of Newton's [01:10:28] in the second iteration of Newton's method we're gonna look at this point [01:10:30] method we're gonna look at this point and again you know take a line that's [01:10:33] and again you know take a line that's just tangent to it and then solve for [01:10:36] just tangent to it and then solve for where this touches the horizontal axis [01:10:39] where this touches the horizontal axis and then that's after two iterations of [01:10:43] and then that's after two iterations of u2's men right and then you repeat take [01:10:46] u2's men right and then you repeat take this sometimes you can overshoot a [01:10:48] this sometimes you can overshoot a little bit but that's okay right [01:10:50] little bit but that's okay right and then that so [01:10:52] and then that so it gives us a cycle back to rent that [01:10:55] it gives us a cycle back to rent that stay the three then you take this [01:10:57] stay the three then you take this let's stay there for so you can tell [01:11:12] let's stay there for so you can tell that Tom Newton's method it's actually [01:11:16] that Tom Newton's method it's actually pretty fast algorithm right within just [01:11:18] pretty fast algorithm right within just what one two three four iterations we've [01:11:21] what one two three four iterations we've gotten really really close to the point [01:11:24] gotten really really close to the point where F of theta is equal to zero so [01:11:29] where F of theta is equal to zero so let's write out the map for how you do [01:11:31] let's write out the map for how you do this so let's see I'm going to so let me [01:11:38] this so let's see I'm going to so let me just write out and derive you know how [01:11:41] just write out and derive you know how you go from theta 0 to theta 1 so I'm [01:11:44] you go from theta 0 to theta 1 so I'm going to use this horizontal distance [01:11:46] going to use this horizontal distance I'm gonna denote this as Delta this [01:11:50] I'm gonna denote this as Delta this triangle is upper case Greek alphabet [01:11:53] triangle is upper case Greek alphabet Delta right this is lower case Delta [01:11:55] Delta right this is lower case Delta that's upper case Delta right and then [01:11:58] that's upper case Delta right and then the height here well that's just f of [01:12:01] the height here well that's just f of theta 0 this is the height it's just F [01:12:04] theta 0 this is the height it's just F of theta 0 and so let's see [01:12:15] so what we'd like to do is solve for the [01:12:18] so what we'd like to do is solve for the value of Delta because one iteration of [01:12:21] value of Delta because one iteration of Newton's method is a set you know theta [01:12:24] Newton's method is a set you know theta one is set to theta zero minus Delta [01:12:29] one is set to theta zero minus Delta right so how do you solve for Delta well [01:12:32] right so how do you solve for Delta well from calculus we know that the slope of [01:12:36] from calculus we know that the slope of the function f is the height over the [01:12:38] the function f is the height over the run right height over the width and so [01:12:41] run right height over the width and so we know that the derivative of del F [01:12:43] we know that the derivative of del F prime that's the derivative of F at the [01:12:46] prime that's the derivative of F at the point stay to zero that's equal to the [01:12:49] point stay to zero that's equal to the height that's F of theta divided by the [01:12:53] height that's F of theta divided by the horizontal right so the derivative [01:12:56] horizontal right so the derivative meaning the slope of the red line is by [01:12:58] meaning the slope of the red line is by definition of derivatives is this ratio [01:13:00] definition of derivatives is this ratio between just height over this width and [01:13:04] between just height over this width and so Delta is equal to f of theta 0 over F [01:13:09] so Delta is equal to f of theta 0 over F prime of theta 0 and if you plug that in [01:13:14] prime of theta 0 and if you plug that in then you find that a single iteration of [01:13:17] then you find that a single iteration of Newton's method is the following group [01:13:20] Newton's method is the following group data T plus 1 gets updated as theta t [01:13:26] data T plus 1 gets updated as theta t minus f of theta t over F prime of theta [01:13:34] minus f of theta t over F prime of theta T ok where instead of 0 1 I replaced it [01:13:39] T ok where instead of 0 1 I replaced it with T and T plus 1 and finally to you [01:13:46] with T and T plus 1 and finally to you know the very first thing we did was [01:13:48] know the very first thing we did was let's let F of theta be equal to say L [01:13:53] let's let F of theta be equal to say L prime of theta right because we want to [01:13:56] prime of theta right because we want to find the place where the first [01:13:57] find the place where the first derivative of L is 0 then this becomes [01:14:01] derivative of L is 0 then this becomes theta T plus 1 gets updated as theta T [01:14:06] theta T plus 1 gets updated as theta T minus L prime of theta T over L double [01:14:12] minus L prime of theta T over L double prime of theta T so it's really the [01:14:16] prime of theta T so it's really the first derivative divided by the second [01:14:22] so [01:14:38] Newton's mess is a very fast algorithm [01:14:41] Newton's mess is a very fast algorithm and it has Newton's method and Joy's [01:14:47] and it has Newton's method and Joy's property called quadratic convergence [01:14:51] not a great name don't worry don't worry [01:14:54] not a great name don't worry don't worry too much about what it means but the [01:14:55] too much about what it means but the informative what it means is that if on [01:14:58] informative what it means is that if on one innovation Newton's method is 0.01 [01:15:02] one innovation Newton's method is 0.01 error so on the x-axis you're 0.01 away [01:15:05] error so on the x-axis you're 0.01 away from the from the value from from the [01:15:08] from the from the value from from the true minimum of a true value of F is [01:15:10] true minimum of a true value of F is zero zero after one iteration the error [01:15:13] zero zero after one iteration the error could go to zero point zero zero zero [01:15:15] could go to zero point zero zero zero one error and all the two iterations it [01:15:18] one error and all the two iterations it goes to zero but roughly Newton's method [01:15:26] goes to zero but roughly Newton's method under certain assumptions that function [01:15:30] under certain assumptions that function is smooth not too far from quadratic the [01:15:32] is smooth not too far from quadratic the number of significant digits that you [01:15:34] number of significant digits that you have converts the minimum doubles on a [01:15:37] have converts the minimum doubles on a single iteration so this is called [01:15:38] single iteration so this is called quadratic convergence and so when you [01:15:41] quadratic convergence and so when you get near the minimum Newton's method [01:15:42] get near the minimum Newton's method converges extremely rapidly right so so [01:15:45] converges extremely rapidly right so so after a single iteration becomes much [01:15:47] after a single iteration becomes much more accurate enumeration becomes way [01:15:48] more accurate enumeration becomes way way way more accurate which is why [01:15:50] way way more accurate which is why Newton's method requires relatively few [01:15:53] Newton's method requires relatively few innovations and let's see I have written [01:15:58] innovations and let's see I have written out Newton's method for when theta is a [01:16:02] out Newton's method for when theta is a real number when theta is a vector [01:16:13] then the generalization of the rule I [01:16:15] then the generalization of the rule I wrote above is the following [01:16:17] wrote above is the following theta T plus one gets updated as theta T [01:16:20] theta T plus one gets updated as theta T plus h that where X is the Hessian [01:16:28] plus h that where X is the Hessian matrix so these details are written in [01:16:37] matrix so these details are written in lecture notes but to give you a sense it [01:16:40] lecture notes but to give you a sense it when theta is a vector this is a vector [01:16:43] when theta is a vector this is a vector of derivatives it says I guess this part [01:16:48] of derivatives it says I guess this part n plus 1 dimensional if nature is an RN [01:16:53] n plus 1 dimensional if nature is an RN plus 1 then this derivative respect to [01:16:57] plus 1 then this derivative respect to theta of the log-likelihood becomes a [01:16:59] theta of the log-likelihood becomes a vector of derivatives and the Hessian [01:17:01] vector of derivatives and the Hessian matrix this becomes in matrixes are n [01:17:04] matrix this becomes in matrixes are n plus 1 by n plus 1 so becomes a square [01:17:09] plus 1 by n plus 1 so becomes a square matrix with the dimension equal to the [01:17:11] matrix with the dimension equal to the parameter vector theta and the Hessian [01:17:14] parameter vector theta and the Hessian matrix is defined as the matrix of [01:17:16] matrix is defined as the matrix of partial derivatives right so and so the [01:17:26] partial derivatives right so and so the disadvantage of Newton's method is that [01:17:29] disadvantage of Newton's method is that in high dimensional problems if theta is [01:17:32] in high dimensional problems if theta is a vector that each step of Newton's [01:17:34] a vector that each step of Newton's method is much more expensive because [01:17:37] method is much more expensive because you're either solving a linear system [01:17:39] you're either solving a linear system craisins or having to convert to pretty [01:17:41] craisins or having to convert to pretty big matrix so if theta is 10 dimensional [01:17:44] big matrix so if theta is 10 dimensional you know this involves inverting a 10 by [01:17:47] you know this involves inverting a 10 by 10 matrix which is fine but if theta was [01:17:49] 10 matrix which is fine but if theta was 10,000 or 100,000 then each innovation [01:17:52] 10,000 or 100,000 then each innovation requires computing like a hundred [01:17:55] requires computing like a hundred thousand by a hundred thousand matrix [01:17:56] thousand by a hundred thousand matrix and inverting that which is very hard [01:17:58] and inverting that which is very hard it's very very difficult to do that in [01:18:00] it's very very difficult to do that in very high dimensional problems so you [01:18:04] very high dimensional problems so you know some rules of thumb if the number [01:18:08] know some rules of thumb if the number of parameters you have foolish if the [01:18:09] of parameters you have foolish if the number of parameters religion aggression [01:18:11] number of parameters religion aggression is not too big if you have 10 parameters [01:18:13] is not too big if you have 10 parameters or 50 parameters I would almost [01:18:15] or 50 parameters I would almost certainly I would very likely use [01:18:18] certainly I would very likely use Newton's method [01:18:20] Newton's method then you probably get convergence in [01:18:23] then you probably get convergence in maybe ten iterations or you know 15 [01:18:25] maybe ten iterations or you know 15 iterations or even less than ten [01:18:27] iterations or even less than ten generations but they've a very large [01:18:29] generations but they've a very large number of parameters if you have you [01:18:30] number of parameters if you have you know ten thousand parameters then rather [01:18:33] know ten thousand parameters then rather than dealing over 10,000 by 10,000 [01:18:35] than dealing over 10,000 by 10,000 matrix or even bigger than 55,000 about [01:18:38] matrix or even bigger than 55,000 about 50,000 matrix under 50,000 parameters I [01:18:40] 50,000 matrix under 50,000 parameters I would use great in the sentence then [01:18:42] would use great in the sentence then okay but if the number of parameters is [01:18:45] okay but if the number of parameters is not too big so that the computational [01:18:47] not too big so that the computational cost per iteration is manageable [01:18:49] cost per iteration is manageable then Newton's method converges in a very [01:18:52] then Newton's method converges in a very small number of iterations and could be [01:18:54] small number of iterations and could be much faster algorithm than gradient [01:18:55] much faster algorithm than gradient descent all right so that's it Newton's [01:19:01] descent all right so that's it Newton's method on Wednesday this remaining time [01:19:05] method on Wednesday this remaining time on Wednesday you hear about generosity [01:19:06] on Wednesday you hear about generosity models I think unfortunately I promised [01:19:10] models I think unfortunately I promised to be in Washington DC tonight I guess [01:19:13] to be in Washington DC tonight I guess through Wednesday so you hear from some [01:19:15] through Wednesday so you hear from some I think onion will give the lecture on [01:19:18] I think onion will give the lecture on Wednesday but I will be back next week [01:19:21] Wednesday but I will be back next week unfortunately what's time to do this but [01:19:23] unfortunately what's time to do this but because of this pulse take on lecture so [01:19:26] because of this pulse take on lecture so so thanks everyone [01:19:28] so thanks everyone I always say ================================================================================ LECTURE 004 ================================================================================ Lecture 4 - Perceptron & Generalized Linear Model | Stanford CS229: Machine Learning (Autumn 2018) Source: https://www.youtube.com/watch?v=iZTeva0WSTQ --- Transcript [00:00:04] a couple of announcements before we get [00:00:06] a couple of announcements before we get started so first of all ps1 is out [00:00:09] started so first of all ps1 is out problem set 1 it is due on 17th that's 2 [00:00:17] problem set 1 it is due on 17th that's 2 weeks from today you have exactly 2 [00:00:19] weeks from today you have exactly 2 weeks to work on it you could take up to [00:00:21] weeks to work on it you could take up to 2 or 3 late days I think you can take up [00:00:24] 2 or 3 late days I think you can take up to three late days there is there's a [00:00:30] to three late days there is there's a good amount of programming and a good [00:00:32] good amount of programming and a good amount of math you need to do so [00:00:35] amount of math you need to do so ps1 needs to be uploaded the solutions [00:00:37] ps1 needs to be uploaded the solutions need to be uploaded to great scope [00:00:39] need to be uploaded to great scope you'll have to make two submissions one [00:00:43] you'll have to make two submissions one submission will be a PDF file which you [00:00:46] submission will be a PDF file which you can either which you can either use a [00:00:49] can either which you can either use a latex template that we provide or you [00:00:51] latex template that we provide or you can handwrite it as well but you're [00:00:53] can handwrite it as well but you're strongly encouraged to use the the latex [00:00:56] strongly encouraged to use the the latex template and there is a separate coding [00:00:59] template and there is a separate coding assignment for which you'll have to [00:01:01] assignment for which you'll have to submit code as a separate great scope [00:01:04] submit code as a separate great scope assignment so they're going to you're [00:01:05] assignment so they're going to you're going to see two assignments in great [00:01:06] going to see two assignments in great scope one is for the written part the [00:01:08] scope one is for the written part the others for the is for the programming [00:01:11] others for the is for the programming part with that let's let's jump right [00:01:15] part with that let's let's jump right into today's topics so today we are [00:01:18] into today's topics so today we are going to cover briefly we're going to [00:01:20] going to cover briefly we're going to cover the perceptron algorithm and then [00:01:24] cover the perceptron algorithm and then you know good chunk of today is going to [00:01:26] you know good chunk of today is going to be exponential family and generalized [00:01:29] be exponential family and generalized linear models and we'll we'll end it [00:01:31] linear models and we'll we'll end it with softmax regression for multi-class [00:01:34] with softmax regression for multi-class classification so perceptron we saw in [00:01:40] classification so perceptron we saw in logistic regression so first of all the [00:01:43] logistic regression so first of all the perceptron algorithm I should mention is [00:01:45] perceptron algorithm I should mention is not something that is widely used in [00:01:48] not something that is widely used in practice we study it mostly for [00:01:51] practice we study it mostly for historical reasons oh and also because [00:01:54] historical reasons oh and also because it is it's nice and simple and you know [00:01:56] it is it's nice and simple and you know it's easy to analyze and we also have [00:01:59] it's easy to analyze and we also have homework questions on it so logistic [00:02:03] homework questions on it so logistic regression we saw logistic regression [00:02:05] regression we saw logistic regression uses the sigmoid function [00:02:33] right so the logistic regression user [00:02:37] right so the logistic regression user the sigmoid function which which [00:02:40] the sigmoid function which which essentially squeezes the entire real [00:02:42] essentially squeezes the entire real line from minus infinity to infinity [00:02:43] line from minus infinity to infinity between zero and one and and the zero [00:02:47] between zero and one and and the zero and one kind of represents the [00:02:48] and one kind of represents the probability right you could also think [00:02:53] probability right you could also think of a variant of that which will be like [00:02:58] of a variant of that which will be like the perceptron where so in the in in the [00:03:00] the perceptron where so in the in in the sigmoid function at at z equals 0 at z [00:03:10] sigmoid function at at z equals 0 at z equals 0 g of z is a half and as Z tends [00:03:14] equals 0 g of z is a half and as Z tends to minus infinity Z tends to 0 and as Z [00:03:17] to minus infinity Z tends to 0 and as Z tends to plus infinity G tends to 1 the [00:03:23] tends to plus infinity G tends to 1 the perceptron algorithm uses a somewhat [00:03:27] perceptron algorithm uses a somewhat similar but different function which [00:03:49] right so G of Z in this case is 1 if Z [00:03:59] right so G of Z in this case is 1 if Z is greater than equal to 0 and 0 if Z is [00:04:03] is greater than equal to 0 and 0 if Z is less than 0 right so you can you can [00:04:05] less than 0 right so you can you can think of this as the hard version of the [00:04:07] think of this as the hard version of the son of the sigmoid function right and [00:04:10] son of the sigmoid function right and this needs to this leads to the [00:04:16] this needs to this leads to the hypothesis function here being H theta [00:04:22] hypothesis function here being H theta of X is equal to G of theta transpose X [00:04:30] of X is equal to G of theta transpose X so theta transpose X here theta is the [00:04:34] so theta transpose X here theta is the parameter X is the X is the input and H [00:04:38] parameter X is the X is the input and H state of X will be 0 or 1 depending on [00:04:42] state of X will be 0 or 1 depending on whether theta transpose X was less than [00:04:44] whether theta transpose X was less than 0 or or greater than 0 and it all and [00:04:48] 0 or or greater than 0 and it all and similarly in logistic regression we had [00:04:51] similarly in logistic regression we had a H state of X is equal to essentially G [00:05:02] a H state of X is equal to essentially G of G of Z where G is the Sigma sigmoid [00:05:06] of G of Z where G is the Sigma sigmoid function both of them have a common [00:05:10] function both of them have a common update rule which on the surface looks [00:05:14] update rule which on the surface looks similar so theta J theta J plus alpha [00:05:22] similar so theta J theta J plus alpha times y minus H [00:05:37] right so the update rules for perceptron [00:05:41] right so the update rules for perceptron and logistic regression they look the [00:05:42] and logistic regression they look the same except each state of X means [00:05:44] same except each state of X means different things in in in the two [00:05:47] different things in in in the two different scenarios we also saw that it [00:05:51] different scenarios we also saw that it was similar for a linear regression as [00:05:53] was similar for a linear regression as well and we're going to see why this [00:05:54] well and we're going to see why this this is you know that this is actually a [00:05:57] this is you know that this is actually a more common common theme so what's [00:06:01] more common common theme so what's happening here so if you inspect this [00:06:05] happening here so if you inspect this equation to get a better sense of what's [00:06:08] equation to get a better sense of what's happening in in the perceptron algorithm [00:06:11] happening in in the perceptron algorithm this quantity over here is a scalar [00:06:15] this quantity over here is a scalar right it's the difference between y i [00:06:18] right it's the difference between y i which can be either 0 and 1 and eight [00:06:20] which can be either 0 and 1 and eight state of X I which can either be 0 or 1 [00:06:23] state of X I which can either be 0 or 1 right so when the algorithm makes a [00:06:28] right so when the algorithm makes a prediction of each state of eight set of [00:06:30] prediction of each state of eight set of X I for a given X I this quantity will [00:06:34] X I for a given X I this quantity will either be 0 if if the algorithm got it [00:06:42] either be 0 if if the algorithm got it right already and it will be either plus [00:06:51] right already and it will be either plus 1 or minus 1 if if Y i if the actual if [00:06:59] 1 or minus 1 if if Y i if the actual if the ground truth was plus 1 and the [00:07:01] the ground truth was plus 1 and the algorithm predicted 0 then it this will [00:07:05] algorithm predicted 0 then it this will evaluate to 1 if wrong and why I equals [00:07:13] evaluate to 1 if wrong and why I equals 1 and similarly it is minus 1 if wrong [00:07:22] 1 and similarly it is minus 1 if wrong and why I [00:07:27] so what's happening here to see what's [00:07:31] so what's happening here to see what's what's happening it's useful to see this [00:07:33] what's happening it's useful to see this picture so this is the input space right [00:07:42] picture so this is the input space right and let's imagine there are two two [00:07:46] and let's imagine there are two two classes boxes and let's say circles and [00:07:54] classes boxes and let's say circles and you want to learn I'm going to learn an [00:07:58] you want to learn I'm going to learn an algorithm that can separate these two [00:07:59] algorithm that can separate these two classes right and if you imagine that [00:08:04] classes right and if you imagine that the what what the algorithm has learned [00:08:08] the what what the algorithm has learned so far is a theta that represents this [00:08:12] so far is a theta that represents this decision boundary so this represents [00:08:16] decision boundary so this represents theta transpose x equals zero and [00:08:20] theta transpose x equals zero and anything about is theta transpose X is [00:08:25] anything about is theta transpose X is greater than zero and anything below is [00:08:27] greater than zero and anything below is a transpose X less than zero right [00:08:32] a transpose X less than zero right and let's say the algorithm is learning [00:08:35] and let's say the algorithm is learning one example at a time and a new example [00:08:37] one example at a time and a new example comes in and this time it happens to be [00:08:42] the new example happens to be a square [00:08:45] the new example happens to be a square or a box and but the algorithm has miss [00:08:49] or a box and but the algorithm has miss misclassified right now this line the [00:08:55] misclassified right now this line the separating boundary if the vector [00:09:00] separating boundary if the vector equivalent of that would be a vector [00:09:01] equivalent of that would be a vector that's normal to the line so this would [00:09:04] that's normal to the line so this would be theta and this is our new X this is [00:09:11] be theta and this is our new X this is the new X so this got misclassified this [00:09:16] the new X so this got misclassified this R this is lying to you know lying on the [00:09:19] R this is lying to you know lying on the bottom of the decision boundary so what [00:09:21] bottom of the decision boundary so what what what's gonna happen here [00:09:23] what what's gonna happen here weii let's call this the one class and [00:09:26] weii let's call this the one class and this is this as the zero class right so [00:09:29] this is this as the zero class right so why I minus eight state of I will be [00:09:32] why I minus eight state of I will be plus one and what the algorithm is doing [00:09:36] plus one and what the algorithm is doing is it sets theta to be theta plus alpha [00:09:39] is it sets theta to be theta plus alpha times X [00:09:41] times X right so this is the old theta this is X [00:09:45] right so this is the old theta this is X alpha is some small learning rate so it [00:09:48] alpha is some small learning rate so it adds let me use a different color here [00:09:51] adds let me use a different color here it adds right alpha times X to theta and [00:09:57] it adds right alpha times X to theta and now say this is let's call it theta [00:10:02] now say this is let's call it theta prime is the new vector that's that's [00:10:04] prime is the new vector that's that's the updated value right and they and the [00:10:07] the updated value right and they and the separating hyperplane corresponding to [00:10:10] separating hyperplane corresponding to this is something that is normal to it [00:10:14] this is something that is normal to it yeah so so it updated the decision [00:10:18] yeah so so it updated the decision boundary such that X is now included in [00:10:20] boundary such that X is now included in the positive class right the the idea [00:10:25] the positive class right the the idea here is that theta we want theta to be [00:10:31] here is that theta we want theta to be similar to X in general where such where [00:10:35] similar to X in general where such where Y is 1 and we want theta to be not [00:10:40] Y is 1 and we want theta to be not similar to X when y equals 0 the reason [00:10:45] similar to X when y equals 0 the reason is when two vectors are similar the dot [00:10:48] is when two vectors are similar the dot product is positive and they are not [00:10:49] product is positive and they are not similar the dot product is negative what [00:10:52] similar the dot product is negative what does that mean if let's say this is X [00:10:55] does that mean if let's say this is X and let's say you have theta if there [00:10:58] and let's say you have theta if there are kind of pointed outwards their dot [00:11:01] are kind of pointed outwards their dot product would be negative and when when [00:11:04] product would be negative and when when if you have a theta that looks like this [00:11:07] if you have a theta that looks like this Teta prime then the dot product will be [00:11:09] Teta prime then the dot product will be positive if their angle is less than so [00:11:12] positive if their angle is less than so this essentially means that as theta is [00:11:14] this essentially means that as theta is rotating the decision boundary is kind [00:11:17] rotating the decision boundary is kind of perpendicular to theta and you want [00:11:19] of perpendicular to theta and you want to get all the positive X's on one side [00:11:22] to get all the positive X's on one side of the decision boundary and what's the [00:11:24] of the decision boundary and what's the what's the most naive way of taking [00:11:28] what's the most naive way of taking theta and given X try to make theta more [00:11:31] theta and given X try to make theta more and closer to X simple thing is to just [00:11:34] and closer to X simple thing is to just add a component of X in that direction [00:11:37] add a component of X in that direction you know add it here and kind of make [00:11:39] you know add it here and kind of make theta and so this this is a very common [00:11:41] theta and so this this is a very common technique used in lots of algorithms [00:11:43] technique used in lots of algorithms where if you add a vector to another [00:11:44] where if you add a vector to another vector you make the second one in a [00:11:47] vector you make the second one in a closer to the first one essentially so [00:11:50] closer to the first one essentially so this is this is the perceptron algorithm [00:11:52] this is this is the perceptron algorithm you go example [00:11:55] you go example example in an online manner and if the [00:11:58] example in an online manner and if the example is already classified you do [00:12:00] example is already classified you do nothing you get a 0 over here if it is [00:12:02] nothing you get a 0 over here if it is misclassified you either are the add a [00:12:06] misclassified you either are the add a small component of you add the vector [00:12:10] small component of you add the vector itself the example itself but your theta [00:12:12] itself the example itself but your theta or you subtract it depending on the [00:12:13] or you subtract it depending on the class of the vector that's about it any [00:12:16] class of the vector that's about it any questions about the perceptron cool so [00:12:22] questions about the perceptron cool so let's move on to the next topic [00:12:27] exponential families so exponential [00:12:36] exponential families so exponential family is essentially a class of yeah [00:12:42] it's not used in practice because aid it [00:12:48] it's not used in practice because aid it does not have a probabilistic [00:12:50] does not have a probabilistic interpretation of what's what's [00:12:51] interpretation of what's what's happening you kind of have a geometrical [00:12:53] happening you kind of have a geometrical feel of what's happening with with the [00:12:54] feel of what's happening with with the hyperplane but it doesn't have a [00:12:55] hyperplane but it doesn't have a probabilistic interpretation also it's [00:13:00] probabilistic interpretation also it's it was in I think the perceptron was a [00:13:04] it was in I think the perceptron was a pretty famous and I think the nineteen [00:13:06] pretty famous and I think the nineteen fifties or the sixties where people [00:13:08] fifties or the sixties where people thought this is a good model of how the [00:13:09] thought this is a good model of how the brain works and I think it was Marvin [00:13:13] brain works and I think it was Marvin Minsky who wrote a paper saying you know [00:13:15] Minsky who wrote a paper saying you know the perceptron is it's kind of limited [00:13:17] the perceptron is it's kind of limited because it it could never classify [00:13:20] because it it could never classify points like this there's no possible [00:13:24] points like this there's no possible separating boundary that can you know do [00:13:26] separating boundary that can you know do something as simple as this and kind of [00:13:28] something as simple as this and kind of people lost interest in it but yeah and [00:13:31] people lost interest in it but yeah and in fact what we see is in logistic [00:13:34] in fact what we see is in logistic regression is like a softer version of [00:13:36] regression is like a softer version of the perceptron itself in a way yeah yeah [00:13:47] it's it's it's up to you know it's it's [00:13:51] it's it's it's up to you know it's it's a design choice that you make what you [00:13:52] a design choice that you make what you could do is you can you can kind of [00:13:54] could do is you can you can kind of anneal your learning rate with every [00:13:57] anneal your learning rate with every step every time you see a new example [00:13:59] step every time you see a new example decrease your learning rate until [00:14:01] decrease your learning rate until something until you stop changing theta [00:14:05] something until you stop changing theta by a lot you can you're not guaranteed [00:14:07] by a lot you can you're not guaranteed that you'll you'll be able to get every [00:14:09] that you'll you'll be able to get every example right for example here no matter [00:14:11] example right for example here no matter how long you learn you're never going to [00:14:12] how long you learn you're never going to you know find a learning boundary so [00:14:16] you know find a learning boundary so it's it's up to you when you want to [00:14:17] it's it's up to you when you want to stop training common things to just [00:14:20] stop training common things to just decrease the learning rate with every [00:14:22] decrease the learning rate with every time step until you stop making changes [00:14:27] all right let's move on to exponential [00:14:30] all right let's move on to exponential families so exponential families is is a [00:14:33] families so exponential families is is a class of probability distributions which [00:14:37] class of probability distributions which are somewhat nice mathematically right [00:14:39] are somewhat nice mathematically right they're also very closely related to [00:14:42] they're also very closely related to GLM's which we will be going over next [00:14:46] GLM's which we will be going over next right but first we kind of take a deeper [00:14:48] right but first we kind of take a deeper look at exponential families and and and [00:14:52] look at exponential families and and and what they're about so an exponential [00:14:54] what they're about so an exponential family is one whose PDF so whose PDF can [00:15:13] family is one whose PDF so whose PDF can be written in the form my PDF I mean [00:15:16] be written in the form my PDF I mean probability density function with a [00:15:18] probability density function with a discrete distribution then it would be [00:15:20] discrete distribution then it would be the probability mass function and this [00:15:23] the probability mass function and this PDF can be written in the form [00:15:47] right this looks pretty scary let's [00:15:50] right this looks pretty scary let's let's kind of break it down into you [00:15:52] let's kind of break it down into you know what what they actually mean [00:15:54] know what what they actually mean so why over here is the data right and [00:16:00] so why over here is the data right and there is a reason why we call it why [00:16:02] there is a reason why we call it why because yeah a bit larger sure [00:16:28] this is better so why is the data and [00:16:32] this is better so why is the data and the reason there's a reason what we call [00:16:33] the reason there's a reason what we call it Y and not X and and that's because [00:16:36] it Y and not X and and that's because we're going to use exponential families [00:16:38] we're going to use exponential families to model the output of your of your data [00:16:40] to model the output of your of your data you know in a supervised learning [00:16:42] you know in a supervised learning setting and and we're going to see X [00:16:45] setting and and we're going to see X when we move on to GLM's until you know [00:16:47] when we move on to GLM's until you know until then we're just going to deal with [00:16:48] until then we're just going to deal with Y's for now [00:16:49] Y's for now so Y is the data it is is called the [00:16:55] so Y is the data it is is called the natural parameter right T of Y is called [00:17:06] natural parameter right T of Y is called a sufficient statistic if you have a [00:17:11] a sufficient statistic if you have a statistics background and you you know [00:17:13] statistics background and you you know if you come across the word sufficient [00:17:14] if you come across the word sufficient statistic before it's the exact same [00:17:16] statistic before it's the exact same thing but you don't need to know much [00:17:18] thing but you don't need to know much about this because for all the [00:17:22] about this because for all the distributions that we're going to be [00:17:23] distributions that we're going to be seeing today or in this class T of Y [00:17:26] seeing today or in this class T of Y will be equal to just Y so you can you [00:17:30] will be equal to just Y so you can you can just replace T of Y with Y for for [00:17:34] can just replace T of Y with Y for for all the examples today and in the rest [00:17:35] all the examples today and in the rest of the of the class B of Y it's called a [00:17:42] of the of the class B of Y it's called a base measure right and finally a of beta [00:17:52] base measure right and finally a of beta is called the lock partition function [00:17:57] is called the lock partition function and you're going to be seeing a lot of [00:17:59] and you're going to be seeing a lot of this function not partition function [00:18:01] this function not partition function right so again why is the data that this [00:18:08] right so again why is the data that this probability distribution standard model [00:18:09] probability distribution standard model eta is the parameter of the distribution [00:18:14] T of Y which will mostly be just Y [00:18:17] T of Y which will mostly be just Y technically we you know T of Y is more [00:18:20] technically we you know T of Y is more more correct to B of Y which means it is [00:18:26] more correct to B of Y which means it is a function of only Y this function [00:18:28] a function of only Y this function cannot involve data right and similarly [00:18:30] cannot involve data right and similarly T of Y cannot involve [00:18:32] T of Y cannot involve data it should be purely a function of Y [00:18:35] data it should be purely a function of Y B of Y is called the base measure and a [00:18:38] B of Y is called the base measure and a of eita which has to be a function of [00:18:40] of eita which has to be a function of only eight and constants no no Y can can [00:18:44] only eight and constants no no Y can can can be part of a of data it's called the [00:18:47] can be part of a of data it's called the lock partition function right and the [00:18:50] lock partition function right and the reason why this is called the lock [00:18:52] reason why this is called the lock partition function is pretty easy to see [00:18:55] partition function is pretty easy to see because this can be written as B of Y so [00:19:12] because this can be written as B of Y so these two are exactly the same just take [00:19:15] these two are exactly the same just take this out and [00:19:31] it's fine these are exactly the same and [00:19:38] it's fine these are exactly the same and oh yeah you're right this should be [00:19:44] oh yeah you're right this should be positive thank you so this is you can [00:19:53] positive thank you so this is you can think of this as a normalizing constant [00:19:55] think of this as a normalizing constant of the distribution such that the the [00:19:58] of the distribution such that the the whole thing integrates to 1 right and [00:20:01] whole thing integrates to 1 right and therefore the log of this will be a of 8 [00:20:04] therefore the log of this will be a of 8 other so H is called the log of the [00:20:05] other so H is called the log of the partition function so the partition [00:20:07] partition function so the partition function is a technical term to indicate [00:20:09] function is a technical term to indicate the normalizing constant of probability [00:20:11] the normalizing constant of probability distributions now you can plug in any [00:20:17] distributions now you can plug in any definition of B a and P yep sure so why [00:20:29] definition of B a and P yep sure so why is your Y and for most of most of our [00:20:32] is your Y and for most of most of our example it's going to be a scalar [00:20:34] example it's going to be a scalar eita can be a vector but we will also be [00:20:39] eita can be a vector but we will also be focusing except maybe in softmax this [00:20:43] focusing except maybe in softmax this would be a scalar T of Y has to match so [00:20:48] would be a scalar T of Y has to match so these the dimension of these two has to [00:20:50] these the dimension of these two has to match and these are scalars right so for [00:21:02] match and these are scalars right so for any choice of a B and T that you put [00:21:06] any choice of a B and T that you put that's that that can be your choice [00:21:08] that's that that can be your choice completely as long as the expression [00:21:12] completely as long as the expression integrates to 1 you have a family in the [00:21:15] integrates to 1 you have a family in the exponential family right what does that [00:21:18] exponential family right what does that mean for a specific choice of say for [00:21:22] mean for a specific choice of say for some choice of a B and T this can [00:21:24] some choice of a B and T this can actually this will be equal to say the [00:21:27] actually this will be equal to say the PDF of the Gaussian in which case you [00:21:30] PDF of the Gaussian in which case you got for that choice of T a and and and B [00:21:33] got for that choice of T a and and and B you got the Gaussian distribution a [00:21:36] you got the Gaussian distribution a family of Gaussian distribution such [00:21:38] family of Gaussian distribution such that for any value of the parameter you [00:21:41] that for any value of the parameter you get a member of the Gaussian family [00:21:43] get a member of the Gaussian family right and this is mostly to show that a [00:21:49] right and this is mostly to show that a distribution is in the exponential [00:21:50] distribution is in the exponential family the most straightforward way to [00:21:54] family the most straightforward way to do it is to write out the PDF of the [00:21:56] do it is to write out the PDF of the distribution in the form that you know [00:21:58] distribution in the form that you know and just do some algebraic massaging to [00:22:01] and just do some algebraic massaging to bring it into this form right and then [00:22:03] bring it into this form right and then you do a pattern match to two and you [00:22:06] you do a pattern match to two and you know conclude that it's a member of the [00:22:09] know conclude that it's a member of the exponential family so let's do it for a [00:22:10] exponential family so let's do it for a couple of examples so a Bernoulli [00:22:33] couple of examples so a Bernoulli distribution is one you used to model a [00:22:36] distribution is one you used to model a binary data right and it has a parameter [00:22:44] binary data right and it has a parameter let's call it fee which is you know the [00:22:47] let's call it fee which is you know the probability of the event happening or [00:22:49] probability of the event happening or not right now the what is the PDF of a [00:23:05] not right now the what is the PDF of a Bernoulli distribution one way to write [00:23:09] Bernoulli distribution one way to write this is fee of Y times 1 minus V 1 minus [00:23:18] this is fee of Y times 1 minus V 1 minus y make sense this this pattern is like a [00:23:23] y make sense this this pattern is like a way of writing a programming [00:23:26] way of writing a programming programmatic if-else in in math right so [00:23:29] programmatic if-else in in math right so whenever Y is 1 this term cancels out so [00:23:33] whenever Y is 1 this term cancels out so the answer would be fee and whenever Y [00:23:35] the answer would be fee and whenever Y is 0 this term cancels out and the [00:23:38] is 0 this term cancels out and the answer is 1 minus V so this is just a [00:23:40] answer is 1 minus V so this is just a mathematical way to represent an if/else [00:23:43] mathematical way to represent an if/else that you would do in programming right [00:23:45] that you would do in programming right so this is the PDF of Bernoulli and our [00:23:50] so this is the PDF of Bernoulli and our goal is to take this form and massage it [00:23:54] goal is to take this form and massage it into that form right and and see what [00:23:56] into that form right and and see what what [00:23:57] what the individual TB and a turn out to be [00:24:00] the individual TB and a turn out to be right [00:24:00] right so whenever you see a distribution in [00:24:05] so whenever you see a distribution in this form a common technique is to wrap [00:24:12] this form a common technique is to wrap this with a log and an X because these [00:24:22] this with a log and an X because these two cancel out so this is actually [00:24:24] two cancel out so this is actually exactly equal to this and if you do some [00:24:36] exactly equal to this and if you do some more algebra on this we will see that [00:24:39] more algebra on this we will see that this turns out to be XP plus it's pretty [00:24:59] this turns out to be XP plus it's pretty straightforward to go from here to here [00:25:01] straightforward to go from here to here I'll let you guys verify it yourself but [00:25:05] I'll let you guys verify it yourself but once we have it in this form it's easy [00:25:08] once we have it in this form it's easy to kind of start doing some pattern [00:25:10] to kind of start doing some pattern matching from this expression to that [00:25:12] matching from this expression to that expression so what what we see here is [00:25:16] expression so what what we see here is the base measure B of Y is equal to if [00:25:21] the base measure B of Y is equal to if you match this with that B of Y will be [00:25:25] you match this with that B of Y will be just 1 because there's no B of Y term [00:25:29] just 1 because there's no B of Y term here and so this would be B of Y this [00:25:37] here and so this would be B of Y this would be beta this would be P of Y this [00:25:43] would be beta this would be P of Y this would be a 8 right so that would be you [00:25:49] would be a 8 right so that would be you know you can see that you know that they [00:25:52] know you can see that you know that they kind of match in pattern so B of Y would [00:25:55] kind of match in pattern so B of Y would be 1 T of Y is just Y as as expected [00:26:03] be 1 T of Y is just Y as as expected well so eta is equal to [00:26:09] well so eta is equal to log P over 1 minus P and this is an [00:26:20] log P over 1 minus P and this is an equivalent statement is to invert this [00:26:22] equivalent statement is to invert this operation and say P is equal to 1 over 1 [00:26:26] operation and say P is equal to 1 over 1 plus e to the minus beta [00:26:33] I'm just flipping the operation from [00:26:35] I'm just flipping the operation from this this went from fee to a tahir you [00:26:38] this this went from fee to a tahir you know it's it's the equivalent now here [00:26:41] know it's it's the equivalent now here it goes from a Tartar fee right and a of [00:26:44] it goes from a Tartar fee right and a of eita [00:26:48] is going to be so here we have it as a [00:26:53] is going to be so here we have it as a function of fee but we got an expression [00:26:56] function of fee but we got an expression for fee in terms of eita [00:26:58] for fee in terms of eita so you can plug this expression in here [00:27:03] so you can plug this expression in here and with the change of minus sign so let [00:27:07] and with the change of minus sign so let me work out the sub-sites gonna be minus [00:27:08] me work out the sub-sites gonna be minus log of 1 minus P this is I just it [00:27:16] log of 1 minus P this is I just it pattern matching there and minus log 1 [00:27:21] pattern matching there and minus log 1 minus 8 the reason is because we want an [00:27:28] minus 8 the reason is because we want an expression in terms of eita here we got [00:27:30] expression in terms of eita here we got it in terms of Phi [00:27:31] it in terms of Phi but we need to plug in plug in a tower [00:27:35] but we need to plug in plug in a tower here beta and this will just be log of 1 [00:27:41] here beta and this will just be log of 1 plus e to the 8th so there we go so this [00:27:48] plus e to the 8th so there we go so this this kind of verifies that the Bernoulli [00:27:50] this kind of verifies that the Bernoulli distribution is a member of the [00:27:52] distribution is a member of the exponential family any questions here so [00:27:58] note that this may look familiar [00:28:01] note that this may look familiar it looks like the sigmoid function [00:28:04] it looks like the sigmoid function somewhat like the sigmoid function and [00:28:06] somewhat like the sigmoid function and this is actually no accident we will see [00:28:07] this is actually no accident we will see why it is actually the sigmoid how it [00:28:13] why it is actually the sigmoid how it kind of relates to logistic regression [00:28:15] kind of relates to logistic regression in a minute so another example [00:28:27] so a Gaussian with fixed millions all [00:28:42] so a Gaussian with fixed millions all right so a Gaussian distribution has two [00:28:47] right so a Gaussian distribution has two parameters the mean and the variance for [00:28:49] parameters the mean and the variance for our purposes we're going to assume a [00:28:51] our purposes we're going to assume a constant variance you can have you can [00:28:57] constant variance you can have you can also consider the options with with [00:28:59] also consider the options with with where the variance is also a variable [00:29:00] where the variance is also a variable but for our course we're only interested [00:29:04] but for our course we're only interested in gaussians with fixed variance and we [00:29:07] in gaussians with fixed variance and we are going to assume assume variance is [00:29:15] are going to assume assume variance is equal to one so this gives the PDF of a [00:29:19] equal to one so this gives the PDF of a Gaussian to look like this P of Y [00:29:23] Gaussian to look like this P of Y parametrized is mu so note here when we [00:29:26] parametrized is mu so note here when we start writing out we start with the [00:29:29] start writing out we start with the parameters that we are commonly used to [00:29:33] parameters that we are commonly used to you know they are also called like the [00:29:35] you know they are also called like the canonical parameters and then we set up [00:29:37] canonical parameters and then we set up a link between the canonical parameters [00:29:39] a link between the canonical parameters and the natural parameters that's part [00:29:41] and the natural parameters that's part of the massaging exercise that we do so [00:29:44] of the massaging exercise that we do so we're going to start with the canonical [00:29:45] we're going to start with the canonical parameters is equal to 1 [00:30:00] / - so this is the gaussian PDF with [00:30:06] / - so this is the gaussian PDF with with with variance equal to 1 right and [00:30:10] with with variance equal to 1 right and this can be rewritten as again I'm [00:30:15] this can be rewritten as again I'm skipping a few algebra steps you know [00:30:18] skipping a few algebra steps you know straightforward no tricks there yeah [00:30:25] straightforward no tricks there yeah our fixed variance e to the minus y [00:30:32] our fixed variance e to the minus y square over 2 again we go through the [00:30:46] square over 2 again we go through the same exercise you know pattern match [00:30:48] same exercise you know pattern match this is B of Y this is 8 this is T of Y [00:30:58] this is B of Y this is 8 this is T of Y and this would be right so we have a B [00:31:07] and this would be right so we have a B of Y note that this is a function of [00:31:19] of Y note that this is a function of only why there's no eita here T of Y is [00:31:23] only why there's no eita here T of Y is just Y and in this case a natural [00:31:26] just Y and in this case a natural parameter is mu theta is mu and the lock [00:31:32] parameter is mu theta is mu and the lock partition function is equal to MU square [00:31:36] partition function is equal to MU square by 2 and when we and we repeat the same [00:31:41] by 2 and when we and we repeat the same exercise we did here we start with a [00:31:44] exercise we did here we start with a lock partition function that is [00:31:47] lock partition function that is parametrized by the canonical parameters [00:31:49] parametrized by the canonical parameters and we use the the link between the [00:31:53] and we use the the link between the canonical and the natural parameters [00:31:55] canonical and the natural parameters invert it and so in this case it's the [00:32:00] invert it and so in this case it's the it's the same search a tower - so a of [00:32:05] it's the same search a tower - so a of beta is a function of only eita [00:32:07] beta is a function of only eita again here a of ETA was a function of [00:32:09] again here a of ETA was a function of only 8 ax and T of Y is a function of [00:32:12] only 8 ax and T of Y is a function of only Y and B of I is a function of [00:32:14] only Y and B of I is a function of you.why as well any questions on this [00:32:22] yeah yeah you if the variance is unknown [00:32:30] yeah yeah you if the variance is unknown you can write it as an exponential [00:32:31] you can write it as an exponential family in which case ADA will now be a [00:32:33] family in which case ADA will now be a vector it won't be a scalar in mode it [00:32:35] vector it won't be a scalar in mode it will be it will have to like eight a one [00:32:37] will be it will have to like eight a one and eight or two and you will also have [00:32:42] and eight or two and you will also have you will have a mapping between each of [00:32:44] you will have a mapping between each of the canonical parameters and each of the [00:32:46] the canonical parameters and each of the natural parameters you can do it it's [00:32:50] natural parameters you can do it it's pretty straightforward [00:32:51] pretty straightforward right so this is this is exponential [00:32:56] right so this is this is exponential these are exponential families right the [00:32:59] these are exponential families right the reason why we are why we use exponential [00:33:01] reason why we are why we use exponential family is because it has some nice [00:33:03] family is because it has some nice mathematical properties right so so one [00:33:15] mathematical properties right so so one property is now on if we perform maximum [00:33:18] property is now on if we perform maximum likelihood on on the exponential family [00:33:22] likelihood on on the exponential family as as when when when the exponential [00:33:27] as as when when when the exponential family is parameterized in the natural [00:33:29] family is parameterized in the natural parameters then the optimization problem [00:33:33] parameters then the optimization problem is concave so MLE with respect to ETA is [00:33:42] is concave so MLE with respect to ETA is concave similarly if you flip the sign [00:33:47] concave similarly if you flip the sign and use the the what's called the [00:33:49] and use the the what's called the negative log likelihood so take the log [00:33:51] negative log likelihood so take the log of the expression negated and in in this [00:33:53] of the expression negated and in in this case the negative log likelihood is like [00:33:55] case the negative log likelihood is like the cost function equivalent of doing [00:33:57] the cost function equivalent of doing maximum likelihood you're just flipping [00:33:59] maximum likelihood you're just flipping a sign instead of maximizing you [00:34:00] a sign instead of maximizing you minimize the negative log likelihood so [00:34:02] minimize the negative log likelihood so the and and you know the NLL is there [00:34:05] the and and you know the NLL is there for convex the expectation of why [00:34:25] what does this mean each of the [00:34:31] what does this mean each of the distribution we start with a of a to [00:34:34] distribution we start with a of a to differentiate this with respect to Etta [00:34:37] differentiate this with respect to Etta the lock partition function with respect [00:34:38] the lock partition function with respect to a toss and you get another function [00:34:42] to a toss and you get another function with respect to beta and that function [00:34:44] with respect to beta and that function will is the mean of the distribution as [00:34:47] will is the mean of the distribution as parameterize by a turn and similarly the [00:34:52] parameterize by a turn and similarly the variance of y it's just the second [00:34:59] variance of y it's just the second derivative this was the first derivative [00:35:00] derivative this was the first derivative this is the second derivative so the [00:35:12] this is the second derivative so the reason why this is nice is because in [00:35:14] reason why this is nice is because in general for probability distributions to [00:35:16] general for probability distributions to calculate the mean and the variance you [00:35:18] calculate the mean and the variance you generally need to integrate something [00:35:19] generally need to integrate something but over here you just need to [00:35:21] but over here you just need to differentiate which is a lot easier [00:35:22] differentiate which is a lot easier operation and and you will be proving [00:35:32] operation and and you will be proving these properties in your first homework [00:35:39] you provided hint search should be [00:35:42] you provided hint search should be right so now we're going to move on to [00:35:47] right so now we're going to move on to generalized linear models this this is [00:35:51] generalized linear models this this is all we want to talk about exponential [00:35:52] all we want to talk about exponential families any questions yeah exactly so [00:36:06] families any questions yeah exactly so if you're if you're if you're if it's a [00:36:10] if you're if you're if you're if it's a multivariate Gaussian then this data [00:36:12] multivariate Gaussian then this data would be a vector and this would be the [00:36:15] would be a vector and this would be the Hessian [00:36:22] all right let's move on to GLM's [00:36:35] so the GLM is is somewhat like a natural [00:36:40] so the GLM is is somewhat like a natural extension of the exponential families to [00:36:42] extension of the exponential families to include include covariates or include [00:36:47] include include covariates or include your input features in some way right [00:36:49] your input features in some way right so over here we are only dealing with in [00:36:52] so over here we are only dealing with in the exponential families you're only [00:36:54] the exponential families you're only dealing with the Y which in our case it [00:36:58] dealing with the Y which in our case it will kind of map to the outputs but we [00:37:03] will kind of map to the outputs but we can actually build a lot of many [00:37:06] can actually build a lot of many powerful models by by choosing an [00:37:10] powerful models by by choosing an appropriate family in the exponential [00:37:14] appropriate family in the exponential family and kind of plugging it down to a [00:37:18] family and kind of plugging it down to a linear model so the assumptions we're [00:37:22] linear model so the assumptions we're going to make for GLM is that one so [00:37:27] going to make for GLM is that one so these are the assumptions or design [00:37:33] these are the assumptions or design choices that are going to take us from [00:37:39] choices that are going to take us from exponential families to generalize [00:37:41] exponential families to generalize linear models so the most important [00:37:43] linear models so the most important assumption is that well yeah assumption [00:37:47] assumption is that well yeah assumption is that Y given X parametrized by theta [00:37:52] is that Y given X parametrized by theta is a member of an exponential family by [00:38:07] is a member of an exponential family by exponential family of kata I mean that [00:38:11] exponential family of kata I mean that form it could it could in in a [00:38:14] form it could it could in in a particular scenario that you have it [00:38:17] particular scenario that you have it could take on any one of these [00:38:19] could take on any one of these distributions we only we only talked [00:38:24] distributions we only we only talked about the Bernoulli and Gaussian there [00:38:26] about the Bernoulli and Gaussian there are also other distributions that are [00:38:29] are also other distributions that are those are part of the exponential family [00:38:32] those are part of the exponential family for example forgot to mention this so if [00:38:37] for example forgot to mention this so if you have real valued data you use a [00:38:41] you have real valued data you use a Gaussian [00:38:44] if you have binary Bernoulli if you have [00:38:52] if you have binary Bernoulli if you have counts my counts here so this is a [00:38:57] counts my counts here so this is a real-valued it can take any value [00:38:59] real-valued it can take any value between 0 and infinity by count like [00:39:01] between 0 and infinity by count like means just non-negative integers but not [00:39:04] means just non-negative integers but not anything it we need so if you have count [00:39:06] anything it we need so if you have count you can use a Poisson if you have [00:39:11] you can use a Poisson if you have positive real valued integers like say [00:39:14] positive real valued integers like say the volume of some object or the time to [00:39:17] the volume of some object or the time to an event which you know you're only [00:39:19] an event which you know you're only predicting into the future so here you [00:39:21] predicting into the future so here you can use like gamma or exponential so so [00:39:31] can use like gamma or exponential so so there is the exponential family and [00:39:33] there is the exponential family and there is also a distribution called the [00:39:34] there is also a distribution called the exponential distribution which are you [00:39:36] exponential distribution which are you know two distinct things the exponential [00:39:38] know two distinct things the exponential distribution happens to be a member of [00:39:40] distribution happens to be a member of the exponential family as well but [00:39:41] the exponential family as well but they're not the same thing the [00:39:44] they're not the same thing the exponential and ya and you can also have [00:39:48] exponential and ya and you can also have you can also have probability [00:39:50] you can also have probability distributions over probability [00:39:52] distributions over probability distributions like beta delay these [00:40:02] distributions like beta delay these mostly show up in Bayesian machine [00:40:03] mostly show up in Bayesian machine learning or Bayesian statistics [00:40:10] so depending on the kind of data that [00:40:14] so depending on the kind of data that you have if your Y variable is is is if [00:40:17] you have if your Y variable is is is if you're trying to do a regression then [00:40:19] you're trying to do a regression then your Y is going to be say a Gaussian if [00:40:21] your Y is going to be say a Gaussian if you're trying to do a classification [00:40:23] you're trying to do a classification then your Y is and if it's a binary [00:40:25] then your Y is and if it's a binary classification then the exponential [00:40:27] classification then the exponential family would be Bernoulli so depending [00:40:28] family would be Bernoulli so depending on the problem that you have you can [00:40:30] on the problem that you have you can choose any member of the exponential [00:40:32] choose any member of the exponential family as parametrized by theta and so [00:40:39] family as parametrized by theta and so that's the first assumption that y [00:40:41] that's the first assumption that y condition on Y given X is a member of [00:40:44] condition on Y given X is a member of the exponential family and the second [00:40:48] the exponential family and the second the design choice that we are making [00:40:50] the design choice that we are making here is that eta is equal to theta [00:40:54] here is that eta is equal to theta transpose X so this is where your X now [00:40:57] transpose X so this is where your X now comes into the picture right so theta is [00:41:04] our N and X is also in our n now this n [00:41:12] our N and X is also in our n now this n has nothing to do with anything in the [00:41:15] has nothing to do with anything in the exponential family it's purely [00:41:17] exponential family it's purely dimensions of your of your data that you [00:41:20] dimensions of your of your data that you have the axis of your inputs and and [00:41:22] have the axis of your inputs and and this does not show up anywhere else I [00:41:24] this does not show up anywhere else I mean that that's and ETA is is we we [00:41:32] mean that that's and ETA is is we we make a design choice that Etta will be [00:41:34] make a design choice that Etta will be theta transpose transpose X and another [00:41:40] theta transpose transpose X and another kind of assumption is that at test time [00:41:47] right when we want output for a new X [00:41:51] right when we want output for a new X given a new X we want to make an output [00:41:53] given a new X we want to make an output right so the output will be right so [00:42:04] right so the output will be right so given an X and given an X we get an [00:42:08] given an X and given an X we get an exponential family distribution right [00:42:11] exponential family distribution right and the mean of that distribution will [00:42:13] and the mean of that distribution will be the prediction that we make for a [00:42:15] be the prediction that we make for a given for a given X on this may sound a [00:42:18] given for a given X on this may sound a little abstract but you know we're going [00:42:20] little abstract but you know we're going to make this more clear so this what [00:42:22] to make this more clear so this what does essentially mean is [00:42:24] does essentially mean is the hypothesis function is actually just [00:42:32] right this is our hypothesis function [00:42:34] right this is our hypothesis function and we'll see that you know what we do [00:42:36] and we'll see that you know what we do over here if you plug in the exponential [00:42:39] over here if you plug in the exponential family as Gaussian then the hypothesis [00:42:42] family as Gaussian then the hypothesis will be the same you know Gaussian [00:42:44] will be the same you know Gaussian hypothesis that we saw in linear [00:42:46] hypothesis that we saw in linear regression if we plug in a Bernoulli [00:42:48] regression if we plug in a Bernoulli then this will turn out to be the same [00:42:50] then this will turn out to be the same hypothesis that we saw in logistic [00:42:52] hypothesis that we saw in logistic regression and so on so one way to kind [00:42:56] regression and so on so one way to kind of visualize this is [00:42:59] of visualize this is [Music] [00:43:40] right so one way to think of is if this [00:43:44] right so one way to think of is if this is there is a model and there is a [00:43:46] is there is a model and there is a distribution right so the model we are [00:43:48] distribution right so the model we are assuming it to be a linear model right [00:43:50] assuming it to be a linear model right given X there is a learnable parameter [00:43:52] given X there is a learnable parameter theta and theta transpose X will give [00:43:55] theta and theta transpose X will give you a parameter right this is the model [00:43:58] you a parameter right this is the model and here is the distribution now the [00:44:00] and here is the distribution now the distribution is a member of the [00:44:03] distribution is a member of the exponential family and the parameter for [00:44:06] exponential family and the parameter for this distribution is the output of the [00:44:08] this distribution is the output of the linear model right this is the picture [00:44:11] linear model right this is the picture you want to have in your mind and the [00:44:13] you want to have in your mind and the exponential family we make depending on [00:44:16] exponential family we make depending on the data that we have whether it's you [00:44:18] the data that we have whether it's you know whether it's a classification [00:44:19] know whether it's a classification problem or a regression problem or a [00:44:21] problem or a regression problem or a time to end problem you would choose an [00:44:23] time to end problem you would choose an appropriate B a and T based on the [00:44:28] appropriate B a and T based on the distribution of your choice right so [00:44:31] distribution of your choice right so this entire thing and from this you can [00:44:35] this entire thing and from this you can say get the expectation of Y given theta [00:44:44] say get the expectation of Y given theta and this is same as expectation of Y [00:44:50] and this is same as expectation of Y given theta transpose X right and this [00:44:54] given theta transpose X right and this is essentially our hypothesis function [00:45:12] that's exactly right [00:45:14] that's exactly right so so the question is are we training [00:45:19] so so the question is are we training theta two to predict the parameter of [00:45:24] theta two to predict the parameter of the exponential family distribution [00:45:26] the exponential family distribution whose mean is the prediction that we are [00:45:30] whose mean is the prediction that we are going to make for Y that's that's [00:45:32] going to make for Y that's that's correct right and so this is what we do [00:45:36] correct right and so this is what we do at test time and during train time how [00:45:45] at test time and during train time how do we train this model so in this model [00:45:47] do we train this model so in this model the parameter that we are learning by [00:45:49] the parameter that we are learning by doing gradient descent are these [00:45:51] doing gradient descent are these parameters right so you're not learning [00:45:54] parameters right so you're not learning any of the parameters in the in the [00:45:57] any of the parameters in the in the exponential family we are not learning [00:45:59] exponential family we are not learning mu or Sigma square or or eita we are not [00:46:03] mu or Sigma square or or eita we are not learning this we are learning theta [00:46:04] learning this we are learning theta that's part of the model and not part of [00:46:06] that's part of the model and not part of the distribution and the output of this [00:46:08] the distribution and the output of this will become the the distributions [00:46:11] will become the the distributions parameter it's unfortunate that we use [00:46:13] parameter it's unfortunate that we use the word parameter for this and that but [00:46:17] the word parameter for this and that but there there it's important to understand [00:46:20] there there it's important to understand what what is being learned during [00:46:23] what what is being learned during training phase and and what's not so [00:46:26] training phase and and what's not so this parameter is the output of a [00:46:28] this parameter is the output of a function it's not it's not a variable [00:46:31] function it's not it's not a variable that we that we do gradient descent on [00:46:33] that we that we do gradient descent on so during learning what we do is maximum [00:46:38] so during learning what we do is maximum likelihood maximized with respect to [00:46:40] likelihood maximized with respect to theta right so you're doing gradient [00:46:56] theta right so you're doing gradient ascent on the locked probability of of Y [00:47:01] ascent on the locked probability of of Y where the the natural parameter was Reap [00:47:06] where the the natural parameter was Reap aramet rised with a linear model right [00:47:10] aramet rised with a linear model right and we are doing gradient descent by [00:47:12] and we are doing gradient descent by taking gradients on theta right the this [00:47:15] taking gradients on theta right the this like the big picture of what's happening [00:47:16] like the big picture of what's happening with GLM's and how they kind of are an [00:47:19] with GLM's and how they kind of are an extension of exponential families yuri [00:47:21] extension of exponential families yuri parameterize the parameters with a [00:47:23] parameterize the parameters with a linear model and you get a GL m [00:47:39] so let's let's look at some more detail [00:47:44] so let's let's look at some more detail on what happens at train time [00:48:19] so another kind of incidental benefit of [00:48:24] so another kind of incidental benefit of using [00:48:25] using GLM's is that a train time we saw that [00:48:38] GLM's is that a train time we saw that we want to do maximum likelihood on the [00:48:42] we want to do maximum likelihood on the log problem using the log probability [00:48:44] log problem using the log probability with respect to theta right now [00:48:49] at first it may appear that you know we [00:48:52] at first it may appear that you know we need to do some more algebra figure out [00:48:54] need to do some more algebra figure out what the expressions for you know P is [00:48:58] what the expressions for you know P is represented in the in as a function of [00:49:00] represented in the in as a function of theta transpose X and take the [00:49:02] theta transpose X and take the derivatives and you know come up with a [00:49:04] derivatives and you know come up with a gradient update rule and so on but it [00:49:07] gradient update rule and so on but it turns out that no matter which what kind [00:49:13] turns out that no matter which what kind of GLM you are doing no matter which [00:49:15] of GLM you are doing no matter which choice of distribution that you make the [00:49:18] choice of distribution that you make the learning update rule is the same the [00:49:27] learning update rule is the same the learning update rule is theta you guys [00:49:48] learning update rule is theta you guys have seen this so many times by now so [00:49:50] have seen this so many times by now so this is you can you can straight away [00:49:54] this is you can you can straight away just apply this learning rule without [00:49:56] just apply this learning rule without ever having to do any more algebra to [00:50:01] ever having to do any more algebra to figure out what the gradients are or [00:50:02] figure out what the gradients are or what the what the loss is you can go [00:50:05] what the what the loss is you can go straight to the update rule and do your [00:50:07] straight to the update rule and do your learning you plug in the appropriate H [00:50:09] learning you plug in the appropriate H theta of X you plug in the appropriate H [00:50:15] theta of X you plug in the appropriate H theta of X depending on the choice of [00:50:17] theta of X depending on the choice of distribution that you make and you can [00:50:19] distribution that you make and you can start learning initialize your theta to [00:50:21] start learning initialize your theta to some random values and and and you can [00:50:23] some random values and and and you can start learning so any question on this [00:50:29] start learning so any question on this yeah [00:50:34] you can do if you want to do it for [00:50:38] you can do if you want to do it for batch gradient descent then you just sum [00:50:41] batch gradient descent then you just sum over all your examples yeah so the [00:50:52] over all your examples yeah so the Newton method is is is probably the most [00:50:55] Newton method is is is probably the most common you would use with GLM's and that [00:50:58] common you would use with GLM's and that again comes with the assumption that [00:51:00] again comes with the assumption that your dimensionality of your data is not [00:51:02] your dimensionality of your data is not extremely high as long as the number of [00:51:04] extremely high as long as the number of features is less than a few thousand [00:51:08] features is less than a few thousand then you can do Newton's method [00:51:12] any other question cool so so this is [00:51:25] any other question cool so so this is the same update rule for any any any [00:51:28] the same update rule for any any any specific type of GLM based on the choice [00:51:30] specific type of GLM based on the choice of distribution that you have whether [00:51:32] of distribution that you have whether you are modeling you know you're doing [00:51:35] you are modeling you know you're doing classification whether you doing [00:51:36] classification whether you doing regression whether you're doing you know [00:51:38] regression whether you're doing you know a Poisson regression the update rule is [00:51:40] a Poisson regression the update rule is the same you just plug in a different [00:51:42] the same you just plug in a different age state of X and you get your learning [00:51:44] age state of X and you get your learning rule another some more terminology so [00:51:59] rule another some more terminology so eta is what we call the natural [00:52:03] eta is what we call the natural parameter [00:52:11] so eta is the natural parameter and the [00:52:15] so eta is the natural parameter and the function that links a natural parameter [00:52:27] function that links a natural parameter to the mean of the distribution and this [00:52:29] to the mean of the distribution and this has a name it's called the canonical [00:52:31] has a name it's called the canonical response function right and similarly [00:52:46] response function right and similarly you can also let's call it mu it's like [00:52:48] you can also let's call it mu it's like the mean of the distribution similarly [00:52:51] the mean of the distribution similarly you can go from me back to ETA with the [00:52:56] you can go from me back to ETA with the inverse of this and this is also called [00:53:00] inverse of this and this is also called the canonical link function there's some [00:53:07] the canonical link function there's some terminology we also already saw that G [00:53:12] terminology we also already saw that G of eta is also equal to the the the [00:53:19] of eta is also equal to the the the gradient of the law partition function [00:53:21] gradient of the law partition function with respect to theta so side not G [00:53:39] right and it's also helpful to make [00:53:44] right and it's also helpful to make explicit the distinction between the [00:53:47] explicit the distinction between the three different kinds of [00:53:48] three different kinds of parameterizations we have so we have [00:53:50] parameterizations we have so we have three parameterizations so we have the [00:53:59] three parameterizations so we have the model parameters that's theta the [00:54:06] model parameters that's theta the natural parameters that's 8 and we have [00:54:15] natural parameters that's 8 and we have the canonical parameters and this is a [00:54:22] the canonical parameters and this is a fee for Bernoulli mu and Sigma square [00:54:26] fee for Bernoulli mu and Sigma square for Gaussian lambda for Poisson so these [00:54:33] for Gaussian lambda for Poisson so these are three different ways we are we can [00:54:35] are three different ways we are we can parameterize either the exponential [00:54:38] parameterize either the exponential family or the GLM and whenever we are [00:54:44] family or the GLM and whenever we are learning a GLM it is only you know this [00:54:47] learning a GLM it is only you know this thing that we learn right that is the [00:54:51] thing that we learn right that is the theta in the linear model this is the [00:54:53] theta in the linear model this is the theta that is that is learnt right and [00:54:57] theta that is that is learnt right and the connection between these two is is [00:55:00] the connection between these two is is linear so theta transpose X will give [00:55:03] linear so theta transpose X will give you the natural parameter and this is [00:55:06] you the natural parameter and this is the design choice that we are making and [00:55:12] the design choice that we are making and we choose to reaper ammeter is a 2 by a [00:55:15] we choose to reaper ammeter is a 2 by a linear model a linear of a linear in [00:55:18] linear model a linear of a linear in your data and between these two you have [00:55:23] your data and between these two you have G to go this way G inverse come back [00:55:28] G to go this way G inverse come back this way where G is also the derivative [00:55:32] this way where G is also the derivative of the partition so yeah so it's [00:55:35] of the partition so yeah so it's important to to kind of realize it can [00:55:38] important to to kind of realize it can get pretty confusing when you're seeing [00:55:40] get pretty confusing when you're seeing this for the first time because you have [00:55:41] this for the first time because you have so many parameters that are being [00:55:42] so many parameters that are being swapped around and you know getting [00:55:44] swapped around and you know getting repairmen tries there are three kind of [00:55:48] repairmen tries there are three kind of spaces in which three different ways in [00:55:50] spaces in which three different ways in which we are parameterizing or [00:55:52] which we are parameterizing or generalized [00:55:52] generalized your models the model parameters the [00:55:55] your models the model parameters the ones that we learn and the output of [00:55:57] ones that we learn and the output of this with this the natural parameter for [00:55:59] this with this the natural parameter for the exponential family and you can you [00:56:02] the exponential family and you can you know do some algebraic manipulations and [00:56:04] know do some algebraic manipulations and get the canonical parameters for the [00:56:07] get the canonical parameters for the distribution that we are choosing [00:56:10] distribution that we are choosing depending on the task whether it's [00:56:11] depending on the task whether it's classification or regression any [00:56:22] classification or regression any questions on this [00:56:32] so now it's actually pretty you know you [00:56:37] so now it's actually pretty you know you can see the you know when you're doing [00:56:38] can see the you know when you're doing logistic regression right so each theta [00:56:47] logistic regression right so each theta of X so H state of X is the expected [00:56:57] of X so H state of X is the expected value of y condition and this is equal [00:57:12] value of y condition and this is equal to V because here the choice of [00:57:17] to V because here the choice of distribution is a Bernoulli and the mean [00:57:20] distribution is a Bernoulli and the mean of a Bernoulli distribution is just V [00:57:22] of a Bernoulli distribution is just V the in the in the canonical parameter [00:57:24] the in the in the canonical parameter space and if we write that as in terms [00:57:30] space and if we write that as in terms of t minus theta transpose X so the [00:57:44] of t minus theta transpose X so the logistic function which when we [00:57:46] logistic function which when we introduced linear logistically agression [00:57:49] introduced linear logistically agression we just you know pulled out the logistic [00:57:52] we just you know pulled out the logistic function out of thin air and said hey [00:57:54] function out of thin air and said hey this is something that can squash minus [00:57:56] this is something that can squash minus infinity to infinity between 0 and 1 [00:57:58] infinity to infinity between 0 and 1 seems like a good choice but but now we [00:58:00] seems like a good choice but but now we see that it is it is a natural outcome [00:58:04] see that it is it is a natural outcome it just pops out from this more elegant [00:58:07] it just pops out from this more elegant generalized linear model where if you [00:58:10] generalized linear model where if you choose Bernoulli to be to be the [00:58:14] choose Bernoulli to be to be the distribution of your output then you [00:58:16] distribution of your output then you know the logistic regression just just [00:58:20] know the logistic regression just just pops out naturally so [00:58:25] pops out naturally so [Music] [00:58:29] any questions yeah yeah so the the [00:58:43] any questions yeah yeah so the the choice of what distribution you're going [00:58:45] choice of what distribution you're going to choose is really dependent on the [00:58:48] to choose is really dependent on the task that you have so if your task is [00:58:50] task that you have so if your task is regression where you want to output real [00:58:53] regression where you want to output real valued numbers like price of the house [00:58:55] valued numbers like price of the house or something then you choose a [00:58:57] or something then you choose a distribution over the real real numbers [00:59:01] distribution over the real real numbers like a Gaussian if your task is [00:59:04] like a Gaussian if your task is classification that your output is [00:59:06] classification that your output is binary 0 or 1 you choose a distribution [00:59:08] binary 0 or 1 you choose a distribution that models binary data right so the [00:59:12] that models binary data right so the task in a way influences you to pick the [00:59:17] task in a way influences you to pick the distribution and you know most of the [00:59:20] distribution and you know most of the times that choice is pretty obvious if [00:59:22] times that choice is pretty obvious if you want to model the number of visitors [00:59:23] you want to model the number of visitors to a website which is like a count you [00:59:26] to a website which is like a count you know you want to use a Poisson [00:59:27] know you want to use a Poisson distribution because Poisson [00:59:28] distribution because Poisson distribution is a distribution over [00:59:29] distribution is a distribution over integers so the task decide you know [00:59:32] integers so the task decide you know pretty much tells you what distribution [00:59:34] pretty much tells you what distribution you want to choose and then you you do [00:59:37] you want to choose and then you you do the you know you do this you know all [00:59:41] the you know you do this you know all you you go through this machinery of [00:59:43] you you go through this machinery of figuring out what are the what what a [00:59:46] figuring out what are the what what a trait of X is and you plug in each state [00:59:48] trait of X is and you plug in each state of X over there and you have your [00:59:51] of X over there and you have your learning role any more questions so it [00:59:58] learning role any more questions so it so we made some assumptions these [01:00:02] so we made some assumptions these assumptions now it's it's also helpful [01:00:08] assumptions now it's it's also helpful to kind of get a visualization of what [01:00:10] to kind of get a visualization of what these assumptions actually mean [01:00:35] so to expand upon your point you know if [01:00:41] so to expand upon your point you know if you think of the question are GLM's used [01:00:43] you think of the question are GLM's used for classification or are they used for [01:00:44] for classification or are they used for regression or are they used for you know [01:00:46] regression or are they used for you know something else [01:00:48] something else the answer really depends on what is the [01:00:50] the answer really depends on what is the choice of distribution that you're going [01:00:52] choice of distribution that you're going to choose you know GLM's are just a [01:00:54] to choose you know GLM's are just a general way to model data and that data [01:00:56] general way to model data and that data could be you know binary it could be [01:00:58] could be you know binary it could be real valued and as long as you have a [01:01:01] real valued and as long as you have a distribution that can model that kind of [01:01:03] distribution that can model that kind of data and falls in the exponential family [01:01:05] data and falls in the exponential family it can be just plugged in to a GL m and [01:01:08] it can be just plugged in to a GL m and everything just works out nicely so so [01:01:19] everything just works out nicely so so the assumptions that we made well let's [01:01:21] the assumptions that we made well let's start with regression right so for [01:01:26] start with regression right so for aggression we assume there is some X to [01:01:31] aggression we assume there is some X to simplify I'm I'm drawing X as one [01:01:34] simplify I'm I'm drawing X as one dimension but you know X could be [01:01:35] dimension but you know X could be multi-dimensional and there exists a [01:01:39] multi-dimensional and there exists a theta right and theta transpose X would [01:01:44] theta right and theta transpose X would would be some linear some linear [01:01:51] would be some linear some linear hyperplane and this we assume is beta [01:02:01] right and in case of regression eight I [01:02:06] right and in case of regression eight I was also mu so eight I was also mu and [01:02:13] was also mu so eight I was also mu and then we are assuming that the Y for any [01:02:16] then we are assuming that the Y for any given X is distributed as a Gaussian [01:02:19] given X is distributed as a Gaussian with mu as the mean so which means for [01:02:23] with mu as the mean so which means for every X every possible X you have the [01:02:27] every X every possible X you have the appropriate data and with this as the [01:02:31] appropriate data and with this as the mean let's let's think of this as Y so [01:02:34] mean let's let's think of this as Y so there is a Gaussian distribution at [01:02:37] there is a Gaussian distribution at every possible [01:02:43] we assume a variance of one so this is [01:02:45] we assume a variance of one so this is like a Gaussian with standard deviation [01:02:47] like a Gaussian with standard deviation or variance equal to one right so for [01:02:49] or variance equal to one right so for every possible X there is a Y given X [01:02:54] every possible X there is a Y given X which is parameterized by by by theta [01:02:57] which is parameterized by by by theta transpose X s as the mean right and you [01:03:02] transpose X s as the mean right and you assume that your data is generated from [01:03:04] assume that your data is generated from this process right so what does it mean [01:03:08] this process right so what does it mean it means given X and let's say this is y [01:03:15] it means given X and let's say this is y so you would have examples in your [01:03:19] so you would have examples in your training set that may look like this the [01:03:24] training set that may look like this the Assumption here is that for every X [01:03:27] Assumption here is that for every X there is let's say for this particular [01:03:31] there is let's say for this particular value of x there was a Gaussian [01:03:34] value of x there was a Gaussian distribution that started from with a [01:03:36] distribution that started from with a mean over here and from this Gaussian [01:03:39] mean over here and from this Gaussian distribution [01:03:40] distribution this value is sampled right you're just [01:03:44] this value is sampled right you're just sampling it from from the distribution [01:03:46] sampling it from from the distribution now the this is how your data is [01:03:50] now the this is how your data is generated again this is our assumption [01:03:55] right now that now based on these [01:04:00] right now that now based on these assumptions what we're doing with the [01:04:01] assumptions what we're doing with the GLM is we start with the data we don't [01:04:04] GLM is we start with the data we don't know anything else we make an assumption [01:04:06] know anything else we make an assumption that there is some linear model from [01:04:09] that there is some linear model from which the data was was generated in this [01:04:11] which the data was was generated in this format and we want to work backwards [01:04:13] format and we want to work backwards right to find theta that will give us [01:04:18] right to find theta that will give us this line right so for a different [01:04:21] this line right so for a different choice of theta we get a different line [01:04:23] choice of theta we get a different line right we assume that you know if that [01:04:27] right we assume that you know if that line represents the the Meuse or the [01:04:29] line represents the the Meuse or the means of the Y's for that particular X [01:04:31] means of the Y's for that particular X from which it's sampled from we are [01:04:34] from which it's sampled from we are trying to find a line which is which [01:04:40] trying to find a line which is which will be like your theta transpose x from [01:04:42] will be like your theta transpose x from which these Y's are most likely to have [01:04:45] which these Y's are most likely to have samples that that's essentially what's [01:04:46] samples that that's essentially what's happening when you do maximum likelihood [01:04:48] happening when you do maximum likelihood with with with the GLM similarly [01:05:06] similarly for classification again let's [01:05:10] similarly for classification again let's assume there's an X right and there's [01:05:13] assume there's an X right and there's some theta transpose X right and and [01:05:19] some theta transpose X right and and this theta transpose X is equal to his [01:05:22] this theta transpose X is equal to his data we assign this to be later right [01:05:26] data we assign this to be later right and this data is from this a table we we [01:05:31] and this data is from this a table we we run this through the sigmoid function 1 [01:05:35] run this through the sigmoid function 1 over 1 plus e to the minus beta to get [01:05:39] over 1 plus e to the minus beta to get fee right so if these are the a TAS for [01:05:43] fee right so if these are the a TAS for each for each a tar we run it through [01:05:47] each for each a tar we run it through the sigmoid and we get something like [01:05:50] the sigmoid and we get something like this right so this tends to 1 this tends [01:05:55] this right so this tends to 1 this tends to 0 and when at this point when eita is [01:06:00] to 0 and when at this point when eita is 0 the sigmoid is is 0.5 and now at each [01:06:12] 0 the sigmoid is is 0.5 and now at each point at any given choice of X we have a [01:06:16] point at any given choice of X we have a probability distribution in this case [01:06:21] probability distribution in this case it's a it's a binary so let's assume [01:06:24] it's a it's a binary so let's assume probability of Y is the height till the [01:06:27] probability of Y is the height till the sigmoid line and here it is so every X [01:06:31] sigmoid line and here it is so every X we have a different Bernoulli [01:06:33] we have a different Bernoulli distribution essentially that's obtained [01:06:36] distribution essentially that's obtained where you know the probability of Y is [01:06:38] where you know the probability of Y is there is the height to the sigmoid [01:06:41] there is the height to the sigmoid through the natural parameter and from [01:06:43] through the natural parameter and from this you have a data generating [01:06:45] this you have a data generating distribution that would look like so X [01:06:48] distribution that would look like so X and you have a few X's in your training [01:06:52] and you have a few X's in your training set and for those X's you calc you [01:06:57] set and for those X's you calc you you know why distribution is and sample [01:06:59] you know why distribution is and sample from it so let's say [01:07:10] right and now again our goal is to stop [01:07:15] right and now again our goal is to stop given given this data so over here this [01:07:18] given given this data so over here this is the X and this is y so this is these [01:07:21] is the X and this is y so this is these are points for which Y is 0 these are [01:07:23] are points for which Y is 0 these are points for which Y is 1 and so given [01:07:25] points for which Y is 1 and so given given this data we want to work [01:07:26] given this data we want to work backwards to find out you know what [01:07:31] backwards to find out you know what theta was what's the theta that would [01:07:33] theta was what's the theta that would have resulted in sigmoid like curve from [01:07:37] have resulted in sigmoid like curve from which these these Y's were most likely [01:07:40] which these these Y's were most likely to have been sampled that's and and [01:07:42] to have been sampled that's and and figuring out that Y is is is essentially [01:07:44] figuring out that Y is is is essentially doing logistic regression any questions [01:07:57] alright so in the last ten minutes or so [01:08:00] alright so in the last ten minutes or so we will go over softmax regression [01:08:30] so softbox aggression is so in the [01:08:35] so softbox aggression is so in the lecture notes softmax regression is [01:08:37] lecture notes softmax regression is explained as as yet another member of [01:08:41] explained as as yet another member of the GLM family however in today's [01:08:44] the GLM family however in today's lecture we'll be taking a non GLM [01:08:47] lecture we'll be taking a non GLM approach and kind of seeing and see how [01:08:50] approach and kind of seeing and see how softmax is is essentially doing what's [01:08:53] softmax is is essentially doing what's also called as cross entropy [01:08:54] also called as cross entropy minimization we'll end up with the same [01:08:59] minimization we'll end up with the same same formulas and equations you can you [01:09:01] same formulas and equations you can you can go through the GLM interpretation in [01:09:03] can go through the GLM interpretation in the notes [01:09:03] the notes it's a little messy to kind of do it on [01:09:05] it's a little messy to kind of do it on the whiteboard so whereas this has it [01:09:08] the whiteboard so whereas this has it has a nicer interpretation and it's good [01:09:11] has a nicer interpretation and it's good to kind of get this cross-entropy [01:09:13] to kind of get this cross-entropy interpretation as well so let's assume [01:09:17] interpretation as well so let's assume so here we are talking about multi-class [01:09:19] so here we are talking about multi-class classification so let's assume we have [01:09:21] classification so let's assume we have three cat three classes of data let's [01:09:25] three cat three classes of data let's call them circles squares and triangles [01:09:40] now if here you know this is x1 and x2 [01:09:45] now if here you know this is x1 and x2 this just you're just visualizing your [01:09:47] this just you're just visualizing your input space and the output space Y is [01:09:50] input space and the output space Y is kind of implicit in the shape of this so [01:09:51] kind of implicit in the shape of this so on so in multik like a multi-class [01:09:58] on so in multik like a multi-class classification our goal is to start from [01:10:02] classification our goal is to start from this data and learn a model that can [01:10:05] this data and learn a model that can given a new data point you know make a [01:10:10] given a new data point you know make a prediction of whether this point is a [01:10:12] prediction of whether this point is a circle square or a triangle right you're [01:10:16] circle square or a triangle right you're just looking at three because it's easy [01:10:18] just looking at three because it's easy to visualize but this can work over [01:10:19] to visualize but this can work over thousands of classes and so what we have [01:10:28] thousands of classes and so what we have is so you have x i's in RN [01:10:33] is so you have x i's in RN right so the label y is zero okay so K [01:10:46] right so the label y is zero okay so K is the number of classes so the labels Y [01:11:00] is the number of classes so the labels Y is a one hot vector what would you call [01:11:03] is a one hot vector what would you call it as a one hot vector where it's a [01:11:04] it as a one hot vector where it's a vector which indicates which class the X [01:11:10] vector which indicates which class the X corresponds to so each each element in [01:11:13] corresponds to so each each element in the vector corresponds to one of the [01:11:15] the vector corresponds to one of the classes so this may correspond to the [01:11:17] classes so this may correspond to the triangle class circle class square class [01:11:20] triangle class circle class square class or maybe something else so the labels [01:11:24] or maybe something else so the labels are in this one hot vector where we have [01:11:28] are in this one hot vector where we have a vector that's filled with zeros except [01:11:30] a vector that's filled with zeros except with a one in one of the places and the [01:11:41] with a one in one of the places and the way we're going to the very gonna think [01:11:46] way we're going to the very gonna think of softmax regression is that each class [01:11:49] of softmax regression is that each class has its its own set of parameters so we [01:11:53] has its its own set of parameters so we have theta class and and there are K [01:12:05] have theta class and and there are K such things where class so in logistic [01:12:16] such things where class so in logistic regression we had just one theta which [01:12:19] regression we had just one theta which would do a binary you know yes versus no [01:12:21] would do a binary you know yes versus no in softmax we have one such vector of [01:12:27] in softmax we have one such vector of theta per class right so you could also [01:12:29] theta per class right so you could also optionally represent them as a matrix I [01:12:32] optionally represent them as a matrix I just an N by K matrix where you have a [01:12:37] just an N by K matrix where you have a theta class right so in softmax [01:12:42] theta class right so in softmax regression it's it's a generalization of [01:12:47] regression it's it's a generalization of logistic regression where you have a set [01:12:50] logistic regression where you have a set of parameters per class right and we're [01:12:55] of parameters per class right and we're going to do something [01:13:02] something similar to [01:13:28] so corresponding to each each class of [01:13:32] so corresponding to each each class of parameters there exists so there is that [01:13:42] parameters there exists so there is that exists this line which represents say [01:13:45] exists this line which represents say theta triangle transpose x equals zero [01:13:47] theta triangle transpose x equals zero and anything to the left will be theta [01:13:50] and anything to the left will be theta triangle transpose X is greater than [01:13:52] triangle transpose X is greater than zero and over here to be less than zero [01:13:54] zero and over here to be less than zero right so if for for the theta triangle [01:13:58] right so if for for the theta triangle class there is there is this line which [01:14:03] class there is there is this line which which corresponds to theta transpose x [01:14:06] which corresponds to theta transpose x equals zero anything to the left will [01:14:08] equals zero anything to the left will give you a value greater than zero [01:14:11] give you a value greater than zero anything to the right similarly there is [01:14:13] anything to the right similarly there is also so this corresponds to theta square [01:14:19] also so this corresponds to theta square transpose x equals zero anything below [01:14:24] transpose x equals zero anything below will be greater than zero anything above [01:14:27] will be greater than zero anything above will be less than zero similarly you [01:14:30] will be less than zero similarly you have another one for this corresponds to [01:14:36] have another one for this corresponds to theta circle transpose x equals zero and [01:14:41] theta circle transpose x equals zero and this half plane we have to be greater [01:14:44] this half plane we have to be greater than zero to the left is less than zero [01:14:47] than zero to the left is less than zero right so we have a different set of [01:14:51] right so we have a different set of parameters per class which which [01:14:56] parameters per class which which hopefully satisfies this property and [01:15:00] hopefully satisfies this property and now our goal is to take these parameters [01:15:07] now our goal is to take these parameters and let's see what happens when we feed [01:15:11] and let's see what happens when we feed a new example so given an example X we [01:15:16] a new example so given an example X we get a set of [01:15:22] given X and who are here we have classes [01:15:28] given X and who are here we have classes right so we have the circle class the [01:15:31] right so we have the circle class the triangle class the square class right so [01:15:34] triangle class the square class right so over here we plot theta class transpose [01:15:38] over here we plot theta class transpose X so we may get something that looks [01:15:41] X so we may get something that looks like this so let's say for a new point x [01:15:48] like this so let's say for a new point x over here if that's our new X we would [01:15:52] over here if that's our new X we would have theta transpose theta transfer data [01:15:57] have theta transpose theta transfer data square transpose X to be positive so [01:16:01] square transpose X to be positive so right and maybe for for the others we [01:16:06] right and maybe for for the others we may have some negative and maybe [01:16:08] may have some negative and maybe something like this for this right so [01:16:11] something like this for this right so this piece is it's also called the logic [01:16:14] this piece is it's also called the logic space I mean so these are real numbers [01:16:17] space I mean so these are real numbers this will this will this is not a value [01:16:20] this will this will this is not a value between 0 & 1 this is between plus [01:16:22] between 0 & 1 this is between plus infinity and minus infinity right and [01:16:26] infinity and minus infinity right and and our goal is to get a probability [01:16:30] and our goal is to get a probability distribution over the classes and in [01:16:33] distribution over the classes and in order to do that we perform a few steps [01:16:36] order to do that we perform a few steps so we exponentiate the logits which [01:16:40] so we exponentiate the logits which would give us now it is x buff theta [01:16:45] would give us now it is x buff theta class transpose x and this will make [01:16:48] class transpose x and this will make everything positive [01:16:56] squares triangles and circles right now [01:17:01] squares triangles and circles right now we got a set of positive numbers and [01:17:03] we got a set of positive numbers and next we normalize this my normalize I [01:17:12] next we normalize this my normalize I mean divide everything by the sum of all [01:17:16] mean divide everything by the sum of all of them so here we have theta e to the [01:17:21] of them so here we have theta e to the theta at class transpose x over some of [01:17:26] theta at class transpose x over some of I in triangle square circle e to the [01:17:34] I in triangle square circle e to the theta I transpose X so once we do this [01:17:39] theta I transpose X so once we do this operation we now get a probability [01:17:41] operation we now get a probability distribution [01:17:53] where the sum of the heights will add up [01:17:55] where the sum of the heights will add up to one so so given so if given a new [01:18:02] to one so so given so if given a new point X and we run through this pipeline [01:18:03] point X and we run through this pipeline we get a probability output over the [01:18:07] we get a probability output over the classes for which class that example is [01:18:11] classes for which class that example is most likely to belong to right and this [01:18:16] most likely to belong to right and this whole process so let's call this P hat [01:18:19] whole process so let's call this P hat of Y for the given X right so this is [01:18:25] of Y for the given X right so this is like our hypothesis the output of the [01:18:26] like our hypothesis the output of the hypothesis function will output this [01:18:28] hypothesis function will output this probability distribution in the other [01:18:30] probability distribution in the other cases the output of the hypothesis [01:18:32] cases the output of the hypothesis function generally output a scalar or a [01:18:34] function generally output a scalar or a probability in this case it's outputting [01:18:36] probability in this case it's outputting a probability distribution over all the [01:18:38] a probability distribution over all the classes and now the true why would look [01:18:43] classes and now the true why would look something like this right let's say the [01:18:46] something like this right let's say the point over there was let's say it was a [01:18:49] point over there was let's say it was a triangle for whatever reason right if [01:18:52] triangle for whatever reason right if that was the triangle then the P of Y [01:18:56] that was the triangle then the P of Y which is also called the label you can [01:19:00] which is also called the label you can think of that as a probability [01:19:02] think of that as a probability distribution which is one over the [01:19:06] distribution which is one over the correct class and 0 elsewhere right so P [01:19:10] correct class and 0 elsewhere right so P of Y this is essentially representing [01:19:12] of Y this is essentially representing the one heart representation as a [01:19:14] the one heart representation as a probability distribution right now the [01:19:16] probability distribution right now the goal or the learning approach that we [01:19:19] goal or the learning approach that we are going to do is in a way minimize the [01:19:24] are going to do is in a way minimize the distance between these two distributions [01:19:27] distance between these two distributions right this is one distribution this is [01:19:29] right this is one distribution this is another distribution we want to change [01:19:31] another distribution we want to change this distribution to look like that [01:19:32] this distribution to look like that distribution right and and technically [01:19:37] distribution right and and technically that the term for that is minimize the [01:19:40] that the term for that is minimize the cross entropy between the two [01:19:41] cross entropy between the two distributions [01:19:55] so the cross-entropy between P and P hat [01:20:04] so the cross-entropy between P and P hat is equal to 4y in circle angle square P [01:20:20] is equal to 4y in circle angle square P of Y times y I don't think we will have [01:20:29] of Y times y I don't think we will have time to go over the interpretation of [01:20:30] time to go over the interpretation of cross-entropy but you can so here we see [01:20:33] cross-entropy but you can so here we see that P of Y will be 1 for just one of [01:20:36] that P of Y will be 1 for just one of the classes and 0 for the others so [01:20:37] the classes and 0 for the others so let's say in this say this example P of [01:20:39] let's say in this say this example P of Y will say a triangle so this will [01:20:42] Y will say a triangle so this will essentially boil down to P and we saw [01:20:57] essentially boil down to P and we saw that this hypothesis is essentially and [01:21:24] that this hypothesis is essentially and on this you you treat this as the loss [01:21:28] on this you you treat this as the loss and do gradient descent gradient descent [01:21:35] and do gradient descent gradient descent with respect to the parameters right [01:21:42] yeah with that I think any questions on [01:21:46] yeah with that I think any questions on softmax [01:21:53] okay so we will break for today in that [01:21:56] okay so we will break for today in that is thanks ================================================================================ LECTURE 005 ================================================================================ Lecture 5 - GDA & Naive Bayes | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018) Source: https://www.youtube.com/watch?v=nt63k3bfXS0 --- Transcript [00:00:04] hey morning everyone welcome back [00:00:07] hey morning everyone welcome back um so last week you heard about uh [00:00:11] um so last week you heard about uh logistic regression and um uh [00:00:14] logistic regression and um uh generalized linear models and it turns [00:00:17] generalized linear models and it turns out all of the learning algorithms we've [00:00:19] out all of the learning algorithms we've been learning about so far are called [00:00:21] been learning about so far are called discriminative learning algorithms which [00:00:22] discriminative learning algorithms which is one big bucket of learning algorithms [00:00:24] is one big bucket of learning algorithms and today um what i' like to do is share [00:00:27] and today um what i' like to do is share with you how generative learning [00:00:29] with you how generative learning algorithms work um in particular you [00:00:31] algorithms work um in particular you learn about Gan discre analysis so by [00:00:34] learn about Gan discre analysis so by the end of the day you know how to [00:00:35] the end of the day you know how to implement this and it turns out that uh [00:00:37] implement this and it turns out that uh compared to say logistic regression for [00:00:39] compared to say logistic regression for classification GDA is actually a um [00:00:43] classification GDA is actually a um simpler and maybe more computationally [00:00:45] simpler and maybe more computationally efficient algorithm to implement in some [00:00:47] efficient algorithm to implement in some cases so um and it sometimes works [00:00:50] cases so um and it sometimes works better if you have uh very small data [00:00:52] better if you have uh very small data sets sometimes of some cavas um and it [00:00:55] sets sometimes of some cavas um and it will talk about comparison between [00:00:57] will talk about comparison between generative learning albums which is a [00:00:58] generative learning albums which is a new class of albums you hear about today [00:01:00] new class of albums you hear about today versus The sctive Learning algorithms [00:01:02] versus The sctive Learning algorithms and then we'll talk about naive Bas and [00:01:05] and then we'll talk about naive Bas and how you could use that to uh build a [00:01:07] how you could use that to uh build a span filter for example okay so um we'll [00:01:12] span filter for example okay so um we'll use bin classification as the motivating [00:01:15] use bin classification as the motivating example for today and um if you have a [00:01:21] example for today and um if you have a data set that looks like this with two [00:01:23] data set that looks like this with two classes then what a discriminative [00:01:26] classes then what a discriminative learning algorithm like logistic [00:01:28] learning algorithm like logistic regression would do is use GR descent to [00:01:31] regression would do is use GR descent to search for a line that separates the [00:01:33] search for a line that separates the positive negative examples right so if [00:01:34] positive negative examples right so if you randomly randomly initialize [00:01:37] you randomly randomly initialize parameters maybe starts with some Digis [00:01:40] parameters maybe starts with some Digis boundary like that and over the course [00:01:42] boundary like that and over the course of grade and descent you know the line [00:01:44] of grade and descent you know the line migrates or evolves until you get maybe [00:01:46] migrates or evolves until you get maybe a line like that that separates the [00:01:48] a line like that that separates the positive and negative examples and um uh [00:01:51] positive and negative examples and um uh logistic regression is really searching [00:01:53] logistic regression is really searching for a line searching for desision [00:01:55] for a line searching for desision boundary that separates the positive and [00:01:56] boundary that separates the positive and negative examples um and so if this was [00:02:00] negative examples um and so if this was the uh malignant [00:02:02] the uh malignant tumors and the benign tumors [00:02:05] tumors and the benign tumors example right uh that's that's what [00:02:08] example right uh that's that's what logistic regression would do now there's [00:02:10] logistic regression would do now there's a different class of algorithm which [00:02:12] a different class of algorithm which isn't searching for this separation [00:02:14] isn't searching for this separation which isn't trying to maximize the [00:02:16] which isn't trying to maximize the likelihood that you the way you saw last [00:02:18] likelihood that you the way you saw last week which is um here's an alternative [00:02:21] week which is um here's an alternative this is called a generative learning [00:02:23] this is called a generative learning algorithm which is rather than looking [00:02:25] algorithm which is rather than looking at two classes and trying to find a [00:02:27] at two classes and trying to find a separation instead the algorithm is [00:02:28] separation instead the algorithm is going to look at the classes one at a [00:02:30] going to look at the classes one at a time first we'll look at all of the [00:02:32] time first we'll look at all of the malignant tumors right in the cancer [00:02:35] malignant tumors right in the cancer example and try to build a model for [00:02:37] example and try to build a model for what malignant tumus looks like so you [00:02:39] what malignant tumus looks like so you might say oh looks like all the [00:02:40] might say oh looks like all the malignant tumus um [00:02:45] roughly all the malignant tumors roughly [00:02:48] roughly all the malignant tumors roughly live in that ellipse and then you look [00:02:50] live in that ellipse and then you look at all the benign tumors in isolation [00:02:53] at all the benign tumors in isolation and say oh it looks like all the benign [00:02:55] and say oh it looks like all the benign tumors roughly live in that ellipse and [00:02:58] tumors roughly live in that ellipse and then at classification time time if [00:03:00] then at classification time time if there's a new you know patient in your [00:03:03] there's a new you know patient in your office with those features uh it would [00:03:06] office with those features uh it would then look at this new patient and [00:03:08] then look at this new patient and compare it to the malignant tumor model [00:03:12] compare it to the malignant tumor model compare it to the benign tumor model and [00:03:14] compare it to the benign tumor model and then say in this case oh looks like this [00:03:16] then say in this case oh looks like this one looks a lot more like the benign [00:03:18] one looks a lot more like the benign tumors I had previously seen so I'm [00:03:20] tumors I had previously seen so I'm going to classify that as a benign tumor [00:03:22] going to classify that as a benign tumor okay so [00:03:25] okay so um rather than uh looking at both [00:03:28] um rather than uh looking at both classes simultaneous and searching for a [00:03:30] classes simultaneous and searching for a way to separate them a generative [00:03:33] way to separate them a generative learning algorithm uh instead builds a [00:03:35] learning algorithm uh instead builds a model of what each of the classes looks [00:03:38] model of what each of the classes looks like kind of almost in isolation with [00:03:40] like kind of almost in isolation with some details we'll learn about later and [00:03:42] some details we'll learn about later and then at test time uh it evaluates a new [00:03:45] then at test time uh it evaluates a new example against the benign model [00:03:47] example against the benign model evaluates against the malignant model [00:03:49] evaluates against the malignant model and tries to see which of the two models [00:03:51] and tries to see which of the two models it matches more closely against so let's [00:03:55] it matches more closely against so let's formalize this um a discriminative [00:04:00] formalize this um a discriminative learning [00:04:01] learning algorithm [00:04:03] algorithm learns P of [00:04:07] Y given X right [00:04:11] Y given X right um [00:04:13] um or uh or or it learns [00:04:21] um right some [00:04:24] um right some mapping from X to Y directly you know [00:04:27] mapping from X to Y directly you know learn or you can learn I think on and [00:04:29] learn or you can learn I think on and brief talked about the perception Al we [00:04:31] brief talked about the perception Al we talk about support vect machines later [00:04:33] talk about support vect machines later um we learns the function mapping from X [00:04:35] um we learns the function mapping from X to the labels directly so that's a [00:04:37] to the labels directly so that's a discriminative learning algorithm you're [00:04:38] discriminative learning algorithm you're trying to discriminate between positive [00:04:40] trying to discriminate between positive and negative [00:04:41] and negative classes in contrast a generative [00:04:44] classes in contrast a generative learning [00:04:50] algorithm it [00:04:54] algorithm it learns P [00:04:56] learns P of um x given [00:05:01] of um x given y so this says what are the features [00:05:08] like given the class right so um instead [00:05:12] like given the class right so um instead of P of Y given X we're going to learn P [00:05:14] of P of Y given X we're going to learn P of x given y so in other words given [00:05:17] of x given y so in other words given that the tumor is malignant what are the [00:05:19] that the tumor is malignant what are the features likely going to be like or [00:05:20] features likely going to be like or given the tumor is benign what are the [00:05:22] given the tumor is benign what are the features X going to be like okay um and [00:05:26] features X going to be like okay um and then there and then there also [00:05:28] then there and then there also generative learning algorithm will also [00:05:30] generative learning algorithm will also learn P of Y so this is a this is also [00:05:32] learn P of Y so this is a this is also called the class prior to this [00:05:36] called the class prior to this probability I guess right it's called a [00:05:38] probability I guess right it's called a class prior it's just when a patient [00:05:40] class prior it's just when a patient walks into your office before you've [00:05:41] walks into your office before you've even examined them before you even seen [00:05:43] even examined them before you even seen them what are the odds that they [00:05:45] them what are the odds that they tumorous malignant versus benign right [00:05:47] tumorous malignant versus benign right before you see any features [00:05:50] before you see any features okay and so using Bas [00:05:56] rule if you can build a model for p of x [00:05:59] rule if you can build a model for p of x given Y and for p of Y um if you know if [00:06:03] given Y and for p of Y um if you know if you can calculate numbers for both of [00:06:05] you can calculate numbers for both of these quantities then using base rule [00:06:07] these quantities then using base rule when you have a new Test [00:06:09] when you have a new Test example with features X you can then [00:06:12] example with features X you can then calculate the chance of Y being equal to [00:06:14] calculate the chance of Y being equal to one as [00:06:22] this [00:06:24] this right uh where P of x [00:06:29] right uh where P of x by [00:06:39] the [00:06:41] the okay [00:06:46] um and so if you've learned this term if [00:06:49] um and so if you've learned this term if you have x given y then you can plug [00:06:52] you have x given y then you can plug that in [00:06:57] here and if you've also learned this [00:06:59] here and if you've also learned this term P of Y you can plug that in [00:07:03] term P of Y you can plug that in here right [00:07:07] here right um so oh P of X in the [00:07:10] um so oh P of X in the denominators goes in the denominator [00:07:13] denominators goes in the denominator okay so if you learned both both of [00:07:14] okay so if you learned both both of those terms in the red square and in the [00:07:16] those terms in the red square and in the orange Square you could plug it into all [00:07:19] orange Square you could plug it into all of those terms and therefore use a base [00:07:21] of those terms and therefore use a base rule to calculate P of y equals 1 given [00:07:25] rule to calculate P of y equals 1 given X so given a new patient with features X [00:07:27] X so given a new patient with features X you could use this formula to calculate [00:07:28] you could use this formula to calculate what's the chance that the tumor is [00:07:30] what's the chance that the tumor is malignant if you've estimated you know [00:07:33] malignant if you've estimated you know these these two quantities in the red [00:07:35] these these two quantities in the red and in the orange circles okay so um [00:07:41] and in the orange circles okay so um that's the framework we're use to build [00:07:43] that's the framework we're use to build generative learning algorithms and in [00:07:45] generative learning algorithms and in fact today you see two examples of [00:07:47] fact today you see two examples of generative learning algorithms uh one [00:07:49] generative learning algorithms uh one for continuous value features uh which [00:07:51] for continuous value features uh which can use for things like the tumor [00:07:52] can use for things like the tumor classification and one for uh discrete [00:07:55] classification and one for uh discrete features which uh you can use for [00:07:57] features which uh you can use for building like an email span filter right [00:07:59] building like an email span filter right or or I don't know or if you want to [00:08:01] or or I don't know or if you want to download Twitter things and see how [00:08:04] download Twitter things and see how positive or negative the sentiment on [00:08:05] positive or negative the sentiment on Twitter is or something right you so [00:08:07] Twitter is or something right you so have a natural language processing [00:08:08] have a natural language processing example [00:08:10] example later so um let's talk [00:08:14] later so um let's talk about calcian distrib [00:08:22] analysis [00:08:25] G [00:08:27] G um so [00:08:34] uh let's develop this model assuming [00:08:36] uh let's develop this model assuming that the features X are continuous [00:08:38] that the features X are continuous valued and when we develop um generative [00:08:42] valued and when we develop um generative learning algorithms I'm going to use x [00:08:44] learning algorithms I'm going to use x and RN so you know I'm going to drop the [00:08:47] and RN so you know I'm going to drop the x0 equals 1 [00:08:50] x0 equals 1 convention right so I'm not going to [00:08:52] convention right so I'm not going to we're not going to need that extra xal 1 [00:08:55] we're not going to need that extra xal 1 so X is now RN rather than RN + 1 and [00:08:59] so X is now RN rather than RN + 1 and the key assumption in gaussian discre [00:09:02] the key assumption in gaussian discre analysis is we're going to [00:09:04] analysis is we're going to assume that P of x given [00:09:10] Y is distributed Gan right in other [00:09:13] Y is distributed Gan right in other words condition on the tumor is being [00:09:16] words condition on the tumor is being malignant the distribution of the [00:09:17] malignant the distribution of the features is Galan you know the feature [00:09:20] features is Galan you know the feature is like size of the size of the tumor [00:09:22] is like size of the size of the tumor the the cell adhesion whatever features [00:09:24] the the cell adhesion whatever features you use to measure a tumor um and [00:09:27] you use to measure a tumor um and condition on it being benign the [00:09:28] condition on it being benign the distribution is also so Gan so um how [00:09:33] distribution is also so Gan so um how many of you are familiar with a multivar [00:09:34] many of you are familiar with a multivar gan raise your hand if you are like half [00:09:37] gan raise your hand if you are like half of you one3 no two fifths okay cool all [00:09:40] of you one3 no two fifths okay cool all right oh how many of you are familiar [00:09:41] right oh how many of you are familiar with a uni varia like a single [00:09:43] with a uni varia like a single dimensional Gan okay cool almost [00:09:45] dimensional Gan okay cool almost everyone all right cool so let me let me [00:09:48] everyone all right cool so let me let me go through what is a multivar gan [00:09:50] go through what is a multivar gan distribution so the Gan is this familiar [00:09:53] distribution so the Gan is this familiar Bel shaped curve a multivar gan is the [00:09:56] Bel shaped curve a multivar gan is the generalization of this familiar [00:09:58] generalization of this familiar bell-shaped curve over one dimensional [00:10:00] bell-shaped curve over one dimensional random variable to multiple random [00:10:02] random variable to multiple random variables at the same time to to to [00:10:03] variables at the same time to to to Vector value random variables rather [00:10:05] Vector value random variables rather than Univar random variable so um if Z [00:10:10] than Univar random variable so um if Z is distributed Gan with some mean Vector [00:10:14] is distributed Gan with some mean Vector mu and some covariance Matrix Sigma um [00:10:19] mu and some covariance Matrix Sigma um so if Z [00:10:21] so if Z is uh in RN then mu would be RN as well [00:10:26] is uh in RN then mu would be RN as well and sigma The ciance Matrix would be n [00:10:28] and sigma The ciance Matrix would be n byn so Z is two dimensional mu is two [00:10:31] byn so Z is two dimensional mu is two dimensional and sigma is two dimensional [00:10:33] dimensional and sigma is two dimensional and the expected value of Z is equal [00:10:37] and the expected value of Z is equal to um the [00:10:39] to um the mean and the [00:10:42] mean and the um coari of [00:10:44] um coari of Z well if you're familiar with [00:10:46] Z well if you're familiar with multivariate ciencies uh this is the [00:10:51] multivariate ciencies uh this is the formula right um and this simplifies we [00:10:54] formula right um and this simplifies we shown the Elon you get this from Elon [00:10:56] shown the Elon you get this from Elon Els [00:11:01] sorry and uh following sometimes [00:11:04] sorry and uh following sometimes semi-standard convention I'm sometimes [00:11:06] semi-standard convention I'm sometimes going to omit the square bracket so [00:11:08] going to omit the square bracket so instead of writing the expected value of [00:11:09] instead of writing the expected value of Z meaning the mean of Z sometimes I just [00:11:12] Z meaning the mean of Z sometimes I just write it as EZ right and omit omit the [00:11:15] write it as EZ right and omit omit the square brackets to simplify the notation [00:11:16] square brackets to simplify the notation of it okay uh and the derivation from [00:11:20] of it okay uh and the derivation from this step to this step is given electon [00:11:22] this step to this step is given electon notes um and [00:11:26] so the PRI density function for G [00:11:29] so the PRI density function for G looks like [00:11:36] this and this is one of those formulas [00:11:39] this and this is one of those formulas that I don't know when you're [00:11:41] that I don't know when you're implementing these algorithms you use [00:11:43] implementing these algorithms you use over and over but what I've seen for a [00:11:45] over and over but what I've seen for a lot of people is almost no one well very [00:11:48] lot of people is almost no one well very few people start their machine learning [00:11:50] few people start their machine learning memorize this formula just look at every [00:11:51] memorize this formula just look at every time you need it I've used it so many [00:11:53] time you need it I've used it so many times I seem to have it steered in my [00:11:55] times I seem to have it steered in my brain by now but most people don't but [00:11:57] brain by now but most people don't but when you use it enough you you you up [00:11:59] when you use it enough you you you up memorizing it uh but let me show you [00:12:01] memorizing it uh but let me show you some pictures of what this looks like [00:12:03] some pictures of what this looks like since I think that would [00:12:06] since I think that would um that might be more useful so the [00:12:09] um that might be more useful so the multiv galaan density has two parameters [00:12:11] multiv galaan density has two parameters mu and sigma that control the mean and [00:12:16] mu and sigma that control the mean and the variance of this density okay so [00:12:20] the variance of this density okay so this is a picture of the Gan density um [00:12:23] this is a picture of the Gan density um this is a two-dimensional Gan bump and [00:12:26] this is a two-dimensional Gan bump and for now I've set the mean parameter to [00:12:29] for now I've set the mean parameter to zero so mu is a two-dimensional [00:12:31] zero so mu is a two-dimensional parameter this is z0 which is why this [00:12:34] parameter this is z0 which is why this gussian bump is uh [00:12:36] gussian bump is uh centered at zero um [00:12:41] centered at zero um and the cence Matrix Sigma is the [00:12:44] and the cence Matrix Sigma is the identity um is the identity Matrix so uh [00:12:49] identity um is the identity Matrix so uh so you know so so you have this standard [00:12:52] so you know so so you have this standard this is also called the standard Gan [00:12:54] this is also called the standard Gan distribution which means mean zero and [00:12:56] distribution which means mean zero and Coan equals to the identity now I'm we [00:12:59] Coan equals to the identity now I'm we going to take the ciance Matrix and [00:13:00] going to take the ciance Matrix and shrink it right so take the ciance [00:13:02] shrink it right so take the ciance Matrix and multiply by a number less [00:13:04] Matrix and multiply by a number less than one that should shrink the variance [00:13:07] than one that should shrink the variance reduce the variability and distribution [00:13:08] reduce the variability and distribution if I do that the density um the probity [00:13:13] if I do that the density um the probity density function becomes taller uh this [00:13:16] density function becomes taller uh this this is a probity density function that [00:13:18] this is a probity density function that always integrates the one right the area [00:13:20] always integrates the one right the area under the curve you know is is is one [00:13:22] under the curve you know is is is one and so by reducing the covariance from [00:13:26] and so by reducing the covariance from Identity to 0.6 times the identity it [00:13:28] Identity to 0.6 times the identity it reduces the the spread of the gum [00:13:30] reduces the the spread of the gum density um but it also makes it taller [00:13:32] density um but it also makes it taller as a result because you know the area [00:13:34] as a result because you know the area under the curve must integrate to one [00:13:36] under the curve must integrate to one now let's make it fatter uh let's make [00:13:39] now let's make it fatter uh let's make the ciance two times the identity then [00:13:42] the ciance two times the identity then you end up with a wider distribution [00:13:44] you end up with a wider distribution where the values of um I guess the axis [00:13:47] where the values of um I guess the axis here this would be the Z1 and the Z2 [00:13:50] here this would be the Z1 and the Z2 axis the two dimensions of gum density [00:13:52] axis the two dimensions of gum density right increases the variance of the [00:13:54] right increases the variance of the density so let's go back to standard Gan [00:13:57] density so let's go back to standard Gan coari equal one one now let's try [00:14:00] coari equal one one now let's try filling around with the off diagonal [00:14:02] filling around with the off diagonal entries um I'm going to so right now the [00:14:05] entries um I'm going to so right now the off diagonal entries are zero right so [00:14:07] off diagonal entries are zero right so in this Gan density the off diagonal [00:14:10] in this Gan density the off diagonal elements are z0 let's increase that to [00:14:12] elements are z0 let's increase that to 0.5 and see what happens so you do that [00:14:16] 0.5 and see what happens so you do that then the Gan density uh hope you can see [00:14:19] then the Gan density uh hope you can see the change right goes from this round [00:14:20] the change right goes from this round shape to this slightly narrower thing [00:14:23] shape to this slightly narrower thing let's increase the further to 8.8 then [00:14:25] let's increase the further to 8.8 then the density ends up looking like that um [00:14:29] the density ends up looking like that um where now it's more likely that Z1 now [00:14:32] where now it's more likely that Z1 now Z1 and Z2 are positively correlated okay [00:14:36] Z1 and Z2 are positively correlated okay so let's go through all of these plots [00:14:37] so let's go through all of these plots um but now looking at Contours of these [00:14:39] um but now looking at Contours of these gum densities instead of these 3D bmps [00:14:42] gum densities instead of these 3D bmps so uh this is the Contours of the Gan [00:14:46] so uh this is the Contours of the Gan density when the ciance Matrix is the [00:14:49] density when the ciance Matrix is the identity Matrix and poiz the aspect [00:14:51] identity Matrix and poiz the aspect ratio these are supposed to be perfectly [00:14:52] ratio these are supposed to be perfectly round circles but the aspect ratio makes [00:14:55] round circles but the aspect ratio makes this look a little bit fatter but this [00:14:56] this look a little bit fatter but this is supposed to be perfectly round [00:14:58] is supposed to be perfectly round circles um and so uh when uh the [00:15:02] circles um and so uh when uh the converence Matrix is the identity Matrix [00:15:04] converence Matrix is the identity Matrix you know Z1 and Z2 are uncorrelated um [00:15:08] you know Z1 and Z2 are uncorrelated um uh and the Contours of the Gin bump of [00:15:10] uh and the Contours of the Gin bump of the Gin density look like round circles [00:15:12] the Gin density look like round circles and if you increase the off diagonal [00:15:14] and if you increase the off diagonal excuse me then it looks like that you [00:15:17] excuse me then it looks like that you increase it further to 8.8 it looks like [00:15:20] increase it further to 8.8 it looks like that okay where now most of the [00:15:23] that okay where now most of the probabbly M most probity density [00:15:25] probabbly M most probity density function places value on um Z1 and Z2 [00:15:28] function places value on um Z1 and Z2 being positively correlated okay um next [00:15:33] being positively correlated okay um next let's look at uh what happens if we set [00:15:35] let's look at uh what happens if we set the off diagonal elements to negative [00:15:38] the off diagonal elements to negative values right so um actually what do you [00:15:41] values right so um actually what do you think will happen let's set the off [00:15:43] think will happen let's set the off diagonals to negative. [00:15:47] 5.5 right oh wow people seeing people [00:15:49] 5.5 right oh wow people seeing people making that head gesture okay cool right [00:15:51] making that head gesture okay cool right great right so so so you you endow the [00:15:54] great right so so so you you endow the two random variables in negative [00:15:56] two random variables in negative correlation so you end up with um this [00:15:59] correlation so you end up with um this type of uh prity density function right [00:16:02] type of uh prity density function right uh and in Contours it looks like this [00:16:05] uh and in Contours it looks like this okay with where it's now slanted the [00:16:07] okay with where it's now slanted the other way so now Z1 and Z2 have a [00:16:09] other way so now Z1 and Z2 have a negative correlation and that's plenty [00:16:11] negative correlation and that's plenty Point okay all right so so far we've [00:16:14] Point okay all right so so far we've been keeping the mean Vector as zero and [00:16:17] been keeping the mean Vector as zero and just varing the covariance Matrix um [00:16:20] just varing the covariance Matrix um good [00:16:23] yeah uh yes every ciance Matrix is [00:16:25] yeah uh yes every ciance Matrix is symmetric yeah um [00:16:29] symmetric yeah um mat a Ser of [00:16:35] that uh should we think of conri as [00:16:38] that uh should we think of conri as interesting column vectors that point [00:16:39] interesting column vectors that point interesting directions not really um let [00:16:44] interesting directions not really um let me think maybe should [00:16:47] me think maybe should yeah yeah no I I I think the cence [00:16:50] yeah yeah no I I I think the cence Matrix is always symmetric and so I [00:16:53] Matrix is always symmetric and so I would usually not look at single Columns [00:16:56] would usually not look at single Columns of the cience Matrix and isolation uh [00:16:59] of the cience Matrix and isolation uh when we talk about principal components [00:17:00] when we talk about principal components analysis talk about the ion vectors of [00:17:02] analysis talk about the ion vectors of the ciance Matrix which are the [00:17:04] the ciance Matrix which are the principal directions in which it points [00:17:06] principal directions in which it points but yeah we'll get to that [00:17:11] later oh yes so the AR vectors The [00:17:14] later oh yes so the AR vectors The Matrix point in the principal axis of [00:17:15] Matrix point in the principal axis of the ellipse uh that's defined by the [00:17:17] the ellipse uh that's defined by the cont yeah [00:17:19] cont yeah cool okay um so this a standed Gan with [00:17:24] cool okay um so this a standed Gan with mean zero so the Gan BMP presented at 0 [00:17:27] mean zero so the Gan BMP presented at 0 0 because me is0 z uh let's move mu [00:17:30] 0 because me is0 z uh let's move mu around so I'm going to move you know mu [00:17:32] around so I'm going to move you know mu to 01 to 0 1.5 so that moves the Galan [00:17:37] to 01 to 0 1.5 so that moves the Galan uh position of gine density right now [00:17:39] uh position of gine density right now let's move it to different location move [00:17:42] let's move it to different location move it to minus 1.5 minus one and so by [00:17:45] it to minus 1.5 minus one and so by varying the value of mu you could also [00:17:47] varying the value of mu you could also shift the center of the gum density [00:17:49] shift the center of the gum density around okay so hope this gives you a [00:17:53] around okay so hope this gives you a sense of um as you vary the parameters [00:17:55] sense of um as you vary the parameters the mean and the ciance Matrix of the 2D [00:17:58] the mean and the ciance Matrix of the 2D gum density um the source of prob PR [00:18:01] gum density um the source of prob PR density functions you can get as a [00:18:03] density functions you can get as a result of changing mu and sigma okay um [00:18:08] result of changing mu and sigma okay um any other questions about [00:18:15] this all right [00:18:23] cool so [00:18:50] here is the [00:18:54] GDA right model um and and [00:19:00] GDA right model um and and uh let's [00:19:02] uh let's see so [00:19:06] um remember for GDA we need to model P [00:19:10] um remember for GDA we need to model P of x given y right instead P of Y given [00:19:12] of x given y right instead P of Y given X so I'm going to write this separately [00:19:14] X so I'm going to write this separately in two separate equations P of x given [00:19:17] in two separate equations P of x given yal Z so what's the chance what's the [00:19:19] yal Z so what's the chance what's the appro density of the features if is a [00:19:22] appro density of the features if is a benign tumor um I'm going to assume is [00:19:27] benign tumor um I'm going to assume is Gan so I'm just write out the formula [00:19:29] Gan so I'm just write out the formula for [00:19:49] Galan okay [00:19:54] um and then similarly I'm going to [00:19:57] um and then similarly I'm going to assume that if it's a malignant tumor so [00:20:00] assume that if it's a malignant tumor so if Y is equal to one that the density of [00:20:03] if Y is equal to one that the density of the features is [00:20:05] the features is also [00:20:09] Gan okay and um I want to point out a [00:20:12] Gan okay and um I want to point out a couple things so the parameters of the [00:20:14] couple things so the parameters of the GDA [00:20:17] model are [00:20:19] model are mu0 [00:20:22] mu1 and sigma um and the reasons we're [00:20:26] mu1 and sigma um and the reasons we're going into a little bit we use the same [00:20:28] going into a little bit we use the same Sigma [00:20:29] Sigma for both [00:20:32] for both classes um but we'll use different means [00:20:35] classes um but we'll use different means zero and one okay uh and we can come [00:20:38] zero and one okay uh and we can come back to this later if you want you could [00:20:41] back to this later if you want you could use separate parameters you know Sigma [00:20:43] use separate parameters you know Sigma 0o and sigma one but that's not usually [00:20:45] 0o and sigma one but that's not usually done so we're going to assume that the [00:20:47] done so we're going to assume that the two gaussians for the positive and [00:20:48] two gaussians for the positive and negative classes have the same [00:20:50] negative classes have the same covariance Matrix but they they have [00:20:51] covariance Matrix but they they have different means uh you don't have to [00:20:53] different means uh you don't have to make this assumption but this is the way [00:20:55] make this assumption but this is the way it's most commonly done and we can talk [00:20:57] it's most commonly done and we can talk about the reason why why we tend to do [00:20:59] about the reason why why we tend to do that in a [00:21:00] that in a second um so this is a model for p of Y [00:21:04] second um so this is a model for p of Y given X the other thing we need to do is [00:21:07] given X the other thing we need to do is model P of Y uh so Y is just a newly [00:21:11] model P of Y uh so Y is just a newly random variable right it takes on you [00:21:13] random variable right it takes on you know the value zero or one and so I'm [00:21:16] know the value zero or one and so I'm going to write it like this 5 to the y * [00:21:20] going to write it like this 5 to the y * 1 - 5 to the [00:21:24] 1 - 5 to the 1- y okay um and you saw this kind of [00:21:29] 1- y okay um and you saw this kind of notation when we talked about logistic [00:21:32] notation when we talked about logistic regression but all this means is that um [00:21:35] regression but all this means is that um you know probity of Y be equal to one is [00:21:38] you know probity of Y be equal to one is equal to [00:21:39] equal to five right because Y is either zero or [00:21:41] five right because Y is either zero or one and so um this is the way of writing [00:21:45] one and so um this is the way of writing uh PRI yals 1 is equal to five okay and [00:21:49] uh PRI yals 1 is equal to five okay and you saw a similar exponentiation [00:21:51] you saw a similar exponentiation notation when we're talking about um [00:21:53] notation when we're talking about um logistic rection right one week ago last [00:21:57] logistic rection right one week ago last Monday and so the last parameter is five [00:22:01] Monday and so the last parameter is five so this is RN this is also RN this is r [00:22:07] so this is RN this is also RN this is r n byn and that's just a real number [00:22:10] n byn and that's just a real number between zero and [00:22:12] between zero and one [00:22:25] okay so um for for any let's see so if [00:22:30] okay so um for for any let's see so if you can fit mu0 mu1 Sigma and F to your [00:22:33] you can fit mu0 mu1 Sigma and F to your data then these parameters will [00:22:37] data then these parameters will Define p of x given y and p of Y and so [00:22:42] Define p of x given y and p of Y and so if at test time you have a new patient [00:22:44] if at test time you have a new patient walked into your office and you need to [00:22:46] walked into your office and you need to compute this then you can compute right [00:22:49] compute this then you can compute right these things in the red and the orange [00:22:51] these things in the red and the orange boxes each of these is a number and by [00:22:53] boxes each of these is a number and by plugging all these numbers in the [00:22:54] plugging all these numbers in the formula you get a number out for p of [00:22:56] formula you get a number out for p of yals 1 given X and you can then predict [00:22:58] yals 1 given X and you can then predict you know malignant or benign tumor right [00:23:03] you know malignant or benign tumor right so let's talk about how to fit the [00:23:04] so let's talk about how to fit the parameters so you have a training [00:23:08] parameters so you have a training set um as usual I'm going to write the [00:23:11] set um as usual I'm going to write the training well I'm let me write the [00:23:12] training well I'm let me write the training set like this x i Yi for IAL 1 [00:23:16] training set like this x i Yi for IAL 1 through M right this is a usual training [00:23:18] through M right this is a usual training set um and what we're going to do in [00:23:24] set um and what we're going to do in order to fit these parameters is [00:23:26] order to fit these parameters is maximize the joints like [00:23:32] hood and in particular um let me Define [00:23:36] hood and in particular um let me Define the likelihood of the [00:23:45] parameters to be equal to the product [00:23:48] parameters to be equal to the product from IAL 1 through M of P of x i [00:23:53] from IAL 1 through M of P of x i Yi you know parameterized [00:23:56] Yi you know parameterized by um the paramet [00:24:07] okay [00:24:15] um and I'm I'm just going to drop the [00:24:18] um and I'm I'm just going to drop the parameters here right to simplify the [00:24:20] parameters here right to simplify the notation a little bit okay and the big [00:24:24] notation a little bit okay and the big difference between um a generative [00:24:28] difference between um a generative learning algorithm like this compared to [00:24:30] learning algorithm like this compared to discriminative learning algorithm is [00:24:33] discriminative learning algorithm is that the cost function you [00:24:36] that the cost function you maximize is this joint likelihood which [00:24:39] maximize is this joint likelihood which is p of X comma y whereas for a [00:24:42] is p of X comma y whereas for a discriminative learning [00:24:46] algorithm we were maximizing um this [00:24:50] algorithm we were maximizing um this other [00:24:51] other thing right [00:25:02] uh which is sometimes also called the [00:25:04] uh which is sometimes also called the conditional [00:25:09] likelihood okay so the big difference [00:25:12] likelihood okay so the big difference between the these two cost functions is [00:25:14] between the these two cost functions is that for logistic regression or linear [00:25:16] that for logistic regression or linear regression or generalized linear models [00:25:18] regression or generalized linear models um you were trying to choose paramet [00:25:20] um you were trying to choose paramet data that maximize P of Y given X but [00:25:24] data that maximize P of Y given X but for generative learning algorithms we're [00:25:26] for generative learning algorithms we're going to try to choose parameters that [00:25:27] going to try to choose parameters that maximize p P of X and Y or P of X comma [00:25:31] maximize p P of X and Y or P of X comma y [00:25:32] y right [00:25:35] okay [00:25:42] so all right [00:26:04] so if you use um maximum like [00:26:17] estimation right um so you choose the [00:26:21] estimation right um so you choose the parameters 5 mu 0 mu1 and sigma that [00:26:27] parameters 5 mu 0 mu1 and sigma that maximize the log [00:26:30] likelihood right where this you define [00:26:33] likelihood right where this you define as you know log of the likelihood that [00:26:36] as you know log of the likelihood that we Define out there um and so uh we [00:26:40] we Define out there um and so uh we actually ask you to do this as a problem [00:26:42] actually ask you to do this as a problem set in the next homework but so the way [00:26:44] set in the next homework but so the way you maximize this is um look at that [00:26:47] you maximize this is um look at that formula for the likelihood take logs [00:26:50] formula for the likelihood take logs take derivatives of this thing set the [00:26:51] take derivatives of this thing set the there is equal to zero and then solve [00:26:53] there is equal to zero and then solve for the Valu so the parameters it [00:26:54] for the Valu so the parameters it maximiz this whole thing and I I I'll [00:26:57] maximiz this whole thing and I I I'll just tell you the answer is is supposed [00:26:58] just tell you the answer is is supposed to get uh but but but you still have to [00:27:00] to get uh but but but you still have to do the [00:27:05] derivation all right um the value of [00:27:08] derivation all right um the value of five that maximizes this is you know not [00:27:12] five that maximizes this is you know not that [00:27:13] that surprisingly so so five is the estimate [00:27:15] surprisingly so so five is the estimate of probability of Y being equal to one [00:27:18] of probability of Y being equal to one right so what's the chance when the next [00:27:20] right so what's the chance when the next patient walks into your doctor's office [00:27:23] patient walks into your doctor's office that they have a a malignant tumor and [00:27:26] that they have a a malignant tumor and so the maximum likely estimate for five [00:27:27] so the maximum likely estimate for five is [00:27:28] is um it's just of all of your training [00:27:30] um it's just of all of your training examples what's the fraction with label [00:27:32] examples what's the fraction with label y equals 1 right so the maxim likelihood [00:27:35] y equals 1 right so the maxim likelihood of the uh bias of a coin TOS is just [00:27:38] of the uh bias of a coin TOS is just well count of the fraction of his you [00:27:40] well count of the fraction of his you got okay so this is it um and one other [00:27:43] got okay so this is it um and one other way to write this is um sum from I = 1 [00:27:47] way to write this is um sum from I = 1 through M [00:27:54] indicat okay [00:28:02] right um let's see so software indicates [00:28:05] right um let's see so software indicates a notation on Wednesday did you no uh [00:28:09] a notation on Wednesday did you no uh did you did we talk did you talk about [00:28:11] did you did we talk did you talk about indicated notation on Wednesday no okay [00:28:13] indicated notation on Wednesday no okay oh so um uh this notation is an [00:28:16] oh so um uh this notation is an indicator function uh where um indicator [00:28:20] indicator function uh where um indicator y i equals 1 is uh uh return zero or one [00:28:23] y i equals 1 is uh uh return zero or one depending on whether the thing inside is [00:28:25] depending on whether the thing inside is true right so this's indicator notation [00:28:27] true right so this's indicator notation in in which a indicator of a true [00:28:30] in in which a indicator of a true statement is equal to one and indicator [00:28:33] statement is equal to one and indicator of a false statement is equal to zero so [00:28:35] of a false statement is equal to zero so that's another way of writing writing [00:28:37] that's another way of writing writing this formula right um and then the [00:28:41] this formula right um and then the maximum likelihood estimate for [00:28:43] maximum likelihood estimate for mu0 is this um I'll just write out [00:29:04] okay um and so well actually if you uh [00:29:08] okay um and so well actually if you uh put aside the math for now what do you [00:29:10] put aside the math for now what do you think is a m likely estimate of the mean [00:29:12] think is a m likely estimate of the mean of all of the uh features for the benign [00:29:15] of all of the uh features for the benign tumors right well what you do is you [00:29:16] tumors right well what you do is you take all the benign tumors in your [00:29:18] take all the benign tumors in your training set and just take their average [00:29:20] training set and just take their average that seems like a very reasonable way [00:29:22] that seems like a very reasonable way just look at look at your trading set [00:29:23] just look at look at your trading set look at all of the um look at all of the [00:29:26] look at all of the um look at all of the benign tumors all the O's I guess and [00:29:29] benign tumors all the O's I guess and then just take the mean of these and [00:29:31] then just take the mean of these and that you know seems like a pretty [00:29:32] that you know seems like a pretty reasonable way to estimate mu0 right [00:29:35] reasonable way to estimate mu0 right look at all the the negative examples [00:29:36] look at all the the negative examples and average their features so this is a [00:29:38] and average their features so this is a way of writing out that intuition um so [00:29:41] way of writing out that intuition um so the denominator is sum from I equals 1 [00:29:43] the denominator is sum from I equals 1 through M indicator Yi equals zero and [00:29:47] through M indicator Yi equals zero and so the denominator will count up the [00:29:49] so the denominator will count up the number of examples that have benign [00:29:52] number of examples that have benign tumus right because every time Yi equals [00:29:54] tumus right because every time Yi equals zero you get an extra one in this sum um [00:29:59] zero you get an extra one in this sum um uh and so the denominator ends up being [00:30:02] uh and so the denominator ends up being the total number of benign tumors in [00:30:05] the total number of benign tumors in your training set okay and the [00:30:09] your training set okay and the numerator uh sunal 13m indicator is a [00:30:12] numerator uh sunal 13m indicator is a benign tumor times [00:30:15] benign tumor times XI so the effect of that is um whenever [00:30:19] XI so the effect of that is um whenever a tumor is benign is one times the [00:30:23] a tumor is benign is one times the features whenever an example is [00:30:26] features whenever an example is malignant it's zero times theur features [00:30:29] malignant it's zero times theur features and so the numerator is summing up all [00:30:31] and so the numerator is summing up all the features all the feature vectors for [00:30:34] the features all the feature vectors for all of the examples that have been nine [00:30:37] all of the examples that have been nine does that make sense I just write this [00:30:39] does that make sense I just write this up so this is sum of feature [00:30:45] vectors for [00:30:48] vectors for um for all the [00:30:52] examples with y equals z and the [00:30:55] examples with y equals z and the denominators a number of examples [00:31:02] with y equals z okay and then if you [00:31:06] with y equals z okay and then if you take this ratio if you take this [00:31:08] take this ratio if you take this fraction then you're summing up all of [00:31:10] fraction then you're summing up all of the feature vectors for the benign [00:31:11] the feature vectors for the benign tumors divide by the total number of [00:31:13] tumors divide by the total number of benign tumors in the training set and so [00:31:16] benign tumors in the training set and so that's just the mean of the feature [00:31:17] that's just the mean of the feature vectors of all of the benign [00:31:21] vectors of all of the benign examples [00:31:24] examples okay um [00:31:28] and [00:31:38] then right Maxim like for me one no [00:31:41] then right Maxim like for me one no surprises is so kind of what you'd [00:31:43] surprises is so kind of what you'd expect sum up all of the positive [00:31:45] expect sum up all of the positive examples and divide by the total number [00:31:47] examples and divide by the total number of positive examples and get their mean [00:31:49] of positive examples and get their mean so that's as like put mu one um and then [00:31:54] so that's as like put mu one um and then I just write this out [00:31:58] if you're familiar with um ciance [00:32:01] if you're familiar with um ciance matrices this formula might not surprise [00:32:04] matrices this formula might not surprise you but if you're less [00:32:07] you but if you're less familiar [00:32:08] familiar then I guess you can see the details in [00:32:11] then I guess you can see the details in the [00:32:20] homework okay don't worry too much about [00:32:22] homework okay don't worry too much about that uh you can unpack the details in [00:32:24] that uh you can unpack the details in the lecture for the hom works okay um [00:32:28] the lecture for the hom works okay um but the ciance Matrix basically tries to [00:32:31] but the ciance Matrix basically tries to you know fit Contours to the ellipse [00:32:35] you know fit Contours to the ellipse right like we saw so so try to fit the G [00:32:38] right like we saw so so try to fit the G to both of these with these [00:32:39] to both of these with these corresponding means where you want one [00:32:40] corresponding means where you want one ciance Matrix to both of these okay um [00:32:45] ciance Matrix to both of these okay um so these are the so so so the way so the [00:32:48] so these are the so so so the way so the way I motivated this was you know I said [00:32:50] way I motivated this was you know I said well if you want to estimate the mean of [00:32:52] well if you want to estimate the mean of a coin toss just counted fraction of [00:32:54] a coin toss just counted fraction of coin tosses they came up heads uh and [00:32:56] coin tosses they came up heads uh and then it seems like the mean for Mu new [00:32:58] then it seems like the mean for Mu new one you should just look at these [00:32:59] one you should just look at these examples and pick the mean right so that [00:33:01] examples and pick the mean right so that that was the intuitive explanation of [00:33:02] that was the intuitive explanation of how you get these formulas but the [00:33:05] how you get these formulas but the mathematically sound way to get these [00:33:07] mathematically sound way to get these formulas is not Val this intuitive [00:33:09] formulas is not Val this intuitive argument that I just gave is instead to [00:33:11] argument that I just gave is instead to look at the likelihood uh take logs get [00:33:14] look at the likelihood uh take logs get the log likelihood take derivatives set [00:33:16] the log likelihood take derivatives set deres equal to zero solve all these [00:33:19] deres equal to zero solve all these values and prove more formally that [00:33:21] values and prove more formally that these are the actual values that [00:33:23] these are the actual values that maximize this thing right by by saying [00:33:25] maximize this thing right by by saying there is a zero and solving so you can [00:33:27] there is a zero and solving so you can see that for yourself um in the problem [00:33:30] see that for yourself um in the problem sets [00:33:32] sets Okay [00:33:38] so all right um [00:33:43] so all right um finally having fit these [00:33:46] finally having fit these parameters um if you want to make a [00:33:53] prediction right so give it a new [00:33:56] prediction right so give it a new patient uh how do you make make a [00:33:58] patient uh how do you make make a prediction for whether their tumor is [00:34:00] prediction for whether their tumor is malignant orine um [00:34:05] malignant orine um so if you want to predict the most [00:34:07] so if you want to predict the most likely class label uh you choose Max [00:34:11] likely class label uh you choose Max over y of P of [00:34:14] over y of P of Y given [00:34:16] Y given X right um and by base rule this is Max [00:34:20] X right um and by base rule this is Max of a y of P of x given y p of Y divided [00:34:25] of a y of P of x given y p of Y divided by [00:34:28] by P of X okay now um I want to introduce [00:34:32] P of X okay now um I want to introduce one well one more piece of notation [00:34:35] one well one more piece of notation which is [00:34:37] which is uh I'm G introduce actually how how many [00:34:40] uh I'm G introduce actually how how many of you are familiar with the agmax [00:34:43] of you are familiar with the agmax notation most of you like okay two two3 [00:34:47] notation most of you like okay two two3 okay cool I I'll go over this quickly so [00:34:50] okay cool I I'll go over this quickly so um Let's do an [00:34:52] um Let's do an example so the [00:34:55] example so the um let's see [00:35:00] H boy all right so you know the Min over [00:35:05] H boy all right so you know the Min over Z of Z - 5^ 2 is equal to zero because [00:35:11] Z of Z - 5^ 2 is equal to zero because the smallest possible value of Z - 5 S [00:35:13] the smallest possible value of Z - 5 S is zero right and the [00:35:17] is zero right and the augment over Z of zus 5^ [00:35:21] augment over Z of zus 5^ 2 is equal to five okay so the Min is [00:35:25] 2 is equal to five okay so the Min is the smallest possible value obtained by [00:35:27] the smallest possible value obtained by the thing inside and the augment is the [00:35:30] the thing inside and the augment is the value you need to plug in to achieve [00:35:32] value you need to plug in to achieve that smallest possible value right so uh [00:35:35] that smallest possible value right so uh the prediction you actually want to make [00:35:36] the prediction you actually want to make if you want to Output a value for y you [00:35:38] if you want to Output a value for y you don't want to Output a probability right [00:35:40] don't want to Output a probability right you want to say well what do I think is [00:35:41] you want to say well what do I think is value of y so you want to choose the [00:35:43] value of y so you want to choose the value of y that maximizes this so so [00:35:45] value of y that maximizes this so so there's the aax of this and this would [00:35:47] there's the aax of this and this would be either zero or one right um so that's [00:35:50] be either zero or one right um so that's equal the AUG Max of that and you notice [00:35:53] equal the AUG Max of that and you notice that uh this denominator is just a [00:35:55] that uh this denominator is just a constant right doesn't doesn't it's a p [00:35:58] constant right doesn't doesn't it's a p of X Y doesn't even appear in there it's [00:36:00] of X Y doesn't even appear in there it's just some positive number and so this is [00:36:03] just some positive number and so this is equal [00:36:04] equal to just AR Max over y p of x given [00:36:09] to just AR Max over y p of x given y time P of Y so when implementing um uh [00:36:17] y time P of Y so when implementing um uh when when making predictions with G in [00:36:18] when when making predictions with G in discri with uh generative learning [00:36:21] discri with uh generative learning algorithm sometimes to save on [00:36:22] algorithm sometimes to save on computation you don't bother to [00:36:24] computation you don't bother to calculate the denominator if all you [00:36:26] calculate the denominator if all you care about is to make a prediction but [00:36:28] care about is to make a prediction but if you actually need a probability then [00:36:30] if you actually need a probability then you have to normalize the [00:36:37] probability [00:36:40] okay [00:36:42] okay so [00:36:44] so let's [00:36:47] let's examine what the Aram is [00:36:56] doing all all right so let's look at the [00:36:58] doing all all right so let's look at the same data set and uh compare and [00:37:01] same data set and uh compare and contrast what a discriminative learning [00:37:03] contrast what a discriminative learning algorithm versus a generative learning [00:37:04] algorithm versus a generative learning algorithm will do on this data [00:37:07] algorithm will do on this data set right [00:37:10] set right um here's example of two features X1 X2 [00:37:14] um here's example of two features X1 X2 and positive and negative examples so [00:37:15] and positive and negative examples so let's start with a discriminative [00:37:17] let's start with a discriminative learning algorithm uh let's say you [00:37:19] learning algorithm uh let's say you initialize the parameters randomly [00:37:22] initialize the parameters randomly typically when you run logistic [00:37:23] typically when you run logistic regression I almost always initialize [00:37:25] regression I almost always initialize parenthesis zero but but this just you [00:37:27] parenthesis zero but but this just you was more interesting to start off for [00:37:29] was more interesting to start off for purpose of visualization in a random [00:37:31] purpose of visualization in a random line I guess and then if you run one [00:37:33] line I guess and then if you run one iteration of gradient descent on the [00:37:35] iteration of gradient descent on the conditional likelihood um one iteration [00:37:38] conditional likelihood um one iteration of legis regession moves the line there [00:37:41] of legis regession moves the line there there two iterations three [00:37:42] there two iterations three iterations um four iterations and so on [00:37:46] iterations um four iterations and so on and after about 20 iterations they'll [00:37:49] and after about 20 iterations they'll converge to that pretty decent [00:37:51] converge to that pretty decent discriminative boundary okay so that's [00:37:54] discriminative boundary okay so that's legis really searching for a line that [00:37:56] legis really searching for a line that separates positive and negative [00:37:58] separates positive and negative examples how about the generative [00:38:00] examples how about the generative learning algorithm what it does is the [00:38:03] learning algorithm what it does is the following which is fit uh with gussian [00:38:06] following which is fit uh with gussian dis analysis what would do is fit [00:38:10] dis analysis what would do is fit gaussians to the positive and negative [00:38:12] gaussians to the positive and negative examples right and and just one one one [00:38:15] examples right and and just one one one technical detail um I described this as [00:38:17] technical detail um I described this as if we look at the two classes separately [00:38:20] if we look at the two classes separately because we use the same coari Matrix [00:38:22] because we use the same coari Matrix Sigma for the positive and negative [00:38:23] Sigma for the positive and negative classes we actually don't quite look at [00:38:25] classes we actually don't quite look at them totally separately but we do fit [00:38:27] them totally separately but we do fit two hous in densities to the positive [00:38:29] two hous in densities to the positive and negative examples um and then what [00:38:32] and negative examples um and then what we do is for each point try to decide uh [00:38:36] we do is for each point try to decide uh what is this class label using base rule [00:38:38] what is this class label using base rule using that formula and it turns out that [00:38:41] using that formula and it turns out that this implies the following decision [00:38:43] this implies the following decision boundary right so points to the upper [00:38:45] boundary right so points to the upper right of this decision boundary that [00:38:48] right of this decision boundary that that straight line I just drew you are [00:38:50] that straight line I just drew you are closer to the negative class you end up [00:38:52] closer to the negative class you end up classifying them as negative examples [00:38:54] classifying them as negative examples and points to the lower left of that [00:38:56] and points to the lower left of that line you end up classifying as 45 as a [00:38:58] line you end up classifying as 45 as a positive examples and um uh I've also [00:39:02] positive examples and um uh I've also drawn in green here the decision [00:39:04] drawn in green here the decision boundary for logistic regression so so [00:39:06] boundary for logistic regression so so so these two algorithms actually come up [00:39:08] so these two algorithms actually come up with slightly different decision [00:39:10] with slightly different decision boundaries okay but the way you arrive [00:39:13] boundaries okay but the way you arrive at these two digion boundaries are a [00:39:14] at these two digion boundaries are a little bit [00:39:17] different so [00:39:21] um all right let's go back to the any [00:39:26] um all right let's go back to the any questions about this yeah [00:39:30] [Music] [00:39:41] oh sure yes good question so um why why [00:39:43] oh sure yes good question so um why why why do we use two separate means mu0 and [00:39:46] why do we use two separate means mu0 and mu1 and single cience Matrix Sigma um it [00:39:50] mu1 and single cience Matrix Sigma um it turns out that um uh well it turns out [00:39:53] turns out that um uh well it turns out that if you choose to build the model [00:39:55] that if you choose to build the model this way the desision boundary ends up [00:39:57] this way the desision boundary ends up being linear and so for a lot of [00:39:59] being linear and so for a lot of problems if you want to lineate Des [00:40:01] problems if you want to lineate Des boundary uh uh uh yeah and it turns out [00:40:04] boundary uh uh uh yeah and it turns out you could choose to use two separate um [00:40:07] you could choose to use two separate um cence Matrix Sigma 0o and sigma one and [00:40:10] cence Matrix Sigma 0o and sigma one and that'll actually work okay right that's [00:40:12] that'll actually work okay right that's is actually very reasonable to do so as [00:40:13] is actually very reasonable to do so as well but uh you double the number of [00:40:15] well but uh you double the number of parameters roughly and you end up with a [00:40:19] parameters roughly and you end up with a desision boundary that isn't linear [00:40:20] desision boundary that isn't linear anymore but it's actually not reason to [00:40:23] anymore but it's actually not reason to do that as [00:40:25] do that as well um [00:40:28] well um now there's [00:40:50] one now there's one very interesting [00:40:52] one now there's one very interesting property um about Gan discr analysis [00:40:58] property um about Gan discr analysis and it turns out that [00:41:01] and it turns out that uh well let's let's let's [00:41:09] compare GDA to logistic [00:41:14] regression and [00:41:16] regression and um for fixed set of [00:41:24] parameters right so let's say you've [00:41:27] parameters right so let's say you've learn some set of [00:41:29] learn some set of parameters um I'm going to do an [00:41:32] parameters um I'm going to do an exercise where we're going to [00:41:38] plot P of Y = 1 given [00:41:42] plot P of Y = 1 given X you know parameterized by all these [00:41:47] things right as a function of [00:41:53] X okay um so I'm going to do this little [00:41:56] X okay um so I'm going to do this little exercise a second but what this means is [00:41:59] exercise a second but what this means is um well this formula this is equal to P [00:42:03] um well this formula this is equal to P of x given y equals [00:42:07] of x given y equals one you know which is parameterized by [00:42:11] one you know which is parameterized by right well the various parameters time P [00:42:14] right well the various parameters time P of Y = 1 parameterized by 5 divided by P [00:42:19] of Y = 1 parameterized by 5 divided by P of X which depends on all the paramas I [00:42:21] of X which depends on all the paramas I guess [00:42:28] right so uh by base rule you know this [00:42:32] right so uh by base rule you know this formula is equal to this little thing [00:42:36] formula is equal to this little thing and uh just as we saw earlier I guess [00:42:39] and uh just as we saw earlier I guess right once you have fixed all the [00:42:41] right once you have fixed all the parameters that's just a number you [00:42:43] parameters that's just a number you compute by evaluating a gan [00:42:45] compute by evaluating a gan density [00:42:48] density um this is a b newly probability so [00:42:50] um this is a b newly probability so actually P of yal 1 parameterized by [00:42:52] actually P of yal 1 parameterized by five this is just equal to five is that [00:42:54] five this is just equal to five is that second term and you similarly calculate [00:42:56] second term and you similarly calculate the denominator but so for every value [00:42:58] the denominator but so for every value of x you can compute this ratio and thus [00:43:02] of x you can compute this ratio and thus get a number for the chance of yal to [00:43:04] get a number for the chance of yal to one given [00:43:07] one given X so I'm going go [00:43:11] through one example of uh what function [00:43:15] through one example of uh what function you get for p of yals 1 given X for what [00:43:19] you get for p of yals 1 given X for what function you get for this if you [00:43:20] function you get for this if you actually plot this for um different [00:43:24] actually plot this for um different values of X okay [00:43:28] values of X okay so [00:43:30] so um let's see let's say you have a just [00:43:32] um let's see let's say you have a just one feature X so X is a you know [00:43:36] one feature X so X is a you know uh and let's say that you have a few [00:43:40] uh and let's say that you have a few negative examples there and a few [00:43:43] negative examples there and a few positive examples [00:43:45] positive examples there right so simple data [00:43:49] there right so simple data set okay and let's see what gine discu [00:43:53] set okay and let's see what gine discu analysis will do on this data set um [00:43:56] analysis will do on this data set um with just one feure so that's why all [00:43:58] with just one feure so that's why all the data is POS on [00:43:59] the data is POS on 1D [00:44:05] so let me map all this data to an [00:44:12] x-axis I just took this data and mapped [00:44:14] x-axis I just took this data and mapped it down and um if you fit a g into each [00:44:18] it down and um if you fit a g into each of these two data sets then you end up [00:44:22] of these two data sets then you end up with you [00:44:23] with you know G as follows where this bump on the [00:44:26] know G as follows where this bump on the left is p of x given yal Z and this bump [00:44:30] left is p of x given yal Z and this bump on the right is p of [00:44:33] on the right is p of x given yal 1 right and and again [00:44:38] x given yal 1 right and and again there's a technical detail that we set [00:44:40] there's a technical detail that we set the same variance to the two Gans but [00:44:42] the same variance to the two Gans but you know you kind of model the Gan densi [00:44:44] you know you kind of model the Gan densi of what does class zero look like what [00:44:46] of what does class zero look like what does class one look like with two gum [00:44:48] does class one look like with two gum BMS like this oh and then because the [00:44:51] BMS like this oh and then because the data set is spit 50/50 you know P of yal [00:44:54] data set is spit 50/50 you know P of yal 1 is 0.5 right so one half prior [00:44:58] 1 is 0.5 right so one half prior okay now let's go through that exercise [00:45:01] okay now let's go through that exercise I described on the left of trying to [00:45:03] I described on the left of trying to plot P of yals 1 given X for different [00:45:08] plot P of yals 1 given X for different values of X so the vertical axis here is [00:45:10] values of X so the vertical axis here is p of yal 1 given different values of X [00:45:15] p of yal 1 given different values of X so um let's pick a point far to the left [00:45:19] so um let's pick a point far to the left here right with this model you if if you [00:45:23] here right with this model you if if you actually calculate this ratio you find [00:45:25] actually calculate this ratio you find that um if you have a Point here it [00:45:28] that um if you have a Point here it almost certainly came from this Gan on [00:45:31] almost certainly came from this Gan on the left right if if if you have an [00:45:33] the left right if if if you have an unable example here you almost certainly [00:45:35] unable example here you almost certainly came from the class zero Gan because the [00:45:39] came from the class zero Gan because the chance of this Gan generating example [00:45:41] chance of this Gan generating example all the way to the left is almost zero [00:45:43] all the way to the left is almost zero right and so chance of p p of y equals [00:45:45] right and so chance of p p of y equals only given X is very small so for a [00:45:48] only given X is very small so for a point like that you end up with a point [00:45:50] point like that you end up with a point you know very close to [00:45:51] you know very close to zero right um let's pick another Point [00:45:55] zero right um let's pick another Point all right how about this point the mid [00:45:57] all right how about this point the mid Point well if you get an example right [00:45:58] Point well if you get an example right in the midpoint you you really have no [00:46:00] in the midpoint you you really have no idea you really can't tell did this come [00:46:02] idea you really can't tell did this come from the negative or the positive [00:46:03] from the negative or the positive calcium can't tell right so this is [00:46:05] calcium can't tell right so this is really 50/50 so I guess if this is a 0.5 [00:46:09] really 50/50 so I guess if this is a 0.5 for that midpoint you would have PF yal [00:46:13] for that midpoint you would have PF yal 1 given X is [00:46:15] 1 given X is .5 um and then if you go to point way to [00:46:18] .5 um and then if you go to point way to the right if you get an example way here [00:46:20] the right if you get an example way here then you'll be pretty sure this came [00:46:21] then you'll be pretty sure this came from the positive examples and so you [00:46:24] from the positive examples and so you know you get a point like that [00:46:27] know you get a point like that right now it turns out that if you [00:46:30] right now it turns out that if you repeat this exercise uh sweeping from [00:46:33] repeat this exercise uh sweeping from left to right from many many points on [00:46:35] left to right from many many points on the x- axis you find that for points far [00:46:38] the x- axis you find that for points far to the left the chance of this coming [00:46:42] to the left the chance of this coming from uh the yals 1 CLA is very small and [00:46:45] from uh the yals 1 CLA is very small and as you approach this [00:46:47] as you approach this midpoint it increases to 0.5 and it [00:46:50] midpoint it increases to 0.5 and it surpasses 0.5 and then beyond a certain [00:46:54] surpasses 0.5 and then beyond a certain point it becomes very very close to one [00:46:58] point it becomes very very close to one right and you do this exercise and [00:46:59] right and you do this exercise and actually just for every point you know [00:47:01] actually just for every point you know for a dense grid on the xaxis evaluate [00:47:04] for a dense grid on the xaxis evaluate this formula which will give you a [00:47:06] this formula which will give you a number between zero and one is [00:47:08] number between zero and one is probability and go ahead and plot you [00:47:10] probability and go ahead and plot you know the values you get a curve like [00:47:12] know the values you get a curve like this and it turns out that if you [00:47:14] this and it turns out that if you connect up the dots um then this is [00:47:18] connect up the dots um then this is exactly a sigid function the shape of [00:47:21] exactly a sigid function the shape of that turns out to be exactly a shap [00:47:23] that turns out to be exactly a shap sigmoid function and you proved this in [00:47:25] sigmoid function and you proved this in the problem sets as well [00:47:28] the problem sets as well right [00:47:31] right um [00:47:34] so [00:47:37] so um both logistic regression and Gan [00:47:40] um both logistic regression and Gan discri analysis actually end up using a [00:47:44] discri analysis actually end up using a sigmoid function to calculate you know P [00:47:48] sigmoid function to calculate you know P of yals 1 given X or or or the the the [00:47:51] of yals 1 given X or or or the the the outcome ends up being a sigmoid function [00:47:53] outcome ends up being a sigmoid function I guess the mechanics is you actually [00:47:54] I guess the mechanics is you actually use this calculation rather than [00:47:56] use this calculation rather than computer sigo function right but um the [00:47:59] computer sigo function right but um the specific choice of the parameters they [00:48:02] specific choice of the parameters they end up choosing are quite different and [00:48:04] end up choosing are quite different and you saw when I was projecting the [00:48:06] you saw when I was projecting the results on the display just now in [00:48:07] results on the display just now in PowerPoint uh that the two algorithms [00:48:10] PowerPoint uh that the two algorithms actually come up with two different [00:48:12] actually come up with two different decision [00:48:14] decision boundaries so um let's discuss when a [00:48:18] boundaries so um let's discuss when a generative algorithm like GDA is [00:48:20] generative algorithm like GDA is superior and when a Distributive [00:48:22] superior and when a Distributive algorithm like logistic regression is [00:48:24] algorithm like logistic regression is superior [00:48:28] um let's [00:48:47] see all [00:48:49] see all right so GDA Gan disc [00:48:53] right so GDA Gan disc analysis so the generative approach [00:48:58] analysis so the generative approach this assumes that x given y equals z [00:49:02] this assumes that x given y equals z this is Gan with mean mu 0 and coent [00:49:06] this is Gan with mean mu 0 and coent sigma it assumes x given yal 1 this is [00:49:09] sigma it assumes x given yal 1 this is Gan would mean mu1 and coent sigma and Y [00:49:13] Gan would mean mu1 and coent sigma and Y is [00:49:14] is brui with [00:49:18] um Paramus SP right and what logistic [00:49:22] um Paramus SP right and what logistic regression does [00:49:28] this is a discens of [00:49:34] algorithm oh some strange wind at the [00:49:38] algorithm oh some strange wind at the back is it I see okay cool all right [00:49:43] back is it I see okay cool all right yeah [00:49:44] yeah boy no there's just a scary un report on [00:49:48] boy no there's just a scary un report on global warming over the weekend I hope [00:49:49] global warming over the weekend I hope we don't already have storms [00:49:52] we don't already have storms here okay it's okay did you guys see the [00:49:55] here okay it's okay did you guys see the UN report on this SL slightly scary [00:49:57] UN report on this SL slightly scary actually with the the the year for [00:49:59] actually with the the the year for global warming but hopefully all right [00:50:01] global warming but hopefully all right good hurricane stopped [00:50:06] okay [00:50:08] okay um let's [00:50:10] um let's see uh so what logistic regression [00:50:15] see uh so what logistic regression assumes is p of y equals 1 given [00:50:19] assumes is p of y equals 1 given X you know that this is a governed by [00:50:22] X you know that this is a governed by logistic function right so this you know [00:50:23] logistic function right so this you know 1 over 1 plus e Nega Theta transpose X [00:50:27] 1 over 1 plus e Nega Theta transpose X with with some details about x0 equals [00:50:31] with with some details about x0 equals 1 and so on right so just [00:50:34] 1 and so on right so just just okay so so in other words let's [00:50:37] just okay so so in other words let's assume that this is [00:50:40] assume that this is um P of yal 1 given XIs [00:50:45] logistic okay and the argument that I [00:50:48] logistic okay and the argument that I just described just now uh plotting you [00:50:52] just described just now uh plotting you know P of yal 1 given X Point by Point [00:50:54] know P of yal 1 given X Point by Point really the sigo curve I drew on On The [00:50:56] really the sigo curve I drew on On The Other Board what that illustrates um it [00:51:00] Other Board what that illustrates um it it doesn't prove it you prove it [00:51:01] it doesn't prove it you prove it yourself in the homework problem but [00:51:03] yourself in the homework problem but what that illustrates is that this set [00:51:05] what that illustrates is that this set of [00:51:07] of assumptions implies that P of yal 1 [00:51:10] assumptions implies that P of yal 1 given X is governed by a logistic [00:51:13] given X is governed by a logistic function right but it turns out that the [00:51:17] function right but it turns out that the implication in the opposite direction is [00:51:19] implication in the opposite direction is not [00:51:21] not true right so if you assume P of yal 1 [00:51:25] true right so if you assume P of yal 1 given X is governed by IC function by by [00:51:27] given X is governed by IC function by by this shape this does not in any way [00:51:30] this shape this does not in any way shape or form assume that x given Y is [00:51:32] shape or form assume that x given Y is Gan uh uh x given y equ z is Gan X y1 is [00:51:38] Gan uh uh x given y equ z is Gan X y1 is Gan so what this means is that GDA the [00:51:43] Gan so what this means is that GDA the generative learning Alm in this case [00:51:45] generative learning Alm in this case this makes a stronger set of [00:51:51] assumptions and which is [00:51:53] assumptions and which is regression makes a [00:51:59] weaker set of assumptions because you [00:52:02] weaker set of assumptions because you could prove these assumptions from this [00:52:04] could prove these assumptions from this assumptions [00:52:06] assumptions okay [00:52:08] okay um and by the way as as a as as [00:52:13] um and by the way as as a as as a let's see and so what you see in a lot [00:52:17] a let's see and so what you see in a lot of learning algorithms is that um if you [00:52:20] of learning algorithms is that um if you make stronger modeling assumptions and [00:52:22] make stronger modeling assumptions and if your modeling assumptions are roughly [00:52:24] if your modeling assumptions are roughly correct then your model will do do [00:52:26] correct then your model will do do better because you're telling more [00:52:28] better because you're telling more information to the algorithm so if [00:52:31] information to the algorithm so if indeed x given Y is Gan then GDA will do [00:52:36] indeed x given Y is Gan then GDA will do better because you're telling the [00:52:38] better because you're telling the algorithm x given Y is Galan and so it [00:52:40] algorithm x given Y is Galan and so it can be more efficient and so even if you [00:52:42] can be more efficient and so even if you have a very small data set um if these [00:52:45] have a very small data set um if these assumptions are roughly correct then GDA [00:52:47] assumptions are roughly correct then GDA will do better and the problem with GDA [00:52:50] will do better and the problem with GDA is if these assumptions turn out to be [00:52:52] is if these assumptions turn out to be wrong so if x given Y is not AOW of Gan [00:52:55] wrong so if x given Y is not AOW of Gan then this might be very bad set of [00:52:57] then this might be very bad set of assumptions to make you might be trying [00:52:58] assumptions to make you might be trying to fit the gaan density the data that is [00:53:00] to fit the gaan density the data that is not at all Gan and then GDA would do [00:53:03] not at all Gan and then GDA would do more poorly okay so here's one fun fact [00:53:08] more poorly okay so here's one fun fact here's another example get to question [00:53:10] here's another example get to question second which is um let's say the [00:53:14] second which is um let's say the following are true let's say that x [00:53:16] following are true let's say that x given yals 1 is [00:53:20] given yals 1 is plus with a parameter Lambda 1 and x [00:53:24] plus with a parameter Lambda 1 and x given y equals z [00:53:27] given y equals z is [00:53:29] is P with a mean lambda0 or Lambda 1 Lambda [00:53:34] P with a mean lambda0 or Lambda 1 Lambda 0 and Y as before is brand [00:53:37] 0 and Y as before is brand newly five right it turns out that this [00:53:41] newly five right it turns out that this set of [00:53:42] set of assumptions also imply that P of yal 1 [00:53:46] assumptions also imply that P of yal 1 given [00:53:49] X is logistic okay and you can prove [00:53:52] X is logistic okay and you can prove this and this is actually true for um [00:53:54] this and this is actually true for um any generalized linear model actually [00:53:57] any generalized linear model actually where uh where where the difference [00:53:59] where uh where where the difference between these two distributions varies [00:54:01] between these two distributions varies only according to the Natural parameter [00:54:03] only according to the Natural parameter as the generalizing name excuse me of [00:54:05] as the generalizing name excuse me of the exponential family distribution [00:54:08] the exponential family distribution right and so what this means is that um [00:54:12] right and so what this means is that um if you don't know if your data is gaan [00:54:14] if you don't know if your data is gaan or P um if you're using logistic [00:54:17] or P um if you're using logistic regression you don't need to worry about [00:54:18] regression you don't need to worry about it it work fine either way right so so [00:54:21] it it work fine either way right so so you know maybe um your fitting data to [00:54:24] you know maybe um your fitting data to maybe fitting uh model conation model to [00:54:27] maybe fitting uh model conation model to some data and you don't know is it data [00:54:30] some data and you don't know is it data gaussian is it pong is it some other [00:54:32] gaussian is it pong is it some other exponential family model maybe you just [00:54:33] exponential family model maybe you just don't know but if you're fitting [00:54:35] don't know but if you're fitting logistic regression it'll do fine under [00:54:37] logistic regression it'll do fine under all of those scenarios right but if your [00:54:41] all of those scenarios right but if your data was actually pass on but you assume [00:54:43] data was actually pass on but you assume it was gussian then your model might do [00:54:45] it was gussian then your model might do quite poorly okay so the key high level [00:54:51] quite poorly okay so the key high level principles when you take away from this [00:54:52] principles when you take away from this is um uh uh if you make weaker [00:54:57] is um uh uh if you make weaker assumptions as in logistic regression [00:55:00] assumptions as in logistic regression then your algorithm will be more robust [00:55:02] then your algorithm will be more robust modeling assumptions such as accy [00:55:04] modeling assumptions such as accy assuming the data is gin if is not uh [00:55:07] assuming the data is gin if is not uh but on the flip side if you have a very [00:55:08] but on the flip side if you have a very small data set then um using a model [00:55:13] small data set then um using a model that makes more assumptions will [00:55:15] that makes more assumptions will actually allow you to do better because [00:55:17] actually allow you to do better because by making more assumptions you're just [00:55:18] by making more assumptions you're just telling the algorithm more truth about [00:55:20] telling the algorithm more truth about the world which is you know hey [00:55:22] the world which is you know hey algorithm the world is Gan and if it is [00:55:24] algorithm the world is Gan and if it is Gan then it will actually do do do [00:55:26] Gan then it will actually do do do better okay question at the back or a [00:55:29] better okay question at the back or a few questions go [00:55:38] ahead oh uh yeah practice what of data [00:55:40] ahead oh uh yeah practice what of data is a go in property you know it's [00:55:44] is a go in property you know it's uh uh yeah you know it's a matter of [00:55:48] uh uh yeah you know it's a matter of degree right most data on this universe [00:55:51] degree right most data on this universe is Gan uh uh uh except for the spe data [00:55:55] is Gan uh uh uh except for the spe data I guess yeah but but [00:55:56] I guess yeah but but um so I think it's actually a matter of [00:55:58] um so I think it's actually a matter of degree right if if you plot actually if [00:56:00] degree right if if you plot actually if you take continuous value data no there [00:56:03] you take continuous value data no there now there there are exceptions you could [00:56:04] now there there are exceptions you could plot it and most data that you plot you [00:56:06] plot it and most data that you plot you know will not really be Gan but a lot of [00:56:09] know will not really be Gan but a lot of it you can convince yourself as vaguely [00:56:11] it you can convince yourself as vaguely Gan so I think a lot of is matter ofree [00:56:13] Gan so I think a lot of is matter ofree I I actually tell you the way I choose [00:56:14] I I actually tell you the way I choose to use um these two algorithms so I [00:56:17] to use um these two algorithms so I think that the whole world has moved [00:56:19] think that the whole world has moved toward using bigger data sets right [00:56:21] toward using bigger data sets right Digital Society which is a lot of data [00:56:24] Digital Society which is a lot of data and so for a lot of problems we have a [00:56:25] and so for a lot of problems we have a lot of data [00:56:26] lot of data I would probably use logistic regression [00:56:29] I would probably use logistic regression because with more data you can overcome [00:56:32] because with more data you can overcome telling the algorithm less about the [00:56:34] telling the algorithm less about the world right so so the algorithm has two [00:56:36] world right so so the algorithm has two sources of knowledge uh one source of [00:56:38] sources of knowledge uh one source of knowledge is what did you tell it what [00:56:40] knowledge is what did you tell it what the assumptions you told it to make and [00:56:42] the assumptions you told it to make and the second source of knowledge is learn [00:56:44] the second source of knowledge is learn from the data and in this era of big [00:56:46] from the data and in this era of big data we have a lot of data you know [00:56:49] data we have a lot of data you know there is a strong Trend to use logistic [00:56:51] there is a strong Trend to use logistic regression which makes less assumptions [00:56:53] regression which makes less assumptions and just let the a figure out whatever [00:56:54] and just let the a figure out whatever it wants to figure out from the data [00:56:56] it wants to figure out from the data right now one practical reason why I [00:56:59] right now one practical reason why I still use algorithms like GDA General [00:57:01] still use algorithms like GDA General discre analysis or algorithms like this [00:57:03] discre analysis or algorithms like this um uh is that is actually quite [00:57:06] um uh is that is actually quite computationally efficient and so there's [00:57:08] computationally efficient and so there's actually one use case that Landing AI [00:57:09] actually one use case that Landing AI than working on where we just need to [00:57:11] than working on where we just need to fit a ton of models and don't have the [00:57:13] fit a ton of models and don't have the patience to run theis progression over [00:57:15] patience to run theis progression over and over and it turns out Computing mean [00:57:17] and over and it turns out Computing mean and variances of um ciance matrices is [00:57:20] and variances of um ciance matrices is very efficient and so this is actually [00:57:23] very efficient and so this is actually apart from the assumptions type of [00:57:25] apart from the assumptions type of benefit uh which is a general [00:57:26] benefit uh which is a general philosophical point we'll see again [00:57:28] philosophical point we'll see again later in this course right this idea [00:57:30] later in this course right this idea about do you make strong or weak [00:57:31] about do you make strong or weak assumptions this is a general principle [00:57:33] assumptions this is a general principle in machine learning that we'll see again [00:57:34] in machine learning that we'll see again in other places but it's a very concrete [00:57:37] in other places but it's a very concrete the other reason I tend to use GDA these [00:57:39] the other reason I tend to use GDA these days is less that I think could perform [00:57:41] days is less that I think could perform better from an accuracy point of view [00:57:42] better from an accuracy point of view but there's actually very efficient [00:57:43] but there's actually very efficient algorithm you just compute the mean Co [00:57:45] algorithm you just compute the mean Co cence and you're done and there's no [00:57:47] cence and you're done and there's no iterative process needed so these days [00:57:49] iterative process needed so these days when I use these models um is more [00:57:52] when I use these models um is more motivated by computation and less by [00:57:55] motivated by computation and less by performance but this General principle [00:57:57] performance but this General principle is one that will come back to you again [00:57:59] is one that will come back to you again later we develop more sub learnings [00:58:03] later we develop more sub learnings yeah [00:58:08] the the [00:58:13] ass oh right so what happens if the [00:58:15] ass oh right so what happens if the cence matrices are different it turns [00:58:17] cence matrices are different it turns out that uh uh it still ends up being a [00:58:20] out that uh uh it still ends up being a logistic function but a bunch of [00:58:21] logistic function but a bunch of quadratic terms in the logistic function [00:58:23] quadratic terms in the logistic function so it's not a linear design boundary [00:58:24] so it's not a linear design boundary anymore you can end up with a desision [00:58:26] anymore you can end up with a desision boundary you know that that that looks [00:58:28] boundary you know that that that looks like this right a positive negative [00:58:30] like this right a positive negative example separated by some by some other [00:58:32] example separated by some by some other shape and [00:58:33] shape and line you you could you could you could F [00:58:36] line you you could you could you could F actually if you're curious encourage you [00:58:38] actually if you're curious encourage you to you know uh fire up Python numpy and [00:58:41] to you know uh fire up Python numpy and and play around their parames and plot [00:58:43] and play around their parames and plot this for yourself [00:58:54] question yeah is a recommend dle test to [00:58:57] question yeah is a recommend dle test to see if it's Gan um I can tell you what's [00:59:00] see if it's Gan um I can tell you what's done in practice I think in practice if [00:59:03] done in practice I think in practice if you have enough data to do a cical test [00:59:05] you have enough data to do a cical test and gain conviction you probably have [00:59:08] and gain conviction you probably have enough data to just use logistic [00:59:09] enough data to just use logistic regression um uh the the I I I don't [00:59:12] regression um uh the the I I I don't know well no that's not really fair I [00:59:14] know well no that's not really fair I don't know if a very high dimensional [00:59:16] don't know if a very high dimensional data I I think what often happens more [00:59:18] data I I think what often happens more is people just plot the data and if it [00:59:21] is people just plot the data and if it looks clearly non-an then you know that [00:59:23] looks clearly non-an then you know that would be a reason to not use GDA but [00:59:25] would be a reason to not use GDA but what happens often is that um uh uh [00:59:30] what happens often is that um uh uh sometimes you just have a very small [00:59:31] sometimes you just have a very small training set and it's just a matter of [00:59:33] training set and it's just a matter of judgment right like if you have if you [00:59:35] judgment right like if you have if you have a a you know I don't know 50 [00:59:38] have a a you know I don't know 50 examples of healthcare records then you [00:59:40] examples of healthcare records then you just have to ask some doctors and ask [00:59:43] just have to ask some doctors and ask well do you think the distribution is RA [00:59:45] well do you think the distribution is RA relatively Gan and use domain knowledge [00:59:47] relatively Gan and use domain knowledge like that right I think by the way [00:59:49] like that right I think by the way another philosophical Point um I think [00:59:52] another philosophical Point um I think that uh the machine learning world has [00:59:56] that uh the machine learning world has TR you know a little bit overhyped big [00:59:57] TR you know a little bit overhyped big data right and and yes it's true that [01:00:00] data right and and yes it's true that when you have more data is great and I [01:00:02] when you have more data is great and I love data and having more data pretty [01:00:04] love data and having more data pretty much never hurts and usually the more [01:00:06] much never hurts and usually the more data the better so all that is true and [01:00:07] data the better so all that is true and I think we did a good job telling people [01:00:09] I think we did a good job telling people that high level message you know more [01:00:10] that high level message you know more data almost always helps but um uh I [01:00:13] data almost always helps but um uh I think a lot of the skill in machine [01:00:15] think a lot of the skill in machine learning these days is getting your alms [01:00:18] learning these days is getting your alms to work even when you don't have a [01:00:19] to work even when you don't have a million in examples even don't have 100 [01:00:21] million in examples even don't have 100 million examples so there are lots of [01:00:22] million examples so there are lots of machine learning applications where you [01:00:24] machine learning applications where you just don't have a million examples [01:00:26] just don't have a million examples uh you have 100 examples and um it's [01:00:29] uh you have 100 examples and um it's then the skill in designing the learning [01:00:31] then the skill in designing the learning AA matters much more um so if you take [01:00:34] AA matters much more um so if you take something like imet million images there [01:00:36] something like imet million images there are now dozens of teams maybe hundreds [01:00:38] are now dozens of teams maybe hundreds of teams I don't know they can get great [01:00:40] of teams I don't know they can get great results if you have a million examples [01:00:42] results if you have a million examples right and so the performance difference [01:00:44] right and so the performance difference between teams you know there are now [01:00:46] between teams you know there are now dozens of teams that get great [01:00:47] dozens of teams that get great performance there a million examples uh [01:00:49] performance there a million examples uh for for for image classification like [01:00:50] for for for image classification like image them but if you have only 100 [01:00:53] image them but if you have only 100 examples then the high school teams will [01:00:56] examples then the high school teams will actually do much much much much better [01:00:58] actually do much much much much better than the low Skool teams whereas the [01:00:59] than the low Skool teams whereas the performance Gap is smaller when you have [01:01:01] performance Gap is smaller when you have giant data sets I think so and I think [01:01:04] giant data sets I think so and I think that is these types of intuitions you [01:01:06] that is these types of intuitions you know what assumptions you use generative [01:01:07] know what assumptions you use generative or discriminative that actually [01:01:09] or discriminative that actually distinguishes the high school teams and [01:01:11] distinguishes the high school teams and the and the and the less experienced [01:01:12] the and the and the less experienced teams and drives a lot of performance [01:01:14] teams and drives a lot of performance differences when you have small data oh [01:01:17] differences when you have small data oh and if someone goes to you and says oh [01:01:19] and if someone goes to you and says oh you only have 100 examples you never do [01:01:21] you only have 100 examples you never do anything then I don't know if if it's a [01:01:24] anything then I don't know if if it's a competitor saying that I'll say great [01:01:26] competitor saying that I'll say great you know don't do it because I can make [01:01:27] you know don't do it because I can make it work uh well I don't know uh but but [01:01:30] it work uh well I don't know uh but but I think there are a lot of applications [01:01:31] I think there are a lot of applications where your skill of Designing a machine [01:01:33] where your skill of Designing a machine Learning System really makes a bigger [01:01:34] Learning System really makes a bigger difference when you have a makes a it [01:01:36] difference when you have a makes a it makes a difference for big data and [01:01:38] makes a difference for big data and small data but it just this is very [01:01:40] small data but it just this is very clear when you don't know much data is [01:01:42] clear when you don't know much data is the assumptions you C into the a like is [01:01:44] the assumptions you C into the a like is it Gan is it P that that skill allows [01:01:47] it Gan is it P that that skill allows you to drive much bigger performance [01:01:50] you to drive much bigger performance then a lower skill team would be able to [01:01:53] then a lower skill team would be able to all right let's just uh I want take [01:01:55] all right let's just uh I want take question [01:01:56] question go [01:02:06] ahead oh sure so does this uh yes so [01:02:09] ahead oh sure so does this uh yes so what's the general statement of this yes [01:02:10] what's the general statement of this yes so if uh x given yals 1 uh comes from an [01:02:14] so if uh x given yals 1 uh comes from an exponential family distribution x given [01:02:16] exponential family distribution x given yal Z comes from exponential family [01:02:18] yal Z comes from exponential family distribution the same exponential family [01:02:20] distribution the same exponential family distribution and if they vary only by [01:02:22] distribution and if they vary only by the natural parameter of exponential [01:02:24] the natural parameter of exponential family distribution then and this will [01:02:26] family distribution then and this will be logistic yeah um I think this was [01:02:29] be logistic yeah um I think this was once a midterm homew world problem to [01:02:31] once a midterm homew world problem to prove this actually but yeah all right [01:02:34] prove this actually but yeah all right uh actually just take one last question [01:02:36] uh actually just take one last question we'll move on go [01:02:44] ahead oh uh does performance Improvement [01:02:47] ahead oh uh does performance Improvement whole even as you increase number of [01:02:49] whole even as you increase number of classes [01:02:52] uh I think so yes uh and the general [01:02:55] uh I think so yes uh and the general ization of this would be the soft Max [01:02:57] ization of this would be the soft Max regression which I didn't talk about but [01:02:59] regression which I didn't talk about but yes I think a similar thing holds true [01:03:01] yes I think a similar thing holds true for um gdf and multiple and we have so [01:03:04] for um gdf and multiple and we have so far we only talked about bind [01:03:05] far we only talked about bind classification what if we have more than [01:03:06] classification what if we have more than two classes but uh but yes similar [01:03:08] two classes but uh but yes similar similar things holds true for uh like a [01:03:11] similar things holds true for uh like a GDA with three classes in [01:03:14] GDA with three classes in softmax yeah oh yes right you saw [01:03:16] softmax yeah oh yes right you saw softmax the other day [01:03:20] softmax the other day cool um and this this theme that when [01:03:23] cool um and this this theme that when you have less data the them needs to [01:03:26] you have less data the them needs to rely more on assumptions you code in [01:03:28] rely more on assumptions you code in this is a recurring theme that we'll [01:03:30] this is a recurring theme that we'll come back to as well this is one of the [01:03:32] come back to as well this is one of the important principles of machine learning [01:03:34] important principles of machine learning that when you have less data your skill [01:03:36] that when you have less data your skill at coding in your knowledge matters much [01:03:38] at coding in your knowledge matters much more uh this is a theme we'll come back [01:03:40] more uh this is a theme we'll come back to when we talk about much more [01:03:42] to when we talk about much more complicated learning ARS as [01:03:46] well all [01:03:49] well all right so I want a fresh board for this [01:04:00] so you've seen GDA in the context of um [01:04:04] so you've seen GDA in the context of um continuous [01:04:05] continuous valued uh features X the last thing I [01:04:09] valued uh features X the last thing I want to do today [01:04:12] want to do today um is talk about one more generative [01:04:15] um is talk about one more generative learning algorithm called naive Bas um [01:04:20] learning algorithm called naive Bas um and I'm going to use as a mul example [01:04:23] and I'm going to use as a mul example email spam classification but this this [01:04:25] email spam classification but this this is this I guess this is our first for [01:04:27] is this I guess this is our first for into national language processing right [01:04:29] into national language processing right but given a piece of text like given a [01:04:30] but given a piece of text like given a piece of email can you classify this as [01:04:32] piece of email can you classify this as spam or not spam or uh other examples uh [01:04:36] spam or not spam or uh other examples uh uh actually several years ago eBay used [01:04:38] uh actually several years ago eBay used a problem of you if someone's trying to [01:04:39] a problem of you if someone's trying to sell something and you write a text [01:04:41] sell something and you write a text description right hey I have a [01:04:43] description right hey I have a secondhand you know room I'm trying to [01:04:45] secondhand you know room I'm trying to sell it on eBay how do you take that [01:04:47] sell it on eBay how do you take that text that someone wrote of a description [01:04:49] text that someone wrote of a description and categorize it is electronic thing or [01:04:51] and categorize it is electronic thing or are they trying to sell a TV are they [01:04:52] are they trying to sell a TV are they trying to sell clothing uh but these [01:04:54] trying to sell clothing uh but these these examples are text class a problems [01:04:56] these examples are text class a problems you have a piece of text and you want to [01:04:57] you have a piece of text and you want to classify it into one of two categories [01:05:00] classify it into one of two categories for spam or non-spam or one of maybe [01:05:03] for spam or non-spam or one of maybe thousands of categories if you're trying [01:05:04] thousands of categories if you're trying to take a private description and [01:05:05] to take a private description and classify it into one of the [01:05:07] classify it into one of the classes um and so the first question we [01:05:12] classes um and so the first question we will have is um uh given an email [01:05:16] will have is um uh given an email problem uh given email classification [01:05:18] problem uh given email classification problem how do you represented as a [01:05:21] problem how do you represented as a feature vector [01:05:27] right and so um in naive base what we're [01:05:31] right and so um in naive base what we're going to do is take your email take a [01:05:34] going to do is take your email take a piece of email and first map it to a [01:05:38] piece of email and first map it to a feature Vector X and we'll do so as [01:05:41] feature Vector X and we'll do so as follows which is first um let's start [01:05:44] follows which is first um let's start with [01:05:45] with a let's start with the English [01:05:47] a let's start with the English dictionary and make a list of all the [01:05:49] dictionary and make a list of all the words in the English dictionary right so [01:05:52] words in the English dictionary right so first word in the English dictionary is [01:05:54] first word in the English dictionary is a second word English dictionary is is [01:05:56] a second word English dictionary is is arvar third word is [01:05:59] arvar third word is adwolf uh that's e he look it [01:06:04] adwolf uh that's e he look it up um and then you know uh uh email spam [01:06:08] up um and then you know uh uh email spam a lot of people ask you buy stuff so [01:06:09] a lot of people ask you buy stuff so they would buy right and then um uh and [01:06:13] they would buy right and then um uh and then the last word in my dictionary is [01:06:16] then the last word in my dictionary is zigar which is the technological [01:06:18] zigar which is the technological chemistry that refers to the [01:06:20] chemistry that refers to the fermentation process in [01:06:23] fermentation process in Brewing um so so again this is useful [01:06:25] Brewing um so so again this is useful way think about it in in in practice [01:06:27] way think about it in in in practice what you do is not uh actually look at [01:06:29] what you do is not uh actually look at the dictionary but look at the top [01:06:31] the dictionary but look at the top 10,000 words in know in your training [01:06:32] 10,000 words in know in your training set right so maybe you have 10,000 it's [01:06:35] set right so maybe you have 10,000 it's easier to think about it as if it was a [01:06:36] easier to think about it as if it was a dictionary but you know in practice what [01:06:38] dictionary but you know in practice what you the other thing that's diction has [01:06:40] you the other thing that's diction has too many words but what the other way to [01:06:42] too many words but what the other way to do it is to look through your own email [01:06:44] do it is to look through your own email Corpus and just find the top 10,000 [01:06:46] Corpus and just find the top 10,000 occurring words and use that as a [01:06:48] occurring words and use that as a feature set and so oh know right in your [01:06:51] feature set and so oh know right in your emails I guess you're getting a bunch of [01:06:52] emails I guess you're getting a bunch of email about from us or maybe others [01:06:54] email about from us or maybe others about cs229 so cs229 might appear in [01:06:56] about cs229 so cs229 might appear in your dictionary of building an email SPF [01:06:58] your dictionary of building an email SPF for yourself even if it doesn't appear [01:07:00] for yourself even if it doesn't appear in the in the official uh what is it [01:07:02] in the in the official uh what is it like the oate dictionary just yet just [01:07:05] like the oate dictionary just yet just just just you way we'll get CS there [01:07:07] just just you way we'll get CS there somay all right um and [01:07:12] somay all right um and so given an email what we would like to [01:07:15] so given an email what we would like to do is then um take this piece of text [01:07:18] do is then um take this piece of text and represent to this feature vector and [01:07:21] and represent to this feature vector and so one way to do this is um you can [01:07:23] so one way to do this is um you can create a binary feature vector that puts [01:07:26] create a binary feature vector that puts a one if a word appears in the email and [01:07:30] a one if a word appears in the email and puts a zero if it doesn't right so if [01:07:32] puts a zero if it doesn't right so if you get an email um uh that asks you to [01:07:35] you get an email um uh that asks you to you know buy some stuff and the word a [01:07:37] you know buy some stuff and the word a appears in email you put a one there [01:07:39] appears in email you put a one there they're not trying to sell odv odw so [01:07:41] they're not trying to sell odv odw so zero there [01:07:45] zero there buy and so on right so you take a take [01:07:47] buy and so on right so you take a take an email and turn it into a binary [01:07:52] an email and turn it into a binary feature Vector um and so here the [01:07:56] feature Vector um and so here the feature Vector is [01:07:59] feature Vector is 01 to the N because it's a n dimensional [01:08:03] 01 to the N because it's a n dimensional bin feature Vector where where for the [01:08:05] bin feature Vector where where for the purpose of illustration let's say n is [01:08:07] purpose of illustration let's say n is 10,000 because you're using you know [01:08:09] 10,000 because you're using you know take the top 10,000 words uh that appear [01:08:11] take the top 10,000 words uh that appear in your email training set as the [01:08:14] in your email training set as the dictionary that you will [01:08:22] use so [01:08:29] um so in other words [01:08:33] um so in other words XI is [01:08:36] XI is indicator where [01:08:39] I appears in the email right so it's [01:08:42] I appears in the email right so it's either Z One depending on whether or not [01:08:44] either Z One depending on whether or not that word I from this list appears in [01:08:47] that word I from this list appears in your [01:08:49] your email now um in the na Bas algorithm [01:08:54] email now um in the na Bas algorithm we're going to build a generative [01:08:56] we're going to build a generative learning algorithm um and so we want to [01:09:05] model P of x given [01:09:07] model P of x given y right as well as P of Y okay but there [01:09:13] y right as well as P of Y okay but there are uh two to the [01:09:21] 10,000 possible values of X right [01:09:25] 10,000 possible values of X right because because X is a binary Vector of [01:09:27] because because X is a binary Vector of this 10,000 dimensional so if we try to [01:09:29] this 10,000 dimensional so if we try to model P of X in the straightforward way [01:09:33] model P of X in the straightforward way as a multinomial distribution over you [01:09:35] as a multinomial distribution over you know two to the 10,000 possible outcomes [01:09:38] know two to the 10,000 possible outcomes then you need right uh uh you need you [01:09:41] then you need right uh uh you need you know two to the 10,000 parameters right [01:09:45] know two to the 10,000 parameters right which is a lot or actually technically [01:09:47] which is a lot or actually technically you need 2 to 10,000 minus one [01:09:49] you need 2 to 10,000 minus one parameters because that add up to one [01:09:51] parameters because that add up to one you save one parameter um but so [01:09:53] you save one parameter um but so modeling this without additional [01:09:54] modeling this without additional assumptions won't won't work right [01:09:57] assumptions won't won't work right because uh excessive number parameters [01:10:00] because uh excessive number parameters so in the Naas algorithm we're going to [01:10:03] so in the Naas algorithm we're going to assume that the [01:10:06] assume that the XIs [01:10:08] XIs are uh conditionally [01:10:18] independent given y okay uh let me just [01:10:22] independent given y okay uh let me just write out what this means but so P of X1 [01:10:25] write out what this means but so P of X1 up to x [01:10:27] up to x 10,000 given y by the chain rule of [01:10:32] 10,000 given y by the chain rule of probability this is equal to P of X1 [01:10:35] probability this is equal to P of X1 given [01:10:39] y times P of X2 [01:10:43] y times P of X2 given um X1 and Y times P of x3 given X1 [01:10:50] given um X1 and Y times P of x3 given X1 X2 y up to your P of x 10,000 [01:10:55] X2 y up to your P of x 10,000 given so on right so I haven't made any [01:10:59] given so on right so I haven't made any assumptions yet this is just a true [01:11:00] assumptions yet this is just a true statement of fact is always true by the [01:11:02] statement of fact is always true by the by the chain rule of [01:11:04] by the chain rule of probability um and what we're going to [01:11:06] probability um and what we're going to assume which is what this assumption [01:11:15] is is that this is equal to this first [01:11:19] is is that this is equal to this first term no change but X2 given y p of x3 [01:11:24] term no change but X2 given y p of x3 given y do do do p of [01:11:27] given y do do do p of x 10,000 given y okay so um this [01:11:33] x 10,000 given y okay so um this assumption is called a conditional [01:11:36] assumption is called a conditional Independence assumption is also [01:11:37] Independence assumption is also sometimes called the na based assumption [01:11:39] sometimes called the na based assumption but you're assuming that um so long as [01:11:42] but you're assuming that um so long as you know why the chance of seeing the [01:11:45] you know why the chance of seeing the word um odv in your email does not [01:11:49] word um odv in your email does not depend on whether the word a appears in [01:11:51] depend on whether the word a appears in your email right um and this is one of [01:11:54] your email right um and this is one of those assumptions is definitely not a [01:11:56] those assumptions is definitely not a true assumption in that this is just not [01:11:57] true assumption in that this is just not a mathematically true assumption just [01:11:59] a mathematically true assumption just that sometimes your data isn't perfectly [01:12:01] that sometimes your data isn't perfectly Gan but you assum as Gan you can kind of [01:12:03] Gan but you assum as Gan you can kind of get away with it uh so this assumption [01:12:05] get away with it uh so this assumption is not true um in a mathematical sense [01:12:08] is not true um in a mathematical sense but it may be not so horrible that you [01:12:11] but it may be not so horrible that you can't get away with it right um and [01:12:15] can't get away with it right um and so side if you if any of you are [01:12:17] so side if you if any of you are familiar with prob graphical models if [01:12:19] familiar with prob graphical models if you taken [01:12:20] you taken cs228 uh this assumption is summarized [01:12:23] cs228 uh this assumption is summarized in this picture and if you haven't taken [01:12:25] in this picture and if you haven't taken cs228 this picture won't make sense but [01:12:27] cs228 this picture won't make sense but don't worry about it [01:12:31] don't worry about it um right that uh once you know the class [01:12:34] um right that uh once you know the class label is a spam or not spam whether or [01:12:36] label is a spam or not spam whether or not each word appears or does not appear [01:12:39] not each word appears or does not appear is independent okay so this called [01:12:41] is independent okay so this called conditional so the the mechanics of this [01:12:43] conditional so the the mechanics of this assumption is really just captured by [01:12:45] assumption is really just captured by this equation um uh and you just use [01:12:48] this equation um uh and you just use this equation that's all you need to [01:12:50] this equation that's all you need to derive naive base but the intuition is [01:12:52] derive naive base but the intuition is that if I tell you whether this piece if [01:12:55] that if I tell you whether this piece if I tell you that this piece of email is [01:12:57] I tell you that this piece of email is Spam then whether the word buy appears [01:12:59] Spam then whether the word buy appears in it doesn't affect your beliefs [01:13:01] in it doesn't affect your beliefs whether the word mortgage or discount or [01:13:03] whether the word mortgage or discount or whatever spam you words appear [01:13:07] whatever spam you words appear right so just to summarize this is [01:13:10] right so just to summarize this is product from I equals 1 through n of P [01:13:13] product from I equals 1 through n of P of x [01:13:14] of x i given y [01:13:47] all [01:13:48] all right so the parameters of this [01:13:53] model uh are [01:13:55] model uh are I'm going to write five [01:13:57] I'm going to write five subrip [01:13:59] subrip um [01:14:01] um J given yals 1 as the probability that [01:14:05] J given yals 1 as the probability that XJ equal 1 given yal 1 I sub j yal [01:14:18] z and then um f [01:14:25] right and just to distinguish all these [01:14:27] right and just to distinguish all these FES from each other I'm going to just [01:14:29] FES from each other I'm going to just call this five subscript y okay so this [01:14:33] call this five subscript y okay so this parameter says if a spam email if yal 1 [01:14:36] parameter says if a spam email if yal 1 is Spam y z is non spam if a spam email [01:14:39] is Spam y z is non spam if a spam email what's the challeng of where J appearing [01:14:42] what's the challeng of where J appearing in the email uh if it's not spam email [01:14:44] in the email uh if it's not spam email what's the chance of where J appearing [01:14:46] what's the chance of where J appearing in the email um and then also what's the [01:14:48] in the email um and then also what's the class prior what's the PRI probability [01:14:50] class prior what's the PRI probability that the next email you receive in your [01:14:52] that the next email you receive in your uh in your in your inbox is SP [01:15:00] email and so um to fit the parameters of [01:15:04] email and so um to fit the parameters of this [01:15:05] this model you would similar to gmin [01:15:12] analysis write out the Jo joint [01:15:15] analysis write out the Jo joint likelihood so the joint likelihood of [01:15:17] likelihood so the joint likelihood of these [01:15:21] parameters right is the product [01:15:28] you know given these [01:15:33] parameters right similar to what we had [01:15:35] parameters right similar to what we had for Gan discre [01:15:38] for Gan discre analysis and the maximum likelihood [01:15:41] analysis and the maximum likelihood estimates um if you take this take logs [01:15:43] estimates um if you take this take logs take der to set there to zero solve for [01:15:45] take der to set there to zero solve for the values that maximize this you find [01:15:48] the values that maximize this you find that the maximum likely estimates of the [01:15:50] that the maximum likely estimates of the parameters are 5y is pretty much what [01:15:54] parameters are 5y is pretty much what You' expect [01:16:00] right it's just a fraction of spam [01:16:03] right it's just a fraction of spam emails and F of J given yals 1 is um [01:16:09] emails and F of J given yals 1 is um well all write does out an indicat a [01:16:10] well all write does out an indicat a function [01:16:20] notation oh shoot sorry [01:16:45] okay um so that's the indicator function [01:16:47] okay um so that's the indicator function notation of writing out look through [01:16:49] notation of writing out look through your uh training set find all the spam [01:16:52] your uh training set find all the spam emails and of all the spam emails are [01:16:55] emails and of all the spam emails are examples of y equal 1 count out what [01:16:58] examples of y equal 1 count out what fraction of them had word j in it right [01:17:00] fraction of them had word j in it right so your estimate of the chance of word J [01:17:02] so your estimate of the chance of word J appearing your estim chance of the word [01:17:05] appearing your estim chance of the word by appearing in spam email is just where [01:17:07] by appearing in spam email is just where of all the spam emails in your training [01:17:09] of all the spam emails in your training set what fraction of them contained the [01:17:11] set what fraction of them contained the word by what fraction of them had your [01:17:14] word by what fraction of them had your XJ equals one for say the word by okay [01:17:18] XJ equals one for say the word by okay um and so it turns out that if you [01:17:21] um and so it turns out that if you implement this algorithm it will it will [01:17:24] implement this algorithm it will it will nearly work I guess uh uh but this is [01:17:27] nearly work I guess uh uh but this is naive base for um for email spam [01:17:30] naive base for um for email spam classification right and I'll mention [01:17:33] classification right and I'll mention the one reason this uh and it turns out [01:17:36] the one reason this uh and it turns out that with with one fix to this algorithm [01:17:38] that with with one fix to this algorithm which we'll talk about on Wednesday um [01:17:41] which we'll talk about on Wednesday um this is actually is actually a not too [01:17:44] this is actually is actually a not too horrible spam classifier it turns out [01:17:46] horrible spam classifier it turns out that if you use logistic regression uh [01:17:48] that if you use logistic regression uh for spam classification you do better [01:17:50] for spam classification you do better than this almost all the time but this [01:17:52] than this almost all the time but this is a very efficient algorithm because [01:17:55] is a very efficient algorithm because estimating these parameters is just [01:17:56] estimating these parameters is just counting and then Computing [01:17:58] counting and then Computing probabilities is just multiplying a [01:17:59] probabilities is just multiplying a bunch of numbers so there's nothing [01:18:00] bunch of numbers so there's nothing iterative about this you can fit this [01:18:02] iterative about this you can fit this model very efficiently and also keep on [01:18:05] model very efficiently and also keep on updating this model even as you get new [01:18:06] updating this model even as you get new data even as you get new new new you [01:18:09] data even as you get new new new you know users hits mark spam or whatever [01:18:11] know users hits mark spam or whatever even you get new data you can update [01:18:12] even you get new data you can update this model very efficiently um but it [01:18:16] this model very efficiently um but it turns out that uh actually the the the [01:18:18] turns out that uh actually the the the biggest problem with this algorithm is [01:18:20] biggest problem with this algorithm is what happens if uh this is zero over if [01:18:23] what happens if uh this is zero over if uh if you get zeros in some of these [01:18:24] uh if you get zeros in some of these equations right but but we'll come back [01:18:26] equations right but but we'll come back to that when we talk about the pl moving [01:18:29] to that when we talk about the pl moving on [01:18:30] on Wednesday okay all right any quick [01:18:32] Wednesday okay all right any quick questions before we wrap [01:18:36] up y okay good so now you learn about [01:18:39] up y okay good so now you learn about generative learning algorithms uh we'll [01:18:41] generative learning algorithms uh we'll come back on Wednesday and learn about [01:18:42] come back on Wednesday and learn about some more fine details how to make this [01:18:44] some more fine details how to make this work that's so let's break we'll see you [01:18:46] work that's so let's break we'll see you on Wednesday ================================================================================ LECTURE 006 ================================================================================ Lecture 6 - Support Vector Machines | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018) Source: https://www.youtube.com/watch?v=lDwow4aOrtg --- Transcript [00:00:03] alright hey everyone morning and welcome [00:00:06] alright hey everyone morning and welcome back so what I'd like to do today is [00:00:11] back so what I'd like to do today is continue our discussion of naive Bayes [00:00:14] continue our discussion of naive Bayes and in particular would describe how to [00:00:17] and in particular would describe how to use naive Bayes generative learning [00:00:19] use naive Bayes generative learning algorithm to build a spam classifier [00:00:21] algorithm to build a spam classifier that will almost work right and and so [00:00:24] that will almost work right and and so today you see how the paths moving is [00:00:27] today you see how the paths moving is one other idea you need to add to the [00:00:30] one other idea you need to add to the naive Bayes algorithm we described on [00:00:31] naive Bayes algorithm we described on Monday to really make it work for say [00:00:35] Monday to really make it work for say email spam constellation or for text [00:00:37] email spam constellation or for text oxidation and then we'll talk about a [00:00:39] oxidation and then we'll talk about a different version of naive Bayes there's [00:00:41] different version of naive Bayes there's even better than the one we've been [00:00:42] even better than the one we've been discussing so far it's a little bit [00:00:45] discussing so far it's a little bit about advice for applying machine [00:00:48] about advice for applying machine learning algorithms so this will be [00:00:49] learning algorithms so this will be useful to you as you get started on your [00:00:52] useful to you as you get started on your system across projects as well this is [00:00:54] system across projects as well this is strategy for how to choose an algorithm [00:00:56] strategy for how to choose an algorithm and what to do first what to do second [00:00:58] and what to do first what to do second and then we'll start with intro to [00:01:01] and then we'll start with intro to support so to recap the naive Bayes [00:01:09] support so to recap the naive Bayes algorithm is a generative learning [00:01:11] algorithm is a generative learning algorithm in which given a piece of [00:01:13] algorithm in which given a piece of email or Twitter message or some piece [00:01:16] email or Twitter message or some piece of text psycho dictionary and put in [00:01:19] of text psycho dictionary and put in zeros and ones depending on whether [00:01:22] zeros and ones depending on whether different words appear in a particular [00:01:24] different words appear in a particular email and so this becomes your feature [00:01:26] email and so this becomes your feature representation for say an email that [00:01:28] representation for say an email that you're trying to classify as spam on all [00:01:30] you're trying to classify as spam on all spam so using the indicator function [00:01:32] spam so using the indicator function notation XJ I've been trying to use a [00:01:39] notation XJ I've been trying to use a subscript J not consistently to denote [00:01:42] subscript J not consistently to denote the indexes and features and I to invest [00:01:44] the indexes and features and I to invest in training examples but we'll see I'm [00:01:46] in training examples but we'll see I'm not going [00:01:47] not going mrs. J is whether or not an indicator [00:01:50] mrs. J is whether or not an indicator for where the word J appears in an email [00:01:53] for where the word J appears in an email and so to build a generative model for [00:01:58] and so to build a generative model for this we need to model these two terms P [00:02:02] this we need to model these two terms P of x given Y and PF y so Gaussian to [00:02:06] of x given Y and PF y so Gaussian to strengthen Alice's models these two [00:02:08] strengthen Alice's models these two terms with a Gaussian and a Bernoulli [00:02:10] terms with a Gaussian and a Bernoulli respectively and naive Bayes uses a [00:02:12] respectively and naive Bayes uses a different model and with naive based in [00:02:14] different model and with naive based in particular P of X given Y is modeled as [00:02:16] particular P of X given Y is modeled as a product of the conditional [00:02:19] a product of the conditional probabilities of the new features given [00:02:22] probabilities of the new features given the class label Y and so the parameters [00:02:25] the class label Y and so the parameters the naive Bayes model are Phi subscript [00:02:28] the naive Bayes model are Phi subscript Y is the class prior what's the chance [00:02:30] Y is the class prior what's the chance that Y is equal to one before you've [00:02:32] that Y is equal to one before you've seen any features as well as Phi [00:02:35] seen any features as well as Phi subscript J given y equals 0 which is a [00:02:38] subscript J given y equals 0 which is a chance of that word appearing in a [00:02:40] chance of that word appearing in a non-spam as was Phi subscript J given y [00:02:43] non-spam as was Phi subscript J given y equals 1 which is the chance event worth [00:02:45] equals 1 which is the chance event worth that were appearing in spam email okay [00:02:49] that were appearing in spam email okay and so if you derive the maximum [00:02:54] and so if you derive the maximum likelihood estimates you will find that [00:02:57] likelihood estimates you will find that the maximum likelihood estimator you [00:02:59] the maximum likelihood estimator you know Phi Y right just a fraction of [00:03:08] know Phi Y right just a fraction of training examples that was equal to spam [00:03:36] and this is just the indicator function [00:03:40] and this is just the indicator function notation way of writing with all of your [00:03:44] notation way of writing with all of your emails with label y equals zero and [00:03:47] emails with label y equals zero and counter my fraction of them did this [00:03:49] counter my fraction of them did this feature X J appeared did this word x JM [00:03:52] feature X J appeared did this word x JM here and then finally a prediction time [00:04:10] you will calculate P of Y given X this [00:04:22] you will calculate P of Y given X this is kind of according to Bayes rule [00:04:41] all right so it turns out this algorithm [00:04:45] all right so it turns out this algorithm will almost work and here's where it [00:04:47] will almost work and here's where it breaks down which is um you know so [00:04:51] breaks down which is um you know so actually every year there are some CST [00:04:54] actually every year there are some CST realized some machine learning students [00:04:56] realized some machine learning students or do a class project and some people [00:04:58] or do a class project and some people end up submitting this to an academic [00:05:00] end up submitting this to an academic conference right some some actually some [00:05:02] conference right some some actually some some 169 class projects get submitted [00:05:04] some 169 class projects get submitted gos conference papers pretty much every [00:05:06] gos conference papers pretty much every year one of the top machine learning [00:05:09] year one of the top machine learning conferences is the conference nibs name [00:05:12] conferences is the conference nibs name stands the neural information processing [00:05:13] stands the neural information processing systems and let's say that in your [00:05:18] systems and let's say that in your dictionary you know you have 10,000 [00:05:19] dictionary you know you have 10,000 words in your dictionary let's say that [00:05:21] words in your dictionary let's say that the nibs conference the word nibs [00:05:23] the nibs conference the word nibs corresponds to word number 6 0 17 right [00:05:27] corresponds to word number 6 0 17 right in your you know 10,000 word dictionary [00:05:30] in your you know 10,000 word dictionary but up until now presumably you've not [00:05:34] but up until now presumably you've not had a lot of emails from your friends [00:05:36] had a lot of emails from your friends asking hey do you want some of the paper [00:05:38] asking hey do you want some of the paper to the nibs conference or not and so if [00:05:43] to the nibs conference or not and so if you use your current you know email set [00:05:46] you use your current you know email set of emails to find these maximum [00:05:48] of emails to find these maximum likelihood estimate the parameters you [00:05:50] likelihood estimate the parameters you will probably estimate that probability [00:05:53] will probably estimate that probability of seeing this word given that it's spam [00:05:58] of seeing this word given that it's spam email it's probably 0 right 0 over the [00:06:04] email it's probably 0 right 0 over the number of examples that you've labeled [00:06:08] number of examples that you've labeled as down email so if you clean up this [00:06:10] as down email so if you clean up this model using your personal email probably [00:06:13] model using your personal email probably none of the emails you've received for [00:06:14] none of the emails you've received for the last few months had the word nips in [00:06:16] the last few months had the word nips in it maybe and so if you plug in this form [00:06:20] it maybe and so if you plug in this form for the master likely estimate the [00:06:21] for the master likely estimate the numerator 0 and so you ask me that this [00:06:23] numerator 0 and so you ask me that this probably is 0 and then similarly this is [00:06:32] probably is 0 and then similarly this is also 0 over you know the number of non [00:06:34] also 0 over you know the number of non staff [00:06:35] staff right so that's what this just this [00:06:38] right so that's what this just this formula right and statistically is just [00:06:46] formula right and statistically is just a bad idea to say that the chance of [00:06:48] a bad idea to say that the chance of something is zero just because you [00:06:50] something is zero just because you haven't seen it yet and where this will [00:06:53] haven't seen it yet and where this will cause the naive Bayes Outland to break [00:06:54] cause the naive Bayes Outland to break down is if you use these as estimates of [00:06:58] down is if you use these as estimates of the parameters so this is your estimate [00:07:01] the parameters so this is your estimate parameter Phi subscript six zero one [00:07:04] parameter Phi subscript six zero one seven given y equals one this is five [00:07:07] seven given y equals one this is five such of 6001 seven given y equals zero [00:07:10] such of 6001 seven given y equals zero yes [00:07:11] yes and if you ever calculate this [00:07:12] and if you ever calculate this probability that is equal to a product [00:07:16] probability that is equal to a product from I equals 1 through n less you have [00:07:19] from I equals 1 through n less you have 10,000 words appear to make us I equals [00:07:24] 10,000 words appear to make us I equals 1 o P of X i given the y right and so if [00:07:33] 1 o P of X i given the y right and so if you train your sound classifier on the [00:07:35] you train your sound classifier on the email you've gotten up until today [00:07:36] email you've gotten up until today and then after cs2 to 89 your project [00:07:39] and then after cs2 to 89 your project teammates saying it starts engine email [00:07:41] teammates saying it starts engine email saying hey you know we like the class [00:07:43] saying hey you know we like the class project shall we consists of letting [00:07:45] project shall we consists of letting this class project to the news [00:07:47] this class project to the news conference news conference deadlines you [00:07:49] conference news conference deadlines you know sort of May or June those years so [00:07:53] know sort of May or June those years so you know finish the house projects this [00:07:55] you know finish the house projects this December work on this some more very [00:07:57] December work on this some more very gently very March April next year and [00:08:00] gently very March April next year and then maybe submit to the conference May [00:08:01] then maybe submit to the conference May or June of 2019 do you start getting [00:08:03] or June of 2019 do you start getting emails on your fancy let's submit our [00:08:05] emails on your fancy let's submit our papers in its conference then when you [00:08:09] papers in its conference then when you start to see the word nips in your email [00:08:10] start to see the word nips in your email maybe in March of next year this product [00:08:14] maybe in March of next year this product of probabilities will have a 0 in it and [00:08:19] of probabilities will have a 0 in it and so this thing that the circle will [00:08:21] so this thing that the circle will evaluate to zero because you're [00:08:22] evaluate to zero because you're multiplying a lot numbers 1 which is 0 [00:08:25] multiplying a lot numbers 1 which is 0 and in the same way this well this is [00:08:29] and in the same way this well this is also 0 right and this is also 0 because [00:08:34] also 0 right and this is also 0 because there'll be that one term bring that [00:08:36] there'll be that one term bring that product over there and so what that [00:08:39] product over there and so what that means is if you train the spam [00:08:40] means is if you train the spam classifier today using all the data you [00:08:44] classifier today using all the data you have in your email inbox so far and if [00:08:46] have in your email inbox so far and if tomorrow [00:08:48] tomorrow you know or two months from now whatever [00:08:49] you know or two months from now whatever the first time you get an email from [00:08:52] the first time you get an email from your teammates that has the word nips in [00:08:54] your teammates that has the word nips in it your spam classifier will estimate [00:08:56] it your spam classifier will estimate this probability as zero over zero plus [00:09:00] this probability as zero over zero plus zero okay [00:09:02] zero okay now apart from the divide by zero error [00:09:05] now apart from the divide by zero error it turns out that this is just a bad [00:09:09] it turns out that this is just a bad idea right to estimate the probably of [00:09:11] idea right to estimate the probably of something as zero just because you have [00:09:14] something as zero just because you have not seen it once yet right so what I [00:09:20] not seen it once yet right so what I want to do is describe to you [00:09:22] want to do is describe to you Laplace smoothing which is a technique [00:09:24] Laplace smoothing which is a technique that helps address this problem [00:09:28] that helps address this problem okay and let's let's uh in order to [00:09:32] okay and let's let's uh in order to motivate Laplace moving let me use a let [00:09:39] motivate Laplace moving let me use a let me use a different example for now let's [00:09:48] me use a different example for now let's see all right so you know several years [00:09:51] see all right so you know several years ago this is this is older data but [00:09:53] ago this is this is older data but several years ago so then put a sign ID [00:09:54] several years ago so then put a sign ID based on top of the pops movie no come [00:09:56] based on top of the pops movie no come back to pile up house routine my base so [00:09:58] back to pile up house routine my base so several years ago I was tracking their [00:10:01] several years ago I was tracking their progress of the Stanford football team [00:10:04] progress of the Stanford football team there's a few years ago now but that [00:10:06] there's a few years ago now but that year on 9/12 [00:10:08] year on 9/12 a football team played to Wake Forest [00:10:14] you know I think these are all the all [00:10:17] you know I think these are all the all the state games we played that year [00:10:18] the state games we played that year right and we did not win that game then [00:10:25] right and we did not win that game then on Jen Jen we played Oregon State and we [00:10:29] on Jen Jen we played Oregon State and we did not win that game [00:10:45] and the question is these are audio way [00:10:50] and the question is these are audio way games almost all the other state games [00:10:52] games almost all the other state games we played that yeah and so via the you [00:10:53] we played that yeah and so via the you know Stanford forgot his biggest fan he [00:10:55] know Stanford forgot his biggest fan he followed into every single out-of-state [00:10:56] followed into every single out-of-state game and watch all these games the [00:10:58] game and watch all these games the question is after this unfortunate [00:10:59] question is after this unfortunate shriek when you go on the accession game [00:11:04] shriek when you go on the accession game when you as if you follow them to their [00:11:06] when you as if you follow them to their over Homer game what's your estimate of [00:11:09] over Homer game what's your estimate of their chance of their winning or losing [00:11:12] their chance of their winning or losing now if use maximum likelihood so let's [00:11:16] now if use maximum likelihood so let's say this is the variable X you would [00:11:18] say this is the variable X you would estimate the probability of their [00:11:19] estimate the probability of their winning well that's one life because [00:11:22] winning well that's one life because really count up the number of wins right [00:11:25] really count up the number of wins right and divide that by the number of wins [00:11:28] and divide that by the number of wins plus the number of losses and so in this [00:11:34] plus the number of losses and so in this case you estimate this as 0 divided by [00:11:39] case you estimate this as 0 divided by number of wins of 0 number of losses is [00:11:43] number of wins of 0 number of losses is 4 right which is equal to 0 ok um that's [00:11:48] 4 right which is equal to 0 ok um that's kind of mean right they lost 4 games but [00:11:51] kind of mean right they lost 4 games but you say no the Chancellor that they're [00:11:52] you say no the Chancellor that they're winning is 0 absolute certain to you and [00:11:55] winning is 0 absolute certain to you and then justice lee this is not this is not [00:11:57] then justice lee this is not this is not a good idea and so what the process [00:12:02] a good idea and so what the process moving what we're going to do is imagine [00:12:10] moving what we're going to do is imagine that we saw the positive outcomes the [00:12:15] that we saw the positive outcomes the number of wins you know just add one to [00:12:18] number of wins you know just add one to the number of wins we actually saw and [00:12:20] the number of wins we actually saw and also the number of losses at 1 right so [00:12:25] also the number of losses at 1 right so if you actually saw 0 wins [00:12:26] if you actually saw 0 wins pretend you saw one that we saw four [00:12:28] pretend you saw one that we saw four losses pretend you saw one more than you [00:12:30] losses pretend you saw one more than you actually saw and so the plus moving [00:12:32] actually saw and so the plus moving gonna end up adding 1 to the numerator [00:12:34] gonna end up adding 1 to the numerator and adding 2 to the denominator and so [00:12:39] and adding 2 to the denominator and so this ends up being 1 over 6 and that's [00:12:43] this ends up being 1 over 6 and that's actually a more reasonable well maybe [00:12:45] actually a more reasonable well maybe maybe this is a more reasonable estimate [00:12:47] maybe this is a more reasonable estimate for the chance of that [00:12:49] for the chance of that or losing the next game and there's [00:12:53] or losing the next game and there's actually a certain set of circumstances [00:12:54] actually a certain set of circumstances under which there's an optimal estimate [00:12:56] under which there's an optimal estimate so I didn't just make the Sabbath in [00:12:58] so I didn't just make the Sabbath in there but Laplace you know ancient that [00:13:03] there but Laplace you know ancient that well-known very influential [00:13:05] well-known very influential mathematician he was actually I spent [00:13:08] mathematician he was actually I spent the chance of the Sun rising the next [00:13:09] the chance of the Sun rising the next day and the reasoning was well we've [00:13:11] day and the reasoning was well we've seen the Sun Rise a lot of times and so [00:13:13] seen the Sun Rise a lot of times and so but but that doesn't mean we should be [00:13:15] but but that doesn't mean we should be absolutely certain the Sun will still [00:13:17] absolutely certain the Sun will still rise tomorrow right and so his reasoning [00:13:20] rise tomorrow right and so his reasoning was well we've seen the Sun Rise 10,000 [00:13:22] was well we've seen the Sun Rise 10,000 times you know we can be really certain [00:13:23] times you know we can be really certain the Sun will rise again tomorrow but [00:13:25] the Sun will rise again tomorrow but maybe not absolutely certain because [00:13:27] maybe not absolutely certain because maybe something will go wrong [00:13:28] maybe something will go wrong who knows what happened this galaxy and [00:13:31] who knows what happened this galaxy and so his reasoning was he derived the [00:13:35] so his reasoning was he derived the optimal way of estimating you know [00:13:37] optimal way of estimating you know really the chance the Sun will rise [00:13:38] really the chance the Sun will rise tomorrow and this is actually an optimal [00:13:40] tomorrow and this is actually an optimal estimate under I'll say I'll save the [00:13:43] estimate under I'll say I'll save the sail assumptions we don't need to worry [00:13:44] sail assumptions we don't need to worry about it but the sense of they assumed [00:13:46] about it but the sense of they assumed that you are Bayesian with a uniform [00:13:48] that you are Bayesian with a uniform Bayesian prior on the chance of the Sun [00:13:51] Bayesian prior on the chance of the Sun rising tomorrow so if the chance of Sun [00:13:53] rising tomorrow so if the chance of Sun rising tomorrow is uniformly distributed [00:13:55] rising tomorrow is uniformly distributed you're in the unit interval anywhere [00:13:57] you're in the unit interval anywhere from 0 to 1 then after the set of [00:13:59] from 0 to 1 then after the set of observations of this coin toss of [00:14:01] observations of this coin toss of whether the Sun rises this is actually [00:14:03] whether the Sun rises this is actually based in optimist myth of the chosen Sun [00:14:05] based in optimist myth of the chosen Sun rising tomorrow ok if you don't stand [00:14:07] rising tomorrow ok if you don't stand which is in the last 30 seconds don't [00:14:08] which is in the last 30 seconds don't worry about it this is taught in sort of [00:14:11] worry about it this is taught in sort of Bayesian statistics that bounds Bayesian [00:14:12] Bayesian statistics that bounds Bayesian statistics classes but mechanically what [00:14:15] statistics classes but mechanically what you should do is uh take this formula [00:14:17] you should do is uh take this formula and add 1 to the number of counts you [00:14:19] and add 1 to the number of counts you actually saw for each of the possible [00:14:21] actually saw for each of the possible outcomes and more generally if Y excuse [00:14:32] outcomes and more generally if Y excuse me if if you're estimating probabilities [00:14:34] me if if you're estimating probabilities for a way random variable then you [00:14:41] for a way random variable then you estimate the chance of X being I to be [00:14:46] estimate the chance of X being I to be equal to you [00:15:02] so that's the maximum likelihood [00:15:05] so that's the maximum likelihood estimate and for Laplace moving you'd [00:15:08] estimate and for Laplace moving you'd add one to the numerator and add K to [00:15:13] add one to the numerator and add K to the denominator so for naive Bayes the [00:15:25] the denominator so for naive Bayes the way the small modifies your premises [00:15:29] way the small modifies your premises I'm just going to copy over right so [00:15:47] I'm just going to copy over right so that's the max or likely estimate and [00:15:50] that's the max or likely estimate and with the pasta thing you add one to the [00:15:53] with the pasta thing you add one to the numerator and add two to the denominator [00:15:55] numerator and add two to the denominator and this means they us mr. probabilities [00:15:58] and this means they us mr. probabilities probably is they're never exactly zero [00:16:00] probably is they're never exactly zero or exactly one which takes away that [00:16:03] or exactly one which takes away that problem of field is zero over zero and [00:16:08] problem of field is zero over zero and so the employers algorithm yeah it's not [00:16:10] so the employers algorithm yeah it's not there's not like a great spam classifier [00:16:12] there's not like a great spam classifier but it's not terrible either and one [00:16:15] but it's not terrible either and one nice thing about this algorithm is is so [00:16:16] nice thing about this algorithm is is so simple right estimated parameters is [00:16:19] simple right estimated parameters is just counting can be done you know very [00:16:22] just counting can be done you know very efficiently right just just by counting [00:16:25] efficiently right just just by counting and then classification time is just [00:16:27] and then classification time is just multiply a bunch of properties together [00:16:29] multiply a bunch of properties together this very competition efficient [00:16:31] this very competition efficient algorithm all right any questions about [00:16:34] algorithm all right any questions about this [00:16:39] yeah [00:16:43] oh sorry this why oh yes thank you all [00:16:59] oh sorry this why oh yes thank you all right [00:16:59] right oh by the way I I was actually falling [00:17:01] oh by the way I I was actually falling inside the football team that year so [00:17:07] okay I love a football team they're [00:17:09] okay I love a football team they're doing much better now because a few [00:17:10] doing much better now because a few years ago all right [00:17:19] years ago all right um so um in Indy example of Tulsa fell [00:17:34] um so um in Indy example of Tulsa fell so far the features were binary valued [00:17:37] so far the features were binary valued and so my quick generalization when the [00:17:44] and so my quick generalization when the features are multinomial value then the [00:17:51] features are multinomial value then the generalization actually here's one [00:17:53] generalization actually here's one example we talked about predicting [00:17:55] example we talked about predicting housing prices right that was a very [00:17:57] housing prices right that was a very first moving example let's say you have [00:17:59] first moving example let's say you have a classification problem instead which [00:18:01] a classification problem instead which is your listing house u.s. Smith once [00:18:03] is your listing house u.s. Smith once the chance that this house would be so [00:18:05] the chance that this house would be so weird in the next 30 days so it's a [00:18:06] weird in the next 30 days so it's a classification problem so if one of the [00:18:09] classification problem so if one of the features is the size of the house X [00:18:12] features is the size of the house X right then one way to turn the feature [00:18:15] right then one way to turn the feature into this speaker would be to choose a [00:18:19] into this speaker would be to choose a few buckets so if the size is less than [00:18:21] few buckets so if the size is less than 400 square feet versus you know 400 to [00:18:27] 400 square feet versus you know 400 to 800 or 800 to 1200 or greater than 1200 [00:18:32] 800 or 800 to 1200 or greater than 1200 square feet then you can set the feature [00:18:36] square feet then you can set the feature X I to one of four values right so [00:18:38] X I to one of four values right so there's how you dis precise it comes in [00:18:40] there's how you dis precise it comes in this value feature to discrete value [00:18:42] this value feature to discrete value feature and if you want to apply naive [00:18:45] feature and if you want to apply naive Bayes to this problem then probably of X [00:18:48] Bayes to this problem then probably of X given Y this is just same as before [00:18:52] given Y this is just same as before product from [00:18:53] product from I equals 1 through n P of XJ given Y [00:19:03] I equals 1 through n P of XJ given Y where now this can be a multinomial [00:19:08] probability right where if X now takes [00:19:10] probability right where if X now takes on one of four values say then this can [00:19:14] on one of four values say then this can be estimated a multinomial probably so [00:19:17] be estimated a multinomial probably so instead of a Bernoulli distribution over [00:19:20] instead of a Bernoulli distribution over to pass all comes this can be a problem [00:19:21] to pass all comes this can be a problem the probably mass function probably over [00:19:24] the probably mass function probably over for most ball comes if you discretize [00:19:25] for most ball comes if you discretize the size of a Houston to four values and [00:19:29] the size of a Houston to four values and if you ever discretized variables [00:19:31] if you ever discretized variables typical rule of thumb in machine [00:19:34] typical rule of thumb in machine learning often we discretize variables [00:19:35] learning often we discretize variables into ten values into 10 buckets that's [00:19:38] into ten values into 10 buckets that's just uh it often seems to work well now [00:19:40] just uh it often seems to work well now I grew four here so I don't to write out [00:19:42] I grew four here so I don't to write out 10 buckets but if you ever disguising [00:19:45] 10 buckets but if you ever disguising their variables you know for most people [00:19:48] their variables you know for most people will start off with disco sizing things [00:19:50] will start off with disco sizing things into 10 down use [00:20:02] No right and so this is how you can [00:20:10] No right and so this is how you can apply now you base other problems as [00:20:12] apply now you base other problems as well including classifying for example [00:20:14] well including classifying for example if a house is likely to be sold in make [00:20:15] if a house is likely to be sold in make seventy days now um this there's a [00:20:22] seventy days now um this there's a different variation on naivebayes [00:20:24] different variation on naivebayes that I want to describe to you that is [00:20:27] that I want to describe to you that is actually much better for the specific [00:20:29] actually much better for the specific problem of text classification and so a [00:20:33] problem of text classification and so a future representation for X so far was [00:20:37] future representation for X so far was the following right with a dictionary so [00:20:50] let's see you get an email there's you [00:20:54] let's see you get an email there's you know a very spammy email that's drugs by [00:20:56] know a very spammy email that's drugs by drugs now this is meant as an [00:21:02] drugs now this is meant as an illustrative example I'm not selling any [00:21:04] illustrative example I'm not selling any of my drivers so if if you have a [00:21:13] of my drivers so if if you have a dictionary of 10,000 words then I guess [00:21:15] dictionary of 10,000 words then I guess let's say a is worth one my pockets were [00:21:18] let's say a is worth one my pockets were to just to you know make this example RP [00:21:21] to just to you know make this example RP let's say that whereby is worth 800 [00:21:23] let's say that whereby is worth 800 drugs is the word 1600 and let's say now [00:21:26] drugs is the word 1600 and let's say now is the word is the 6,200 word in your [00:21:29] is the word is the 6,200 word in your 10,000 words in a sorted dictionary then [00:21:33] 10,000 words in a sorted dictionary then the representation for X will be you [00:21:36] the representation for X will be you know 0 0 right and I put a one there and [00:21:39] know 0 0 right and I put a one there and one there and the one there okay now one [00:21:42] one there and the one there okay now one one so um one interesting thing about [00:21:44] one so um one interesting thing about naive Bayes is that it throws away the [00:21:46] naive Bayes is that it throws away the fact that the word drug says appear [00:21:48] fact that the word drug says appear twice right so that's losing a little [00:21:51] twice right so that's losing a little bit of elevation and and in this speaker [00:21:55] bit of elevation and and in this speaker representation you know each feature is [00:21:58] representation you know each feature is either 0 or 1 right and that's part of [00:22:00] either 0 or 1 right and that's part of why it's throws [00:22:01] why it's throws way the information that what the one [00:22:04] way the information that what the one where drugs appear twice and maybe [00:22:05] where drugs appear twice and maybe should be given more weight for your [00:22:07] should be given more weight for your classifier um there's a different [00:22:11] classifier um there's a different representation which is specific to text [00:22:19] representation which is specific to text and I think text data has a property [00:22:22] and I think text data has a property that they can be very long or very short [00:22:24] that they can be very long or very short you can have a five-word email or 1,000 [00:22:26] you can have a five-word email or 1,000 word email and somehow you're taking [00:22:29] word email and somehow you're taking very strong or very long emails and just [00:22:32] very strong or very long emails and just mapping them to a feature vector there's [00:22:33] mapping them to a feature vector there's always the same way just a different [00:22:36] always the same way just a different representation for this email which is [00:22:40] representation for this email which is for that email this is drills by drugs [00:22:43] for that email this is drills by drugs and how we're going to represent it as a [00:22:45] and how we're going to represent it as a four dimensional feature vector so this [00:22:53] four dimensional feature vector so this is going to be n-dimensional for an [00:22:57] is going to be n-dimensional for an email of length and so rather than a [00:22:59] email of length and so rather than a 10,000 dimensional feature vector we now [00:23:03] 10,000 dimensional feature vector we now have a four dimensional feature vector [00:23:05] have a four dimensional feature vector but now X J is in index from 1 to 10,000 [00:23:19] but now X J is in index from 1 to 10,000 instead of just being 0 or 1 okay and [00:23:21] instead of just being 0 or 1 okay and and this and I guess n varies by chain [00:23:27] and this and I guess n varies by chain example so an eye is the length [00:23:38] of email I say the longer email this [00:23:42] of email I say the longer email this vector the feature vector X will be [00:23:45] vector the feature vector X will be longer and if you're short to email this [00:23:47] longer and if you're short to email this feature vector will be shorter okay so [00:23:56] let's see just to give names the average [00:24:00] let's see just to give names the average we're going to develop these are these [00:24:02] we're going to develop these are these are really very confusing very horrible [00:24:04] are really very confusing very horrible names but this is what the community [00:24:06] names but this is what the community calls them the model we've talked about [00:24:09] calls them the model we've talked about so far is sometimes called the [00:24:11] so far is sometimes called the multivariate Bernoulli model so [00:24:23] multivariate Bernoulli model so Bernoulli means coin tosses so [00:24:24] Bernoulli means coin tosses so multivariate means you know they're ten [00:24:26] multivariate means you know they're ten thousand Bernoulli random variables in [00:24:29] thousand Bernoulli random variables in this model this is a multivariate [00:24:30] this model this is a multivariate Bernoulli event model and the event [00:24:32] Bernoulli event model and the event comes with statistics and answers and [00:24:34] comes with statistics and answers and the new representation we're gonna talk [00:24:36] the new representation we're gonna talk about is called the multinomial event [00:24:44] about is called the multinomial event model these two names are frankly these [00:24:47] model these two names are frankly these two names are quite confusing but these [00:24:49] two names are quite confusing but these are the names that I think actually one [00:24:51] are the names that I think actually one of my friends as McCollum as far as I [00:24:54] of my friends as McCollum as far as I know wrote the paper that named these [00:24:56] know wrote the paper that named these two algorithms but but I think these are [00:24:58] two algorithms but but I think these are these are the names me seem to use [00:25:13] and so with this new model [00:25:19] and so with this new model we're gonna build a generative model [00:25:22] we're gonna build a generative model because there's a generative model or [00:25:24] because there's a generative model or model P of X comma Y which can be [00:25:28] model P of X comma Y which can be factored as follows [00:25:29] factored as follows and using the naive Bayes assumption [00:25:33] and using the naive Bayes assumption we're going to assume that P of X given [00:25:36] we're going to assume that P of X given Y is product from I equals 1 through n J [00:25:41] Y is product from I equals 1 through n J equals 1 to N P of XJ given Y and n [00:25:47] equals 1 to N P of XJ given Y and n times you know P of Y is that second [00:25:50] times you know P of Y is that second term right now one of the one of the [00:25:54] term right now one of the one of the reasons these two models were very [00:25:56] reasons these two models were very frankly are actually very confusing to [00:25:59] frankly are actually very confusing to machine learning community it's because [00:26:00] machine learning community it's because this is exactly the equation that you [00:26:03] this is exactly the equation that you know you saw on Monday when we described [00:26:05] know you saw on Monday when we described naive Bayes for the first time like you [00:26:08] naive Bayes for the first time like you know this P of X in wise father [00:26:10] know this P of X in wise father probabilities is exactly so this this [00:26:13] probabilities is exactly so this this equation looks cosmetically identical [00:26:15] equation looks cosmetically identical but with this new model the second model [00:26:18] but with this new model the second model the confusingly named multinomial event [00:26:21] the confusingly named multinomial event model the definition of XJ and the [00:26:25] model the definition of XJ and the definition of n is very different right [00:26:27] definition of n is very different right so instead of a product from 1 through [00:26:31] so instead of a product from 1 through 10,000 there's a product from 1 through [00:26:32] 10,000 there's a product from 1 through the number of words in your email and [00:26:34] the number of words in your email and this is now instead a multinomial [00:26:36] this is now instead a multinomial probability rather than the binary or [00:26:39] probability rather than the binary or Bernoulli probability and it turns out [00:26:44] Bernoulli probability and it turns out that with this model the parameters are [00:26:51] that with this model the parameters are same as before if I was very at one [00:26:53] same as before if I was very at one equals one and also the other parameters [00:26:57] equals one and also the other parameters of this model by K given y equals zero [00:27:01] of this model by K given y equals zero is a chance of XJ equals K given y [00:27:10] is a chance of XJ equals K given y equals zero right and then just to make [00:27:12] equals zero right and then just to make sure you understand the notation [00:27:13] sure you understand the notation see if this makes sense so this [00:27:15] see if this makes sense so this probability is the chance of word blank [00:27:22] being black if they were y equals zero [00:27:29] being black if they were y equals zero so what goes in those two blanks [00:27:39] actually what goes in a second bag let's [00:27:42] actually what goes in a second bag let's see [00:28:00] yes says a chance of the third word in [00:28:09] yes says a chance of the third word in the email being aware drugs which ones [00:28:11] the email being aware drugs which ones are the second word in the email being [00:28:12] are the second word in the email being by or whatever and one part of what [00:28:16] by or whatever and one part of what wouldn't firstly assumes me why this is [00:28:18] wouldn't firstly assumes me why this is tricky is that we assume that this [00:28:22] tricky is that we assume that this probability doesn't depend on J right [00:28:25] probability doesn't depend on J right that for every position an email for the [00:28:27] that for every position an email for the chance of the first word being drugs the [00:28:29] chance of the first word being drugs the same as chances of second Rabindra to [00:28:31] same as chances of second Rabindra to say mr. worth being drugs which is why [00:28:33] say mr. worth being drugs which is why on the left hand side J doesn't actually [00:28:35] on the left hand side J doesn't actually appear on the left hand side right any [00:28:40] appear on the left hand side right any questions about this and so the way you [00:28:47] questions about this and so the way you calculate the probability the way you [00:28:49] calculate the probability the way you would and so the way that given a new [00:28:55] would and so the way that given a new email a test email you would calculate [00:28:59] email a test email you would calculate this probability is by you know plugging [00:29:01] this probability is by you know plugging these parameters a USB from Tainter into [00:29:04] these parameters a USB from Tainter into this formula [00:29:27] oh and then no I wrote down and then an [00:29:34] oh and then no I wrote down and then an Indian set the parameters is this kind [00:29:38] Indian set the parameters is this kind of just with our y equals one is that y [00:29:40] of just with our y equals one is that y equals zero and then for the maximum [00:29:43] equals zero and then for the maximum likely resume the parameters I just [00:29:44] likely resume the parameters I just write out one of them your estimate of [00:29:49] write out one of them your estimate of the chance of a given work there's [00:29:53] the chance of a given work there's really any word in any position being [00:29:55] really any word in any position being word K what's the chance of some word in [00:29:58] word K what's the chance of some word in a non-spam email being the word drugs [00:30:00] a non-spam email being the word drugs let's say the chance of that is equal to [00:30:04] let's say the chance of that is equal to you I find it well this indicates a [00:30:24] you I find it well this indicates a function notation [00:30:25] function notation those are complex I just say in a second [00:30:33] what this actually means [00:30:35] what this actually means so the denominator so this space means [00:30:39] so the denominator so this space means so if you figure out what the English [00:30:41] so if you figure out what the English meaning of this complicated formula is [00:30:43] meaning of this complicated formula is this basically says look at all the [00:30:45] this basically says look at all the words in all of your non-spam emails all [00:30:48] words in all of your non-spam emails all the emails of y equals zero and look at [00:30:51] the emails of y equals zero and look at all of the words in all of the emails [00:30:52] all of the words in all of the emails and so all of those words what fraction [00:30:54] and so all of those words what fraction of those words is the word drugs and [00:30:56] of those words is the word drugs and that's a new estimate of the chance of [00:30:59] that's a new estimate of the chance of the word drugs appearing in the non-spam [00:31:01] the word drugs appearing in the non-spam email is [00:31:02] email is position in that right and so in nav the [00:31:06] position in that right and so in nav the denominator is sum over your training [00:31:08] denominator is sum over your training set indicator is not spam times the [00:31:12] set indicator is not spam times the number of words in that you know [00:31:13] number of words in that you know so the denominator ends up being the [00:31:17] so the denominator ends up being the total number of words in all of your [00:31:18] total number of words in all of your non-stemi emails in your training set [00:31:20] non-stemi emails in your training set and the numerator is some of your [00:31:24] and the numerator is some of your training set some from michaeles wants [00:31:25] training set some from michaeles wants um indicator y equals zero so you know [00:31:30] um indicator y equals zero so you know concept only the things for non-spam [00:31:33] concept only the things for non-spam email and for the non-spam email J [00:31:36] email and for the non-spam email J equals 1 through and I go over the words [00:31:38] equals 1 through and I go over the words in that email and see how many words are [00:31:40] in that email and see how many words are that worth K right and so if in your [00:31:45] that worth K right and so if in your training set you have you know a hundred [00:31:49] training set you have you know a hundred thousand words in your non-spam emails [00:31:51] thousand words in your non-spam emails and 200 of them are the word drugs the [00:31:55] and 200 of them are the word drugs the course you know 200 times then this [00:31:57] course you know 200 times then this range would be two hundred over a [00:31:58] range would be two hundred over a hundred thousand [00:31:59] hundred thousand oh and then lastly to implement Laplace [00:32:09] oh and then lastly to implement Laplace moving with this you would add one to [00:32:13] moving with this you would add one to the numerator as usual and then let's [00:32:18] the numerator as usual and then let's see actually what would you add to the [00:32:20] see actually what would you add to the denominator [00:32:30] wait but what is K naught K right K is a [00:32:34] wait but what is K naught K right K is a variable it's okay indexes into the [00:32:37] variable it's okay indexes into the words what you had 1000 oh I think I've [00:32:57] words what you had 1000 oh I think I've just read why you say K I think over [00:32:59] just read why you say K I think over though the notation when defining the [00:33:01] though the notation when defining the fast moving I think I use K is the [00:33:02] fast moving I think I use K is the number of possible outcomes yeah but [00:33:04] number of possible outcomes yeah but here cases in depth yeah right so CI [00:33:09] here cases in depth yeah right so CI once a numerator and add the number of [00:33:11] once a numerator and add the number of possible outcomes in denominator which [00:33:13] possible outcomes in denominator which in this case was there at 10,000 so so [00:33:17] in this case was there at 10,000 so so this is the probability of X being equal [00:33:22] this is the probability of X being equal to the value of K where K ranges from 1 [00:33:26] to the value of K where K ranges from 1 the sooner 10,000 if you have a [00:33:29] the sooner 10,000 if you have a dictionary size if you're about a list [00:33:31] dictionary size if you're about a list of 10,000 words you're modeling and so [00:33:33] of 10,000 words you're modeling and so the number of possible values for X is [00:33:36] the number of possible values for X is 10,000 so you have 10,000 simulator [00:33:38] 10,000 so you have 10,000 simulator Oh what do you do worse around in a [00:33:46] Oh what do you do worse around in a dictionary so um there are two [00:33:49] dictionary so um there are two approaches of that one is just throw it [00:33:52] approaches of that one is just throw it away [00:33:52] away just ignore it disregard it that's one [00:33:54] just ignore it disregard it that's one second approach is to take the rare [00:33:56] second approach is to take the rare words and map them to a special token [00:33:58] words and map them to a special token which traditionally is denoted UNK for [00:34:02] which traditionally is denoted UNK for unknown word so if in your training set [00:34:05] unknown word so if in your training set you decide to take just the top 10,000 [00:34:08] you decide to take just the top 10,000 words [00:34:08] words it sends your dictionary then everything [00:34:10] it sends your dictionary then everything that's down the top 10,000 words you can [00:34:12] that's down the top 10,000 words you can map to a you know unknown word token [00:34:14] map to a you know unknown word token unknown where a special symbol [00:34:19] oh why they wipe their run before oh [00:34:24] oh why they wipe their run before oh this is an indicator function notation [00:34:27] this is an indicator function notation so indicates a function boy so if so [00:34:34] so indicates a function boy so if so this is this notation right means so [00:34:38] this is this notation right means so indicator you know two equals one plus [00:34:41] indicator you know two equals one plus one this is true and indicates her you [00:34:45] one this is true and indicates her you know V equals five but this is this is [00:34:56] know V equals five but this is this is no formula that's either true or false [00:34:58] no formula that's either true or false did I know whether Y is zero [00:35:01] did I know whether Y is zero I guess if Y is 0 1 this is the same as [00:35:05] I guess if Y is 0 1 this is the same as not why I again saw 1 minus y all right [00:35:16] not why I again saw 1 minus y all right so I think both event models including [00:35:19] so I think both event models including the details of mass molecular estimates [00:35:20] the details of mass molecular estimates are written out in more detail in [00:35:24] are written out in more detail in lecture notes um so you know when would [00:35:30] lecture notes um so you know when would you use the naive Bayes algorithm it [00:35:33] you use the naive Bayes algorithm it turns out now greens algorithm is [00:35:34] turns out now greens algorithm is actually not very competitive of other [00:35:36] actually not very competitive of other learning algorithms so for most problems [00:35:38] learning algorithms so for most problems you find that logistic regression we're [00:35:42] you find that logistic regression we're better in terms of delivering a higher [00:35:45] better in terms of delivering a higher accuracy than naive Bayes but the the [00:35:49] accuracy than naive Bayes but the the advantages of naive Bayes is first is [00:35:52] advantages of naive Bayes is first is completely very efficient and second is [00:35:54] completely very efficient and second is relatively quick to implement right and [00:35:57] relatively quick to implement right and it also doesn't require iterative [00:35:58] it also doesn't require iterative gradient descent thing and the number of [00:36:00] gradient descent thing and the number of lines of code needs its employee base is [00:36:02] lines of code needs its employee base is relatively small so if you are facing a [00:36:07] relatively small so if you are facing a problem where you go is to implement [00:36:09] problem where you go is to implement something quick and dirty then naive [00:36:12] something quick and dirty then naive Bayes is is may be a reasonable choice [00:36:14] Bayes is is may be a reasonable choice and I think you know as you work on your [00:36:19] and I think you know as you work on your class projects I think some of you [00:36:21] class projects I think some of you priority minority will try to invent a [00:36:24] priority minority will try to invent a new machine learning algorithm and write [00:36:26] new machine learning algorithm and write a research paper and and I think you [00:36:29] a research paper and and I think you know inventing new machine learning is a [00:36:31] know inventing new machine learning is a great thing to do [00:36:32] great thing to do helps love people on a lot of different [00:36:34] helps love people on a lot of different applications so this one [00:36:36] applications so this one the majority of constructions to t9 [00:36:38] the majority of constructions to t9 won't try to apply a learning algorithm [00:36:42] won't try to apply a learning algorithm to a project that you care about we [00:36:44] to a project that you care about we applied to a research project you're [00:36:46] applied to a research project you're working on somewhere Stanford or apply [00:36:48] working on somewhere Stanford or apply it to a fun application you want to [00:36:50] it to a fun application you want to build or apply to business application [00:36:51] build or apply to business application for some of you taking this on SCPD [00:36:53] for some of you taking this on SCPD taking this remotely and if your goal is [00:36:56] taking this remotely and if your goal is not to invent a brand new learning [00:36:58] not to invent a brand new learning algorithm but to take the existing [00:37:00] algorithm but to take the existing algorithms and apply them then your [00:37:03] algorithms and apply them then your thumb I suggest to you is when you get [00:37:07] thumb I suggest to you is when you get started on the machine learning project [00:37:08] started on the machine learning project start by implementing something quick [00:37:11] start by implementing something quick and dirty [00:37:11] and dirty instead of implanting most complicated [00:37:13] instead of implanting most complicated possible learning algorithm stop [00:37:14] possible learning algorithm stop influencing something quickly and train [00:37:17] influencing something quickly and train the algorithm look at how it performs [00:37:19] the algorithm look at how it performs and then use that to deep out the [00:37:21] and then use that to deep out the algorithm and keep innovating all right [00:37:24] algorithm and keep innovating all right so I think you know what does a Stanford [00:37:26] so I think you know what does a Stanford so we're very good at coming up with [00:37:27] so we're very good at coming up with very very complicated algorithms but if [00:37:30] very very complicated algorithms but if you go is to make something work for an [00:37:33] you go is to make something work for an application [00:37:34] application you brought it and invented neither in [00:37:35] you brought it and invented neither in the algorithm and published a paper on a [00:37:37] the algorithm and published a paper on a new technical you know contribution if [00:37:40] new technical you know contribution if you if you mean go is a you're working [00:37:42] you if you mean go is a you're working on an application on understanding news [00:37:45] on an application on understanding news better or improving the environment or [00:37:48] better or improving the environment or estimating prices or whatever and your [00:37:50] estimating prices or whatever and your primary objective is just make an [00:37:52] primary objective is just make an algorithm work then rather than building [00:37:56] algorithm work then rather than building a very complicated out and if it all set [00:37:58] a very complicated out and if it all set I would recommend implementing something [00:38:01] I would recommend implementing something quickly so that you can then better [00:38:04] quickly so that you can then better understand how it's performing and then [00:38:06] understand how it's performing and then to error analysis which we'll talk about [00:38:08] to error analysis which we'll talk about later and use that to drive your [00:38:10] later and use that to drive your development you know one one one analogy [00:38:14] development you know one one one analogy I sometimes make is that if you are [00:38:20] I sometimes make is that if you are let's see so if you're writing a new [00:38:23] let's see so if you're writing a new computer program with 10,000 lines of [00:38:25] computer program with 10,000 lines of code right one approach is to write all [00:38:28] code right one approach is to write all 10,000 lines of code first and then they [00:38:31] 10,000 lines of code first and then they try compiling it for the first time [00:38:32] try compiling it for the first time right and that's clearly a bad idea [00:38:35] right and that's clearly a bad idea and then sudden you know you should [00:38:36] and then sudden you know you should write small Maude use around the [00:38:38] write small Maude use around the attested unit testing and then build up [00:38:40] attested unit testing and then build up the program incremental you rather than [00:38:42] the program incremental you rather than write 10,000 lines of code and then [00:38:43] write 10,000 lines of code and then start to see what syntax errors he gave [00:38:45] start to see what syntax errors he gave me for the first time um and I think a [00:38:47] me for the first time um and I think a similar for machine learning instead of [00:38:50] similar for machine learning instead of building a very complicated algorithm [00:38:51] building a very complicated algorithm from the get-go we've got a simpler [00:38:54] from the get-go we've got a simpler algorithm tested and then and then use [00:38:57] algorithm tested and then and then use the see what is doing one see what's [00:38:59] the see what is doing one see what's doing wrong to improve from there you [00:39:01] doing wrong to improve from there you often end up getting to a better [00:39:04] often end up getting to a better performing algorithm faster um so here's [00:39:08] performing algorithm faster um so here's here's one example there's actually [00:39:10] here's one example there's actually something I used to work on I actually [00:39:12] something I used to work on I actually started on conference on you know and [00:39:14] started on conference on you know and Auntie spats you know student work on [00:39:16] Auntie spats you know student work on spam classification many years ago and [00:39:19] spam classification many years ago and it turns out that when you start out on [00:39:22] it turns out that when you start out on a new application problem oh it's hard [00:39:25] a new application problem oh it's hard to know what's the hardest part of the [00:39:27] to know what's the hardest part of the problem right so if you want to build an [00:39:30] problem right so if you want to build an anti-spam crossfire the lots of things [00:39:32] anti-spam crossfire the lots of things you could work on for example spammers [00:39:34] you could work on for example spammers would deliberately misspell words you [00:39:36] would deliberately misspell words you know wallet mortgage that right no [00:39:38] know wallet mortgage that right no refinance your mortgage or whatever but [00:39:40] refinance your mortgage or whatever but instead of writing the words mortgage [00:39:44] instead of writing the words mortgage the spammers would write them 0 RT GA or [00:39:54] the spammers would write them 0 RT GA or instead of GA keaney maybe uh slash [00:39:57] instead of GA keaney maybe uh slash slash right but all of us as people have [00:40:01] slash right but all of us as people have no trouble meaning this is aware [00:40:02] no trouble meaning this is aware mortgage but this would trip up a spam [00:40:05] mortgage but this would trip up a spam filter this might map the work to an [00:40:07] filter this might map the work to an unknown word talk about just Alfred it [00:40:10] unknown word talk about just Alfred it hasn't seen this before and does the [00:40:11] hasn't seen this before and does the line this were to slip by the spam [00:40:12] line this were to slip by the spam filter so that's one idea for improving [00:40:15] filter so that's one idea for improving spam or action one over here see [00:40:18] spam or action one over here see students the hall like we actually roll [00:40:19] students the hall like we actually roll the paper [00:40:20] the paper mapping this back to worse like that so [00:40:23] mapping this back to worse like that so this dental they can see their words the [00:40:24] this dental they can see their words the way that humans see them all right so [00:40:26] way that humans see them all right so that's one idea [00:40:27] that's one idea another idea might be a lot of spam [00:40:30] another idea might be a lot of spam email spruce email headers you know [00:40:37] email spruce email headers you know spammers often tried to hide where the [00:40:39] spammers often tried to hide where the email truly came from [00:40:41] email truly came from by spoofing the email header the email [00:40:44] by spoofing the email header the email address on from information and and and [00:40:48] address on from information and and and another thing you might do is try to [00:40:50] another thing you might do is try to fetch the URLs that are referred to in [00:40:52] fetch the URLs that are referred to in the email and then analyze the webpages [00:40:54] the email and then analyze the webpages that you get to write but a lot of [00:40:56] that you get to write but a lot of things that you could do to improve a [00:40:58] things that you could do to improve a spam filter and any one of these topics [00:41:01] spam filter and any one of these topics could easily be three months or six [00:41:03] could easily be three months or six months of research but when you're [00:41:05] months of research but when you're building say a new spam filter for the [00:41:07] building say a new spam filter for the first time how do you actually know [00:41:08] first time how do you actually know which of these is the best investment of [00:41:10] which of these is the best investment of your time so my advice to those who work [00:41:14] your time so my advice to those who work on project if your primary goes just get [00:41:16] on project if your primary goes just get distinctive work is to not somewhat [00:41:19] distinctive work is to not somewhat arbitrarily dive in and spend six months [00:41:21] arbitrarily dive in and spend six months on improving this or spend you know six [00:41:25] on improving this or spend you know six plans on trying to analyze email headers [00:41:28] plans on trying to analyze email headers but just that implement a more basic [00:41:30] but just that implement a more basic algorithm almost implement something [00:41:32] algorithm almost implement something quick and dirty and then look at the [00:41:34] quick and dirty and then look at the examples that your learning algorithm is [00:41:36] examples that your learning algorithm is still misclassifying if you find that if [00:41:38] still misclassifying if you find that if after you've implemented a quick and [00:41:40] after you've implemented a quick and dirty algorithm you find that yours the [00:41:42] dirty algorithm you find that yours the anti-spam algorithm is misclassifying a [00:41:44] anti-spam algorithm is misclassifying a lot of examples with these deliberately [00:41:46] lot of examples with these deliberately misspell words there's only then they [00:41:48] misspell words there's only then they get more evidence than it's worth [00:41:49] get more evidence than it's worth spending a bunch of time solving the [00:41:52] spending a bunch of time solving the misspelled words and deliberately [00:41:53] misspelled words and deliberately misspell words problem right but you [00:41:55] misspell words problem right but you implement spam filter and you see that [00:41:57] implement spam filter and you see that there's not misclassifying a lot of [00:41:59] there's not misclassifying a lot of examples of these misspelled worst and I [00:42:00] examples of these misspelled worst and I would say don't bother go work on [00:42:02] would say don't bother go work on something else is there or these at [00:42:04] something else is there or these at least treat that as a lower priority so [00:42:07] least treat that as a lower priority so one of the uses of [00:42:08] one of the uses of GDA call centers grant analysis as well [00:42:11] GDA call centers grant analysis as well as naive bayes is that is they're not [00:42:14] as naive bayes is that is they're not going to be the most accurate algorithms [00:42:16] going to be the most accurate algorithms if you want the highest precision [00:42:18] if you want the highest precision accuracy there are other algorithms like [00:42:20] accuracy there are other algorithms like which is you Russian or as well switched [00:42:22] which is you Russian or as well switched up walnuts or neural networks we talk [00:42:24] up walnuts or neural networks we talk about later which will others always [00:42:26] about later which will others always give you higher classification accuracy [00:42:28] give you higher classification accuracy than these algorithms but the advantage [00:42:30] than these algorithms but the advantage of Johnson's analysis and naive Bayes is [00:42:33] of Johnson's analysis and naive Bayes is that they are very quick to train there [00:42:36] that they are very quick to train there is no literature this is just counting [00:42:39] is no literature this is just counting and GDA is just computing means and [00:42:42] and GDA is just computing means and covariances [00:42:42] covariances right so it's very calm [00:42:43] right so it's very calm they efficient and also there are there [00:42:46] they efficient and also there are there are simple to implement so it can help [00:42:48] are simple to implement so it can help you implement that quick and dirty thing [00:42:50] you implement that quick and dirty thing that helps you get going more quickly [00:42:54] that helps you get going more quickly and so I think for your project as well [00:42:57] and so I think for your project as well I would advise most of you to you know [00:43:01] I would advise most of you to you know as you start working on your project [00:43:02] as you start working on your project I would advise most of you to don't [00:43:05] I would advise most of you to don't spend weeks designing exactly what [00:43:08] spend weeks designing exactly what you're going to do if you have an [00:43:09] you're going to do if you have an advocate if you have your theory of an [00:43:10] advocate if you have your theory of an apply project but instead get the Dana [00:43:13] apply project but instead get the Dana said and apply something simple start to [00:43:15] said and apply something simple start to reach a super aggression [00:43:17] reach a super aggression noth-nothing your network or not not [00:43:18] noth-nothing your network or not not something more complicated or started [00:43:20] something more complicated or started nowadays and then see how that performs [00:43:22] nowadays and then see how that performs and then and then go from there okay [00:43:27] alright so that's it for naive Bayes and [00:43:32] alright so that's it for naive Bayes and generative learning algorithms the thing [00:43:36] generative learning algorithms the thing I want to do is move on to a different [00:43:38] I want to do is move on to a different type of classifier which is a support [00:43:41] type of classifier which is a support vector machine let me check out any [00:43:43] vector machine let me check out any questions about this with Wyman's long [00:43:57] oh wait oh sorry oh can use logistic [00:44:08] oh wait oh sorry oh can use logistic regression with discrete variables [00:44:20] oh I see yeah right yes so yes right so [00:44:26] oh I see yeah right yes so yes right so one of the weaknesses of the naive Bayes [00:44:28] one of the weaknesses of the naive Bayes algorithm is that it treats all the [00:44:29] algorithm is that it treats all the words completely you know separate from [00:44:32] words completely you know separate from each other [00:44:32] each other so right there was one and two are quite [00:44:35] so right there was one and two are quite similar and whereas the only mother and [00:44:37] similar and whereas the only mother and father are quite similar and so but with [00:44:41] father are quite similar and so but with this feature better representation it [00:44:44] this feature better representation it doesn't know the relation means these [00:44:45] doesn't know the relation means these words so in machine learning there are [00:44:48] words so in machine learning there are other ways of representing words this [00:44:50] other ways of representing words this technique called word embeddings in [00:44:57] technique called word embeddings in which we choose the feature [00:44:58] which we choose the feature representation that encodes the fact [00:45:00] representation that encodes the fact that the worst one and two are quite [00:45:02] that the worst one and two are quite similar to each other are the worst [00:45:03] similar to each other are the worst mother and father question as each other [00:45:05] mother and father question as each other you know they're worse whatever London [00:45:09] you know they're worse whatever London and Tokyo are quite suit each other [00:45:10] and Tokyo are quite suit each other because they're both city names and so [00:45:13] because they're both city names and so this is a technique that I was not [00:45:16] this is a technique that I was not planning to teach Europe but that is [00:45:18] planning to teach Europe but that is taught in CS 2:30 but you can also read [00:45:25] taught in CS 2:30 but you can also read up on word embeddings or look at some of [00:45:27] up on word embeddings or look at some of their videos or reasons from CS 2:30 you [00:45:29] their videos or reasons from CS 2:30 you want to so the word embeddings technique [00:45:33] want to so the word embeddings technique this is a technique from neural networks [00:45:34] this is a technique from neural networks really will reduce the number of [00:45:36] really will reduce the number of training examples you need to learn a [00:45:37] training examples you need to learn a good tech sauce fire because it comes in [00:45:39] good tech sauce fire because it comes in with more knowledge victim by the way I [00:45:51] with more knowledge victim by the way I do this in the other classes too inside [00:45:53] do this in the other classes too inside the other classes something got a [00:45:54] the other classes something got a question I go no we don't do that we [00:45:56] question I go no we don't do that we just covered as ICSC jt9 so [00:46:10] actually ICS 224 and I think also covers [00:46:13] actually ICS 224 and I think also covers this yeah the NLP appreciate yeah I'm [00:46:17] this yeah the NLP appreciate yeah I'm sure okay so so support vector machine [00:46:33] sure okay so so support vector machine says be ins um let's see the [00:46:39] says be ins um let's see the classification problem right whether [00:46:50] classification problem right whether they said looks like this and so you [00:46:53] they said looks like this and so you want an algorithm to find you know like [00:46:55] want an algorithm to find you know like a non linear decision boundary right so [00:46:59] a non linear decision boundary right so the support vector machine will be an [00:47:01] the support vector machine will be an algorithm to help us find potentially [00:47:03] algorithm to help us find potentially very very nonlinear decision boundary is [00:47:05] very very nonlinear decision boundary is like this now one way to build a [00:47:08] like this now one way to build a classifier like this would be to use [00:47:09] classifier like this would be to use logistic regression but if this is X 1 [00:47:13] logistic regression but if this is X 1 this is X 2 right [00:47:15] this is X 2 right so the logistic regression will fit a [00:47:18] so the logistic regression will fit a straight line to Zeta a Gaussian [00:47:19] straight line to Zeta a Gaussian distribution Valerie so one way to apply [00:47:23] distribution Valerie so one way to apply this is arrested like this would be to [00:47:25] this is arrested like this would be to take your feature vector x1 x2 and map [00:47:28] take your feature vector x1 x2 and map it to a high dimensional feature vector [00:47:30] it to a high dimensional feature vector with you know x1 x2 x1 squared x2 [00:47:34] with you know x1 x2 x1 squared x2 squared x1 x2 may be X Y cube x2 cube [00:47:39] squared x1 x2 may be X Y cube x2 cube and so on and have a new feature vector [00:47:41] and so on and have a new feature vector which we'll call the value of x that [00:47:43] which we'll call the value of x that that has these high dimensional features [00:47:46] that has these high dimensional features right now it turns out if you do this [00:47:50] right now it turns out if you do this and then apply logistic regression to [00:47:52] and then apply logistic regression to this augmented feature vector then [00:47:55] this augmented feature vector then logistic regression can learn nonlinear [00:47:57] logistic regression can learn nonlinear decision boundaries with this other [00:47:59] decision boundaries with this other features [00:47:59] features there's just regression if you actually [00:48:01] there's just regression if you actually learn the decision boundary there's this [00:48:03] learn the decision boundary there's this there's a shape of an ellipse but man [00:48:07] there's a shape of an ellipse but man they'll be choosing these features is a [00:48:09] they'll be choosing these features is a little bit of a pain right [00:48:11] little bit of a pain right know what I did I actually don't know [00:48:13] know what I did I actually don't know what you know type of all set of [00:48:17] what you know type of all set of features could get you a decision valve [00:48:19] features could get you a decision valve you like that right rather than just in [00:48:21] you like that right rather than just in the lips and more complex is your mouth [00:48:22] the lips and more complex is your mouth to me and what we will see with support [00:48:26] to me and what we will see with support vector machines is that we will be able [00:48:28] vector machines is that we will be able to derive an algorithm that can take say [00:48:31] to derive an algorithm that can take say input features x1 x2 map them to a much [00:48:36] input features x1 x2 map them to a much higher dimensional set of features and [00:48:39] higher dimensional set of features and then apply a linear classifier in a way [00:48:42] then apply a linear classifier in a way similar to logistic regression but [00:48:43] similar to logistic regression but different in details that allows you to [00:48:45] different in details that allows you to learn very nonlinear decision boundaries [00:48:49] learn very nonlinear decision boundaries and I think you know a support vector [00:48:52] and I think you know a support vector machine one of the actually one of the [00:48:54] machine one of the actually one of the reasons support vector machines are used [00:48:56] reasons support vector machines are used today is is a relatively turnkey [00:48:59] today is is a relatively turnkey algorithm and what I mean by that is it [00:49:01] algorithm and what I mean by that is it doesn't have too many parameters to [00:49:02] doesn't have too many parameters to fiddle with even for logistic regression [00:49:05] fiddle with even for logistic regression or for linear regression you know you [00:49:08] or for linear regression you know you might have to tune the gradient descent [00:49:10] might have to tune the gradient descent parameter a tune the learning rate sorry [00:49:12] parameter a tune the learning rate sorry change the learning rate alpha and [00:49:13] change the learning rate alpha and that's just another thing that fiddle [00:49:15] that's just another thing that fiddle worked very try a few values and hope [00:49:17] worked very try a few values and hope you didn't mess up how you said that [00:49:18] you didn't mess up how you said that value Oh a support vector machine today [00:49:22] value Oh a support vector machine today has the very robust very mature software [00:49:26] has the very robust very mature software packages they can just download to train [00:49:28] packages they can just download to train a support vector machine on on any on [00:49:30] a support vector machine on on any on you know on a problem and you just run [00:49:32] you know on a problem and you just run it and the algorithm will kind of [00:49:34] it and the algorithm will kind of converge without you having to worry too [00:49:36] converge without you having to worry too much about the details so I think on the [00:49:38] much about the details so I think on the grand scheme of things today I would say [00:49:40] grand scheme of things today I would say support vector machines are not as [00:49:42] support vector machines are not as effective as neural networks for many [00:49:44] effective as neural networks for many problems but but one dream propria [00:49:48] problems but but one dream propria support vector machines is this is [00:49:50] support vector machines is this is turnkey you kind of just turn the key [00:49:51] turnkey you kind of just turn the key and it works and there isn't as many [00:49:53] and it works and there isn't as many parameters like the learning rate and [00:49:55] parameters like the learning rate and other things that you have to fiddle [00:49:56] other things that you have to fiddle with [00:50:03] so the roadmap is we're going to develop [00:50:09] so the roadmap is we're going to develop the following set of ideas talk about [00:50:12] the following set of ideas talk about the optimal margin classifier today and [00:50:20] the optimal margin classifier today and we'll start with the separable case and [00:50:26] we'll start with the separable case and what that means is going to start off [00:50:28] what that means is going to start off with data sets that we assume look like [00:50:34] with data sets that we assume look like this and that are linearly separable and [00:50:36] this and that are linearly separable and so the also margin classifier is the [00:50:38] so the also margin classifier is the basic building block of a support vector [00:50:40] basic building block of a support vector machine and will first derive an [00:50:44] machine and will first derive an algorithm [00:50:45] algorithm don't be they'll have some similarities [00:50:47] don't be they'll have some similarities to a logistic regression but that allows [00:50:49] to a logistic regression but that allows us a scale in the important way it's [00:50:52] us a scale in the important way it's that to find a linear classifier for [00:50:55] that to find a linear classifier for training cells like this that we assume [00:50:57] training cells like this that we assume for now can be linearly separated so [00:51:00] for now can be linearly separated so we'll do that today and then what you [00:51:02] we'll do that today and then what you see on Wednesday is excuse me next [00:51:07] see on Wednesday is excuse me next Monday [00:51:07] Monday wait you see next Monday is an idea [00:51:10] wait you see next Monday is an idea called kernels and the kernel idea is [00:51:13] called kernels and the kernel idea is one of the most powerful ideas in [00:51:14] one of the most powerful ideas in machine learning is how do you take a [00:51:17] machine learning is how do you take a feature vector X maybe this is r2 and [00:51:22] map it too much high dimensional set of [00:51:26] map it too much high dimensional set of features in our example there that was [00:51:29] features in our example there that was our 5 right and then train an algorithm [00:51:33] our 5 right and then train an algorithm on this higher dimensional set of [00:51:35] on this higher dimensional set of features and and the cool thing about [00:51:37] features and and the cool thing about kernels is that this high dimensional [00:51:39] kernels is that this high dimensional set of features may not be r5 it might [00:51:42] set of features may not be r5 it might be our 100,000 or it might even be our [00:51:46] be our 100,000 or it might even be our infinite and so with the kernel [00:51:49] infinite and so with the kernel formulation we really take you know the [00:51:52] formulation we really take you know the original set of features that you were [00:51:53] original set of features that you were given for the houses you trying to sell [00:51:56] given for the houses you trying to sell you know medical conditions try to [00:51:58] you know medical conditions try to predict and map this two dimensional [00:52:00] predict and map this two dimensional feature vector space into maybe an [00:52:02] feature vector space into maybe an infinite dimensional [00:52:03] infinite dimensional so the features and what this does is it [00:52:07] so the features and what this does is it relieves us from a lot of the burden of [00:52:09] relieves us from a lot of the burden of manually picking features right like do [00:52:11] manually picking features right like do you want to have square root of x 1 or [00:52:13] you want to have square root of x 1 or maybe X 1 X 2 to the power of 2/3 so you [00:52:16] maybe X 1 X 2 to the power of 2/3 so you just don't have the fiddle of these [00:52:17] just don't have the fiddle of these features too much because the kernels [00:52:21] features too much because the kernels will allow you to choose an infinitely [00:52:23] will allow you to choose an infinitely large set of features okay and then [00:52:26] large set of features okay and then finally we'll talk about the inseparable [00:52:29] finally we'll talk about the inseparable case so I'm gonna do this today and then [00:52:35] case so I'm gonna do this today and then this next Monday so [00:52:55] and by the way I you know the machine [00:52:58] and by the way I you know the machine there in the world's become a little [00:53:00] there in the world's become a little it's funny I think the if you leap well [00:53:02] it's funny I think the if you leap well in the news the media talks a lot about [00:53:04] in the news the media talks a lot about machine learning the media just talks [00:53:06] machine learning the media just talks about you know neural networks all the [00:53:08] about you know neural networks all the time right and you hear about neural [00:53:09] time right and you hear about neural networks and deep learning wrote the [00:53:11] networks and deep learning wrote the later in this class but if you look at [00:53:12] later in this class but if you look at what actually happens in practice in [00:53:14] what actually happens in practice in machine learning the set of algorithms [00:53:18] machine learning the set of algorithms actually used in practice it's actually [00:53:20] actually used in practice it's actually much wider than neural networks and deep [00:53:21] much wider than neural networks and deep learning so so we do not live in a [00:53:23] learning so so we do not live in a neural networks only world we actually [00:53:26] neural networks only world we actually use many many tools in machine learning [00:53:28] use many many tools in machine learning it's just that deep learning attracts [00:53:31] it's just that deep learning attracts the attention of the media in some this [00:53:33] the attention of the media in some this in some way there's quite [00:53:34] in some way there's quite disproportionate to what I find useful [00:53:37] disproportionate to what I find useful you know that's like I love them but but [00:53:41] you know that's like I love them but but they're not they're not the only thing [00:53:43] they're not they're not the only thing in the world [00:53:44] in the world and so yeah late last night I was [00:53:46] and so yeah late last night I was talking an engineer about factor [00:53:49] talking an engineer about factor analysis which you learn about latency [00:53:51] analysis which you learn about latency s39 right unsupervised learning [00:53:52] s39 right unsupervised learning algorithm and there's an application [00:53:55] algorithm and there's an application that one of my teams is working on in [00:53:58] that one of my teams is working on in manufacturing where I'm gonna use factor [00:53:59] manufacturing where I'm gonna use factor analysis or something very similar to it [00:54:01] analysis or something very similar to it which which is totally not a neural [00:54:03] which which is totally not a neural network technique right for so they're [00:54:04] network technique right for so they're there all these other technique is that [00:54:06] there all these other technique is that including support vector machines in [00:54:08] including support vector machines in ninety days and I think do can use are [00:54:11] ninety days and I think do can use are not important all right so let's start [00:54:17] not important all right so let's start developing the optimal margin classifier [00:54:20] developing the optimal margin classifier so um first let me define the functional [00:54:31] so um first let me define the functional margin which is informally the [00:54:34] margin which is informally the functional margin of the classifier is [00:54:36] functional margin of the classifier is how well how confident I and accurately [00:54:39] how well how confident I and accurately do you classify an example [00:54:41] do you classify an example so here's what I mean we're gonna go to [00:54:45] so here's what I mean we're gonna go to binary classification and we're gonna [00:54:48] binary classification and we're gonna use logistic regression right so so [00:54:52] use logistic regression right so so let's start by motivating this with [00:54:54] let's start by motivating this with logistic regression so deserve which is [00:54:57] logistic regression so deserve which is a classifier H of theta equals the [00:54:59] a classifier H of theta equals the logistic function applied to theta [00:55:00] logistic function applied to theta transpose X and so if you turn this into [00:55:05] transpose X and so if you turn this into a binary classification if you have this [00:55:08] a binary classification if you have this algorithm predict not a probability but [00:55:10] algorithm predict not a probability but predict 0 or 1 then what does classifier [00:55:13] predict 0 or 1 then what does classifier will do is predict 1 if theta transpose [00:55:19] will do is predict 1 if theta transpose X is greater than 0 right and predict 0 [00:55:28] otherwise okay because theta transpose X [00:55:31] otherwise okay because theta transpose X is greater than 0 this means that G of [00:55:37] is greater than 0 this means that G of theta transpose X is greater than 0.5 [00:55:39] theta transpose X is greater than 0.5 you can I've created emigrate to them an [00:55:42] you can I've created emigrate to them an equal to it doesn't matter yes and [00:55:43] equal to it doesn't matter yes and exactly 0.5 it doesn't really matter [00:55:45] exactly 0.5 it doesn't really matter what you do [00:55:47] what you do and so you predict 1 if theta transpose [00:55:50] and so you predict 1 if theta transpose X is greater than equals 0 meaning that [00:55:52] X is greater than equals 0 meaning that all probability the estimator probably [00:55:55] all probability the estimator probably over cause being 1 is greater than 50/50 [00:55:58] over cause being 1 is greater than 50/50 and so you predict 1 and if theta [00:55:59] and so you predict 1 and if theta transpose X is less than 0 then you [00:56:02] transpose X is less than 0 then you predict that this class is 0 ok so this [00:56:04] predict that this class is 0 ok so this is what happen if you have largest [00:56:07] is what happen if you have largest regression output 1 or 0 rather than [00:56:09] regression output 1 or 0 rather than output or probability so in other words [00:56:14] output or probability so in other words this means that if Y I is equal to 1 [00:56:20] this means that if Y I is equal to 1 right then hope or we want that theta [00:56:27] right then hope or we want that theta transpose X alright it's much greater [00:56:32] transpose X alright it's much greater than 0 this double greater-than sign it [00:56:36] than 0 this double greater-than sign it means much greater right because if the [00:56:40] means much greater right because if the true label is 1 then if the album is [00:56:43] true label is 1 then if the album is doing well hopefully theta transpose X [00:56:47] doing well hopefully theta transpose X right will be faster there right so the [00:56:50] right will be faster there right so the output probability is very very close to [00:56:52] output probability is very very close to 1 [00:56:52] 1 and if indeed theta transpose X is much [00:56:55] and if indeed theta transpose X is much greater than zero then G of theta [00:56:57] greater than zero then G of theta transpose X will be very close to one [00:56:59] transpose X will be very close to one which means that is giving a very good [00:57:02] which means that is giving a very good very accurate prediction very correct [00:57:05] very accurate prediction very correct and confident prediction right there [00:57:08] and confident prediction right there goes one and if Y is equal to zero then [00:57:14] goes one and if Y is equal to zero then what we want or what we hope is that [00:57:16] what we want or what we hope is that theta transpose X I is much less than [00:57:20] theta transpose X I is much less than zero right because if this is true then [00:57:23] zero right because if this is true then the algorithm is doing very well on this [00:57:24] the algorithm is doing very well on this example so um so the functional margin [00:57:39] example so um so the functional margin which will define in a second captures [00:57:42] which will define in a second captures this idea that if the Kasbah has a large [00:57:46] this idea that if the Kasbah has a large functional margin it means that these [00:57:48] functional margin it means that these two statements are true [00:57:51] two statements are true um it's a local henro bit there's a [00:57:56] um it's a local henro bit there's a different thing we define in a second [00:57:58] different thing we define in a second there's the counted geometric margin and [00:58:01] there's the counted geometric margin and that's the following and for now let's [00:58:04] that's the following and for now let's assume the data is linearly separable [00:58:16] so let's say that's the data set now [00:58:28] that seems like a pretty good decision [00:58:30] that seems like a pretty good decision boundary for separating the positive and [00:58:32] boundary for separating the positive and negative examples [00:58:36] that's another decision boundary in red [00:58:39] that's another decision boundary in red that also separates a positive negative [00:58:41] that also separates a positive negative examples but somehow the Green Line [00:58:43] examples but somehow the Green Line looks much better than the red line so [00:58:47] looks much better than the red line so why is that well the red line comes [00:58:50] why is that well the red line comes really close it's a few of the training [00:58:53] really close it's a few of the training examples whereas the Green Line you know [00:58:59] examples whereas the Green Line you know has a much bigger separation right just [00:59:02] has a much bigger separation right just as a much bigger distance from the [00:59:04] as a much bigger distance from the positive a negative example so even [00:59:06] positive a negative example so even though the red line and the Green Line [00:59:08] though the red line and the Green Line both you know perfectly separate the [00:59:11] both you know perfectly separate the positive and negative examples the Green [00:59:14] positive and negative examples the Green Line has a much bigger separation which [00:59:18] Line has a much bigger separation which is called the geometric margin there's a [00:59:20] is called the geometric margin there's a much bigger geometric margin meaning of [00:59:21] much bigger geometric margin meaning of physical separation from the training [00:59:24] physical separation from the training examples even as it separates them okay [00:59:27] examples even as it separates them okay and so what I'd like to do in mix [00:59:33] and so what I'd like to do in mix several I guess with next 20 minutes is [00:59:36] several I guess with next 20 minutes is formalize definitely functional margin [00:59:38] formalize definitely functional margin and formalized definition of geometric [00:59:40] and formalized definition of geometric margin and it will pose that I guess the [00:59:43] margin and it will pose that I guess the optimizing classifier which is based in [00:59:45] optimizing classifier which is based in algorithm that tries to maximize the [00:59:47] algorithm that tries to maximize the geometric margin so what the rudimentary [00:59:49] geometric margin so what the rudimentary SVM does what the Hesby mlo dimensional [00:59:52] SVM does what the Hesby mlo dimensional spaces will do also called the optimal [00:59:54] spaces will do also called the optimal margin classifier is pose an [00:59:56] margin classifier is pose an optimization problem to try to find a [00:59:58] optimization problem to try to find a green line to classify these examples so [01:00:08] now in order to develop svms I'm going [01:00:12] now in order to develop svms I'm going to change the notation a little bit [01:00:14] to change the notation a little bit again yeah because these algorithms have [01:00:16] again yeah because these algorithms have different properties using slightly [01:00:19] different properties using slightly different notation to strive to make [01:00:21] different notation to strive to make something math-amazing so when [01:00:23] something math-amazing so when developing SVM's we're going to use [01:00:27] developing SVM's we're going to use minus 1 and plus 1 to denote the cost [01:00:31] minus 1 and plus 1 to denote the cost labels [01:00:33] labels and we're going to have output so rather [01:00:48] and we're going to have output so rather than having hypothesis output a [01:00:50] than having hypothesis output a probability like you saw in logistic [01:00:53] probability like you saw in logistic regression the support vector machine [01:00:55] regression the support vector machine will output either minus 1 or plus 1 and [01:00:58] will output either minus 1 or plus 1 and so G of Z becomes minus 1 or 1 so I'll [01:01:08] so G of Z becomes minus 1 or 1 so I'll put 1 if Z is greater than equal to 0 [01:01:11] put 1 if Z is greater than equal to 0 and minus 1 otherwise so instead of a [01:01:17] and minus 1 otherwise so instead of a smooth transition from 0 to 1 we have a [01:01:19] smooth transition from 0 to 1 we have a hard transition an abrupt transition [01:01:21] hard transition an abrupt transition from negative 1 to plus 1 [01:01:35] and finally where previously we had for [01:01:42] and finally where previously we had for logistic regression right where this was [01:01:49] logistic regression right where this was our n plus 1 with x0 equals 1 for the [01:01:55] our n plus 1 with x0 equals 1 for the SVM we will have h of just write this [01:01:59] SVM we will have h of just write this out so for the SVM the parameters of the [01:02:20] out so for the SVM the parameters of the SVM will be the parameters W and B and [01:02:22] SVM will be the parameters W and B and hypothesis applied to X will be G of [01:02:25] hypothesis applied to X will be G of this and we're dropping the X 0 equals 1 [01:02:29] this and we're dropping the X 0 equals 1 constraint so separate out W and B as [01:02:33] constraint so separate out W and B as follows so this is a standard notation [01:02:34] follows so this is a standard notation used to develop support vector machines [01:02:37] used to develop support vector machines and one way to think about this is this [01:02:39] and one way to think about this is this if the parameters are you know theta 0 [01:02:41] if the parameters are you know theta 0 theta 1 theta 2 theta 3 then this is a [01:02:45] theta 1 theta 2 theta 3 then this is a new B and this isn't UW [01:02:48] new B and this isn't UW so you just separate out the theta 0 [01:02:52] so you just separate out the theta 0 which is previously Mouse playing to n 0 [01:02:58] and so on and so this term here becomes [01:03:05] and so on and so this term here becomes sum from I equals 1 through n WI x pi [01:03:11] sum from I equals 1 through n WI x pi plus b right super conservative x 0 [01:03:35] all right so let me formalize the [01:03:39] all right so let me formalize the definition of a functional question so [01:03:49] so the parameters W and B to find a [01:03:52] so the parameters W and B to find a linear classifier right so you know what [01:03:55] linear classifier right so you know what the form is just wrote down the [01:03:57] the form is just wrote down the parameters W and P defines a really [01:04:01] parameters W and P defines a really defines a hyperplane but defines a line [01:04:03] defines a hyperplane but defines a line or in high dimensions it be a plane or a [01:04:06] or in high dimensions it be a plane or a hyperplane but defines a straight line [01:04:08] hyperplane but defines a straight line stepping out the positive and negative [01:04:10] stepping out the positive and negative examples and so we're gonna say the [01:04:13] examples and so we're gonna say the functional margin so function margin of [01:04:44] functional margin so function margin of a hyperplane defined by this with [01:04:47] a hyperplane defined by this with respect to one training example we're [01:04:50] respect to one training example we're going to write as this and hyperplane [01:04:54] going to write as this and hyperplane just means straight line right but in [01:04:56] just means straight line right but in high dimension so this linear classifier [01:04:57] high dimension so this linear classifier so it's just you know functional margin [01:04:59] so it's just you know functional margin of this classifier respect to one [01:05:01] of this classifier respect to one training example we're going to define [01:05:03] training example we're going to define as this and so if you compare this with [01:05:07] as this and so if you compare this with the equations we had up there you know [01:05:10] the equations we had up there you know if y equals one we hope for that at y [01:05:12] if y equals one we hope for that at y equals zero we hope for that so really [01:05:14] equals zero we hope for that so really what we hope for is for a classifier to [01:05:16] what we hope for is for a classifier to achieve a large functional margin right [01:05:19] achieve a large functional margin right and so so if y I equals 1 then what we [01:05:27] and so so if y I equals 1 then what we want or what we hope for is that ee [01:05:32] want or what we hope for is that ee transpose X I plus B is greater than my [01:05:36] transpose X I plus B is greater than my creatinine 0 and after they both equal [01:05:39] creatinine 0 and after they both equal to minus 1 [01:05:41] to minus 1 then we once I'll be hope that this is [01:05:48] then we once I'll be hope that this is much smaller than zero and if you kind [01:05:51] much smaller than zero and if you kind of combine these two statements if you [01:05:53] of combine these two statements if you take why I write and multiply it with [01:06:02] that then you know these two statements [01:06:05] that then you know these two statements together is basically saying that you [01:06:07] together is basically saying that you hope that gamma hat I is much greater [01:06:10] hope that gamma hat I is much greater than 0 because Y I know is plus 1 or [01:06:13] than 0 because Y I know is plus 1 or minus 1 and and so Y is equal to 1 you [01:06:18] minus 1 and and so Y is equal to 1 you want this to be very very large if Y is [01:06:20] want this to be very very large if Y is negative 1 you want this to be a very [01:06:22] negative 1 you want this to be a very very large negative number and so either [01:06:25] very large negative number and so either way it's just saying that you hope this [01:06:27] way it's just saying that you hope this would be very large so we just hope that [01:06:38] and and as an aside one property of this [01:06:43] and and as an aside one property of this as well is that so long as gamma hat is [01:06:51] as well is that so long as gamma hat is greater than 0 that means the algorithm [01:07:02] so so so long as the functional margin [01:07:07] so so so long as the functional margin so long as this gamma hat is greater [01:07:09] so long as this gamma hat is greater than 0 it means that either this is [01:07:13] than 0 it means that either this is bigger than 0 this is less than 0 [01:07:15] bigger than 0 this is less than 0 depending on the sign of the label and [01:07:18] depending on the sign of the label and it means that the algorithm yes [01:07:20] it means that the algorithm yes this one example corrector is right and [01:07:23] this one example corrector is right and much greater than 0 then it means you [01:07:26] much greater than 0 then it means you know so the scaling Zira means in in the [01:07:28] know so the scaling Zira means in in the logistic regression case it means that [01:07:29] logistic regression case it means that the prediction is at least a little bit [01:07:31] the prediction is at least a little bit above 0.5 and low Pitbull 0.5 probably [01:07:34] above 0.5 and low Pitbull 0.5 probably at least gets it right and it was much [01:07:36] at least gets it right and it was much greater than 0 much less than 0 and [01:07:38] greater than 0 much less than 0 and Jesus you know the probability output in [01:07:41] Jesus you know the probability output in the loose aggression cases are very [01:07:42] the loose aggression cases are very close to one or very close to zero so [01:07:48] close to one or very close to zero so one of the definition I'm going to [01:08:00] one of the definition I'm going to define functional margin respect to the [01:08:02] define functional margin respect to the training set to be gamma hat equals min [01:08:05] training set to be gamma hat equals min over I here R equals ranges over your [01:08:12] over I here R equals ranges over your training examples okay so this is a [01:08:15] training examples okay so this is a worst-case notion but so this definition [01:08:17] worst-case notion but so this definition of a function margin on the Left we [01:08:19] of a function margin on the Left we define functional margin respect to a [01:08:21] define functional margin respect to a single training example which is how are [01:08:23] single training example which is how are you doing on that one training example [01:08:25] you doing on that one training example and we'll define the function margin [01:08:27] and we'll define the function margin with respect to the entire training set [01:08:29] with respect to the entire training set as how are you doing on the worst [01:08:31] as how are you doing on the worst example in your training set this is a [01:08:34] example in your training set this is a little bit of a brittle notion and we're [01:08:36] little bit of a brittle notion and we're for now for today we're assuming that [01:08:39] for now for today we're assuming that the training set is linearly separable [01:08:40] the training set is linearly separable so I'm gonna assume that the training [01:08:42] so I'm gonna assume that the training set you know it looks like this and [01:08:45] set you know it looks like this and separative of a straight line or the [01:08:48] separative of a straight line or the Laxus later but because we're assuming [01:08:50] Laxus later but because we're assuming just for today that the training set is [01:08:53] just for today that the training set is linearly separable we'll use this kind [01:08:55] linearly separable we'll use this kind of worst-case notion and defined the [01:08:57] of worst-case notion and defined the function margin to be the function [01:08:59] function margin to be the function margin of the worst training example [01:09:08] now one thing about the definition of [01:09:12] now one thing about the definition of the functional margin is they're [01:09:14] the functional margin is they're actually really easy to cheat and [01:09:15] actually really easy to cheat and increase the functional margin right and [01:09:18] increase the functional margin right and one thing you can do is look at this [01:09:20] one thing you can do is look at this formula is if you take W you know and [01:09:24] formula is if you take W you know and multiply it by two and take B and [01:09:27] multiply it by two and take B and multiply by two then everything here [01:09:32] multiply by two then everything here just multiplies by two and you've [01:09:34] just multiplies by two and you've doubled the functional margin right but [01:09:36] doubled the functional margin right but you haven't actually changed anything [01:09:38] you haven't actually changed anything meaningful okay so so one one way to [01:09:41] meaningful okay so so one one way to cheat on the functional margin is just [01:09:42] cheat on the functional margin is just by scaling the parameters by two or in [01:09:45] by scaling the parameters by two or in seven - maybe you can multiply all your [01:09:47] seven - maybe you can multiply all your parameters by ten and then you've [01:09:49] parameters by ten and then you've actually increased the functional margin [01:09:50] actually increased the functional margin of your training examples 10x but this [01:09:53] of your training examples 10x but this doesn't actually change the decision [01:09:55] doesn't actually change the decision boundary right it doesn't actually [01:09:56] boundary right it doesn't actually change any classification just to [01:09:57] change any classification just to multiply all of your parameters by a [01:09:59] multiply all of your parameters by a factor of ten um so one thing you could [01:10:05] factor of ten um so one thing you could do is replace one thing you could do [01:10:13] would be to normalize the length of your [01:10:16] would be to normalize the length of your parameters so for example hypothetically [01:10:18] parameters so for example hypothetically you can impose a constraint the normal W [01:10:21] you can impose a constraint the normal W is equal to one another way to do that [01:10:24] is equal to one another way to do that we could see W and E and replace it with [01:10:27] we could see W and E and replace it with W over right justify your parameters [01:10:34] W over right justify your parameters through by the magnitude by their by the [01:10:36] through by the magnitude by their by the Euclidean length of the parameter vector [01:10:39] Euclidean length of the parameter vector W and this doesn't change any [01:10:41] W and this doesn't change any classification is just V scaling the [01:10:42] classification is just V scaling the parameters but but but it prevents you [01:10:46] parameters but but but it prevents you know this way of cheating on the focused [01:10:49] know this way of cheating on the focused on margin okay and in fact more [01:10:53] on margin okay and in fact more generally you could actually scale W and [01:10:55] generally you could actually scale W and V by any other value you want and it [01:10:58] V by any other value you want and it doesn't doesn't matter [01:10:59] doesn't doesn't matter could choose certain places by w over 17 [01:11:02] could choose certain places by w over 17 and then P over 17 before any other very [01:11:05] and then P over 17 before any other very right and the classification stayed the [01:11:08] right and the classification stayed the same [01:11:08] same okay so we'll come back and use this [01:11:10] okay so we'll come back and use this property all right so to find a [01:11:26] property all right so to find a functional margin let's define the [01:11:27] functional margin let's define the geometric margin and you see in a second [01:11:30] geometric margin and you see in a second how did your metric on the function [01:11:31] how did your metric on the function margin relate to each other um so [01:11:35] margin relate to each other um so there's less let's define the geometric [01:11:38] there's less let's define the geometric margin with respect to a single example [01:11:41] margin with respect to a single example which is um so let's see let's say you [01:11:45] which is um so let's see let's say you have a classifier right so given [01:11:50] have a classifier right so given parameters W and V that defines a linear [01:11:53] parameters W and V that defines a linear classifier and the equation W X plus B [01:11:57] classifier and the equation W X plus B equals zero defines the equation of a [01:11:59] equals zero defines the equation of a straight line so the axis here I think's [01:12:02] straight line so the axis here I think's more than x2 and then half of this plane [01:12:04] more than x2 and then half of this plane you know in this half of the plane you [01:12:06] you know in this half of the plane you have W transpose X plus B is greater [01:12:09] have W transpose X plus B is greater than 0 and in this half you have W [01:12:12] than 0 and in this half you have W transpose X plus B is less than 0 and in [01:12:15] transpose X plus B is less than 0 and in between this straight line but given by [01:12:18] between this straight line but given by this equation W transpose X plus B [01:12:20] this equation W transpose X plus B equals 0 right and so given parameters W [01:12:23] equals 0 right and so given parameters W and B the upper right is where your [01:12:25] and B the upper right is where your classifier will predict y equals 1 and [01:12:28] classifier will predict y equals 1 and the lower left is well predict y is [01:12:30] the lower left is well predict y is equal to negative 1 ok now let's say you [01:12:35] equal to negative 1 ok now let's say you have a one training example here so [01:12:38] have a one training example here so that's a training example X I comma Y I [01:12:42] that's a training example X I comma Y I and that say is a positive example ok [01:12:51] and that say is a positive example ok and so your classifier is cause find [01:12:55] and so your classifier is cause find this example correctly right because and [01:12:58] this example correctly right because and the upper right half plane you're in [01:13:02] the upper right half plane you're in this half plane double transpose X plus [01:13:04] this half plane double transpose X plus B is greater than 0 and so [01:13:06] B is greater than 0 and so this upper-right region your classifier [01:13:09] this upper-right region your classifier is predicting +1 right where is it this [01:13:13] is predicting +1 right where is it this low hot region it be predicting H of X [01:13:16] low hot region it be predicting H of X equals negative 1 and that's why this [01:13:21] equals negative 1 and that's why this straight line where switches from [01:13:23] straight line where switches from predicting negative to positive is the [01:13:24] predicting negative to positive is the decision boundary so what we're going to [01:13:31] decision boundary so what we're going to do is define this distance to be the [01:13:36] do is define this distance to be the geometric margin of this training [01:13:41] geometric margin of this training example it's that you couldn't distance [01:13:44] example it's that you couldn't distance is what we're define to be the geometric [01:13:46] is what we're define to be the geometric immersion so let me just write down what [01:13:49] immersion so let me just write down what that is [01:14:03] so the geometric margin [01:14:10] you know the classifier of the hyper [01:14:13] you know the classifier of the hyper plane defined by WB we respect to one [01:14:18] plane defined by WB we respect to one example X my eye this is going to be [01:14:22] example X my eye this is going to be gamourai equals and let's see I'm not [01:14:34] gamourai equals and let's see I'm not proving why this is the case the proof [01:14:36] proving why this is the case the proof is given an election notes but the [01:14:38] is given an election notes but the legend else shows why this is the right [01:14:40] legend else shows why this is the right formula for measuring the Euclidean [01:14:43] formula for measuring the Euclidean distance that I just drew a picture up [01:14:44] distance that I just drew a picture up there okay but then I'm not proving this [01:14:47] there okay but then I'm not proving this here but the proof is giving election [01:14:49] here but the proof is giving election that was me this turns out to be the way [01:14:50] that was me this turns out to be the way you compute the Euclidean distance [01:14:51] you compute the Euclidean distance between you have an example and in the [01:14:54] between you have an example and in the decision boundary okay um and and and [01:15:02] decision boundary okay um and and and this is for the positive example I guess [01:15:04] this is for the positive example I guess more generally going to define the [01:15:14] more generally going to define the geometric margin to be equal to this and [01:15:17] geometric margin to be equal to this and this definition applies to positive [01:15:19] this definition applies to positive examples and to negative examples and so [01:15:24] examples and to negative examples and so the relationship between the geometric [01:15:26] the relationship between the geometric margin and the functional margin is that [01:15:28] margin and the functional margin is that the geometric margin is equals in a [01:15:31] the geometric margin is equals in a personal margin divided by the norm of W [01:15:52] finally the geometric margin with [01:16:06] finally the geometric margin with respect to the training set is we're [01:16:11] respect to the training set is we're gain use this worst-case notion look [01:16:16] gain use this worst-case notion look through all your training examples and [01:16:17] through all your training examples and pick the worst possible training example [01:16:19] pick the worst possible training example and that is your geometric margin on the [01:16:24] and that is your geometric margin on the training set [01:16:25] training set oh and and so I hope that certain [01:16:27] oh and and so I hope that certain notation is clear right so gamma hat was [01:16:29] notation is clear right so gamma hat was the functional margin and gamma is a [01:16:42] the functional margin and gamma is a geometric margin and so [01:17:06] what the optimal margin classifier does [01:17:15] is choose the parameters W and B to [01:17:27] is choose the parameters W and B to maximize the geometric margin okay so in [01:17:34] maximize the geometric margin okay so in other words this is the optimal margin [01:17:36] other words this is the optimal margin classifiers it's the baby SVM [01:17:41] SVM for linearly separable call data at [01:17:45] SVM for linearly separable call data at least for today [01:17:46] least for today so the optimum arching crossfire would [01:17:48] so the optimum arching crossfire would choose that straight line because that [01:17:50] choose that straight line because that straight line maximizes the distance or [01:17:53] straight line maximizes the distance or maximizes the geometric margin so all of [01:17:56] maximizes the geometric margin so all of these examples know oh how you pose this [01:18:03] these examples know oh how you pose this mathematically they're few steps of this [01:18:05] mathematically they're few steps of this derivation I don't want to do but I'll [01:18:07] derivation I don't want to do but I'll just describe the beginning step and the [01:18:10] just describe the beginning step and the last step and leave that in the in [01:18:12] last step and leave that in the in between steps the lecture notes [01:18:14] between steps the lecture notes but it turns out that one way to pose [01:18:17] but it turns out that one way to pose this problem is to maximize gamma W and [01:18:22] this problem is to maximize gamma W and B of gamma so you want to maximize the [01:18:25] B of gamma so you want to maximize the geometric margin subject to that subject [01:18:43] geometric margin subject to that subject to that every training example must have [01:18:47] to that every training example must have geometric margin greater than or equal [01:18:51] geometric margin greater than or equal to gamma right so you want gamma to [01:18:54] to gamma right so you want gamma to means bigger possible subject to that [01:18:56] means bigger possible subject to that every single training example must have [01:18:57] every single training example must have at least as I mentioned [01:18:59] at least as I mentioned this causes you to maximize the [01:19:01] this causes you to maximize the worst-case geometric motion and it turns [01:19:04] worst-case geometric motion and it turns out this is um not in this form this [01:19:07] out this is um not in this form this isn't a convex optimization problem so [01:19:09] isn't a convex optimization problem so it's difficult to solve this without [01:19:10] it's difficult to solve this without don't like green design initially [01:19:11] don't like green design initially there's no little glassware and so on [01:19:13] there's no little glassware and so on but it turns out that by a few steps are [01:19:15] but it turns out that by a few steps are be writing you can reformulate this [01:19:18] be writing you can reformulate this problem as into the equivalent problem [01:19:21] problem as into the equivalent problem which is a minimizing normal W subject [01:19:24] which is a minimizing normal W subject to the dramatic margin and so it turns [01:19:34] to the dramatic margin and so it turns out so I hope this problem makes sense [01:19:35] out so I hope this problem makes sense right so this problem is just you know [01:19:37] right so this problem is just you know solve for W and B to make sure that [01:19:40] solve for W and B to make sure that every example test your metric margin [01:19:42] every example test your metric margin creating equal gamma and you want gather [01:19:44] creating equal gamma and you want gather to be as big as possible [01:19:45] to be as big as possible so this is a way to find the [01:19:46] so this is a way to find the optimization problem that says maximize [01:19:49] optimization problem that says maximize the geometric margin and what we show in [01:19:52] the geometric margin and what we show in the lecture notes is that through a few [01:19:55] the lecture notes is that through a few steps you can rewrite this optimization [01:19:57] steps you can rewrite this optimization problem into the following equivalent [01:19:59] problem into the following equivalent form which is to try to minimize the [01:20:02] form which is to try to minimize the normal W subject to this and maybe one [01:20:05] normal W subject to this and maybe one piece of intuition to take away is you [01:20:08] piece of intuition to take away is you know the smaller W is the bigger right [01:20:11] know the smaller W is the bigger right the less of a normalization Division [01:20:14] the less of a normalization Division effect you have right but the details [01:20:17] effect you have right but the details are given in lecture notes ok but this [01:20:20] are given in lecture notes ok but this turns out to be a convex optimization [01:20:21] turns out to be a convex optimization problem and if you optimize this then [01:20:24] problem and if you optimize this then you will have the optimal margin [01:20:26] you will have the optimal margin classifier and they're very good [01:20:27] classifier and they're very good numerical optimization packages to solve [01:20:30] numerical optimization packages to solve this optimization problem and if you [01:20:31] this optimization problem and if you give this a data set then you know [01:20:33] give this a data set then you know assuming your data is separable we'll [01:20:35] assuming your data is separable we'll fix that assumption well reconvene next [01:20:37] fix that assumption well reconvene next week then you have the optimal knowledge [01:20:40] week then you have the optimal knowledge and classifier where should be the baby [01:20:41] and classifier where should be the baby svf and we add kernels to it then you [01:20:43] svf and we add kernels to it then you have the full content of C [01:20:46] have the full content of C alright let's break for the day [01:20:49] alright let's break for the day see see you guys ================================================================================ LECTURE 007 ================================================================================ Lecture 7 - Kernels | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018) Source: https://www.youtube.com/watch?v=8NYoQiRANpg --- Transcript [00:00:03] all right good morning let's get started [00:00:07] all right good morning let's get started so today you see the support vector [00:00:11] so today you see the support vector machine algorithm and this is one of my [00:00:15] machine algorithm and this is one of my favorite algorithms because it's very [00:00:16] favorite algorithms because it's very turnkey classification problem so in [00:00:24] turnkey classification problem so in particular talk a bit more about the [00:00:26] particular talk a bit more about the optimization problem you have to solve [00:00:29] optimization problem you have to solve in support vector machine then talk [00:00:32] in support vector machine then talk about something called the representor [00:00:34] about something called the representor theorem and this would be a key idea to [00:00:37] theorem and this would be a key idea to how will work in potentially very high [00:00:40] how will work in potentially very high dimensional like 100,000 dimensional or [00:00:42] dimensional like 100,000 dimensional or a moving dimensional or 100 billion [00:00:44] a moving dimensional or 100 billion dimensional or even infinite dimensional [00:00:46] dimensional or even infinite dimensional feature spaces and just to teach you how [00:00:48] feature spaces and just to teach you how to represent feature vectors and how to [00:00:50] to represent feature vectors and how to represent parameters that may be you [00:00:53] represent parameters that may be you know hundred billion dimensional or 100 [00:00:55] know hundred billion dimensional or 100 trillion dimensional or infinite [00:00:56] trillion dimensional or infinite dimensional and based on this we derived [00:01:00] dimensional and based on this we derived kernels which is the mechanism for [00:01:01] kernels which is the mechanism for working these incredibly high [00:01:03] working these incredibly high dimensional feature spaces and then [00:01:05] dimensional feature spaces and then hopefully time permitting wrap up with a [00:01:08] hopefully time permitting wrap up with a few examples of compute implementations [00:01:11] few examples of compute implementations of these ideas so to recap on last [00:01:16] of these ideas so to recap on last Wednesday we started to talk about the [00:01:18] Wednesday we started to talk about the optimal margin the classifier which said [00:01:21] optimal margin the classifier which said that even the data set that looks like [00:01:24] that even the data set that looks like this then you want to find the decision [00:01:28] this then you want to find the decision boundary with the greatest possible [00:01:30] boundary with the greatest possible geometric margin right so the geometric [00:01:33] geometric margin right so the geometric margin can be calculated by this formula [00:01:35] margin can be calculated by this formula and this is just the derivations in [00:01:38] and this is just the derivations in lecture notes just you know measuring [00:01:40] lecture notes just you know measuring the distance to the nearest point and [00:01:45] the distance to the nearest point and for now let's assume the data can be [00:01:47] for now let's assume the data can be separated by a straight line and so [00:01:49] separated by a straight line and so gamma I is this is sort of geometry I [00:01:52] gamma I is this is sort of geometry I guess derivation the lecture notes this [00:01:55] guess derivation the lecture notes this is the formula for computing the [00:01:57] is the formula for computing the distance from the example X I Y I to the [00:02:00] distance from the example X I Y I to the decision boundary governed by the [00:02:02] decision boundary governed by the parameters W and B and gamma is the [00:02:06] parameters W and B and gamma is the worst case geometric margin right you [00:02:09] worst case geometric margin right you will make so [00:02:13] of all of your M training examples which [00:02:16] of all of your M training examples which one has the resource possible geometric [00:02:18] one has the resource possible geometric margin and the support that the optimal [00:02:20] margin and the support that the optimal margin classifier will try to make this [00:02:23] margin classifier will try to make this as big as possible and by the way what [00:02:26] as big as possible and by the way what will what you see later on is that [00:02:27] will what you see later on is that optimum Archana classifiers BC this [00:02:29] optimum Archana classifiers BC this algorithm and okto margin classifier [00:02:32] algorithm and okto margin classifier plus kernels meaning AC take this idea [00:02:35] plus kernels meaning AC take this idea of the pie in a hundred billion [00:02:36] of the pie in a hundred billion dimensional feature space that's the [00:02:38] dimensional feature space that's the support vector machine ok um so I saw [00:02:43] support vector machine ok um so I saw one thing I didn't have time to talk [00:02:45] one thing I didn't have time to talk about on Wednesday was the derivation of [00:02:49] about on Wednesday was the derivation of this classification problem so whether [00:02:52] this classification problem so whether this optimization objective come from so [00:02:54] this optimization objective come from so let me let me just go over that very [00:02:56] let me let me just go over that very briefly so the way motivated these [00:03:00] briefly so the way motivated these definitions was said that given training [00:03:01] definitions was said that given training set you want to find the decision [00:03:03] set you want to find the decision boundary parametrized by W and B that [00:03:07] boundary parametrized by W and B that maximizes the geometric margin right and [00:03:10] maximizes the geometric margin right and so again as we can your classifier [00:03:12] so again as we can your classifier output G equals W transpose X plus B and [00:03:18] output G equals W transpose X plus B and so you want to find parameters W be [00:03:20] so you want to find parameters W be they'll define the decision boundary [00:03:23] they'll define the decision boundary when your classification switch from [00:03:25] when your classification switch from positive and negative that maximizes the [00:03:27] positive and negative that maximizes the geometric margin and so one way to pose [00:03:30] geometric margin and so one way to pose this as an optimization problem is let's [00:03:35] this as an optimization problem is let's see is to try to find the biggest [00:03:39] see is to try to find the biggest possible value of gamma subject to that [00:03:53] subject to that the geometric margin [00:03:55] subject to that the geometric margin must be greater than equal to gamma [00:03:56] must be greater than equal to gamma right so so in this optimization problem [00:04:00] right so so in this optimization problem the parameters you get to fiddle with [00:04:02] the parameters you get to fiddle with our gamma W and B and if you solve this [00:04:06] our gamma W and B and if you solve this optimization problem then you are [00:04:08] optimization problem then you are finding the values of W and B that [00:04:11] finding the values of W and B that defines a straight line that defines the [00:04:13] defines a straight line that defines the decision boundary so that so this [00:04:17] decision boundary so that so this constraint says that every example right [00:04:21] constraint says that every example right so this constraint says every example [00:04:25] has Joe mentioned margin greater than [00:04:28] has Joe mentioned margin greater than equal to gamma this is this is what is [00:04:31] equal to gamma this is this is what is saying and you want to set gamma as big [00:04:33] saying and you want to set gamma as big as possible which means that you're [00:04:36] as possible which means that you're maximizing the worst-case geometric [00:04:38] maximizing the worst-case geometric logic this make sense so so if I so the [00:04:42] logic this make sense so so if I so the only way to make gamma say 17 or 20 or [00:04:47] only way to make gamma say 17 or 20 or whatever is if every training example [00:04:49] whatever is if every training example has geometric margin bigger than 17 [00:04:52] has geometric margin bigger than 17 right and so this optimization problem [00:04:55] right and so this optimization problem is trying to find Delvian be to drive up [00:04:58] is trying to find Delvian be to drive up gamma as big as possible and have every [00:05:00] gamma as big as possible and have every example have geometric margin even [00:05:02] example have geometric margin even bigger than camera so this optimization [00:05:04] bigger than camera so this optimization problem maximizes the it causes your [00:05:10] problem maximizes the it causes your causes your defined Delta NP with as big [00:05:12] causes your defined Delta NP with as big a geometric margin as possible eyes [00:05:14] a geometric margin as possible eyes bigger the worst case your magic margin [00:05:15] bigger the worst case your magic margin as possible okay and so does this [00:05:21] as possible okay and so does this defense actually yeah right okay [00:05:24] defense actually yeah right okay actually raise your hand if this makes [00:05:25] actually raise your hand if this makes sense oh okay well many of you all right [00:05:29] sense oh okay well many of you all right I mean this even a slightly different [00:05:30] I mean this even a slightly different way um so let's see of a few training [00:05:33] way um so let's see of a few training examples you know in the training [00:05:34] examples you know in the training examples geometric margins are 17 2 & 5 [00:05:41] examples geometric margins are 17 2 & 5 right then the Jamaican margin in this [00:05:43] right then the Jamaican margin in this case is the worst case value 2 right and [00:05:46] case is the worst case value 2 right and so if you are solving an optimization [00:05:48] so if you are solving an optimization problem where I want every example where [00:05:51] problem where I want every example where I wanted the [00:05:54] I wanted the where I want a min over I of gamma I to [00:06:00] where I want a min over I of gamma I to be as big as possible [00:06:02] be as big as possible one way to enforce this is to say they [00:06:04] one way to enforce this is to say they can [00:06:04] can I must be bigger than equal to camera [00:06:07] I must be bigger than equal to camera for every possible value of I and then [00:06:09] for every possible value of I and then I'm gonna lift camera up as much as [00:06:11] I'm gonna lift camera up as much as possible right because the only way to [00:06:13] possible right because the only way to live camera up subject to this is it [00:06:16] live camera up subject to this is it every verb value of gamma is bigger than [00:06:18] every verb value of gamma is bigger than that and so lifting gamma up maximizing [00:06:22] that and so lifting gamma up maximizing gamma as effective maximizing the worst [00:06:24] gamma as effective maximizing the worst case examples geometric margin which is [00:06:27] case examples geometric margin which is which is which is how we've defined this [00:06:30] which is which is how we've defined this optimization okay and then the last one [00:06:37] optimization okay and then the last one step to turn this problem into this one [00:06:39] step to turn this problem into this one on the left is this interesting [00:06:42] on the left is this interesting observation that you might remember when [00:06:45] observation that you might remember when we talked about the functional margin [00:06:48] we talked about the functional margin which is the numerator here that you [00:06:51] which is the numerator here that you know the function margin you can scale W [00:06:53] know the function margin you can scale W and B by any number and the decision [00:06:56] and B by any number and the decision boundary stays the same and so you know [00:06:59] boundary stays the same and so you know if if your classifier is y so this is G [00:07:02] if if your classifier is y so this is G of W transpose X plus B right so if if W [00:07:12] of W transpose X plus B right so if if W was the vector 2 1 let's say that's the [00:07:21] was the vector 2 1 let's say that's the classifier right then you can take W and [00:07:25] classifier right then you can take W and B and multiply it by any number you want [00:07:28] B and multiply it by any number you want I can multiply this by 10 and this the [00:07:35] I can multiply this by 10 and this the point it's the same straight line right [00:07:38] point it's the same straight line right so if I take a look I think let's see [00:07:43] so if I take a look I think let's see with this 2 1 X this actually defines [00:07:47] with this 2 1 X this actually defines the decision boundary that looks like [00:07:50] the decision boundary that looks like that if this is X 1 and this is X 2 then [00:07:54] that if this is X 1 and this is X 2 then this is the equation of the straight [00:07:55] this is the equation of the straight line where the V transpose X plus B [00:07:58] line where the V transpose X plus B equals 0 right that's 1 & 2 you could [00:08:08] equals 0 right that's 1 & 2 you could verify for yourself you plug in this [00:08:10] verify for yourself you plug in this point then W transpose X plus B equals 0 [00:08:12] point then W transpose X plus B equals 0 we plug at this point there because it [00:08:14] we plug at this point there because it was x equals 0 and so that's the [00:08:16] was x equals 0 and so that's the decision boundary [00:08:17] decision boundary where the SVM will predict positive [00:08:20] where the SVM will predict positive everywhere here and predict negative [00:08:23] everywhere here and predict negative everywhere to the lower left and this [00:08:26] everywhere to the lower left and this straight line you know stays the same [00:08:28] straight line you know stays the same you very much supply these parameters by [00:08:31] you very much supply these parameters by any constant okay and so to simplify [00:08:38] any constant okay and so to simplify this notice that you could choose [00:08:41] this notice that you could choose anything you want for the normal W right [00:08:44] anything you want for the normal W right just by scaling this by a factor of ten [00:08:46] just by scaling this by a factor of ten you can increase it or scale me by a [00:08:47] you can increase it or scale me by a factor one over ten you can decrease it [00:08:49] factor one over ten you can decrease it but you have to flexibility to scale the [00:08:52] but you have to flexibility to scale the parameters W and B you know up or down [00:08:55] parameters W and B you know up or down by any fixed constant without changing [00:08:58] by any fixed constant without changing the decision boundary and so the trick [00:09:02] the decision boundary and so the trick to simplify this equation under that one [00:09:04] to simplify this equation under that one is if you choose to scale the normal W [00:09:08] is if you choose to scale the normal W to be equal to 1 over gamma because if [00:09:15] to be equal to 1 over gamma because if you do that then this optimization [00:09:18] you do that then this optimization objective becomes [00:09:30] maximize one of the norm of W subject to [00:09:44] right if you substitute normal W equals [00:09:48] right if you substitute normal W equals one of Yama and so that cancels out and [00:09:53] one of Yama and so that cancels out and so you end up with this optimization [00:09:55] so you end up with this optimization problem instead of maximizing one of a [00:09:57] problem instead of maximizing one of a norm W to minimize one half the normal W [00:10:00] norm W to minimize one half the normal W squared subject to this okay and so [00:10:13] squared subject to this okay and so that's a rough I know I did this [00:10:15] that's a rough I know I did this relatively quickly again as usual the [00:10:18] relatively quickly again as usual the full derivation is redundant lecture [00:10:19] full derivation is redundant lecture notes but hopefully this gives you a [00:10:21] notes but hopefully this gives you a flavor for why if you solve this [00:10:23] flavor for why if you solve this optimization problem and you're [00:10:25] optimization problem and you're minimizing over W and B that you are [00:10:28] minimizing over W and B that you are solving for the parameters WMV that give [00:10:30] solving for the parameters WMV that give you the optimal margin classifier okay [00:10:35] you the optimal margin classifier okay now Delta margin classifier we've been [00:10:40] now Delta margin classifier we've been deriving this algorithm as if you know [00:10:43] deriving this algorithm as if you know the features X I let's see [00:10:48] the features X I let's see we're from deriving these algorithm is [00:10:49] we're from deriving these algorithm is if the features X I are some reasonable [00:10:52] if the features X I are some reasonable dimensional feature X equals r2 it's a [00:10:55] dimensional feature X equals r2 it's a 100 or something um what we will talk [00:10:59] 100 or something um what we will talk about later is a case where the features [00:11:02] about later is a case where the features X I become you know 100 trillion [00:11:04] X I become you know 100 trillion dimensional right or infinite [00:11:06] dimensional right or infinite dimensional and what what we will assume [00:11:15] dimensional and what what we will assume is that W can be represented as a sum as [00:11:23] is that W can be represented as a sum as a linear combination of the training [00:11:26] a linear combination of the training examples [00:11:27] examples so um in order to the writers for vector [00:11:31] so um in order to the writers for vector machine we're going to make an [00:11:32] machine we're going to make an additional restriction that the [00:11:34] additional restriction that the parameters W can be expressed as a [00:11:37] parameters W can be expressed as a linear combination of the training [00:11:39] linear combination of the training examples right so and it turns out that [00:11:43] examples right so and it turns out that when X I is you know 100 trillion [00:11:46] when X I is you know 100 trillion dimensional doing this will let us [00:11:48] dimensional doing this will let us derive algorithms that work even in [00:11:50] derive algorithms that work even in these hundred trillion or in these [00:11:51] these hundred trillion or in these infinite dimensional feature spaces now [00:11:54] infinite dimensional feature spaces now I'm describing this just as an [00:11:56] I'm describing this just as an assumption it turns out that there's a [00:11:58] assumption it turns out that there's a theorem called the represented theorem [00:12:00] theorem called the represented theorem that shows that you can make this [00:12:02] that shows that you can make this assumption without losing any [00:12:04] assumption without losing any performance the proof the represented [00:12:06] performance the proof the represented theorem is quite complicated I don't [00:12:07] theorem is quite complicated I don't want to do this in this class there's [00:12:09] want to do this in this class there's actually written out the proof but why [00:12:11] actually written out the proof but why you can make this assumption is also [00:12:12] you can make this assumption is also written election else is pretty long and [00:12:14] written election else is pretty long and involve proof involving a primal dual [00:12:16] involve proof involving a primal dual optimization I don't to present the [00:12:18] optimization I don't to present the whole proof here but there give you a [00:12:19] whole proof here but there give you a flavor for why this is a reasonable [00:12:21] flavor for why this is a reasonable assumption to make okay oh and just just [00:12:25] assumption to make okay oh and just just to make things complicated [00:12:26] to make things complicated later on we actually do this right so [00:12:29] later on we actually do this right so why I is always plus minus one so so [00:12:31] why I is always plus minus one so so we're actually by convention we're going [00:12:33] we're actually by convention we're going to assume that [00:12:34] to assume that WI can be written all right so in this [00:12:37] WI can be written all right so in this example this is plus minus 1 right so [00:12:40] example this is plus minus 1 right so this makes some of the math a little bit [00:12:43] this makes some of the math a little bit down so you come out easier but this is [00:12:45] down so you come out easier but this is still saying that W is can be [00:12:47] still saying that W is can be represented as a linear combination of [00:12:49] represented as a linear combination of the training examples ok so um let me [00:12:54] the training examples ok so um let me just describe less formally why this is [00:12:57] just describe less formally why this is a reasonable assumption but it's [00:12:58] a reasonable assumption but it's actually not an assumption that [00:12:59] actually not an assumption that represents a theorem proves that you [00:13:01] represents a theorem proves that you know this is just true at the optimal [00:13:02] know this is just true at the optimal value of W but let me convey a couple [00:13:05] value of W but let me convey a couple ways why this is a reasonable thing to [00:13:09] ways why this is a reasonable thing to do I see yes [00:13:12] do I see yes so um maybe his intuition number one and [00:13:19] so um maybe his intuition number one and I'm going to refer to logistic [00:13:21] I'm going to refer to logistic regression [00:13:27] right we're suppose that you run [00:13:30] right we're suppose that you run logistic regression with gradient [00:13:32] logistic regression with gradient descent say so cost appear in descent [00:13:34] descent say so cost appear in descent then you will initialize the parameters [00:13:36] then you will initialize the parameters to be equal to zero at first and then [00:13:39] to be equal to zero at first and then for each iteration of stochastic [00:13:41] for each iteration of stochastic gradient descent right you would update [00:13:47] gradient descent right you would update theta gets updated as theta minus a [00:13:50] theta gets updated as theta minus a learning rate times you know times X [00:13:58] learning rate times you know times X okay and so sorry here alpha is a [00:14:03] okay and so sorry here alpha is a learning rate [00:14:03] learning rate nothing does overload the notation this [00:14:05] nothing does overload the notation this alpha is nothing to do that alpha but so [00:14:07] alpha is nothing to do that alpha but so they're saying that on every iteration [00:14:09] they're saying that on every iteration you're updating the parameters theta as [00:14:11] you're updating the parameters theta as a by adding or subtracting some constant [00:14:16] a by adding or subtracting some constant times some training example and so kind [00:14:18] times some training example and so kind of proof by induction right if they [00:14:21] of proof by induction right if they tostop zero and on every iteration a [00:14:24] tostop zero and on every iteration a great descent you're adding a multiple [00:14:26] great descent you're adding a multiple of some training example then no matter [00:14:29] of some training example then no matter how many iterations you run gradient [00:14:31] how many iterations you run gradient descent theta is still a linear [00:14:33] descent theta is still a linear combination of your training examples [00:14:36] combination of your training examples okay and again identical stay there that [00:14:39] okay and again identical stay there that it was really theta 0 theta 1 up to [00:14:42] it was really theta 0 theta 1 up to theta n right whereas here we have a B [00:14:46] theta n right whereas here we have a B and then w1 down to W I know this pens [00:14:50] and then w1 down to W I know this pens really bad if you like alright yeah [00:14:54] really bad if you like alright yeah throw these away so they don't keep [00:14:56] throw these away so they don't keep haunting us in the future ok right so [00:15:00] haunting us in the future ok right so but if you but so I did those a theta [00:15:05] but if you but so I did those a theta rather than chop you but it turns out if [00:15:07] rather than chop you but it turns out if you work through the algebra this is a [00:15:09] you work through the algebra this is a little fruit by induction that you know [00:15:11] little fruit by induction that you know as you run logistic regression after [00:15:14] as you run logistic regression after every iteration the parameters theta of [00:15:16] every iteration the parameters theta of the parameters W are always a linear [00:15:19] the parameters W are always a linear combination of the training samples and [00:15:22] combination of the training samples and this is also true of you use batch [00:15:24] this is also true of you use batch gradient descent if you use factory in [00:15:26] gradient descent if you use factory in the Senate then the up a roux is this [00:15:38] and so it turns out you can derive [00:15:40] and so it turns out you can derive gradient descent for the support vector [00:15:43] gradient descent for the support vector machine learning algorithm as well you [00:15:44] machine learning algorithm as well you can derive your descent authorized W [00:15:46] can derive your descent authorized W subject to this and you could have a [00:15:48] subject to this and you could have a proof by induction you know that no [00:15:50] proof by induction you know that no matter how many iterations you run [00:15:51] matter how many iterations you run through in descent it will always be a [00:15:52] through in descent it will always be a linear combination of the training [00:15:54] linear combination of the training examples so that's one intuition for how [00:16:01] examples so that's one intuition for how you might see that assuming W is a [00:16:05] you might see that assuming W is a linear combination of the training [00:16:06] linear combination of the training examples you know is a reasonable [00:16:08] examples you know is a reasonable assumption I want to present a second [00:16:14] assumption I want to present a second set of intuitions and this one would be [00:16:16] set of intuitions and this one would be easier if you're good at visualizing [00:16:18] easier if you're good at visualizing high dimensional spaces I guess but let [00:16:22] high dimensional spaces I guess but let me just give intuition number two which [00:16:25] me just give intuition number two which is um let's see so so first of all let's [00:16:35] is um let's see so so first of all let's take our example just now right let's [00:16:37] take our example just now right let's say that the classifier uses this 2 1 X [00:16:43] say that the classifier uses this 2 1 X minus 2 right so this is W and this is [00:16:48] minus 2 right so this is W and this is um then it turns out that the decision [00:16:52] um then it turns out that the decision boundary is this where this is 1 and [00:16:56] boundary is this where this is 1 and this is 2 and it turns out that the [00:17:00] this is 2 and it turns out that the vector W is always at 90 degrees to the [00:17:04] vector W is always at 90 degrees to the decision boundary right this is a [00:17:07] decision boundary right this is a effective I guess geometry or something [00:17:10] effective I guess geometry or something well then the algebra right where it's [00:17:12] well then the algebra right where it's the vector W to 1 so the vector W you [00:17:16] the vector W to 1 so the vector W you know is sort of two to the right and [00:17:17] know is sort of two to the right and then one up is always that well all [00:17:21] then one up is always that well all right the vector W is always 90 degrees [00:17:24] right the vector W is always 90 degrees to the decision boundary and it doesn't [00:17:26] to the decision boundary and it doesn't bounce separates where you predict [00:17:28] bounce separates where you predict positive from where you predict negative [00:17:31] positive from where you predict negative okay and so it it turns out that if you [00:17:39] okay and so it it turns out that if you have to take a simple example let's say [00:17:42] have to take a simple example let's say you have two [00:17:44] you have two training examples a positive example and [00:17:48] training examples a positive example and a negative example right then is next to [00:17:54] a negative example right then is next to the linear algebra way of saying this is [00:17:56] the linear algebra way of saying this is that the vector W lies in the span of [00:17:59] that the vector W lies in the span of the training examples oh and and and the [00:18:03] the training examples oh and and and the way to picture this is that W sets the [00:18:06] way to picture this is that W sets the direction of the decision boundary and [00:18:08] direction of the decision boundary and as you vary B then the position so the [00:18:10] as you vary B then the position so the relative position you know setting [00:18:13] relative position you know setting different values of B will move the [00:18:14] different values of B will move the decision boundary back and forth like [00:18:16] decision boundary back and forth like this and W pins the direction and just [00:18:24] this and W pins the direction and just one last example for for why this might [00:18:27] one last example for for why this might be true is so we're gonna be working in [00:18:36] be true is so we're gonna be working in very very high dimensional feature [00:18:38] very very high dimensional feature spaces for this example let's say you [00:18:41] spaces for this example let's say you have three features x1 x2 x3 right and [00:18:44] have three features x1 x2 x3 right and then later we'll get to where this is [00:18:46] then later we'll get to where this is like a hundred trillion right and let's [00:18:48] like a hundred trillion right and let's say for the sake of illustration that [00:18:50] say for the sake of illustration that all of your examples line the plane of [00:18:53] all of your examples line the plane of x1 and x2 so let's say X 3 is equal to 0 [00:19:00] ok so let's say from all of your [00:19:02] ok so let's say from all of your training examples X 3 equals 0 then the [00:19:07] training examples X 3 equals 0 then the decision boundary you know will be will [00:19:10] decision boundary you know will be will be some sort of vertical plane that [00:19:11] be some sort of vertical plane that looks like this right so this is going [00:19:14] looks like this right so this is going to be the plane specifying W transpose X [00:19:19] to be the plane specifying W transpose X plus B equals 0 where now W and X are [00:19:22] plus B equals 0 where now W and X are three dimensional and so the vector W [00:19:27] three dimensional and so the vector W will have a should have W 3 equals 0 [00:19:34] will have a should have W 3 equals 0 right if one of the features is always 0 [00:19:37] right if one of the features is always 0 was always fixed then you know W 3 [00:19:40] was always fixed then you know W 3 should be equal to 0 and that's another [00:19:42] should be equal to 0 and that's another way of saying that the vector W you know [00:19:45] way of saying that the vector W you know should be represented as an in the span [00:19:48] should be represented as an in the span of just two features and clinics [00:19:50] of just two features and clinics span of the training examples okay all [00:19:57] span of the training examples okay all right I'm not sure if either intuition [00:19:59] right I'm not sure if either intuition one or intuition to convince this you I [00:20:01] one or intuition to convince this you I think hopefully that's good enough but [00:20:02] think hopefully that's good enough but this the second intuition will be easier [00:20:05] this the second intuition will be easier if you're used to thinking about baptism [00:20:07] if you're used to thinking about baptism high damaged all feature spaces and [00:20:12] high damaged all feature spaces and again the formal proof of this result [00:20:14] again the formal proof of this result which is called the represent two [00:20:16] which is called the represent two theorem is given in the lecture notes [00:20:19] theorem is given in the lecture notes but is a very possessive well no it was [00:20:21] but is a very possessive well no it was actually there's actually on the most [00:20:23] actually there's actually on the most complicated is well it's definitely the [00:20:25] complicated is well it's definitely the high end in terms of complexity of the [00:20:27] high end in terms of complexity of the of the full derivation of the formal [00:20:29] of the full derivation of the formal derivation is this result so so let's [00:20:42] derivation is this result so so let's assume that W can be written as follows [00:20:47] assume that W can be written as follows so optimization problem was this you [00:20:51] so optimization problem was this you want to solve for W and B so that the [00:20:55] want to solve for W and B so that the norm of W squared is as small as [00:20:57] norm of W squared is as small as possible and so that do this is the one [00:21:06] for every value of I so let's see [00:21:14] for every value of I so let's see normal W squared this is just equal to W [00:21:17] normal W squared this is just equal to W transpose W and so if you plug in this [00:21:23] transpose W and so if you plug in this definition of W you know into these [00:21:27] definition of W you know into these equations you have as the optimization [00:21:30] equations you have as the optimization objective min of 1/2 sum from I equals 1 [00:21:35] objective min of 1/2 sum from I equals 1 through m so this is W transpose W [00:21:51] which is equal to I guess some of I some [00:22:00] which is equal to I guess some of I some of AJ of my ELF ej y iy j and then X I [00:22:08] of AJ of my ELF ej y iy j and then X I transpose XJ and I'm going to take this [00:22:16] transpose XJ and I'm going to take this so this is an inner product between X I [00:22:18] so this is an inner product between X I XJ and I'm gonna use I'm just only write [00:22:21] XJ and I'm gonna use I'm just only write to this this right X I did this notation [00:22:27] to this this right X I did this notation so X comma Z equals X transpose Z is the [00:22:33] so X comma Z equals X transpose Z is the inner product between two vectors this [00:22:38] inner product between two vectors this is maybe another alternative notation [00:22:40] is maybe another alternative notation for writing inner products and when we [00:22:42] for writing inner products and when we derived kernels you see that expressing [00:22:45] derived kernels you see that expressing your algorithm in terms of inner [00:22:46] your algorithm in terms of inner products between features X this is the [00:22:48] products between features X this is the key map practical step needed to derive [00:22:50] key map practical step needed to derive kernels and we'll use this slightly [00:22:52] kernels and we'll use this slightly different so open angle bracket closing [00:22:55] different so open angle bracket closing angle brackets notation to denote the [00:22:57] angle brackets notation to denote the end the inner product between two [00:22:59] end the inner product between two different feature vectors so that is the [00:23:03] different feature vectors so that is the optimization objective oh and then this [00:23:07] optimization objective oh and then this constraint it becomes something else I [00:23:09] constraint it becomes something else I guess this becomes a what is it why I [00:23:16] guess this becomes a what is it why I times W which is transpose X plus B is [00:23:29] times W which is transpose X plus B is greater than one and again this [00:23:31] greater than one and again this simplifies or if you just multiply this [00:23:34] simplifies or if you just multiply this out [00:23:48] so just to make sure the mapping is [00:23:50] so just to make sure the mapping is clear all these pens are like dying all [00:24:04] clear all these pens are like dying all right so that becomes this and this [00:24:10] right so that becomes this and this becomes that okay and the key property [00:24:17] becomes that okay and the key property we're going to use is that if you look [00:24:19] we're going to use is that if you look at these two equations in terms of how [00:24:21] at these two equations in terms of how we pose the optimization problem the [00:24:22] we pose the optimization problem the only place that the feature vectors [00:24:25] only place that the feature vectors appears is in this inner product and it [00:24:33] appears is in this inner product and it turns out when we talked about the [00:24:35] turns out when we talked about the kernel trick when we talked about the [00:24:36] kernel trick when we talked about the application of kernels it turns out that [00:24:39] application of kernels it turns out that if you can compute this very efficiently [00:24:42] if you can compute this very efficiently that's when you can get away with [00:24:43] that's when you can get away with manipulating even infinite dimensional [00:24:46] manipulating even infinite dimensional feature vectors well we'll get to this [00:24:48] feature vectors well we'll get to this in a second but the reason we want to [00:24:49] in a second but the reason we want to write the whole algorithm in terms of [00:24:51] write the whole algorithm in terms of inner products is there'll be important [00:24:53] inner products is there'll be important cases where the feature vectors are [00:24:55] cases where the feature vectors are hundred trillion dimensional but you can [00:24:59] hundred trillion dimensional but you can compute it or even infinite dimensional [00:25:01] compute it or even infinite dimensional but you can compute the inner product [00:25:02] but you can compute the inner product very efficiently without needing to loop [00:25:04] very efficiently without needing to loop over you know the 100 trillion elements [00:25:06] over you know the 100 trillion elements in an array right and we'll see exactly [00:25:08] in an array right and we'll see exactly how to do that later in very shortly [00:25:21] so all right now it turns out that we've [00:25:32] so all right now it turns out that we've now expressed the whole optimization [00:25:36] now expressed the whole optimization algorithm in terms of these parameters [00:25:38] algorithm in terms of these parameters alpha right defined here and B so now [00:25:42] alpha right defined here and B so now the parameters theta [00:25:43] the parameters theta well now the parameters need to optimize [00:25:45] well now the parameters need to optimize for our alpha it turns out that by [00:25:49] for our alpha it turns out that by convention in a way that you see support [00:25:51] convention in a way that you see support vector machines refer to you know in [00:25:53] vector machines refer to you know in research papers or in text books it [00:25:55] research papers or in text books it turns out there's a further [00:25:56] turns out there's a further simplification of that optimization [00:25:58] simplification of that optimization problem which is that you can simplify [00:26:00] problem which is that you can simplify to this and the derivation to go from [00:26:06] to this and the derivation to go from that to this is again relatively [00:26:09] that to this is again relatively complicated but it turns out you can [00:26:14] complicated but it turns out you can further simplify the authorization [00:26:17] further simplify the authorization problem I wrote there to this and again [00:26:23] problem I wrote there to this and again you can copy this down if you want but [00:26:25] you can copy this down if you want but it's also written the lecture notes and [00:26:27] it's also written the lecture notes and by convention this slightly simplified [00:26:30] by convention this slightly simplified version optimization problem is called [00:26:32] version optimization problem is called the dual optimization problem the way to [00:26:39] the dual optimization problem the way to simplify that authorisations problem to [00:26:41] simplify that authorisations problem to this one that's actually done by using [00:26:45] this one that's actually done by using convex optimization theory and and again [00:26:48] convex optimization theory and and again the derivation is written in lecture [00:26:51] the derivation is written in lecture notes but I don't want to do that here [00:26:52] notes but I don't want to do that here if you want think of it as doing a bunch [00:26:54] if you want think of it as doing a bunch more algebra to simplify that problem to [00:26:56] more algebra to simplify that problem to this one and cause density you can still [00:26:58] this one and cause density you can still be along the way there's a little more [00:26:59] be along the way there's a little more complicated than that but but write full [00:27:02] complicated than that but but write full derivation is given in the lecture notes [00:27:05] derivation is given in the lecture notes and so finally you know the way you [00:27:11] and so finally you know the way you train for the way you make a prediction [00:27:13] train for the way you make a prediction right as you solve the Alpha rice and [00:27:18] right as you solve the Alpha rice and maybe for B right so you solve this [00:27:20] maybe for B right so you solve this optimization problem or that [00:27:21] optimization problem or that optimization problem for the Alpha rice [00:27:24] optimization problem for the Alpha rice and then to make a prediction [00:27:36] you need to compute H of W B of X for a [00:27:40] you need to compute H of W B of X for a new test example which is G of W [00:27:44] new test example which is G of W transpose X plus B right but because of [00:27:48] transpose X plus B right but because of the definition of W dub this is equal to [00:27:52] the definition of W dub this is equal to G of that's W transpose X plus B because [00:28:05] G of that's W transpose X plus B because this is W and so that's equal to G of [00:28:12] this is W and so that's equal to G of sum over I of I in a product between X [00:28:17] sum over I of I in a product between X and X plus B and so once again you know [00:28:21] and X plus B and so once again you know once you have stored the alphas in your [00:28:24] once you have stored the alphas in your computer memory you can make predictions [00:28:26] computer memory you can make predictions using just inner products again right so [00:28:29] using just inner products again right so the entire algorithm both the [00:28:30] the entire algorithm both the optimization objective you need to do [00:28:32] optimization objective you need to do during training as was how you make [00:28:34] during training as was how you make predictions is is expressed only in [00:28:37] predictions is is expressed only in terms of inner products so [00:28:46] we're now ready to apply kernels and [00:28:54] we're now ready to apply kernels and sometimes in machine learning people [00:28:56] sometimes in machine learning people sometimes we call this a kernel trick [00:28:58] sometimes we call this a kernel trick and then we just the other recipe for [00:29:00] and then we just the other recipe for what this means [00:29:02] what this means step one is write your algorithm in [00:29:11] step one is write your algorithm in terms of X I XJ in terms of inner [00:29:15] terms of X I XJ in terms of inner products and instead of carrying the [00:29:18] products and instead of carrying the superscript you know X I XJ I'm [00:29:21] superscript you know X I XJ I'm sometimes gonna write in the product [00:29:22] sometimes gonna write in the product between X and Z right where X and z are [00:29:25] between X and Z right where X and z are supposed to be proxy is for two [00:29:27] supposed to be proxy is for two different training examples exam thanks [00:29:29] different training examples exam thanks Jay but it simplifies a notation or [00:29:31] Jay but it simplifies a notation or write a little bit to let there be some [00:29:40] write a little bit to let there be some mapping from your original input [00:29:50] mapping from your original input features X to some higher damage though [00:29:54] features X to some higher damage though set of features Phi and so one example [00:29:59] set of features Phi and so one example would be lets you try to predict housing [00:30:02] would be lets you try to predict housing prices a particular house will be sold [00:30:03] prices a particular house will be sold in the next month so maybe X in this [00:30:06] in the next month so maybe X in this case is the size of the house or maybe [00:30:09] case is the size of the house or maybe is size in yeah right maybe X is the [00:30:13] is size in yeah right maybe X is the size of a house and so you could take [00:30:18] size of a house and so you could take this 1d feature and expand it to a high [00:30:21] this 1d feature and expand it to a high dimensional feature vector with X x [00:30:24] dimensional feature vector with X x squared X cubed x to the 4th right so [00:30:27] squared X cubed x to the 4th right so this would be one way of defining a high [00:30:29] this would be one way of defining a high dimensional future laughing well another [00:30:31] dimensional future laughing well another one could be if you have two features x1 [00:30:33] one could be if you have two features x1 and x2 corresponding to the size of [00:30:36] and x2 corresponding to the size of house in the number of bedrooms and you [00:30:38] house in the number of bedrooms and you can map this to different 5x which may [00:30:41] can map this to different 5x which may be x1 x2 x1 times x2 x1 squared x2 x1 x2 [00:30:47] be x1 x2 x1 times x2 x1 squared x2 x1 x2 squared and so on it kind of a [00:30:50] squared and so on it kind of a polynomial set of features or maybe [00:30:51] polynomial set of features or maybe other set of features as well ok [00:30:54] other set of features as well ok and what we'll be able to do is work [00:30:57] and what we'll be able to do is work with feature mappings Phi of X where the [00:31:01] with feature mappings Phi of X where the original input X may be 1d or 2d o or [00:31:04] original input X may be 1d or 2d o or whatever and Phi of X could be you know [00:31:08] whatever and Phi of X could be you know a hundred thousand dimensional or [00:31:10] a hundred thousand dimensional or infinite dimensional but we'll be able [00:31:13] infinite dimensional but we'll be able to do this very efficiently right oh you [00:31:16] to do this very efficiently right oh you an infinite dimensional okay so guess [00:31:21] an infinite dimensional okay so guess we'll get some concrete examples of this [00:31:23] we'll get some concrete examples of this later but I want to give you the overall [00:31:24] later but I want to give you the overall recipe and then what we're going to do [00:31:28] recipe and then what we're going to do is a final way to compute ok of X comma [00:31:36] is a final way to compute ok of X comma Z equals Phi of X transpose Phi of Z so [00:31:45] Z equals Phi of X transpose Phi of Z so this is called a kernel function and [00:31:47] this is called a kernel function and what we're going to do is we'll see that [00:31:49] what we're going to do is we'll see that there are clever tricks so that you can [00:31:52] there are clever tricks so that you can compute the inner product between X and [00:31:54] compute the inner product between X and Z even when Phi of X and Phi of Z are [00:31:57] Z even when Phi of X and Phi of Z are incredibly high dimensional right we'll [00:31:59] incredibly high dimensional right we'll see an example of this enough in very [00:32:01] see an example of this enough in very very soon and in step four is replace XZ [00:32:08] very soon and in step four is replace XZ in algorithm with K of X Z okay because [00:32:21] in algorithm with K of X Z okay because if you could do this then what you're [00:32:23] if you could do this then what you're doing is you're running the whole [00:32:25] doing is you're running the whole learning algorithm on this high [00:32:27] learning algorithm on this high dimensional set of features and and the [00:32:31] dimensional set of features and and the problem with you know swapping out X for [00:32:34] problem with you know swapping out X for Phi of X right is that it can be very [00:32:37] Phi of X right is that it can be very computationally expensive if you're [00:32:38] computationally expensive if you're working over a hundred thousand [00:32:39] working over a hundred thousand dimensional feature vectors right I mean [00:32:41] dimensional feature vectors right I mean even by today's standards you know a [00:32:43] even by today's standards you know a hundred thousand yes yeah it's not the [00:32:45] hundred thousand yes yeah it's not the biggest I've seen I've seen actually [00:32:46] biggest I've seen I've seen actually because I've seen that here are the [00:32:47] because I've seen that here are the convenient features but you've been by [00:32:49] convenient features but you've been by today's standards hundred thousand [00:32:51] today's standards hundred thousand features is actually quite a lot and if [00:32:56] features is actually quite a lot and if you're not King said just a thousand [00:32:57] you're not King said just a thousand this is a lot and large number of [00:32:59] this is a lot and large number of features I guess and the problem of [00:33:03] features I guess and the problem of using business is quite computation [00:33:04] using business is quite computation expensive to carry around these hundred [00:33:06] expensive to carry around these hundred thousand or [00:33:07] thousand or image the 100 million dimensional future [00:33:09] image the 100 million dimensional future vectors or whatever but that's what you [00:33:14] vectors or whatever but that's what you would do if you were to swap in Phi of X [00:33:16] would do if you were to swap in Phi of X you know in the naive straightforward [00:33:18] you know in the naive straightforward way for X but what we'll see is that if [00:33:20] way for X but what we'll see is that if you can compute K of X Z then you could [00:33:23] you can compute K of X Z then you could because you've written your whole [00:33:24] because you've written your whole algorithm just in terms of inner [00:33:26] algorithm just in terms of inner products then you don't ever need to [00:33:28] products then you don't ever need to explicitly compute Phi of X so you can [00:33:30] explicitly compute Phi of X so you can always just compute these kernels [00:33:48] then get to that later I won't go for [00:33:51] then get to that later I won't go for some kernels I was talked about by his [00:33:53] some kernels I was talked about by his Darian spray on Wednesday yeah I think [00:33:56] Darian spray on Wednesday yeah I think the no free lunch theorem is a [00:33:58] the no free lunch theorem is a fascinating theory or concept but I [00:34:00] fascinating theory or concept but I think that it's been I don't know it's [00:34:03] think that it's been I don't know it's been less useful actually because I [00:34:04] been less useful actually because I think we have inductive biases that turn [00:34:06] think we have inductive biases that turn out to be useful [00:34:08] out to be useful there's a there's a famous theorem in [00:34:10] there's a there's a famous theorem in learning theory called no free lunch was [00:34:12] learning theory called no free lunch was like 20 years ago dad basically says [00:34:14] like 20 years ago dad basically says that in the worst case learning [00:34:16] that in the worst case learning algorithms do not work for any learning [00:34:19] algorithms do not work for any learning algorithm I can come up with some data [00:34:20] algorithm I can come up with some data distribution so that your learning [00:34:21] distribution so that your learning algorithm stops that that's roughly the [00:34:23] algorithm stops that that's roughly the no free lunch to ever improved about [00:34:24] no free lunch to ever improved about like 20 years ago but it turns out most [00:34:26] like 20 years ago but it turns out most of the world most the time the universe [00:34:27] of the world most the time the universe is not that hostile to all that so so [00:34:30] is not that hostile to all that so so yeah that's the learning I was turned [00:34:32] yeah that's the learning I was turned out okay all right let's go through one [00:34:41] out okay all right let's go through one example of kernels so for this example [00:34:44] example of kernels so for this example let's say that your original input [00:34:46] let's say that your original input features was three dimensional X 1 X 2 X [00:34:49] features was three dimensional X 1 X 2 X 3 and let's say I'm gonna choose the [00:34:52] 3 and let's say I'm gonna choose the feature mapping Phi of X to be all so [00:34:56] feature mapping Phi of X to be all so pairwise monomial terms so I'm gonna [00:34:59] pairwise monomial terms so I'm gonna choose X 1 times X 1 X 1 X 2 X 1 X 3 X 2 [00:35:05] choose X 1 times X 1 X 1 X 2 X 1 X 3 X 2 X 1 okay and there are a couple [00:35:16] X 1 okay and there are a couple duplicates that X 1 X 3 is equal to X 3 [00:35:18] duplicates that X 1 X 3 is equal to X 3 X 1 but agenda writes it out this way [00:35:20] X 1 but agenda writes it out this way and so notice that if you have if X is [00:35:25] and so notice that if you have if X is in RN right then Phi of X is in R and [00:35:30] in RN right then Phi of X is in R and squared right so the three-dimensional [00:35:33] squared right so the three-dimensional features two non dimensional and I'm [00:35:36] features two non dimensional and I'm using small numbers for illustration in [00:35:38] using small numbers for illustration in practice think of X as a thousand [00:35:40] practice think of X as a thousand dimensional and so this is now a million [00:35:42] dimensional and so this is now a million well think of this as maybe ten thousand [00:35:44] well think of this as maybe ten thousand and this is now like a hundred million [00:35:46] and this is now like a hundred million okay so and squid features this much [00:35:48] okay so and squid features this much bigger and then similarly Phi of Z is [00:35:52] bigger and then similarly Phi of Z is going to be Z 1 z 1 z 1 z 2 [00:36:10] some have gone from n features like [00:36:13] some have gone from n features like 10,000 features to n square features HP [00:36:15] 10,000 features to n square features HP in this case hundred million features um [00:36:19] in this case hundred million features um so because there are n squared elements [00:36:26] so because there are n squared elements right you would need order N squared [00:36:30] right you would need order N squared time to compute Phi of X or to compute [00:36:40] time to compute Phi of X or to compute Phi of X transpose Phi of Z explicitly [00:36:47] Phi of X transpose Phi of Z explicitly right say we want to compute the inner [00:36:49] right say we want to compute the inner product between 5s and Phi of Z and they [00:36:50] product between 5s and Phi of Z and they do it explicitly in the obvious way [00:36:52] do it explicitly in the obvious way they'll take N squared time to just [00:36:54] they'll take N squared time to just compute all of these in their products [00:36:55] compute all of these in their products and then do the right and then compute [00:36:58] and then do the right and then compute this uh compute this right and there's [00:37:01] this uh compute this right and there's actually n squared over 2 because a lot [00:37:02] actually n squared over 2 because a lot of these things are duplicated but [00:37:03] of these things are duplicated but that's the order N squared [00:37:15] but let's see you can find a better way [00:37:17] but let's see you can find a better way to do that so what we want is to write [00:37:21] to do that so what we want is to write out the kernel of X comma Z so this Phi [00:37:25] out the kernel of X comma Z so this Phi of X transpose Phi of Z right and what [00:37:31] of X transpose Phi of Z right and what I'm gonna prove is that this can be [00:37:32] I'm gonna prove is that this can be computed as X transpose Z squared and [00:37:38] computed as X transpose Z squared and the cool thing is that remember X is n [00:37:42] the cool thing is that remember X is n dimensional Z is n dimensional so X [00:37:47] dimensional Z is n dimensional so X transpose Z squared this is an order n [00:37:49] transpose Z squared this is an order n time computation right because taking X [00:37:53] time computation right because taking X transpose Z you know that's just in a [00:37:56] transpose Z you know that's just in a product of two n dimensional vectors and [00:37:58] product of two n dimensional vectors and then you take that number you a [00:37:59] then you take that number you a transpose Z is a real number and you [00:38:01] transpose Z is a real number and you just square that number so that's the [00:38:04] just square that number so that's the order n time computation and so let me [00:38:08] order n time computation and so let me just prove that X transpose Z is equal [00:38:10] just prove that X transpose Z is equal to well let me let me let me prove this [00:38:12] to well let me let me let me prove this step right and so X transpose e squared [00:38:17] step right and so X transpose e squared that's equal to right so this is X [00:38:28] that's equal to right so this is X transpose Z right and then times this is [00:38:39] transpose Z right and then times this is also X transpose e so this formula is X [00:38:41] also X transpose e so this formula is X transpose e squared is X transpose e [00:38:43] transpose e squared is X transpose e times itself and then if I rearrange [00:38:47] times itself and then if I rearrange sums this is equal to sum from I equals [00:38:50] sums this is equal to sum from I equals 1 through n sum from J equals 1 through [00:38:53] 1 through n sum from J equals 1 through n X i zi x j zj and this in turn is you [00:39:04] n X i zi x j zj and this in turn is you know some of i sum over J of X I XJ [00:39:10] know some of i sum over J of X I XJ times Zi [00:39:15] times Zi zj and so what this is doing is this [00:39:20] zj and so what this is doing is this marching through all possible pairs of [00:39:22] marching through all possible pairs of inj and multiplying X I XJ with the [00:39:30] inj and multiplying X I XJ with the corresponding Zi Zi J and having that up [00:39:34] corresponding Zi Zi J and having that up but of course if you were to compute Phi [00:39:38] but of course if you were to compute Phi of X transpose Phi of Z what you do is [00:39:41] of X transpose Phi of Z what you do is you take this and multiply it with that [00:39:43] you take this and multiply it with that and then add it to the sum then take [00:39:46] and then add it to the sum then take this and multiply with that and add it [00:39:48] this and multiply with that and add it to some and so on until you end up [00:39:50] to some and so on until you end up taking this and multiplying that and [00:39:52] taking this and multiplying that and having it to your son right so that's [00:39:55] having it to your son right so that's why so that's why this formula is just [00:40:01] why so that's why this formula is just you know marching down these two lists [00:40:04] you know marching down these two lists and multiplying multiplying multiply and [00:40:06] and multiplying multiplying multiply and add it up which is exactly Phi transpose [00:40:12] which is exactly the Phi of X transpose [00:40:17] which is exactly the Phi of X transpose Phi of Z okay so this proves that you've [00:40:23] Phi of Z okay so this proves that you've turned what was previously an order N [00:40:25] turned what was previously an order N squared time calculation and turn order [00:40:27] squared time calculation and turn order n time calculation which means that if n [00:40:33] n time calculation which means that if n was 10,000 instead of needing to [00:40:36] was 10,000 instead of needing to manipulate a hundred thousand [00:40:38] manipulate a hundred thousand dimensional vectors to come up with [00:40:41] dimensional vectors to come up with these URLs as my phone buzzing axes [00:40:43] these URLs as my phone buzzing axes really loud okay instead of needing to [00:40:46] really loud okay instead of needing to manipulate so 100,000 dimensional [00:40:49] manipulate so 100,000 dimensional vectors you could do so in the plating [00:40:51] vectors you could do so in the plating only 10,000 initial vectors now a few [00:40:58] only 10,000 initial vectors now a few other examples of kernels [00:41:11] it turns out that if you choose this [00:41:16] it turns out that if you choose this current so let's see we had K of X comma [00:41:18] current so let's see we had K of X comma Z equals X transpose e squared and we [00:41:25] Z equals X transpose e squared and we now add a plus C there where C is a [00:41:27] now add a plus C there where C is a constant so C is just some fixed real [00:41:30] constant so C is just some fixed real number that corresponds to modifying [00:41:34] number that corresponds to modifying your pictures as follows where instead [00:41:37] your pictures as follows where instead of justice [00:41:38] of justice you know binomial terms pairs of these [00:41:40] you know binomial terms pairs of these things if you add plus C there it [00:41:43] things if you add plus C there it corresponds to adding x1 x2 x3 to this [00:41:48] corresponds to adding x1 x2 x3 to this to your set of features technically [00:41:51] to your set of features technically there's actually waiting on this and [00:41:53] there's actually waiting on this and actually root to see look to see root 2 [00:41:56] actually root to see look to see root 2 C and then as a constant C here as well [00:41:58] C and then as a constant C here as well you can prove this yourself and it turns [00:42:00] you can prove this yourself and it turns out that if this is your new definition [00:42:02] out that if this is your new definition for Phi of X and make the same change to [00:42:04] for Phi of X and make the same change to Phi of Z you know it's a root 2 C Z 1 [00:42:06] Phi of Z you know it's a root 2 C Z 1 and so on then if you take the inner [00:42:08] and so on then if you take the inner product of these then it can be computed [00:42:10] product of these then it can be computed as this right and so that's and so the [00:42:14] as this right and so that's and so the row of the constant C it trades off the [00:42:18] row of the constant C it trades off the relative weighting between the binomial [00:42:20] relative weighting between the binomial terms though you know xixj compared to [00:42:23] terms though you know xixj compared to the to the single to the first degree [00:42:25] the to the single to the first degree terms like x1 or x2 or x3 other examples [00:42:32] terms like x1 or x2 or x3 other examples if you choose this ^ d notice that this [00:42:45] if you choose this ^ d notice that this still is in order n time computation [00:42:50] still is in order n time computation right x transpose e takes all the end [00:42:51] right x transpose e takes all the end time you add a number to it then you [00:42:53] time you add a number to it then you take this the power of G so you can [00:42:54] take this the power of G so you can compute this in all the end time but [00:42:57] compute this in all the end time but this corresponds to now apply of X as [00:43:03] this corresponds to now apply of X as all the number of terms turns out to be [00:43:06] all the number of terms turns out to be M plus D choose D but it doesn't matter [00:43:08] M plus D choose D but it doesn't matter it turns out this contains all features [00:43:11] it turns out this contains all features of monomials [00:43:17] up to order d so by which I mean it if [00:43:22] up to order d so by which I mean it if let's say D is equal to five right then [00:43:26] let's say D is equal to five right then this contains then Phi of X contains all [00:43:28] this contains then Phi of X contains all the features of the form X 1 X 2 X 5 X [00:43:32] the features of the form X 1 X 2 X 5 X 17 X 29 right this is a fifth-degree [00:43:35] 17 X 29 right this is a fifth-degree thing or X or X 1 X 2 squared X 3 X you [00:43:41] thing or X or X 1 X 2 squared X 3 X you know 18 this is also fifth order [00:43:43] know 18 this is also fifth order polynomial fit for the monomials called [00:43:46] polynomial fit for the monomials called and so if you choose this as your kernel [00:43:49] and so if you choose this as your kernel this corresponds to constructing Phi of [00:43:52] this corresponds to constructing Phi of X to contain all of these features and [00:43:54] X to contain all of these features and there are exponentially many of them [00:43:56] there are exponentially many of them right there all of these features any [00:43:57] right there all of these features any older although these are called [00:44:00] older although these are called monomials basically all the polynomial [00:44:02] monomials basically all the polynomial terms all the more narrow terms up to a [00:44:04] terms all the more narrow terms up to a fifth degree polynomial up to a fifth [00:44:06] fifth degree polynomial up to a fifth order monomial term so and there are [00:44:08] order monomial term so and there are turns out there are n plus d choose DS [00:44:10] turns out there are n plus d choose DS which is a roughly M plus D to the power [00:44:12] which is a roughly M plus D to the power of the very roughly so this is a very [00:44:15] of the very roughly so this is a very very large number of features but your [00:44:18] very large number of features but your computation doesn't blow up [00:44:19] computation doesn't blow up exponentially even as D increases so [00:44:26] exponentially even as D increases so what the support vector machine is is [00:44:31] what the support vector machine is is taking the optimal margin classifier [00:44:33] taking the optimal margin classifier that we derived earlier and applying the [00:44:37] that we derived earlier and applying the kernel trick to it in which we already [00:44:41] kernel trick to it in which we already had D so well so okto margin classifier [00:44:49] plus the kernel trick right that is the [00:44:55] plus the kernel trick right that is the support vector machine [00:44:58] and so if you choose some of these [00:45:00] and so if you choose some of these kernels for example then you could run [00:45:03] kernels for example then you could run an SVM in these very very high [00:45:05] an SVM in these very very high dimensional feature spaces in these you [00:45:07] dimensional feature spaces in these you know hundred trillion dimensional [00:45:09] know hundred trillion dimensional feature spaces but your computational [00:45:12] feature spaces but your computational timescales only linearly as all their n [00:45:15] timescales only linearly as all their n as the number as the dimension of your [00:45:17] as the number as the dimension of your input features X rather than as a [00:45:19] input features X rather than as a function of this hundred trillion [00:45:20] function of this hundred trillion dimensional feature space you're [00:45:22] dimensional feature space you're actually building a linear classifier so [00:45:25] actually building a linear classifier so um why is this a good idea [00:45:28] um why is this a good idea let me just not show a quick video to [00:45:36] let me just not show a quick video to give you intuition for what this is [00:45:38] give you intuition for what this is doing let's see okay I think projects it [00:45:44] doing let's see okay I think projects it takes a while to warm up does it any [00:45:49] takes a while to warm up does it any questions while we're [00:46:00] yes so this kind of function appears [00:46:03] yes so this kind of function appears applies only to this vision happy so [00:46:05] applies only to this vision happy so each kernel function of yes up to [00:46:10] each kernel function of yes up to trivial differences right if you have a [00:46:12] trivial differences right if you have a feature mapping where the features that [00:46:13] feature mapping where the features that come you are permuted or something then [00:46:15] come you are permuted or something then the kernel function stays the same so [00:46:18] the kernel function stays the same so there are trivial function [00:46:19] there are trivial function transformations like that but if you [00:46:21] transformations like that but if you have a totally different feature mapping [00:46:22] have a totally different feature mapping you would expect to need a totally [00:46:24] you would expect to need a totally different kernel function so I wanted to [00:46:38] let's see oh cool so I want to give you [00:46:44] let's see oh cool so I want to give you a visual picture [00:46:52] I've wiped this [00:47:09] all right this is a YouTube video that [00:47:12] all right this is a YouTube video that can cotton faroush would teach us to [00:47:14] can cotton faroush would teach us to that cs2 there the environment such as [00:47:16] that cs2 there the environment such as IU so I don't know who Rudy are [00:47:18] IU so I don't know who Rudy are erroneous but there's a nice [00:47:20] erroneous but there's a nice visualization of what a support vector [00:47:22] visualization of what a support vector machine is doing so let's say you have a [00:47:27] machine is doing so let's say you have a learning algorithm where you're trying [00:47:29] learning algorithm where you're trying to separate the blue dots from the red [00:47:30] to separate the blue dots from the red dots right so the blue and the red dots [00:47:34] dots right so the blue and the red dots can't be separated by a straight line [00:47:36] can't be separated by a straight line but you put them on the plane and you [00:47:38] but you put them on the plane and you use a feature mapping Phi to throw these [00:47:40] use a feature mapping Phi to throw these points into much high dimensional space [00:47:42] points into much high dimensional space so it's now throwing these points in the [00:47:44] so it's now throwing these points in the 3-dimensional space in this three [00:47:46] 3-dimensional space in this three dimensional space you can then find W so [00:47:49] dimensional space you can then find W so W is now three dimensional because apply [00:47:51] W is now three dimensional because apply the optimal margin classifier in this [00:47:52] the optimal margin classifier in this three dimensional space that separates [00:47:54] three dimensional space that separates the blue dots and the red dots and if [00:47:58] the blue dots and the red dots and if you now you know examine what this is [00:48:00] you now you know examine what this is doing back in the original space then [00:48:03] doing back in the original space then your linear classifier actually defines [00:48:05] your linear classifier actually defines that elliptical decision boundary [00:48:08] that elliptical decision boundary variance right so you're taking the data [00:48:11] variance right so you're taking the data all right so taking the data mapping [00:48:18] all right so taking the data mapping it's a much higher dimensional feature [00:48:19] it's a much higher dimensional feature space three dimension in this [00:48:21] space three dimension in this visualization but in practice can be 100 [00:48:23] visualization but in practice can be 100 trillion dimensions and then finding a [00:48:25] trillion dimensions and then finding a linear decision boundary in that hundred [00:48:27] linear decision boundary in that hundred trillion dimensional space which is [00:48:30] trillion dimensional space which is going to be a hyperplane like a like a [00:48:31] going to be a hyperplane like a like a straight you know like a plane or a [00:48:33] straight you know like a plane or a straight line or a plane and then when [00:48:35] straight line or a plane and then when you look at what you just did in the [00:48:36] you look at what you just did in the original feature space you found it very [00:48:38] original feature space you found it very non linear this is your value so this is [00:48:44] non linear this is your value so this is y and again you know you can only [00:48:48] y and again you know you can only visualize relatively low dimensional [00:48:51] visualize relatively low dimensional future spaces even even on a display [00:48:53] future spaces even even on a display like that but you find that if you use a [00:48:56] like that but you find that if you use a SVM kernel [00:49:06] all right you can learn very nonlinear [00:49:09] all right you can learn very nonlinear decision boundaries like that but that [00:49:11] decision boundaries like that but that is a linear decision boundary in a very [00:49:13] is a linear decision boundary in a very high dimensional space but when you [00:49:14] high dimensional space but when you project it back down to you know 2d you [00:49:17] project it back down to you know 2d you end up with a very dominant is about [00:49:19] end up with a very dominant is about okay all right oh sure yes so in this [00:49:36] okay all right oh sure yes so in this high dimensional space represented by [00:49:38] high dimensional space represented by the feature mapping 5x so this data [00:49:40] the feature mapping 5x so this data always have to be linearly separable so [00:49:42] always have to be linearly separable so far been pretending that it does our [00:49:44] far been pretending that it does our coming back and fix that assumption [00:49:45] coming back and fix that assumption later today [00:49:54] so um no how do you make kernels right [00:50:05] so um no how do you make kernels right so just here's something so here's some [00:50:08] so just here's something so here's some intuition you might have about kernels [00:50:12] intuition you might have about kernels if X and z are similar you know if - if [00:50:22] if X and z are similar you know if - if - infinite ampuls X is here close to [00:50:24] - infinite ampuls X is here close to each other or summit each other then [00:50:27] each other or summit each other then chain of X Z which is the inner product [00:50:31] chain of X Z which is the inner product between X and Z right presumably this [00:50:35] between X and Z right presumably this should be large and conversely if X and [00:50:40] should be large and conversely if X and z are dissimilar then KF x z you know [00:50:48] z are dissimilar then KF x z you know this maybe should be smaller right [00:50:51] this maybe should be smaller right because uh the inner product of two very [00:50:53] because uh the inner product of two very similar vectors that are pointing in the [00:50:55] similar vectors that are pointing in the same direction should be large and the [00:50:57] same direction should be large and the inner product of two dissimilar vectors [00:50:59] inner product of two dissimilar vectors should be small right so this is one [00:51:02] should be small right so this is one guiding principle behind you know what [00:51:05] guiding principle behind you know what you see in the law of kernels just see [00:51:06] you see in the law of kernels just see if if this is Phi of X and this is Phi [00:51:08] if if this is Phi of X and this is Phi of Z the inner product is large but then [00:51:12] of Z the inner product is large but then they kind of point off in random [00:51:13] they kind of point off in random directions the inner product will be [00:51:16] directions the inner product will be small right [00:51:16] small right that's how vector inner product works [00:51:19] that's how vector inner product works and so well what have you just poor [00:51:22] and so well what have you just poor functionality straight out of the air [00:51:24] functionality straight out of the air which is K of X Z equals e to the [00:51:28] which is K of X Z equals e to the negative X is minus Z squared over 2 [00:51:32] negative X is minus Z squared over 2 Sigma squared right so this is one [00:51:38] Sigma squared right so this is one example of a similar if you think of [00:51:41] example of a similar if you think of kernels there's a similarity measure of [00:51:43] kernels there's a similarity measure of function just you know let's just make [00:51:46] function just you know let's just make up another similarity measure of [00:51:47] up another similarity measure of function and this does have the property [00:51:50] function and this does have the property that if x and z are very close to each [00:51:52] that if x and z are very close to each other then this would be e to the 0 [00:51:55] other then this would be e to the 0 which is about 1 but if x and z are very [00:51:58] which is about 1 but if x and z are very far apart then this would be small right [00:52:00] far apart then this would be small right so this function it it actually [00:52:02] so this function it it actually satisfies this [00:52:03] satisfies this criteria satisfies those criteria and [00:52:05] criteria satisfies those criteria and the question is is it okay to use this [00:52:10] the question is is it okay to use this as a kernel function so it turns out [00:52:15] as a kernel function so it turns out that a function like that K of X Z you [00:52:20] that a function like that K of X Z you can use it as a kernel function only if [00:52:24] can use it as a kernel function only if there exists some Phi such that K of X Z [00:52:34] there exists some Phi such that K of X Z equals Phi of X transpose Phi Z right [00:52:41] equals Phi of X transpose Phi Z right so we derived the whole algorithm [00:52:42] so we derived the whole algorithm assuming this to be true and it turns [00:52:44] assuming this to be true and it turns out if you plug in the kernel function [00:52:46] out if you plug in the kernel function for which you know this isn't true then [00:52:48] for which you know this isn't true then all of the derivation we wrote down [00:52:50] all of the derivation we wrote down breaks down and the optimization problem [00:52:52] breaks down and the optimization problem you know can have very strange solutions [00:52:56] you know can have very strange solutions right that don't correspond to good [00:52:58] right that don't correspond to good classification that could cost firewalls [00:53:00] classification that could cost firewalls and so this puts some constraints on [00:53:03] and so this puts some constraints on white kernel functions we could choose [00:53:05] white kernel functions we could choose for example one thing it must satisfy is [00:53:08] for example one thing it must satisfy is K of X X which is 5 X transpose Phi of Z [00:53:12] K of X X which is 5 X transpose Phi of Z this had better be greater than or equal [00:53:14] this had better be greater than or equal to 0 right sorry because in a proper [00:53:18] to 0 right sorry because in a proper weight vector with itself had better be [00:53:20] weight vector with itself had better be non-negative [00:53:21] non-negative so if K of X X is ever 0 or less than 0 [00:53:24] so if K of X X is ever 0 or less than 0 then this is the only valid kernel [00:53:25] then this is the only valid kernel function okay um [00:53:28] function okay um more generally there's a theorem that [00:53:33] more generally there's a theorem that proves when is something a valid kernel [00:53:36] proves when is something a valid kernel somebody just outlined that that proof [00:53:39] somebody just outlined that that proof very briefly which is let's set X 1 up [00:53:44] very briefly which is let's set X 1 up to X D you know be any D points and [00:53:52] to X D you know be any D points and that's net ok sorry about overloading of [00:53:58] that's net ok sorry about overloading of notation this is a so K represents a [00:54:02] notation this is a so K represents a kernel function and I'm gonna use K to [00:54:05] kernel function and I'm gonna use K to represent a kernel matrix as well [00:54:10] sometimes it's also called the grand [00:54:11] sometimes it's also called the grand matrix but it's called the kernel matrix [00:54:14] matrix but it's called the kernel matrix so the K IJ is equal to the kernel [00:54:18] so the K IJ is equal to the kernel function applied to two of those points [00:54:22] function applied to two of those points xixj right so you have D points so just [00:54:26] xixj right so you have D points so just apply the kernel function to every pair [00:54:28] apply the kernel function to every pair of those points and put them in a matrix [00:54:29] of those points and put them in a matrix and a Big D by D matrix like that so it [00:54:37] and a Big D by D matrix like that so it turns out that given any vector Z I [00:54:44] turns out that given any vector Z I think you've seen something similar to [00:54:47] think you've seen something similar to this in problem set one but given any [00:54:49] this in problem set one but given any vector Z Z transpose AZ which is sum of [00:54:55] vector Z Z transpose AZ which is sum of I sum over j zi k IJ zj right if k is a [00:55:05] I sum over j zi k IJ zj right if k is a valid kernel function so if there is [00:55:07] valid kernel function so if there is some feature mapping Phi then this [00:55:11] some feature mapping Phi then this should equal to sum of i sum of j zi v [00:55:15] should equal to sum of i sum of j zi v of x i transpose Phi of Z XJ times EJ [00:55:25] of x i transpose Phi of Z XJ times EJ and by a couple of other steps let's see [00:55:32] and by a couple of other steps let's see this v of x AI transpose 5xj I'm going [00:55:35] this v of x AI transpose 5xj I'm going to expand out that inner product so sum [00:55:37] to expand out that inner product so sum of a K Phi of X I element of K times Phi [00:55:43] of a K Phi of X I element of K times Phi of XJ other than K times Z J and then [00:55:52] of XJ other than K times Z J and then rearranging sums is Sun some of our K [00:55:56] rearranging sums is Sun some of our K actually sorry I'm running out of white [00:55:58] actually sorry I'm running out of white board and it's still an exploit [00:56:14] several range sums some okay sum of i [00:56:20] several range sums some okay sum of i sum of a j zi v x phi subscript K times [00:56:29] sum of a j zi v x phi subscript K times Phi of X J subscript K times Z J which [00:56:40] Phi of X J subscript K times Z J which is sum of a K squared and therefore this [00:56:50] is sum of a K squared and therefore this must be greater than or equal to 0 and [00:56:52] must be greater than or equal to 0 and so this proves that the matrix K the [00:56:57] so this proves that the matrix K the kernel matrix K is positive [00:57:00] kernel matrix K is positive semi-definite okay and so more generally [00:57:06] semi-definite okay and so more generally it turns out that this is also a [00:57:08] it turns out that this is also a sufficient condition for a kernel [00:57:13] sufficient condition for a kernel function since our function K to be a [00:57:15] function since our function K to be a valid kernel function so I'm just write [00:57:19] valid kernel function so I'm just write this out this is called a vs. theorem so [00:57:32] this out this is called a vs. theorem so k is a valid kernel so K is a valid [00:57:42] k is a valid kernel so K is a valid kernel function ie there exists v such [00:57:47] kernel function ie there exists v such that K of X Z equals Phi of X transpose [00:57:52] that K of X Z equals Phi of X transpose Phi of Z if and only if for any D points [00:58:05] you know x1 up to X D the corresponding [00:58:11] you know x1 up to X D the corresponding kernel matrix is a positive [00:58:21] kernel matrix is a positive semi-definite so let's write this K [00:58:23] semi-definite so let's write this K trading equals zero and I proved just [00:58:26] trading equals zero and I proved just one dimension of one one direction of [00:58:29] one dimension of one one direction of this implication right this proof [00:58:30] this implication right this proof outline here shows that if it is a valid [00:58:33] outline here shows that if it is a valid kernel function then does this positive [00:58:36] kernel function then does this positive semi-definite this Allah did improve the [00:58:38] semi-definite this Allah did improve the opposite direction [00:58:39] opposite direction it's as if an earlier right shows both [00:58:42] it's as if an earlier right shows both direction so this algebra we did just [00:58:45] direction so this algebra we did just now proves that dimension of the proof I [00:58:47] now proves that dimension of the proof I did improve the reverse dimension but [00:58:49] did improve the reverse dimension but this turns out to be a if and only if [00:58:50] this turns out to be a if and only if condition and so this gives maybe one [00:58:53] condition and so this gives maybe one test for whether or not something is a [00:58:56] test for whether or not something is a valid kernel function okay and it turns [00:59:01] valid kernel function okay and it turns out that the kernel I wrote up there [00:59:04] out that the kernel I wrote up there that one K of X C it turns out this is a [00:59:15] that one K of X C it turns out this is a valid kernel this is called the Gaussian [00:59:16] valid kernel this is called the Gaussian kernel this is a probably the most [00:59:22] kernel this is a probably the most widely use kernel [00:59:24] widely use kernel well actually did what [00:59:36] well actually the most widely-used [00:59:38] well actually the most widely-used kernels is maybe the linear kernel which [00:59:44] kernels is maybe the linear kernel which just uses K of X Z equals X transpose Z [00:59:50] just uses K of X Z equals X transpose Z and so this is using you know Phi of x [00:59:53] and so this is using you know Phi of x equals x right so no no no high [00:59:55] equals x right so no no no high dimensional features so sometimes you [00:59:57] dimensional features so sometimes you call it the linear kernel it just means [00:59:58] call it the linear kernel it just means you're not using a high dimensional [01:00:00] you're not using a high dimensional feature mapping or the future mapping is [01:00:01] feature mapping or the future mapping is just equal to the original features this [01:00:04] just equal to the original features this is this is actually pretty commonly used [01:00:06] is this is actually pretty commonly used kernel function you're not taking [01:00:09] kernel function you're not taking advantage of kernels in other words but [01:00:11] advantage of kernels in other words but after the linear kernel the Gaussian [01:00:13] after the linear kernel the Gaussian kernel is probably the most widely used [01:00:15] kernel is probably the most widely used kernel or the one I wrote out there and [01:00:18] kernel or the one I wrote out there and this corresponds to a future dimensional [01:00:21] this corresponds to a future dimensional space that is infinite dimensional right [01:00:26] space that is infinite dimensional right and this is actually the difficult [01:00:30] and this is actually the difficult kernel function corresponds to using all [01:00:32] kernel function corresponds to using all monomial features so if you have you [01:00:35] monomial features so if you have you know X 1 and also X 1 X 2 and X 1 [01:00:38] know X 1 and also X 1 X 2 and X 1 squared X 2 then X 1 squared X 5 to the [01:00:41] squared X 2 then X 1 squared X 5 to the 10 and so on up to you know X 1 to the [01:00:44] 10 and so on up to you know X 1 to the 10,000 and X 2 to the 17 right whatever [01:00:50] so this the kernel corresponds to using [01:00:53] so this the kernel corresponds to using all these polynomial features without [01:00:55] all these polynomial features without end going to arbitrarily high [01:00:57] end going to arbitrarily high dimensional but giving a smaller waiting [01:01:00] dimensional but giving a smaller waiting to the very very high dimensional ones [01:01:02] to the very very high dimensional ones which is why it's why [01:01:12] [Music] [01:01:13] [Music] great [01:01:15] great so the furniture and then toward the end [01:01:18] so the furniture and then toward the end I'll give some other examples of kernels [01:01:20] I'll give some other examples of kernels so it turns out that the kernel trick is [01:01:23] so it turns out that the kernel trick is more general than the support vector [01:01:25] more general than the support vector machine it was really popularized by [01:01:29] machine it was really popularized by this vortex machine where you know [01:01:31] this vortex machine where you know researchers I guess volume of Athens and [01:01:34] researchers I guess volume of Athens and Corinth and Cortez found that applying [01:01:37] Corinth and Cortez found that applying this kernel tricks the support vector [01:01:38] this kernel tricks the support vector machine makes for a very effective [01:01:40] machine makes for a very effective learning algorithm but the kernel trick [01:01:42] learning algorithm but the kernel trick is actually more general and if you're [01:01:44] is actually more general and if you're any learning algorithm that you can [01:01:46] any learning algorithm that you can write in terms of products like this [01:01:48] write in terms of products like this then you can apply the kernel trick to [01:01:51] then you can apply the kernel trick to it and so you you play of this for a [01:01:53] it and so you you play of this for a different learning algorithm in the in [01:01:55] different learning algorithm in the in the program assignments as well and the [01:01:58] the program assignments as well and the way to apply the coho Sheikh is take a [01:02:00] way to apply the coho Sheikh is take a learning algorithm write the whole thing [01:02:01] learning algorithm write the whole thing in terms of inner products and then [01:02:03] in terms of inner products and then replace it with okay z it was some [01:02:07] replace it with okay z it was some appropriately chosen kernel function k [01:02:09] appropriately chosen kernel function k of XE and all of the discriminative [01:02:12] of XE and all of the discriminative learning algorithms would learn so far [01:02:14] learning algorithms would learn so far it can be written in this way so they [01:02:17] it can be written in this way so they can apply the kernel trick so linear [01:02:19] can apply the kernel trick so linear regression which is super aggression [01:02:21] regression which is super aggression everything that generalizes any model [01:02:23] everything that generalizes any model family the perceptron algorithm all of [01:02:25] family the perceptron algorithm all of the all of those algorithms you can [01:02:28] the all of those algorithms you can actually apply the kernel trick to which [01:02:30] actually apply the kernel trick to which means that you can apply linear [01:02:32] means that you can apply linear regression in an infinite dimensional [01:02:34] regression in an infinite dimensional feature space if you wish and later in [01:02:38] feature space if you wish and later in this class we'll talk about principal [01:02:39] this class we'll talk about principal components analysis we should heard of [01:02:41] components analysis we should heard of but when I talk about principal [01:02:42] but when I talk about principal components analysis turns out that's yet [01:02:44] components analysis turns out that's yet another algorithm that can be written [01:02:46] another algorithm that can be written only in terms of inner products and so [01:02:47] only in terms of inner products and so there's an everyone called kernel pca [01:02:49] there's an everyone called kernel pca kernel principal component analysis if [01:02:51] kernel principal component analysis if you don't know how pcs don't worry about [01:02:52] you don't know how pcs don't worry about it we'll get to later but lot of [01:02:54] it we'll get to later but lot of algorithms can be married with the [01:02:56] algorithms can be married with the kernel trick to implicitly apply the [01:02:59] kernel trick to implicitly apply the algorithm and even an infinite [01:03:01] algorithm and even an infinite dimensional feature space but about [01:03:02] dimensional feature space but about needing your computer to for infinite [01:03:04] needing your computer to for infinite amounts of memory or using an infinite [01:03:06] amounts of memory or using an infinite amount of computation but this actually [01:03:10] amount of computation but this actually the single place is most powerfully [01:03:11] the single place is most powerfully apart is that it's the support vector [01:03:13] apart is that it's the support vector machine in practice I guess in practice [01:03:16] machine in practice I guess in practice the kernel trick is apply all the time [01:03:18] the kernel trick is apply all the time in support vector machines and that [01:03:19] in support vector machines and that often in other algorithms [01:03:28] all right [01:03:37] I shoot any questions [01:03:53] all right so last two things I want to [01:03:57] all right so last two things I want to do today um one is fixed the assumption [01:04:01] do today um one is fixed the assumption that we had made that the data is [01:04:04] that we had made that the data is linearly separable so you know sometimes [01:04:12] linearly separable so you know sometimes you don't want your learning algorithm [01:04:14] you don't want your learning algorithm to have zero errors on the training set [01:04:17] to have zero errors on the training set right you know so when you take this low [01:04:19] right you know so when you take this low dimensional data and map it's a very [01:04:21] dimensional data and map it's a very high dimensional feature space the data [01:04:23] high dimensional feature space the data does become much more separable but it [01:04:26] does become much more separable but it turns out that if your data centers will [01:04:28] turns out that if your data centers will be noisy right if your data looks like [01:04:45] be noisy right if your data looks like this you maybe wanted to find a decision [01:04:50] this you maybe wanted to find a decision boundary like that and you don't want it [01:04:53] boundary like that and you don't want it to try so hard to separate every little [01:04:55] to try so hard to separate every little example right as to find it really [01:04:58] example right as to find it really complicated decision boundary like that [01:04:59] complicated decision boundary like that right so sometimes either the low [01:05:01] right so sometimes either the low dimensional space one the high [01:05:02] dimensional space one the high dimensional space file you don't [01:05:05] dimensional space file you don't actually want the algorithm to separate [01:05:07] actually want the algorithm to separate out your data perfectly and then [01:05:08] out your data perfectly and then sometimes even in high dimensional [01:05:10] sometimes even in high dimensional feature space your data may not be [01:05:12] feature space your data may not be linearly separable you don't want to [01:05:13] linearly separable you don't want to algorithm to you know have zero error on [01:05:16] algorithm to you know have zero error on the training set and so um there's an [01:05:20] the training set and so um there's an Aron County or one norm soft margin SVM [01:05:28] Aron County or one norm soft margin SVM which is a modification to the basic [01:05:33] which is a modification to the basic algorithm so the basic algorithm was win [01:05:36] algorithm so the basic algorithm was win over this subject to [01:05:53] and so what the l1 norm soft margin does [01:05:57] and so what the l1 norm soft margin does is the following it says you know [01:06:00] is the following it says you know previously this is saying that so [01:06:02] previously this is saying that so remember this is the geometric margin [01:06:07] right if you normalize this by the [01:06:09] right if you normalize this by the normal W becomes excuse me this is the [01:06:11] normal W becomes excuse me this is the function of margin if you divide this by [01:06:14] function of margin if you divide this by the normal W becomes the geometric [01:06:16] the normal W becomes the geometric margin so this optimization problem was [01:06:20] margin so this optimization problem was saying let's make sure each example has [01:06:23] saying let's make sure each example has functional margin created equal to one [01:06:24] functional margin created equal to one and in the unknown soft margin SVM we're [01:06:28] and in the unknown soft margin SVM we're going to relax this we're gonna say that [01:06:30] going to relax this we're gonna say that this needs to be bigger than 1 - see [01:06:32] this needs to be bigger than 1 - see there's a Greek alphabet see and then [01:06:37] there's a Greek alphabet see and then we're gonna modify the cost function as [01:06:39] we're gonna modify the cost function as follows where these C eyes are greater [01:06:48] follows where these C eyes are greater than or equal to 0 so remember if the [01:06:52] than or equal to 0 so remember if the function margin is creating equal zero [01:06:53] function margin is creating equal zero it means the algorithm is closed find [01:06:55] it means the algorithm is closed find that example correctly right if so long [01:06:58] that example correctly right if so long as this thing is getting a little zero [01:06:59] as this thing is getting a little zero then you know why and this thing will [01:07:03] then you know why and this thing will have the same sign either both positive [01:07:04] have the same sign either both positive or both negative that's what it means by [01:07:07] or both negative that's what it means by product of two things to be greater than [01:07:08] product of two things to be greater than zero [01:07:09] zero both things have to at the same sign [01:07:11] both things have to at the same sign right and so if this is if so so as this [01:07:16] right and so if this is if so so as this is bigger than zero it means it's closed [01:07:18] is bigger than zero it means it's closed find that example correctly and the SVM [01:07:21] find that example correctly and the SVM is asking for it to not just classify [01:07:24] is asking for it to not just classify correctly but classify correctly with a [01:07:26] correctly but classify correctly with a with a functional margin at least one [01:07:29] with a functional margin at least one and if you allow CI to be positive then [01:07:34] and if you allow CI to be positive then that's relaxing that constraint okay but [01:07:39] that's relaxing that constraint okay but you don't want the sea-ice to be too big [01:07:41] you don't want the sea-ice to be too big which is why you add to the optimization [01:07:44] which is why you add to the optimization cost function a cost for making [01:07:47] cost function a cost for making see I - great and so you optimize this [01:07:50] see I - great and so you optimize this as function of W and these are the [01:07:54] as function of W and these are the alphabets and if if you draw a picture [01:08:00] alphabets and if if you draw a picture it turns out that in this example with [01:08:08] it turns out that in this example with that being the optimal decision boundary [01:08:09] that being the optimal decision boundary it turns out that these examples these [01:08:13] it turns out that these examples these three examples will be equidistant from [01:08:16] three examples will be equidistant from this straight line right because if they [01:08:17] this straight line right because if they were then you can fiddle the straight [01:08:19] were then you can fiddle the straight line to improve the margin you've a [01:08:21] line to improve the margin you've a little bit more it turns out that these [01:08:23] little bit more it turns out that these three examples have function margin [01:08:26] three examples have function margin exactly equal to one at this example [01:08:28] exactly equal to one at this example over there we have functional margin [01:08:30] over there we have functional margin equal to two and the further away [01:08:31] equal to two and the further away examples that even bigger functional [01:08:33] examples that even bigger functional margins and what this optimization [01:08:35] margins and what this optimization objective is saying that this is okay if [01:08:39] objective is saying that this is okay if you have an example here with functional [01:08:41] you have an example here with functional margin so everything right - the [01:08:44] margin so everything right - the everything here as function margin one [01:08:47] everything here as function margin one if an example here I have functional [01:08:50] if an example here I have functional margin little bit less than 1 and this [01:08:52] margin little bit less than 1 and this by having by setting CI to 0.5 say is [01:08:55] by having by setting CI to 0.5 say is letting me get away with having function [01:08:57] letting me get away with having function logic loading less than 1 [01:09:05] and one other reason why you might want [01:09:09] and one other reason why you might want to use the unknown soft margin SVM is [01:09:12] to use the unknown soft margin SVM is the following which is that's it you [01:09:15] the following which is that's it you have a data set that looks like this [01:09:20] you know seems like it seems like that [01:09:23] you know seems like it seems like that would be a pretty good decision boundary [01:09:25] would be a pretty good decision boundary but if we add just a lot of examples [01:09:30] but if we add just a lot of examples there's a lot of evidence but if you [01:09:33] there's a lot of evidence but if you have just one outlier say over here then [01:09:39] have just one outlier say over here then technically the data set is still [01:09:41] technically the data set is still linearly separable right if you really [01:09:44] linearly separable right if you really want to separate this data set sorry [01:09:49] want to separate this data set sorry that seemed to be cooling these pens [01:09:50] that seemed to be cooling these pens myself as well if you want to separate [01:09:54] myself as well if you want to separate out this data set you can actually you [01:09:57] out this data set you can actually you know choose that decision boundary but [01:10:01] know choose that decision boundary but the basic optimization classifier will [01:10:03] the basic optimization classifier will allow the presence of one training [01:10:05] allow the presence of one training example to cause you to have this [01:10:08] example to cause you to have this dramatic swing in the position of the [01:10:11] dramatic swing in the position of the decision boundary so the R because the [01:10:13] decision boundary so the R because the original auto margin classifier it [01:10:15] original auto margin classifier it optimizes for the worst case margin the [01:10:18] optimizes for the worst case margin the concept of optimizing for the worst case [01:10:20] concept of optimizing for the worst case margin allows one example by being the [01:10:23] margin allows one example by being the worst case training examples have a huge [01:10:25] worst case training examples have a huge impact on your decision boundary and so [01:10:27] impact on your decision boundary and so the l1 non-stop margin SVM allows the [01:10:31] the l1 non-stop margin SVM allows the SVM to still keep the decision boundary [01:10:34] SVM to still keep the decision boundary closer to the blue line even when this [01:10:36] closer to the blue line even when this one outlier and it makes it so much more [01:10:39] one outlier and it makes it so much more robust to outliers [01:10:45] and then if you go through the [01:10:48] and then if you go through the represents a theorem derivation you know [01:10:51] represents a theorem derivation you know represent w is a function of the alphas [01:10:53] represent w is a function of the alphas and so on it turns out that the problem [01:10:55] and so on it turns out that the problem then simplifies to the following so this [01:10:59] then simplifies to the following so this is I'm just right after some some after [01:11:08] is I'm just right after some some after you know the home represent account the [01:11:11] you know the home represent account the whole represented calculation the [01:11:16] whole represented calculation the derivation this is just what we had [01:11:22] derivation this is just what we had previously [01:11:23] previously I've not changed anything so far right [01:11:31] I've not changed anything so far right this is just exactly what we had and it [01:11:40] this is just exactly what we had and it turns out that the only change to this [01:11:43] turns out that the only change to this is we end up with an additional [01:11:45] is we end up with an additional condition on the Alpha rise so if you go [01:11:50] condition on the Alpha rise so if you go for that simplification [01:11:52] for that simplification now that you've changed the algorithms [01:11:54] now that you've changed the algorithms of this extra term then the new form [01:11:57] of this extra term then the new form this is called the dual form where they [01:11:59] this is called the dual form where they also say no problem [01:12:00] also say no problem the only change is that you end up with [01:12:01] the only change is that you end up with this additional condition the [01:12:05] this additional condition the constraints between alpha are between 0 [01:12:07] constraints between alpha are between 0 and C and it turns out that today there [01:12:15] and C and it turns out that today there are very good you know packages software [01:12:17] are very good you know packages software packages for just solving that for you [01:12:19] packages for just solving that for you III think once upon a time we were doing [01:12:22] III think once upon a time we were doing machine learning you need to worry about [01:12:24] machine learning you need to worry about whether your köppen inverting matrices [01:12:25] whether your köppen inverting matrices was good enough right and when when Kofa [01:12:28] was good enough right and when when Kofa inverting matrices will as mature [01:12:30] inverting matrices will as mature there's just one thing yet to think [01:12:31] there's just one thing yet to think about but today linear algebra you know [01:12:33] about but today linear algebra you know packages have gotten good enough that [01:12:36] packages have gotten good enough that when you invert the matrix it just [01:12:37] when you invert the matrix it just invert the matrix you doesn't worry too [01:12:39] invert the matrix you doesn't worry too much when we're solving you have to [01:12:41] much when we're solving you have to worry too much about it so in the early [01:12:43] worry too much about it so in the early days of SVM solving this problem was [01:12:45] days of SVM solving this problem was really hard you don't worry of your [01:12:46] really hard you don't worry of your optimization packages you optimizing it [01:12:48] optimization packages you optimizing it though I think today they're very good [01:12:49] though I think today they're very good numeric often [01:12:50] numeric often and packages they just solved this [01:12:52] and packages they just solved this problem for you and you can just call it [01:12:53] problem for you and you can just call it with all worrying about the details that [01:12:55] with all worrying about the details that much all right so this l1 no soft margin [01:13:00] much all right so this l1 no soft margin SVM and oh and so and so this parameter [01:13:05] SVM and oh and so and so this parameter C is something you need to choose we'll [01:13:07] C is something you need to choose we'll talk on Wednesday about how to choose [01:13:09] talk on Wednesday about how to choose this parameter but it trades off how [01:13:12] this parameter but it trades off how much you want to insist on getting the [01:13:14] much you want to insist on getting the training examples right versus you know [01:13:16] training examples right versus you know saying it's okay if you label if you can [01:13:18] saying it's okay if you label if you can example as well well we'll discuss on [01:13:21] example as well well we'll discuss on Wednesday we'll discuss PI's and [01:13:23] Wednesday we'll discuss PI's and variants how to choose the parameter [01:13:24] variants how to choose the parameter like C all right so the last thing I [01:13:32] like C all right so the last thing I want to lastly would like you to see [01:13:34] want to lastly would like you to see today there's really just a few examples [01:13:36] today there's really just a few examples of SVM kernels let me just give all [01:13:44] of SVM kernels let me just give all right so it turns out the SVM with [01:13:46] right so it turns out the SVM with polynomial kernel works quite well so [01:13:50] polynomial kernel works quite well so this is a you know K of X Z equals X [01:13:55] this is a you know K of X Z equals X transpose e to the T this is no that's [01:13:57] transpose e to the T this is no that's called a polynomial kernel and this is [01:14:00] called a polynomial kernel and this is called a Gaussian kernel with this video [01:14:02] called a Gaussian kernel with this video the most widely used one is a Gaussian [01:14:05] the most widely used one is a Gaussian kernel right and turns out that I guess [01:14:07] kernel right and turns out that I guess early days of SVM's you know one of the [01:14:10] early days of SVM's you know one of the proof points as yes was few the machine [01:14:13] proof points as yes was few the machine learning was doing a lot of work on [01:14:14] learning was doing a lot of work on handwritten digit classification so [01:14:16] handwritten digit classification so that's a so digit is a matrix of pixels [01:14:19] that's a so digit is a matrix of pixels with values that are you know 0 or 1 or [01:14:23] with values that are you know 0 or 1 or maybe grayscale values right and say you [01:14:25] maybe grayscale values right and say you take the list of pixel intensity values [01:14:26] take the list of pixel intensity values and list them so there's 0 0 0 1 1 0 0 0 [01:14:31] and list them so there's 0 0 0 1 1 0 0 0 0 1 0 and just this sound all the pixel [01:14:36] 0 1 0 and just this sound all the pixel intensity values then this can be your [01:14:38] intensity values then this can be your future x and they feed it to an SVM [01:14:41] future x and they feed it to an SVM using either of these kernels it'll do [01:14:44] using either of these kernels it'll do not too badly as a pen written digit [01:14:48] not too badly as a pen written digit classification right so does the classic [01:14:50] classification right so does the classic data set called m-miss which is a [01:14:53] data set called m-miss which is a classic benchmark in computing in [01:14:55] classic benchmark in computing in history of machine learning and it was a [01:14:58] history of machine learning and it was a very surprising result many years ago [01:15:01] very surprising result many years ago that you know support vector machine [01:15:02] that you know support vector machine with a kernel [01:15:04] with a kernel this does very well on hermit integer [01:15:06] this does very well on hermit integer classification in the past several years [01:15:08] classification in the past several years we found that deep learning algorithms [01:15:10] we found that deep learning algorithms must be convolutional neural networks do [01:15:12] must be convolutional neural networks do even better than the SVM but for some [01:15:14] even better than the SVM but for some time [01:15:15] time lesbians were the best algorithm and and [01:15:18] lesbians were the best algorithm and and they're very easy to use in turnkey [01:15:20] they're very easy to use in turnkey there aren't a lot of parameters if they [01:15:21] there aren't a lot of parameters if they don't work so that's the one very nice [01:15:23] don't work so that's the one very nice property about them um but more [01:15:26] property about them um but more generally a lot of the most innovative [01:15:32] generally a lot of the most innovative work in SVM's [01:15:33] work in SVM's has been into design of kernels so [01:15:36] has been into design of kernels so here's one example let's say you want a [01:15:39] here's one example let's say you want a protein sequence classifier so protein [01:15:49] protein sequence classifier so protein sequences are made up of amino acids so [01:15:52] sequences are made up of amino acids so because a lot of our bodies are made of [01:15:55] because a lot of our bodies are made of proteins and proteins are just sequences [01:15:57] proteins and proteins are just sequences of amino acids and there are 20 amino [01:15:59] of amino acids and there are 20 amino acids but in order to simplify the [01:16:02] acids but in order to simplify the description and really not worry too [01:16:05] description and really not worry too much apology and hope the biologists [01:16:06] much apology and hope the biologists don't get mad at me I'm gonna pretend [01:16:08] don't get mad at me I'm gonna pretend that 26 amino acids even though they're [01:16:10] that 26 amino acids even though they're because they're 26 alphabets so I'm [01:16:12] because they're 26 alphabets so I'm gonna use the alphabets A through Z to [01:16:14] gonna use the alphabets A through Z to denote amino acids even though I know [01:16:16] denote amino acids even though I know this suppose me only 20 but this is [01:16:18] this suppose me only 20 but this is easier to talk with with 26 alphabets [01:16:20] easier to talk with with 26 alphabets and so protein is a sequence of [01:16:23] and so protein is a sequence of alphabets right because the protein in [01:16:32] alphabets right because the protein in your body is the sequence is made up of [01:16:34] your body is the sequence is made up of the sequence of amino acids and amino [01:16:36] the sequence of amino acids and amino acids can be very variable 9 something [01:16:39] acids can be very variable 9 something very very long so if you're very short [01:16:40] very very long so if you're very short so the question is how do you represent [01:16:46] the feature X [01:16:50] so it turns out and so the goal is to be [01:16:53] so it turns out and so the goal is to be the input X and make a prediction about [01:16:57] the input X and make a prediction about this particular protein like what is the [01:16:59] this particular protein like what is the function of this protein right and so [01:17:02] function of this protein right and so well here's one way to design a feature [01:17:04] well here's one way to design a feature vector which is uh I'm going to list out [01:17:07] vector which is uh I'm going to list out all combinations of four amino acids you [01:17:14] all combinations of four amino acids you can tell this will take a while right go [01:17:17] can tell this will take a while right go down to a a a Z and then a a B a and so [01:17:23] down to a a a Z and then a a B a and so on and eventually you know there'll be a [01:17:26] on and eventually you know there'll be a be a JT TST a down to zzzzz right and [01:17:33] be a JT TST a down to zzzzz right and then I'm going to construct five x [01:17:37] according to the number of times I see [01:17:39] according to the number of times I see the sequence in the amino acid so for [01:17:42] the sequence in the amino acid so for example being a JT appears twice so I'm [01:17:47] example being a JT appears twice so I'm gonna put two there you know TST a [01:17:52] gonna put two there you know TST a whatever right a PS ones so I'm for the [01:17:56] whatever right a PS ones so I'm for the one there and there are no a is no ABS [01:17:58] one there and there are no a is no ABS no you see okay so this is a 20 to the [01:18:04] no you see okay so this is a 20 to the for you know 26 to 420 is a four [01:18:07] for you know 26 to 420 is a four dimensional feature vector so this is a [01:18:09] dimensional feature vector so this is a very very high dimensional feature [01:18:11] very very high dimensional feature vector and it turns out that using some [01:18:14] vector and it turns out that using some statistical 22 for is 160,000 it's [01:18:18] statistical 22 for is 160,000 it's pretty high dimensional quite expensive [01:18:19] pretty high dimensional quite expensive to compute and it turns out that using [01:18:23] to compute and it turns out that using dynamic programming given to amino acid [01:18:27] dynamic programming given to amino acid sequences you can compute 5x transyl Phi [01:18:29] sequences you can compute 5x transyl Phi of Z as K of X Z and there's a there's a [01:18:34] of Z as K of X Z and there's a there's a there's a dynamic programming algorithm [01:18:35] there's a dynamic programming algorithm for doing this [01:18:36] for doing this the details aren't important for [01:18:37] the details aren't important for personal siestas you know if any of you [01:18:39] personal siestas you know if any of you have taken in advanced years algorithm [01:18:41] have taken in advanced years algorithm schools and learned about the [01:18:42] schools and learned about the knuth-morris-pratt algorithm this is [01:18:46] knuth-morris-pratt algorithm this is quite similar to that so don new right [01:18:49] quite similar to that so don new right with Stanford Stanford professor [01:18:50] with Stanford Stanford professor emeritus professor here so the DPRK's [01:18:52] emeritus professor here so the DPRK's question was in ads and using this is [01:18:56] question was in ads and using this is actually quite this is that here pretty [01:19:00] actually quite this is that here pretty decent algorithm for [01:19:01] decent algorithm for sequence of say amino acids and training [01:19:05] sequence of say amino acids and training a supervised learning algorithm to make [01:19:06] a supervised learning algorithm to make a clock binary classification on [01:19:08] a clock binary classification on University premises so as your PI [01:19:11] University premises so as your PI support vector machines one of the [01:19:12] support vector machines one of the things you see is that depending on the [01:19:14] things you see is that depending on the input data you have there can be [01:19:16] input data you have there can be innovative kernels to use in order to [01:19:19] innovative kernels to use in order to measure the similarity of two amino acid [01:19:22] measure the similarity of two amino acid sequences or the similarity of two of [01:19:24] sequences or the similarity of two of whatever else and then to use that to [01:19:27] whatever else and then to use that to buy the classifier even on very strange [01:19:30] buy the classifier even on very strange shaped object which you know do not come [01:19:32] shaped object which you know do not come as a feature okay so and I think [01:19:39] as a feature okay so and I think actually another example or if the input [01:19:41] actually another example or if the input X is a histogram you know maybe you have [01:19:43] X is a histogram you know maybe you have two different countries your histograms [01:19:45] two different countries your histograms of people's demographic because it turns [01:19:47] of people's demographic because it turns out that there is a kernel that taking [01:19:50] out that there is a kernel that taking the min of the two histograms and then [01:19:51] the min of the two histograms and then summing up to compute a kernel function [01:19:53] summing up to compute a kernel function that inputs two histograms it measures [01:19:55] that inputs two histograms it measures how similar they are so there many [01:19:56] how similar they are so there many different kernel functions for many [01:19:58] different kernel functions for many different unique types of inputs you [01:20:00] different unique types of inputs you might want to possible okay so that's of [01:20:03] might want to possible okay so that's of SVM's a very useful algorithm and what [01:20:07] SVM's a very useful algorithm and what we'll do on Wednesday is continue with [01:20:10] we'll do on Wednesday is continue with more advice on now do you know all of [01:20:11] more advice on now do you know all of these learning algorithms we'll talk [01:20:13] these learning algorithms we'll talk about bias and variance to give you more [01:20:14] about bias and variance to give you more advice on how to actually apply them so [01:20:17] advice on how to actually apply them so that's great and then I look hard to see [01:20:19] that's great and then I look hard to see you on Wednesday ================================================================================ LECTURE 008 ================================================================================ Lecture 8 - Data Splits, Models & Cross-Validation | Stanford CS229: Machine Learning (Autumn 2018) Source: https://www.youtube.com/watch?v=rjbkWSTjHzM --- Transcript [00:00:03] hey guys um let's get started so over [00:00:08] hey guys um let's get started so over the last several weeks you've learned a [00:00:09] the last several weeks you've learned a lot about many different learning [00:00:11] lot about many different learning algorithms from linear regression so [00:00:13] algorithms from linear regression so this is regression to generalizing [00:00:15] this is regression to generalizing models [00:00:16] models Jennifer albums like GD n ie based most [00:00:19] Jennifer albums like GD n ie based most recently support vector machines what [00:00:22] recently support vector machines what I'd like to do today is to start talking [00:00:25] I'd like to do today is to start talking about advice for applying learning [00:00:28] about advice for applying learning algorithms the fusional bottom of the [00:00:29] algorithms the fusional bottom of the theory behind how to make good decisions [00:00:32] theory behind how to make good decisions of what to do how to actually apply [00:00:35] of what to do how to actually apply these algorithms and so today I want to [00:00:39] these algorithms and so today I want to discuss bias and variance and it turns [00:00:43] discuss bias and variance and it turns out you know I've built quite a lot of [00:00:45] out you know I've built quite a lot of machine learning systems and it turns [00:00:47] machine learning systems and it turns out that bias variance there's one of [00:00:49] out that bias variance there's one of those concepts this is sort of easy to [00:00:51] those concepts this is sort of easy to understand but hard to master what is it [00:00:55] understand but hard to master what is it lots of those was it all these board [00:00:57] lots of those was it all these board games or sometimes a smartphone game say [00:00:59] games or sometimes a smartphone game say he's to learn hard to master or [00:01:01] he's to learn hard to master or something like that [00:01:01] something like that so Bosnians exactly one of those things [00:01:03] so Bosnians exactly one of those things where I've had some PhD students you [00:01:06] where I've had some PhD students you know they're worked with me for several [00:01:07] know they're worked with me for several years and then graduated and worked in [00:01:09] years and then graduated and worked in industry for a couple years after that [00:01:10] industry for a couple years after that and and they actually tell me that you [00:01:13] and and they actually tell me that you know when they took machine learning at [00:01:15] know when they took machine learning at Stanford they learned bias and variance [00:01:17] Stanford they learned bias and variance but as they progress for many years [00:01:20] but as they progress for many years their understanding of bias and variance [00:01:21] their understanding of bias and variance continues to deepen so I'm going to try [00:01:24] continues to deepen so I'm going to try to accelerate your learning of what bias [00:01:28] to accelerate your learning of what bias and variance because I find that people [00:01:30] and variance because I find that people that understand this concept are much [00:01:34] that understand this concept are much more efficient in terms of how you [00:01:36] more efficient in terms of how you develop learning algorithms and meet [00:01:38] develop learning algorithms and meet reverence work so let's talk about this [00:01:39] reverence work so let's talk about this today and to be a recurring theme that [00:01:41] today and to be a recurring theme that will come up with gain a few times in to [00:01:43] will come up with gain a few times in to make several weeks as well um and then [00:01:46] make several weeks as well um and then we'll discuss regularization and talk [00:01:49] we'll discuss regularization and talk about how to reduce variance in learning [00:01:52] about how to reduce variance in learning algorithms talk boy train dev test lists [00:01:55] algorithms talk boy train dev test lists and then also talk about a few model [00:01:58] and then also talk about a few model selection and cross validation [00:02:00] selection and cross validation algorithms oh and the C reminders for [00:02:04] algorithms oh and the C reminders for today problem set one is due tonight [00:02:08] today problem set one is due tonight 11:59 p.m. and if you are not yet ready [00:02:13] 11:59 p.m. and if you are not yet ready to submit it today late submissions are [00:02:16] to submit it today late submissions are accepted until Saturday evening Saturday [00:02:18] accepted until Saturday evening Saturday 11:59 p.m. with the details of late [00:02:21] 11:59 p.m. with the details of late submissions written according to the [00:02:23] submissions written according to the laid a policy written across websites if [00:02:25] laid a policy written across websites if so so definitely purchase em your home [00:02:27] so so definitely purchase em your home on time today if for some reason you're [00:02:29] on time today if for some reason you're not able to the late submission which we [00:02:32] not able to the late submission which we don't encourage anyone to take advantage [00:02:34] don't encourage anyone to take advantage of but it is written on the course [00:02:36] of but it is written on the course website and problem set two will be [00:02:40] website and problem set two will be released shortly actually I think I was [00:02:43] released shortly actually I think I was already posted online and there's do two [00:02:46] already posted online and there's do two weeks from now right and so so oh and [00:02:54] weeks from now right and so so oh and and and what I'm going to do today is [00:02:56] and and what I'm going to do today is talk about their conceptual aspects of [00:02:58] talk about their conceptual aspects of this and if you want to see even more [00:03:01] this and if you want to see even more math between these sort of conceptual [00:03:03] math between these sort of conceptual concepts at this Friday's discussion [00:03:05] concepts at this Friday's discussion section will be covering some of the [00:03:09] section will be covering some of the mathematical aspects of learning [00:03:11] mathematical aspects of learning theories such as error decomposition [00:03:13] theories such as error decomposition uniform conversions and vc-dimension you [00:03:15] uniform conversions and vc-dimension you know what one interesting thing I've [00:03:16] know what one interesting thing I've learned really watching the evolution of [00:03:18] learned really watching the evolution of machine learning over many years is that [00:03:20] machine learning over many years is that machine learning is a discipline as I [00:03:22] machine learning is a discipline as I become less mathematical over the years [00:03:25] become less mathematical over the years so I remember when you know machine [00:03:30] so I remember when you know machine learning people used to worry about [00:03:32] learning people used to worry about computing the normal equations where X [00:03:34] computing the normal equations where X transpose X inverse equals X transpose Y [00:03:35] transpose X inverse equals X transpose Y how numerically stable is you a [00:03:37] how numerically stable is you a numerical solver for solving the normal [00:03:39] numerical solver for solving the normal equations of an inverting a matrix and [00:03:41] equations of an inverting a matrix and solving linear equations but because [00:03:44] solving linear equations but because numerical linear algebra has made [00:03:46] numerical linear algebra has made tremendous strides you now we just call [00:03:48] tremendous strides you now we just call a linear you know linear algebra routine [00:03:50] a linear you know linear algebra routine to invert the matrix of solving linear [00:03:51] to invert the matrix of solving linear equations I got not worried about what [00:03:54] equations I got not worried about what is numerically stable or no but once [00:03:56] is numerically stable or no but once upon the time lot of my friends in [00:03:58] upon the time lot of my friends in machine learning were reading text books [00:04:00] machine learning were reading text books on numerical optimization to figure out [00:04:02] on numerical optimization to figure out of your you know formula for inverting a [00:04:05] of your you know formula for inverting a matrix or really solving the system [00:04:07] matrix or really solving the system equations was numerically stable and so [00:04:09] equations was numerically stable and so one of the trends have seen is that I [00:04:12] one of the trends have seen is that I think you know three or four years ago [00:04:15] think you know three or four years ago to understand violent variants there was [00:04:17] to understand violent variants there was a certain mathematical theory that was [00:04:19] a certain mathematical theory that was crucial to understand [00:04:20] crucial to understand and so I used to teach that in CS 229 [00:04:23] and so I used to teach that in CS 229 but we decided we're constantly trying [00:04:26] but we decided we're constantly trying to improve this class right but I [00:04:28] to improve this class right but I decided that that mathematical theory is [00:04:31] decided that that mathematical theory is actually less crucial today if your main [00:04:33] actually less crucial today if your main goal is to make these albums were so we [00:04:35] goal is to make these albums were so we still teach it but we're doing in the [00:04:37] still teach it but we're doing in the Friday discussion section and that means [00:04:38] Friday discussion section and that means more time for the main lecture here to [00:04:41] more time for the main lecture here to talk more about the conceptual things I [00:04:42] talk more about the conceptual things I think will help you build learning [00:04:44] think will help you build learning algorithms as well as for the newer [00:04:45] algorithms as well as for the newer topics like well we'll talk about the [00:04:47] topics like well we'll talk about the random forest decision trees around the [00:04:49] random forest decision trees around the forest and your network stinks so okay [00:04:54] forest and your network stinks so okay so the solver buys a variance um let's [00:04:58] so the solver buys a variance um let's say you have this data set right I'm [00:05:08] say you have this data set right I'm gonna draw the same data set three times [00:05:17] okay so um let's say you have a housing [00:05:24] okay so um let's say you have a housing price prediction problem where this is [00:05:26] price prediction problem where this is the size of the house and does the price [00:05:27] the size of the house and does the price of the house um you know it looks like [00:05:31] of the house um you know it looks like if you fit a straight line to this data [00:05:34] if you fit a straight line to this data maybe it's not too bad right but it [00:05:40] maybe it's not too bad right but it looks like if this data set seems to go [00:05:42] looks like if this data set seems to go up and then curve downward a little bit [00:05:44] up and then curve downward a little bit right and so maybe this is a slightly [00:05:47] right and so maybe this is a slightly better model if you fit a whole series [00:05:50] better model if you fit a whole series so this is if you fit a linear function [00:05:52] so this is if you fit a linear function theta 0 plus theta 1 X but if you fit a [00:05:57] theta 0 plus theta 1 X but if you fit a quadratic model maybe there's actually [00:05:59] quadratic model maybe there's actually physically a little bit better or you [00:06:07] physically a little bit better or you could actually fit a high order [00:06:08] could actually fit a high order polynomial this is one two three four [00:06:10] polynomial this is one two three four five six examples serve you for the [00:06:12] five six examples serve you for the fifth order polynomial let's say the 5x [00:06:20] fifth order polynomial let's say the 5x to the fifth then you know you can [00:06:24] to the fifth then you know you can actually fit a function that passes [00:06:25] actually fit a function that passes through all the points perfectly but [00:06:28] through all the points perfectly but that doesn't seem like a great model for [00:06:31] that doesn't seem like a great model for this data [00:06:32] this data and so um to name these phenomenon the [00:06:37] and so um to name these phenomenon the function assuming you know the one in [00:06:39] function assuming you know the one in the middle is what we like fitting a [00:06:43] the middle is what we like fitting a quadratic function is maybe pretty [00:06:45] quadratic function is maybe pretty grateful service called just right [00:06:46] grateful service called just right whoever as this example on the left it [00:06:52] whoever as this example on the left it under fits the data as it's not [00:06:59] under fits the data as it's not capturing the trend that is maybe semi [00:07:03] capturing the trend that is maybe semi evidence in the data and we say this [00:07:05] evidence in the data and we say this algorithm has high bias and the term [00:07:09] algorithm has high bias and the term buyers you know the term bias has has [00:07:13] buyers you know the term bias has has actually multiple meanings in the [00:07:14] actually multiple meanings in the English language we as a society want to [00:07:17] English language we as a society want to avoid you know racial bias and gender [00:07:20] avoid you know racial bias and gender bias and discrimination against people's [00:07:22] bias and discrimination against people's orientation and things like that so the [00:07:25] orientation and things like that so the term bias and machine learning has a [00:07:26] term bias and machine learning has a completely separate meaning and it just [00:07:29] completely separate meaning and it just means that and and and it just means [00:07:31] means that and and and it just means that um this learning algorithm had very [00:07:35] that um this learning algorithm had very strong preconceptions that data could be [00:07:38] strong preconceptions that data could be fit by linear function this algorithm [00:07:40] fit by linear function this algorithm has a very strong bias a very strong [00:07:42] has a very strong bias a very strong preconception that the relationship [00:07:44] preconception that the relationship between pricing and haut size of house [00:07:45] between pricing and haut size of house is linear and this bias turns out not to [00:07:48] is linear and this bias turns out not to be true okay so there's a different [00:07:50] be true okay so there's a different sense of bias then then the then the [00:07:52] sense of bias then then the then the then the other type of undesirable bias [00:07:54] then the other type of undesirable bias where the boy this is society or which [00:07:56] where the boy this is society or which which interesting he comes up in machine [00:07:58] which interesting he comes up in machine learning as well in other contexts we [00:08:00] learning as well in other contexts we want our learning out of what those that [00:08:02] want our learning out of what those that advises well there's different use of [00:08:03] advises well there's different use of the term and in contrast in this curve [00:08:07] the term and in contrast in this curve on the right we say that this is a [00:08:09] on the right we say that this is a overfitting the data and this algorithm [00:08:16] as high variance and the term high [00:08:20] as high variance and the term high variance comes from this intuition that [00:08:22] variance comes from this intuition that you happen to get these five examples [00:08:26] you happen to get these five examples but if you know a friend of yours was to [00:08:29] but if you know a friend of yours was to collect data from series six six [00:08:32] collect data from series six six examples if a friend of yours was to [00:08:34] examples if a friend of yours was to collect you know a slightly different [00:08:36] collect you know a slightly different set of six examples right so four friend [00:08:38] set of six examples right so four friend of yours with a rerun you collected [00:08:40] of yours with a rerun you collected slightly different set of housings how [00:08:44] slightly different set of housings how you know right then does our River will [00:08:49] you know right then does our River will fit some totally other varying function [00:08:52] fit some totally other varying function on this and so the your predictions will [00:08:55] on this and so the your predictions will have very high variance if you think of [00:08:57] have very high variance if you think of this as a version over different random [00:08:58] this as a version over different random draws of the data so so the variations [00:09:02] draws of the data so so the variations if if a friend of yours does in the same [00:09:04] if if a friend of yours does in the same experiment just slightly different data [00:09:06] experiment just slightly different data set just due to random noise then this [00:09:08] set just due to random noise then this algorithm fitting a fifth order [00:09:09] algorithm fitting a fifth order polynomial results in a totally [00:09:11] polynomial results in a totally different result so that's so we say [00:09:14] different result so that's so we say that this album has very high variance [00:09:16] that this album has very high variance there's a lot of variability in the [00:09:18] there's a lot of variability in the predictions this algorithm will make um [00:09:21] predictions this algorithm will make um so one of the things we'll need to do is [00:09:24] so one of the things we'll need to do is identify if your learning algorithm so [00:09:29] identify if your learning algorithm so we clean the learning algorithm it [00:09:30] we clean the learning algorithm it almost never works the first time right [00:09:32] almost never works the first time right and so when I'm developing learning [00:09:34] and so when I'm developing learning algorithms my standard workflow is often [00:09:37] algorithms my standard workflow is often to train an algorithm often trained on [00:09:40] to train an algorithm often trained on something quick and dirty and then try [00:09:42] something quick and dirty and then try to understand if the algorithm has a [00:09:44] to understand if the algorithm has a problem of high bias or high variance if [00:09:46] problem of high bias or high variance if it's under fitting ozone Fillion data [00:09:48] it's under fitting ozone Fillion data and I use that insight to decide how to [00:09:51] and I use that insight to decide how to improve the learning algorithm okay and [00:09:53] improve the learning algorithm okay and I will say a lot more about how to [00:09:55] I will say a lot more about how to improve the learning algorithm will have [00:09:57] improve the learning algorithm will have a menu of tools that we'll talk about in [00:09:59] a menu of tools that we'll talk about in the next couple weeks about how to [00:10:01] the next couple weeks about how to reduce bias or reduce variance of all of [00:10:05] reduce bias or reduce variance of all of your learning algorithms um I just [00:10:07] your learning algorithms um I just mentioned that the problems of bias [00:10:11] mentioned that the problems of bias variance also hold true for [00:10:15] variance also hold true for classification problems so [00:10:27] so let's say that's a binary [00:10:29] so let's say that's a binary classification problem if you fit a [00:10:33] classification problem if you fit a logistic regression model to this you [00:10:37] logistic regression model to this you know straight line fit to the data maybe [00:10:40] know straight line fit to the data maybe that's not great [00:10:42] that's not great if you fitted logistic regression model [00:10:45] if you fitted logistic regression model with a few nonlinear features so yeah [00:10:49] with a few nonlinear features so yeah features x1 x2 if instead of using X 1 [00:10:53] features x1 x2 if instead of using X 1 and X 2 as features you use additional [00:10:55] and X 2 as features you use additional features x1 squared x2 squared x1 times [00:10:58] features x1 squared x2 squared x1 times x2 x1 cube x2 it does a spy of X right [00:11:03] x2 x1 cube x2 it does a spy of X right and you can have a small set of features [00:11:05] and you can have a small set of features you choose by hand it's usually pretty [00:11:07] you choose by hand it's usually pretty more features than this or use an SVM [00:11:10] more features than this or use an SVM kernel and use an SVM for this problem [00:11:12] kernel and use an SVM for this problem then if you let's see if you have too [00:11:17] then if you let's see if you have too many features then you might actually [00:11:19] many features then you might actually have a learning algorithm that fits the [00:11:21] have a learning algorithm that fits the decision boundary here that looks like [00:11:23] decision boundary here that looks like that right and this learning algorithm [00:11:29] that right and this learning algorithm actually gets perfect performance on the [00:11:31] actually gets perfect performance on the training set but this overfits [00:11:35] excuse me I meant to make the colors [00:11:38] excuse me I meant to make the colors consistent sorry I meant to use red [00:11:40] consistent sorry I meant to use red thank you you get what I mean [00:11:42] thank you you get what I mean um and there's only if you choose [00:11:45] um and there's only if you choose somewhere in between you know there you [00:11:49] somewhere in between you know there you get something that that seems to be a [00:11:51] get something that that seems to be a much better fit to the data the green [00:11:53] much better fit to the data the green line seems to be a pretty good way of [00:11:55] line seems to be a pretty good way of separating positive and negative [00:11:56] separating positive and negative examples that they're sort of just right [00:11:58] examples that they're sort of just right so similar to I guess I messed up the [00:12:01] so similar to I guess I messed up the COS idea well kind of but similar to [00:12:03] COS idea well kind of but similar to these colors here the blue line under [00:12:05] these colors here the blue line under fits because it's not capturing trends [00:12:06] fits because it's not capturing trends they're pretty apparent as data the the [00:12:09] they're pretty apparent as data the the orange line over fits is just much too [00:12:11] orange line over fits is just much too complicated hypotheses whereas the green [00:12:13] complicated hypotheses whereas the green line is just right [00:12:17] so it turns out that in the area of you [00:12:25] so it turns out that in the area of you know GPU computing ability to train [00:12:27] know GPU computing ability to train models of Lalla features one of the by [00:12:31] models of Lalla features one of the by building a big enough model so take a [00:12:34] building a big enough model so take a support vector machine if you add enough [00:12:36] support vector machine if you add enough features to it if you're high enough you [00:12:38] features to it if you're high enough you know dimensional feature space or if you [00:12:41] know dimensional feature space or if you take a linear regression model religious [00:12:44] take a linear regression model religious regression model and just add enough [00:12:45] regression model and just add enough features to it you can often overfit the [00:12:48] features to it you can often overfit the data and it turns out that one of the [00:12:53] data and it turns out that one of the most effective ways to prevent [00:12:55] most effective ways to prevent overfitting is regularization so let me [00:12:59] overfitting is regularization so let me describe what that is and are you [00:13:08] describe what that is and are you working today's lecture so this is and [00:13:28] working today's lecture so this is and regularization is a it'll be one of [00:13:31] regularization is a it'll be one of those techniques that won't take that [00:13:34] those techniques that won't take that long to explain it'll sound deceptively [00:13:36] long to explain it'll sound deceptively simple but it's one of the techniques [00:13:38] simple but it's one of the techniques that I use most often I feel like that [00:13:41] that I use most often I feel like that uses regularization in many many models [00:13:43] uses regularization in many many models so so just because it doesn't sound [00:13:45] so so just because it doesn't sound caught that comprise it or maybe won't [00:13:47] caught that comprise it or maybe won't even take that long to explain today [00:13:48] even take that long to explain today don't underestimate how widely use it is [00:13:50] don't underestimate how widely use it is this is using it's not using every [00:13:53] this is using it's not using every single machine learning model but it's [00:13:54] single machine learning model but it's used very very often so here's the idea [00:14:09] which is a stick linear regression right [00:14:27] which is a stick linear regression right so that's the optimization objective for [00:14:30] so that's the optimization objective for linear regression if you want to add [00:14:33] linear regression if you want to add regularization you just add one extra [00:14:38] regularization you just add one extra term here lambda times norm of theta [00:14:45] term here lambda times norm of theta squared right sometimes you write lambda [00:14:48] squared right sometimes you write lambda over two to make some of the derivations [00:14:50] over two to make some of the derivations come out easier and what this does is it [00:14:54] come out easier and what this does is it takes your cost function for logistic [00:14:56] takes your cost function for logistic regression which you try to minimize [00:14:58] regression which you try to minimize trying to minimize the squared error fit [00:15:00] trying to minimize the squared error fit to the data and you are creating an [00:15:03] to the data and you are creating an incentive term for the algorithm to make [00:15:06] incentive term for the algorithm to make the parameters theta is smaller okay so [00:15:09] the parameters theta is smaller okay so this is called the regularization term [00:15:16] and it turns out that um let's take the [00:15:21] and it turns out that um let's take the linear regression overfitting example so [00:15:28] linear regression overfitting example so you know if you said long D equals zero [00:15:30] you know if you said long D equals zero then it's just linear regression over [00:15:32] then it's just linear regression over the fifth order polynomial features it [00:15:36] the fifth order polynomial features it turns out that as you increase lambda or [00:15:39] turns out that as you increase lambda or lambda to some intermediate value or [00:15:41] lambda to some intermediate value or depend on the scale of the data unless [00:15:43] depend on the scale of the data unless he said lambda equals one then when you [00:15:46] he said lambda equals one then when you solve for this minimization problem was [00:15:48] solve for this minimization problem was this augment problem for the value of [00:15:50] this augment problem for the value of theta this term penalizes the parameters [00:15:54] theta this term penalizes the parameters being to be and it turns out that you [00:15:57] being to be and it turns out that you end up with a fit that looks a little [00:16:02] end up with a fit that looks a little bit better right there maybe looks like [00:16:03] bit better right there maybe looks like that [00:16:04] that okay and by preventing the parameters [00:16:08] okay and by preventing the parameters data from being too big you're making it [00:16:10] data from being too big you're making it harder for the learning algorithm to [00:16:12] harder for the learning algorithm to overfit the data it turns out fitting a [00:16:15] overfit the data it turns out fitting a very high order polynomial like that may [00:16:16] very high order polynomial like that may result in value of stated is a very [00:16:18] result in value of stated is a very large right and and then if you sent [00:16:23] large right and and then if you sent lambda to be too large then you actually [00:16:27] lambda to be too large then you actually end up in an under fitting regime okay [00:16:32] end up in an under fitting regime okay so they're usually be some optimal value [00:16:35] so they're usually be some optimal value of lambda where it long equals zero [00:16:37] of lambda where it long equals zero you're not using any regularization here [00:16:39] you're not using any regularization here so it may be overfitting if lambda is [00:16:42] so it may be overfitting if lambda is way too big then you're forcing all the [00:16:45] way too big then you're forcing all the parameters to be too close to zero in [00:16:48] parameters to be too close to zero in fact actually think about if mom it was [00:16:50] fact actually think about if mom it was equals you know 10 to the 100 or some [00:16:52] equals you know 10 to the 100 or some ridiculously large number then you are [00:16:55] ridiculously large number then you are really forcing all the Thetas to be 0 [00:16:57] really forcing all the Thetas to be 0 right if all the phases are 0 then you [00:17:00] right if all the phases are 0 then you know then you're kind of fitting this [00:17:02] know then you're kind of fitting this straight line right so that's it mom V [00:17:04] straight line right so that's it mom V equals 10 to the 100 and so and this is [00:17:07] equals 10 to the 100 and so and this is a very simple function which is the [00:17:10] a very simple function which is the function 0 right this function H of [00:17:12] function 0 right this function H of theta of x equals 0 right approximately [00:17:16] theta of x equals 0 right approximately 0 this is a very simple function which [00:17:19] 0 this is a very simple function which you get if you set down to be very large [00:17:21] you get if you set down to be very large and by dialing lambda between you know a [00:17:24] and by dialing lambda between you know a far too large value like 10 to the 100 [00:17:27] far too large value like 10 to the 100 compared to far too small value like [00:17:29] compared to far too small value like angle 0 you use you smooth the interplay [00:17:31] angle 0 you use you smooth the interplay between this much to simple function of [00:17:34] between this much to simple function of H equals 0 and a much to complex [00:17:36] H equals 0 and a much to complex function okay so there is so that's [00:17:46] function okay so there is so that's pretty it it so that's pretty much it [00:17:51] pretty it it so that's pretty much it for regularization in terms of what you [00:17:53] for regularization in terms of what you need to is meant but if you like your [00:17:54] need to is meant but if you like your learning Outram may be overfitting add [00:17:57] learning Outram may be overfitting add this to your model and solve this [00:18:00] this to your model and solve this optimization problem and it will help [00:18:03] optimization problem and it will help relieve overfitting more generally if [00:18:08] relieve overfitting more generally if you are [00:18:13] let's see more generally if you have a [00:18:19] let's see more generally if you have a save logistic regression problem where [00:18:22] save logistic regression problem where this is your cost function then to add [00:18:34] this is your cost function then to add regularization I guess instead of min [00:18:37] regularization I guess instead of min this is a max right if you're applying [00:18:39] this is a max right if you're applying there just regression then this was the [00:18:41] there just regression then this was the original cost function then you can have [00:18:44] original cost function then you can have - long term or lambda over 2 there's [00:18:48] - long term or lambda over 2 there's just defensive scaling of lambda times [00:18:51] just defensive scaling of lambda times the norm of theta squared and there's a [00:18:53] the norm of theta squared and there's a minus here because village is regression [00:18:54] minus here because village is regression we're maximizing around in minimizing [00:18:56] we're maximizing around in minimizing what this could be our merits in any of [00:18:58] what this could be our merits in any of the generalized linear model family as [00:18:59] the generalized linear model family as well but by subtracting lambda times the [00:19:03] well but by subtracting lambda times the norm of theta squared this allows you to [00:19:04] norm of theta squared this allows you to also regularize the classification [00:19:06] also regularize the classification algorithm such as logistic regression ok [00:19:10] algorithm such as logistic regression ok um it turns out that and and I want to [00:19:16] um it turns out that and and I want to make an analogy that where where all the [00:19:19] make an analogy that where where all the math details are true but we don't want [00:19:22] math details are true but we don't want to talk about all the math details it [00:19:23] to talk about all the math details it turns out that one of the reasons the [00:19:26] turns out that one of the reasons the support vector machine doesn't over fit [00:19:28] support vector machine doesn't over fit too badly even though it has you know [00:19:30] too badly even though it has you know you can be working infinite like a you [00:19:33] you can be working infinite like a you know infinite dimensional feature space [00:19:34] know infinite dimensional feature space right so so why doesn't a support vector [00:19:37] right so so why doesn't a support vector machine just over fit like crazy we [00:19:39] machine just over fit like crazy we showed on Monday that by using kernels [00:19:42] showed on Monday that by using kernels is sort of using an infinite dimensional [00:19:44] is sort of using an infinite dimensional feature space right so why doesn't [00:19:47] feature space right so why doesn't always fit these crazy complicated [00:19:49] always fit these crazy complicated functions it just over today to say that [00:19:51] functions it just over today to say that crazy it turns out and the theory is [00:19:53] crazy it turns out and the theory is complicated it turns out that you know [00:19:59] complicated it turns out that you know the optimization objective of the [00:20:00] the optimization objective of the support vector machine was to minimize [00:20:02] support vector machine was to minimize the norm of W squared this turns out to [00:20:05] the norm of W squared this turns out to correspond to maximizing the margin that [00:20:08] correspond to maximizing the margin that you measured margin SVM and it's [00:20:10] you measured margin SVM and it's actually possible to prove that this has [00:20:13] actually possible to prove that this has a similar effect [00:20:14] a similar effect as that right that this is why the [00:20:17] as that right that this is why the support vector machine despite working [00:20:18] support vector machine despite working infinite dimensional feature space [00:20:20] infinite dimensional feature space sometimes by forcing the parameters to [00:20:23] sometimes by forcing the parameters to be small it's difficult for the support [00:20:25] be small it's difficult for the support vector machine to overfit the data too [00:20:28] vector machine to overfit the data too much okay the theory to actually show [00:20:29] much okay the theory to actually show this is quite complicated yeah there's [00:20:35] this is quite complicated yeah there's actually very show that the class of [00:20:37] actually very show that the class of classifiers where this is what normal W [00:20:39] classifiers where this is what normal W is small cannot be too complicating [00:20:41] is small cannot be too complicating cannot over fit basically but that's why [00:20:43] cannot over fit basically but that's why as you guys working you can work an [00:20:46] as you guys working you can work an infinite dimensional features basis yeah [00:20:48] infinite dimensional features basis yeah oh sure do you ever recognize per [00:20:58] oh sure do you ever recognize per element appearances um not really [00:21:01] element appearances um not really and the problem with that is you know [00:21:04] and the problem with that is you know give one let me give one more specific [00:21:06] give one let me give one more specific example then come back to that right so [00:21:08] example then come back to that right so it turns out that um so we talked about [00:21:13] it turns out that um so we talked about now you base as a text classification [00:21:15] now you base as a text classification algorithm it turns out that tell me [00:21:20] algorithm it turns out that tell me let's see if it's a classification [00:21:21] let's see if it's a classification algorithm problem yes classify spam and [00:21:23] algorithm problem yes classify spam and non-spam or classified it sentiment [00:21:26] non-spam or classified it sentiment positive or negative sentiment of a [00:21:27] positive or negative sentiment of a tweet or something let's say you have a [00:21:29] tweet or something let's say you have a hundred examples but you have ten [00:21:33] hundred examples but you have ten thousand dimensional features right so [00:21:35] thousand dimensional features right so let's see your features are these you [00:21:37] let's see your features are these you know take the dictionary a odd Bach and [00:21:40] know take the dictionary a odd Bach and so on it's a one zero one right so let's [00:21:43] so on it's a one zero one right so let's you construct your feature vectors it [00:21:46] you construct your feature vectors it turns out that if you fit the just [00:21:49] turns out that if you fit the just regression to this type of data where [00:21:51] regression to this type of data where you have 10,000 parameters and hundred [00:21:53] you have 10,000 parameters and hundred examples [00:21:54] examples this will badly this will probably [00:21:56] this will badly this will probably overfit the data because you have but it [00:22:01] overfit the data because you have but it turns out that if you use the just [00:22:03] turns out that if you use the just regress in with regularization this is [00:22:06] regress in with regularization this is actually a pretty good algorithm for [00:22:07] actually a pretty good algorithm for text ossification and this will usually [00:22:10] text ossification and this will usually in terms of performance accuracy you [00:22:13] in terms of performance accuracy you know because this is Lee just regression [00:22:14] know because this is Lee just regression you need to implement gradient descent [00:22:16] you need to implement gradient descent or [00:22:16] or to solve the good value parameters but [00:22:18] to solve the good value parameters but logistic regression with regularization [00:22:20] logistic regression with regularization for text classification will usually [00:22:23] for text classification will usually perform outperform naive Bayes on a [00:22:26] perform outperform naive Bayes on a classification accuracy you standpoint [00:22:28] classification accuracy you standpoint without regularization which is Russian [00:22:31] without regularization which is Russian will value over fit this data and and to [00:22:35] will value over fit this data and and to to explain a bit more you know imagine [00:22:38] to explain a bit more you know imagine that you have a three dimensional [00:22:41] that you have a three dimensional subspace where you have two examples [00:22:44] subspace where you have two examples then all you can do is fit a straight [00:22:47] then all you can do is fit a straight line right for the hyperplane to [00:22:49] line right for the hyperplane to separate these two examples but so one [00:22:51] separate these two examples but so one rule of thumb for logistic regression is [00:22:55] rule of thumb for logistic regression is that if you do not use regularization [00:22:57] that if you do not use regularization it's nice if the number of examples is [00:23:00] it's nice if the number of examples is at least on the order of the number of [00:23:02] at least on the order of the number of parameters you want to fit right so this [00:23:04] parameters you want to fit right so this is if you're not using regularization [00:23:05] is if you're not using regularization it's nice if in fact I personally think [00:23:08] it's nice if in fact I personally think that I tend to use the jurors and the [00:23:10] that I tend to use the jurors and the only of the number of examples can be [00:23:12] only of the number of examples can be maybe 10x bigger than the number of [00:23:15] maybe 10x bigger than the number of examples because that's what you need to [00:23:17] examples because that's what you need to have enough information to fit good [00:23:19] have enough information to fit good choices all these parameters but that's [00:23:22] choices all these parameters but that's a good not using regularization but if [00:23:24] a good not using regularization but if you are using regularization then you [00:23:27] you are using regularization then you can fit you know even 10,000 parameters [00:23:30] can fit you know even 10,000 parameters right even with only 100 examples and [00:23:32] right even with only 100 examples and this will be a pretty decent text [00:23:35] this will be a pretty decent text classification out um the question you [00:23:39] classification out um the question you had just now why don't we regularize per [00:23:41] had just now why don't we regularize per parameter right so why don't we [00:23:44] parameter right so why don't we let's see so I guess instead of lambda [00:23:47] let's see so I guess instead of lambda you norm of theta squared it would be a [00:23:50] you norm of theta squared it would be a sum over J lambda J you know and theta J [00:23:55] sum over J lambda J you know and theta J squared right the reason we don't do [00:23:58] squared right the reason we don't do this is because you then end up with if [00:24:00] this is because you then end up with if you have 10,000 parameters here you end [00:24:03] you have 10,000 parameters here you end up with another 10,000 parameters here [00:24:05] up with another 10,000 parameters here and so choosing all this 10,000 lambdas [00:24:07] and so choosing all this 10,000 lambdas is as difficult that's just choosing all [00:24:09] is as difficult that's just choosing all these parameters in the first place so [00:24:11] these parameters in the first place so we don't have a good way to do this [00:24:12] we don't have a good way to do this whereas when you talk about [00:24:13] whereas when you talk about cross-validation multi-session a little [00:24:15] cross-validation multi-session a little bit we'll talk about how to choose maybe [00:24:17] bit we'll talk about how to choose maybe one parameter lambda but but those [00:24:19] one parameter lambda but but those techniques won't work for choosing so [00:24:21] techniques won't work for choosing so 10,000 parameters [00:24:38] you'll see right yes thank you um yes so [00:24:41] you'll see right yes thank you um yes so in order to make sure that the different [00:24:43] in order to make sure that the different launchers on the similar scale a common [00:24:45] launchers on the similar scale a common pre-processing step we use in learning [00:24:47] pre-processing step we use in learning algorithms is take two different [00:24:49] algorithms is take two different features [00:24:50] features so for text classification if all the [00:24:52] so for text classification if all the features are zero one you can just leave [00:24:54] features are zero one you can just leave the features alone but if housing [00:24:56] the features alone but if housing classification if feature one is the [00:24:58] classification if feature one is the size of house which I guess ranges from [00:25:00] size of house which I guess ranges from 100 how big are the biggest houses no no [00:25:04] 100 how big are the biggest houses no no whatever let's see how's this girl from [00:25:07] whatever let's see how's this girl from 500 square feet to 10,000 square feet [00:25:09] 500 square feet to 10,000 square feet 10,000 square foot really really big for [00:25:11] 10,000 square foot really really big for housing yes but then um feature x2 is [00:25:14] housing yes but then um feature x2 is the number of bedrooms which proudly [00:25:16] the number of bedrooms which proudly ranges from like oh no wonder [00:25:18] ranges from like oh no wonder I guess that some houses a ton of [00:25:19] I guess that some houses a ton of dangerous but I think most houses have [00:25:21] dangerous but I think most houses have been most 5 bedrooms right then these [00:25:24] been most 5 bedrooms right then these features are on very different scales [00:25:25] features are on very different scales and normalizing them to all be on a [00:25:28] and normalizing them to all be on a similar scale so subtract out the mean [00:25:31] similar scale so subtract out the mean and divided by the standard deviation so [00:25:33] and divided by the standard deviation so scale all of these things to be between [00:25:35] scale all of these things to be between you know 0 1 or between minus 1 to 1 [00:25:38] you know 0 1 or between minus 1 to 1 with would be a good pre-processing step [00:25:41] with would be a good pre-processing step before applying these methods it turns [00:25:43] before applying these methods it turns out that this will make gradient descent [00:25:45] out that this will make gradient descent run faster as well as a comic [00:25:46] run faster as well as a comic pre-processing step to scale each [00:25:48] pre-processing step to scale each individual feature to be on a similar [00:25:50] individual feature to be on a similar range of values [00:26:10] so it's actually both searches repeated [00:26:14] so it's actually both searches repeated why why don't support vector machines [00:26:16] why why don't support vector machines ever too badly is it because there's no [00:26:17] ever too badly is it because there's no small number of small vectors or is it [00:26:19] small number of small vectors or is it because of minimizing the penalty w um I [00:26:22] because of minimizing the penalty w um I would say the formal argument relies [00:26:24] would say the formal argument relies more on the latter so it turns out that [00:26:25] more on the latter so it turns out that if you look at all the cost if you look [00:26:28] if you look at all the cost if you look at all the class of functions sector [00:26:29] at all the class of functions sector today to have a large margin that class [00:26:32] today to have a large margin that class has low complexity formalized by low VC [00:26:34] has low complexity formalized by low VC dimension which you learned about in [00:26:36] dimension which you learned about in Friday's discussion section if you want [00:26:37] Friday's discussion section if you want to come to that and so it turns out that [00:26:40] to come to that and so it turns out that the cause of all functions that separate [00:26:41] the cause of all functions that separate the data of a large margin is a [00:26:43] the data of a large margin is a relatively simple class of functions but [00:26:45] relatively simple class of functions but and by simple cost functions I mean has [00:26:46] and by simple cost functions I mean has low VC dimension which talked about this [00:26:48] low VC dimension which talked about this Friday and does any function within that [00:26:51] Friday and does any function within that class of functions is not too likely to [00:26:54] class of functions is not too likely to overfit so it is convenient the support [00:26:57] overfit so it is convenient the support vector machine is relatively no number [00:26:59] vector machine is relatively no number of support vectors but there you could [00:27:02] of support vectors but there you could imagine other algorithms of a very large [00:27:04] imagine other algorithms of a very large number of support vectors [00:27:05] number of support vectors that's all as the large margin is still [00:27:08] that's all as the large margin is still a local Mexican so I would say the game [00:27:24] oh sure yes so is it possible though so [00:27:36] oh sure yes so is it possible though so yes so one of the so yes so in general [00:27:40] yes so one of the so yes so in general models that have high bias to enter [00:27:43] models that have high bias to enter under fit and models of high variance [00:27:44] under fit and models of high variance sent to overfit [00:27:45] sent to overfit we use these terms overfilled high [00:27:48] we use these terms overfilled high variance under fit hi buyers [00:27:50] variance under fit hi buyers not quite and they have very similar [00:27:52] not quite and they have very similar meanings right their first frustration [00:27:53] meanings right their first frustration assume they didn't mean the same thing [00:27:55] assume they didn't mean the same thing one thing we'll see later a two weeks [00:27:57] one thing we'll see later a two weeks from now is we'll talk about algorithms [00:27:59] from now is we'll talk about algorithms with high bias and high variance so this [00:28:04] with high bias and high variance so this is a and actually one way to think of [00:28:06] is a and actually one way to think of high bias in high variance was hopeful [00:28:08] high bias in high variance was hopeful Dominator is your data set that looks [00:28:10] Dominator is your data set that looks like this and if somehow your classifier [00:28:18] like this and if somehow your classifier has very high complexity there's a very [00:28:20] has very high complexity there's a very very complicated function but for some [00:28:23] very complicated function but for some reason is still not fitting your data [00:28:25] reason is still not fitting your data well right so that'd be one way to have [00:28:27] well right so that'd be one way to have high bias and high variance which does [00:28:28] high bias and high variance which does happen [00:28:35] all right [00:28:47] so to wrap up the discussion on [00:28:50] so to wrap up the discussion on regularization this one so mechanically [00:28:58] regularization this one so mechanically the way you implement regularization is [00:29:01] the way you implement regularization is by adding that penalty on the norm of [00:29:03] by adding that penalty on the norm of the parameters so that's what you [00:29:06] the parameters so that's what you actually implement it turns out that [00:29:08] actually implement it turns out that there's another way to think about [00:29:10] there's another way to think about regularization so you remember when we [00:29:12] regularization so you remember when we talked about the new linear regression [00:29:14] talked about the new linear regression we talked about minimizing squared error [00:29:16] we talked about minimizing squared error and then later on we saw that linear [00:29:18] and then later on we saw that linear regression was maximum likelihood [00:29:20] regression was maximum likelihood estimation on a certain journalize [00:29:22] estimation on a certain journalize linear model using a using using using a [00:29:25] linear model using a using using using a Gaussian distribution as the expertise [00:29:27] Gaussian distribution as the expertise for the exponential family is a member [00:29:29] for the exponential family is a member of the extension family it turns out [00:29:31] of the extension family it turns out that does a similar point of view you [00:29:34] that does a similar point of view you can take on the regularization algorithm [00:29:36] can take on the regularization algorithm that we just saw which is let's say s is [00:29:40] that we just saw which is let's say s is the training set so um given a training [00:29:53] the training set so um given a training set you want to find the most likely [00:29:58] set you want to find the most likely value of theta right and so by Bayes [00:30:05] value of theta right and so by Bayes rule P of theta given s is P R s given [00:30:10] rule P of theta given s is P R s given theta times P of theta divided by P of s [00:30:15] theta times P of theta divided by P of s and so if you want to pick the value of [00:30:20] and so if you want to pick the value of theta that's the most likely value of [00:30:22] theta that's the most likely value of theta given the data you saw then [00:30:25] theta given the data you saw then because the denominator is just a [00:30:26] because the denominator is just a constant this is our max over theta P of [00:30:30] constant this is our max over theta P of x given theta times P of theta [00:30:41] and so if you're using logistic [00:30:46] and so if you're using logistic regression then the first term is this [00:30:56] and in the second term is P of theta [00:31:02] and in the second term is P of theta where this is the you know logistic [00:31:04] where this is the you know logistic regression models they write or any [00:31:10] regression models they write or any generalized linear model and it turns [00:31:15] generalized linear model and it turns out that if you assume P of theta is [00:31:19] out that if you assume P of theta is Gaussian so if we assume your phases [00:31:24] Gaussian so if we assume your phases follow theta the probability on theta is [00:31:28] follow theta the probability on theta is Gaussian with mean zero and some [00:31:34] Gaussian with mean zero and some variance tau squared I so in other words [00:31:38] variance tau squared I so in other words a p of theta is you know 1 over root 2 [00:31:42] a p of theta is you know 1 over root 2 pi I guess this be determinant of tau [00:31:46] pi I guess this be determinant of tau squared I write e to the negative theta [00:31:52] squared I write e to the negative theta transpose tau squared i inverse so the [00:32:01] transpose tau squared i inverse so the Gaussian probability as follows it turns [00:32:03] Gaussian probability as follows it turns out that if this is your prior [00:32:06] out that if this is your prior distribution for theta and you plug this [00:32:09] distribution for theta and you plug this in here and you take logs compute the [00:32:13] in here and you take logs compute the max and so on then you end up with [00:32:14] max and so on then you end up with exactly the regularization technique [00:32:17] exactly the regularization technique that we found just now ok and so in [00:32:21] that we found just now ok and so in everything we've been doing so far we've [00:32:24] everything we've been doing so far we've been taking a frequentist interpretation [00:32:28] been taking a frequentist interpretation I guess the two main schools of [00:32:32] I guess the two main schools of statistics are there frequences school [00:32:37] statistics are there frequences school of statistic and the Bayesian school [00:32:39] of statistic and the Bayesian school statistic and they used to be sort of [00:32:44] statistic and they used to be sort of Titanic academic debates about which is [00:32:47] Titanic academic debates about which is the right one but I think [00:32:48] the right one but I think the statisticians have gotten together [00:32:50] the statisticians have gotten together and and kind of made peace and and then [00:32:53] and and kind of made peace and and then it goes really between these two more [00:32:54] it goes really between these two more and more these days maybe not not all [00:32:56] and more these days maybe not not all the time later but the frequency school [00:32:58] the time later but the frequency school statistic we say that there is some data [00:33:01] statistic we say that there is some data and we want to find the value of theta [00:33:07] and we want to find the value of theta that makes the data as likely as [00:33:10] that makes the data as likely as possible and that's where we got maximum [00:33:12] possible and that's where we got maximum likelihood estimation all right and in [00:33:15] likelihood estimation all right and in the frequency school statistics we view [00:33:17] the frequency school statistics we view there as being some true value of theta [00:33:19] there as being some true value of theta out in the world that is unknown and so [00:33:22] out in the world that is unknown and so there is some true value of theta that [00:33:24] there is some true value of theta that generated all these housing prices and [00:33:25] generated all these housing prices and our goal is to estimate this true [00:33:28] our goal is to estimate this true parameter in the Bayesian school of [00:33:31] parameter in the Bayesian school of statistics we say that theta is unknown [00:33:33] statistics we say that theta is unknown but before you see even any data you [00:33:37] but before you see even any data you already have some prior beliefs about [00:33:39] already have some prior beliefs about how housing prices are generated out in [00:33:41] how housing prices are generated out in the world and your prior beliefs are [00:33:43] the world and your prior beliefs are captured in a prior distribution denoted [00:33:46] captured in a prior distribution denoted by P of theta so this is called a [00:33:48] by P of theta so this is called a Gaussian prior and we say that and and [00:33:56] Gaussian prior and we say that and and and if you look at this Gaussian prior [00:33:59] and if you look at this Gaussian prior excuse me it's quite reasonable you're [00:34:02] excuse me it's quite reasonable you're saying that before you seen any data on [00:34:04] saying that before you seen any data on average I think the parent is of theta [00:34:06] average I think the parent is of theta have mean zero because I don't know if [00:34:08] have mean zero because I don't know if each theta is positive negative so [00:34:10] each theta is positive negative so giving them a zero seems reasonable and [00:34:12] giving them a zero seems reasonable and most things in the world I thought since [00:34:13] most things in the world I thought since I just assumed that my prior on Thetas [00:34:16] I just assumed that my prior on Thetas Gaussian so you know because we could [00:34:18] Gaussian so you know because we could debate if this is a the right assumption [00:34:20] debate if this is a the right assumption but it's not totally unreasonable right [00:34:22] but it's not totally unreasonable right but they say well actually I think you [00:34:25] but they say well actually I think you know for the next linear regression [00:34:27] know for the next linear regression problem I'm gonna work on next week and [00:34:30] problem I'm gonna work on next week and I have no idea what I'm gonna work on [00:34:31] I have no idea what I'm gonna work on what I'm gonna apply linear regression [00:34:33] what I'm gonna apply linear regression in next week it was actually not too bad [00:34:35] in next week it was actually not too bad an assumption to say you know my priors [00:34:37] an assumption to say you know my priors is Gaussian and in the Bayesian view of [00:34:40] is Gaussian and in the Bayesian view of the world our goal is to find the value [00:34:45] the world our goal is to find the value of theta that is most likely after we [00:34:53] of theta that is most likely after we have seen the data okay [00:34:55] have seen the data okay and so this is called map estimation [00:35:00] we're sense for maximum a-posteriori [00:35:06] estimation so this is actually the map [00:35:08] estimation so this is actually the map estimator I get the odd max of this [00:35:10] estimator I get the odd max of this right there's the map or the maximum a [00:35:17] right there's the map or the maximum a posteriori estimates of theta which [00:35:18] posteriori estimates of theta which means look at the data compute the [00:35:20] means look at the data compute the Bayesian posterior distribution on theta [00:35:22] Bayesian posterior distribution on theta and pay the value of theta that's most [00:35:24] and pay the value of theta that's most likely okay and so one of the things you [00:35:27] likely okay and so one of the things you do in the problem set that was just [00:35:29] do in the problem set that was just released is no is actually show this [00:35:32] released is no is actually show this equivalence as well as plug in a [00:35:36] equivalence as well as plug in a different prior States or other than the [00:35:38] different prior States or other than the Gaussian prior you experiment about [00:35:40] Gaussian prior you experiment about whether P of a is the Laplace prior and [00:35:43] whether P of a is the Laplace prior and define and derive a different map okay [00:36:04] wait sorry few said again yes OSE oh yes [00:36:25] wait sorry few said again yes OSE oh yes current difference we need these to be [00:36:27] current difference we need these to be seen as recognize diversity is not right [00:36:28] seen as recognize diversity is not right yes so so MOU here corresponds to you [00:36:32] yes so so MOU here corresponds to you know the average salary without [00:36:33] know the average salary without regularization and this procedure here [00:36:36] regularization and this procedure here corresponds to having regularization it [00:36:39] corresponds to having regularization it turns out that free consistent decisions [00:36:42] turns out that free consistent decisions can also use regularization it's just [00:36:44] can also use regularization it's just that they don't try to justify it [00:36:45] that they don't try to justify it through amazing a prior they shouldn't [00:36:46] through amazing a prior they shouldn't say so if your frequency statistic if [00:36:49] say so if your frequency statistic if your frequentist statistics your job is [00:36:51] your frequentist statistics your job is a wake up and come up with an algorithm [00:36:53] a wake up and come up with an algorithm to estimate this you know true value of [00:36:55] to estimate this you know true value of theta then because it's out in the world [00:36:56] theta then because it's out in the world you can come of any procedure you want [00:36:58] you can come of any procedure you want and then spa your procedure you can add [00:37:00] and then spa your procedure you can add a regularization term I think there's a [00:37:02] a regularization term I think there's a lot of these debates in frequences and [00:37:03] lot of these debates in frequences and patients are more philosophical I think [00:37:05] patients are more philosophical I think as a machine learning person as an [00:37:07] as a machine learning person as an engineer I don't really you know I think [00:37:10] engineer I don't really you know I think the philosophical debates are lovely but [00:37:12] the philosophical debates are lovely but I just I just like my stuff to work so [00:37:14] I just I just like my stuff to work so so I decided so we can say so [00:37:17] so I decided so we can say so frequencies can also in fin [00:37:18] frequencies can also in fin regularization it's just that they say [00:37:20] regularization it's just that they say this is part of the algorithm they [00:37:21] this is part of the algorithm they invented rather than derived from a [00:37:23] invented rather than derived from a Bayesian prior alright cool so [00:37:40] let's talk about so in in a discussion [00:37:45] let's talk about so in in a discussion on regularization and choosing the [00:37:49] on regularization and choosing the degree of polynomial so let's see let's [00:38:00] degree of polynomial so let's see let's say I plot a chart where on the [00:38:01] say I plot a chart where on the horizontal axis I plot model complexity [00:38:08] horizontal axis I plot model complexity so how complicated is your model so for [00:38:10] so how complicated is your model so for example to the right of this curve could [00:38:13] example to the right of this curve could be a very high degree polynomial and [00:38:24] be a very high degree polynomial and what you find is that as you increase [00:38:28] what you find is that as you increase model complexity your training error if [00:38:33] model complexity your training error if you do not regularize right so if you [00:38:35] you do not regularize right so if you fit a linear function which I found any [00:38:38] fit a linear function which I found any cubic function and so on you find that [00:38:40] cubic function and so on you find that the higher the degree of a polynomial [00:38:41] the higher the degree of a polynomial the bed to your training error because [00:38:43] the bed to your training error because you know a fifth order polynomial always [00:38:46] you know a fifth order polynomial always facilitate better than a fourth order [00:38:47] facilitate better than a fourth order polynomial if if you if you do not [00:38:50] polynomial if if you if you do not regular eyes but what we saw with the [00:38:52] regular eyes but what we saw with the original picture was that the ability of [00:38:58] original picture was that the ability of the algorithm to generalize kind of goes [00:39:04] the algorithm to generalize kind of goes down and then starts to go back up right [00:39:07] down and then starts to go back up right and so if you were to have a separate [00:39:09] and so if you were to have a separate test set and evaluate your classifier on [00:39:12] test set and evaluate your classifier on a set of data that the algorithm hasn't [00:39:14] a set of data that the algorithm hasn't seen so far so measure how well the [00:39:16] seen so far so measure how well the album generalizes to a different novel [00:39:18] album generalizes to a different novel set of data then if you fit a linear [00:39:20] set of data then if you fit a linear function then this under fits if you [00:39:27] function then this under fits if you take the fifth order polynomial this [00:39:29] take the fifth order polynomial this over fits and this somewhere in between [00:39:39] right that is just right okay and this [00:39:45] right that is just right okay and this curve is true for regularization as well [00:39:48] curve is true for regularization as well so say you apply linear regression with [00:39:51] so say you apply linear regression with 10000 features to a very small shiny [00:39:54] 10000 features to a very small shiny example if launder was much too big then [00:40:04] example if launder was much too big then they will under fit if lambda was 0 so [00:40:11] they will under fit if lambda was 0 so you're not recognizing at all then it [00:40:13] you're not recognizing at all then it will over fit and there will be some [00:40:16] will over fit and there will be some intermediate value of lambda that's not [00:40:18] intermediate value of lambda that's not too big not too small that you know [00:40:20] too big not too small that you know balances overfitting and underfitting [00:40:22] balances overfitting and underfitting okay so um what I like to do next is [00:40:26] okay so um what I like to do next is describe a mechanistic as you a few [00:40:28] describe a mechanistic as you a few different mechanistic procedures for [00:40:30] different mechanistic procedures for trying to find this point in the middle [00:40:33] trying to find this point in the middle right and so [00:40:56] um so given the data set what we'll [00:41:11] um so given the data set what we'll often do is take your data set and split [00:41:15] often do is take your data set and split it into different subsets and a good [00:41:18] it into different subsets and a good hygiene is the ticket agent train to [00:41:20] hygiene is the ticket agent train to Train dev and test sets so if you have [00:41:23] Train dev and test sets so if you have say 10,000 examples and you're trying to [00:41:30] say 10,000 examples and you're trying to carry out this model selection problem [00:41:32] carry out this model selection problem so for example let's say you're trying [00:41:34] so for example let's say you're trying to decide what order polynomial you want [00:41:37] to decide what order polynomial you want to fit right or you're trying to choose [00:41:46] to fit right or you're trying to choose the value of lambda or you're trying to [00:41:50] the value of lambda or you're trying to choose the value of tau there was the [00:41:52] choose the value of tau there was the bandwidth parameter in a locally [00:41:54] bandwidth parameter in a locally weighted regression that you saw on the [00:41:56] weighted regression that you saw on the problem set and if that we saw with a [00:41:58] problem set and if that we saw with a locally weighted regression [00:41:59] locally weighted regression all right so oh you're trying to choose [00:42:02] all right so oh you're trying to choose a value C in a support vector machine so [00:42:05] a value C in a support vector machine so remember the SVM objective was actually [00:42:08] remember the SVM objective was actually this write what you know subject to some [00:42:13] this write what you know subject to some of the things but for the unknown soft [00:42:15] of the things but for the unknown soft margin they were southbound on Wednesday [00:42:18] margin they were southbound on Wednesday total on Monday you're trying to [00:42:20] total on Monday you're trying to minimize the norm of W and then there [00:42:22] minimize the norm of W and then there was this additional parameter C that [00:42:24] was this additional parameter C that trades off how much you insist on [00:42:26] trades off how much you insist on classifying every training example [00:42:28] classifying every training example perfectly so whether you're trying to [00:42:31] perfectly so whether you're trying to make which of these decisions are trying [00:42:33] make which of these decisions are trying to make how do you you know choose a [00:42:38] to make how do you you know choose a polynomial size or choose lambda or [00:42:40] polynomial size or choose lambda or choose tau or choose parameter C which [00:42:42] choose tau or choose parameter C which also has this bias variance trade-off [00:42:44] also has this bias variance trade-off there will be some valleys that see the [00:42:45] there will be some valleys that see the too large and some valleys of C that's [00:42:47] too large and some valleys of C that's you small [00:42:58] so here's one thing you can do which is [00:43:06] let's see so split your training data s [00:43:10] let's see so split your training data s into a subset which I'm gonna call the [00:43:14] into a subset which I'm gonna call the real training set as subscript train and [00:43:18] real training set as subscript train and in some subset which we call as [00:43:20] in some subset which we call as subscript dev and F stands for [00:43:23] subscript dev and F stands for development and then later we'll talk [00:43:28] development and then later we'll talk about a separate test set and so what [00:43:33] about a separate test set and so what you can do is train each model and by [00:43:39] you can do is train each model and by model I mean um option for the degree of [00:43:48] model I mean um option for the degree of polynomial on s train Soviet evaluating [00:44:01] polynomial on s train Soviet evaluating a menu of models right so let's say this [00:44:03] a menu of models right so let's say this is model 1 model 2 and so on up to model [00:44:08] is model 1 model 2 and so on up to model 5 up to some number they can train each [00:44:10] 5 up to some number they can train each of these models on the first subset of [00:44:14] of these models on the first subset of the data and then get some hypothesis [00:44:20] the data and then get some hypothesis that's called H I and then measure the [00:44:29] that's called H I and then measure the error on s death which is the second [00:44:34] error on s death which is the second subset of data called the development [00:44:35] subset of data called the development set and pick the one [00:44:50] so rather than and and I want to [00:44:53] so rather than and and I want to contrast this with an alternative [00:44:56] contrast this with an alternative procedure right so the two cents of the [00:44:58] procedure right so the two cents of the day two substances they talk about tests [00:45:00] day two substances they talk about tests and data training set and development [00:45:02] and data training set and development sets and after training first of all the [00:45:06] sets and after training first of all the world second apollomon or third all [00:45:08] world second apollomon or third all polynomial on the training set evaluate [00:45:10] polynomial on the training set evaluate all of these different models on the [00:45:11] all of these different models on the separate held up development sets and [00:45:14] separate held up development sets and then pick the one with the lowest error [00:45:15] then pick the one with the lowest error on the development center okay but one [00:45:19] on the development center okay but one thing to not do would be to evaluate all [00:45:21] thing to not do would be to evaluate all these algorithms instead on the training [00:45:23] these algorithms instead on the training set and then pick the one with the [00:45:26] set and then pick the one with the lowest error on the training set right [00:45:28] lowest error on the training set right why not what what goes wrong when you do [00:45:29] why not what what goes wrong when you do that [00:45:38] Yeah right you just over fit I were you [00:45:40] Yeah right you just over fit I were you over it [00:45:49] yeah yep cool right so if you use this [00:45:52] yeah yep cool right so if you use this procedure you always end up picking the [00:45:54] procedure you always end up picking the fifth order polynomial right because the [00:45:56] fifth order polynomial right because the more complex our rhythm will always do [00:45:59] more complex our rhythm will always do better on the training set so if you do [00:46:00] better on the training set so if you do this this will always cause you to say [00:46:02] this this will always cause you to say let's use the fifth order polynomial or [00:46:04] let's use the fifth order polynomial or the highest possible order polynomial so [00:46:06] the highest possible order polynomial so this won't help you realize in the [00:46:08] this won't help you realize in the housing price prediction example the [00:46:10] housing price prediction example the second order polynomial is a benefit to [00:46:12] second order polynomial is a benefit to the data and that's why for this [00:46:16] the data and that's why for this procedure if you evaluate your models [00:46:21] procedure if you evaluate your models error or the separate development set [00:46:23] error or the separate development set that the album did not see during [00:46:25] that the album did not see during training this allows you to hopefully [00:46:28] training this allows you to hopefully pick a model that neither overfits no [00:46:30] pick a model that neither overfits no longer fits and in this example [00:46:31] longer fits and in this example hopefully you find that there will be [00:46:34] hopefully you find that there will be the second order polynomial right the [00:46:36] the second order polynomial right the one that's just right in between that [00:46:37] one that's just right in between that actually does best on your development [00:46:39] actually does best on your development center okay now and then you know if you [00:46:51] center okay now and then you know if you are if you are publishing an academic [00:46:54] are if you are publishing an academic paper on machine learning then this [00:46:57] paper on machine learning then this procedure has looked at the training set [00:46:59] procedure has looked at the training set as was the development set right so this [00:47:01] as was the development set right so this this procedure this piece of code is you [00:47:05] this procedure this piece of code is you know is - in these decisions is tune the [00:47:08] know is - in these decisions is tune the parameters the training set and is tuned [00:47:10] parameters the training set and is tuned the decision on the degree of polynomial [00:47:12] the decision on the degree of polynomial to the DEF set and so if you want to [00:47:16] to the DEF set and so if you want to know if you want to publish a paper to [00:47:17] know if you want to publish a paper to say oh my algorithm achieves 90% [00:47:20] say oh my algorithm achieves 90% accuracy of this data set is not valid [00:47:23] accuracy of this data set is not valid to report the result on the dev set [00:47:25] to report the result on the dev set because the algorithm has already been [00:47:27] because the algorithm has already been optimized to the data set in particular [00:47:29] optimized to the data set in particular information about what's the most what's [00:47:32] information about what's the most what's the best degree polynomial was derived [00:47:35] the best degree polynomial was derived from the DEF set from the development [00:47:36] from the DEF set from the development set and so if you're publishing a paper [00:47:39] set and so if you're publishing a paper or you want to report an unbiased result [00:47:45] evaluate the algorithm [00:47:49] separate test set as a test and report [00:47:58] separate test set as a test and report that error okay and so if you're [00:48:00] that error okay and so if you're publishing your paper if you can say [00:48:02] publishing your paper if you can say good hygiene to report the error on a [00:48:06] good hygiene to report the error on a completely separate test set that you [00:48:08] completely separate test set that you did not in any way shape or form look at [00:48:10] did not in any way shape or form look at during the development of your model [00:48:12] during the development of your model doing the training procedure or Devon [00:48:24] doing the training procedure or Devon test is there any different by Mike it [00:48:26] test is there any different by Mike it depends on the jet size it depends on [00:48:28] depends on the jet size it depends on the size of the data set and so it turns [00:48:32] the size of the data set and so it turns out that actually let me let me give an [00:48:37] out that actually let me let me give an example actually so let's say you're [00:48:39] example actually so let's say you're trying to fit a degree of polynomial and [00:48:44] trying to fit a degree of polynomial and you want to choose right to death error [00:48:50] you want to choose right to death error so you can for the first second third [00:48:52] so you can for the first second third fourth degree polynomial and so after [00:48:57] fourth degree polynomial and so after fitting all of these let's say that the [00:48:59] fitting all of these let's say that the squared error right to just use round [00:49:01] squared error right to just use round numbers is ten five point one five point [00:49:07] numbers is ten five point one five point zero nine seven ten okay just to just to [00:49:17] zero nine seven ten okay just to just to use round numbers for illustrative [00:49:19] use round numbers for illustrative purposes if you're using the def error [00:49:21] purposes if you're using the def error to pick the best hypothesis to pick the [00:49:27] to pick the best hypothesis to pick the best classifier you would say that using [00:49:31] best classifier you would say that using a fifth order polynomial against you for [00:49:34] a fifth order polynomial against you for point nine squared error right but did [00:49:37] point nine squared error right but did you really earn that four point nine [00:49:39] you really earn that four point nine squared error or did you just get lucky [00:49:42] squared error or did you just get lucky because there is some noise and so maybe [00:49:45] because there is some noise and so maybe all of these actually have error that [00:49:48] all of these actually have error that close to five point zero but some are [00:49:50] close to five point zero but some are just higher so much as lower and you [00:49:52] just higher so much as lower and you just got a little bit lucky that on the [00:49:54] just got a little bit lucky that on the DEF set this did better which is why if [00:49:56] DEF set this did better which is why if you look in your deaf set error your def [00:49:58] you look in your deaf set error your def set error is a biased estimate [00:50:01] set error is a biased estimate right and so where's there a very large [00:50:03] right and so where's there a very large test set there is a very large test said [00:50:05] test set there is a very large test said maybe the true numbers are 10 5 5 5 7 10 [00:50:12] maybe the true numbers are 10 5 5 5 7 10 by your actual expected squared errors [00:50:14] by your actual expected squared errors it's just that because of little bit of [00:50:17] it's just that because of little bit of noise you got lucky and you reported 4.9 [00:50:19] noise you got lucky and you reported 4.9 and so this would be a bad thing to do [00:50:21] and so this would be a bad thing to do in an economic paper right because of [00:50:22] in an economic paper right because of what you earned was an error of 5.0 you [00:50:25] what you earned was an error of 5.0 you didn't earn an error 4.9 that's just [00:50:27] didn't earn an error 4.9 that's just that because you're overfitting a little [00:50:30] that because you're overfitting a little bit and the DEF set you chose the thing [00:50:33] bit and the DEF set you chose the thing that looked best for the DEF set but [00:50:34] that looked best for the DEF set but your algorithm didn't actually keep that [00:50:36] your algorithm didn't actually keep that error it's just because of noise okay so [00:50:39] error it's just because of noise okay so so now in some cells consider good [00:50:42] so now in some cells consider good practice to report so reporting on the [00:50:48] practice to report so reporting on the death error is in isn't isn't really a [00:50:50] death error is in isn't isn't really a valid unbiased procedure and then [00:50:55] question [00:51:27] yeah so what one of the just a [00:51:30] yeah so what one of the just a researcher said yes you're right one of [00:51:32] researcher said yes you're right one of the problems with some of the machine [00:51:33] the problems with some of the machine learning benchmarks that people worked [00:51:35] learning benchmarks that people worked on for a long time is does this [00:51:37] on for a long time is does this unavoidable mental overfitting the [00:51:39] unavoidable mental overfitting the people call them to use the Z's and [00:51:40] people call them to use the Z's and everyone's worked my same trying to [00:51:42] everyone's worked my same trying to publish the best numbers the same test [00:51:43] publish the best numbers the same test said so the academic committee on [00:51:45] said so the academic committee on machine learning does have some amount [00:51:47] machine learning does have some amount of overfitting to the standard [00:51:49] of overfitting to the standard benchmarks that people worked on for a [00:51:51] benchmarks that people worked on for a long time and this is an unfortunate [00:51:52] long time and this is an unfortunate result when the testing is very very [00:51:54] result when the testing is very very large the amount of overfitting is [00:51:56] large the amount of overfitting is probably smaller but when the test set [00:51:58] probably smaller but when the test set is not big enough then the overfitting [00:52:00] is not big enough then the overfitting result can cause sometimes even research [00:52:03] result can cause sometimes even research papers to publish results that are [00:52:05] papers to publish results that are you're probably over fit to the data set [00:52:08] you're probably over fit to the data set and so I think there's actually one [00:52:11] and so I think there's actually one stand the academic benchmark because the [00:52:12] stand the academic benchmark because the dataset called seefox quite small so [00:52:14] dataset called seefox quite small so it's actually very same research paper [00:52:17] it's actually very same research paper analyzing results on C far arguing that [00:52:21] analyzing results on C far arguing that some fraction of the progress that was [00:52:23] some fraction of the progress that was made was actually perhaps researchers [00:52:26] made was actually perhaps researchers unintentionally overfitting to this [00:52:28] unintentionally overfitting to this dataset okay oh my the way um one thing [00:52:34] dataset okay oh my the way um one thing I do when I'm building you know [00:52:36] I do when I'm building you know production machine learning systems so [00:52:37] production machine learning systems so when I'm when I'm shipping a product [00:52:39] when I'm when I'm shipping a product right I just like build with speech [00:52:40] right I just like build with speech recognition to them and just make it [00:52:42] recognition to them and just make it work I just wanna and not and if I'm not [00:52:44] work I just wanna and not and if I'm not trying to publish a paper and not try [00:52:46] trying to publish a paper and not try and make some claim sometimes I don't [00:52:48] and make some claim sometimes I don't bother to the test set right so and and [00:52:51] bother to the test set right so and and that means I don't know the true error [00:52:52] that means I don't know the true error of the system sometimes but I'm very [00:52:55] of the system sometimes but I'm very conscious of that if I don't allow data [00:52:56] conscious of that if I don't allow data sometimes I'm gonna decide to just not [00:52:59] sometimes I'm gonna decide to just not have the test set and it means I just [00:53:00] have the test set and it means I just don't try to report the tested number I [00:53:02] don't try to report the tested number I can report a Jeff said number which I [00:53:04] can report a Jeff said number which I know is bias and I just don't report a [00:53:07] know is bias and I just don't report a tested number I don't do this in your [00:53:08] tested number I don't do this in your publishing your academia paper right [00:53:10] publishing your academia paper right this is not good if you're publishing a [00:53:11] this is not good if you're publishing a paper on making claims on the outside [00:53:13] paper on making claims on the outside but already doing is building a product [00:53:15] but already doing is building a product and not writing a paper distance this is [00:53:17] and not writing a paper distance this is actually [00:53:18] actually okay yeah yeah okay good that's let me [00:53:32] okay yeah yeah okay good that's let me let me get to that good so um the next [00:53:35] let me get to that good so um the next topic about some of the train dev test [00:53:37] topic about some of the train dev test blitz is how do you decide how much data [00:53:39] blitz is how do you decide how much data should go into you should these three [00:53:41] should go into you should these three subsets um so I can tell its owner just [00:53:46] subsets um so I can tell its owner just tell you the historical perspective and [00:53:48] tell you the historical perspective and then a modern perspective um [00:53:49] then a modern perspective um historically the rule of thumb was you [00:53:52] historically the rule of thumb was you take your training said right take your [00:53:55] take your training said right take your training set s and then you would send [00:53:57] training set s and then you would send you know one rule of thumb that you see [00:53:59] you know one rule of thumb that you see a lot of people refer to is 70% train [00:54:03] a lot of people refer to is 70% train right 30% tests this is one common rule [00:54:08] right 30% tests this is one common rule of thumb that you just hear a lot or [00:54:11] of thumb that you just hear a lot or maybe you have if you if you don't have [00:54:13] maybe you have if you if you don't have a test said if you're not doing model [00:54:15] a test said if you're not doing model selection if you just if you already [00:54:16] selection if you just if you already picked them although not realizing or [00:54:19] picked them although not realizing or maybe you have people use you know 60 [00:54:22] maybe you have people use you know 60 percent range 20 percent 20 percent [00:54:26] percent range 20 percent 20 percent tests right so these are rules of thumb [00:54:29] tests right so these are rules of thumb that people use to give and these are [00:54:33] that people use to give and these are decent rules of thumb when you don't [00:54:35] decent rules of thumb when you don't have a massive data set see if you have [00:54:37] have a massive data set see if you have 100 tons of examples maybe of a thousand [00:54:40] 100 tons of examples maybe of a thousand examples may be several thousand [00:54:43] examples may be several thousand examples I think these rules of thumb [00:54:44] examples I think these rules of thumb are perfectly fine what I'm seeing is [00:54:48] are perfectly fine what I'm seeing is that as you move to machine learning [00:54:50] that as you move to machine learning problems with really really giant data [00:54:51] problems with really really giant data sets the percentage of data you sent a [00:54:54] sets the percentage of data you sent a dev and test shrinking and so here's [00:54:58] dev and test shrinking and so here's what I mean let's say you have 10 [00:55:01] what I mean let's say you have 10 million examples decent-size not giant [00:55:06] million examples decent-size not giant but you can resize so let's take this is [00:55:11] but you can resize so let's take this is actually pretty good rule of thumb if [00:55:13] actually pretty good rule of thumb if you're a small desert if you have you [00:55:14] you're a small desert if you have you know a 5 min examples this perfectly 5 [00:55:16] know a 5 min examples this perfectly 5 rule of thumb to use but if you have 10 [00:55:18] rule of thumb to use but if you have 10 million examples then you know you have [00:55:22] million examples then you know you have 6 million 2 million too many [00:55:28] 6 million 2 million too many right trained deaf tests and the [00:55:34] right trained deaf tests and the question is do you really need two [00:55:36] question is do you really need two million examples to estimate the [00:55:38] million examples to estimate the performance of your final classifier [00:55:40] performance of your final classifier sometimes you do if you're working on [00:55:43] sometimes you do if you're working on online advertising you know which I have [00:55:45] online advertising you know which I have done and you're trying to increase your [00:55:48] done and you're trying to increase your ad click-through rate by 0.1 percent [00:55:50] ad click-through rate by 0.1 percent because it turns out increasing our [00:55:52] because it turns out increasing our click-through rates by 0.1 percent which [00:55:54] click-through rates by 0.1 percent which I've done multiple times turns out to be [00:55:55] I've done multiple times turns out to be very lucrative then you actually need a [00:55:58] very lucrative then you actually need a very large data set to measure these [00:56:00] very large data set to measure these very very small improvements because to [00:56:02] very very small improvements because to increase an ad click-through rate by 0.1 [00:56:05] increase an ad click-through rate by 0.1 you might have llama projects you might [00:56:06] you might have llama projects you might have 10 projects each of which increases [00:56:09] have 10 projects each of which increases at click-through rate by 0.01% right and [00:56:11] at click-through rate by 0.01% right and so to measure these very different small [00:56:14] so to measure these very different small differences in algorithm one does 0.01 [00:56:17] differences in algorithm one does 0.01 percent better than algorithm B but so [00:56:19] percent better than algorithm B but so you need a wall date so to tease out [00:56:20] you need a wall date so to tease out that very small difference so if you're [00:56:22] that very small difference so if you're in the business of teasing out these [00:56:23] in the business of teasing out these very small differences you actually need [00:56:26] very small differences you actually need very large test sets but if you are [00:56:28] very large test sets but if you are comparing different algorithms and one [00:56:30] comparing different algorithms and one algorithm is you know 2% better or even [00:56:34] algorithm is you know 2% better or even 1% better than the other algorithm then [00:56:37] 1% better than the other algorithm then with a thousand examples maybe right a [00:56:40] with a thousand examples maybe right a thousand examples may be enough for you [00:56:42] thousand examples may be enough for you to distinguish between these much larger [00:56:44] to distinguish between these much larger differences so my recommendation for [00:56:48] differences so my recommendation for choosing the Jeff and Tess says is [00:56:50] choosing the Jeff and Tess says is choose them to be big enough that you [00:56:55] choose them to be big enough that you have been updated to make meaningful [00:56:57] have been updated to make meaningful comparisons between different algorithms [00:56:59] comparisons between different algorithms and if you suspect your algorithms will [00:57:01] and if you suspect your algorithms will be Rhian performance by 0.01% you just [00:57:04] be Rhian performance by 0.01% you just need a law data to distinguish that [00:57:06] need a law data to distinguish that right so if you have 100 examples then [00:57:09] right so if you have 100 examples then you know if one algorithm has 90% [00:57:13] you know if one algorithm has 90% accuracy and one algorithm has 90 point [00:57:16] accuracy and one algorithm has 90 point zero one percent accuracy then unless [00:57:18] zero one percent accuracy then unless you have at least a thousand examples [00:57:20] you have at least a thousand examples and maybe ten thousand or more you just [00:57:22] and maybe ten thousand or more you just can't see this very small difference [00:57:23] can't see this very small difference right if you're a hundred examples you [00:57:25] right if you're a hundred examples you just can't measure this very small [00:57:26] just can't measure this very small difference so my advice is choose your [00:57:30] difference so my advice is choose your dev and test sets to be big enough that [00:57:32] dev and test sets to be big enough that you could see the differences in the [00:57:35] you could see the differences in the performance of algorithms that you have [00:57:37] performance of algorithms that you have yet that you roughly expected [00:57:39] yet that you roughly expected and then you don't need to make your jab [00:57:41] and then you don't need to make your jab in chest as much larger than that and I [00:57:43] in chest as much larger than that and I would usually then just put the data you [00:57:45] would usually then just put the data you don't need an indefinite essence back in [00:57:47] don't need an indefinite essence back in the training set so when you're working [00:57:49] the training set so when you're working a very large data set say you know [00:57:51] a very large data set say you know million or ten million one hundred [00:57:53] million or ten million one hundred million examples what you see is that [00:57:54] million examples what you see is that the percentage of data that goes into [00:57:57] the percentage of data that goes into dev and test tends to be much smaller so [00:57:59] dev and test tends to be much smaller so it might be so you see for example maybe [00:58:03] it might be so you see for example maybe 90 percent train you know five percent [00:58:05] 90 percent train you know five percent dev and five percent test right or even [00:58:08] dev and five percent test right or even smaller or even one percent one percent [00:58:10] smaller or even one percent one percent depending on how much states you really [00:58:11] depending on how much states you really need to measure to the level of accuracy [00:58:14] need to measure to the level of accuracy you need the differences that [00:58:15] you need the differences that performance your algorithms okay all [00:58:21] performance your algorithms okay all right um just to give this whole [00:58:24] right um just to give this whole procedure a name what we just did here [00:58:26] procedure a name what we just did here between the train and dev set this [00:58:30] between the train and dev set this procedure that we have is called holdout [00:58:32] procedure that we have is called holdout cross-validation and sometimes to [00:58:41] cross-validation and sometimes to distinguish this from other cross [00:58:43] distinguish this from other cross validation procedures we'll talk about [00:58:45] validation procedures we'll talk about in a minute sometimes this is called [00:58:46] in a minute sometimes this is called simple hold our cross validation we'll [00:58:48] simple hold our cross validation we'll talk about some other hold our cross [00:58:50] talk about some other hold our cross validation procedures in a second and [00:58:54] validation procedures in a second and and the dev set is sometimes also called [00:59:03] and the dev set is sometimes also called the our cross validation set [00:59:10] right so sometimes you people use [00:59:13] right so sometimes you people use sometimes you hear people say you know [00:59:15] sometimes you hear people say you know we're going to use a cross-validation [00:59:16] we're going to use a cross-validation set that means roughly the same thing as [00:59:18] set that means roughly the same thing as a dev set okay so in the normal workflow [00:59:24] a dev set okay so in the normal workflow of developing a learning algorithm when [00:59:26] of developing a learning algorithm when you're given the data said I was [00:59:28] you're given the data said I was splitted into a training set and a dev [00:59:29] splitted into a training set and a dev set oh and I used to say [00:59:31] set oh and I used to say cross-validation set but [00:59:33] cross-validation set but cross-validation is just a mouthful so I [00:59:35] cross-validation is just a mouthful so I think just motivated by the reducing [00:59:37] think just motivated by the reducing number of syllables because we're using [00:59:38] number of syllables because we're using the cause of so often more and more [00:59:40] the cause of so often more and more people just call the dev set but it [00:59:41] people just call the dev set but it means you're roughly the same thing [00:59:42] means you're roughly the same thing right so so when I'm building a machine [00:59:45] right so so when I'm building a machine learning system I'll often take the days [00:59:47] learning system I'll often take the days that splits into train and dev and if [00:59:50] that splits into train and dev and if you need a test set and also a test set [00:59:51] you need a test set and also a test set and then keep on fitting the parameters [00:59:55] and then keep on fitting the parameters to the training set and evaluating the [00:59:58] to the training set and evaluating the performance of your algorithm on the [00:59:59] performance of your algorithm on the Jeff set and so of using that to come up [01:00:02] Jeff set and so of using that to come up with new features choose the model size [01:00:03] with new features choose the model size choose the regularization parameter [01:00:05] choose the regularization parameter lambda really try out lots of different [01:00:08] lambda really try out lots of different things and spend you know several days [01:00:10] things and spend you know several days or weeks to optimize the performance on [01:00:12] or weeks to optimize the performance on the DEF set and then when you want to [01:00:16] the DEF set and then when you want to know how well is your algorithm [01:00:17] know how well is your algorithm performing to then evaluate the model on [01:00:20] performing to then evaluate the model on the test set right and and the thing to [01:00:23] the test set right and and the thing to be careful not to do is to make any [01:00:25] be careful not to do is to make any decisions about your model using the [01:00:27] decisions about your model using the test set because then then you're signed [01:00:30] test set because then then you're signed to fit the data to the test set there's [01:00:32] to fit the data to the test set there's no longer than bias estimate so and one [01:00:36] no longer than bias estimate so and one thing that is actually okay to do is if [01:00:38] thing that is actually okay to do is if you have a team that's working on a [01:00:40] you have a team that's working on a problem if every week they measure the [01:00:43] problem if every week they measure the performance on the test set and report [01:00:45] performance on the test set and report out on the chart right you know the [01:00:48] out on the chart right you know the performance of the test set that's [01:00:50] performance of the test set that's actually okay you can evaluate the model [01:00:52] actually okay you can evaluate the model multiple times on the test set you can [01:00:53] multiple times on the test set you can actually give out a weekly report saying [01:00:55] actually give out a weekly report saying this week for our online advertising [01:00:57] this week for our online advertising system we have this result in a test set [01:00:59] system we have this result in a test set one week later with distres on test set [01:01:01] one week later with distres on test set won't be later this code says that it's [01:01:02] won't be later this code says that it's actually okay to evaluate your algorithm [01:01:04] actually okay to evaluate your algorithm repeatedly on the test set what's not [01:01:07] repeatedly on the test set what's not okay is to use those evaluations to make [01:01:09] okay is to use those evaluations to make any decisions about your learning [01:01:10] any decisions about your learning algorithm so for example if one day you [01:01:12] algorithm so for example if one day you notice that your model is doing worse [01:01:14] notice that your model is doing worse this weekend last week on the test set [01:01:17] this weekend last week on the test set if you use that to revert back to an [01:01:19] if you use that to revert back to an older model [01:01:20] older model then you've just made a decision that's [01:01:22] then you've just made a decision that's based on the test set and your test set [01:01:23] based on the test set and your test set is no longer bias but if all you do is [01:01:26] is no longer bias but if all you do is report all the result but not make any [01:01:28] report all the result but not make any decisions based on the test set [01:01:29] decisions based on the test set performance such as whether to revert to [01:01:31] performance such as whether to revert to an earlier model then you feel that it [01:01:33] an earlier model then you feel that it is actually legitimate it's actually [01:01:34] is actually legitimate it's actually okay to keep on you know user use the [01:01:38] okay to keep on you know user use the same test set to track your your team's [01:01:40] same test set to track your your team's performance over time okay all right [01:01:45] performance over time okay all right good so when your very large data says [01:01:50] good so when your very large data says this is a procedure of you know [01:01:54] this is a procedure of you know developing for defining the I trained F [01:01:56] developing for defining the I trained F in test sets and this procedure can be [01:01:59] in test sets and this procedure can be used to choose the model of polynomial [01:02:01] used to choose the model of polynomial it can also be used to choose the [01:02:03] it can also be used to choose the regularization parameter lambda or the [01:02:05] regularization parameter lambda or the parent st or the parameter tau from a [01:02:09] parent st or the parameter tau from a locally weighted regression now one of [01:02:14] locally weighted regression now one of you have a very small data set so it [01:02:24] you have a very small data set so it turns out that so I'm going to leave out [01:02:26] turns out that so I'm going to leave out the test set for now let's just assume [01:02:28] the test set for now let's just assume you're some circuit tester I'm not gonna [01:02:29] you're some circuit tester I'm not gonna worry about that for now um but let's [01:02:32] worry about that for now um but let's say you have a 100 examples right if [01:02:37] say you have a 100 examples right if you're going to split this into you know [01:02:39] you're going to split this into you know 17 in the training set in s subscript [01:02:42] 17 in the training set in s subscript train and 30 in s def then you train [01:02:47] train and 30 in s def then you train your algorithm on 70 examples instead of [01:02:49] your algorithm on 70 examples instead of a hundred examples and so I've actually [01:02:52] a hundred examples and so I've actually worked on a few healthcare problems okay [01:02:54] worked on a few healthcare problems okay most of my PhD students including Anand [01:02:56] most of my PhD students including Anand work doing a lot of work on machine [01:02:59] work doing a lot of work on machine learning advisor healthcare and so we're [01:03:01] learning advisor healthcare and so we're actually working a few data says in [01:03:02] actually working a few data says in healthcare where you know every training [01:03:05] healthcare where you know every training example correspond to some patient that [01:03:08] example correspond to some patient that sometimes had a you know unfortunate [01:03:10] sometimes had a you know unfortunate disease or if ever you're working or if [01:03:13] disease or if ever you're working or if every example correspond it to injecting [01:03:17] every example correspond it to injecting a patient with a drug and seeing what [01:03:18] a patient with a drug and seeing what happened to the patient right sometimes [01:03:20] happened to the patient right sometimes there's literally lot of blood and pain [01:03:22] there's literally lot of blood and pain that goes into collecting every example [01:03:24] that goes into collecting every example and if you have a hundred examples to [01:03:27] and if you have a hundred examples to hold out 30 of them for the purpose of [01:03:30] hold out 30 of them for the purpose of model selection using only 70s [01:03:33] model selection using only 70s 100 examples it seems like you're [01:03:34] 100 examples it seems like you're wasting a lot of data there was [01:03:36] wasting a lot of data there was collected through a lot of you know [01:03:38] collected through a lot of you know literal pain right so is there a way to [01:03:41] literal pain right so is there a way to say do model selection so just choose [01:03:44] say do model selection so just choose the degree of polynomial without quote [01:03:46] the degree of polynomial without quote slightly wasting so much of the data [01:03:52] there is a procedure that you should use [01:03:55] there is a procedure that you should use only if you have a small dataset only if [01:03:58] only if you have a small dataset only if you're worried about the size oh and the [01:04:00] you're worried about the size oh and the other disadvantage of this is evaluate [01:04:02] other disadvantage of this is evaluate your model only on 30 examples and that [01:04:04] your model only on 30 examples and that seems really small right yeah okay can [01:04:06] seems really small right yeah okay can you just find more data to evaluate your [01:04:08] you just find more data to evaluate your models as well so there's a procedure [01:04:11] models as well so there's a procedure that you should use only if you have a [01:04:14] that you should use only if you have a small data set called k-fold [01:04:16] small data set called k-fold cross-validation or CAFO cv and this is [01:04:20] cross-validation or CAFO cv and this is in contrast to simple cross-validation [01:04:23] in contrast to simple cross-validation but this is the idea which is let's say [01:04:27] but this is the idea which is let's say this your training set s you know x1 y1 [01:04:31] this your training set s you know x1 y1 down to x say 100 by 100 what we're [01:04:38] down to x say 100 by 100 what we're going to do is take the training set and [01:04:41] going to do is take the training set and divide it into K pieces so for the [01:04:45] divide it into K pieces so for the purpose of illustration I'm going to use [01:04:46] purpose of illustration I'm going to use K equals 5 when I'm just to make the [01:04:49] K equals 5 when I'm just to make the writing on the board same pen k equals [01:04:52] writing on the board same pen k equals 10 is typical but so what you do is um [01:05:03] 10 is typical but so what you do is um take your data set and divide it into [01:05:06] take your data set and divide it into five different subsets of in this [01:05:08] five different subsets of in this example you would have 20 examples I'm [01:05:11] example you would have 20 examples I'm doing 100 examples divided into 5 [01:05:13] doing 100 examples divided into 5 subsets is that 20 examples in each [01:05:15] subsets is that 20 examples in each subset and what you do is from I equals [01:05:20] subset and what you do is from I equals 1 to K train [01:05:25] 1 to K train I give fit parameters on K minus 1 [01:05:32] I give fit parameters on K minus 1 pieces [01:05:36] and then test on the remaining one piece [01:05:50] and then test on the remaining one piece and then your average right so in other [01:05:54] and then your average right so in other words when K is equals five we're going [01:05:57] words when K is equals five we're going to loop through five times in the first [01:05:59] to loop through five times in the first iteration we're going to Train on these [01:06:01] iteration we're going to Train on these and test on the last one fifth of the [01:06:05] and test on the last one fifth of the data so we'll hold out the last one [01:06:08] data so we'll hold out the last one fifth of the data trained on the rest [01:06:09] fifth of the data trained on the rest and test on that and then in the second [01:06:11] and test on that and then in the second iteration through this for loop we'll [01:06:14] iteration through this for loop we'll train on pieces one two three and five [01:06:17] train on pieces one two three and five and test on piece number four and we get [01:06:20] and test on piece number four and we get the number and then you hold out this [01:06:24] the number and then you hold out this third piece trained on the others test [01:06:26] third piece trained on the others test on this and so on so you're doing five [01:06:28] on this and so on so you're doing five times we're on each time you leave out [01:06:30] times we're on each time you leave out one fifth of the data trained on the [01:06:32] one fifth of the data trained on the remaining four fifths and you evaluate [01:06:34] remaining four fifths and you evaluate the model on that final one fifth okay [01:06:38] the model on that final one fifth okay and so if you're trying to choose the [01:06:40] and so if you're trying to choose the degree of polynomial what you would do [01:06:42] degree of polynomial what you would do is I guess for you know D equals one to [01:06:47] is I guess for you know D equals one to five [01:06:55] right so you do this procedure for a [01:06:58] right so you do this procedure for a first-order polynomial fit you fit a [01:07:01] first-order polynomial fit you fit a linear regression model of five times [01:07:03] linear regression model of five times each time on four fifths in the model [01:07:04] each time on four fifths in the model and testin remaining one fifth and you [01:07:06] and testin remaining one fifth and you repeat this whole procedure for the [01:07:08] repeat this whole procedure for the quadratic function repeat this whole [01:07:09] quadratic function repeat this whole procedure for the cubic function and so [01:07:11] procedure for the cubic function and so on and after doing this for every order [01:07:15] on and after doing this for every order polynomial from serve one two three four [01:07:17] polynomial from serve one two three four five you would then pick the degree of [01:07:19] five you would then pick the degree of polynomial that oh and sorry and then [01:07:23] polynomial that oh and sorry and then for each of these models you then [01:07:24] for each of these models you then average the five SS you have for this [01:07:26] average the five SS you have for this error okay and then after doing this you [01:07:30] error okay and then after doing this you will pick the degree of polynomial that [01:07:33] will pick the degree of polynomial that did best according to this according to [01:07:35] did best according to this according to this metric right and then maybe you [01:07:38] this metric right and then maybe you find it a second-order polynomial does [01:07:40] find it a second-order polynomial does this and now you actually end up with [01:07:47] this and now you actually end up with five classifiers right because you know [01:07:50] five classifiers right because you know at five classifies each one fits on [01:07:52] at five classifies each one fits on four-fifths of the data and then and and [01:07:55] four-fifths of the data and then and and there's a there's a final optional step [01:07:58] there's a there's a final optional step which is the refit the model on all 100% [01:08:04] which is the refit the model on all 100% of the data so if you want you could [01:08:09] of the data so if you want you could keep five classifiers around and output [01:08:11] keep five classifiers around and output their predictions but then you're [01:08:13] their predictions but then you're keeping five pass files around this may [01:08:16] keeping five pass files around this may be a bit more common now that you've [01:08:17] be a bit more common now that you've chosen to use a second-order polynomial [01:08:19] chosen to use a second-order polynomial to just refit them all the ones on all [01:08:21] to just refit them all the ones on all 100% of the data okay [01:08:24] 100% of the data okay and so the advantage of k-fold [01:08:28] and so the advantage of k-fold cross-validation is that instead of [01:08:30] cross-validation is that instead of leaving out 30% of your data for your [01:08:33] leaving out 30% of your data for your dev set on each iteration you're only [01:08:35] dev set on each iteration you're only leaving out one over k of your data I [01:08:38] leaving out one over k of your data I use K equals five for illustration but [01:08:41] use K equals five for illustration but in practice Kinkos ten is by far the [01:08:43] in practice Kinkos ten is by far the most common choice that we use of [01:08:45] most common choice that we use of sometimes seeing people use K equals 20 [01:08:47] sometimes seeing people use K equals 20 but quite rarely but if used K equals 10 [01:08:51] but quite rarely but if used K equals 10 then on each iteration you're leaving [01:08:53] then on each iteration you're leaving out just one tenth of the data 10% in [01:08:56] out just one tenth of the data 10% in later rather than 30% of the data [01:08:59] later rather than 30% of the data and so this procedure compared to simple [01:09:03] and so this procedure compared to simple cross-validation it makes more efficient [01:09:06] cross-validation it makes more efficient use of the data because you're holding [01:09:08] use of the data because you're holding out you know only 10% of the data on [01:09:10] out you know only 10% of the data on each generation the disadvantage of this [01:09:13] each generation the disadvantage of this is completely very expensive that you're [01:09:15] is completely very expensive that you're now fitting each model ten times instead [01:09:18] now fitting each model ten times instead of just once okay but but then when you [01:09:20] of just once okay but but then when you have a small data set this is actually a [01:09:22] have a small data set this is actually a better procedure than simple [01:09:24] better procedure than simple cross-validation if you don't mind the [01:09:26] cross-validation if you don't mind the competition expense of fitting each [01:09:27] competition expense of fitting each model ten times this does this actually [01:09:29] model ten times this does this actually lets you get away with holding on this [01:09:31] lets you get away with holding on this data and then there's one even more [01:09:44] data and then there's one even more extreme version of this which you should [01:09:46] extreme version of this which you should use if you have very very small data [01:09:48] use if you have very very small data sets so sometimes you might have an even [01:09:50] sets so sometimes you might have an even smaller if they set you know if you're [01:09:52] smaller if they set you know if you're doing a class project with twenty [01:09:54] doing a class project with twenty examples this that's that's small even [01:09:56] examples this that's that's small even by today's machine learning standards so [01:09:58] by today's machine learning standards so this does an extreme version of k-fold [01:10:01] this does an extreme version of k-fold cross-validation called leave one out [01:10:03] cross-validation called leave one out cross validation which is if you say a [01:10:08] cross validation which is if you say a equals m right so in other words here's [01:10:12] equals m right so in other words here's your training set [01:10:13] your training set maybe twenty examples so you're gonna [01:10:16] maybe twenty examples so you're gonna divide this into as many pieces as you [01:10:19] divide this into as many pieces as you have training examples and what you do [01:10:22] have training examples and what you do is leave out one example trained on the [01:10:25] is leave out one example trained on the other nineteen and test on the one [01:10:27] other nineteen and test on the one example you held out and then leave out [01:10:30] example you held out and then leave out a second example trained on the other [01:10:31] a second example trained on the other nineteen and tested one example you held [01:10:33] nineteen and tested one example you held out and do that twenty times and then [01:10:35] out and do that twenty times and then you averaged this over the twenty [01:10:37] you averaged this over the twenty outcomes evaluate how good different [01:10:39] outcomes evaluate how good different orders of polynomial the huge downside [01:10:42] orders of polynomial the huge downside of this is this is completely very very [01:10:44] of this is this is completely very very expensive because now you need to train [01:10:46] expensive because now you need to train your algorithm m times so you you kind [01:10:49] your algorithm m times so you you kind of never do this unless M is really [01:10:50] of never do this unless M is really small I personally have I pretty much [01:10:54] small I personally have I pretty much never use this procedure unless Emma is [01:10:56] never use this procedure unless Emma is a hundred or less that yes you you know [01:10:58] a hundred or less that yes you you know if your model isn't too complicated you [01:11:00] if your model isn't too complicated you can afford to fill in linear regression [01:11:01] can afford to fill in linear regression model hundred times like it's not too [01:11:03] model hundred times like it's not too bad right so so if if M is less than 100 [01:11:07] bad right so so if if M is less than 100 you could consider this procedure but if [01:11:09] you could consider this procedure but if if M is a thousand fifteen alidium [01:11:11] if M is a thousand fifteen alidium although fitting them although a [01:11:12] although fitting them although a thousand times it seems to come out the [01:11:14] thousand times it seems to come out the world and usually use k-fold [01:11:15] world and usually use k-fold cross-validation instead but if you do [01:11:18] cross-validation instead but if you do have twenty examples then you know I [01:11:20] have twenty examples then you know I would then if you have twenty examples I [01:11:23] would then if you have twenty examples I would probably use this procedure and [01:11:25] would probably use this procedure and somewhere between twenty and 50s [01:11:28] somewhere between twenty and 50s maybe we're not switch over from leave [01:11:30] maybe we're not switch over from leave one out okay for cross-validation [01:11:46] oh yeah so right so since you have K [01:11:52] oh yeah so right so since you have K estimates say ten tenesmus you're using [01:11:54] estimates say ten tenesmus you're using 10-fold cross-validation [01:11:55] 10-fold cross-validation can you measure the variance on those [01:11:57] can you measure the variance on those ten estimates it turns out that those [01:12:00] ten estimates it turns out that those ten estimates are correlated because [01:12:02] ten estimates are correlated because each of the ten classifiers [01:12:04] each of the ten classifiers eight eight eight out of nine of the [01:12:06] eight eight eight out of nine of the sense of data they trained on overlap so [01:12:09] sense of data they trained on overlap so there was some very interesting theory [01:12:12] there was some very interesting theory result there's some research papers [01:12:14] result there's some research papers written by micro currents actually it [01:12:17] written by micro currents actually it was like a long time ago trying to [01:12:19] was like a long time ago trying to understand how correlated are these ten [01:12:21] understand how correlated are these ten estimates and from a theoretical point [01:12:24] estimates and from a theoretical point of view the we as far as I know the [01:12:27] of view the we as far as I know the latest error result shows that this is [01:12:29] latest error result shows that this is not the worst estimate than training [01:12:30] not the worst estimate than training error but no but but maybe it's showing [01:12:33] error but no but but maybe it's showing us in practice is not you could measure [01:12:37] us in practice is not you could measure it but we don't really trust that [01:12:40] it but we don't really trust that estimate of variance because we think [01:12:41] estimate of variance because we think all 10s mr. Hardy [01:12:42] all 10s mr. Hardy Carlita Oh at least somewhat correlated [01:12:53] where do I find using k-fold [01:12:55] where do I find using k-fold cross-validation debugging um if you [01:12:57] cross-validation debugging um if you have a very small training set then [01:13:00] have a very small training set then maybe yes but deep learning algorithms [01:13:02] maybe yes but deep learning algorithms depend on the details right sometimes it [01:13:04] depend on the details right sometimes it takes so long to train that training [01:13:06] takes so long to train that training training training on your network 20 [01:13:08] training training on your network 20 times you know seems like a pain unless [01:13:10] times you know seems like a pain unless unless you have enough data unless uh [01:13:13] unless you have enough data unless uh unless you're near a network is quite [01:13:14] unless you're near a network is quite small right so it's rarely done with a [01:13:19] small right so it's rarely done with a deep learning algorithm but if you're [01:13:21] deep learning algorithm but if you're frankly if you have so little data if [01:13:23] frankly if you have so little data if you're 20 training examples you know [01:13:27] you're 20 training examples you know there are other techniques that you [01:13:28] there are other techniques that you probably need to use the Bruce [01:13:30] probably need to use the Bruce performance such as transfer learning or [01:13:32] performance such as transfer learning or just more hand engineering of input [01:13:35] just more hand engineering of input features or something else sorry tell [01:13:44] features or something else sorry tell you oh sorry thank you for asking that [01:13:53] you oh sorry thank you for asking that this average step no I meant averaging [01:13:58] this average step no I meant averaging the test errors so here you will have [01:14:01] the test errors so here you will have trained ten classifiers and you know [01:14:04] trained ten classifiers and you know when you evaluate it on the Left I won [01:14:06] when you evaluate it on the Left I won ten for the data you get it really you [01:14:08] ten for the data you get it really you get a number right so you're looping ten [01:14:10] get a number right so you're looping ten times hold on one part trained on the [01:14:13] times hold on one part trained on the others test on this part you left out [01:14:14] others test on this part you left out and so that would give you a number like [01:14:16] and so that would give you a number like oh say oh when you test on this part you [01:14:18] oh say oh when you test on this part you left out the squared error was 5.0 and [01:14:21] left out the squared error was 5.0 and then do it again square error was five [01:14:23] then do it again square error was five point seven squared was two point eight [01:14:24] point seven squared was two point eight so by average I meant average those [01:14:27] so by average I meant average those numbers and the average of those numbers [01:14:30] numbers and the average of those numbers is your estimate of the error of a you [01:14:34] is your estimate of the error of a you know third order polynomial for this [01:14:36] know third order polynomial for this problem so this is an averaging instead [01:14:39] problem so this is an averaging instead of real numbers that you got from this [01:14:41] of real numbers that you got from this is so so this loop gives you K real [01:14:44] is so so this loop gives you K real numbers and so this is averaging those [01:14:47] numbers and so this is averaging those cave [01:14:47] cave numbers to estimate for this alternate [01:14:50] numbers to estimate for this alternate how could a cost 5 with a degree [01:14:53] how could a cost 5 with a degree polynomials okay well I should love [01:14:55] polynomials okay well I should love questions there's one thing I like over [01:14:57] questions there's one thing I like over go ahead but it's lost two good I see [01:15:10] go ahead but it's lost two good I see sure yes I'm sure in something other [01:15:12] sure yes I'm sure in something other than f1 score with the doing other than [01:15:14] than f1 score with the doing other than average yes it would [01:15:17] average yes it would having f1 school is complicated yes I [01:15:19] having f1 school is complicated yes I think I think we'll talk actually so [01:15:22] think I think we'll talk actually so this week Friday we'll talk about [01:15:24] this week Friday we'll talk about learning theory next week next Friday [01:15:25] learning theory next week next Friday were talking about performance [01:15:27] were talking about performance evaluation metrics so I'll talk about [01:15:29] evaluation metrics so I'll talk about one school all right oh sure honey [01:15:35] one school all right oh sure honey sample the data in these says so for the [01:15:38] sample the data in these says so for the purpose of this cause assuming all your [01:15:39] purpose of this cause assuming all your data comes to the same distribution oh I [01:15:42] data comes to the same distribution oh I will use the randomly shuffle again in [01:15:46] will use the randomly shuffle again in the era of machine learning and big data [01:15:47] the era of machine learning and big data there's one other interesting trend is [01:15:50] there's one other interesting trend is which which just wasn't true 10 years [01:15:51] which which just wasn't true 10 years ago which is we're increasingly trying [01:15:53] ago which is we're increasingly trying to train and test on different stats [01:15:54] to train and test on different stats we're trying to you know train on data [01:15:57] we're trying to you know train on data collected in one context and apply to [01:16:00] collected in one context and apply to totally different context suggests we're [01:16:03] totally different context suggests we're trying to you know train on speech [01:16:06] trying to you know train on speech collected on your cell phone because [01:16:08] collected on your cell phone because we've all that data and trying to apply [01:16:10] we've all that data and trying to apply it to a small speaker where it was [01:16:15] it to a small speaker where it was collected on a different microphone in [01:16:16] collected on a different microphone in your cell phone or something so if you [01:16:19] your cell phone or something so if you are doing that and the way you set your [01:16:21] are doing that and the way you set your train to have test fit it's a bit more [01:16:22] train to have test fit it's a bit more complicated um I wasn't going to talk [01:16:25] complicated um I wasn't going to talk about in this class you want to learn [01:16:26] about in this class you want to learn more I think that the sound of that [01:16:28] more I think that the sound of that cause I mentioned I was working on this [01:16:30] cause I mentioned I was working on this book machine learning journey so that [01:16:32] book machine learning journey so that book is finished and if you go to this [01:16:34] book is finished and if you go to this website you can get a copy of it for [01:16:36] website you can get a copy of it for free that talks about that and I also [01:16:40] free that talks about that and I also talked about this more CS 2:30 English [01:16:42] talked about this more CS 2:30 English which goes more into the Big Data but [01:16:44] which goes more into the Big Data but you could you know go and learn machine [01:16:46] you could you know go and learn machine you can also read all about it in [01:16:47] you can also read all about it in machine learning learning if the [01:16:50] machine learning learning if the training test [01:16:51] training test different distribution yeah but random [01:16:54] different distribution yeah but random softly it would be a good default if you [01:16:55] softly it would be a good default if you think your training test that's not too [01:16:57] think your training test that's not too different all right just one last thing [01:17:01] different all right just one last thing with a cover real quick which is a [01:17:07] with a cover real quick which is a feature selection and so so them justice [01:17:23] feature selection and so so them justice drag one so sometimes you have a lot of [01:17:25] drag one so sometimes you have a lot of features so actually let's take text [01:17:27] features so actually let's take text classification you might have ten [01:17:29] classification you might have ten thousand features because one in the ten [01:17:31] thousand features because one in the ten thousand words [01:17:32] thousand words but you might suspect that while the [01:17:34] but you might suspect that while the features are not important right you [01:17:36] features are not important right you know there won't be whether the worthy [01:17:38] know there won't be whether the worthy is called a stop where whether the word [01:17:40] is called a stop where whether the word D appears an email not doesn't really [01:17:42] D appears an email not doesn't really tell you this family of spam because of [01:17:44] tell you this family of spam because of where the a of you know these are called [01:17:47] where the a of you know these are called stop words they don't tell you much [01:17:48] stop words they don't tell you much about the content of the email but so if [01:17:52] about the content of the email but so if a lot of features sometimes one way to [01:17:55] a lot of features sometimes one way to reduce overfitting is to try to find a [01:17:58] reduce overfitting is to try to find a small subset of the features that are [01:18:00] small subset of the features that are most useful for your task right and so [01:18:02] most useful for your task right and so um [01:18:03] um this takes judgment there are some [01:18:05] this takes judgment there are some problems like computer vision where you [01:18:07] problems like computer vision where you have a lot of features Chris one into [01:18:09] have a lot of features Chris one into there being a lot of pixels in every [01:18:10] there being a lot of pixels in every image but probably every pixel is [01:18:13] image but probably every pixel is somewhat relevant so you don't want to [01:18:15] somewhat relevant so you don't want to select a subset of pixels for most [01:18:17] select a subset of pixels for most configuration tasks but there are some [01:18:19] configuration tasks but there are some other problems where you may have lot of [01:18:21] other problems where you may have lot of features then you suspect the way to [01:18:24] features then you suspect the way to prevent overfitting is to find a small [01:18:26] prevent overfitting is to find a small subset of the most relevant features for [01:18:28] subset of the most relevant features for your task so feature selection is a [01:18:31] your task so feature selection is a special case of model selection that [01:18:33] special case of model selection that applies to when you suspect that even [01:18:36] applies to when you suspect that even though you have ten thousand features [01:18:38] though you have ten thousand features maybe only 50 of them are highly [01:18:40] maybe only 50 of them are highly relevant right and so one example if you [01:18:44] relevant right and so one example if you are measuring a lot of things going on [01:18:46] are measuring a lot of things going on in a truck in order to figure out if the [01:18:49] in a truck in order to figure out if the truck is about to break down right you [01:18:51] truck is about to break down right you might preventive maintenance you might [01:18:54] might preventive maintenance you might measure hundreds of variables or many [01:18:56] measure hundreds of variables or many hundreds of variables but you might [01:18:57] hundreds of variables but you might secretly suspect that there are only a [01:18:59] secretly suspect that there are only a few things that you know predict when [01:19:01] few things that you know predict when the truck is about to go down so good [01:19:03] the truck is about to go down so good preventive maintenance so if you suspect [01:19:04] preventive maintenance so if you suspect that's the case then feature selection [01:19:07] that's the case then feature selection will be a reasonable approach to try [01:19:08] will be a reasonable approach to try right and so here's the I'll just write [01:19:12] right and so here's the I'll just write out one algorithm which is start with [01:19:17] out one algorithm which is start with this is script F equals the empty set of [01:19:21] this is script F equals the empty set of features and then you're repeatedly try [01:19:31] features and then you're repeatedly try adding each feature I so f and c which [01:19:43] adding each feature I so f and c which single feature addition most improves [01:19:54] single feature addition most improves the DEF set performance and then step [01:20:03] the DEF set performance and then step two is a go ahead and commit to add that [01:20:05] two is a go ahead and commit to add that feature [01:20:18] so let me illustrate this with pictures [01:20:21] so let me illustrate this with pictures so let's say you have all five features [01:20:27] so let's say you have all five features X 1 through X 5 and in practice are you [01:20:29] X 1 through X 5 and in practice are you more like X 1 through X 500 oh I went [01:20:31] more like X 1 through X 500 oh I went through 10,000 I'll just use 5 so start [01:20:35] through 10,000 I'll just use 5 so start off with an empty set of features and [01:20:36] off with an empty set of features and you know train a linear classifier with [01:20:39] you know train a linear classifier with no features so the model is H of X [01:20:41] no features so the model is H of X equals theta 0 right with no features so [01:20:45] equals theta 0 right with no features so this won't be very good normal but see [01:20:46] this won't be very good normal but see how well this does on your death set so [01:20:48] how well this does on your death set so the thinking were average the Y's right [01:20:50] the thinking were average the Y's right so it's not really normal [01:20:51] so it's not really normal NIC's so this is step one in the second [01:20:55] NIC's so this is step one in the second iteration you would then take each of [01:20:57] iteration you would then take each of these features and add it to the empty [01:20:59] these features and add it to the empty set so you can try the empty set plus x1 [01:21:02] set so you can try the empty set plus x1 and he said plus x2 and he said plus x5 [01:21:07] and he said plus x2 and he said plus x5 and for each of these you infinite [01:21:09] and for each of these you infinite corresponding models so for this one you [01:21:11] corresponding models so for this one you fit the picture of x equals theta 0 plus [01:21:14] fit the picture of x equals theta 0 plus theta 1 x5 so let's try adding one [01:21:17] theta 1 x5 so let's try adding one feature to your model and see which [01:21:19] feature to your model and see which model best improves your performance on [01:21:22] model best improves your performance on the DEF set right and let's say you find [01:21:24] the DEF set right and let's say you find it adding feature two is the best choice [01:21:27] it adding feature two is the best choice so now what we do is set the set of [01:21:30] so now what we do is set the set of features to be x2 for the next step you [01:21:35] features to be x2 for the next step you would then consider starting of X 2 and [01:21:40] would then consider starting of X 2 and adding X 1 or X 3 or X 4 or X 5 so if [01:21:50] adding X 1 or X 3 or X 4 or X 5 so if your model is already using the feature [01:21:52] your model is already using the feature X 2 [01:21:53] X 2 what's the other feature what additional [01:21:55] what's the other feature what additional feature most helps you algorithm and [01:21:58] feature most helps you algorithm and let's say this X 4 right fit for models [01:22:01] let's say this X 4 right fit for models see which one does best and now you [01:22:02] see which one does best and now you would commit to using the features X 2 [01:22:06] would commit to using the features X 2 and X 4 and you kind of keep on doing [01:22:11] and X 4 and you kind of keep on doing this keep on adding features greedily [01:22:13] this keep on adding features greedily keep on adding features one at a time to [01:22:15] keep on adding features one at a time to see which single feature addition helps [01:22:18] see which single feature addition helps improve your album demos and [01:22:21] improve your album demos and and and you can keep iterating until [01:22:23] and and you can keep iterating until adding more features now hers [01:22:24] adding more features now hers performance and then pick what whichever [01:22:27] performance and then pick what whichever feature subset allows you to have the [01:22:29] feature subset allows you to have the best possible performance which I've [01:22:30] best possible performance which I've said okay so this is a special case of [01:22:33] said okay so this is a special case of model selection called forward search [01:22:35] model selection called forward search it's called forward search we're [01:22:36] it's called forward search we're starting empty set of features and [01:22:37] starting empty set of features and adding features one at a time there's a [01:22:40] adding features one at a time there's a procedure called backward search which [01:22:41] procedure called backward search which can read about that we start with all [01:22:43] can read about that we start with all the features remove features one other [01:22:44] the features remove features one other time but this would be a reasonable [01:22:46] time but this would be a reasonable feature section algorithm the [01:22:49] feature section algorithm the disadvantage of this is is quite [01:22:50] disadvantage of this is is quite computationally expensive but this can [01:22:52] computationally expensive but this can help you select a decent set of features [01:22:54] help you select a decent set of features okay several running a little bit late [01:22:58] okay several running a little bit late let's break oh so I think I was meant to [01:23:01] let's break oh so I think I was meant to be on the road next week but because is [01:23:05] be on the road next week but because is still unable to teach I think we'll have [01:23:07] still unable to teach I think we'll have Rafael teach decision trees next week [01:23:12] Rafael teach decision trees next week and then also can talk about networks we [01:23:16] and then also can talk about networks we okay so let's break for today and then [01:23:19] okay so let's break for today and then maybe we'll see some of you at the [01:23:20] maybe we'll see some of you at the Friday discussion ================================================================================ LECTURE 009 ================================================================================ Lecture 9 - Approx/Estimation Error & ERM | Stanford CS229: Machine Learning (Autumn 2018) Source: https://www.youtube.com/watch?v=iVOxMcumR4A --- Transcript [00:00:03] okay welcome everyone so today we'll be [00:00:08] okay welcome everyone so today we'll be going over learning theory this is this [00:00:11] going over learning theory this is this used to be taught in the main lectures [00:00:13] used to be taught in the main lectures and in previous offerings this year [00:00:16] and in previous offerings this year we're going to cover it as as a Friday [00:00:18] we're going to cover it as as a Friday section however some of the concepts [00:00:21] section however some of the concepts here are we're going to be covering [00:00:23] here are we're going to be covering today are important in the sense that [00:00:27] today are important in the sense that they kind of deepen your understanding [00:00:28] they kind of deepen your understanding of how machine learning kind of works [00:00:31] of how machine learning kind of works and other covers what are the [00:00:32] and other covers what are the assumptions that you're making and you [00:00:34] assumptions that you're making and you know why do things generalize and and so [00:00:38] know why do things generalize and and so forth so here's the agenda for today so [00:00:40] forth so here's the agenda for today so we're going to quickly start off with [00:00:43] we're going to quickly start off with framing the learning problem and we'll [00:00:47] framing the learning problem and we'll go deep into bias-variance tradeoff will [00:00:50] go deep into bias-variance tradeoff will you go we'll spend some time over there [00:00:52] you go we'll spend some time over there and we look at some other ways where how [00:00:56] and we look at some other ways where how you can kind of decompose the error as [00:00:59] you can kind of decompose the error as an approximation error an estimation [00:01:01] an approximation error an estimation error we will see what empirical risk [00:01:05] error we will see what empirical risk minimization is and then we'll spend [00:01:07] minimization is and then we'll spend some time on uniform convergence and VC [00:01:10] some time on uniform convergence and VC dimensions so let's jump right in [00:01:15] dimensions so let's jump right in right so d so the assumptions under [00:01:24] right so d so the assumptions under which we're going to be operating for [00:01:28] which we're going to be operating for for this lecture and in fact for most of [00:01:30] for this lecture and in fact for most of most of the algorithms that we'll be [00:01:33] most of the algorithms that we'll be covering in this course is that there [00:01:37] covering in this course is that there are two main assumptions one is that [00:01:40] are two main assumptions one is that there exists a data distribution B from [00:01:51] there exists a data distribution B from which X Y pairs are sampled so this is [00:01:56] which X Y pairs are sampled so this is this makes sense in the supervised [00:01:58] this makes sense in the supervised learning setting where you're expected [00:02:01] learning setting where you're expected to learn a mapping from X to Y but the [00:02:04] to learn a mapping from X to Y but the assumption also actually holds more [00:02:07] assumption also actually holds more generally even in the unsupervised [00:02:08] generally even in the unsupervised setting case the the main assumption is [00:02:11] setting case the the main assumption is that there is a data generating [00:02:13] that there is a data generating distribution [00:02:14] distribution and the examples that we have in our [00:02:17] and the examples that we have in our training set and the ones we will be [00:02:20] training set and the ones we will be encountering when we tested are all [00:02:25] encountering when we tested are all coming from the same distribution right [00:02:27] coming from the same distribution right that's that's like the core assumption [00:02:28] that's that's like the core assumption without this coming up with any theory [00:02:33] without this coming up with any theory is is it's going to be much harder so [00:02:35] is is it's going to be much harder so the Assumption here is that you know [00:02:36] the Assumption here is that you know there is some kind of a data generating [00:02:39] there is some kind of a data generating process and we have a few samples from [00:02:42] process and we have a few samples from that data generating process that [00:02:43] that data generating process that becomes our training set and that's a [00:02:45] becomes our training set and that's a finite number you can get an infinite [00:02:49] finite number you can get an infinite number of samples from this data [00:02:50] number of samples from this data generating process and the examples that [00:02:54] generating process and the examples that we're going to encounter at test time [00:02:56] we're going to encounter at test time are also samples from the same process [00:02:59] are also samples from the same process that's that's the assumption and there [00:03:01] that's that's the assumption and there is a second assumption which is that all [00:03:05] is a second assumption which is that all these samples are sample independently [00:03:16] so with these two assumptions we can [00:03:20] so with these two assumptions we can imagine learning the process of learning [00:03:23] imagine learning the process of learning to look something like this so we have a [00:03:27] to look something like this so we have a set of XY pairs which we call this yes [00:03:34] set of XY pairs which we call this yes these are just x1 y1 X M Y M so we have [00:03:42] these are just x1 y1 X M Y M so we have M samples from from a sample from the [00:03:48] M samples from from a sample from the data generating process and we feed this [00:03:52] data generating process and we feed this into a learning algorithm [00:04:02] and the output of the learning algorithm [00:04:06] and the output of the learning algorithm is what we call as a hypothesis [00:04:09] is what we call as a hypothesis so hypothesis is is a function which [00:04:14] so hypothesis is is a function which accepts an input a new input X and makes [00:04:17] accepts an input a new input X and makes a prediction about about Y for that X so [00:04:20] a prediction about about Y for that X so this hypothesis is sometimes also in the [00:04:23] this hypothesis is sometimes also in the form of theta hat so if we if we [00:04:27] form of theta hat so if we if we restrict ourselves to a class of [00:04:29] restrict ourselves to a class of hypotheses for example all possible [00:04:32] hypotheses for example all possible logistic regression models of of [00:04:34] logistic regression models of of dimension n for example then it's you [00:04:38] dimension n for example then it's you know obtaining those parameters is [00:04:40] know obtaining those parameters is equivalent to obtaining the hypothesis [00:04:42] equivalent to obtaining the hypothesis function itself so a key thing to note [00:04:45] function itself so a key thing to note here is that this s is a random variable [00:04:55] this is a random variable this is a [00:04:57] this is a random variable this is a deterministic function [00:05:06] and what happens when you feed a random [00:05:09] and what happens when you feed a random variable through a deterministic [00:05:11] variable through a deterministic function you get a random variable [00:05:14] function you get a random variable exactly so so the hypothesis that we get [00:05:18] exactly so so the hypothesis that we get is also a random variable right so all [00:05:27] is also a random variable right so all random variables have a distribution [00:05:29] random variables have a distribution associated with them the distribution [00:05:31] associated with them the distribution associated with the data is the [00:05:34] associated with the data is the distribution of capital D this is just [00:05:37] distribution of capital D this is just enough ixed deterministic function and [00:05:40] enough ixed deterministic function and there is a distribution associated with [00:05:43] there is a distribution associated with the with the with the parameters that we [00:05:48] the with the with the parameters that we obtain that has a certain distribution [00:05:49] obtain that has a certain distribution as well in in this in a more statistical [00:05:57] as well in in this in a more statistical setting we call this an estimator so if [00:06:02] setting we call this an estimator so if you take some advanced statistics [00:06:03] you take some advanced statistics courses you will call what you will come [00:06:05] courses you will call what you will come across as an estimator here we call it a [00:06:08] across as an estimator here we call it a learning algorithm right and the [00:06:11] learning algorithm right and the distribution of theta is also called the [00:06:15] distribution of theta is also called the sampling distribution and the what's [00:06:25] sampling distribution and the what's implied in this process is that there [00:06:27] implied in this process is that there exists some theta star or in its H star [00:06:34] exists some theta star or in its H star however you want to view it which is in [00:06:37] however you want to view it which is in the sense or true parameter the true [00:06:44] parameter that we wish to be the output [00:06:48] parameter that we wish to be the output of the learning algorithm but of course [00:06:50] of the learning algorithm but of course you know we never know we never know [00:06:52] you know we never know we never know what theta star is and when what we get [00:06:57] what theta star is and when what we get out of the learning algorithm is is [00:07:01] out of the learning algorithm is is going to be just a sample from a random [00:07:04] going to be just a sample from a random a random variable now a thing to notice [00:07:09] a random variable now a thing to notice that this the theta star or H star is [00:07:11] that this the theta star or H star is not random it's just an unknown constant [00:07:18] not about when we say it's not random it [00:07:22] not about when we say it's not random it means there is no probability [00:07:24] means there is no probability distribution associated with it it's [00:07:25] distribution associated with it it's just a constant which we don't know that [00:07:28] just a constant which we don't know that that's that's the assumption under which [00:07:29] that's that's the assumption under which you operate right no let's see what [00:07:36] you operate right no let's see what let's see what what what are some [00:07:38] let's see what what what are some properties about this theta theta hat so [00:07:41] properties about this theta theta hat so all the all all the entities that we [00:07:45] all the all all the entities that we estimate are generally decorated with a [00:07:49] estimate are generally decorated with a hat on top which which indicates that [00:07:51] hat on top which which indicates that it's it's something that we estimated [00:07:53] it's it's something that we estimated and anything with a star is like you [00:07:56] and anything with a star is like you know the true or the right answer which [00:07:58] know the true or the right answer which we don't have access to in general so [00:08:03] we don't have access to in general so any questions with this so far yeah yeah [00:08:12] any questions with this so far yeah yeah so this could be in case of like a [00:08:17] so this could be in case of like a linear or logistic regression linear [00:08:20] linear or logistic regression linear regression generally happens to be a [00:08:21] regression generally happens to be a vector it could be a scalar it could be [00:08:23] vector it could be a scalar it could be you know a matrix it could be anything [00:08:25] you know a matrix it could be anything right it's just an entity that we [00:08:28] right it's just an entity that we estimate and sometimes it's store can [00:08:31] estimate and sometimes it's store can also be so generic that it need not even [00:08:34] also be so generic that it need not even be parametrized will it's just some [00:08:36] be parametrized will it's just some function that you estimate right so yeah [00:08:40] function that you estimate right so yeah so it could it could be a vector or a [00:08:42] so it could it could be a vector or a scalar or a matrix it could be anything [00:08:44] scalar or a matrix it could be anything right so let's see what happens when we [00:08:52] so in the lecture we saw this diagram [00:08:56] so in the lecture we saw this diagram for India now when we are talking about [00:08:58] for India now when we are talking about bias variance so [00:09:05] in case of regression and we saw that we [00:09:41] in case of regression and we saw that we saw this as the concepts of underfitting [00:09:50] this is overfit and this is like just [00:09:55] this is overfit and this is like just right alright so the concept of [00:10:00] right alright so the concept of underfitting and overfitting are kind of [00:10:01] underfitting and overfitting are kind of closely related to bias and variance so [00:10:05] closely related to bias and variance so this is how you would knew it from the [00:10:09] this is how you would knew it from the data so this from the data view right [00:10:12] data so this from the data view right because this is X this is y you know [00:10:14] because this is X this is y you know this is your data and if you look at you [00:10:18] this is your data and if you look at you know look at it from a data point of [00:10:21] know look at it from a data point of view these are the kind of different [00:10:23] view these are the kind of different algorithms that you might get right [00:10:25] algorithms that you might get right however or to get a more formal sense [00:10:28] however or to get a more formal sense formal view into what's what's bias and [00:10:31] formal view into what's what's bias and variance it's more useful to seed from [00:10:35] variance it's more useful to seed from the parameter view so let's imagine we [00:10:42] the parameter view so let's imagine we have four different learning algorithms [00:10:53] and here this is the parameter space [00:10:56] and here this is the parameter space let's say theta 1 theta 2 let's imagine [00:10:58] let's say theta 1 theta 2 let's imagine you know you have just two parameters [00:11:01] you know you have just two parameters that's easy to visualize theta 1 and [00:11:04] that's easy to visualize theta 1 and theta 2 all right and this corresponds [00:11:08] theta 2 all right and this corresponds to algorithm a algorithm B C and D right [00:11:14] to algorithm a algorithm B C and D right there is there is a true theta star [00:11:18] there is there is a true theta star let's let's which is unknown right now [00:11:30] let's imagine we run through this this [00:11:33] let's imagine we run through this this process of sampling M examples running [00:11:37] process of sampling M examples running it through the algorithm obtain a theta [00:11:40] it through the algorithm obtain a theta hat right and then we start with a new [00:11:43] hat right and then we start with a new sample sample from D running through the [00:11:46] sample sample from D running through the algorithm we get a different data hat [00:11:48] algorithm we get a different data hat right and the theta hat is going to be [00:11:50] right and the theta hat is going to be different for different learning [00:11:51] different for different learning algorithms so so let's imagine first we [00:11:56] algorithms so so let's imagine first we we we sample some data that's our [00:11:59] we we sample some data that's our training set run it through algorithm a [00:12:00] training set run it through algorithm a and let's say this is the parameter we [00:12:05] and let's say this is the parameter we got and then we run it through algorithm [00:12:06] got and then we run it through algorithm B and let's say this is the parameter [00:12:09] B and let's say this is the parameter with art and through C here and through [00:12:12] with art and through C here and through D over here and we're going to repeat [00:12:15] D over here and we're going to repeat this you know second one may be here and [00:12:22] this you know second one may be here and so on and you repeat this process over [00:12:24] so on and you repeat this process over and over and over the the key is that [00:12:27] and over and over the the key is that the number of samples per input is M [00:12:30] the number of samples per input is M that is fixed right but we're going to [00:12:31] that is fixed right but we're going to repeat this process and over and over [00:12:33] repeat this process and over and over and for every time we repeat it we get a [00:12:36] and for every time we repeat it we get a different point over here [00:12:52] right so each point each dot corresponds [00:12:58] right so each point each dot corresponds to a sample of size M right the number [00:13:01] to a sample of size M right the number of points is basically the number of [00:13:03] of points is basically the number of times we repeated the experiment right [00:13:05] times we repeated the experiment right and what we see is that these dots are [00:13:09] and what we see is that these dots are basically samples from the sampling [00:13:11] basically samples from the sampling distribution right now the concept of of [00:13:18] distribution right now the concept of of bias and variance is kind of visible [00:13:20] bias and variance is kind of visible over here so if we were to classify this [00:13:23] over here so if we were to classify this now we would call this as bias and [00:13:28] now we would call this as bias and variance right so these two are [00:13:38] variance right so these two are algorithms that have low bias these two [00:13:42] algorithms that have low bias these two are have high variance these two have [00:13:44] are have high variance these two have low where these are low bias high bias [00:13:47] low where these are low bias high bias low variance high variance so what does [00:13:50] low variance high variance so what does this mean [00:13:51] this mean so bias is basically checking Rd is the [00:13:59] so bias is basically checking Rd is the sampling distribution kind of centered [00:14:00] sampling distribution kind of centered around the true parameter the true [00:14:02] around the true parameter the true unknown parameter is it centered around [00:14:04] unknown parameter is it centered around the true parameter right and variance is [00:14:07] the true parameter right and variance is is measuring basically how dispersed the [00:14:11] is measuring basically how dispersed the the sampling distribution is right so so [00:14:15] the sampling distribution is right so so formally speaking this is bias and [00:14:17] formally speaking this is bias and variance and it becomes you know pretty [00:14:19] variance and it becomes you know pretty clear when we see it in the parameter [00:14:21] clear when we see it in the parameter view instead of in the the data view and [00:14:25] view instead of in the the data view and essentially bias and variance are [00:14:27] essentially bias and variance are basically just properties of the first [00:14:29] basically just properties of the first and second moments of your sampling [00:14:30] and second moments of your sampling distribution so you're asking the first [00:14:33] distribution so you're asking the first moment that's the mean is it centered [00:14:34] moment that's the mean is it centered around the true parameter and you know [00:14:36] around the true parameter and you know the second moment that variance that's [00:14:38] the second moment that variance that's literally variance of the bias-variance [00:14:39] literally variance of the bias-variance tradeoff yeah [00:14:51] yeah um so this is a diagram where I'm [00:15:00] yeah um so this is a diagram where I'm using only two Thetas just to fit you [00:15:02] using only two Thetas just to fit you know right on a whiteboard so you you [00:15:05] know right on a whiteboard so you you would imagine something that has high [00:15:06] would imagine something that has high variance for example this one to [00:15:09] variance for example this one to probably be of a much much higher [00:15:10] probably be of a much much higher dimension not just two but it would [00:15:12] dimension not just two but it would still be spread out it would still have [00:15:14] still be spread out it would still have like high variance there would be points [00:15:16] like high variance there would be points in a higher dimensional space you know [00:15:18] in a higher dimensional space you know but more spread out right so the [00:15:26] but more spread out right so the question was the question was in over [00:15:31] question was the question was in over here we we actually had more number of [00:15:34] here we we actually had more number of Thetas but here with the higher variance [00:15:37] Thetas but here with the higher variance plots we are having the same number of [00:15:39] plots we are having the same number of Thetas so yeah so you could imagine this [00:15:43] Thetas so yeah so you could imagine this to be higher dimensional and also [00:15:45] to be higher dimensional and also different algorithms can have different [00:15:49] different algorithms can have different bias and variance even though they have [00:15:51] bias and variance even though they have the same number of parameters for [00:15:54] the same number of parameters for example if you had regularization the [00:15:56] example if you had regularization the variance would come down for example we [00:15:59] variance would come down for example we go over that a few observations that we [00:16:03] go over that a few observations that we want to make is that as we increase the [00:16:06] want to make is that as we increase the size of the data every time we feed and [00:16:08] size of the data every time we feed and so if this were to may be made bigger if [00:16:12] so if this were to may be made bigger if we take a bigger sample for every every [00:16:15] we take a bigger sample for every every time we learn the variance of theta hat [00:16:21] time we learn the variance of theta hat would become small right so if we repeat [00:16:25] would become small right so if we repeat the same thing but with larger number of [00:16:28] the same thing but with larger number of examples so this would be more all of [00:16:30] examples so this would be more all of these would be more tightly concentrated [00:16:33] these would be more tightly concentrated right so the spread is so the spread is [00:16:37] right so the spread is so the spread is a function of how many examples we have [00:16:39] a function of how many examples we have in each in each iteration right so as M [00:16:47] in each in each iteration right so as M tends to infinity right the variance [00:16:55] tends to infinity right the variance tends to zero right if you were to [00:16:59] tends to zero right if you were to collect an infinite number of samples [00:17:01] collect an infinite number of samples run it through the algorithm you would [00:17:04] run it through the algorithm you would get some particular theta hat and if you [00:17:08] get some particular theta hat and if you were to repeat that with an infinite [00:17:09] were to repeat that with an infinite number of examples you will always keep [00:17:11] number of examples you will always keep getting the same theta hat now the rate [00:17:16] getting the same theta hat now the rate at which the variance goes goes to zero [00:17:19] at which the variance goes goes to zero as you increase M is you can think of it [00:17:22] as you increase M is you can think of it as what's also called like the [00:17:24] as what's also called like the statistical efficiency it's basically a [00:17:32] statistical efficiency it's basically a measure of how efficient your algorithm [00:17:34] measure of how efficient your algorithm is in squeezing out information from a [00:17:37] is in squeezing out information from a given amount of data and if theta hat [00:17:43] given amount of data and if theta hat tends to theta star as M tends to [00:17:48] tends to theta star as M tends to infinity you call such algorithms as [00:17:52] infinity you call such algorithms as consistent so consistent and if the [00:18:04] consistent so consistent and if the expected value of your theta hat is [00:18:07] expected value of your theta hat is equal to theta star for all M so no [00:18:14] equal to theta star for all M so no matter how big your sample sizes if you [00:18:17] matter how big your sample sizes if you always end up with a sampling [00:18:20] always end up with a sampling distribution that's centered around the [00:18:22] distribution that's centered around the true parameter then your estimator is [00:18:25] true parameter then your estimator is called an unbiased estimator yes so [00:18:30] called an unbiased estimator yes so efficiency is is basically the rate at [00:18:32] efficiency is is basically the rate at which the variance drops to zero as M [00:18:40] which the variance drops to zero as M tends to zero so for example you may [00:18:42] tends to zero so for example you may have one algorithm which which where the [00:18:46] have one algorithm which which where the variance is a function of 1 over m [00:18:49] variance is a function of 1 over m square another algorithm where the [00:18:51] square another algorithm where the variance is a function of e to the minus [00:18:56] variance is a function of e to the minus m you know you you can have the variance [00:18:58] m you know you you can have the variance can drive down at different rates [00:19:01] can drive down at different rates relative to M so that's kind of captures [00:19:03] relative to M so that's kind of captures which what's efficiency here [00:19:08] right yeah yeah so theta theta hat [00:19:22] right yeah yeah so theta theta hat approaches so this is a random variable [00:19:29] approaches so this is a random variable here so so here's one thing to be clear [00:19:32] here so so here's one thing to be clear about here this is a number a constant [00:19:35] about here this is a number a constant and this is a constant but here this is [00:19:38] and this is a constant but here this is a random variable right so what we are [00:19:41] a random variable right so what we are saying is that as M tends to infinity [00:19:44] saying is that as M tends to infinity theta hat that is the distribution [00:19:46] theta hat that is the distribution converges towards being a constant and [00:19:49] converges towards being a constant and that constant is going to be theta star [00:19:52] that constant is going to be theta star which means at smaller values of M your [00:19:55] which means at smaller values of M your algorithm might be centered elsewhere [00:19:57] algorithm might be centered elsewhere but as you get more and more data your [00:20:00] but as you get more and more data your sampling distribution variance reduces [00:20:02] sampling distribution variance reduces and also gets centered around the truth [00:20:04] and also gets centered around the truth theta star eventually so informally [00:20:11] theta star eventually so informally speaking if your algorithm has high bias [00:20:15] speaking if your algorithm has high bias it essentially means no matter how much [00:20:19] it essentially means no matter how much data or evidence you provided it kind of [00:20:21] data or evidence you provided it kind of always keeps away from from theta star [00:20:24] always keeps away from from theta star right you cannot change its mind no [00:20:26] right you cannot change its mind no matter how much data you feed it it's [00:20:28] matter how much data you feed it it's never going to center itself around [00:20:29] never going to center itself around around theta star that's like a high [00:20:31] around theta star that's like a high bias toggle it's biased away from the [00:20:33] bias toggle it's biased away from the true parameter and variance is you can [00:20:37] true parameter and variance is you can think of it as your algorithm that's [00:20:39] think of it as your algorithm that's that's that's kind of highly distracted [00:20:42] that's that's kind of highly distracted by the noise in the data and kind of [00:20:44] by the noise in the data and kind of easily get swayed away you know far away [00:20:47] easily get swayed away you know far away depending on the noise in your data so [00:20:50] depending on the noise in your data so these algorithms you would call them as [00:20:53] these algorithms you would call them as those having high variance because they [00:20:55] those having high variance because they can easily get swayed by noise in the [00:20:58] can easily get swayed by noise in the data and as we are seeing here bias and [00:21:02] data and as we are seeing here bias and variance are kind of independent of each [00:21:04] variance are kind of independent of each other you can have algorithms that have [00:21:06] other you can have algorithms that have you know an independent amount of bias [00:21:09] you know an independent amount of bias and variance in them you know there is [00:21:11] and variance in them you know there is there is no correlation between bias and [00:21:15] there is no correlation between bias and variance [00:21:17] variance and one way so the how do we how do we [00:21:22] and one way so the how do we how do we kind of fight variants so first let's [00:21:24] kind of fight variants so first let's look at how we can address variants yes [00:21:34] so bias and variance are properties of [00:21:38] so bias and variance are properties of the algorithm at a given size M right so [00:21:43] the algorithm at a given size M right so these plots were from well from a fixed [00:21:48] these plots were from well from a fixed size M and for that fixed size data this [00:21:52] size M and for that fixed size data this algorithm has high bias low variance [00:21:56] algorithm has high bias low variance this algorithm has high variance and [00:21:59] this algorithm has high variance and high bias and so on yeah yeah you can [00:22:02] high bias and so on yeah yeah you can you can you can think of it as yeah it [00:22:04] you can you can think of it as yeah it you assume like a fixed data size right [00:22:08] you assume like a fixed data size right so fighting variance so one way to kind [00:22:22] so fighting variance so one way to kind of address if you are in a high variance [00:22:26] of address if you are in a high variance situation this will just increase the [00:22:29] situation this will just increase the amount of data that you have and that [00:22:31] amount of data that you have and that would naturally just reduce the variance [00:22:33] would naturally just reduce the variance in your algorithm yes that is true so [00:22:41] in your algorithm yes that is true so you don't know upfront what whether [00:22:42] you don't know upfront what whether you're you're you know you know high [00:22:45] you're you're you know you know high bias or high variance scenario one way [00:22:49] bias or high variance scenario one way to kind of one way to kind of test that [00:22:54] to kind of one way to kind of test that is by looking at your training [00:22:57] is by looking at your training performance versus test performance we [00:23:00] performance versus test performance we go over that arm in fact we're going to [00:23:03] go over that arm in fact we're going to go into you know much more detail in the [00:23:06] go into you know much more detail in the main lectures of how do you identify [00:23:07] main lectures of how do you identify bias and variance here we're just going [00:23:09] bias and variance here we're just going over the concepts of what our bias and [00:23:11] over the concepts of what our bias and what our variance so one way to address [00:23:16] what our variance so one way to address variances you just get more data right [00:23:18] variances you just get more data right as you get more data the your sampling [00:23:21] as you get more data the your sampling distributions kind of tend to get more [00:23:23] distributions kind of tend to get more concentrated the other way is what's [00:23:27] concentrated the other way is what's called as regularization [00:23:32] so when you when you had regularization [00:23:35] so when you when you had regularization like l2 regularization or l1 [00:23:37] like l2 regularization or l1 regularization what we are effectively [00:23:41] regularization what we are effectively doing is let's say we have an algorithm [00:23:45] doing is let's say we have an algorithm with high variance maybe low bias no [00:23:53] with high variance maybe low bias no bias high variance and you add [00:23:58] bias high variance and you add regularization right what you end up [00:24:01] regularization right what you end up with is an algorithm that has maybe a [00:24:10] with is an algorithm that has maybe a small bias you increase the bias by [00:24:13] small bias you increase the bias by adding regularization but low variance [00:24:19] so if what you care about is your [00:24:22] so if what you care about is your predictive accuracy you're probably [00:24:24] predictive accuracy you're probably better off training off high variance to [00:24:29] better off training off high variance to some bias and getting down reducing your [00:24:31] some bias and getting down reducing your your variance to a large extent yeah [00:24:38] yeah we'll we are going to look into [00:24:42] yeah we'll we are going to look into that up next [00:24:54] so in order to kind of get a better [00:24:57] so in order to kind of get a better understanding of this let's imagine I [00:25:03] think of this as the space of hypothesis [00:25:06] think of this as the space of hypothesis space of right so let's assume there is [00:25:16] space of right so let's assume there is a true there exists this hypothesis [00:25:21] a true there exists this hypothesis let's call it G right which is like the [00:25:26] let's call it G right which is like the best possible hypothesis you can think [00:25:28] best possible hypothesis you can think of by best possible hypothesis I mean if [00:25:32] of by best possible hypothesis I mean if you were to kind of take this take this [00:25:38] you were to kind of take this take this hypothesis and take the expected value [00:25:41] hypothesis and take the expected value of the loss with respect to the data [00:25:42] of the loss with respect to the data generating distribution of across an [00:25:44] generating distribution of across an infinite amount of data you kind of have [00:25:46] infinite amount of data you kind of have the lowest error with this so this is [00:25:48] the lowest error with this so this is enough you know the best possible [00:25:51] enough you know the best possible hypothesis and then there is this class [00:25:54] hypothesis and then there is this class of hypotheses let's call this classes H [00:26:00] of hypotheses let's call this classes H right so this for example can be the set [00:26:04] right so this for example can be the set of all logistic regression hypotheses or [00:26:08] of all logistic regression hypotheses or the set of all SVM's you know so this is [00:26:12] the set of all SVM's you know so this is a class of hypotheses and what we what [00:26:17] a class of hypotheses and what we what we end up with when we take a finite [00:26:20] we end up with when we take a finite amount of data is some member over here [00:26:23] amount of data is some member over here right so let me call this H star right [00:26:28] right so let me call this H star right there is also some hypothesis in this [00:26:34] there is also some hypothesis in this class let me call it kind of H star [00:26:38] class let me call it kind of H star which is the best in class hypotheses so [00:26:41] which is the best in class hypotheses so within the set of all logistic [00:26:43] within the set of all logistic regression functions there exist some [00:26:45] regression functions there exist some you know some model which would give you [00:26:48] you know some model which would give you the lowest lowest error if you were to [00:26:52] the lowest lowest error if you were to test it on the full data distribution [00:26:53] test it on the full data distribution right the best possible hypothesis may [00:26:57] right the best possible hypothesis may not be inside [00:26:59] not be inside you're inside your hypothesis class it's [00:27:03] you're inside your hypothesis class it's just some you know some hypothesis that [00:27:05] just some you know some hypothesis that that's that's conceptually something [00:27:07] that's that's conceptually something outside the class right now G is the [00:27:14] outside the class right now G is the best possible hypothesis H star is best [00:27:24] best possible hypothesis H star is best in class just in class h h ad is one you [00:27:34] in class just in class h h ad is one you learnt from finite data so we also [00:27:49] learnt from finite data so we also introduced some new notation so epsilon [00:27:53] introduced some new notation so epsilon of H is you will call this the risk or [00:28:00] of H is you will call this the risk or generalization error right and it is [00:28:09] generalization error right and it is defined to be equal to the expectation [00:28:12] defined to be equal to the expectation of X Y sample from D so you sample [00:28:27] of X Y sample from D so you sample examples from the data generating [00:28:30] examples from the data generating process run it through the hypothesis [00:28:33] process run it through the hypothesis check whether it matches with with your [00:28:37] check whether it matches with with your output and if it matches you get a 1 if [00:28:40] output and if it matches you get a 1 if it does if it if it doesn't match you [00:28:43] it does if it if it doesn't match you get a 1 if it matches you get a 0 so on [00:28:45] get a 1 if it matches you get a 0 so on average this is you know roughly [00:28:48] average this is you know roughly speaking the fraction of all examples on [00:28:51] speaking the fraction of all examples on which you make a mistake and here we are [00:28:54] which you make a mistake and here we are kind of thinking about this from a [00:28:56] kind of thinking about this from a classification point of view to check if [00:28:58] classification point of view to check if you know the class that your output [00:29:00] you know the class that your output matches the true class or not but you [00:29:02] matches the true class or not but you can also extend this to the regression [00:29:05] can also extend this to the regression setting but that's a little harder to [00:29:07] setting but that's a little harder to analyze but you know the generalization [00:29:10] analyze but you know the generalization holds to the regression setting is [00:29:12] holds to the regression setting is but we'll stick to classification for [00:29:14] but we'll stick to classification for now and we have an epsilon hat s of H [00:29:21] now and we have an epsilon hat s of H and this is called the empirical risk is [00:29:30] and this is called the empirical risk is the empirical risk or empirical error [00:29:32] the empirical risk or empirical error and this over here is the difference [00:29:55] and this over here is the difference here is that here this is like an [00:29:56] here is that here this is like an infinite process you're you're sampling [00:29:59] infinite process you're you're sampling from D forever and calculating like the [00:30:01] from D forever and calculating like the long-term average whereas this is you [00:30:04] long-term average whereas this is you have a finite number that's given to you [00:30:05] have a finite number that's given to you and what's the fraction of examples on [00:30:08] and what's the fraction of examples on which you make you make them an error [00:30:10] which you make you make them an error right all right before we go further [00:30:15] right all right before we go further there was a question of how adding [00:30:19] there was a question of how adding regularization reduces your variance so [00:30:22] regularization reduces your variance so what you can see [00:30:26] actually let me let me get back to that [00:30:28] actually let me let me get back to that in a bit so you know G and this is [00:30:39] in a bit so you know G and this is called the Bayes error so this [00:30:46] called the Bayes error so this essentially means if you take the best [00:30:47] essentially means if you take the best ways possible hypothesis what's the [00:30:49] ways possible hypothesis what's the fraction what's what's the rate at which [00:30:52] fraction what's what's the rate at which you make errors you know and that can be [00:30:54] you make errors you know and that can be nonzero right even if you take the best [00:30:56] nonzero right even if you take the best possible hypothesis ever and that can [00:30:59] possible hypothesis ever and that can still still make some some mistakes and [00:31:01] still still make some some mistakes and and this is also called irreducible [00:31:03] and this is also called irreducible error for example if your data [00:31:11] error for example if your data generating process you know spits out [00:31:14] generating process you know spits out examples where for the same X you have [00:31:17] examples where for the same X you have different Y's in two different examples [00:31:19] different Y's in two different examples then you know no no learning algorithm [00:31:23] then you know no no learning algorithm can you know [00:31:25] can you know do well in such cases that's just one [00:31:29] do well in such cases that's just one one kind of irreducible error they can [00:31:31] one kind of irreducible error they can be other kinds of identity visible [00:31:32] be other kinds of identity visible errors as well and epsilon of H star [00:31:42] errors as well and epsilon of H star epsilon G is called the approximation [00:31:46] epsilon G is called the approximation error so this essentially means what is [00:31:55] error so this essentially means what is the price that we are paying for [00:31:57] the price that we are paying for limiting ourselves to some class right [00:32:00] limiting ourselves to some class right so it's the it's the error between its [00:32:03] so it's the it's the error between its it's the difference between the best [00:32:05] it's the difference between the best possible error that you can get and the [00:32:07] possible error that you can get and the best possible error you can get from H [00:32:09] best possible error you can get from H star right so this is this is an [00:32:13] star right so this is this is an attribute of the class so what's the [00:32:16] attribute of the class so what's the cost you are paying for restricting [00:32:18] cost you are paying for restricting yourself to a class and then you have [00:32:21] yourself to a class and then you have epsilon of H and minus epsilon of H star [00:32:27] epsilon of H and minus epsilon of H star and this you call it the estimation [00:32:29] and this you call it the estimation error the estimation error is given the [00:32:36] error the estimation error is given the data that we got you know the M examples [00:32:39] data that we got you know the M examples that we got and we estimated you know [00:32:42] that we got and we estimated you know using our estimator some H h H act [00:32:46] using our estimator some H h H act what's the what's the what's the error [00:32:57] what's the what's the what's the error due to estimation and this is like so [00:33:06] due to estimation and this is like so this this the error on G is is the Bayes [00:33:10] this this the error on G is is the Bayes error the gap between this error and the [00:33:14] error the gap between this error and the best in class is the approximation error [00:33:16] best in class is the approximation error and the gap between the best in class [00:33:19] and the gap between the best in class and the hypothesis that you end up with [00:33:21] and the hypothesis that you end up with is called the estimation error right and [00:33:24] is called the estimation error right and it's easy to see that H hat is actually [00:33:32] it's easy to see that H hat is actually equal to [00:33:35] estimation error plus approximation [00:33:43] estimation error plus approximation error right it's pretty C's with you [00:33:56] error right it's pretty C's with you know if you just add them up all these [00:33:57] know if you just add them up all these cancel out and you're just left with [00:34:00] cancel out and you're just left with epsilon of H hat so it's it's kind of [00:34:04] epsilon of H hat so it's it's kind of useful to think about your [00:34:05] useful to think about your generalization error as different [00:34:08] generalization error as different components some error which you just [00:34:13] components some error which you just cannot you know reduce it no matter what [00:34:15] cannot you know reduce it no matter what no matter what type odd says you pick no [00:34:17] no matter what type odd says you pick no matter how much of training data you [00:34:18] matter how much of training data you have there's no way you can get rid of [00:34:20] have there's no way you can get rid of the irreducible error and then you make [00:34:22] the irreducible error and then you make some some decisions about that you're [00:34:25] some some decisions about that you're going to limit yourself to neural [00:34:27] going to limit yourself to neural networks or logistic regression or [00:34:29] networks or logistic regression or whatever and thereby you're kind of [00:34:30] whatever and thereby you're kind of defining a class of all possible models [00:34:33] defining a class of all possible models and that has a cost itself and that's [00:34:35] and that has a cost itself and that's your approximation error and then you [00:34:36] your approximation error and then you are working with limited data and this [00:34:39] are working with limited data and this is generally due to data right and with [00:34:42] is generally due to data right and with the limited data that you have and [00:34:44] the limited data that you have and possibly due to some nuances of your [00:34:45] possibly due to some nuances of your algorithm you also have an estimation [00:34:47] algorithm you also have an estimation error I mean we can further see that the [00:34:52] error I mean we can further see that the estimation error can be broken down into [00:34:55] estimation error can be broken down into estimation variance and the estimation [00:34:58] estimation variance and the estimation bias and you can all therefore write [00:35:08] bias and you can all therefore write this as and what we commonly call as [00:35:21] this as and what we commonly call as bias and variance are you know this we [00:35:23] bias and variance are you know this we call it as variance and this we call it [00:35:27] call it as variance and this we call it as bias and this is just irreducible so [00:35:34] as bias and this is just irreducible so sometimes you see the bias-variance [00:35:37] sometimes you see the bias-variance decomposition and sometimes you see the [00:35:40] decomposition and sometimes you see the estimation approximation error [00:35:42] estimation approximation error decomposition they are somewhat related [00:35:43] decomposition they are somewhat related they're not exactly the same so the bias [00:35:47] they're not exactly the same so the bias is based [00:35:49] is based why is you know bias is basically trying [00:35:54] why is you know bias is basically trying to capture why is a chat far from a from [00:35:56] to capture why is a chat far from a from G right why is it staying away from G [00:35:59] G right why is it staying away from G you know why did our hypothesis stay [00:36:01] you know why did our hypothesis stay away from the true hypothesis and that [00:36:03] away from the true hypothesis and that could be because your classes is kind of [00:36:06] could be because your classes is kind of too small or it could be due to other [00:36:09] too small or it could be due to other reasons such as you know as we'll see [00:36:13] reasons such as you know as we'll see maybe regularization that kind of keeps [00:36:15] maybe regularization that kind of keeps you away from a certain certain [00:36:18] you away from a certain certain hypotheses right and the variance is [00:36:21] hypotheses right and the variance is generally due to it like it's almost [00:36:22] generally due to it like it's almost always due to having small data it could [00:36:25] always due to having small data it could be due to other reasons as well but [00:36:29] be due to other reasons as well but these are like two different ways of of [00:36:31] these are like two different ways of of decomposing your your error so now if [00:36:37] decomposing your your error so now if you have high bias how do you fight high [00:36:39] you have high bias how do you fight high bias fight.i bias [00:36:50] so how would you fight high bias any [00:36:53] so how would you fight high bias any guesses mm-hmm yeah exactly so one way [00:37:02] guesses mm-hmm yeah exactly so one way is to just you know make your H bigger [00:37:09] I'd make your H bigger and also you can [00:37:12] I'd make your H bigger and also you can you can try you know different [00:37:14] you can try you know different algorithms after making your H bigger [00:37:18] algorithms after making your H bigger and what this generally means is what we [00:37:23] and what this generally means is what we saw there was regularization kind of you [00:37:25] saw there was regularization kind of you know reduces your your variance by [00:37:30] know reduces your your variance by paying a small cost in bias and over [00:37:33] paying a small cost in bias and over here you know so let's say your [00:37:51] here you know so let's say your algorithm has some bias right so it has [00:37:57] algorithm has some bias right so it has a high bias and some variance and you [00:38:07] a high bias and some variance and you make H bigger your class bigger right [00:38:14] make H bigger your class bigger right and this generally results in something [00:38:16] and this generally results in something which reduces your bias but also [00:38:19] which reduces your bias but also increases your variance right so with [00:38:25] increases your variance right so with this picture you can you can also see [00:38:27] this picture you can you can also see you know what's the effect of how does [00:38:30] you know what's the effect of how does variance come into picture now just by [00:38:32] variance come into picture now just by having a bigger class there is a higher [00:38:34] having a bigger class there is a higher probability that the hypothesis that you [00:38:37] probability that the hypothesis that you estimate can vary a lot right if you [00:38:40] estimate can vary a lot right if you reduce your the space of hypothesis you [00:38:46] reduce your the space of hypothesis you may be increasing your bias because you [00:38:48] may be increasing your bias because you may be moving away from G but you're [00:38:50] may be moving away from G but you're also effectively reducing your variance [00:38:51] also effectively reducing your variance right so that's that's the one of the [00:38:55] right so that's that's the one of the you know trade-off that you observe that [00:38:57] you know trade-off that you observe that any step you a step that you take for [00:39:00] any step you a step that you take for example in reducing bias by May [00:39:04] example in reducing bias by May get bigger also makes it possible for [00:39:07] get bigger also makes it possible for your H hat to land at much [00:39:09] your H hat to land at much you know at a wider space and increases [00:39:11] you know at a wider space and increases your variance and if you take a step to [00:39:13] your variance and if you take a step to reducing your variance by maybe making [00:39:16] reducing your variance by maybe making your your class smaller you may end up [00:39:20] your your class smaller you may end up making it smaller by being away from the [00:39:22] making it smaller by being away from the end thereby increase your your increase [00:39:26] end thereby increase your your increase your bias so when you when you add [00:39:29] your bias so when you when you add regularization you know the the question [00:39:32] regularization you know the the question of somebody asks before of how does me [00:39:35] of somebody asks before of how does me in how does adding regularization [00:39:40] in how does adding regularization decrease the variance by adding [00:39:43] decrease the variance by adding regularization you're effectively kind [00:39:45] regularization you're effectively kind of shrinking the class of hypothesis [00:39:47] of shrinking the class of hypothesis that you have you start penalizing those [00:39:50] that you have you start penalizing those hypotheses whose theta is very is very [00:39:52] hypotheses whose theta is very is very large and in a way you're kind of you [00:39:54] large and in a way you're kind of you know shrinking the class of hypotheses [00:39:56] know shrinking the class of hypotheses that you have so if you shrink the class [00:39:59] that you have so if you shrink the class of hypotheses your your variance is kind [00:40:01] of hypotheses your your variance is kind of reduced because you know there's much [00:40:03] of reduced because you know there's much smaller wiggles room for your estimator [00:40:05] smaller wiggles room for your estimator to place your H hat and you know if you [00:40:09] to place your H hat and you know if you shrink it by going away from from from G [00:40:13] shrink it by going away from from from G you you also introduced bias that's like [00:40:15] you you also introduced bias that's like you know the bias-variance tradeoff any [00:40:20] you know the bias-variance tradeoff any questions on the so far [00:40:31] yeah you you you probably want to think [00:40:34] yeah you you you probably want to think of each of these you probably want to [00:40:37] of each of these you probably want to think of this as a generalized version [00:40:39] think of this as a generalized version of this right so here we have like fixed [00:40:41] of this right so here we have like fixed data on theta 2 but you know because you [00:40:45] data on theta 2 but you know because you could parameterize them into a few [00:40:47] could parameterize them into a few parameters you can kind of plot it in a [00:40:48] parameters you can kind of plot it in a matrix space but that's like a more [00:40:50] matrix space but that's like a more general like a bag of hypotheses and you [00:40:55] general like a bag of hypotheses and you know but in any case in both of both [00:40:57] know but in any case in both of both those diagrams a point here is one [00:41:00] those diagrams a point here is one hypotheses a point there is one [00:41:02] hypotheses a point there is one hypotheses here it's parameterize here [00:41:03] hypotheses here it's parameterize here it's not parametrized yes the thing is D [00:41:16] it's not parametrized yes the thing is D for D so the question is how what if we [00:41:20] for D so the question is how what if we we shrink it towards H star right the [00:41:23] we shrink it towards H star right the thing is we don't know where H star is [00:41:26] thing is we don't know where H star is right if we knew it we didn't even need [00:41:28] right if we knew it we didn't even need to learn anything we could just go [00:41:29] to learn anything we could just go straight there right so yeah [00:41:44] with regularization so the question is [00:41:47] with regularization so the question is when you add regularization are we sure [00:41:49] when you add regularization are we sure that the bias is going up no we don't [00:41:52] that the bias is going up no we don't know and and this is a common scenario [00:41:54] know and and this is a common scenario what happens right you when you add [00:41:56] what happens right you when you add regularization you you you reduce the [00:41:58] regularization you you you reduce the variance for sure but you're very likely [00:42:01] variance for sure but you're very likely going to introduce some bias in that [00:42:03] going to introduce some bias in that process [00:42:08] so if you add regularization you're [00:42:11] so if you add regularization you're shrinking your hypothesis space in some [00:42:14] shrinking your hypothesis space in some way so you're kind of moving away from [00:42:16] way so you're kind of moving away from true G so you're kind of adding a little [00:42:18] true G so you're kind of adding a little bit bias you're very likely to add some [00:42:20] bit bias you're very likely to add some bias in that process yeah so it's so I I [00:42:26] bias in that process yeah so it's so I I would encourage you to you know kind of [00:42:29] would encourage you to you know kind of after this lecture to think about this a [00:42:31] after this lecture to think about this a little more slowly it's it's it's it [00:42:32] little more slowly it's it's it's it takes a while to kind of internalize [00:42:33] takes a while to kind of internalize this the concept of bias and variance [00:42:35] this the concept of bias and variance and and [00:42:37] and and it's not very intuitive but but you know [00:42:41] it's not very intuitive but but you know thinking about it more definitely helps [00:42:45] all right any other questions before we [00:42:48] all right any other questions before we move on so an example of a hypothesis [00:42:54] move on so an example of a hypothesis class right so an example would be the [00:42:57] class right so an example would be the set of all logistic regression models [00:43:01] set of all logistic regression models right and when you do gradient descent [00:43:04] right and when you do gradient descent on your you know logistic regression [00:43:06] on your you know logistic regression class you are kind of implicitly [00:43:07] class you are kind of implicitly restricting yourself to set up our [00:43:09] restricting yourself to set up our possible logistic regression models [00:43:11] possible logistic regression models that's kind of implicit so the H is the [00:43:22] that's kind of implicit so the H is the output of the learning algorithm all [00:43:26] output of the learning algorithm all right so you feed an input to your [00:43:28] right so you feed an input to your algorithm this is not the model this is [00:43:30] algorithm this is not the model this is the learning algorithm like this is like [00:43:32] the learning algorithm like this is like gradient descent for example right and [00:43:34] gradient descent for example right and the output of that is the parameters [00:43:37] the output of that is the parameters that you learnt that converge to right [00:43:40] that you learnt that converge to right so so yeah you probably don't want to [00:43:44] so so yeah you probably don't want to think about this as the model that you [00:43:45] think about this as the model that you learned but this as the like the [00:43:48] learned but this as the like the training process and the output of the [00:43:50] training process and the output of the training process is a model that you'll [00:43:52] training process is a model that you'll know and that is a point in your in the [00:43:56] know and that is a point in your in the class of hypotheses yes so you fix that [00:44:07] class of hypotheses yes so you fix that the class of learning mods you say I'm [00:44:10] the class of learning mods you say I'm going only going to learn logistic [00:44:12] going only going to learn logistic regression models right for different [00:44:14] regression models right for different different samples of data that you feed [00:44:16] different samples of data that you feed it as your training set you're going to [00:44:17] it as your training set you're going to get learn a different theta hat yes but [00:44:23] get learn a different theta hat yes but they have to be within the class of [00:44:25] they have to be within the class of hypotheses all right so let's let's move [00:44:31] hypotheses all right so let's let's move on [00:44:56] so next we come across this concept [00:44:59] so next we come across this concept called empirical risk minimization [00:45:22] this the empirical risk minimizer so so [00:45:38] this the empirical risk minimizer so so the empirical risk minimizer is a [00:45:40] the empirical risk minimizer is a learning algorithm it is one of those [00:45:44] learning algorithm it is one of those kind of boxes that we drew it is you [00:45:46] kind of boxes that we drew it is you know so in the box that we drew earlier [00:45:56] know so in the box that we drew earlier as learning algorithm right so the the [00:46:04] as learning algorithm right so the the diagram that we drew earlier based on [00:46:07] diagram that we drew earlier based on which we reasoned everything so far [00:46:09] which we reasoned everything so far didn't actually tell you what actually [00:46:11] didn't actually tell you what actually happens inside it could be doing [00:46:13] happens inside it could be doing gradient descent it could just do [00:46:14] gradient descent it could just do something else it couldn't be you know [00:46:16] something else it couldn't be you know some some you know smart programmer [00:46:19] some some you know smart programmer who's written a whole bunch of if else [00:46:21] who's written a whole bunch of if else and just returns a theta it could be [00:46:23] and just returns a theta it could be anything right and no matter what kind [00:46:25] anything right and no matter what kind of algorithm was used the bias-variance [00:46:29] of algorithm was used the bias-variance theory still holds right now we are [00:46:32] theory still holds right now we are going to look at a very specific type of [00:46:35] going to look at a very specific type of learning algorithms called the empirical [00:46:37] learning algorithms called the empirical risk minimizer right so and this was eat [00:46:44] risk minimizer right so and this was eat into your algorithm and you get it star [00:46:59] no H its add equal to so what does erm [00:47:10] no H its add equal to so what does erm empirical risk minimization it's what [00:47:12] empirical risk minimization it's what we've been doing so far in the course [00:47:17] we're we try to find a minimizer in a [00:47:20] we're we try to find a minimizer in a class of hypotheses that minimizes the [00:47:23] class of hypotheses that minimizes the average training error weight so for [00:47:38] average training error weight so for example this is trying to minimize the [00:47:41] example this is trying to minimize the training error from a classification [00:47:44] training error from a classification perspective this is kind of minimizing [00:47:47] perspective this is kind of minimizing the or increase in their training [00:47:50] the or increase in their training accuracy which is different from what [00:47:52] accuracy which is different from what actually logistic regression did where [00:47:54] actually logistic regression did where we were doing the maximum likelihood or [00:47:55] we were doing the maximum likelihood or minimizing the negative log likelihood [00:47:57] minimizing the negative log likelihood it can be shown that losses like the [00:48:00] it can be shown that losses like the logistic loss are can be well [00:48:02] logistic loss are can be well approximated by by the ERM and and and [00:48:06] approximated by by the ERM and and and this theory should should hold [00:48:09] this theory should should hold nonetheless right so if if we are [00:48:16] nonetheless right so if if we are limiting ourselves to do that class of [00:48:21] limiting ourselves to do that class of algorithms which which work by [00:48:24] algorithms which which work by minimizing the training loss right as [00:48:28] minimizing the training loss right as opposed to something that say returns a [00:48:31] opposed to something that say returns a constant all the time or does something [00:48:33] constant all the time or does something else if we limit ourselves to empirical [00:48:38] else if we limit ourselves to empirical risk minimizer's then we can come up [00:48:40] risk minimizer's then we can come up with more theoretical results for [00:48:43] with more theoretical results for example uniform convergence which we are [00:48:45] example uniform convergence which we are going to look at right now [00:49:02] so so we are limiting ourselves to [00:49:07] so so we are limiting ourselves to empirical risk minimizer's and starting [00:49:11] empirical risk minimizer's and starting off uniform convergence so there are two [00:49:28] off uniform convergence so there are two central questions that we are kind of [00:49:29] central questions that we are kind of interested in so one question is if we [00:49:36] interested in so one question is if we do empirical risk minimization that is [00:49:38] do empirical risk minimization that is if we just reduce the training loss [00:49:40] if we just reduce the training loss right what what does that say about the [00:49:44] right what what does that say about the generalization error of that so that is [00:49:46] generalization error of that so that is basically the height of H versus H so [00:49:56] basically the height of H versus H so for you know consider some hypothesis [00:49:58] for you know consider some hypothesis right and that gives you some amount of [00:50:00] right and that gives you some amount of training error right what does that say [00:50:03] training error right what does that say about its generalization error like [00:50:05] about its generalization error like that's one central question we want to [00:50:09] that's one central question we want to consider and the second one is how does [00:50:14] consider and the second one is how does the generalization error of our learned [00:50:18] the generalization error of our learned hypothesis compared to the best possible [00:50:23] hypothesis compared to the best possible generalization error in that class right [00:50:27] generalization error in that class right note we're you know we're only talking [00:50:29] note we're you know we're only talking about a star and not G yeah so it's star [00:50:33] about a star and not G yeah so it's star is is the best in class so these are [00:50:38] is is the best in class so these are these are two central questions that we [00:50:39] these are two central questions that we want to we want to explore and for this [00:50:43] want to we want to explore and for this we're going to use two tools right so [00:50:49] we're going to use two tools right so one is called the Union bound right [00:50:55] one is called the Union bound right what's the Union bound if we have [00:51:00] what's the Union bound if we have see different events - okay then [00:51:07] see different events - okay then this need not be independent then the [00:51:16] this need not be independent then the probability of if this looks trivial it [00:51:40] probability of if this looks trivial it is trivial it's it's it's probably one [00:51:41] is trivial it's it's it's probably one of the axioms in in in your undergrad [00:51:45] of the axioms in in in your undergrad probability class the the probability of [00:51:48] probability class the the probability of any one of these events happening is [00:51:50] any one of these events happening is less than or equal to the sum of the [00:51:53] less than or equal to the sum of the probabilities of of each of them [00:51:55] probabilities of of each of them happening right and then we have a [00:51:59] happening right and then we have a second tool is called the halflings [00:52:08] second tool is called the halflings inequality we're only going to state the [00:52:18] inequality we're only going to state the inequality here there is a supplemental [00:52:21] inequality here there is a supplemental notes on the website that actually [00:52:23] notes on the website that actually proves the tufting inequality you can go [00:52:25] proves the tufting inequality you can go through that but here we are only going [00:52:29] through that but here we are only going to state the result in fact throughout [00:52:31] to state the result in fact throughout this session you only got a state result [00:52:32] this session you only got a state result so you're not going to prove anything so [00:52:35] so you're not going to prove anything so let Z 1 Z 2 Z em be sampled from some [00:52:47] let Z 1 Z 2 Z em be sampled from some Bernoulli distribution parameter fee and [00:52:52] Bernoulli distribution parameter fee and let's call we had to be average of them [00:53:02] of Zi and let there be a gamma greater [00:53:12] of Zi and let there be a gamma greater than zero which we call it as the margin [00:53:17] so the huffing inequality basically says [00:53:20] so the huffing inequality basically says the probability that the absolute [00:53:25] the probability that the absolute difference between the estimated fee [00:53:28] difference between the estimated fee parameter and the true fee parameter is [00:53:33] parameter and the true fee parameter is greater than some margin can be bounded [00:53:38] greater than some margin can be bounded by two times the exponential of minus 2 [00:53:44] by two times the exponential of minus 2 gamma square em not very obvious but you [00:53:54] gamma square em not very obvious but you know you can you can you can show this [00:53:55] know you can you can you can show this what what is basically saying is there [00:53:57] what what is basically saying is there is some there is some some parameter [00:54:02] is some there is some some parameter between 0 & 1 of a Bernoulli [00:54:04] between 0 & 1 of a Bernoulli distribution the fact that it is between [00:54:08] distribution the fact that it is between 0 & 1 means it's it's bounded and and [00:54:11] 0 & 1 means it's it's bounded and and that's a key requirement for the Hough [00:54:13] that's a key requirement for the Hough dings inequality and now we take samples [00:54:17] dings inequality and now we take samples from this Bernoulli distribution and the [00:54:20] from this Bernoulli distribution and the estimator for this is basically you know [00:54:23] estimator for this is basically you know and these are just zeros or ones Zi Zi [00:54:25] and these are just zeros or ones Zi Zi each of the Zi is either a 0 or a 1 the [00:54:27] each of the Zi is either a 0 or a 1 the sample a 0 or a 1 with probability P and [00:54:31] sample a 0 or a 1 with probability P and the estimator is basically just the [00:54:34] the estimator is basically just the averages of your samples right and the [00:54:39] averages of your samples right and the absolute difference between the [00:54:41] absolute difference between the estimated value and the true value the [00:54:43] estimated value and the true value the probability that this difference becomes [00:54:46] probability that this difference becomes greater than some margin gamma is [00:54:49] greater than some margin gamma is bounded by this expression so there are [00:54:52] bounded by this expression so there are a lot of things happening here so [00:54:54] a lot of things happening here so probably one of slowly think through [00:54:57] probably one of slowly think through this so this is a margin right and this [00:55:04] this so this is a margin right and this is like basically like the deviation or [00:55:08] is like basically like the deviation or the error [00:55:11] right it's the absolute value of how far [00:55:15] right it's the absolute value of how far away your estimated value is from from [00:55:17] away your estimated value is from from the true and you would like it to be [00:55:21] the true and you would like it to be closer so you you you probably want your [00:55:25] closer so you you you probably want your your fee hat and fee to be not more than [00:55:28] your fee hat and fee to be not more than 0.001 right so in which case if the [00:55:32] 0.001 right so in which case if the absolute value between the estimated and [00:55:35] absolute value between the estimated and a true parameter is greater than 0.001 [00:55:40] a true parameter is greater than 0.001 if that's the margin you're that you're [00:55:42] if that's the margin you're that you're interested in then this the huffing is [00:55:46] interested in then this the huffing is inequality proves that if you were to [00:55:49] inequality proves that if you were to repeat this process over and over and [00:55:50] repeat this process over and over and over the number of times fee hat is [00:55:54] over the number of times fee hat is going to be great it's going to be [00:55:56] going to be great it's going to be farther than 0.001 from the true [00:55:58] farther than 0.001 from the true parameter it's going to be less than [00:56:00] parameter it's going to be less than this expression which is a function of M [00:56:02] this expression which is a function of M right and that is you can kind of [00:56:05] right and that is you can kind of believe it because as M increases this [00:56:08] believe it because as M increases this becomes smaller which means the [00:56:10] becomes smaller which means the probability of your estimate deviating [00:56:14] probability of your estimate deviating more than a certain margin only reduces [00:56:16] more than a certain margin only reduces as you increase M right so this is [00:56:19] as you increase M right so this is hoppings inequality and we're going to [00:56:20] hoppings inequality and we're going to use this questions [00:56:35] not so so the question is is hate star [00:56:39] not so so the question is is hate star the limit of H at as M goes to infinity [00:56:43] the limit of H at as M goes to infinity it is its star in in the limit as M goes [00:56:48] it is its star in in the limit as M goes to infinity if it is a consistent [00:56:51] to infinity if it is a consistent estimator right so we went over the [00:56:54] estimator right so we went over the concept of consistency I'd given [00:56:55] concept of consistency I'd given infinite data will you eventually get to [00:56:57] infinite data will you eventually get to the right answer and if your estimator [00:56:59] the right answer and if your estimator is not consistent then it will it need [00:57:01] is not consistent then it will it need not be so in general H hat need not [00:57:05] not be so in general H hat need not converge to H star as you get an [00:57:06] converge to H star as you get an infinite amount of data so now we want [00:57:12] infinite amount of data so now we want to use these tools tool 1 and tool to to [00:57:17] to use these tools tool 1 and tool to to answer our like the central questions [00:57:24] any other questions yeah this is a more [00:57:39] any other questions yeah this is a more limited version of Hough Kings [00:57:40] limited version of Hough Kings inequality and yes if we limit ourselves [00:57:43] inequality and yes if we limit ourselves to a bernoulli variable [00:57:45] to a bernoulli variable which has some parameter fee and you [00:57:47] which has some parameter fee and you take samples from it and you construct [00:57:51] take samples from it and you construct an estimator which is the average of the [00:57:54] an estimator which is the average of the of the samples of the zeros and ones [00:57:56] of the samples of the zeros and ones then this inequality holds [00:58:00] then this inequality holds that's the this inequality is called the [00:58:02] that's the this inequality is called the Hough dings inequality yes [00:58:11] so if you're in general that there are [00:58:17] so if you're in general that there are there there is this class of algorithms [00:58:19] there there is this class of algorithms called maximum likelihood algorithms [00:58:20] called maximum likelihood algorithms maximum likelihood estimators and a pure [00:58:23] maximum likelihood estimators and a pure maximum likelihood estimator is [00:58:24] maximum likelihood estimator is generally consistent if you include [00:58:27] generally consistent if you include regularization then it need not be it [00:58:32] regularization then it need not be it need not be consistent though I'm not [00:58:36] need not be consistent though I'm not very sure about that I'm not very sure [00:58:37] very sure about that I'm not very sure about that so basically you know for the [00:59:14] about that so basically you know for the Mike what he responded was if you have [00:59:18] Mike what he responded was if you have an algorithm like a neural net which is [00:59:20] an algorithm like a neural net which is which is non convex you may actually not [00:59:23] which is non convex you may actually not end up with the same result even if you [00:59:27] end up with the same result even if you increase increase like the number of [00:59:31] increase increase like the number of though I would probably call the fact I [00:59:35] though I would probably call the fact I would probably think of the non [00:59:37] would probably think of the non convexity to be part of an estimation [00:59:40] convexity to be part of an estimation bias because you could in theory always [00:59:44] bias because you could in theory always find like the global minimum of a neural [00:59:46] find like the global minimum of a neural network is just that there's some bias [00:59:47] network is just that there's some bias in our estimator that we are using [00:59:48] in our estimator that we are using gradient descent and we cannot solve it [00:59:54] okay so now let's let's use these two [00:59:58] okay so now let's let's use these two tools and for that we're going to start [01:00:02] tools and for that we're going to start at this diagram [01:00:12] so over here we have hypothesis here we [01:00:19] so over here we have hypothesis here we have error and [01:00:34] there's actually one one curve which I'm [01:00:37] there's actually one one curve which I'm trying to make it thick and probably [01:00:39] trying to make it thick and probably make it to look like multiple curves [01:00:40] make it to look like multiple curves this just one curve and this we will [01:00:43] this just one curve and this we will call it s this is the generalization [01:00:51] call it s this is the generalization risk or the [01:00:53] risk or the the generalization error of every [01:00:56] the generalization error of every possible hypothesis in our class right [01:01:01] possible hypothesis in our class right so pick one hypothesis that's going to [01:01:05] so pick one hypothesis that's going to be somewhere on this axis calculate the [01:01:08] be somewhere on this axis calculate the generalization error not the am [01:01:11] generalization error not the am particular the generalization error and [01:01:13] particular the generalization error and you know that's the height of that curve [01:01:15] you know that's the height of that curve right and we also have something like [01:01:20] right and we also have something like this right so this dotted line now [01:01:31] this right so this dotted line now corresponds to H of H now let's let's [01:01:43] corresponds to H of H now let's let's sample a set of M examples and calculate [01:01:47] sample a set of M examples and calculate the empirical error of all our [01:01:50] the empirical error of all our hypotheses in our class and plot that as [01:01:52] hypotheses in our class and plot that as a curve all right any questions on what [01:01:56] a curve all right any questions on what these two are yeah it need not beats and [01:02:01] these two are yeah it need not beats and I'm just in fact Li this is very likely [01:02:05] I'm just in fact Li this is very likely not even a straight line you're just [01:02:06] not even a straight line you're just thinking of all possible hypotheses it [01:02:08] thinking of all possible hypotheses it need not be convex this is just to get [01:02:12] need not be convex this is just to get some ideas you get better intuitions on [01:02:16] some ideas you get better intuitions on some of these ideas yes so the black [01:02:20] some of these ideas yes so the black line the thick black line is the [01:02:23] line the thick black line is the generalization error of all your [01:02:25] generalization error of all your hypotheses right and let's say you [01:02:28] hypotheses right and let's say you sample some some some data right let's [01:02:31] sample some some some data right let's call it s on that sample you have [01:02:33] call it s on that sample you have training error for all possible [01:02:35] training error for all possible hypotheses right we haven't not learnt [01:02:37] hypotheses right we haven't not learnt anything right it's it's this is the [01:02:41] anything right it's it's this is the generalization error and this is the [01:02:43] generalization error and this is the empirical error for the given s right [01:02:47] empirical error for the given s right now in order to apply halflings [01:02:52] now in order to apply halflings inequality here right so let's consider [01:03:00] inequality here right so let's consider some H I right this is some hypotheses [01:03:07] some H I right this is some hypotheses we don't know so we start with some [01:03:10] we don't know so we start with some random hypotheses right and so so by [01:03:16] random hypotheses right and so so by starting with some hypotheses like think [01:03:18] starting with some hypotheses like think of this as you start with some parameter [01:03:20] of this as you start with some parameter right and so the height of this line up [01:03:34] right and so the height of this line up to the the thick black curve is [01:03:38] to the the thick black curve is basically the generalization error of H [01:03:47] basically the generalization error of H is the height to the thick black curve [01:03:49] is the height to the thick black curve so let me call this epsilon of H I right [01:03:56] so let me call this epsilon of H I right and the height to the dotted curve until [01:03:59] and the height to the dotted curve until here and this is epsilon hat of H I I'm [01:04:08] here and this is epsilon hat of H I I'm going to ignore the s for now right and [01:04:14] going to ignore the s for now right and this corresponds to like the the sample [01:04:17] this corresponds to like the the sample that we obtained now one thing you can [01:04:21] that we obtained now one thing you can you can you can check is that the [01:04:23] you can you can check is that the expected value of [01:04:44] where the expectation is with respect to [01:04:46] where the expectation is with respect to the data the sample so what this means [01:04:48] the data the sample so what this means is that for one particular sample you [01:04:52] is that for one particular sample you this is the generalization error you got [01:04:54] this is the generalization error you got take another set of M samples that curve [01:04:57] take another set of M samples that curve might look some you know some other way [01:04:59] might look some you know some other way and you know the height of the dotted [01:05:01] and you know the height of the dotted line would be there so in general on [01:05:03] line would be there so in general on average if you Sam average across all [01:05:06] average if you Sam average across all possible training samples that you can [01:05:08] possible training samples that you can get the the expected value of the height [01:05:12] get the the expected value of the height to the dotted line is going to be the [01:05:15] to the dotted line is going to be the height to the thick line right that's [01:05:18] height to the thick line right that's that's just a five now here if you apply [01:05:20] that's just a five now here if you apply halflings inequality you basically get [01:05:24] halflings inequality you basically get probability of absolute difference [01:05:27] probability of absolute difference between the empirical error versus the [01:05:32] between the empirical error versus the generalization error doing greater than [01:05:37] generalization error doing greater than gamma is less than equal to minus two [01:05:45] gamma is less than equal to minus two square this is basically you know opting [01:05:50] square this is basically you know opting in equality we have right here except in [01:05:53] in equality we have right here except in place of fee and fee hat we have the [01:05:56] place of fee and fee hat we have the true generalization error and the [01:05:59] true generalization error and the empirical error any questions on this so [01:06:02] empirical error any questions on this so far so what we are saying is essentially [01:06:08] far so what we are saying is essentially the the gap between the generalization [01:06:12] the the gap between the generalization error and the empirical error right [01:06:16] error and the empirical error right right the gap being greater than some [01:06:22] right the gap being greater than some margin gamma is going to be bounded by [01:06:27] margin gamma is going to be bounded by this expression so loosely speaking what [01:06:32] this expression so loosely speaking what this means is as we increase the size M [01:06:36] this means is as we increase the size M if our trainings up if we plot the set [01:06:40] if our trainings up if we plot the set of all dotted lines for a larger M they [01:06:44] of all dotted lines for a larger M they are going to be more concentrated around [01:06:46] are going to be more concentrated around the black line does that make sense [01:06:51] the black line does that make sense so so take a moment and think about it [01:06:53] so so take a moment and think about it this [01:06:54] this dotted line correspond to s of some [01:06:56] dotted line correspond to s of some particular size M we could take another [01:07:00] particular size M we could take another sample of you know a fixed set of [01:07:03] sample of you know a fixed set of examples and that might look something [01:07:07] examples and that might look something like this and take another sample of [01:07:12] like this and take another sample of size M and that might look something [01:07:13] size M and that might look something like this [01:07:17] like this now and now consider the set of all [01:07:21] now and now consider the set of all deviations from from the black line to [01:07:24] deviations from from the black line to every possible dotted line along the [01:07:26] every possible dotted line along the vertical line of H I right now this gap [01:07:30] vertical line of H I right now this gap is greater than some margin gamma with [01:07:36] is greater than some margin gamma with probability less than this term over [01:07:40] probability less than this term over here right so so it essentially means [01:07:43] here right so so it essentially means that if you start plotting dotted lines [01:07:45] that if you start plotting dotted lines with the bigger em right where the set [01:07:48] with the bigger em right where the set of all those dotted lines are correspond [01:07:50] of all those dotted lines are correspond to a bigger M they are going to be much [01:07:53] to a bigger M they are going to be much more tightly concentrated around the [01:07:56] more tightly concentrated around the true generalization of that of that H [01:08:00] that make sense right you're basically [01:08:03] that make sense right you're basically applying tufting inequality to this gap [01:08:06] applying tufting inequality to this gap over here instead of something that's [01:08:08] over here instead of something that's basically what you're doing no that's [01:08:14] basically what you're doing no that's good but but there's a problem here the [01:08:16] good but but there's a problem here the problem here is that we started with [01:08:20] problem here is that we started with some hypotheses and then averaged across [01:08:22] some hypotheses and then averaged across all possible data that you could sample [01:08:25] all possible data that you could sample but in practice this is useless because [01:08:27] but in practice this is useless because in practice we start with some data and [01:08:29] in practice we start with some data and run the empirical risk minimizer to find [01:08:32] run the empirical risk minimizer to find the lowest H for that particular data [01:08:35] the lowest H for that particular data right and when you when when which means [01:08:39] right and when you when when which means that H and the data that you have are [01:08:42] that H and the data that you have are not really independent right you you [01:08:44] not really independent right you you chose the H to minimize minimize the [01:08:47] chose the H to minimize minimize the risk for the empirical risk for the [01:08:50] risk for the empirical risk for the particular data that you are given in [01:08:51] particular data that you are given in the first place right so to to fix this [01:08:56] the first place right so to to fix this what we want to do is basically extend [01:09:01] what we want to do is basically extend this result that we got to [01:09:06] this result that we got to account for all H right now if we want [01:09:11] account for all H right now if we want to get a bound on the the gap between [01:09:16] to get a bound on the the gap between the a probabilistic bound and the gap [01:09:19] the a probabilistic bound and the gap between the generalization error and the [01:09:25] between the generalization error and the empirical error for all age you know [01:09:30] empirical error for all age you know what's that bound going to look like [01:09:31] what's that bound going to look like right and this is basically called [01:09:34] right and this is basically called uniform uniform convergence this result [01:09:36] uniform uniform convergence this result is called uniform convergence because we [01:09:38] is called uniform convergence because we are trying to we are trying to see how [01:09:41] are trying to we are trying to see how the risk curve converges uniformly to [01:09:45] the risk curve converges uniformly to the generalization risk how the [01:09:47] the generalization risk how the empirical risk curve uniformly converges [01:09:50] empirical risk curve uniformly converges to the generalization risk curve and and [01:09:52] to the generalization risk curve and and it's that that's called uniform [01:09:55] it's that that's called uniform convergence which you can apply to [01:09:56] convergence which you can apply to functions in general but here we are [01:09:57] functions in general but here we are applying to the risk curves across our [01:10:00] applying to the risk curves across our hypotheses and we can show I'm gonna [01:10:04] hypotheses and we can show I'm gonna just skip the math so this we showed [01:10:09] just skip the math so this we showed using halflings inequality and you can [01:10:12] using halflings inequality and you can apply the Union bound for unioning [01:10:15] apply the Union bound for unioning across all age except we can first we're [01:10:20] across all age except we can first we're going to limit ourselves to correct so [01:10:25] going to limit ourselves to correct so let me start over so we got this bound [01:10:28] let me start over so we got this bound for a fixed edge right but we are [01:10:31] for a fixed edge right but we are interested in getting the bound for any [01:10:34] interested in getting the bound for any possible edge right so that's our next [01:10:36] possible edge right so that's our next step right and the way we're going to [01:10:39] step right and the way we're going to going to extend this point wise result [01:10:41] going to extend this point wise result to across all of them is going to look [01:10:44] to across all of them is going to look different for two possible cases one is [01:10:46] different for two possible cases one is a case of finite hypothesis class and [01:10:49] a case of finite hypothesis class and the other case is going to be the case [01:10:51] the other case is going to be the case for infinite hypothesis class so what [01:10:54] for infinite hypothesis class so what does it look like so [01:11:06] so let's first consider finite [01:11:10] so let's first consider finite hypothesis classes so first we are going [01:11:21] hypothesis classes so first we are going to assume that the class of H has a [01:11:26] to assume that the class of H has a finite number of hypotheses the result [01:11:29] finite number of hypotheses the result by itself is not very useful but it's [01:11:31] by itself is not very useful but it's going to be like a building block for [01:11:33] going to be like a building block for further for the other case so let's [01:11:36] further for the other case so let's assume that the number of hypotheses in [01:11:40] assume that the number of hypotheses in this class is some number K right we can [01:11:45] this class is some number K right we can show that I'm not going to go over the [01:11:49] show that I'm not going to go over the the derivation but I'm just going to [01:11:51] the derivation but I'm just going to write out the result it's it's pretty [01:11:54] write out the result it's it's pretty intuitive so basically what we do is we [01:11:57] intuitive so basically what we do is we apply the Union bound for all K [01:11:59] apply the Union bound for all K hypotheses and we end up just [01:12:02] hypotheses and we end up just multiplying that by a factor of K all [01:12:04] multiplying that by a factor of K all right so what we get is the probability [01:12:08] right so what we get is the probability that there exists some hypotheses in H [01:12:18] that there exists some hypotheses in H such that the empirical error minus [01:12:23] such that the empirical error minus generalization error this is greater [01:12:30] generalization error this is greater than gamma is less than equal to K times [01:12:37] K times the probability of any one which [01:12:40] K times the probability of any one which is equal to K times 2 minus 2 gamma [01:12:49] is equal to K times 2 minus 2 gamma square M and this we flip it over we [01:12:54] square M and this we flip it over we negate it and we get the probability [01:12:57] negate it and we get the probability that for all hypotheses in our class [01:13:08] empirical risk - generalization risk is [01:13:12] empirical risk - generalization risk is less than gamma this is going to be [01:13:15] less than gamma this is going to be greater than equal to 1 minus 2 K so [01:13:28] greater than equal to 1 minus 2 K so with probability at least 1 - you know [01:13:33] with probability at least 1 - you know this expression which we can call this [01:13:38] this expression which we can call this Delta with probability at least so much [01:13:43] Delta with probability at least so much for all hypotheses our margin is going [01:13:47] for all hypotheses our margin is going to be less than some gamma right this is [01:13:51] to be less than some gamma right this is this is just [01:13:54] hoppings inequality plus Union bound and [01:13:57] hoppings inequality plus Union bound and just negate the two sides you get this [01:14:00] just negate the two sides you get this and you can go with this slowly you know [01:14:03] and you can go with this slowly you know later from the notes the notes goes over [01:14:06] later from the notes the notes goes over this in more detail right now basically [01:14:11] this in more detail right now basically now what we have is no now let's let [01:14:15] now what we have is no now let's let Delta K gamma square hmm so we basically [01:14:23] Delta K gamma square hmm so we basically now have a relation between Delta which [01:14:31] now have a relation between Delta which is like the probability of error by here [01:14:38] is like the probability of error by here by error I mean that the empirical risk [01:14:43] by error I mean that the empirical risk and the generalization risk are farther [01:14:46] and the generalization risk are farther than some some margin and gamma is [01:14:49] than some some margin and gamma is called the margin of error and M is your [01:14:55] called the margin of error and M is your sample size [01:15:00] so what so what this basically tells is [01:15:04] so what so what this basically tells is if your algorithm is the empirical risk [01:15:07] if your algorithm is the empirical risk minimizer it could have been any kind of [01:15:10] minimizer it could have been any kind of algorithm but if it is the kind that [01:15:12] algorithm but if it is the kind that minimizes the training error then you [01:15:15] minimizes the training error then you can get by by just changing the sample [01:15:20] can get by by just changing the sample size you can get a relation between the [01:15:24] size you can get a relation between the margin of error and the probability of [01:15:25] margin of error and the probability of error and related to the sample size [01:15:27] error and related to the sample size right so what we can do with this [01:15:34] right so what we can do with this relation is basically fix any two and [01:15:37] relation is basically fix any two and solve for the third and that gives us [01:15:41] solve for the third and that gives us you know some actionable results for [01:15:46] you know some actionable results for example you can fix any two and solve [01:15:51] example you can fix any two and solve for the third from this relationship [01:15:52] for the third from this relationship right and what what what that could mean [01:15:57] right and what what what that could mean is for example so you you can choose any [01:16:00] is for example so you you can choose any two and solve for the third am I'm only [01:16:02] two and solve for the third am I'm only going to go over one one one of those so [01:16:04] going to go over one one one of those so let's fix fix gamma and Delta to be [01:16:14] let's fix fix gamma and Delta to be greater than 0 and we solve for M and we [01:16:19] greater than 0 and we solve for M and we get em to be a too many good one over to [01:16:25] get em to be a too many good one over to gamma square Delta so what this means is [01:16:34] gamma square Delta so what this means is with probability at least 1 minus Delta [01:16:37] with probability at least 1 minus Delta which means probably at least 99% 99.9% [01:16:41] which means probably at least 99% 99.9% for example the probability at least 1 [01:16:45] for example the probability at least 1 minus Delta the margin of error between [01:16:49] minus Delta the margin of error between the empirical risk and the true [01:16:52] the empirical risk and the true generalization risk is going to be less [01:16:54] generalization risk is going to be less than gamma as long as your training size [01:16:59] than gamma as long as your training size is bigger than this expression [01:17:01] is bigger than this expression all right that's something actionable [01:17:03] all right that's something actionable for us right now theory can be useful so [01:17:06] for us right now theory can be useful so this is also called the sample [01:17:08] this is also called the sample complexity dessert [01:17:13] right [01:17:14] right and basically what this means is as you [01:17:17] and basically what this means is as you increase em and you sample different [01:17:20] increase em and you sample different sets of data sets your dotted lines are [01:17:25] sets of data sets your dotted lines are going to get closer and closer to to the [01:17:29] going to get closer and closer to to the thick line which means minimizing you're [01:17:32] thick line which means minimizing you're minimizing on the dotted line will also [01:17:36] minimizing on the dotted line will also get you closer to the generalization [01:17:38] get you closer to the generalization error so this this is basically telling [01:17:40] error so this this is basically telling you how minimizing on on minimizing on [01:17:44] you how minimizing on on minimizing on the empirical risk gets you closer to [01:17:47] the empirical risk gets you closer to generalization right okay so that so we [01:17:53] generalization right okay so that so we started off with two questions relating [01:17:55] started off with two questions relating the empirical risk to generalization [01:17:57] the empirical risk to generalization risk now let's let's explore the second [01:18:00] risk now let's let's explore the second question what about the generalization [01:18:03] question what about the generalization error of our minimizer with the best [01:18:13] error of our minimizer with the best possible in class so let's look at this [01:18:16] possible in class so let's look at this diagram again let's say we started with [01:18:19] diagram again let's say we started with this dotted curve right and the [01:18:21] this dotted curve right and the minimizer of that would be 8 star and [01:18:25] minimizer of that would be 8 star and this is 8 star sorry the diagram is a [01:18:28] this is 8 star sorry the diagram is a little so this is H hat sorry so this is [01:18:40] little so this is H hat sorry so this is H at and this has a particular [01:18:45] H at and this has a particular generalization error right that is the [01:18:48] generalization error right that is the point of let's assume we got this data [01:18:51] point of let's assume we got this data set we ran the empirical risk minimizer [01:18:54] set we ran the empirical risk minimizer and we obtained this hypothesis and when [01:18:57] and we obtained this hypothesis and when we deploy this in the world in the real [01:18:58] we deploy this in the world in the real world it's error is going to be so much [01:19:01] world it's error is going to be so much right now how does this compare to the [01:19:05] right now how does this compare to the performance of the minimizer of the the [01:19:09] performance of the minimizer of the the best possible [01:19:16] so this is H star best-in-class right [01:19:22] so this is H star best-in-class right now [01:19:23] now we want to get a relation between this [01:19:25] we want to get a relation between this error level and this irrelevant [01:19:28] error level and this irrelevant we got one bound that relates this to [01:19:32] we got one bound that relates this to this and now we want something that [01:19:33] this and now we want something that relates this to this now how do we do [01:19:37] relates this to this now how do we do that it's pretty straightforward so the [01:19:46] that it's pretty straightforward so the generalization error of H hat that's [01:19:49] generalization error of H hat that's this dot over here [01:19:52] this dot over here less than equal to empirical risk of H [01:19:58] less than equal to empirical risk of H hat plus gamma so we got a result using [01:20:04] hat plus gamma so we got a result using a huffing and union-bound that the gap [01:20:07] a huffing and union-bound that the gap between the dotted line and the the [01:20:09] between the dotted line and the the thick black line is always less than [01:20:11] thick black line is always less than gamma right and it's the absolute value [01:20:14] gamma right and it's the absolute value so we can we can write it this way as [01:20:17] so we can we can write it this way as well and this right so basically we we [01:20:22] well and this right so basically we we started from the thick black line [01:20:25] started from the thick black line dropped down to the dotted line and this [01:20:30] dropped down to the dotted line and this is going to be less than the empirical [01:20:34] is going to be less than the empirical error of H star plus gamma why is that [01:20:39] error of H star plus gamma why is that because M empirical error the empirical [01:20:45] because M empirical error the empirical error of H hat by definition is less [01:20:49] error of H hat by definition is less than or equal to the empirical error on [01:20:51] than or equal to the empirical error on any other hypothesis including the [01:20:53] any other hypothesis including the best-in-class [01:20:54] best-in-class because this is the training error not [01:20:56] because this is the training error not not not the generalization error right [01:20:58] not not the generalization error right so which means and and this is less than [01:21:06] so which means and and this is less than or equal to so we we dropped from the [01:21:09] or equal to so we we dropped from the generalization to the test and we said [01:21:12] generalization to the test and we said this test is the this training error is [01:21:15] this test is the this training error is always going to be less than the [01:21:20] always going to be less than the empirical error of the best-in-class you [01:21:23] empirical error of the best-in-class you see that the best-in-class was higher [01:21:24] see that the best-in-class was higher for the trained particular [01:21:26] for the trained particular and this again is now this gap is also [01:21:31] and this again is now this gap is also bounded because we prove uniform [01:21:32] bounded because we prove uniform convergence that the gap between the [01:21:34] convergence that the gap between the dotted line and thick line is bounded by [01:21:36] dotted line and thick line is bounded by gamma for any edge right and this is [01:21:40] gamma for any edge right and this is therefore H star plus 2 gamma because we [01:21:50] therefore H star plus 2 gamma because we added the extra margin so we wanted a [01:21:52] added the extra margin so we wanted a relation between the the our our [01:21:57] relation between the the our our hypothesis generalization error to the [01:21:59] hypothesis generalization error to the generalization error of the [01:22:00] generalization error of the best-in-class hypothesis so we dropped [01:22:03] best-in-class hypothesis so we dropped from the generalization error to the [01:22:07] from the generalization error to the empirical error of our hypothesis [01:22:09] empirical error of our hypothesis related that to the empirical error of [01:22:11] related that to the empirical error of the best-in-class and again bounded by [01:22:14] the best-in-class and again bounded by the gap between these two so we got a [01:22:16] the gap between these two so we got a gap between the generalization bound the [01:22:19] gap between the generalization bound the generalized error of our hypothesis to [01:22:20] generalized error of our hypothesis to the best-in-class generalization any [01:22:23] the best-in-class generalization any questions on this so the result [01:22:33] questions on this so the result basically says with probability 1 minus [01:22:37] basically says with probability 1 minus Delta and for training size M the [01:22:47] generalization error of the hypothesis [01:22:50] generalization error of the hypothesis from the empirical risk minimizer is [01:22:52] from the empirical risk minimizer is going to be within the best-in-class [01:22:58] going to be within the best-in-class generalization error plus 2 times log [01:23:10] Delta so this was basically so you can [01:23:15] Delta so this was basically so you can get this when you when you so in this [01:23:23] get this when you when you so in this expression if you set this equal to [01:23:25] expression if you set this equal to Delta and solve for gamma you will get [01:23:27] Delta and solve for gamma you will get this any questions [01:23:33] this any questions I think we are already over time so the [01:23:41] I think we are already over time so the the case for infinite classes is an [01:23:44] the case for infinite classes is an extension to this maybe I'll just write [01:23:46] extension to this maybe I'll just write the result so there is a concept called [01:23:48] the result so there is a concept called VC dimension [01:23:49] VC dimension which is a pretty simple concept but you [01:23:53] which is a pretty simple concept but you know we won't be going over it today VC [01:23:56] know we won't be going over it today VC dimension basically says what is the so [01:24:00] dimension basically says what is the so VC dimension is you can think of it as [01:24:02] VC dimension is you can think of it as trying to assign a size to an infinitely [01:24:07] trying to assign a size to an infinitely to it to an infinite size hypothesis [01:24:09] to it to an infinite size hypothesis class for a fixed size hypothesis class [01:24:11] class for a fixed size hypothesis class we had like you know care to me the size [01:24:12] we had like you know care to me the size of the hypothesis so we see of some [01:24:16] of the hypothesis so we see of some hypothesis class it's going to be some [01:24:19] hypothesis class it's going to be some number right some number which which [01:24:22] number right some number which which kind of which is like the size of that [01:24:25] kind of which is like the size of that hypothesis it's basically telling you [01:24:26] hypothesis it's basically telling you how how expressive it is and and on [01:24:31] how how expressive it is and and on using using the the VC dimension there [01:24:34] using using the the VC dimension there are very nice geometrical meanings of VC [01:24:37] are very nice geometrical meanings of VC dimension you can you can get a bound [01:24:39] dimension you can you can get a bound similar bound but now it's not for high [01:24:44] similar bound but now it's not for high it's not for finite classes anymore some [01:24:53] it's not for finite classes anymore some Big O [01:25:19] right [01:25:20] right so in place of this margin we ended up [01:25:24] so in place of this margin we ended up with a different margin that is a [01:25:26] with a different margin that is a function of the VC dimension and the the [01:25:32] function of the VC dimension and the the key takeaway from this is that the the [01:25:37] key takeaway from this is that the the number of data examples that the sample [01:25:40] number of data examples that the sample complexity that you want is generally [01:25:43] complexity that you want is generally you know an order of the VC dimension to [01:25:46] you know an order of the VC dimension to get good results that's basically the [01:25:48] get good results that's basically the domain result from that right from with [01:25:52] domain result from that right from with that I guess will will will break for [01:25:55] that I guess will will will break for the day and we'll take more questions ================================================================================ LECTURE 010 ================================================================================ Lecture 10 - Decision Trees and Ensemble Methods | Stanford CS229: Machine Learning (Autumn 2018) Source: https://www.youtube.com/watch?v=wr9gUr-eWdA --- Transcript [00:00:03] hello everyone so my name is Rafael [00:00:08] hello everyone so my name is Rafael Townsend I'm one of the head TA for this [00:00:10] Townsend I'm one of the head TA for this class this week Andrews travelling in my [00:00:13] class this week Andrews travelling in my advisor is still dealing with medical [00:00:15] advisor is still dealing with medical issues so I'm gonna be giving today's [00:00:17] issues so I'm gonna be giving today's lecture you heard from my wonderful Co [00:00:19] lecture you heard from my wonderful Co head ta an on a couple weeks ago and so [00:00:22] head ta an on a couple weeks ago and so today we're gonna be going over decision [00:00:24] today we're gonna be going over decision trees in various ensemble methods so [00:00:28] trees in various ensemble methods so these might seem a bit like despair [00:00:29] these might seem a bit like despair topics at first but really decision [00:00:31] topics at first but really decision trees are sort of a classical example [00:00:33] trees are sort of a classical example model class to use with various ensemble [00:00:36] model class to use with various ensemble methods we're gonna get into a little [00:00:37] methods we're gonna get into a little bit why in a bit but just to give you [00:00:40] bit why in a bit but just to give you guys an overview what the outlines can [00:00:42] guys an overview what the outlines can be refreshing to over decision trees [00:00:43] be refreshing to over decision trees then we're gonna go over general on [00:00:45] then we're gonna go over general on something methods and then go [00:00:46] something methods and then go specifically into bagging random forests [00:00:48] specifically into bagging random forests and boosting okay so let's get started [00:00:53] and boosting okay so let's get started so first let's cover some decision trees [00:01:05] okay so last week [00:01:07] okay so last week Andrew was covering SVM which are sort [00:01:09] Andrew was covering SVM which are sort of one of the classical linear models [00:01:11] of one of the classical linear models and sort of brought to a close a lot of [00:01:13] and sort of brought to a close a lot of discussion of those linear models and so [00:01:15] discussion of those linear models and so today we're gonna be getting to decision [00:01:17] today we're gonna be getting to decision trees which is really one of our first [00:01:18] trees which is really one of our first examples of a non-linear model and so to [00:01:21] examples of a non-linear model and so to motivate these guys let me give you guys [00:01:23] motivate these guys let me give you guys an example okay so I'm Canadian I really [00:01:27] an example okay so I'm Canadian I really like the ski so I'm gonna motivate it [00:01:29] like the ski so I'm gonna motivate it using that so pretend you have a [00:01:31] using that so pretend you have a classifier that given a time and a [00:01:34] classifier that given a time and a location tells you whether or not you [00:01:36] location tells you whether or not you can ski if it's a binary classifier [00:01:38] can ski if it's a binary classifier saying yes or no and so you have you can [00:01:40] saying yes or no and so you have you can imagine a graph like this and on the [00:01:44] imagine a graph like this and on the x-axis we're gonna have time in months [00:01:47] x-axis we're gonna have time in months so counting from the start so starting [00:01:53] so counting from the start so starting at 1 for January to 12 for December and [00:01:57] at 1 for January to 12 for December and then on the y-axis we're gonna use [00:01:59] then on the y-axis we're gonna use latitude in degrees okay and so for [00:02:03] latitude in degrees okay and so for those of you who might have forgotten [00:02:04] those of you who might have forgotten what latitude is it's basically at [00:02:07] what latitude is it's basically at positive 90 degrees you're at the North [00:02:09] positive 90 degrees you're at the North Pole at negative 90 degrees you're at [00:02:11] Pole at negative 90 degrees you're at the South Pole so positive 90 negative [00:02:16] the South Pole so positive 90 negative 90 [00:02:17] 90 zero being the equator it's sort of your [00:02:20] zero being the equator it's sort of your location along the north-south axis okay [00:02:23] location along the north-south axis okay so given this if you might recall the [00:02:27] so given this if you might recall the winter in the northern hemisphere [00:02:29] winter in the northern hemisphere generally happens in the early months of [00:02:31] generally happens in the early months of the year so you might see that you can [00:02:33] the year so you might see that you can ski in these early months over here and [00:02:35] ski in these early months over here and have some positive data points and then [00:02:37] have some positive data points and then in the later months right and then in [00:02:42] in the later months right and then in the middle you can't really ski verses [00:02:46] the middle you can't really ski verses in the southern hemisphere it's [00:02:47] in the southern hemisphere it's basically flipped where you can not ski [00:02:50] basically flipped where you can not ski in the early months you can ski during [00:02:53] in the early months you can ski during the name july/august [00:02:55] the name july/august time period and then you can't ski in [00:02:58] time period and then you can't ski in the later months and then the equator in [00:03:01] the later months and then the equator in general is just not great for skiing [00:03:02] general is just not great for skiing there's a reason I don't live there and [00:03:04] there's a reason I don't live there and so you just have a bunch of negatives [00:03:05] so you just have a bunch of negatives here okay and so when you look at a data [00:03:10] here okay and so when you look at a data set like this you've sort of got these [00:03:12] set like this you've sort of got these separate regions that you're looking at [00:03:14] separate regions that you're looking at right and you sort of want to isolate [00:03:15] right and you sort of want to isolate out those regions of positive examples [00:03:16] out those regions of positive examples if you had a linear classifier you'd [00:03:19] if you had a linear classifier you'd sort of be hard-pressed to come up with [00:03:21] sort of be hard-pressed to come up with any sort of decision boundary that would [00:03:22] any sort of decision boundary that would separate this reasonably now you could [00:03:24] separate this reasonably now you could think okay maybe you have an SVM or [00:03:26] think okay maybe you have an SVM or something you come up with a kernel that [00:03:28] something you come up with a kernel that could protect perhaps project this into [00:03:30] could protect perhaps project this into a higher feature space that would make [00:03:31] a higher feature space that would make it linearly separable but it turns out [00:03:33] it linearly separable but it turns out that with decision trees you have a very [00:03:35] that with decision trees you have a very natural way to do this so to sort of [00:03:39] natural way to do this so to sort of make clear exactly what we want to do [00:03:41] make clear exactly what we want to do with decision trees is we want to sort [00:03:43] with decision trees is we want to sort of partition the space into individual [00:03:45] of partition the space into individual regions so we sort of want to isolate [00:03:47] regions so we sort of want to isolate out things like positive examples for [00:03:49] out things like positive examples for example and in general this problem is [00:03:51] example and in general this problem is fairly intractable just coming up with [00:03:53] fairly intractable just coming up with the optimal regions but how we do it [00:03:56] the optimal regions but how we do it with decision trees is we do it in this [00:03:58] with decision trees is we do it in this basically greedy top-down recursive [00:04:09] basically greedy top-down recursive manner this would be recursive [00:04:14] manner this would be recursive partitioning okay [00:04:22] partitioning okay and so it's basically it's top down [00:04:24] and so it's basically it's top down because we're starting with the overall [00:04:25] because we're starting with the overall region and we want to slowly partition [00:04:27] region and we want to slowly partition it up okay and then it's greedy because [00:04:29] it up okay and then it's greedy because at each step we want to pick the best [00:04:31] at each step we want to pick the best partition possible [00:04:33] partition possible okay so let's actually try and work out [00:04:36] okay so let's actually try and work out intuitively what a decision tree would [00:04:38] intuitively what a decision tree would do okay so what we do is we start with [00:04:40] do okay so what we do is we start with the overall space and the tree is [00:04:42] the overall space and the tree is basically going to play 20 questions [00:04:44] basically going to play 20 questions with this space okay so like for example [00:04:47] with this space okay so like for example one question it might ask is if we have [00:04:50] one question it might ask is if we have the data coming in like this is is the [00:04:53] the data coming in like this is is the latitude greater than thirty degrees [00:04:56] latitude greater than thirty degrees okay and that would involve sort of [00:05:00] okay and that would involve sort of cutting the space like this for example [00:05:02] cutting the space like this for example okay and then we'd have a yes or a no [00:05:08] okay and then we'd have a yes or a no and so starting from like the most [00:05:11] and so starting from like the most general space now we have partitioned [00:05:13] general space now we have partitioned the overall space into two separate [00:05:15] the overall space into two separate spaces using this this question okay and [00:05:18] spaces using this this question okay and this is where the recursive part comes [00:05:20] this is where the recursive part comes in now because now that you've sort of [00:05:22] in now because now that you've sort of split this space into two you can then [00:05:25] split this space into two you can then sort of treat each individual space as a [00:05:27] sort of treat each individual space as a new problem to ask a new question about [00:05:29] new problem to ask a new question about and so for example now that you've asked [00:05:31] and so for example now that you've asked this latitude greater than 30 question [00:05:33] this latitude greater than 30 question you could then ask something like month [00:05:37] you could then ask something like month less than like March or something like [00:05:40] less than like March or something like that and that would give you a yes or no [00:05:44] that and that would give you a yes or no and what that works out to effectively [00:05:46] and what that works out to effectively is that now you've taken this upper [00:05:48] is that now you've taken this upper space here and divided it up into these [00:05:53] space here and divided it up into these two separate regions like this and so [00:05:56] two separate regions like this and so you could imagine how through asking [00:05:58] you could imagine how through asking these recursive questions over and over [00:06:00] these recursive questions over and over again you could start splitting up the [00:06:01] again you could start splitting up the entire space into your individual [00:06:03] entire space into your individual regions like this okay and so to make [00:06:11] regions like this okay and so to make this a little bit more formal what we're [00:06:13] this a little bit more formal what we're looking for is we're looking for sort of [00:06:16] looking for is we're looking for sort of this split function okay so you can sort [00:06:18] this split function okay so you can sort of define a region so you have a region [00:06:22] of define a region so you have a region and let's call that region R P in this [00:06:26] and let's call that region R P in this case for our parent okay and we're [00:06:29] case for our parent okay and we're looking for looking for [00:06:33] looking for looking for a split s P such that you have an SP you [00:06:46] a split s P such that you have an SP you can sort of write out this SP function [00:06:47] can sort of write out this SP function as a function of J comma T okay where [00:06:53] as a function of J comma T okay where you so you were J is which feature [00:06:55] you so you were J is which feature number and T is the threshold you're [00:06:57] number and T is the threshold you're using and so you can sort of write this [00:06:59] using and so you can sort of write this out formally as sort of you're [00:07:00] out formally as sort of you're outputting a tuple where on the one hand [00:07:03] outputting a tuple where on the one hand you have a set X where you have the XJ [00:07:08] you have a set X where you have the XJ the the J feature of X is less than the [00:07:11] the the J feature of X is less than the threshold and you have X's element of RP [00:07:15] threshold and you have X's element of RP since we're only partitioning that [00:07:16] since we're only partitioning that parent region and then the second set is [00:07:20] parent region and then the second set is literally the same thing except it's [00:07:24] literally the same thing except it's just those that are greater than T [00:07:33] and so we can refer to each one of these [00:07:34] and so we can refer to each one of these as r1 and r2 any questions so far now [00:07:48] as r1 and r2 any questions so far now okay so we sort of define now how we [00:07:51] okay so we sort of define now how we would sort of do this we're trying to [00:07:52] would sort of do this we're trying to like greedily pick these picks that are [00:07:54] like greedily pick these picks that are partitioning our input space and the [00:07:57] partitioning our input space and the splits are sort of defined by which [00:07:58] splits are sort of defined by which feature you're looking at and the [00:08:00] feature you're looking at and the threshold that you're applying to that [00:08:02] threshold that you're applying to that feature a sort of a natural question to [00:08:05] feature a sort of a natural question to ask now is is how do you choose these [00:08:08] ask now is is how do you choose these splits [00:08:22] all right and so I sort of give this [00:08:25] all right and so I sort of give this intuitive explanation that really what [00:08:27] intuitive explanation that really what you're trying to do is you're trying to [00:08:28] you're trying to do is you're trying to isolate out the space of positives and [00:08:30] isolate out the space of positives and negatives in this case and so what it's [00:08:33] negatives in this case and so what it's useful to define is a loss on a region [00:08:37] useful to define is a loss on a region okay so define your loss l on r and so [00:08:54] okay so define your loss l on r and so for now let's define our loss as [00:08:56] for now let's define our loss as something fairly obvious is your miss [00:08:58] something fairly obvious is your miss classification loss it's how many [00:09:00] classification loss it's how many examples in your region you get wrong [00:09:02] examples in your region you get wrong and so assuming that you have given C [00:09:10] and so assuming that you have given C classes total you can define P hat C to [00:09:26] classes total you can define P hat C to be the proportion of examples in our [00:09:52] okay and so now that we've got this [00:09:55] okay and so now that we've got this definition where we had this P hat C of [00:09:57] definition where we had this P hat C of telling us the proportion of examples [00:09:59] telling us the proportion of examples that we've got in that case you can sort [00:10:01] that we've got in that case you can sort of define the loss of any region as loss [00:10:05] of define the loss of any region as loss let's call it in this classification [00:10:09] let's call it in this classification it's just one - max over C of P hat C [00:10:19] it's just one - max over C of P hat C okay and so the reasoning behind this is [00:10:22] okay and so the reasoning behind this is basically you can say that for any [00:10:24] basically you can say that for any region you've subdivided generally what [00:10:26] region you've subdivided generally what you'll want to do is predict the most [00:10:28] you'll want to do is predict the most common class there which is just the [00:10:30] common class there which is just the maximum P hat C right and so then all [00:10:33] maximum P hat C right and so then all the remaining probability just gets [00:10:35] the remaining probability just gets thrown onto miss classification errors [00:10:37] thrown onto miss classification errors okay and so then once we do this we want [00:10:42] okay and so then once we do this we want to basically pick now that we have a [00:10:44] to basically pick now that we have a loss defined we want to pick a split [00:10:48] loss defined we want to pick a split that decreases the loss as much as [00:10:51] that decreases the loss as much as possible so you recall I've defined this [00:10:53] possible so you recall I've defined this region our parent and then these two [00:10:55] region our parent and then these two children regions r1 and r2 and you [00:10:58] children regions r1 and r2 and you basically want to reduce that loss as [00:11:01] basically want to reduce that loss as much as possible so you want to [00:11:04] much as possible so you want to basically minimize loss our parent - [00:11:14] basically minimize loss our parent - loss of r1 r2 and so this is sort of [00:11:23] loss of r1 r2 and so this is sort of here parent loss this is your children [00:11:33] here parent loss this is your children loss okay and since you're picking and [00:11:40] loss okay and since you're picking and basically what you're minimizing over in [00:11:42] basically what you're minimizing over in some case is this J comma T that we [00:11:44] some case is this J comma T that we defined over here since the split is [00:11:47] defined over here since the split is really what is going to define our two [00:11:49] really what is going to define our two children regions all right and what [00:11:52] children regions all right and what you'll notice is that the loss of the [00:11:53] you'll notice is that the loss of the parent doesn't really matter in this [00:11:55] parent doesn't really matter in this case because that's already defined so [00:11:57] case because that's already defined so really all you're trying to do is [00:11:58] really all you're trying to do is minimize this negative sum of losses of [00:12:01] minimize this negative sum of losses of your children [00:12:03] your children okay so let's move to this next board so [00:12:27] okay so let's move to this next board so I started to find this miss [00:12:28] I started to find this miss classification law so let's get a little [00:12:29] classification law so let's get a little bit into actually why miss [00:12:31] bit into actually why miss classification loss isn't actually the [00:12:32] classification loss isn't actually the right loss to use for this problem so [00:12:51] okay and so for a simple example let's [00:12:54] okay and so for a simple example let's pretend so I've sort of drawn out a tree [00:12:56] pretend so I've sort of drawn out a tree like this let's pretend that instead we [00:12:58] like this let's pretend that instead we have another set up here where we're [00:13:03] have another set up here where we're coming into a decision node now at this [00:13:05] coming into a decision node now at this point we have 900 positives and 100 [00:13:08] point we have 900 positives and 100 negatives okay so this is sort of a Miss [00:13:11] negatives okay so this is sort of a Miss classification loss of 100 in this case [00:13:14] classification loss of 100 in this case because you'd predict the most common [00:13:15] because you'd predict the most common class and end up with 100 misclassified [00:13:17] class and end up with 100 misclassified examples and so this would be your [00:13:21] examples and so this would be your region RP right now all right and so [00:13:24] region RP right now all right and so then you can split it into these two [00:13:26] then you can split it into these two other regions right say r1 and r2 and [00:13:35] other regions right say r1 and r2 and say that what you've achieved now is you [00:13:37] say that what you've achieved now is you have this 700 positive 100 negatives on [00:13:40] have this 700 positive 100 negatives on this side versus 200 positives and zero [00:13:46] this side versus 200 positives and zero negatives on this side okay now this [00:13:51] negatives on this side okay now this seems like a pretty good split since [00:13:52] seems like a pretty good split since you're getting out some more examples [00:13:54] you're getting out some more examples but what you can see is that if you just [00:13:56] but what you can see is that if you just drew the same thing again right RP with [00:14:00] drew the same thing again right RP with 900 100 split split [00:14:08] and say in this case instead you got 400 [00:14:12] and say in this case instead you got 400 positives over here 100 negatives and [00:14:18] positives over here 100 negatives and 500 positives so most people would argue [00:14:23] 500 positives so most people would argue that this right decision boundary is [00:14:25] that this right decision boundary is better than the left one because you're [00:14:27] better than the left one because you're basically isolating out even more [00:14:28] basically isolating out even more positives in this case however if you're [00:14:31] positives in this case however if you're just looking at your Mis classification [00:14:33] just looking at your Mis classification loss it turns out that on this left one [00:14:35] loss it turns out that on this left one here let's call this r1 r2 versus this [00:14:37] here let's call this r1 r2 versus this right one let's call this r1 prime r2 [00:14:39] right one let's call this r1 prime r2 prime okay so your loss of r1 plus r2 on [00:14:48] prime okay so your loss of r1 plus r2 on this left case is just one hundred plus [00:14:50] this left case is just one hundred plus zero okay so just one hundred and then [00:14:54] zero okay so just one hundred and then on the right side here it's actually [00:14:56] on the right side here it's actually still just the same alright and in fact [00:15:05] still just the same alright and in fact if you look at the original loss of your [00:15:07] if you look at the original loss of your parent it's also just a hundred right so [00:15:10] parent it's also just a hundred right so you haven't really according to this [00:15:12] you haven't really according to this lost metric changed anything at all and [00:15:14] lost metric changed anything at all and so that sort of brings up one problem [00:15:16] so that sort of brings up one problem with the Mis classification loss is that [00:15:17] with the Mis classification loss is that it's not really sensitive enough okay [00:15:21] it's not really sensitive enough okay so like instead what we can do is we can [00:15:24] so like instead what we can do is we can define this cross-entropy loss okay [00:15:44] which will define as L cross [00:15:51] let me just write this out here [00:16:00] and so really what you're doing is [00:16:02] and so really what you're doing is you're just summing over the classes and [00:16:04] you're just summing over the classes and it's the probability the proportion of [00:16:06] it's the probability the proportion of elements in that class times the log of [00:16:08] elements in that class times the log of the proportion in that class and how you [00:16:10] the proportion in that class and how you can think of this is it's sort of a this [00:16:12] can think of this is it's sort of a this concept that we borrow from information [00:16:14] concept that we borrow from information theory which is sort of like the number [00:16:16] theory which is sort of like the number of bits you need to communicate to tell [00:16:19] of bits you need to communicate to tell someone who already knows what the [00:16:20] someone who already knows what the probabilities are what class you're [00:16:22] probabilities are what class you're looking at and so that sounds like a [00:16:24] looking at and so that sounds like a mouthful but really you can sort of [00:16:26] mouthful but really you can sort of think of it intuitively as if someone [00:16:28] think of it intuitively as if someone already knows the probabilities like say [00:16:29] already knows the probabilities like say it's a hundred percent chance that it is [00:16:31] it's a hundred percent chance that it is of one class then you don't need to [00:16:33] of one class then you don't need to communicate anything to tell them [00:16:35] communicate anything to tell them exactly which classes because it's [00:16:36] exactly which classes because it's obvious that it's that one class versus [00:16:38] obvious that it's that one class versus if you have like a fairly even split [00:16:40] if you have like a fairly even split then you need to communicate a lot more [00:16:41] then you need to communicate a lot more information to tell someone exactly what [00:16:44] information to tell someone exactly what class you are in any questions so far [00:16:50] class you are in any questions so far yep the R 1 R 2 for the parent class for [00:17:05] yep the R 1 R 2 for the parent class for this case here yeah yeah so um for that [00:17:10] this case here yeah yeah so um for that case there so you see that I'll try and [00:17:12] case there so you see that I'll try and reach up there but so it's like say like [00:17:13] reach up there but so it's like say like RP was your start region right you could [00:17:16] RP was your start region right you could say it's the overall region right and [00:17:18] say it's the overall region right and then R 1 would be all the points above [00:17:21] then R 1 would be all the points above this latitude 30 line and our two would [00:17:24] this latitude 30 line and our two would be all the points below the latitude 30 [00:17:26] be all the points below the latitude 30 line yep [00:17:39] yeah so the question is when you're [00:17:41] yeah so the question is when you're trying to minimize this loss here is it [00:17:43] trying to minimize this loss here is it the same as maximizing the the children [00:17:46] the same as maximizing the the children loss and sir [00:17:47] loss and sir now let's see of maximizing the children [00:17:51] now let's see of maximizing the children loss and yeah it turns out it doesn't [00:17:52] loss and yeah it turns out it doesn't really matter which which way you put it [00:17:55] really matter which which way you put it it's just basically you're trying to [00:17:57] it's just basically you're trying to either minimize the loss of the children [00:17:59] either minimize the loss of the children or maximize the gain in information [00:18:02] or maximize the gain in information basically yeah let's see yeah you're [00:18:18] basically yeah let's see yeah you're right that should actually be a max let [00:18:21] right that should actually be a max let me fix that really quick because you [00:18:32] me fix that really quick because you start with your parent loss and then [00:18:33] start with your parent loss and then you're subtracting out your children's [00:18:34] you're subtracting out your children's loss and so the amount left let's see [00:18:38] loss and so the amount left let's see the higher this loss is yeah so you [00:18:40] the higher this loss is yeah so you really want to maximize this guy make [00:18:46] really want to maximize this guy make sense everyone thanks for that [00:19:02] okay so I've sort of given this late [00:19:04] okay so I've sort of given this late hand-wavy oh sure what's up so that [00:19:11] hand-wavy oh sure what's up so that would be log base two the question is [00:19:12] would be log base two the question is for the cross-entropy loss is it log [00:19:14] for the cross-entropy loss is it log base 2 or log base C it's log base 2 [00:19:16] base 2 or log base C it's log base 2 okay yep or sorry I didn't quite hear [00:19:29] okay yep or sorry I didn't quite hear that okay so the question is can what is [00:19:42] that okay so the question is can what is the proportion that are correct versus [00:19:44] the proportion that are correct versus incorrect for these two examples we've [00:19:46] incorrect for these two examples we've worked through here and so yeah [00:19:49] worked through here and so yeah basically what we're starting with is [00:19:51] basically what we're starting with is we're starting with we have nine hundred [00:19:52] we're starting with we have nine hundred one hundred nine hundred faucets and one [00:19:54] one hundred nine hundred faucets and one hundred negatives all right so you can [00:19:55] hundred negatives all right so you can imagine if you just stopped at this [00:19:56] imagine if you just stopped at this point right you would just classify [00:19:59] point right you would just classify everything that's positive right and so [00:20:01] everything that's positive right and so you get one hundred negatives incorrect [00:20:04] you get one hundred negatives incorrect that makes it because this is nine [00:20:06] that makes it because this is nine hundred positives and one hundred [00:20:07] hundred positives and one hundred negatives so if you just stopped here [00:20:10] negatives so if you just stopped here and just tried to classify given this [00:20:11] and just tried to classify given this whole region or P you would end up [00:20:13] whole region or P you would end up getting 10% of your examples wrong right [00:20:18] getting 10% of your examples wrong right in this case we're sort of talking we're [00:20:20] in this case we're sort of talking we're not talking about percentages we're [00:20:21] not talking about percentages we're talking about absolute number of [00:20:23] talking about absolute number of examples that we've gotten wrong [00:20:24] examples that we've gotten wrong you can also definitely talk in terms of [00:20:26] you can also definitely talk in terms of percentages instead and then down here [00:20:29] percentages instead and then down here once you've split it right now you've [00:20:32] once you've split it right now you've got these two sub regions and on this on [00:20:35] got these two sub regions and on this on this left one here you still have more [00:20:37] this left one here you still have more positives than negatives right so you're [00:20:40] positives than negatives right so you're still going to classify positive in this [00:20:42] still going to classify positive in this leaf all right and you're still gonna [00:20:44] leaf all right and you're still gonna classify positive in this leaf too [00:20:46] classify positive in this leaf too because they're both majority class or [00:20:51] because they're both majority class or the positives are still the majority [00:20:52] the positives are still the majority class there and in this case and you [00:20:54] class there and in this case and you have zero negatives you're not gonna [00:20:55] have zero negatives you're not gonna make any errors in your classification [00:20:57] make any errors in your classification verses in this case you're still gonna [00:20:59] verses in this case you're still gonna make 100 errors and so what I'm saying [00:21:01] make 100 errors and so what I'm saying is that at this level so if we just look [00:21:03] is that at this level so if we just look above this line at our [00:21:05] above this line at our right you're making 100 mistakes and [00:21:07] right you're making 100 mistakes and then below this line you're still making [00:21:09] then below this line you're still making 100 mistakes so what I'm saying is that [00:21:11] 100 mistakes so what I'm saying is that the loss in this case is not very [00:21:13] the loss in this case is not very informative so this so this P hat okay [00:21:21] informative so this so this P hat okay I'm being a little bit loose with the [00:21:22] I'm being a little bit loose with the terminal with the notation here but the [00:21:24] terminal with the notation here but the P hat in this case is a proportion okay [00:21:26] P hat in this case is a proportion okay but you can also easily basically it's [00:21:29] but you can also easily basically it's like whether you're normalizing the [00:21:30] like whether you're normalizing the whole thing or not [00:21:35] okay so I've started given this a bit [00:21:39] okay so I've started given this a bit hand wavy explanation as to why miss [00:21:41] hand wavy explanation as to why miss classification loss versus cross-entropy [00:21:43] classification loss versus cross-entropy loss might be better or worse we can [00:21:45] loss might be better or worse we can actually get a fairly good intuition for [00:21:47] actually get a fairly good intuition for why this is the case by looking at it [00:21:50] why this is the case by looking at it from a sort of geometric perspective so [00:21:53] from a sort of geometric perspective so pretend now that you have this this plot [00:21:59] pretend now that you have this this plot okay and what you're plotting here is [00:22:01] okay and what you're plotting here is pretend you have a binary classification [00:22:03] pretend you have a binary classification problem okay so you have just is it [00:22:06] problem okay so you have just is it positive class or negative class okay [00:22:08] positive class or negative class okay and so you can sort of represent say P [00:22:10] and so you can sort of represent say P hat as like the proportion of positives [00:22:13] hat as like the proportion of positives in your set okay and what you've got [00:22:15] in your set okay and what you've got plotted up here is your loss okay [00:22:19] plotted up here is your loss okay for cross-entropy loss well your curve [00:22:22] for cross-entropy loss well your curve is gonna end up looking like is it's [00:22:24] is gonna end up looking like is it's gonna end up looking like this strictly [00:22:25] gonna end up looking like this strictly concave curve like this okay and what [00:22:30] concave curve like this okay and what you can do is you can sort of look at [00:22:32] you can do is you can sort of look at where your children versus your parent [00:22:34] where your children versus your parent would fall on this curve so say that you [00:22:37] would fall on this curve so say that you have two children okay you have one up [00:22:40] have two children okay you have one up here so the places call this lr1 and you [00:22:44] here so the places call this lr1 and you have one down here lr2 okay [00:22:49] have one down here lr2 okay and say that you have an equal number of [00:22:51] and say that you have an equal number of examples in both r1 and r2 so they're [00:22:54] examples in both r1 and r2 so they're equally weighted if you take when you're [00:22:57] equally weighted if you take when you're looking at the overall loss between the [00:22:59] looking at the overall loss between the two right that's really just the average [00:23:01] two right that's really just the average of the two so you can draw a line [00:23:03] of the two so you can draw a line between these two and the midpoint turns [00:23:07] between these two and the midpoint turns out to be the average of your two losses [00:23:09] out to be the average of your two losses so this is l r1 + l r2 divided by 2 [00:23:22] that's what this guy's okay and what you [00:23:28] that's what this guy's okay and what you can notice is that in fact the loss of [00:23:30] can notice is that in fact the loss of the parent node is actually just this [00:23:33] the parent node is actually just this point projected upwards here so this [00:23:35] point projected upwards here so this would be your L R parent and this [00:23:39] would be your L R parent and this difference right here this difference is [00:23:45] difference right here this difference is sort of your change in loss does this [00:23:57] sort of your change in loss does this make sense any questions okay so we have [00:24:04] make sense any questions okay so we have this just to recap okay so we have say [00:24:07] this just to recap okay so we have say we have two children regions right and [00:24:09] we have two children regions right and they have different probabilities of [00:24:11] they have different probabilities of positive examples occurring all right [00:24:14] positive examples occurring all right they sort of would fall one would fall [00:24:16] they sort of would fall one would fall on this point on the curve and say the [00:24:18] on this point on the curve and say the other one falls on this point on the [00:24:19] other one falls on this point on the curve then the average of the two loss [00:24:21] curve then the average of the two loss is sort of falls on that midpoint [00:24:22] is sort of falls on that midpoint between these two original losses and if [00:24:26] between these two original losses and if you look at the parent it's really just [00:24:28] you look at the parent it's really just halfway between on the x-axis and you [00:24:30] halfway between on the x-axis and you can project upwards for that as well and [00:24:32] can project upwards for that as well and you end up with the loss of our parent [00:24:34] you end up with the loss of our parent what's up [00:24:44] okay so what we're looking at here is [00:24:46] okay so what we're looking at here is we're looking at the cross entropy law [00:24:47] we're looking at the cross entropy law so you've got this function here this L [00:24:49] so you've got this function here this L cross entropy right and that's in terms [00:24:52] cross entropy right and that's in terms of P hat C's right and in this case here [00:24:55] of P hat C's right and in this case here we're just assuming that we have two [00:24:56] we're just assuming that we have two classes okay and so what we're doing is [00:24:59] classes okay and so what we're doing is we're just modifying the P hat see we're [00:25:02] we're just modifying the P hat see we're we're changing that on the x-axis and [00:25:04] we're changing that on the x-axis and then we're looking at what the response [00:25:05] then we're looking at what the response of the overall loss function is on the [00:25:07] of the overall loss function is on the y-axis and so what I just did here is [00:25:10] y-axis and so what I just did here is for any this curve just represents for [00:25:12] for any this curve just represents for any P hat see what the cross entropy [00:25:15] any P hat see what the cross entropy loss would look like okay and so we can [00:25:19] loss would look like okay and so we can come back to this for example right and [00:25:20] come back to this for example right and if we look at this parent here right [00:25:23] if we look at this parent here right this guy has a 10% right it's sort of [00:25:26] this guy has a 10% right it's sort of like P hat P hat for this guy is 0.1 [00:25:31] like P hat P hat for this guy is 0.1 it's 10% basically or or I guess no in [00:25:35] it's 10% basically or or I guess no in this case would be 0.9 sorry and then [00:25:38] this case would be 0.9 sorry and then vs. here in these two cases right your P [00:25:41] vs. here in these two cases right your P hat in this case is one since you've got [00:25:44] hat in this case is one since you've got them all right all right and then in [00:25:46] them all right all right and then in this case it's 0.8 all right and so you [00:25:50] this case it's 0.8 all right and so you can sort of see since these are equal [00:25:51] can sort of see since these are equal there's the same number of examples in [00:25:53] there's the same number of examples in both of these the P hat of the parent is [00:25:55] both of these the P hat of the parent is just the average of the pH of the [00:25:57] just the average of the pH of the children okay and so that's how we can [00:26:00] children okay and so that's how we can sort of take this LR parent this L R [00:26:03] sort of take this LR parent this L R parent is just half way if we projected [00:26:04] parent is just half way if we projected this down all right [00:26:06] this down all right let me just erase this a little bit here [00:26:13] if we projected this down like this we'd [00:26:21] if we projected this down like this we'd see that this is that this point here is [00:26:23] see that this is that this point here is the midpoint okay and but then when [00:26:33] the midpoint okay and but then when you're actually averaging the two losses [00:26:34] you're actually averaging the two losses after you've done the split then you can [00:26:37] after you've done the split then you can basically just you're just taking the [00:26:39] basically just you're just taking the average loss right you're just summing L [00:26:41] average loss right you're just summing L R 1 plus L R 2 and if you're taking the [00:26:43] R 1 plus L R 2 and if you're taking the average then you're dividing by 2 and [00:26:44] average then you're dividing by 2 and what you can do is you can just draw the [00:26:46] what you can do is you can just draw the line and take the midpoint of this line [00:26:47] line and take the midpoint of this line instead [00:26:52] yeah yeah exactly [00:27:06] yeah yeah exactly so yeah really any if there it's a good [00:27:09] so yeah really any if there it's a good point the question was if you have an [00:27:11] point the question was if you have an uneven split what would that look like [00:27:14] uneven split what would that look like on this curve right and so at this point [00:27:17] on this curve right and so at this point I've been making the math easy by just [00:27:18] I've been making the math easy by just saying there's an even split but really [00:27:20] saying there's an even split but really if there was a slightly uneven split you [00:27:21] if there was a slightly uneven split you the average would just be any point [00:27:23] the average would just be any point along this line that you've drawn and as [00:27:26] along this line that you've drawn and as you can see the whole thing is strictly [00:27:27] you can see the whole thing is strictly concave so any point along that line is [00:27:30] concave so any point along that line is going to lie below the original loss [00:27:32] going to lie below the original loss curve for the parent so you're basically [00:27:35] curve for the parent so you're basically as long as you're not picking the exact [00:27:37] as long as you're not picking the exact same points on the probability curve and [00:27:39] same points on the probability curve and not making any gain at all in your split [00:27:40] not making any gain at all in your split you're gonna gain some amount of [00:27:42] you're gonna gain some amount of information through this split okay now [00:27:54] this was the cross entropy loss right if [00:28:04] this was the cross entropy loss right if instead we look at the miss [00:28:05] instead we look at the miss classification loss over here let's draw [00:28:08] classification loss over here let's draw this one instead [00:28:32] well we can see in this case if you draw [00:28:35] well we can see in this case if you draw it is that it's in fact really this [00:28:36] it is that it's in fact really this pyramid kind of shape [00:28:38] pyramid kind of shape we're just linear and then flips over [00:28:40] we're just linear and then flips over once you start classifying the other [00:28:41] once you start classifying the other side and if you did the same argument [00:28:44] side and if you did the same argument here where he had L R 1 and L R 2 and [00:28:51] here where he had L R 1 and L R 2 and then you drew a line between them right [00:28:53] then you drew a line between them right that's basically just still the lost [00:28:55] that's basically just still the lost curve and so in this case like your [00:28:57] curve and so in this case like your midpoint would be the same point as your [00:28:59] midpoint would be the same point as your parent so your loss of our parent in [00:29:02] parent so your loss of our parent in this case would equal your loss of R 1 [00:29:06] this case would equal your loss of R 1 plus loss of R 2 divided by 2 all right [00:29:13] plus loss of R 2 divided by 2 all right and so in this case you can there's even [00:29:15] and so in this case you can there's even though according to the cross-entropy [00:29:17] though according to the cross-entropy formulation you do have a gain in [00:29:19] formulation you do have a gain in information and intuitively we do see a [00:29:21] information and intuitively we do see a gain in information over here for the [00:29:22] gain in information over here for the misclassification law since it's not [00:29:24] misclassification law since it's not very sensitive if you end up with points [00:29:26] very sensitive if you end up with points on the same side of the curve then you [00:29:28] on the same side of the curve then you actually don't see any sort of [00:29:30] actually don't see any sort of information game based on this kind of [00:29:31] information game based on this kind of representation and so there's actually a [00:29:38] representation and so there's actually a couple I presented the cross entropy [00:29:39] couple I presented the cross entropy loss here there's also the Gini loss [00:29:42] loss here there's also the Gini loss which is another one which people just [00:29:45] which is another one which people just write out as as the sum over your [00:29:49] write out as as the sum over your classes P hat C times 1 minus P hat C ok [00:29:55] classes P hat C times 1 minus P hat C ok and it turns out that this curve also [00:29:57] and it turns out that this curve also looks very similar to this original [00:29:59] looks very similar to this original cross entropy curve and what you'll see [00:30:01] cross entropy curve and what you'll see is that actually most curves that are [00:30:03] is that actually most curves that are successful youth for decision splits [00:30:06] successful youth for decision splits look basically like this strictly [00:30:08] look basically like this strictly concave function ok so that's what it [00:30:13] concave function ok so that's what it covers a lot of the criteria we use for [00:30:16] covers a lot of the criteria we use for splits let's look at some extensions for [00:30:20] splits let's look at some extensions for a decision trees [00:30:38] I'm gonna keep this guy [00:30:58] okay so I so far I've been talking about [00:31:01] okay so I so far I've been talking about decision trees for classification you [00:31:03] decision trees for classification you could also imagine having decision trees [00:31:06] could also imagine having decision trees for regression and people generally call [00:31:08] for regression and people generally call these regression trees okay [00:31:11] these regression trees okay I'm so taking the ski example again [00:31:13] I'm so taking the ski example again let's pretend that instead of now [00:31:15] let's pretend that instead of now predicting whether or not you can ski [00:31:16] predicting whether or not you can ski you're predicting the amount of snowfall [00:31:18] you're predicting the amount of snowfall you would expect in that area around [00:31:20] you would expect in that area around that time and so like let's I'm just [00:31:24] that time and so like let's I'm just gonna say it's like inches of snowfall I [00:31:26] gonna say it's like inches of snowfall I guess or something per like day or [00:31:29] guess or something per like day or something and just like maybe you have [00:31:31] something and just like maybe you have some values up here some high value [00:31:34] some values up here some high value because you're it's winter over there [00:31:36] because you're it's winter over there it's mostly zeros over here cuz you're [00:31:39] it's mostly zeros over here cuz you're summer and then you have some more high [00:31:41] summer and then you have some more high values over here and then you have zeros [00:31:46] values over here and then you have zeros along the equator again zeros southern [00:31:53] along the equator again zeros southern hemisphere over our winter like this and [00:32:02] hemisphere over our winter like this and you can sort of see how you do just the [00:32:04] you can sort of see how you do just the exact same thing you still want to [00:32:05] exact same thing you still want to isolate out regions and sort of increase [00:32:07] isolate out regions and sort of increase like the purity of those regions so you [00:32:10] like the purity of those regions so you could still create like your trees like [00:32:12] could still create like your trees like this split out like this for example and [00:32:18] this split out like this for example and what you do when you get to one of your [00:32:20] what you do when you get to one of your leaves is instead of just predicting a [00:32:22] leaves is instead of just predicting a majority class what you can do is [00:32:24] majority class what you can do is predict the mean of the values left so [00:32:27] predict the mean of the values left so you're predicting predict Y hat wear [00:32:36] you're predicting predict Y hat wear well for RM so but then you have a [00:32:39] well for RM so but then you have a region RM you're predicting Y hat of M [00:32:41] region RM you're predicting Y hat of M which is the sum of all the indices in [00:32:45] which is the sum of all the indices in RM y I minus y hat M and you want the [00:32:52] RM y I minus y hat M and you want the squared loss and then you skim sort of I [00:32:55] squared loss and then you skim sort of I guess in this case you want to normalize [00:32:57] guess in this case you want to normalize by the overall cardinality of RM or how [00:33:01] by the overall cardinality of RM or how many points you have [00:33:04] many points you have and so in this case basically all you've [00:33:07] and so in this case basically all you've done is you've switched your loss [00:33:09] done is you've switched your loss function or no sorry that's wrong this [00:33:16] function or no sorry that's wrong this is actually I got a little bit ahead of [00:33:17] is actually I got a little bit ahead of myself this is actually just the mean [00:33:20] myself this is actually just the mean value would just be this in this case [00:33:22] value would just be this in this case right it's just you're summing all the [00:33:24] right it's just you're summing all the values within your region so in this [00:33:26] values within your region so in this case seven nine eight ten and then just [00:33:28] case seven nine eight ten and then just taking the average of that but so then [00:33:32] taking the average of that but so then what you do is what I was starting to [00:33:34] what you do is what I was starting to write out there was actually really the [00:33:35] write out there was actually really the the loss that you would use in this case [00:33:37] the loss that you would use in this case right which is your squared loss okay so [00:33:42] right which is your squared loss okay so like we'll just call that l squared [00:33:46] which in this case would be equal to Y [00:33:54] which in this case would be equal to Y minus y hat M squared over R M and [00:34:01] minus y hat M squared over R M and that's what I started to write over [00:34:03] that's what I started to write over there but in this case right you have [00:34:05] there but in this case right you have your mean prediction and then your loss [00:34:07] your mean prediction and then your loss in this case is how far off your mean [00:34:09] in this case is how far off your mean prediction is from the overall [00:34:11] prediction is from the overall predictions in this case yep [00:34:33] so that's a really good question the [00:34:35] so that's a really good question the question was how do you actually search [00:34:38] question was how do you actually search for you splits how do you actually solve [00:34:39] for you splits how do you actually solve the optimization problem of finding [00:34:41] the optimization problem of finding these splits and it turns out that you [00:34:43] these splits and it turns out that you can actually basically brute force it [00:34:45] can actually basically brute force it very efficiently I'm going to get into [00:34:47] very efficiently I'm going to get into sort of the details of how you do that [00:34:49] sort of the details of how you do that shortly but it turns out that you can [00:34:50] shortly but it turns out that you can just go through everything fairly [00:34:52] just go through everything fairly quickly I'll get into that I think [00:34:54] quickly I'll get into that I think that's in a couple sections from now any [00:34:58] that's in a couple sections from now any other questions okay so this is for [00:35:05] other questions okay so this is for regression trees right it turns out that [00:35:08] regression trees right it turns out that another useful extension that that you [00:35:11] another useful extension that that you don't really get for other learning [00:35:13] don't really get for other learning algorithms is that you can also deal [00:35:15] algorithms is that you can also deal with categorical variables fairly easily [00:35:17] with categorical variables fairly easily and basically for this case you could [00:35:32] and basically for this case you could imagine that instead of having your [00:35:34] imagine that instead of having your latitude in degrees you could just have [00:35:36] latitude in degrees you could just have three categories right you could have [00:35:38] three categories right you could have something like this is the northern [00:35:43] something like this is the northern hemisphere this is the equator and this [00:35:48] hemisphere this is the equator and this is the southern hemisphere okay and then [00:35:53] is the southern hemisphere okay and then you could ask questions instead of the [00:35:55] you could ask questions instead of the sort like that initial question we had [00:35:57] sort like that initial question we had before where was latitude greater than [00:35:59] before where was latitude greater than thirty your question could instead be is [00:36:02] thirty your question could instead be is is I guess this would be is location in [00:36:10] is I guess this would be is location in [Music] [00:36:12] [Music] northern hemisphere right and so you [00:36:15] northern hemisphere right and so you could have basically any sort of subs [00:36:16] could have basically any sort of subs that you could ask a question about any [00:36:18] that you could ask a question about any sort of subset of the categories you're [00:36:20] sort of subset of the categories you're looking at it's in this case northern [00:36:22] looking at it's in this case northern you would still this question would [00:36:24] you would still this question would still split out this top part from these [00:36:25] still split out this top part from these bottom pieces here one thing to be [00:36:28] bottom pieces here one thing to be careful about though is that if you have [00:36:30] careful about though is that if you have Q categories then you have I mean you [00:36:39] Q categories then you have I mean you basically are considering every single [00:36:41] basically are considering every single possible sub [00:36:42] possible sub set of these categories so that's 2 to [00:36:44] set of these categories so that's 2 to the Q possible splits and so in general [00:36:53] the Q possible splits and so in general you don't want to deal with too many [00:36:54] you don't want to deal with too many categories because this will become [00:36:57] categories because this will become quickly intractable to look through that [00:36:59] quickly intractable to look through that many possible examples it turns out that [00:37:02] many possible examples it turns out that in certain very specific cases you can [00:37:05] in certain very specific cases you can still deal with a lot of categories one [00:37:08] still deal with a lot of categories one such case is for binary classification [00:37:10] such case is for binary classification where then you can just the math it's a [00:37:13] where then you can just the math it's a little bit complicated for this one but [00:37:14] little bit complicated for this one but you can basically sort your categories [00:37:16] you can basically sort your categories by how many positive examples are in [00:37:17] by how many positive examples are in each category and then just take that as [00:37:20] each category and then just take that as like a sorted order them and search [00:37:22] like a sorted order them and search through that linearly and it turns out [00:37:24] through that linearly and it turns out that that yields you in optimal [00:37:25] that that yields you in optimal solutions so decision trees we can use [00:37:36] solutions so decision trees we can use them for regression we can also use them [00:37:37] them for regression we can also use them for categorical variables one thing that [00:37:40] for categorical variables one thing that I've not gotten into is that you can [00:37:43] I've not gotten into is that you can imagine that in the limit if you grew [00:37:45] imagine that in the limit if you grew your tree without ever stopping you [00:37:47] your tree without ever stopping you could end up just having a separate [00:37:48] could end up just having a separate region for every single data point that [00:37:50] region for every single data point that you have and so that's really you could [00:37:53] you have and so that's really you could consider that probably overfitting if [00:37:55] consider that probably overfitting if you ran it all the way to that [00:37:57] you ran it all the way to that completion right so you can sort of see [00:37:58] completion right so you can sort of see that decision trees are fairly high [00:38:02] that decision trees are fairly high variance models and so one thing that [00:38:07] variance models and so one thing that we're interested in doing is [00:38:08] we're interested in doing is regularizing these high variance models [00:38:24] and generally how people have solved [00:38:27] and generally how people have solved this problem is through a number of [00:38:28] this problem is through a number of heuristics okay [00:38:29] heuristics okay so one such heuristic is that if you hit [00:38:32] so one such heuristic is that if you hit a certain minimum leaf size you stop [00:38:35] a certain minimum leaf size you stop splitting that leaf okay so for example [00:38:41] splitting that leaf okay so for example in this case if you've hit like you only [00:38:42] in this case if you've hit like you only have four examples left in this leaf [00:38:44] have four examples left in this leaf then you just stop another one is you [00:38:47] then you just stop another one is you can enforce a maximum depth and sort of [00:38:55] can enforce a maximum depth and sort of a related one in this case is a max [00:38:57] a related one in this case is a max number of nodes and then a fourth very [00:39:09] number of nodes and then a fourth very tempting one I've got to say to use is [00:39:12] tempting one I've got to say to use is you say a minimum decrease in loss right [00:39:24] all right and I say this one's tempting [00:39:26] all right and I say this one's tempting because it's generally not actually a [00:39:28] because it's generally not actually a good idea to use this minimum decrease [00:39:30] good idea to use this minimum decrease in wasallam and you can think about that [00:39:32] in wasallam and you can think about that I thinking that if you have any sort of [00:39:34] I thinking that if you have any sort of higher-order interactions between your [00:39:36] higher-order interactions between your variables you might have to ask one [00:39:39] variables you might have to ask one question that is not very optimal or [00:39:41] question that is not very optimal or doesn't give you that much of an [00:39:42] doesn't give you that much of an increase in loss and then your follow-up [00:39:44] increase in loss and then your follow-up question combined with that first [00:39:45] question combined with that first question might give you a much better [00:39:46] question might give you a much better increase and you can sort of see that in [00:39:48] increase and you can sort of see that in this case where initial latitude [00:39:50] this case where initial latitude question doesn't really give us that [00:39:52] question doesn't really give us that much of a game we sort of split some [00:39:53] much of a game we sort of split some positive negatives but the combination [00:39:55] positive negatives but the combination of the latitude question plus the time [00:39:57] of the latitude question plus the time question really nails down what we want [00:39:59] question really nails down what we want and if we were looking at it purely from [00:40:01] and if we were looking at it purely from the minimum decrease in lost perspective [00:40:03] the minimum decrease in lost perspective we might stomp too early and miss that [00:40:05] we might stomp too early and miss that entirely and so a better way to do this [00:40:09] entirely and so a better way to do this kind of loss decrease is instead you [00:40:12] kind of loss decrease is instead you grow out your full tree and then you [00:40:14] grow out your full tree and then you prune it backwards instead so you you [00:40:16] prune it backwards instead so you you grow out the whole thing and then you [00:40:17] grow out the whole thing and then you check which nodes to prune out pruning [00:40:21] check which nodes to prune out pruning and how you generally do this is you you [00:40:24] and how you generally do this is you you take it you have a validation set that [00:40:26] take it you have a validation set that you use this with and you evaluate what [00:40:29] you use this with and you evaluate what your miss classification error is on [00:40:31] your miss classification error is on your validation set if for each example [00:40:33] your validation set if for each example that you might remove for each leaf that [00:40:35] that you might remove for each leaf that you might remember [00:40:36] you might remember so you would use miss classification in [00:40:41] so you would use miss classification in this case with the validation set [00:41:00] any questions [00:41:03] any questions yep the minimum decrease in loss so yeah [00:41:09] yep the minimum decrease in loss so yeah of course so you'll recall that before I [00:41:12] of course so you'll recall that before I was talking about sort of this RP this [00:41:14] was talking about sort of this RP this loss of our parent versus loss of our 1 [00:41:17] loss of our parent versus loss of our 1 plus loss of our two right so when we're [00:41:19] plus loss of our two right so when we're or I had written out a maximization [00:41:21] or I had written out a maximization basically oh to be clear the question is [00:41:25] basically oh to be clear the question is can you explain a little bit more [00:41:26] can you explain a little bit more clearly what this minimum decrease in [00:41:28] clearly what this minimum decrease in loss means and so you have your loss of [00:41:33] loss means and so you have your loss of r1 and r2 versus your loss of our parent [00:41:35] r1 and r2 versus your loss of our parent right so the split before the split [00:41:37] right so the split before the split alright you have your loss before split [00:41:45] alright you have your loss before split yeah loss of our parent and then after [00:41:50] yeah loss of our parent and then after split you have loss of r1 plus loss of [00:42:00] split you have loss of r1 plus loss of r2 and if if this decrease between your [00:42:05] r2 and if if this decrease between your loss of our parent to your loss of your [00:42:07] loss of our parent to your loss of your children is not great enough you might [00:42:08] children is not great enough you might be tempted to say okay that question [00:42:10] be tempted to say okay that question didn't really gain us anything and so [00:42:13] didn't really gain us anything and so therefore we will not actually use that [00:42:14] therefore we will not actually use that question but what I'm saying is that [00:42:17] question but what I'm saying is that sometimes you have to ask multiple [00:42:18] sometimes you have to ask multiple questions right yet to ask sort of [00:42:20] questions right yet to ask sort of suboptimal questions first to get to the [00:42:22] suboptimal questions first to get to the really good questions especially if you [00:42:23] really good questions especially if you have sort of interaction between your [00:42:25] have sort of interaction between your variables if there's some amount of [00:42:26] variables if there's some amount of correlation between your variables [00:42:37] okay so we talked about regularization I [00:42:42] okay so we talked about regularization I said that we would get to run time let's [00:42:45] said that we would get to run time let's actually just go up here again so let's [00:42:53] actually just go up here again so let's cover that really quickly [00:43:27] okay so it'll be useful to just find a [00:43:30] okay so it'll be useful to just find a couple numbers at this point so say you [00:43:33] couple numbers at this point so say you have n examples you have FB trace and [00:43:48] have n examples you have FB trace and finally you have D say the depth with [00:43:52] finally you have D say the depth with your tree is D okay you've grunt you you [00:43:59] your tree is D okay you've grunt you you have an examples that you trained on you [00:44:01] have an examples that you trained on you with each has F features and your [00:44:03] with each has F features and your resulting tree is depth D so a test time [00:44:06] resulting tree is depth D so a test time your run time is basically just your [00:44:09] your run time is basically just your depth of D it's just oh D all right [00:44:18] depth of D it's just oh D all right which is your depth and typically though [00:44:20] which is your depth and typically though not all cases D and sort of about is [00:44:26] not all cases D and sort of about is less than the log of your number of [00:44:29] less than the log of your number of examples and you can sort of think about [00:44:31] examples and you can sort of think about this as if you have a fairly balanced [00:44:33] this as if you have a fairly balanced tree right you'll end up sort of evenly [00:44:36] tree right you'll end up sort of evenly splitting out all the examples in sort [00:44:37] splitting out all the examples in sort of recursively like doing these binary [00:44:39] of recursively like doing these binary splits and so you'll be splitting it at [00:44:41] splits and so you'll be splitting it at the log of that end okay so at test time [00:44:44] the log of that end okay so at test time you've generally got it pretty quick at [00:44:47] you've generally got it pretty quick at train time [00:44:55] you have each point so if you return [00:44:58] you have each point so if you return back to this example you'll see that [00:45:00] back to this example you'll see that each point right once you've done a [00:45:02] each point right once you've done a split only belongs to the left or right [00:45:05] split only belongs to the left or right of that split afterwards all right sort [00:45:07] of that split afterwards all right sort of like like this point right here once [00:45:09] of like like this point right here once you've split here will only ever be part [00:45:10] you've split here will only ever be part of this region will never be considered [00:45:11] of this region will never be considered on the other side on the right-hand side [00:45:13] on the other side on the right-hand side of that split alright so if you're if [00:45:18] of that split alright so if you're if your tree is of depth D each point each [00:45:22] your tree is of depth D each point each point is part of oh the nose okay and [00:45:39] point is part of oh the nose okay and then at each node you can actually work [00:45:41] then at each node you can actually work out that the cost of evaluating that [00:45:44] out that the cost of evaluating that point for a train time is actually just [00:45:46] point for a train time is actually just proportional to the number of features F [00:46:04] and I won't get too much into the [00:46:06] and I won't get too much into the details of why this is but you can [00:46:08] details of why this is but you can consider that if you're doing binary [00:46:10] consider that if you're doing binary features for example where each features [00:46:12] features for example where each features just yes or no of some sort then you [00:46:14] just yes or no of some sort then you only have to consider if you have F [00:46:16] only have to consider if you have F features total you only have to consider [00:46:18] features total you only have to consider F possible splits and so that's why the [00:46:21] F possible splits and so that's why the cost in that case would be F and then if [00:46:23] cost in that case would be F and then if it was instead a quantitative feature I [00:46:26] it was instead a quantitative feature I mentioned briefly that you could sort [00:46:28] mentioned briefly that you could sort the overall features and then scan [00:46:30] the overall features and then scan through them linearly and that also ends [00:46:33] through them linearly and that also ends up being asymptotically o of F to do [00:46:35] up being asymptotically o of F to do that okay so each point is that most o [00:46:40] that okay so each point is that most o of D nodes and then the cost of a point [00:46:42] of D nodes and then the cost of a point at each node is o of F and you have n [00:46:45] at each node is o of F and you have n points total so the total cost is really [00:46:47] points total so the total cost is really just [00:46:54] is just oh of NFD like this and it turns [00:47:00] is just oh of NFD like this and it turns out that this is actually surprisingly [00:47:03] out that this is actually surprisingly fast especially if you consider that n [00:47:08] fast especially if you consider that n times F is just the size of your [00:47:10] times F is just the size of your original design matrix right or your [00:47:12] original design matrix right or your data matrix right your data matrix is of [00:47:21] data matrix right your data matrix is of size and times F right and then your [00:47:28] size and times F right and then your only your your run time is going through [00:47:30] only your your run time is going through the data matrix that most depth times [00:47:32] the data matrix that most depth times and since depth is log of n that turns [00:47:34] and since depth is log of n that turns out to be or generally bounded by log of [00:47:36] out to be or generally bounded by log of n you have generally fairly fast [00:47:39] n you have generally fairly fast training time as well [00:47:41] training time as well any questions about run time okay so [00:47:50] any questions about run time okay so I've been talking a lot about the good [00:47:52] I've been talking a lot about the good sides of decision trees right they seem [00:47:54] sides of decision trees right they seem pretty nice so far however there are a [00:47:56] pretty nice so far however there are a number of downsides too and one big one [00:48:02] is that it doesn't have additive [00:48:05] is that it doesn't have additive structure to it and so let me explain a [00:48:07] structure to it and so let me explain a little bit what that means [00:48:29] okay so let's say now we have an example [00:48:33] okay so let's say now we have an example and you have just two features again so [00:48:35] and you have just two features again so X 1 and X 2 and you can say you define a [00:48:40] X 1 and X 2 and you can say you define a line and just running through the middle [00:48:42] line and just running through the middle defined by x1 equals x2 and all the [00:48:46] defined by x1 equals x2 and all the points above this line are positive and [00:48:51] points above this line are positive and all the points below it are negative now [00:48:54] all the points below it are negative now if you have a simple linear model like [00:48:55] if you have a simple linear model like logistic regression to have no issue [00:48:57] logistic regression to have no issue with this kind of setup but for a [00:48:59] with this kind of setup but for a decision tree basically you'd have to [00:49:02] decision tree basically you'd have to ask a lot of questions that even [00:49:04] ask a lot of questions that even somewhat approximate this line what you [00:49:06] somewhat approximate this line what you could try you're going to say okay let's [00:49:07] could try you're going to say okay let's split this way something like this and [00:49:15] split this way something like this and basically something like that right and [00:49:18] basically something like that right and even here you so you've asked a lot of [00:49:20] even here you so you've asked a lot of questions and you've only gotten a very [00:49:22] questions and you've only gotten a very rough approximation of the actual line [00:49:24] rough approximation of the actual line that you've drawn in this case and so [00:49:27] that you've drawn in this case and so decision trees do have a lot of issues [00:49:29] decision trees do have a lot of issues with these kind of structures where this [00:49:31] with these kind of structures where this the features are interacting additively [00:49:33] the features are interacting additively with one another ok so to recap so far [00:49:40] with one another ok so to recap so far since we've covered a number of [00:49:42] since we've covered a number of different things about decision trees [00:49:46] there's a number of pluses and minuses [00:49:49] there's a number of pluses and minuses to decision trees ok so on the plus side [00:49:51] to decision trees ok so on the plus side there actually I think this is an [00:49:53] there actually I think this is an important point is that they're actually [00:49:54] important point is that they're actually pretty easy to explain right if you're [00:49:57] pretty easy to explain right if you're explaining what a decision tree is to [00:49:59] explaining what a decision tree is to like a non-technical person it's fairly [00:50:01] like a non-technical person it's fairly obvious you're like okay you have this [00:50:03] obvious you're like okay you have this tree you're just playing 20 questions [00:50:04] tree you're just playing 20 questions with your data and letting it come up [00:50:06] with your data and letting it come up with one question at a time [00:50:08] with one question at a time they're also interpret able you can just [00:50:11] they're also interpret able you can just draw out the tree especially for shorter [00:50:13] draw out the tree especially for shorter trees to see exactly what it's doing it [00:50:21] trees to see exactly what it's doing it can deal with categorical variables [00:50:29] and it's generally pretty fast and [00:50:35] and it's generally pretty fast and however on the negative side one that I [00:50:40] however on the negative side one that I alluded to was that they're fairly high [00:50:42] alluded to was that they're fairly high variance models and so are oftentimes [00:50:46] variance models and so are oftentimes prone to overfitting in your data [00:50:51] they're bad at additive structure and [00:51:00] they're bad at additive structure and then finally they have because in large [00:51:05] then finally they have because in large part because of these bursts - they're [00:51:07] part because of these bursts - they're generally have fairly low predictive [00:51:09] generally have fairly low predictive accuracy I know what you guys are [00:51:16] accuracy I know what you guys are thinking I just spent all this time [00:51:17] thinking I just spent all this time talking about decision trees and I tell [00:51:18] talking about decision trees and I tell you guys they actually sort of suck so [00:51:20] you guys they actually sort of suck so why did I actually cover decision trees [00:51:21] why did I actually cover decision trees and the answer is that in fact you can [00:51:25] and the answer is that in fact you can make decision trees a lot better through [00:51:27] make decision trees a lot better through ensemble and a lot of the methods for [00:51:29] ensemble and a lot of the methods for example the leading methods and kaggle [00:51:31] example the leading methods and kaggle these days are actually built on [00:51:33] these days are actually built on ensembles of decision trees and they [00:51:35] ensembles of decision trees and they really provide an ideal sort of a model [00:51:37] really provide an ideal sort of a model framework to look at through which we [00:51:39] framework to look at through which we can examine a lot of these different [00:51:40] can examine a lot of these different ensemble methods any questions about [00:51:44] ensemble methods any questions about decision trees before I move on I I [00:51:55] decision trees before I move on I I don't think that's strictly okay so the [00:51:57] don't think that's strictly okay so the question is for the cross-entropy loss [00:52:00] question is for the cross-entropy loss does the log need to be based - and the [00:52:02] does the log need to be based - and the answer is I'm pretty sure that is not [00:52:04] answer is I'm pretty sure that is not very relevant in this case I'm not a [00:52:06] very relevant in this case I'm not a hundred percent sure about that but I'm [00:52:07] hundred percent sure about that but I'm pretty sure that the base is a lot good [00:52:09] pretty sure that the base is a lot good that makes it [00:52:09] that makes it it's cross-entropy loss actually [00:52:11] it's cross-entropy loss actually initially came out of like information [00:52:12] initially came out of like information there we have like computer bits and [00:52:14] there we have like computer bits and you're transmitting bits and so it's [00:52:15] you're transmitting bits and so it's useful to think in terms of bits of [00:52:17] useful to think in terms of bits of information that you can transmit which [00:52:19] information that you can transmit which is why I always came up as log base to [00:52:20] is why I always came up as log base to initial formulation [00:52:53] okay so now let's talk about ensemble [00:53:06] okay so what I want is on Tom blowing [00:53:10] okay so what I want is on Tom blowing help at some level you can sort of think [00:53:12] help at some level you can sort of think back to your basic statistics so say you [00:53:15] back to your basic statistics so say you have you have excise excise which are [00:53:28] random variables [00:53:36] well sometimes right this is just our V [00:53:42] that are independent identically [00:53:52] that are independent identically distributed and so probably a lot of you [00:53:59] distributed and so probably a lot of you are familiar with this already well you [00:54:04] are familiar with this already well you can call this iid okay now say that your [00:54:11] can call this iid okay now say that your variance of one of these variables is [00:54:17] variance of one of these variables is Sigma squared [00:54:21] then what you can show is that the [00:54:23] then what you can show is that the variance of the mean of many of these [00:54:27] variance of the mean of many of these variables so let's of many of these [00:54:29] variables so let's of many of these random variables or written [00:54:32] random variables or written alternatively one over n sum over I of X [00:54:37] alternatively one over n sum over I of X I is equal to Sigma squared over N and [00:54:44] I is equal to Sigma squared over N and so each independent variable you factor [00:54:47] so each independent variable you factor in is decreasing the variance of your [00:54:50] in is decreasing the variance of your model and so the thought is that if you [00:54:54] model and so the thought is that if you can factor in a number of independent [00:54:56] can factor in a number of independent sources you can slowly decrease your [00:54:59] sources you can slowly decrease your variance okay so I saw that though this [00:55:02] variance okay so I saw that though this is a little bit simplistic of a way of [00:55:04] is a little bit simplistic of a way of looking at this because really all these [00:55:06] looking at this because really all these different things are factoring together [00:55:07] different things are factoring together have some amount of correlation with [00:55:09] have some amount of correlation with each other and so this independent [00:55:10] each other and so this independent assumption is oftentimes not correct so [00:55:13] assumption is oftentimes not correct so if instead you drop the independence [00:55:23] if instead you drop the independence assumption so now your variables are [00:55:40] assumption so now your variables are just ID right okay and say we can [00:55:53] just ID right okay and say we can characterize what the correlation [00:55:54] characterize what the correlation between any two x i's is and we can [00:55:57] between any two x i's is and we can write that down as rho so [00:56:13] then you can actually write out the [00:56:15] then you can actually write out the variance of your mean as Rho Sigma [00:56:23] variance of your mean as Rho Sigma squared squared plus 1 minus Rho over m [00:56:32] squared squared plus 1 minus Rho over m or no n Sigma squared okay and so you [00:56:38] or no n Sigma squared okay and so you can sort of see that if your correlation [00:56:39] can sort of see that if your correlation if they're fully correlated then your [00:56:41] if they're fully correlated then your this term will drop to zero and that [00:56:43] this term will drop to zero and that you'll just have Sigma squared again [00:56:45] you'll just have Sigma squared again because adding a bunch of fully [00:56:46] because adding a bunch of fully correlated variables it's just going to [00:56:47] correlated variables it's just going to give you the original variables variance [00:56:49] give you the original variables variance versus if they're completely d [00:56:51] versus if they're completely d correlated then this term drops to zero [00:56:53] correlated then this term drops to zero and you just end up with Sigma squared [00:56:54] and you just end up with Sigma squared over n which gives you the initial [00:56:56] over n which gives you the initial independent identically distributed [00:56:59] independent identically distributed equation and so in this case really what [00:57:03] equation and so in this case really what you want to do the name of the game is [00:57:05] you want to do the name of the game is you want to have as many different [00:57:07] you want to have as many different models that your factoring is possible [00:57:09] models that your factoring is possible to increase this n which drives this [00:57:11] to increase this n which drives this term down and then on the other hand you [00:57:13] term down and then on the other hand you also want to make sure those models are [00:57:15] also want to make sure those models are as d correlated as possible so that your [00:57:17] as d correlated as possible so that your Rho goes down in this first term goes [00:57:19] Rho goes down in this first term goes down as well okay [00:57:35] and so this gives rise to a number of [00:57:37] and so this gives rise to a number of different ways to Ensemble and one way [00:57:49] different ways to Ensemble and one way you could think about doing this is you [00:57:51] you could think about doing this is you just use different algorithms this is [00:58:03] just use different algorithms this is actually what a lot of people in cackled [00:58:05] actually what a lot of people in cackled for example will do is they'll just take [00:58:07] for example will do is they'll just take a core random forest and svm average of [00:58:10] a core random forest and svm average of them all together and you know that [00:58:12] them all together and you know that actually works pretty well but then you [00:58:15] actually works pretty well but then you sort of have to spend your time [00:58:16] sort of have to spend your time implementing all these separate [00:58:17] implementing all these separate algorithms which it's oftentimes not the [00:58:19] algorithms which it's oftentimes not the most efficient use of your time another [00:58:22] most efficient use of your time another one that people would like to do is just [00:58:25] one that people would like to do is just use different training sets okay and [00:58:39] use different training sets okay and again in this case like you probably [00:58:41] again in this case like you probably spend a lot of effort collecting your [00:58:42] spend a lot of effort collecting your initial training set you don't want your [00:58:44] initial training set you don't want your like machine learning person to just [00:58:46] like machine learning person to just come and recommend to you that just go [00:58:48] come and recommend to you that just go collect a whole second training set or [00:58:50] collect a whole second training set or something like that to improve your [00:58:51] something like that to improve your performance like that's generally not [00:58:53] performance like that's generally not the most helpful recommendation okay [00:58:56] the most helpful recommendation okay and so then what we're gonna cover now [00:58:59] and so then what we're gonna cover now are these two other methods that we use [00:59:00] are these two other methods that we use to do Ensemble II and one of them is [00:59:02] to do Ensemble II and one of them is called bagging which is sort of trying [00:59:07] called bagging which is sort of trying to approximate having different training [00:59:09] to approximate having different training sets I'll get into that quickly and then [00:59:12] sets I'll get into that quickly and then you also have boosting [00:59:19] and just so that you had to have a [00:59:20] and just so that you had to have a little bit of context we're gonna be [00:59:22] little bit of context we're gonna be using decision trees to talk of a lot [00:59:23] using decision trees to talk of a lot about these models and so bagging you [00:59:26] about these models and so bagging you might have heard of random force that's [00:59:29] might have heard of random force that's a variant of bagging for decision trees [00:59:32] a variant of bagging for decision trees and then for boosting you might have [00:59:36] and then for boosting you might have heard of things like add a boost or XG [00:59:42] heard of things like add a boost or XG boost which are variants of boosting for [00:59:46] boost which are variants of boosting for decision trees okay so that sort of [00:59:53] decision trees okay so that sort of covers that a high level would want to [00:59:55] covers that a high level would want to do these first two are very nice because [00:59:57] do these first two are very nice because they're sort of would give us a much [00:59:58] they're sort of would give us a much more like independently correlated or [01:00:01] more like independently correlated or less correlated variables but generally [01:00:03] less correlated variables but generally we're we end up doing these latter two [01:00:06] we're we end up doing these latter two because we don't want to collect new [01:00:07] because we don't want to collect new training sets or train entirely new [01:00:08] training sets or train entirely new algorithms okay so let's cover bagging [01:00:12] algorithms okay so let's cover bagging first [01:00:21] okay so bagging really stands for this [01:00:24] okay so bagging really stands for this thing it's called bootstrap aggregation [01:00:26] thing it's called bootstrap aggregation okay and so first let's just break down [01:00:42] okay and so first let's just break down this term so bootstrap what that is is [01:00:44] this term so bootstrap what that is is this typically this method use and [01:00:45] this typically this method use and statistics to measure the uncertainty of [01:00:48] statistics to measure the uncertainty of your estimate okay and so what what is [01:00:52] your estimate okay and so what what is useful to define in this case for when [01:00:54] useful to define in this case for when you're talking about bagging is you can [01:00:56] you're talking about bagging is you can say that you have a true population P [01:01:06] say that you have a true population P okay and your training set training set [01:01:15] s is sampled from P you just are drawing [01:01:19] s is sampled from P you just are drawing a bunch of examples from P and that's [01:01:21] a bunch of examples from P and that's what forms your training set and so [01:01:24] what forms your training set and so ideally like for example this different [01:01:26] ideally like for example this different training sets approach what you do with [01:01:28] training sets approach what you do with you just draw s1 s2 s3 s4 and then train [01:01:31] you just draw s1 s2 s3 s4 and then train your model and each one [01:01:31] your model and each one separately unfortunately you generally [01:01:34] separately unfortunately you generally don't have the time to do that and so [01:01:36] don't have the time to do that and so what that what bootstrapping does is you [01:01:39] what that what bootstrapping does is you assume basically that your population is [01:01:44] assume basically that your population is your training sample okay so you assume [01:01:48] your training sample okay so you assume that your population is your training [01:01:50] that your population is your training sample and so now that you have this s [01:01:53] sample and so now that you have this s is approximating your P then you can [01:01:55] is approximating your P then you can draw new samples from your population by [01:01:58] draw new samples from your population by just drawing samples from s instead okay [01:02:01] just drawing samples from s instead okay so you have bootstrap samples is what [01:02:05] so you have bootstrap samples is what they're called z samples from s and so [01:02:15] they're called z samples from s and so how that works is you basically just [01:02:16] how that works is you basically just take your train your your training [01:02:19] take your train your your training sample okay say it's of like cardinality [01:02:21] sample okay say it's of like cardinality n or something and you're just sample n [01:02:23] n or something and you're just sample n times from s and this is important you [01:02:26] times from s and this is important you do it with replacement because they're [01:02:28] do it with replacement because they're pretending that this is a population and [01:02:30] pretending that this is a population and so doing it with replacement sort of [01:02:32] so doing it with replacement sort of makes it of something hold that you're [01:02:34] makes it of something hold that you're sampling from it as a population okay so [01:02:40] sampling from it as a population okay so that's bootstrapping so you generate all [01:02:42] that's bootstrapping so you generate all these different bootstrap samples Z on [01:02:44] these different bootstrap samples Z on your from your training set and what you [01:02:47] your from your training set and what you can do is you can take your model and [01:02:49] can do is you can take your model and train it on all these separate bootstrap [01:02:51] train it on all these separate bootstrap samples and then you can sort of look at [01:02:53] samples and then you can sort of look at the variability in the predictions that [01:02:55] the variability in the predictions that your model ends up making based on these [01:02:57] your model ends up making based on these different bootstrap samples and that [01:02:59] different bootstrap samples and that gives you sort of a measure of [01:03:00] gives you sort of a measure of uncertainty I'm not going to go into too [01:03:02] uncertainty I'm not going to go into too much detail that because that's not [01:03:04] much detail that because that's not actually what we're gonna use [01:03:04] actually what we're gonna use bootstrapping for what we want to use [01:03:07] bootstrapping for what we want to use bootstrapping force we want to aggregate [01:03:09] bootstrapping force we want to aggregate basically bootstrap samples and so at a [01:03:11] basically bootstrap samples and so at a very high level what that means is we're [01:03:13] very high level what that means is we're gonna take a bunch of bootstrap samples [01:03:15] gonna take a bunch of bootstrap samples train separate models on each and then [01:03:17] train separate models on each and then average their outputs okay [01:03:22] average their outputs okay so let's make that a little bit more [01:03:24] so let's make that a little bit more formal [01:03:46] so you have bootstrap samples z1 through [01:04:01] so you have bootstrap samples z1 through ZM say okay capital n let's just say how [01:04:05] ZM say okay capital n let's just say how many bootstrap samples you're gonna take [01:04:07] many bootstrap samples you're gonna take okay you train a model GM okay on Xia [01:04:24] okay and then all you're doing is you're [01:04:28] okay and then all you're doing is you're just defining this new sort of meta [01:04:29] just defining this new sort of meta model I'm not putting a subscript on [01:04:32] model I'm not putting a subscript on this one to show that it's the meta [01:04:33] this one to show that it's the meta Model T of M which is just the sum of [01:04:37] Model T of M which is just the sum of your predictions your individual models [01:04:43] your predictions your individual models divided by the total number of models [01:04:46] divided by the total number of models you have and this is just me writing out [01:04:51] you have and this is just me writing out what I was sort of talking about right [01:04:53] what I was sort of talking about right up there for bagging it's you're taking [01:04:54] up there for bagging it's you're taking these bootstrap samples and then your [01:04:57] these bootstrap samples and then your training separate models and then you're [01:04:58] training separate models and then you're just aggregating them all together to [01:05:00] just aggregating them all together to get this bagging approach and so if we [01:05:08] get this bagging approach and so if we just do a little bit of analysis from [01:05:10] just do a little bit of analysis from the bias-variance perspective on this we [01:05:12] the bias-variance perspective on this we can sort of see why this kind of thing [01:05:14] can sort of see why this kind of thing might work [01:05:27] and so you recall we have this equation [01:05:29] and so you recall we have this equation up here right there various variants of [01:05:31] up here right there various variants of the mean is Rho Sigma squared plus 1 [01:05:34] the mean is Rho Sigma squared plus 1 minus Rho over N Sigma squared so let me [01:05:37] minus Rho over N Sigma squared so let me just write that out here and in this [01:05:50] just write that out here and in this case our n is actually really just the [01:05:52] case our n is actually really just the number of bootstrap samples so we'll [01:05:54] number of bootstrap samples so we'll just use Big M in this case and what [01:05:58] just use Big M in this case and what you're doing is by taking these [01:05:59] you're doing is by taking these bootstrap samples you're sort of D [01:06:01] bootstrap samples you're sort of D correlating your the models your [01:06:02] correlating your the models your training your bootstrapping is driving [01:06:11] training your bootstrapping is driving down well okay and so by driving this [01:06:22] down well okay and so by driving this down you're sort of making this term get [01:06:24] down you're sort of making this term get smaller and smaller and then your [01:06:25] smaller and smaller and then your question might be okay what about this [01:06:26] question might be okay what about this term here and it turns out that [01:06:28] term here and it turns out that basically you can take as many bootstrap [01:06:31] basically you can take as many bootstrap samples as you want [01:06:32] samples as you want and that'll slowly drive down it [01:06:34] and that'll slowly drive down it increases M and drive down this second [01:06:37] increases M and drive down this second term and it turns out that one nice [01:06:39] term and it turns out that one nice thing about bootstrapping is that [01:06:41] thing about bootstrapping is that increasing the number of bootstrap [01:06:44] increasing the number of bootstrap models your training doesn't actually [01:06:46] models your training doesn't actually cause you to over fit any more than you [01:06:48] cause you to over fit any more than you were beforehand because all you're doing [01:06:50] were beforehand because all you're doing is you're driving down this term here so [01:06:53] is you're driving down this term here so more M it's just less variance all [01:07:03] more M it's just less variance all you're doing is driving down the second [01:07:05] you're doing is driving down the second term as much as possible when you're [01:07:06] term as much as possible when you're getting more and more bootstrap samples [01:07:08] getting more and more bootstrap samples so generally only improves performance [01:07:09] so generally only improves performance and so generally what people will do is [01:07:11] and so generally what people will do is they'll sample more and more models [01:07:13] they'll sample more and more models until they see that their error stops [01:07:15] until they see that their error stops going down because that means I've [01:07:16] going down because that means I've basically eliminated this term over here [01:07:19] basically eliminated this term over here so this seems kind of nice right you're [01:07:23] so this seems kind of nice right you're decreasing the variance where's the [01:07:25] decreasing the variance where's the trade-off coming in oh there's a [01:07:27] trade-off coming in oh there's a question [01:07:32] yeah there's definitely a bound right [01:07:34] yeah there's definitely a bound right because I'm not gonna define one [01:07:37] because I'm not gonna define one formally right now oh the question is [01:07:40] formally right now oh the question is can you define a bound on how much you [01:07:42] can you define a bound on how much you decrease row by I'm not yeah so there's [01:07:45] decrease row by I'm not yeah so there's definitely a lower bound yeah lower [01:07:49] definitely a lower bound yeah lower bound and how far you can decrease row [01:07:52] bound and how far you can decrease row it basically comes down to your [01:07:54] it basically comes down to your bootstrap samples are still fairly [01:07:55] bootstrap samples are still fairly highly correlated with one another all [01:07:58] highly correlated with one another all right because they're still just drawing [01:07:59] right because they're still just drawing it from the same sample set s really [01:08:02] it from the same sample set s really your Z's gonna end up containing about [01:08:04] your Z's gonna end up containing about two for each Izzy's gonna contain about [01:08:06] two for each Izzy's gonna contain about two-thirds of s and so your Z's are [01:08:08] two-thirds of s and so your Z's are still gonna be fairly highly correlated [01:08:09] still gonna be fairly highly correlated with each other and though I don't have [01:08:11] with each other and though I don't have a formal equation to write down as to [01:08:12] a formal equation to write down as to exactly how much that decreases row by [01:08:15] exactly how much that decreases row by how much that balance row by you can [01:08:17] how much that balance row by you can sort of see intuitively that there is a [01:08:19] sort of see intuitively that there is a bound there and then you can't just [01:08:20] bound there and then you can't just magically decrease row all the way down [01:08:22] magically decrease row all the way down to zero and achieve zero variance so I [01:08:30] to zero and achieve zero variance so I was saying that you decrease variance [01:08:31] was saying that you decrease variance this seems very nice one issue that [01:08:34] this seems very nice one issue that comes up with with bootstrapping is that [01:08:37] comes up with with bootstrapping is that in fact you're actually slightly [01:08:38] in fact you're actually slightly increasing the bias of your models when [01:08:40] increasing the bias of your models when you're doing this and the reasoning for [01:08:43] you're doing this and the reasoning for that is because of this subsampling that [01:08:53] that is because of this subsampling that I was talking about here each one of [01:08:55] I was talking about here each one of your Z's is now about two-thirds of the [01:08:57] your Z's is now about two-thirds of the original s so your training unless data [01:08:59] original s so your training unless data and so your models are becoming slightly [01:09:01] and so your models are becoming slightly less you know complex and so that [01:09:05] less you know complex and so that increases your bias in this case yes [01:09:14] yeah for sure so the question is can you [01:09:18] yeah for sure so the question is can you explain the difference between a random [01:09:20] explain the difference between a random variable and an algorithm in this case [01:09:22] variable and an algorithm in this case right and so you can sort of at a very [01:09:25] right and so you can sort of at a very high level you can think of an algorithm [01:09:26] high level you can think of an algorithm as a classifier that as a function [01:09:28] as a classifier that as a function that's taking in some data and making a [01:09:30] that's taking in some data and making a prediction right and if you sort of see [01:09:34] prediction right and if you sort of see those that whole set up as sort of like [01:09:36] those that whole set up as sort of like probably the algorithm is giving some [01:09:38] probably the algorithm is giving some sort of output in the problem holistic [01:09:39] sort of output in the problem holistic perspective you can sort of see the [01:09:41] perspective you can sort of see the algorithm as like a random variable in a [01:09:44] algorithm as like a random variable in a case in this case sort of like you're [01:09:46] case in this case sort of like you're basically considering sort of the space [01:09:49] basically considering sort of the space of possible predictions that your [01:09:51] of possible predictions that your algorithm can make and that you can sort [01:09:53] algorithm can make and that you can sort of see as a distribution of possible [01:09:55] of see as a distribution of possible predictions and that you can approximate [01:09:58] predictions and that you can approximate that as a random variable I mean it is a [01:09:59] that as a random variable I mean it is a random variable at some level because [01:10:01] random variable at some level because it's sort of like based on what training [01:10:04] it's sort of like based on what training sample you end up with your predictions [01:10:06] sample you end up with your predictions of your output model are going to change [01:10:08] of your output model are going to change and so since you're sampling sort of [01:10:10] and so since you're sampling sort of these random samples from your [01:10:12] these random samples from your population set you can consider your [01:10:15] population set you can consider your algorithm as sort of based on that [01:10:16] algorithm as sort of based on that random sample and therefore random [01:10:18] random sample and therefore random variable itself okay so yeah your [01:10:24] variable itself okay so yeah your bicycle increased because of random [01:10:31] bicycle increased because of random subsampling [01:10:39] but generally the decrease in variance [01:10:42] but generally the decrease in variance that you get from doing this it's much [01:10:44] that you get from doing this it's much larger than the slight increase in bias [01:10:46] larger than the slight increase in bias you get from from doing this random life [01:10:49] you get from from doing this random life subsampling so in a lot of cases bagging [01:10:51] subsampling so in a lot of cases bagging is quite nice [01:11:08] okay so I've talked a bit about buying [01:11:11] okay so I've talked a bit about buying about bagging let's talk about decision [01:11:13] about bagging let's talk about decision trees plus bagging now okay so you [01:11:25] trees plus bagging now okay so you recall that decision trees are high [01:11:29] recall that decision trees are high variance low bias and this right here [01:11:40] variance low bias and this right here sort of explains why they're pretty good [01:11:42] sort of explains why they're pretty good fit for bagging okay because bagging [01:11:44] fit for bagging okay because bagging what you're doing is you're decreasing [01:11:45] what you're doing is you're decreasing the variance of your models for a slight [01:11:48] the variance of your models for a slight increase in bias and since most of your [01:11:50] increase in bias and since most of your error from your decision trees is coming [01:11:52] error from your decision trees is coming from the high variance side of things by [01:11:55] from the high variance side of things by sort of driving down that variance you [01:11:57] sort of driving down that variance you get a lot more benefit than for a model [01:11:59] get a lot more benefit than for a model that would be on the reverse hybrid bias [01:12:02] that would be on the reverse hybrid bias and low variance alright so so this [01:12:06] and low variance alright so so this makes this like an ideal fit for bagging [01:12:24] okay so now this is sort of decision [01:12:28] okay so now this is sort of decision tree split bagging I said that random [01:12:30] tree split bagging I said that random force or sort of a version of decision [01:12:33] force or sort of a version of decision trees plus backing and so what I've [01:12:36] trees plus backing and so what I've described here is actually almost random [01:12:37] described here is actually almost random for us at this point the one key point [01:12:39] for us at this point the one key point we're still missing is that random [01:12:42] we're still missing is that random forest actually introduce even more [01:12:43] forest actually introduce even more randomization into each individual [01:12:45] randomization into each individual decision tree and the idea behind that [01:12:48] decision tree and the idea behind that is that as I had that question from [01:12:50] is that as I had that question from before is this row you can only drive it [01:12:52] before is this row you can only drive it down so far through just pure [01:12:54] down so far through just pure bootstrapping [01:12:55] bootstrapping but if you can further D correlate your [01:12:57] but if you can further D correlate your different random variables and you can [01:12:59] different random variables and you can drive down that variance even further [01:13:00] drive down that variance even further okay and so the idea there is that [01:13:05] okay and so the idea there is that basically for at each split four random [01:13:08] basically for at each split four random forest [01:13:18] at each split you consider only a [01:13:29] at each split you consider only a fraction of your total features so it's [01:13:47] fraction of your total features so it's sort of like for that ski example maybe [01:13:48] sort of like for that ski example maybe like for the first plate I only let it [01:13:50] like for the first plate I only let it look at latitude and then for the second [01:13:52] look at latitude and then for the second split I only let it look at the time of [01:13:55] split I only let it look at the time of the year and so this might seem a little [01:13:58] the year and so this might seem a little bit unintuitive at first but you can [01:14:00] bit unintuitive at first but you can sort of get the intuition for two ways [01:14:01] sort of get the intuition for two ways one is that you're decreasing row and [01:14:09] one is that you're decreasing row and then the other one is you can think that [01:14:12] then the other one is you can think that say you have a classification example we [01:14:14] say you have a classification example we have one very strong predictor that gets [01:14:17] have one very strong predictor that gets you very good performance on its own and [01:14:19] you very good performance on its own and regardless of what bootstrap sample you [01:14:21] regardless of what bootstrap sample you selects your models probably gonna use [01:14:22] selects your models probably gonna use that predictor as its first split that's [01:14:25] that predictor as its first split that's gonna cause all your models to be very [01:14:26] gonna cause all your models to be very highly correlated right at that first [01:14:28] highly correlated right at that first split for example and by instead forcing [01:14:30] split for example and by instead forcing it to to sample from different features [01:14:34] it to to sample from different features instead that's going to increase the or [01:14:37] instead that's going to increase the or decrease the correlation between your [01:14:39] decrease the correlation between your models and so it's all about D [01:14:41] models and so it's all about D correlating your models in this case [01:14:51] okay and that sort of brings so close a [01:14:54] okay and that sort of brings so close a lot of our discussion of bagging are [01:14:56] lot of our discussion of bagging are there any questions regarding bagging [01:15:00] okay [01:15:05] now I've covered bagging let's get a [01:15:07] now I've covered bagging let's get a little bit into boosting and I'll make [01:15:15] little bit into boosting and I'll make this quick [01:15:16] this quick but basically whereas bagging we sort of [01:15:22] but basically whereas bagging we sort of saw in the intuition that we were [01:15:24] saw in the intuition that we were decreasing variants boosting is sort of [01:15:27] decreasing variants boosting is sort of actually more of the opposite where [01:15:28] actually more of the opposite where you're decreasing the bias of your [01:15:30] you're decreasing the bias of your models okay and also it is basically [01:15:45] models okay and also it is basically more additive and how it's doing things [01:15:49] more additive and how it's doing things so versus you'll recall that for bagging [01:15:55] so versus you'll recall that for bagging you were taking the average of a number [01:15:57] you were taking the average of a number of variables and boosting what happens [01:15:59] of variables and boosting what happens you train one model and then you add [01:16:00] you train one model and then you add that prediction into your ensemble and [01:16:02] that prediction into your ensemble and then when you turn a new model you just [01:16:03] then when you turn a new model you just add that in as a prediction and so [01:16:06] add that in as a prediction and so that's a little bit hand wavy right now [01:16:07] that's a little bit hand wavy right now so let me actually make that clear [01:16:09] so let me actually make that clear through an example so say you have a [01:16:15] through an example so say you have a data set again x1 x2 x2 and you have [01:16:20] data set again x1 x2 x2 and you have some data points maybe some it's [01:16:23] some data points maybe some it's actually just called pluses and minuses [01:16:26] actually just called pluses and minuses say you have some more pluses here and [01:16:30] say you have some more pluses here and then maybe a couple minuses and pluses [01:16:32] then maybe a couple minuses and pluses here okay and what you say your training [01:16:37] here okay and what you say your training size one decision tree so decision [01:16:40] size one decision tree so decision stumps is what we call them it's you [01:16:41] stumps is what we call them it's you only get to ask one question at a time [01:16:43] only get to ask one question at a time and the reason behind this just really [01:16:46] and the reason behind this just really quickly is that because you're [01:16:47] quickly is that because you're decreasing bias by restricting your [01:16:49] decreasing bias by restricting your trees to be only depth one you basically [01:16:53] trees to be only depth one you basically are increasing their amount of bias and [01:16:55] are increasing their amount of bias and decreasing the amount of variance which [01:16:56] decreasing the amount of variance which makes them a better fit for boosting [01:16:57] makes them a better fit for boosting kind of methods and say that you come up [01:17:01] kind of methods and say that you come up with a decision boundary okay say this [01:17:03] with a decision boundary okay say this one here okay and what you're gonna do [01:17:07] one here okay and what you're gonna do is on this side you predict positive [01:17:09] is on this side you predict positive right and on this side you predict [01:17:11] right and on this side you predict negative it's like a reasonable like [01:17:13] negative it's like a reasonable like lying that you could draw here but it's [01:17:14] lying that you could draw here but it's not perfect right you've made some [01:17:16] not perfect right you've made some mistakes [01:17:17] mistakes in fact what you can do is you can sort [01:17:19] in fact what you can do is you can sort of identify these mistakes so if we draw [01:17:21] of identify these mistakes so if we draw this in red hey you've got made these [01:17:25] this in red hey you've got made these guys as mistakes and what boosting does [01:17:28] guys as mistakes and what boosting does is basically it increases the weights of [01:17:31] is basically it increases the weights of the mistakes you've made and then for [01:17:34] the mistakes you've made and then for the next decision stump that you train [01:17:36] the next decision stump that you train it's now trained on this modified set [01:17:38] it's now trained on this modified set which I subscride over here and so now [01:17:50] which I subscride over here and so now you these positives I'll just draw them [01:17:51] you these positives I'll just draw them much bigger you know you've got big [01:17:53] much bigger you know you've got big positives here and some small negatives [01:17:54] positives here and some small negatives and some small positives some big [01:17:57] and some small positives some big negatives here and so now your model to [01:18:02] negatives here and so now your model to try and get these right might pick a [01:18:03] try and get these right might pick a decision boundary like this and this is [01:18:07] decision boundary like this and this is also basically recursive in that each [01:18:09] also basically recursive in that each step right you're going to be reading [01:18:11] step right you're going to be reading each of the examples based on how many [01:18:12] each of the examples based on how many of your previous ones have gotten it [01:18:15] of your previous ones have gotten it wrong or right in the past and so [01:18:20] wrong or right in the past and so basically what you're doing is you can [01:18:23] basically what you're doing is you can sort of weight each one of these [01:18:25] sort of weight each one of these classifiers you can determine for [01:18:32] classifiers you can determine for classifier GM a weight alpha M which is [01:18:46] classifier GM a weight alpha M which is proportional to how many examples you [01:18:48] proportional to how many examples you got wrong or right so better classifier [01:18:50] got wrong or right so better classifier you want to give it more weight and a [01:18:53] you want to give it more weight and a bad classifier you want to give it less [01:18:55] bad classifier you want to give it less right portion all and I think that the [01:19:01] right portion all and I think that the exact equation used in adaboost for [01:19:04] exact equation used in adaboost for example is just log of 1 minus the error [01:19:08] example is just log of 1 minus the error of your n model divided lis basically [01:19:11] of your n model divided lis basically log odds okay and then your total [01:19:15] log odds okay and then your total classifier is just f of a or let's just [01:19:18] classifier is just f of a or let's just call it G of X again G of X it's just [01:19:22] call it G of X again G of X it's just the sum over m [01:19:27] the sum over m of alpha and G of M and then each G of M [01:19:33] of alpha and G of M and then each G of M is trained on a weighted on a reweighed [01:19:47] is trained on a weighted on a reweighed it actually reweighed 'add and so i've [01:19:57] it actually reweighed 'add and so i've glossed over a lot of the details here [01:19:58] glossed over a lot of the details here in interest of time but the specifics of [01:20:01] in interest of time but the specifics of an algorithm like this are will be in [01:20:04] an algorithm like this are will be in the lecture notes and this algorithm is [01:20:06] the lecture notes and this algorithm is actually known as adaboost and basically [01:20:12] actually known as adaboost and basically through similar techniques you can [01:20:14] through similar techniques you can derive algorithms such as XG boost or [01:20:16] derive algorithms such as XG boost or gradient boosting machines that also [01:20:19] gradient boosting machines that also allow you to basically re-weight the [01:20:21] allow you to basically re-weight the examples you're getting right or wrong [01:20:22] examples you're getting right or wrong in this sort of dynamic fashion and [01:20:24] in this sort of dynamic fashion and slowly adding them in is additive [01:20:26] slowly adding them in is additive fashion to your composite model and that [01:20:29] fashion to your composite model and that about finishes it for today thanks for [01:20:32] about finishes it for today thanks for coming great rest of your week ================================================================================ LECTURE 011 ================================================================================ Lecture 11 - Introduction to Neural Networks | Stanford CS229: Machine Learning (Autumn 2018) Source: https://www.youtube.com/watch?v=MfIjxPh6Pys --- Transcript [00:00:03] hello everyone welcome to CS 2 to 9 [00:00:08] hello everyone welcome to CS 2 to 9 today we're going to talk about deep [00:00:11] today we're going to talk about deep learning and neural networks we're going [00:00:15] learning and neural networks we're going to have two lectures on that one today [00:00:17] to have two lectures on that one today and a little bit more of it on Monday [00:00:21] and a little bit more of it on Monday don't hesitate to ask questions during [00:00:24] don't hesitate to ask questions during the lecture so stop me if you don't [00:00:26] the lecture so stop me if you don't understand something and we'll try to [00:00:27] understand something and we'll try to build the intuition around your own [00:00:29] build the intuition around your own Network together we will actually start [00:00:31] Network together we will actually start with an algorithm that you guys have [00:00:33] with an algorithm that you guys have seen previously called logistic [00:00:35] seen previously called logistic regression everybody remembers logistic [00:00:37] regression everybody remembers logistic regression ok remember it's a [00:00:39] regression ok remember it's a classification algorithm we're going to [00:00:42] classification algorithm we're going to do that [00:00:43] do that explain how logistic regression can be [00:00:45] explain how logistic regression can be interpreted as a neural network specific [00:00:48] interpreted as a neural network specific case of a neural network and then we [00:00:51] case of a neural network and then we will go to neural networks sounds good [00:00:53] will go to neural networks sounds good so the quick intro and deep learning so [00:01:03] so the quick intro and deep learning so deep learning is a is a set of [00:01:05] deep learning is a is a set of techniques that is let's say a subset of [00:01:09] techniques that is let's say a subset of machine learning and it's one of the [00:01:11] machine learning and it's one of the growing techniques that have been used [00:01:12] growing techniques that have been used in the industry specifically for [00:01:14] in the industry specifically for problems in computer vision natural [00:01:16] problems in computer vision natural language processing and speech [00:01:17] language processing and speech recognition so you guys have a lot of [00:01:19] recognition so you guys have a lot of different tools and and plugins on your [00:01:23] different tools and and plugins on your smartphones that uses this type of [00:01:25] smartphones that uses this type of algorithm the reason it came to work [00:01:29] algorithm the reason it came to work very well is primarily the new [00:01:32] very well is primarily the new computational methods so one thing we're [00:01:34] computational methods so one thing we're going to see today is that deep learning [00:01:39] going to see today is that deep learning is really really computationally [00:01:41] is really really computationally expensive and people had to find [00:01:44] expensive and people had to find techniques in order to parallelize the [00:01:46] techniques in order to parallelize the code and use GPU specifically in order [00:01:50] code and use GPU specifically in order to graphical processing unit in order to [00:01:52] to graphical processing unit in order to be able to compute the computations in [00:01:56] be able to compute the computations in the punic the second part is the data [00:02:00] the punic the second part is the data available has been growing after after [00:02:06] available has been growing after after the internet bubble with the [00:02:07] the internet bubble with the digitalization of the work so now people [00:02:10] digitalization of the work so now people have access to large amounts of data and [00:02:12] have access to large amounts of data and this type of algorithm has the [00:02:13] this type of algorithm has the specificity of be able to learn [00:02:15] specificity of be able to learn when there's a lot of data so these [00:02:18] when there's a lot of data so these models are very flexible and the more [00:02:20] models are very flexible and the more you give them data the more they will be [00:02:21] you give them data the more they will be able to understand the salient feature [00:02:24] able to understand the salient feature of the data and finally algorithms so [00:02:29] of the data and finally algorithms so people have come up with with new [00:02:31] people have come up with with new techniques in order to use the data use [00:02:35] techniques in order to use the data use the competition power and build models [00:02:37] the competition power and build models so we're going to touch a little bit on [00:02:39] so we're going to touch a little bit on all of that but let's go with logistic [00:02:41] all of that but let's go with logistic regression first can you guys see in the [00:02:50] regression first can you guys see in the back yeah okay so you remember what [00:02:57] back yeah okay so you remember what logistic regression is we're going to [00:02:59] logistic regression is we're going to fix a goal for us that is a [00:03:03] fix a goal for us that is a classification goal so let's try to to [00:03:07] classification goal so let's try to to find cuts in images so find cuts in [00:03:12] find cuts in images so find cuts in images meaning binary classification if [00:03:18] images meaning binary classification if there is a cut in the image we want to [00:03:23] there is a cut in the image we want to output a number that is close to one [00:03:25] output a number that is close to one presence of the cut and if there is no [00:03:29] presence of the cut and if there is no cut in the image one output zero let's [00:03:37] cut in the image one output zero let's say for now we're constrained to the [00:03:39] say for now we're constrained to the fact that there is maximum one cut very [00:03:41] fact that there is maximum one cut very much there's no more if you have to draw [00:03:44] much there's no more if you have to draw the logistic regression model that's [00:03:46] the logistic regression model that's what you would do you would take a cut [00:03:48] what you would do you would take a cut so this is an image of the cuts very bad [00:03:52] so this is an image of the cuts very bad at that sorry in computer science you [00:04:00] at that sorry in computer science you know that images can be represented as [00:04:02] know that images can be represented as 3d matrices so if I tell you that this [00:04:06] 3d matrices so if I tell you that this is a color image of size 64 by 64 how [00:04:12] is a color image of size 64 by 64 how many numbers do I have to represent [00:04:14] many numbers do I have to represent those pixels [00:04:20] yeah I heard it 64 by 64 by 3:3 for the [00:04:25] yeah I heard it 64 by 64 by 3:3 for the RGB Channel red green blue every pixel [00:04:30] RGB Channel red green blue every pixel in an image can be represented by three [00:04:32] in an image can be represented by three numbers one represents in the red filter [00:04:33] numbers one represents in the red filter the green filter and the blue filter so [00:04:37] the green filter and the blue filter so actually this image is of size 64 times [00:04:40] actually this image is of size 64 times 64 times 3 that make sense so the first [00:04:46] 64 times 3 that make sense so the first thing we will do in order to use [00:04:47] thing we will do in order to use logistic regression to find if there is [00:04:48] logistic regression to find if there is a cut on this image we're going to [00:04:50] a cut on this image we're going to flatten this into a vector so I'm going [00:04:56] flatten this into a vector so I'm going to take all the numbers in this matrix [00:04:58] to take all the numbers in this matrix and flatten them in a vector just an [00:05:01] and flatten them in a vector just an image to vector operation nothing more [00:05:04] image to vector operation nothing more and now I can use my logistic regression [00:05:06] and now I can use my logistic regression because I have a vector input so I'm [00:05:09] because I have a vector input so I'm going to to take all of these and push [00:05:14] going to to take all of these and push them in an operation let me call this in [00:05:17] them in an operation let me call this in the logistic operation which has one [00:05:19] the logistic operation which has one part that is W X plus B where X is going [00:05:26] part that is W X plus B where X is going to be the image so W X plus B and the [00:05:32] to be the image so W X plus B and the second part is going to be the sigmoid [00:05:35] second part is going to be the sigmoid everybody's familiar with the sigmoid [00:05:37] everybody's familiar with the sigmoid function function that takes a number [00:05:39] function function that takes a number between minus infinity and plus infinity [00:05:41] between minus infinity and plus infinity maps it between 0 and 1 it's very [00:05:43] maps it between 0 and 1 it's very convenient for classification problems [00:05:44] convenient for classification problems and this we're going to call it Y hat [00:05:46] and this we're going to call it Y hat which is sigmoid of what you've seen in [00:05:50] which is sigmoid of what you've seen in class previously I think it's theta [00:05:52] class previously I think it's theta transpose X but here we will just [00:05:54] transpose X but here we will just separate the notation into W and B so [00:06:04] separate the notation into W and B so can someone tell me what's the shape of [00:06:06] can someone tell me what's the shape of W 2 matrix W vector matrix [00:06:21] what [00:06:24] yeah 64 by 64 by 3 as a yeah so you know [00:06:30] yeah 64 by 64 by 3 as a yeah so you know that this guy here is a vector of 64 by [00:06:35] that this guy here is a vector of 64 by 64 by 3 a column vector so the shape of [00:06:38] 64 by 3 a column vector so the shape of X is going to be 64 by 64 by 3 times 1 [00:06:46] X is going to be 64 by 64 by 3 times 1 this is the shape and this I think it's [00:06:51] twelve twelve thousand 288 and this [00:06:56] twelve twelve thousand 288 and this indeed because we want Y hat to be one [00:06:59] indeed because we want Y hat to be one by one this W has to be one by twelve [00:07:05] by one this W has to be one by twelve 298 that make sense so we have a row [00:07:09] 298 that make sense so we have a row vector as our parameter we're just [00:07:13] vector as our parameter we're just changing the notations of the logistic [00:07:15] changing the notations of the logistic regression that you guys have seen and [00:07:17] regression that you guys have seen and so once we have this model we need to [00:07:19] so once we have this model we need to train it as you know and the process of [00:07:21] train it as you know and the process of training is that first we will [00:07:23] training is that first we will initialize or parameters these are what [00:07:30] initialize or parameters these are what we call parameters we will use the [00:07:33] we call parameters we will use the specific vocabulary of weights and bias [00:07:37] specific vocabulary of weights and bias I believe you guys have heard this [00:07:40] I believe you guys have heard this vocabulary before weights and biases so [00:07:44] vocabulary before weights and biases so we're going to find the right W and the [00:07:47] we're going to find the right W and the right B in order to be able to use this [00:07:52] right B in order to be able to use this model properly once we initialize them [00:07:54] model properly once we initialize them what we will do is that we will optimize [00:07:57] what we will do is that we will optimize them find the optimal W and B and after [00:08:06] them find the optimal W and B and after we found the optimal W and B we will use [00:08:09] we found the optimal W and B we will use them to predict [00:08:20] does this process make sense this [00:08:23] does this process make sense this training process and I think the [00:08:25] training process and I think the important part is to understand what [00:08:26] important part is to understand what this is find the optimal W and B means [00:08:30] this is find the optimal W and B means defining your last function which is the [00:08:33] defining your last function which is the objective and in machine learning you [00:08:35] objective and in machine learning you often have this this specific problem [00:08:38] often have this this specific problem where you have a function that you know [00:08:40] where you have a function that you know you want to find the network function [00:08:42] you want to find the network function but you don't know the values of its [00:08:44] but you don't know the values of its parameters in order to find them you're [00:08:46] parameters in order to find them you're going to use a proxy that is going to be [00:08:48] going to use a proxy that is going to be your last function if you manage to [00:08:49] your last function if you manage to minimize the last function you will find [00:08:51] minimize the last function you will find the right parameters so you define a [00:08:55] the right parameters so you define a loss function that is the logistic loss [00:09:01] Y log Y hat plus 1 minus y log of 1 [00:09:08] Y log Y hat plus 1 minus y log of 1 minus y hat you guys have seen this one [00:09:12] minus y hat you guys have seen this one you remember where it comes from comes [00:09:16] you remember where it comes from comes from a maximum likelihood estimation [00:09:18] from a maximum likelihood estimation starting from a probabilistic model and [00:09:23] starting from a probabilistic model and so the idea is how can I minimize this [00:09:25] so the idea is how can I minimize this function minimize because I've put a [00:09:27] function minimize because I've put a minus sign here I want to find W and B [00:09:32] minus sign here I want to find W and B that minimize this function and I'm [00:09:33] that minimize this function and I'm going to use a gradient descent [00:09:35] going to use a gradient descent algorithm which means I'm going to [00:09:38] algorithm which means I'm going to iteratively compute the derivative of [00:09:41] iteratively compute the derivative of the loss with respect to my parameters [00:09:48] and at every step I will update them to [00:09:52] and at every step I will update them to make this loss function go a little down [00:09:54] make this loss function go a little down at every trait if set so in terms of [00:09:56] at every trait if set so in terms of implementation this is a for loop you [00:09:58] implementation this is a for loop you will loop over a certain number of [00:10:00] will loop over a certain number of iteration and at every point you will [00:10:02] iteration and at every point you will compute the derivative of your loss with [00:10:04] compute the derivative of your loss with respect to your parameters everybody [00:10:07] respect to your parameters everybody remembers how to compute this number [00:10:09] remembers how to compute this number take the derivative here you use the [00:10:12] take the derivative here you use the fact that the sigmoid function has a [00:10:14] fact that the sigmoid function has a derivative that is Sigma 8 times 1 minus [00:10:17] derivative that is Sigma 8 times 1 minus Sigma and you will compute the results [00:10:20] Sigma and you will compute the results we're going to do some derivative later [00:10:22] we're going to do some derivative later today but just to set up the problem [00:10:25] today but just to set up the problem here so the few things that I want to do [00:10:29] here so the few things that I want to do I want to touch on here is first how [00:10:32] I want to touch on here is first how many parameters does [00:10:33] many parameters does this model have this logistic regression [00:10:35] this model have this logistic regression if you have to count them so this is the [00:10:47] if you have to count them so this is the number 89 yeah correct so twelve [00:10:49] number 89 yeah correct so twelve thousand two hundred eighty eight [00:10:50] thousand two hundred eighty eight weights and one bias that make sense so [00:10:54] weights and one bias that make sense so actually it's funny because you can [00:10:55] actually it's funny because you can quickly count it by just counting the [00:10:57] quickly count it by just counting the number of edges on the on the on the [00:10:59] number of edges on the on the on the drawing plus one every circle has a bias [00:11:03] drawing plus one every circle has a bias every Edge has a weight because [00:11:06] every Edge has a weight because ultimately this operation you can [00:11:09] ultimately this operation you can rewrite it like that right it means [00:11:13] rewrite it like that right it means every weight has every weight [00:11:16] every weight has every weight corresponds to an edge so that's another [00:11:18] corresponds to an edge so that's another way to count it we're going to use it a [00:11:19] way to count it we're going to use it a little further so we're starting with [00:11:21] little further so we're starting with not too many parameters actually and one [00:11:24] not too many parameters actually and one thing that we noticed is that the number [00:11:25] thing that we noticed is that the number of parameters of our model depends on [00:11:27] of parameters of our model depends on the size of the input we probably don't [00:11:30] the size of the input we probably don't want that at some point so we're going [00:11:31] want that at some point so we're going to change it later so two equations that [00:11:35] to change it later so two equations that I want you to remember is the first one [00:11:38] I want you to remember is the first one is neurone equals linear plus activation [00:11:43] is neurone equals linear plus activation so this is the vocabulary we will use in [00:11:47] so this is the vocabulary we will use in your networks we define in your own as [00:11:49] your networks we define in your own as an operation that has two parts one [00:11:52] an operation that has two parts one linear part and one activation part and [00:11:54] linear part and one activation part and it's exactly that this is actually a [00:11:56] it's exactly that this is actually a neural we have a linear part WX plus B [00:12:02] neural we have a linear part WX plus B and then we take the output of this [00:12:04] and then we take the output of this linear part and we put it in an [00:12:06] linear part and we put it in an activation that in this case is the [00:12:08] activation that in this case is the sigmoid function it can be other [00:12:10] sigmoid function it can be other functions okay so this is the first [00:12:13] functions okay so this is the first equation not too hard the second [00:12:17] equation not too hard the second equation that I wanna set now is the [00:12:19] equation that I wanna set now is the model equals architecture plus [00:12:26] model equals architecture plus parameters what does that mean it means [00:12:33] parameters what does that mean it means here we're trying to train a logistic [00:12:35] here we're trying to train a logistic regression in order to be able to use it [00:12:38] regression in order to be able to use it we need an architecture which is the [00:12:41] we need an architecture which is the following a one-year own neural network [00:12:45] following a one-year own neural network and the parameters W and E so basically [00:12:48] and the parameters W and E so basically when people say we've shipped a model [00:12:51] when people say we've shipped a model like in the industry what they're saying [00:12:53] like in the industry what they're saying is that they found the right parameters [00:12:55] is that they found the right parameters with the right architecture they have [00:12:57] with the right architecture they have two files and these two files are [00:12:59] two files and these two files are predicting a bunch of things okay one [00:13:03] predicting a bunch of things okay one parameter file and one architecture file [00:13:05] parameter file and one architecture file the architecture will be modified a lot [00:13:07] the architecture will be modified a lot today we will add neurons all over and [00:13:10] today we will add neurons all over and the parameters will always be called W [00:13:13] the parameters will always be called W and B but they will become bigger and [00:13:15] and B but they will become bigger and bigger because we have more data we want [00:13:17] bigger because we have more data we want to be able to understand it you can get [00:13:20] to be able to understand it you can get that it's going to be hard to understand [00:13:21] that it's going to be hard to understand what a cat is with only that that that [00:13:24] what a cat is with only that that that many parameters we want to have more [00:13:26] many parameters we want to have more parameters any questions so far so this [00:13:32] parameters any questions so far so this was just to set up the problem with [00:13:33] was just to set up the problem with logistic regression let's try to set a [00:13:37] logistic regression let's try to set a new goal after the first goal we have [00:13:40] new goal after the first goal we have set prior to that so the second goal [00:13:44] set prior to that so the second goal would be find cat lion so a little [00:13:58] would be find cat lion so a little different than before only thing we [00:14:00] different than before only thing we changed is that we want to now to detect [00:14:02] changed is that we want to now to detect three types of animals if there's a cat [00:14:05] three types of animals if there's a cat on the image I want to know there is a [00:14:06] on the image I want to know there is a cat if there is anyone on the image I [00:14:08] cat if there is anyone on the image I want to know there is an iguana if [00:14:09] want to know there is an iguana if there's a line on image I want to know [00:14:11] there's a line on image I want to know it as well so how would you modify the [00:14:14] it as well so how would you modify the network that we previously had in order [00:14:17] network that we previously had in order to take this into account yeah [00:14:26] yeah good idea so put two more circles [00:14:29] yeah good idea so put two more circles so neurons and do the same thing so we [00:14:32] so neurons and do the same thing so we have our picture here with the cats [00:14:35] have our picture here with the cats so the cat is going to the right 64 by [00:14:43] so the cat is going to the right 64 by 64 by 3 we flatten it from X 1 to X n [00:14:47] 64 by 3 we flatten it from X 1 to X n let's say and represent 64 64 by 3 and [00:14:51] let's say and represent 64 64 by 3 and what I will do is that I will use 3 [00:14:53] what I will do is that I will use 3 neurons that are all computing the same [00:14:57] neurons that are all computing the same thing they're all connected to all these [00:14:59] thing they're all connected to all these inputs ok I connect all my inputs x1 to [00:15:08] inputs ok I connect all my inputs x1 to xn to each of these neurons and I will [00:15:12] xn to each of these neurons and I will use a specific set of notation here [00:15:43] why to hat equals a 2-1 sigmoid of w2 1 [00:15:52] why to hat equals a 2-1 sigmoid of w2 1 plus 2 and similarly white 3 hat equals [00:16:00] plus 2 and similarly white 3 hat equals 81 which is sigmoid of w3 1x plus 3 so [00:16:11] 81 which is sigmoid of w3 1x plus 3 so I'm introducing a few notations here and [00:16:14] I'm introducing a few notations here and we will get used to it don't worry so [00:16:16] we will get used to it don't worry so just just write this down and we're [00:16:18] just just write this down and we're going to go over it so the square [00:16:21] going to go over it so the square brackets here represents what we will [00:16:25] brackets here represents what we will call later on a layer if you look at [00:16:27] call later on a layer if you look at this network it looks like there is one [00:16:30] this network it looks like there is one layer here there is one layer in which [00:16:32] layer here there is one layer in which neurons don't communicate with each [00:16:34] neurons don't communicate with each other we could add up to it and we will [00:16:37] other we could add up to it and we will do it later on more neurons in other [00:16:39] do it later on more neurons in other layers we will then note with square [00:16:41] layers we will then note with square brackets the index of the layer the [00:16:44] brackets the index of the layer the index that is the subscript to this a is [00:16:47] index that is the subscript to this a is the number identifying the neuron inside [00:16:50] the number identifying the neuron inside the layer so here we have one layer we [00:16:52] the layer so here we have one layer we have a 1 a 2 and a 3 with square [00:16:55] have a 1 a 2 and a 3 with square brackets one to identify the layer does [00:16:57] brackets one to identify the layer does it make sense and then we have our Y hat [00:17:00] it make sense and then we have our Y hat that instead of being a single number as [00:17:02] that instead of being a single number as it was before is now a vector of size 3 [00:17:08] it was before is now a vector of size 3 so how many parameters does this network [00:17:11] so how many parameters does this network have [00:17:24] how much [00:17:28] okay how did you come up with that okay [00:17:33] okay how did you come up with that okay yeah correct so we just have three times [00:17:35] yeah correct so we just have three times the thing we had before because we added [00:17:37] the thing we had before because we added two more neurons and they all have their [00:17:39] two more neurons and they all have their own set of parameters look like this [00:17:41] own set of parameters look like this edge is a separate edges this one so we [00:17:43] edge is a separate edges this one so we have to replicate parameters for each of [00:17:46] have to replicate parameters for each of these so w11 would be the equivalent of [00:17:48] these so w11 would be the equivalent of what we had for the cat but we have to [00:17:50] what we had for the cat but we have to add two more parameter vectors and [00:17:54] add two more parameter vectors and biases so other question when you have [00:17:58] biases so other question when you have to train this logistic regression what [00:18:01] to train this logistic regression what data set did you need [00:18:13] can someone try to describe the data set [00:18:18] yeah yeah correct so we need images and [00:18:25] yeah yeah correct so we need images and labels with it's labeled as cat one or [00:18:28] labels with it's labeled as cat one or no cat zero so it's a binary [00:18:30] no cat zero so it's a binary classification with images and labels [00:18:32] classification with images and labels now what do you think should be the data [00:18:35] now what do you think should be the data set to train this network yes that's a [00:18:49] set to train this network yes that's a good idea [00:18:50] good idea so just to repeat a label for an image [00:18:55] so just to repeat a label for an image that has a cat would probably be a [00:18:59] that has a cat would probably be a vector with a one and two zeroes where [00:19:03] vector with a one and two zeroes where the one should represent the present the [00:19:05] the one should represent the present the presence of a cat this one should [00:19:08] presence of a cat this one should represent the presence of a lion and [00:19:09] represent the presence of a lion and this one should represent the presence [00:19:11] this one should represent the presence of an iguana so let's assume I use this [00:19:16] of an iguana so let's assume I use this scheme to label my dataset I train this [00:19:19] scheme to label my dataset I train this network using the same techniques here [00:19:22] network using the same techniques here initialize all my weights and biases [00:19:24] initialize all my weights and biases with a value a starting value optimize a [00:19:28] with a value a starting value optimize a loss function by using gradient descent [00:19:31] loss function by using gradient descent and then use y hat equals lala to [00:19:34] and then use y hat equals lala to predict what do you think this neuron is [00:19:39] predict what do you think this neuron is going to be responsible for if you have [00:19:47] going to be responsible for if you have to describe the responsibilities of this [00:19:49] to describe the responsibilities of this neuro yes well this one yeah lion and [00:19:59] neuro yes well this one yeah lion and this one iguana so basically the way you [00:20:02] this one iguana so basically the way you go free that's a good question [00:20:08] go free that's a good question we're going to talk about that now [00:20:09] we're going to talk about that now multiple image contain different animals [00:20:11] multiple image contain different animals or not so going back on what you said [00:20:13] or not so going back on what you said because we decided to label our data set [00:20:16] because we decided to label our data set like that after training this neuron is [00:20:19] like that after training this neuron is not really going to be there to detect [00:20:21] not really going to be there to detect cuts if we had changed the labeling [00:20:23] cuts if we had changed the labeling scheme and I said that the second entry [00:20:25] scheme and I said that the second entry would correspond to the [00:20:26] would correspond to the shut the presence of the cat then after [00:20:29] shut the presence of the cat then after training you will detect that this [00:20:31] training you will detect that this neuron is responsible for detecting the [00:20:32] neuron is responsible for detecting the cat so the network is going to evolve [00:20:34] cat so the network is going to evolve depending on the way you label your [00:20:36] depending on the way you label your dataset [00:20:37] dataset now do you think that this network can [00:20:41] now do you think that this network can still be robust to different animals in [00:20:43] still be robust to different animals in the same picture so this cat now has a [00:20:47] the same picture so this cat now has a friend that is a lion okay I have no [00:20:50] friend that is a lion okay I have no idea how to draw a lion but let's say [00:20:53] idea how to draw a lion but let's say there is a lion here and because there [00:20:57] there is a lion here and because there is a lion I will add a one here do you [00:20:59] is a lion I will add a one here do you think this network is robust to this [00:21:02] think this network is robust to this type of labeling [00:21:13] hmm it should be the neurons aren't [00:21:18] hmm it should be the neurons aren't talking to each other that's a good [00:21:19] talking to each other that's a good answer actually another answer that's a [00:21:31] answer actually another answer that's a good on intuition because the network [00:21:34] good on intuition because the network what it sees is just one one zero and an [00:21:36] what it sees is just one one zero and an image it doesn't see that this one [00:21:39] image it doesn't see that this one corresponds to the Cal correspond to the [00:21:41] corresponds to the Cal correspond to the first one and the second and the line [00:21:43] first one and the second and the line correspond to the second one so this is [00:21:46] correspond to the second one so this is a property of neural networks it's the [00:21:47] a property of neural networks it's the fact that you don't need to tell them [00:21:49] fact that you don't need to tell them everything if you have enough data [00:21:50] everything if you have enough data they're going to figure it out [00:21:52] they're going to figure it out so because you will have also cats with [00:21:54] so because you will have also cats with iguanas cats alone Lions with iguanas [00:21:57] iguanas cats alone Lions with iguanas lions alone ultimately this neuron will [00:22:00] lions alone ultimately this neuron will understand what it's looking for and it [00:22:02] understand what it's looking for and it will understand that this one [00:22:03] will understand that this one corresponds to this line just needs a [00:22:07] corresponds to this line just needs a lot of data so yes it's going to be [00:22:09] lot of data so yes it's going to be robust and that's the reason you [00:22:12] robust and that's the reason you mentioned is going to be robust to that [00:22:13] mentioned is going to be robust to that because the tree neurons aren't [00:22:15] because the tree neurons aren't communicating together so we can totally [00:22:18] communicating together so we can totally train them independent independently [00:22:20] train them independent independently from each other and in fact the sigmoid [00:22:22] from each other and in fact the sigmoid here doesn't depend on the sigmoid here [00:22:24] here doesn't depend on the sigmoid here and doesn't depend on the same weight [00:22:25] and doesn't depend on the same weight here it means we can have one one and [00:22:28] here it means we can have one one and one as an output yes question you could [00:22:36] one as an output yes question you could you could you could think about it as [00:22:37] you could you could think about it as trilogies equations so we wouldn't call [00:22:40] trilogies equations so we wouldn't call that in your own network yet it's not [00:22:42] that in your own network yet it's not ready yet but it's a three neural [00:22:46] ready yet but it's a three neural network or three logistic regression [00:22:48] network or three logistic regression with each other now following up on that [00:22:51] with each other now following up on that yeah go for it the question W and B are [00:22:59] yeah go for it the question W and B are related to what oh yeah so so usually [00:23:05] related to what oh yeah so so usually you would have theta transpose X which [00:23:08] you would have theta transpose X which is sum of theta I X I correct and what I [00:23:13] is sum of theta I X I correct and what I will split it is I will spit it in sum [00:23:15] will split it is I will spit it in sum of theta I X I plus theta 0 times 1 I'll [00:23:21] of theta I X I plus theta 0 times 1 I'll split it like that theta 0 would [00:23:23] split it like that theta 0 would correspond to be [00:23:24] correspond to be and these data eyes would correspond to [00:23:26] and these data eyes would correspond to Wis make sense one more question [00:23:45] good question that's the next thing [00:23:47] good question that's the next thing we're going to see so the question is a [00:23:51] we're going to see so the question is a follow-up on this is there cases where [00:23:53] follow-up on this is there cases where we have a constraint where there is only [00:23:57] we have a constraint where there is only one possible outcome it means there is [00:24:00] one possible outcome it means there is no chat in Lyon there's either a cat or [00:24:02] no chat in Lyon there's either a cat or a lion [00:24:02] a lion there is no Guana in Lyon there's either [00:24:05] there is no Guana in Lyon there's either in iguana or line think about health [00:24:08] in iguana or line think about health care there are many there are many [00:24:10] care there are many there are many models that are made to detect if this [00:24:16] models that are made to detect if this is skin disease is present on based on [00:24:18] is skin disease is present on based on cell microscopic images usually there is [00:24:21] cell microscopic images usually there is no overlap between this is it means you [00:24:23] no overlap between this is it means you want to classify a specific this is [00:24:25] want to classify a specific this is among a large number of diseases so this [00:24:27] among a large number of diseases so this model would still work but would not be [00:24:30] model would still work but would not be optimal because it's longer to Train [00:24:32] optimal because it's longer to Train maybe one this is super super rare and [00:24:34] maybe one this is super super rare and one of the neurons is never going to be [00:24:36] one of the neurons is never going to be trained let's say you're working in a [00:24:38] trained let's say you're working in a zoo where there is only one in wanna and [00:24:40] zoo where there is only one in wanna and there are thousands of lions and [00:24:41] there are thousands of lions and thousands of cats this guy will never [00:24:44] thousands of cats this guy will never train almost you know it would be super [00:24:46] train almost you know it would be super hard to train this one so you want to [00:24:48] hard to train this one so you want to start with another model that where you [00:24:49] start with another model that where you put the constraint that okay there is [00:24:51] put the constraint that okay there is only one disease that we want to predict [00:24:53] only one disease that we want to predict and let the model learn with all the [00:24:56] and let the model learn with all the neurons learn together by creating [00:24:58] neurons learn together by creating interaction between them have you guys [00:25:01] interaction between them have you guys heard of soft max yes some of you I see [00:25:06] heard of soft max yes some of you I see that okay so let's look at soft max a [00:25:09] that okay so let's look at soft max a little bit together so we set a new goal [00:25:11] little bit together so we set a new goal now which is we add a constraint which [00:25:21] now which is we add a constraint which is unique animal on an image so at most [00:25:32] is unique animal on an image so at most one animal on an image [00:25:35] so I'm going to modify the network a [00:25:37] so I'm going to modify the network a little bit we have our chat and there is [00:25:40] little bit we have our chat and there is no line on the image we flatten it and [00:25:45] no line on the image we flatten it and now I'm going to use the same scheme [00:25:48] now I'm going to use the same scheme with the tree neurons a1 a2 a3 but as an [00:26:04] with the tree neurons a1 a2 a3 but as an output what I'm going to use is an [00:26:09] output what I'm going to use is an exponent softmax function so let me be [00:26:13] exponent softmax function so let me be more precise let me let me actually [00:26:15] more precise let me let me actually introduce another notation to make it [00:26:17] introduce another notation to make it easier as you know the neuron is a [00:26:21] easier as you know the neuron is a linear part plus an activation so we're [00:26:24] linear part plus an activation so we're going to introduce a notation for the [00:26:28] going to introduce a notation for the linear part I'm going to introduce z1 1 [00:26:31] linear part I'm going to introduce z1 1 to represent the linear part of the [00:26:34] to represent the linear part of the first neuron z11 two to introduce the [00:26:41] first neuron z11 two to introduce the linear part of the second gyro [00:26:43] linear part of the second gyro so now when neuron has two parts one [00:26:45] so now when neuron has two parts one which compute Z and one which computes a [00:26:47] which compute Z and one which computes a equals Samoyed Ozzy now I'm going to [00:26:51] equals Samoyed Ozzy now I'm going to remove all the activations and make [00:26:54] remove all the activations and make these these and I'm going to use the [00:26:59] these these and I'm going to use the specific formula [00:27:24] so this if you recall it's exactly the [00:27:29] so this if you recall it's exactly the softmax formula okay so now the network [00:27:53] softmax formula okay so now the network we have can you guys see rates too small [00:27:56] we have can you guys see rates too small too small okay I'm going to just write [00:27:59] too small okay I'm going to just write this formula bigger and then you can [00:28:02] this formula bigger and then you can figure out the others I guess because e [00:28:04] figure out the others I guess because e of Z 3 1 divided by sum from J equals 1 [00:28:10] of Z 3 1 divided by sum from J equals 1 2 3 of e exponential of ZK this one so [00:28:20] 2 3 of e exponential of ZK this one so here is a for the third one if you are [00:28:21] here is a for the third one if you are doing it for the first one you will add [00:28:23] doing it for the first one you will add you just change this into a 2 into a 1 [00:28:25] you just change this into a 2 into a 1 and for a second 1 into a 2 so why is [00:28:28] and for a second 1 into a 2 so why is this formula interesting and why is it [00:28:30] this formula interesting and why is it not robust to this labeling scheme [00:28:32] not robust to this labeling scheme anymore it's because the sum of the [00:28:36] anymore it's because the sum of the outputs of this network have to sum up [00:28:38] outputs of this network have to sum up to 1 you can try it if you sum the three [00:28:41] to 1 you can try it if you sum the three outputs you get the same thing in the [00:28:44] outputs you get the same thing in the numerator and on the denominator and you [00:28:46] numerator and on the denominator and you get one that makes sense [00:28:49] get one that makes sense so instead of getting a probabilistic [00:28:53] so instead of getting a probabilistic output for each each of Y if each of Y [00:28:59] output for each each of Y if each of Y hat 1 Y had to I had 3 we will get a [00:29:02] hat 1 Y had to I had 3 we will get a probability distribution over all the [00:29:04] probability distribution over all the classes so it means we cannot get 0.7 [00:29:07] classes so it means we cannot get 0.7 0.6 0.1 telling us roughly that there is [00:29:11] 0.6 0.1 telling us roughly that there is probably a cat and a lion but no iguana [00:29:13] probably a cat and a lion but no iguana we have to sum these two one so it means [00:29:16] we have to sum these two one so it means if there is no cut and no lion it means [00:29:19] if there is no cut and no lion it means there is very likely an iguana the three [00:29:22] there is very likely an iguana the three probabilities are dependent on each [00:29:24] probabilities are dependent on each other and for this one we have to label [00:29:29] other and for this one we have to label the following way 1 1 0 for a cat 0 1 0 [00:29:34] the following way 1 1 0 for a cat 0 1 0 for a lion or 0 0 1 [00:29:37] for a lion or 0 0 1 for an iguana so this is called a [00:29:40] for an iguana so this is called a softmax multi-class network you assume [00:30:04] softmax multi-class network you assume there is at least one of the three [00:30:05] there is at least one of the three classes otherwise you have to add a [00:30:07] classes otherwise you have to add a fourth input that will represent absence [00:30:09] fourth input that will represent absence of animal but this way your assume there [00:30:13] of animal but this way your assume there is always one of these three animals on [00:30:14] is always one of these three animals on every picture and how many parameters [00:30:23] every picture and how many parameters does the network have the same as the [00:30:27] does the network have the same as the second one we still have three neurons [00:30:29] second one we still have three neurons and although I didn't write it this Z 1 [00:30:31] and although I didn't write it this Z 1 is equal to w1 1 X plus B 1 Z 2 thames [00:30:36] is equal to w1 1 X plus B 1 Z 2 thames III same so there's 3 n plus 3 [00:30:39] III same so there's 3 n plus 3 parameters so one question that we [00:30:46] parameters so one question that we didn't talk about is how do we train [00:30:48] didn't talk about is how do we train these parameters these these parameters [00:30:56] these parameters these these parameters the 3n plus 3 parameters how do we train [00:30:58] the 3n plus 3 parameters how do we train them you think this scheme will work or [00:31:01] them you think this scheme will work or not what's wrong what's wrong with this [00:31:03] not what's wrong what's wrong with this scheme what's wrong with the last [00:31:09] scheme what's wrong with the last function specifically [00:31:15] there's only two outcomes so in this [00:31:18] there's only two outcomes so in this last function y is a number between 0 & [00:31:22] last function y is a number between 0 & 1 y hat same is a probability y is [00:31:26] 1 y hat same is a probability y is either 0 or 1 y hat is between 0 & 1 so [00:31:29] either 0 or 1 y hat is between 0 & 1 so it cannot match this labeling so we need [00:31:32] it cannot match this labeling so we need to modify the loss function so let's [00:31:36] to modify the loss function so let's call it loss trainer what I'm going to [00:31:42] call it loss trainer what I'm going to do is I'm going to just sum it up for [00:31:46] do is I'm going to just sum it up for the fingers [00:32:05] does this make sense so I'm just doing [00:32:09] does this make sense so I'm just doing three times this loss for each of the [00:32:13] three times this loss for each of the neurons so we have exactly three times [00:32:15] neurons so we have exactly three times this we sum them together and if you [00:32:20] this we sum them together and if you train this last function you should be [00:32:22] train this last function you should be able to train the three neurons that you [00:32:24] able to train the three neurons that you have and again talking about scarcity of [00:32:27] have and again talking about scarcity of one of the classes if there is not many [00:32:29] one of the classes if there is not many in Guana then the third term of this sum [00:32:34] in Guana then the third term of this sum is not going to help this neuron train [00:32:38] is not going to help this neuron train towards detecting an iguana [00:32:40] towards detecting an iguana it's going to push it to the technology [00:32:42] it's going to push it to the technology Juana any question on the last function [00:32:47] Juana any question on the last function does this one make sense yeah yeah [00:33:02] does this one make sense yeah yeah usually that's what will happen is that [00:33:03] usually that's what will happen is that the output of this network once it's [00:33:06] the output of this network once it's trained is going to be a probability [00:33:07] trained is going to be a probability distribution you will pick the maximum [00:33:09] distribution you will pick the maximum of those and you will set it one and the [00:33:11] of those and you will set it one and the others to zero as your prediction one [00:33:17] others to zero as your prediction one more question yeah [00:33:28] if you use the two one if you use this [00:33:31] if you use the two one if you use this labeling skin-like one one zero for this [00:33:34] labeling skin-like one one zero for this network what do you think it will happen [00:33:40] it will probably not work and the reason [00:33:43] it will probably not work and the reason is this sum is equal to two there's some [00:33:47] is this sum is equal to two there's some of these entries while the sum of this [00:33:49] of these entries while the sum of this entry is equal to one so you will never [00:33:51] entry is equal to one so you will never be able to match the output to the input [00:33:54] be able to match the output to the input to the label it makes sense [00:33:56] to the label it makes sense so what the network is probably going to [00:33:58] so what the network is probably going to do is it's probably going to send this [00:34:00] do is it's probably going to send this one to one half this one to one half and [00:34:02] one to one half this one to one half and this one to zero probably which is not [00:34:04] this one to zero probably which is not what you want okay let's talk about the [00:34:09] what you want okay let's talk about the last function for this softmax [00:34:11] last function for this softmax regression because you know what's [00:34:22] regression because you know what's interesting about this loss is if I take [00:34:25] interesting about this loss is if I take this derivative derivative of the Lost [00:34:29] this derivative derivative of the Lost 3m with respect to W to one you thing is [00:34:36] 3m with respect to W to one you thing is going to be harder than this derivative [00:34:38] going to be harder than this derivative then this one or no it's going to be [00:34:41] then this one or no it's going to be exactly the same because only one of [00:34:44] exactly the same because only one of these three terms depends on W want to [00:34:45] these three terms depends on W want to it means the derivative of the two [00:34:47] it means the derivative of the two others are zero so we're exactly at the [00:34:50] others are zero so we're exactly at the same complexity during the derivation [00:34:52] same complexity during the derivation but this one you think if you try to [00:34:56] but this one you think if you try to compute let's say we define a loss [00:35:00] compute let's say we define a loss function that corresponds roughly to [00:35:01] function that corresponds roughly to that if you try to compute the [00:35:02] that if you try to compute the derivative of the loss with respect to W [00:35:04] derivative of the loss with respect to W 2 it will become much more complex [00:35:07] 2 it will become much more complex because this number the output here that [00:35:11] because this number the output here that is going to impact the loss function [00:35:13] is going to impact the loss function directly not only depends on the [00:35:15] directly not only depends on the parameters of W 2 it also depends on the [00:35:18] parameters of W 2 it also depends on the parents of W 1 and W 3 and same for this [00:35:21] parents of W 1 and W 3 and same for this put this output also depends on the [00:35:23] put this output also depends on the parameters W 2 doesn't make sense [00:35:25] parameters W 2 doesn't make sense because of this denominator so the [00:35:29] because of this denominator so the softmax regression needs a different [00:35:30] softmax regression needs a different loss function and a different derivative [00:35:33] loss function and a different derivative so the loss function will define is a [00:35:36] so the loss function will define is a very common one in deep learning [00:35:38] very common one in deep learning is called the softmax first entropy [00:35:41] is called the softmax first entropy cross entropy loss i'm not going to into [00:35:50] cross entropy loss i'm not going to into the details of where it comes from but [00:35:51] the details of where it comes from but you can get the intuition why [00:36:14] so it surprisingly looks like the binary [00:36:19] so it surprisingly looks like the binary croissant the binary the logistic class [00:36:21] croissant the binary the logistic class function the only difference is that we [00:36:24] function the only difference is that we will sum it up on all the on all the [00:36:30] will sum it up on all the on all the classes now we will take a derivative of [00:36:36] classes now we will take a derivative of something that looks like that later but [00:36:38] something that looks like that later but I'd say if you can try it at home on [00:36:40] I'd say if you can try it at home on this one it would be a good exercise [00:36:42] this one it would be a good exercise this way so this binary croissant ropey [00:36:46] this way so this binary croissant ropey loss is very likely to be used in [00:36:48] loss is very likely to be used in classification problems that are multi [00:36:51] classification problems that are multi class okay so this was the first part on [00:36:57] class okay so this was the first part on logistic regression types of networks [00:36:59] logistic regression types of networks and I think we're ready now with the [00:37:02] and I think we're ready now with the notation that we introduced to jump on [00:37:04] notation that we introduced to jump on to neural networks any question on this [00:37:07] to neural networks any question on this first part before we move on so one [00:37:15] first part before we move on so one question I would have for you let's say [00:37:17] question I would have for you let's say instead of trying to predict if there is [00:37:20] instead of trying to predict if there is a cat or no cat we will trying to [00:37:23] a cat or no cat we will trying to predict the age of the cat based on the [00:37:26] predict the age of the cat based on the image what would you change this network [00:37:31] image what would you change this network instead of predicting one zero you want [00:37:34] instead of predicting one zero you want to predict the age of the cat what are [00:37:39] to predict the age of the cat what are the things you would change [00:37:43] yes okay so I repeat I I basically make [00:37:57] yes okay so I repeat I I basically make several output nodes where each of them [00:38:00] several output nodes where each of them corresponds to one edge of cats so would [00:38:02] corresponds to one edge of cats so would you use this network or the third one [00:38:06] you use this network or the third one would use the tree neurons your own [00:38:09] would use the tree neurons your own network or the softmax regression the [00:38:12] network or the softmax regression the third one why you have a unique age you [00:38:16] third one why you have a unique age you cannot have two ages right so we would [00:38:19] cannot have two ages right so we would use a soft max one because we want a [00:38:21] use a soft max one because we want a probability distribution along the edge [00:38:23] probability distribution along the edge the age okay that makes sense that's a [00:38:28] the age okay that makes sense that's a good approach there is also another [00:38:29] good approach there is also another approach which is using directly [00:38:31] approach which is using directly regression to predict an age an age can [00:38:34] regression to predict an age an age can be between 0 and plus in feet not plus [00:38:36] be between 0 and plus in feet not plus infinity 0 in a certain number and so [00:38:44] infinity 0 in a certain number and so let's say you want to do a regression [00:38:45] let's say you want to do a regression how would you modify your network change [00:38:50] how would you modify your network change the sigmoid the sigmoid puts the Z [00:38:52] the sigmoid the sigmoid puts the Z between 0 & 1 we don't want this to [00:38:54] between 0 & 1 we don't want this to happen so I'd say we will change the [00:38:56] happen so I'd say we will change the sigmoid into what function would you [00:38:58] sigmoid into what function would you change the Samoyed [00:39:09] yes so the second one you said was or to [00:39:15] yes so the second one you said was or to get a plus-one type of distribution okay [00:39:17] get a plus-one type of distribution okay so let's let's go with linear you [00:39:19] so let's let's go with linear you mentioned linear we could just use a [00:39:21] mentioned linear we could just use a linear function right for the sigmoid [00:39:25] linear function right for the sigmoid but this becomes a linear regression the [00:39:28] but this becomes a linear regression the whole network becomes a linear [00:39:29] whole network becomes a linear regression another one that is very [00:39:30] regression another one that is very common in in deep learning is called the [00:39:32] common in in deep learning is called the rayleigh function it's a function that [00:39:34] rayleigh function it's a function that is almost linear but for every input [00:39:37] is almost linear but for every input that is negative it's equal to zero [00:39:39] that is negative it's equal to zero because we cannot have negative age it [00:39:41] because we cannot have negative age it makes sense to use this one okay so this [00:39:46] makes sense to use this one okay so this is called rectified linear units really [00:39:50] is called rectified linear units really it's a very common one in different now [00:39:54] it's a very common one in different now what else would you change we talked [00:39:56] what else would you change we talked about linear regression you remember the [00:39:58] about linear regression you remember the last function you were using a linear [00:40:00] last function you were using a linear regression what was it it was probably [00:40:06] regression what was it it was probably one of these two y hat minus y just [00:40:10] one of these two y hat minus y just comparison between the output label and [00:40:13] comparison between the output label and Y hats the prediction or it was the l2 [00:40:16] Y hats the prediction or it was the l2 loss Y hat minus y in l2 norm so that's [00:40:20] loss Y hat minus y in l2 norm so that's what we would use we would modify our [00:40:22] what we would use we would modify our loss function to fit the regression type [00:40:24] loss function to fit the regression type of problem and the reason we would use [00:40:26] of problem and the reason we would use this loss instead of the one we have for [00:40:29] this loss instead of the one we have for a regression test is because in [00:40:31] a regression test is because in optimization the shape of this loss is [00:40:34] optimization the shape of this loss is much easier to optimize for a regression [00:40:36] much easier to optimize for a regression task than it is for a classification [00:40:37] task than it is for a classification task and vice versa not going to go into [00:40:41] task and vice versa not going to go into the details of that but that's the [00:40:42] the details of that but that's the intuition ok let's go have fun with [00:40:46] intuition ok let's go have fun with neural networks [00:41:10] so we we stick to our first goal I've [00:41:17] so we we stick to our first goal I've given an image tell us if there is cat [00:41:21] given an image tell us if there is cat or no cat this is one this is you but [00:41:28] or no cat this is one this is you but now we're going to make a network a [00:41:29] now we're going to make a network a little more complex we're going to add [00:41:31] little more complex we're going to add some parameters so I get my teacher of [00:41:33] some parameters so I get my teacher of the cat cat is moving okay [00:41:45] the cat cat is moving okay and what I'm going to do is that I'm [00:41:47] and what I'm going to do is that I'm going to put more neurons than before [00:41:50] going to put more neurons than before maybe something like that [00:42:35] so using the same notation you see that [00:42:38] so using the same notation you see that my square bracket here is 2 indicating [00:42:41] my square bracket here is 2 indicating that there is a layer here which is the [00:42:44] that there is a layer here which is the second layer while this one is the first [00:42:52] second layer while this one is the first air and this one is the third layer [00:42:56] everybody's up to speed with the [00:42:58] everybody's up to speed with the notations cool so now notice that when [00:43:04] notations cool so now notice that when you make a choice of architecture you [00:43:07] you make a choice of architecture you have to be careful of one thing is that [00:43:10] have to be careful of one thing is that the output layer has to have the same [00:43:13] the output layer has to have the same number of neurons as you want the number [00:43:15] number of neurons as you want the number of classes to be for a classification [00:43:17] of classes to be for a classification and one for a regression so how many [00:43:27] and one for a regression so how many parameters does need this network have [00:43:31] parameters does need this network have can someone quickly give me the thought [00:43:33] can someone quickly give me the thought process so how much here [00:43:41] yeah like 3n plus 3 let's say [00:43:59] yeah correct so here you would have [00:44:02] yeah correct so here you would have three any weights plus three biases here [00:44:06] three any weights plus three biases here you would have two times three weights [00:44:08] you would have two times three weights plus two biases because you have three [00:44:10] plus two biases because you have three neurons connected to two neurons and [00:44:13] neurons connected to two neurons and here you will have two times one plus [00:44:15] here you will have two times one plus one bias this is the total number of [00:44:18] one bias this is the total number of characters so you see that we didn't add [00:44:22] characters so you see that we didn't add too much parameters most of the [00:44:23] too much parameters most of the parameters are still in the input layer [00:44:28] let's define some vocabulary the first [00:44:31] let's define some vocabulary the first word is layer layer denotes neurons that [00:44:35] word is layer layer denotes neurons that are not connected to each other these [00:44:36] are not connected to each other these two neurons are not connected to each [00:44:37] two neurons are not connected to each other [00:44:38] other these three neurons are not connected to [00:44:39] these three neurons are not connected to each other we call this cluster of [00:44:41] each other we call this cluster of neurons a layer and this has three [00:44:44] neurons a layer and this has three layers we would use input layer to [00:44:47] layers we would use input layer to define the first layer output layer to [00:44:50] define the first layer output layer to define the third layer because it [00:44:51] define the third layer because it directly sees the output and we would [00:44:53] directly sees the output and we would call the second layer a hidden layer and [00:44:59] the reason we call it hidden is because [00:45:02] the reason we call it hidden is because the input and the output are hidden from [00:45:04] the input and the output are hidden from this layer it means the only thing that [00:45:06] this layer it means the only thing that this layer sees as input is what's the [00:45:09] this layer sees as input is what's the previous layer again it so it's an [00:45:12] previous layer again it so it's an abstraction of the inputs but it's not [00:45:14] abstraction of the inputs but it's not the input doesn't make sense and say it [00:45:18] the input doesn't make sense and say it doesn't see the output it just gives [00:45:19] doesn't see the output it just gives what it understood to the last neuron [00:45:22] what it understood to the last neuron that will compare the output to the [00:45:24] that will compare the output to the ground truth so now why our neural [00:45:28] ground truth so now why our neural network interesting and why do we call [00:45:30] network interesting and why do we call this hidden layer is because if you [00:45:34] this hidden layer is because if you train this network on cats [00:45:36] train this network on cats classification with a lot of images of [00:45:38] classification with a lot of images of cats you would notice that the first [00:45:41] cats you would notice that the first layers are going to understand the [00:45:43] layers are going to understand the fundamental concepts of the image which [00:45:46] fundamental concepts of the image which is the edges this neuron is going to be [00:45:49] is the edges this neuron is going to be able to detect this type of edges this [00:45:52] able to detect this type of edges this your own probably going to detect some [00:45:54] your own probably going to detect some other type of edge this neuron may be [00:45:57] other type of edge this neuron may be this type of edge then what's going to [00:46:00] this type of edge then what's going to happen is that this neuron are going to [00:46:01] happen is that this neuron are going to communicate what they found on the image [00:46:03] communicate what they found on the image to the next layers new [00:46:05] to the next layers new and this room is going to use the edges [00:46:07] and this room is going to use the edges that these guys found to figure out that [00:46:09] that these guys found to figure out that oh there is a their ears while this one [00:46:13] oh there is a their ears while this one is going to figure out oh there is a [00:46:15] is going to figure out oh there is a mouth and so on if you have several [00:46:18] mouth and so on if you have several neural and they're going to communicate [00:46:19] neural and they're going to communicate what they understood to the output [00:46:22] what they understood to the output neuron that is going to construct the [00:46:24] neuron that is going to construct the face of the cat [00:46:25] face of the cat based on what it received and be able to [00:46:27] based on what it received and be able to tell if there is a cat or not [00:46:29] tell if there is a cat or not so the reason it's called hidden layer [00:46:31] so the reason it's called hidden layer is because we don't really know what [00:46:34] is because we don't really know what it's going to figure out but with enough [00:46:35] it's going to figure out but with enough data it should understand very complex [00:46:38] data it should understand very complex information about the data the deeper [00:46:40] information about the data the deeper you go the more complex information the [00:46:42] you go the more complex information the neurons are able to understand let me [00:46:45] neurons are able to understand let me give you another example which is a [00:46:47] give you another example which is a house prediction example house price [00:46:51] house prediction example house price prediction [00:47:12] so let's assume that our inputs are [00:47:15] so let's assume that our inputs are number of bedrooms size of the house zip [00:47:24] number of bedrooms size of the house zip code and wealth of the neighborhood [00:47:30] code and wealth of the neighborhood let's say what we will build is a [00:47:33] let's say what we will build is a network that has twin neurons in the [00:47:37] network that has twin neurons in the first layer and one your own in the [00:47:38] first layer and one your own in the output layer so what's interesting is [00:47:42] output layer so what's interesting is that as a human if you were to build [00:47:45] that as a human if you were to build this network and like hand engineer it [00:47:48] this network and like hand engineer it you would say that okay zip code and [00:47:51] you would say that okay zip code and wealth or or sorry zip code and wealth [00:47:57] wealth or or sorry zip code and wealth are able to tell us about the school [00:48:01] are able to tell us about the school quality in the neighborhood the quality [00:48:04] quality in the neighborhood the quality of the school that is next to the house [00:48:08] of the school that is next to the house probably as a human you would say these [00:48:12] probably as a human you would say these are probably good features to predict [00:48:13] are probably good features to predict that the zip code is going to tell us if [00:48:16] that the zip code is going to tell us if the neighborhood is walkable or not [00:48:20] the neighborhood is walkable or not probably the size and the number of [00:48:26] probably the size and the number of bedrooms is going to tell us what's the [00:48:29] bedrooms is going to tell us what's the size of the family that can fit in this [00:48:31] size of the family that can fit in this house and these three are probably [00:48:35] house and these three are probably better information than these in order [00:48:37] better information than these in order to finally predict the price so that's a [00:48:41] to finally predict the price so that's a way to hand engineer that by hand as a [00:48:44] way to hand engineer that by hand as a human in order to give human knowledge [00:48:48] human in order to give human knowledge to the network to figure out the price [00:48:50] to the network to figure out the price in practice what we do here is that we [00:48:54] in practice what we do here is that we use a fully connected layer fully [00:49:02] use a fully connected layer fully connected what does it mean it means [00:49:03] connected what does it mean it means that we connect every input of a layer [00:49:07] that we connect every input of a layer every every input to the first layer [00:49:10] every every input to the first layer every output of the first layer to the [00:49:13] every output of the first layer to the input of the third layer [00:49:14] input of the third layer here and so on so all the neurons among [00:49:17] here and so on so all the neurons among like from one layer to another are [00:49:18] like from one layer to another are connected with each other what we're [00:49:20] connected with each other what we're saying is that we will let the network [00:49:22] saying is that we will let the network figure these out we will net the neurons [00:49:25] figure these out we will net the neurons of the first layer figure out what's [00:49:27] of the first layer figure out what's interesting for the second layer to make [00:49:29] interesting for the second layer to make the price prediction so we will not tell [00:49:32] the price prediction so we will not tell these to the network instead we will [00:49:34] these to the network instead we will fully connect the network and so on [00:49:41] fully connect the network and so on okay we'll fully connect the network and [00:49:45] okay we'll fully connect the network and let it figure out what are the [00:49:46] let it figure out what are the interesting features and often time the [00:49:48] interesting features and often time the network is going to be able better than [00:49:50] network is going to be able better than humans to find these what are the [00:49:52] humans to find these what are the features that are representative [00:49:53] features that are representative sometimes you may hear neural networks [00:49:56] sometimes you may hear neural networks referred as blackbox models the reason [00:50:00] referred as blackbox models the reason is we will not understand what this edge [00:50:03] is we will not understand what this edge would correspond to it's it's hard to [00:50:05] would correspond to it's it's hard to figure out that this neuron is detecting [00:50:08] figure out that this neuron is detecting a weighted average of the input features [00:50:12] a weighted average of the input features does it make sense another word you [00:50:18] does it make sense another word you might hear is end to end learning the [00:50:21] might hear is end to end learning the reason we talk about end to end learning [00:50:23] reason we talk about end to end learning is because we have an input a ground [00:50:26] is because we have an input a ground truth and we don't constrain the network [00:50:30] truth and we don't constrain the network in the middle we let it learn whatever [00:50:31] in the middle we let it learn whatever it has to learn and we call it end to [00:50:34] it has to learn and we call it end to end learning because we're just training [00:50:36] end learning because we're just training based on the input and the output [00:51:14] let's delve more into the math of this [00:51:17] let's delve more into the math of this network the neural network that we have [00:51:20] network the neural network that we have here which has an input layer a hidden [00:51:22] here which has an input layer a hidden layer and an output layer let's try to [00:51:25] layer and an output layer let's try to write down the equations that run the [00:51:27] write down the equations that run the input and pour propagated through the [00:51:29] input and pour propagated through the output we first have z1 that is the [00:51:35] output we first have z1 that is the linear part of the first layer that is [00:51:37] linear part of the first layer that is computed using w1 times X plus B then [00:51:45] computed using w1 times X plus B then this z1 is given to an activation let's [00:51:48] this z1 is given to an activation let's say it's sigmoid which is sigmoid of z1 [00:51:52] say it's sigmoid which is sigmoid of z1 z2 is then the linear part of the second [00:51:56] z2 is then the linear part of the second neuron which is going to take the output [00:52:00] neuron which is going to take the output of the previous layer multiplied by its [00:52:03] of the previous layer multiplied by its weights and add the bias the second [00:52:08] weights and add the bias the second activation is going to take the sigmoid [00:52:11] activation is going to take the sigmoid of z2 and finally we have the third [00:52:16] of z2 and finally we have the third layer which is going to multiply its [00:52:19] layer which is going to multiply its weights with the output of the layer [00:52:22] weights with the output of the layer present in it and add its bias and [00:52:26] present in it and add its bias and finally we have the third activation [00:52:29] finally we have the third activation which is simply the simulate so what is [00:52:39] which is simply the simulate so what is interesting to notice between these [00:52:42] interesting to notice between these equations and the equations that we [00:52:45] equations and the equations that we wrote here is that we put everything in [00:52:50] wrote here is that we put everything in matrices so it means this 8/3 that I [00:52:55] matrices so it means this 8/3 that I have here sorry this here for three [00:52:59] have here sorry this here for three neurons I wrote three [00:53:01] neurons I wrote three here for three neurons in the second [00:53:06] here for three neurons in the second layer I just wrote a single equation to [00:53:08] layer I just wrote a single equation to summarize it but the shape of these [00:53:11] summarize it but the shape of these things are going to be vectors so let's [00:53:13] things are going to be vectors so let's go over the shapes let's try to define [00:53:15] go over the shapes let's try to define them z11 is going to be X which is n by [00:53:20] them z11 is going to be X which is n by 1 times W which has to be 3 by n because [00:53:26] 1 times W which has to be 3 by n because it connects three neurons to the input [00:53:29] it connects three neurons to the input so this Z has to be 3 by 1 it makes [00:53:33] so this Z has to be 3 by 1 it makes sense because we have three neurons [00:53:37] sense because we have three neurons now let's go let's go deeper a 1 is just [00:53:42] now let's go let's go deeper a 1 is just the sigmoid of z1 so it doesn't change [00:53:44] the sigmoid of z1 so it doesn't change the shape it keeps the 3 by 1 Z 2 we [00:53:49] the shape it keeps the 3 by 1 Z 2 we know it it has to be 2 by 1 because [00:53:51] know it it has to be 2 by 1 because there are two neurons in the second [00:53:53] there are two neurons in the second layer and it helps us figure out what W [00:53:57] layer and it helps us figure out what W 2 would be we know a 1 is 3 by 1 it [00:54:00] 2 would be we know a 1 is 3 by 1 it means that W 2 has to be 2 by 3 and if [00:54:05] means that W 2 has to be 2 by 3 and if you count the edges between the first [00:54:07] you count the edges between the first and the second layer here you will find [00:54:09] and the second layer here you will find 6 ages 2 times 3 a 2 same shape as z2 z3 [00:54:17] 6 ages 2 times 3 a 2 same shape as z2 z3 1 by 1 a 3 1 by 1 w3 it has to be 1 by 2 [00:54:23] 1 by 1 a 3 1 by 1 w3 it has to be 1 by 2 because a 2 is 2 by it's same for me B [00:54:29] because a 2 is 2 by it's same for me B is going to be the number of neurons so [00:54:31] is going to be the number of neurons so 3 by 1 2 by 1 and finally 1 by 1 so I [00:54:38] 3 by 1 2 by 1 and finally 1 by 1 so I think it's usually very helpful even [00:54:39] think it's usually very helpful even when coding this type of equations to [00:54:42] when coding this type of equations to know all the shapes that are involved [00:54:44] know all the shapes that are involved are you guys like totally ok with the [00:54:47] are you guys like totally ok with the shapes super easy to figure out ok cool [00:54:50] shapes super easy to figure out ok cool so now what is interesting is that we [00:54:56] so now what is interesting is that we will try to vectorize the code even more [00:54:58] will try to vectorize the code even more does someone remember the difference [00:55:00] does someone remember the difference between stochastic gradient descent and [00:55:02] between stochastic gradient descent and gradient descent what's the difference [00:55:13] exactly so Cassie gradient descent is [00:55:17] exactly so Cassie gradient descent is update the weights and the bias after [00:55:20] update the weights and the bias after you see every example so the direction [00:55:22] you see every example so the direction of the gradient is quite noisy doesn't [00:55:25] of the gradient is quite noisy doesn't represent very well the entire batch [00:55:26] represent very well the entire batch while gradient descent or batch gradient [00:55:29] while gradient descent or batch gradient descent is update after you've seen the [00:55:32] descent is update after you've seen the whole batch of examples and the gradient [00:55:34] whole batch of examples and the gradient is much more precise it points to the [00:55:37] is much more precise it points to the direction you want to go to so what [00:55:43] direction you want to go to so what we're trying to do now is to write down [00:55:46] we're trying to do now is to write down these equations if instead of giving one [00:55:49] these equations if instead of giving one single cat image we had given a bunch of [00:55:52] single cat image we had given a bunch of images that either have a cat or another [00:55:54] images that either have a cat or another cat so now our input X so what happens [00:56:05] cat so now our input X so what happens for an input batch of examples so now [00:56:22] for an input batch of examples so now arm or X is not anymore a single column [00:56:27] arm or X is not anymore a single column vector it's a matrix with the first [00:56:32] vector it's a matrix with the first image corresponding to X 1 the second [00:56:35] image corresponding to X 1 the second image corresponding to X 2 and so on [00:56:37] image corresponding to X 2 and so on until the enth image corresponding to X [00:56:41] until the enth image corresponding to X and I'm introducing a new notation which [00:56:45] and I'm introducing a new notation which is the parentheses superscript [00:56:48] is the parentheses superscript corresponding to the ID of the example [00:56:55] so square brackets for the layer round [00:57:00] so square brackets for the layer round brackets for the idea of the example [00:57:03] brackets for the idea of the example we're talking about so just to give more [00:57:07] we're talking about so just to give more context on what we're trying to do we [00:57:09] context on what we're trying to do we know that this is a bunch of operations [00:57:11] know that this is a bunch of operations we just have a network with inputs [00:57:15] we just have a network with inputs hidden and output layer we could have a [00:57:18] hidden and output layer we could have a network with a thousand layer the more [00:57:20] network with a thousand layer the more layers we have the more computation [00:57:22] layers we have the more computation and it quickly goes up so what we want [00:57:26] and it quickly goes up so what we want to do is to be able to paralyze our code [00:57:28] to do is to be able to paralyze our code or our computation as much as possible [00:57:30] or our computation as much as possible by giving batches of input and [00:57:32] by giving batches of input and parallelizing these equations so let's [00:57:34] parallelizing these equations so let's see how these equations are modified [00:57:36] see how these equations are modified when we give it a bash of M inputs I [00:57:41] will use capital letters to denote the [00:57:47] will use capital letters to denote the equivalent of the lowercase letters but [00:57:50] equivalent of the lowercase letters but for a batch of input so z1 as an example [00:57:53] for a batch of input so z1 as an example would be w1 let's use the same w1 times [00:58:00] would be w1 let's use the same w1 times X plus b1 so let's analyze what z1 would [00:58:05] X plus b1 so let's analyze what z1 would look like [00:58:07] look like z1 we know that for every for every [00:58:12] z1 we know that for every for every input example of the batch we will get [00:58:15] input example of the batch we will get one Z 1 it should look like this [00:58:29] then we have to figure out what has to [00:58:31] then we have to figure out what has to be the shapes of this equation in order [00:58:33] be the shapes of this equation in order to end up with this we know that Z one [00:58:36] to end up with this we know that Z one was three by one it mean it means [00:58:39] was three by one it mean it means Capital Z one has to be three by M [00:58:44] Capital Z one has to be three by M because each of these column vectors are [00:58:47] because each of these column vectors are three by one and we have M of them [00:58:49] three by one and we have M of them because for each input we forward [00:58:52] because for each input we forward propagate through the network we get [00:58:53] propagate through the network we get these equations so for the first cut [00:58:55] these equations so for the first cut image we get these equations for the [00:58:57] image we get these equations for the second cut image we get again equations [00:58:59] second cut image we get again equations like that and so on so what is the shape [00:59:08] like that and so on so what is the shape of X we have it above we know that it's [00:59:10] of X we have it above we know that it's n by M what is the shape of W one it [00:59:18] n by M what is the shape of W one it didn't change the ability one doesn't [00:59:20] didn't change the ability one doesn't change it's not because I will give a [00:59:22] change it's not because I will give a thousand inputs to my network that the [00:59:24] thousand inputs to my network that the parameters are going to be more so the [00:59:28] parameters are going to be more so the parameter number stays the same even if [00:59:30] parameter number stays the same even if I give more inputs and so this has to be [00:59:33] I give more inputs and so this has to be 3 by n in order to match now the [00:59:37] 3 by n in order to match now the interesting thing is that there is a an [00:59:40] interesting thing is that there is a an algebraic problem here what is the [00:59:43] algebraic problem here what is the algebraic problem we said that the [00:59:45] algebraic problem we said that the number of parameters doesn't change it [00:59:49] number of parameters doesn't change it means that W has the same shape as it [00:59:52] means that W has the same shape as it has before as it had before B should [00:59:55] has before as it had before B should have the same shape as it had before [00:59:57] have the same shape as it had before right should be 3 by 1 what's the [01:00:01] right should be 3 by 1 what's the problem of this equation exactly we're [01:00:08] problem of this equation exactly we're summing a 3 by M matrix to a 3 by 1 [01:00:12] summing a 3 by M matrix to a 3 by 1 vector this is not possible in that it [01:00:16] vector this is not possible in that it doesn't work doesn't match when you do [01:00:18] doesn't work doesn't match when you do some summations or subtraction you need [01:00:21] some summations or subtraction you need the two terms to be the same shape [01:00:23] the two terms to be the same shape because you will do an element-wise [01:00:25] because you will do an element-wise addition of them an element-wise [01:00:27] addition of them an element-wise subscription so what's the trick that is [01:00:30] subscription so what's the trick that is used here it's a it's a technique called [01:00:32] used here it's a it's a technique called broadcasting [01:00:41] broadcasting is that is the fact that we [01:00:44] broadcasting is that is the fact that we don't want to change the number of [01:00:45] don't want to change the number of parameters it should stay the same but [01:00:48] parameters it should stay the same but we still want this operation to be able [01:00:50] we still want this operation to be able to be written in parallel version so we [01:00:54] to be written in parallel version so we still want to write this equation [01:00:55] still want to write this equation because we want to paralyze our code but [01:00:57] because we want to paralyze our code but we don't want to add more parameters it [01:00:58] we don't want to add more parameters it doesn't make sense so what we're going [01:01:00] doesn't make sense so what we're going to do is that we're going to create a [01:01:02] to do is that we're going to create a vector B tilde 1 which is going to be B [01:01:08] vector B tilde 1 which is going to be B 1 repeated three times sorry repeated M [01:01:13] 1 repeated three times sorry repeated M times [01:01:23] so we just keep the same number of [01:01:25] so we just keep the same number of parameters but just repeat them in order [01:01:28] parameters but just repeat them in order to be able to write my code in parallel [01:01:32] to be able to write my code in parallel is this called broadcasting and what is [01:01:35] is this called broadcasting and what is convenient is that for those of you who [01:01:37] convenient is that for those of you who do not do homeworks are in max hub or [01:01:39] do not do homeworks are in max hub or Python MATLAB okay so in MATLAB know [01:01:43] Python MATLAB okay so in MATLAB know Python Python so in Python there is a [01:01:49] Python Python so in Python there is a package that is often used to code these [01:01:51] package that is often used to code these equations it's non pipe some people call [01:01:54] equations it's non pipe some people call it dumpy not sure [01:01:55] it dumpy not sure so numpy basically numerical Python we [01:02:01] so numpy basically numerical Python we directly do the broadcasting it means if [01:02:04] directly do the broadcasting it means if you sum this three by m matrix with a [01:02:09] you sum this three by m matrix with a three by one parameter vector is going [01:02:12] three by one parameter vector is going to automatically reproduce the parameter [01:02:14] to automatically reproduce the parameter vector M times so that the equation [01:02:16] vector M times so that the equation works [01:02:16] works it's called broadcasting it make sense [01:02:19] it's called broadcasting it make sense so because we're using this technique [01:02:22] so because we're using this technique we're able to rewrite all these [01:02:24] we're able to rewrite all these equations with capital letters you want [01:02:28] equations with capital letters you want to do it together or do you want to do [01:02:29] to do it together or do you want to do it on your own wants to do it on their [01:02:33] it on your own wants to do it on their own okay so let's do it on their own on [01:02:40] own okay so let's do it on their own on your own so rewrite these with capital [01:02:43] your own so rewrite these with capital letters and figure out the shapes I [01:02:45] letters and figure out the shapes I think you can do it at home where we're [01:02:47] think you can do it at home where we're not going to date here but make sure you [01:02:48] not going to date here but make sure you understand all the shapes yeah [01:03:05] so the question is how is this different [01:03:07] so the question is how is this different from principle component analysis this [01:03:10] from principle component analysis this is a supervised learning algorithm that [01:03:11] is a supervised learning algorithm that will be used to predict the price of a [01:03:13] will be used to predict the price of a house principle component analysis [01:03:16] house principle component analysis doesn't predict anything it gets an [01:03:19] doesn't predict anything it gets an input matrix X normalizes it compute the [01:03:22] input matrix X normalizes it compute the covariance matrix and then figures out [01:03:24] covariance matrix and then figures out what are the principal components by [01:03:26] what are the principal components by doing the eigenvalue decomposition but [01:03:29] doing the eigenvalue decomposition but the outcome of pca is you know that the [01:03:32] the outcome of pca is you know that the most important features of your data set [01:03:35] most important features of your data set X are going to be these features here [01:03:39] X are going to be these features here we're not looking at the features we're [01:03:41] we're not looking at the features we're only looking at the output that's what [01:03:42] only looking at the output that's what is important to us [01:03:57] so the question is can you explain why [01:03:59] so the question is can you explain why the first layer would see the edges is [01:04:01] the first layer would see the edges is there any tuition behind it it's not [01:04:03] there any tuition behind it it's not always going to see the edges but it's [01:04:05] always going to see the edges but it's often time going to see edges because in [01:04:09] often time going to see edges because in order to detect a human face let's say [01:04:11] order to detect a human face let's say you will train an algorithm to find out [01:04:14] you will train an algorithm to find out whose face it is so it has to understand [01:04:16] whose face it is so it has to understand the faces very well you need the network [01:04:19] the faces very well you need the network to be complex enough to understand very [01:04:21] to be complex enough to understand very detailed feature of the face and usually [01:04:24] detailed feature of the face and usually this neuron what it sees as input or [01:04:28] this neuron what it sees as input or pixels so it means every edge here is [01:04:31] pixels so it means every edge here is the multiplication of the weight by a [01:04:33] the multiplication of the weight by a pixel so it sees pixels it cannot [01:04:38] pixel so it sees pixels it cannot understand the face as a whole because [01:04:40] understand the face as a whole because it sees only pixels it's very granular [01:04:43] it sees only pixels it's very granular information for it so it's going to [01:04:45] information for it so it's going to check if pixels nearby have the same [01:04:48] check if pixels nearby have the same color and understand that there is an [01:04:50] color and understand that there is an edge there okay but it's too complicated [01:04:53] edge there okay but it's too complicated to understand the whole face in the [01:04:54] to understand the whole face in the first layer however if it understands a [01:04:58] first layer however if it understands a little more than a pixel information it [01:05:00] little more than a pixel information it can give it to the next neuron this [01:05:02] can give it to the next neuron this neuron will receive more than pixel [01:05:04] neuron will receive more than pixel information it would receive a little [01:05:07] information it would receive a little more complex like edges and then it will [01:05:10] more complex like edges and then it will use this information to build on top of [01:05:11] use this information to build on top of it and build the features of the face so [01:05:14] it and build the features of the face so what I'm trying to sum up is that these [01:05:15] what I'm trying to sum up is that these neurons only see the pixels so they're [01:05:17] neurons only see the pixels so they're not able to build more than the edges [01:05:19] not able to build more than the edges that's the minimum thing that they can [01:05:21] that's the minimum thing that they can the maximum thing they can build and [01:05:24] the maximum thing they can build and it's it's a complex topic like [01:05:25] it's it's a complex topic like interpretation of neural network is very [01:05:27] interpretation of neural network is very highly researched topic the big research [01:05:29] highly researched topic the big research topic so nobody figured out exactly how [01:05:33] topic so nobody figured out exactly how all the neurons evolved yeah one more [01:05:37] all the neurons evolved yeah one more question and then we move on [01:05:50] so the question is how how do you decide [01:05:54] so the question is how how do you decide how many neurons per layer how many [01:05:56] how many neurons per layer how many layers [01:05:57] layers what's the architecture of their neural [01:05:58] what's the architecture of their neural network there are two things to take [01:06:00] network there are two things to take into a consideration I would say first [01:06:02] into a consideration I would say first and nobody knows the right answer so you [01:06:04] and nobody knows the right answer so you have to test it so you guys talked about [01:06:06] have to test it so you guys talked about training set validation set and test set [01:06:09] training set validation set and test set so what we would do is we would try 10 [01:06:12] so what we would do is we would try 10 different architectures train its train [01:06:15] different architectures train its train the network on this look at the [01:06:17] the network on this look at the validation set accuracy of all these and [01:06:19] validation set accuracy of all these and decide which one seems to be the best [01:06:21] decide which one seems to be the best that's how we figure out what's the [01:06:23] that's how we figure out what's the right network size on top of that using [01:06:25] right network size on top of that using experience is often valuable so if you [01:06:28] experience is often valuable so if you give me a problem I try always to gauge [01:06:31] give me a problem I try always to gauge how complex is the problem like CAD [01:06:34] how complex is the problem like CAD classification do you think it's easier [01:06:38] classification do you think it's easier or harder than day-and-night [01:06:39] or harder than day-and-night classification so then a classification [01:06:42] classification so then a classification is I give you an image I ask you to [01:06:43] is I give you an image I ask you to predict if it was taken during the day [01:06:45] predict if it was taken during the day or during the night and on the other [01:06:46] or during the night and on the other hand you want there is a cat on the [01:06:48] hand you want there is a cat on the image or not which one is easier which [01:06:50] image or not which one is easier which one is harder [01:06:54] who thinks cat classification is harder [01:06:58] who thinks cat classification is harder ok I think people are great at [01:07:00] ok I think people are great at classification seems harder why because [01:07:02] classification seems harder why because there are many breeds of cats can look [01:07:04] there are many breeds of cats can look like different things [01:07:05] like different things there's not many breeds of nights one [01:07:10] there's not many breeds of nights one thing that might be challenging in the [01:07:11] thing that might be challenging in the image classification is if you want also [01:07:14] image classification is if you want also to figure it out in house like inside [01:07:16] to figure it out in house like inside you know maybe there is a tiny window [01:07:19] you know maybe there is a tiny window there and I'm able to tell that is the [01:07:21] there and I'm able to tell that is the day but for a network to understand it [01:07:23] day but for a network to understand it you will need a lot more data than if [01:07:24] you will need a lot more data than if only you wanted to work outside [01:07:26] only you wanted to work outside different so these problems all have [01:07:29] different so these problems all have their own complexity based on their [01:07:31] their own complexity based on their complexity I think the network should be [01:07:33] complexity I think the network should be deeper become the more complex usually [01:07:35] deeper become the more complex usually is the problem the more data you need in [01:07:37] is the problem the more data you need in order to figure out the output the more [01:07:39] order to figure out the output the more deeper should be the network that's an [01:07:41] deeper should be the network that's an intuition I think ok let's move on guys [01:07:44] intuition I think ok let's move on guys because I think we have about what 12 [01:07:48] because I think we have about what 12 more minutes [01:07:57] okay let's try to write the lost [01:08:00] okay let's try to write the lost function for this problem so now that we [01:08:19] function for this problem so now that we have our network we have written this [01:08:22] have our network we have written this propagation equation and I will call it [01:08:24] propagation equation and I will call it for propagation phase going forward it's [01:08:27] for propagation phase going forward it's going from the input to the output later [01:08:30] going from the input to the output later on when we will do we will derive these [01:08:32] on when we will do we will derive these equations we will call them backward [01:08:34] equations we will call them backward propagation because we're starting from [01:08:36] propagation because we're starting from the loss and going backwards so let's [01:08:40] the loss and going backwards so let's let's talk about the optimization [01:08:42] let's talk about the optimization problem optimizing w1 w2 w3 b1 b2 [01:08:55] problem optimizing w1 w2 w3 b1 b2 Mitri we have a lot of stuff to optimize [01:08:59] Mitri we have a lot of stuff to optimize right we have to find the right values [01:09:00] right we have to find the right values for these and remember model equals [01:09:02] for these and remember model equals architectural parameter we have our [01:09:03] architectural parameter we have our architecture if we have our parameters [01:09:05] architecture if we have our parameters we're done so in order to do that we [01:09:08] we're done so in order to do that we have to define an objective function [01:09:13] have to define an objective function sometimes called loss sometimes cost [01:09:16] sometimes called loss sometimes cost cost function so usually we would call [01:09:20] cost function so usually we would call it loss if there is only one example in [01:09:22] it loss if there is only one example in the batch and cost if there is multiple [01:09:25] the batch and cost if there is multiple examples in a match so the last function [01:09:35] examples in a match so the last function that let's define the cost function the [01:09:38] that let's define the cost function the cost function J depends on Y hat and Y [01:09:42] cost function J depends on Y hat and Y okay so Y hat Y hat is a 3 ok [01:09:54] it depends on Y hat and Y and we will [01:09:58] it depends on Y hat and Y and we will set it to be the sum of the loss [01:10:03] set it to be the sum of the loss functions Li and I will normalize it [01:10:07] functions Li and I will normalize it it's not mandatory but normalize it with [01:10:09] it's not mandatory but normalize it with one over so what does this mean is that [01:10:15] one over so what does this mean is that we're going for batch gradient descent [01:10:17] we're going for batch gradient descent we want to compute the loss function for [01:10:21] we want to compute the loss function for the whole batch paralyze our code and [01:10:23] the whole batch paralyze our code and then calculate the cost function that [01:10:26] then calculate the cost function that will be then derived to give us the [01:10:30] will be then derived to give us the direction of the gradient that is the [01:10:32] direction of the gradient that is the average direction of all the the [01:10:34] average direction of all the the derivation with respect to the whole [01:10:36] derivation with respect to the whole input batch and Li will be the last [01:10:43] input batch and Li will be the last function corresponding to one parameter [01:10:45] function corresponding to one parameter so what's the error on this specific one [01:10:49] so what's the error on this specific one input sorry not parameter and it will be [01:10:52] input sorry not parameter and it will be the logistic loss you've already seen [01:11:10] the logistic loss you've already seen these equations I believe so now is it [01:11:15] these equations I believe so now is it more complex to take a derivative with [01:11:18] more complex to take a derivative with respect to J like of J with respect to [01:11:20] respect to J like of J with respect to the parameters or of L what's the most [01:11:23] the parameters or of L what's the most complex between this one let's say we're [01:11:27] complex between this one let's say we're taking derivative with respect to W to [01:11:31] taking derivative with respect to W to compare to this one [01:11:38] which one is the hardest [01:11:43] who thinks J is the hardest we think it [01:11:48] who thinks J is the hardest we think it doesn't matter it doesn't matter because [01:11:53] doesn't matter it doesn't matter because derivation is a linear operation right [01:11:56] derivation is a linear operation right so you can just take the derivative [01:11:57] so you can just take the derivative inside and you will see that if you know [01:11:59] inside and you will see that if you know this you just have to take the sum over [01:12:02] this you just have to take the sum over this so instead of computing or [01:12:05] this so instead of computing or derivatives on J [01:12:06] derivatives on J we will come compute them on L but it's [01:12:08] we will come compute them on L but it's totally equivalent there's just one more [01:12:10] totally equivalent there's just one more step at the end okay so now we defined [01:12:15] step at the end okay so now we defined our loss function super we define our [01:12:21] our loss function super we define our loss function and the next step is [01:12:22] loss function and the next step is optimize so we have to compute a lot of [01:12:25] optimize so we have to compute a lot of derivatives [01:12:41] and that's called backward propagation [01:12:51] so the question is why is it called [01:12:54] so the question is why is it called backward propagation it's because what [01:12:56] backward propagation it's because what we want to do ultimately is this for any [01:13:00] we want to do ultimately is this for any N equals one to three we want to do that [01:13:09] N equals one to three we want to do that WL equals W L minus alpha derivative of [01:13:15] WL equals W L minus alpha derivative of J with respect to W and BL equals V L [01:13:23] J with respect to W and BL equals V L minus alpha derivative of J with respect [01:13:29] minus alpha derivative of J with respect so we want to do that for every [01:13:32] so we want to do that for every parameter in layer 1 2 & 3 so it means [01:13:36] parameter in layer 1 2 & 3 so it means we have to compute all these derivatives [01:13:38] we have to compute all these derivatives we have to compute derivative of the [01:13:40] we have to compute derivative of the cost with respect to W 1 W 2 W 3 B 1 B 2 [01:13:43] cost with respect to W 1 W 2 W 3 B 1 B 2 B 3 you've done it with logistic [01:13:46] B 3 you've done it with logistic regression we're going to do it with a [01:13:48] regression we're going to do it with a neural network and you're going to [01:13:50] neural network and you're going to understand why it's called backward [01:13:51] understand why it's called backward propagation which one you want to start [01:13:53] propagation which one you want to start with which derivative you want to start [01:13:56] with which derivative you want to start with the derivative with respect W 1 W 2 [01:13:58] with the derivative with respect W 1 W 2 or W 3 they say assuming we'll do the [01:14:01] or W 3 they say assuming we'll do the bias later W what W want you think that [01:14:07] bias later W what W want you think that value one is a good idea I don't want to [01:14:10] value one is a good idea I don't want to do W 1 and I think we should do W 3 and [01:14:15] do W 1 and I think we should do W 3 and the reason is because if you look at [01:14:19] the reason is because if you look at this loss function do you think the [01:14:24] this loss function do you think the relation between W 3 and this loss [01:14:26] relation between W 3 and this loss function is easier to understand or the [01:14:28] function is easier to understand or the relation between W 1 and this loss [01:14:30] relation between W 1 and this loss function is the relation between W 3 and [01:14:33] function is the relation between W 3 and this last function because W 3 happens [01:14:35] this last function because W 3 happens much later in the in the network so if [01:14:37] much later in the in the network so if you want to understand how much should [01:14:39] you want to understand how much should we move W 1 in order to make the last [01:14:41] we move W 1 in order to make the last move it's much more complicated than [01:14:43] move it's much more complicated than answering the question how much should W [01:14:45] answering the question how much should W 3 move to move the loss because there is [01:14:48] 3 move to move the loss because there is much more connections if you want to [01:14:51] much more connections if you want to compute with W 1 [01:14:53] compute with W 1 so that's why we call it backward [01:14:54] so that's why we call it backward propagation is because we will start [01:14:55] propagation is because we will start with the top layer the one that's the [01:14:57] with the top layer the one that's the closest to the last function derive the [01:15:00] closest to the last function derive the derivative of J with respect to w1 okay [01:15:09] derivative of J with respect to w1 okay and once we computed this derivative [01:15:11] and once we computed this derivative which we are going to do next week once [01:15:15] which we are going to do next week once we completed this number we can then [01:15:17] we completed this number we can then tackle this one oh sorry yeah thanks [01:15:25] tackle this one oh sorry yeah thanks yeah once we computed this number we [01:15:28] yeah once we computed this number we will be able to compute this one very [01:15:30] will be able to compute this one very easily why very easily because we can [01:15:34] easily why very easily because we can use the chain rule of calculus so let's [01:15:37] use the chain rule of calculus so let's see how it works [01:15:38] see how it works we're I'm just going to give you a one [01:15:41] we're I'm just going to give you a one minute pitch on on backdrop but we'll do [01:15:44] minute pitch on on backdrop but we'll do it next week together so if we had to [01:15:46] it next week together so if we had to compute this derivative what I will do [01:15:49] compute this derivative what I will do is that I will separate it into several [01:15:51] is that I will separate it into several derivative that are easier I will [01:15:54] derivative that are easier I will separate it into derivative of J with [01:15:56] separate it into derivative of J with respect as something with this something [01:15:58] respect as something with this something with respect the w3 and the question is [01:16:02] with respect the w3 and the question is what should this something be I will [01:16:05] what should this something be I will look at my equations I know that J [01:16:08] look at my equations I know that J depends on Y hat and I know that Y hat [01:16:11] depends on Y hat and I know that Y hat depends on Z 3 Y hat is the same thing [01:16:14] depends on Z 3 Y hat is the same thing as a 3 I know it depends on Z 3 so why [01:16:18] as a 3 I know it depends on Z 3 so why don't why don't I include these three in [01:16:20] don't why don't I include these three in my equation I also know that Z 3 depends [01:16:22] my equation I also know that Z 3 depends on W 3 and the derivative of Z 3 with [01:16:24] on W 3 and the derivative of Z 3 with respect to W 3 super easy it's just a 2 [01:16:27] respect to W 3 super easy it's just a 2 transpose so I will just make a quick [01:16:31] transpose so I will just make a quick hack and say that this derivative is the [01:16:34] hack and say that this derivative is the same as taking it with respect to a 3 [01:16:38] same as taking it with respect to a 3 taking the derivative of 83 with respect [01:16:40] taking the derivative of 83 with respect to Z 3 and taking the derivative of Z 3 [01:16:45] to Z 3 and taking the derivative of Z 3 with respect to W 3 [01:16:49] with respect to W 3 so you see same same derivative [01:16:52] so you see same same derivative calculated in different ways and I know [01:16:55] calculated in different ways and I know this I know these are pretty easy to [01:17:00] this I know these are pretty easy to compute so that's why we call it back [01:17:03] compute so that's why we call it back propagation is because we use the chain [01:17:05] propagation is because we use the chain rule to compute the derivative [01:17:06] rule to compute the derivative w3 and then one I want to do it for w2 [01:17:09] w3 and then one I want to do it for w2 I'm going to insert I'm going to insert [01:17:14] I'm going to insert I'm going to insert the derivative with Z three times the [01:17:19] the derivative with Z three times the derivative of Z three with respect to a [01:17:23] derivative of Z three with respect to a two times the derivative of a two with [01:17:26] two times the derivative of a two with respect to Z 2 and their relative of Z 2 [01:17:30] respect to Z 2 and their relative of Z 2 with respect to W 2 does this make sense [01:17:34] with respect to W 2 does this make sense that this thing here is the same thing [01:17:40] that this thing here is the same thing as this it means if I want to compute [01:17:45] as this it means if I want to compute the derivative of W 2 I don't need to [01:17:47] the derivative of W 2 I don't need to come to this anymore [01:17:48] come to this anymore I already did for W 3 I just need to [01:17:51] I already did for W 3 I just need to compute those which are easy ones and so [01:17:53] compute those which are easy ones and so on if I want to compute the derivative [01:17:55] on if I want to compute the derivative of J with respect to W 1 I'm going to [01:18:01] of J with respect to W 1 I'm going to I'm not going to decompose all the thing [01:18:03] I'm not going to decompose all the thing again I'm just going to take the [01:18:04] again I'm just going to take the derivative of J with respect to Z 2 [01:18:07] derivative of J with respect to Z 2 which is equal to this whole thing and [01:18:11] which is equal to this whole thing and then I'm going to multiply it by [01:18:12] then I'm going to multiply it by derivative of Z 2 with respect to a 1 [01:18:16] derivative of Z 2 with respect to a 1 times derivative of a 1 with respect to [01:18:19] times derivative of a 1 with respect to Z 1 times the derivative of Z 1 with [01:18:24] Z 1 times the derivative of Z 1 with respect to W 1 and again this thing I [01:18:28] respect to W 1 and again this thing I know it already I computed it previously [01:18:30] know it already I computed it previously just for this one so what's what's [01:18:34] just for this one so what's what's interesting about it is that I'm not [01:18:36] interesting about it is that I'm not going to redo the work I did I'm just [01:18:38] going to redo the work I did I'm just going to store the right values while [01:18:39] going to store the right values while back propagating and continue to [01:18:41] back propagating and continue to derivate one thing that you need to [01:18:43] derivate one thing that you need to notice though is that look you need this [01:18:47] notice though is that look you need this forward propagation equation in order to [01:18:49] forward propagation equation in order to remember what should be the path to take [01:18:52] remember what should be the path to take in your chain rule because you know that [01:18:55] in your chain rule because you know that this derivative of J with respect to W 3 [01:18:59] this derivative of J with respect to W 3 I cannot use it as it is [01:19:00] I cannot use it as it is because W 3 is not connected to the [01:19:03] because W 3 is not connected to the previous layer if you look at this [01:19:04] previous layer if you look at this equation e 2 doesn't depend on W 3 it [01:19:08] equation e 2 doesn't depend on W 3 it depends on Z 3 sorry like my bad [01:19:12] depends on Z 3 sorry like my bad it depends no sorry what I wanted to say [01:19:14] it depends no sorry what I wanted to say is that Z 2 is connected to W 2 [01:19:20] is that Z 2 is connected to W 2 but a1 is not connected to w2 so you [01:19:26] but a1 is not connected to w2 so you want to choose the path that you're [01:19:28] want to choose the path that you're going through in the proper way so that [01:19:30] going through in the proper way so that there is no cancellation in these [01:19:32] there is no cancellation in these derivatives you cannot compute [01:19:35] derivatives you cannot compute derivative of W 2 with respect to 2 a1 [01:19:43] right you cannot compute that you don't [01:19:47] right you cannot compute that you don't know it okay so I think we're done for [01:19:50] know it okay so I think we're done for today so one thing that I'd like you to [01:19:53] today so one thing that I'd like you to do if you have time is just think about [01:19:56] do if you have time is just think about the things that can be tweaked in a [01:19:57] the things that can be tweaked in a neural network when you build a neural [01:20:00] neural network when you build a neural network you are not done [01:20:02] network you are not done you have to tweak it you have to tweak [01:20:03] you have to tweak it you have to tweak the activations you have to take the [01:20:04] the activations you have to take the loss function there's many things you [01:20:06] loss function there's many things you can tweak and that's what we're going to [01:20:07] can tweak and that's what we're going to see next which ok thanks ================================================================================ LECTURE 012 ================================================================================ Lecture 12 - Backprop & Improving Neural Networks | Stanford CS229: Machine Learning (Autumn 2018) Source: https://www.youtube.com/watch?v=zUazLXZZA2U --- Transcript [00:00:04] hi everyone welcome welcome to the [00:00:08] hi everyone welcome welcome to the second lecture on deep learning for CST [00:00:11] second lecture on deep learning for CST to 9 so a quick announcement before we [00:00:14] to 9 so a quick announcement before we start there is a Piazza post number 695 [00:00:17] start there is a Piazza post number 695 which is the meet quarter survey for CST [00:00:21] which is the meet quarter survey for CST to nine so fill it in when you have time [00:00:23] to nine so fill it in when you have time ok so let's get back to deep learning so [00:00:28] ok so let's get back to deep learning so last week together we've seen what the [00:00:33] last week together we've seen what the neural network is and we started by [00:00:35] neural network is and we started by defining the logistic regression from a [00:00:37] defining the logistic regression from a neural network perspective we said that [00:00:39] neural network perspective we said that logistic regression can be viewed as a [00:00:42] logistic regression can be viewed as a one neuron neural network where there is [00:00:45] one neuron neural network where there is a linear part and an activation part [00:00:47] a linear part and an activation part which was sigmoid in that case we we've [00:00:50] which was sigmoid in that case we we've seen that sigmoid is a common activation [00:00:53] seen that sigmoid is a common activation function to be used for classification [00:00:54] function to be used for classification tasks because it casts a number between [00:00:57] tasks because it casts a number between minus infinity and plus infinity [00:00:59] minus infinity and plus infinity 0 in the z-row one interval which can be [00:01:02] 0 in the z-row one interval which can be interpreted as a probability and then we [00:01:05] interpreted as a probability and then we introduced the neural network so we [00:01:06] introduced the neural network so we started to stack some neurons inside a [00:01:09] started to stack some neurons inside a layer and then stack layer on top of [00:01:11] layer and then stack layer on top of each other [00:01:12] each other and we said that the more we stack [00:01:14] and we said that the more we stack layers the more parameters we have and [00:01:16] layers the more parameters we have and the more parameters we have the more our [00:01:18] the more parameters we have the more our network is able to copy the complexity [00:01:22] network is able to copy the complexity of our data because it becomes more [00:01:23] of our data because it becomes more flexible so we stopped at a point where [00:01:28] flexible so we stopped at a point where we did a forward propagation we had an [00:01:30] we did a forward propagation we had an example during training before were [00:01:32] example during training before were propagated through the network we get [00:01:34] propagated through the network we get the output then we compute the cost [00:01:36] the output then we compute the cost function which compares this output to [00:01:38] function which compares this output to the ground truth and we were in the [00:01:40] the ground truth and we were in the process of back propagating the error to [00:01:43] process of back propagating the error to tell our parameters how they should move [00:01:44] tell our parameters how they should move in order to detect cuts more properly [00:01:47] in order to detect cuts more properly does that make sense all this part so [00:01:51] does that make sense all this part so today we're going to continue that so [00:01:53] today we're going to continue that so we're in the second part neural networks [00:01:55] we're in the second part neural networks we're going to derive the back [00:01:56] we're going to derive the back propagation with the chain rule and [00:01:58] propagation with the chain rule and after that we're going to talk about how [00:02:01] after that we're going to talk about how to improve our neural networks because [00:02:03] to improve our neural networks because in practice it's not because you [00:02:05] in practice it's not because you designed in your own network that is [00:02:06] designed in your own network that is going to work there's a lot of hacks and [00:02:08] going to work there's a lot of hacks and tricks that you need to know in order to [00:02:11] tricks that you need to know in order to make a neural network work ok let's go [00:02:15] make a neural network work ok let's go so first thing that we talked about is [00:02:19] so first thing that we talked about is in order to define our optimization [00:02:23] in order to define our optimization problem and find our right parameters we [00:02:25] problem and find our right parameters we need to define a cost function and [00:02:28] need to define a cost function and usually we said we would use the letter [00:02:31] usually we said we would use the letter J to denote the cost function so here [00:02:33] J to denote the cost function so here when I talk about cost function I'm [00:02:35] when I talk about cost function I'm talking about a batch of examples it [00:02:38] talking about a batch of examples it means I'm for propagating M examples at [00:02:41] means I'm for propagating M examples at a time you remember why we do that [00:02:43] a time you remember why we do that what's the reason we use a batch instead [00:02:47] what's the reason we use a batch instead of a single example vectorization we [00:02:51] of a single example vectorization we want to use what our GPU can do and [00:02:53] want to use what our GPU can do and paralyze the computation so that's what [00:02:55] paralyze the computation so that's what we do so we have M examples that go for [00:03:01] we do so we have M examples that go for propagate in the network and each of [00:03:05] propagate in the network and each of them has a loss function associated with [00:03:07] them has a loss function associated with them the average of the loss functions [00:03:09] them the average of the loss functions over the batch give us the cost function [00:03:10] over the batch give us the cost function and we had defined this loss function [00:03:13] and we had defined this loss function together L of I assuming we're still and [00:03:18] together L of I assuming we're still and just as a reminder we're still in this [00:03:21] just as a reminder we're still in this network where where we had a cat [00:03:24] network where where we had a cat remember this 1 X 1 2 X n the cat was [00:03:33] remember this 1 X 1 2 X n the cat was flattened into a vector RGB matrix into [00:03:36] flattened into a vector RGB matrix into one vector and then there was a neural [00:03:38] one vector and then there was a neural network with three neurons then two [00:03:40] network with three neurons then two neurons than one neuron remember fully [00:03:43] neurons than one neuron remember fully connected here everything and then we [00:03:53] connected here everything and then we remember this one I think that what is [00:03:56] remember this one I think that what is this one ok so now we're here we take M [00:03:59] this one ok so now we're here we take M images of cats or non cats for propagate [00:04:02] images of cats or non cats for propagate everything in the network compute a loss [00:04:04] everything in the network compute a loss function for each of them average it and [00:04:06] function for each of them average it and get the cost function so our last [00:04:08] get the cost function so our last function was the binary cross-entropy [00:04:11] function was the binary cross-entropy or also called the loss function the [00:04:13] or also called the loss function the logistic loss function it was the [00:04:15] logistic loss function it was the following y hi [00:04:17] following y hi log of y hat I plus 1 minus y I log [00:04:25] log of y hat I plus 1 minus y I log of 1 minus y hat bye [00:04:30] so let me circle this one it's an [00:04:36] so let me circle this one it's an important one and what we said is that [00:04:39] important one and what we said is that this network has many parameters and we [00:04:42] this network has many parameters and we said the first layer has W 1 V 1 the [00:04:46] said the first layer has W 1 V 1 the second layer has W 2 V 2 and the third [00:04:51] second layer has W 2 V 2 and the third layer has W 3 B 3 where the square [00:04:56] layer has W 3 B 3 where the square brackets this denotes the layer and we [00:05:00] brackets this denotes the layer and we have to train all these parameters one [00:05:02] have to train all these parameters one thing we noticed is that because we want [00:05:04] thing we noticed is that because we want to make a good use of the chain rule [00:05:05] to make a good use of the chain rule we're going to start by by computing the [00:05:08] we're going to start by by computing the derivative of these guys W 3 and V 3 and [00:05:11] derivative of these guys W 3 and V 3 and then come back and do W 2 and V 2 and [00:05:14] then come back and do W 2 and V 2 and then back again W 1 and B 1 in order to [00:05:17] then back again W 1 and B 1 in order to use our formulas of the update of the [00:05:22] use our formulas of the update of the gradient descent where W would be equal [00:05:24] gradient descent where W would be equal to W minus alpha derivative of the cost [00:05:27] to W minus alpha derivative of the cost with respect to W and this for any layer [00:05:31] with respect to W and this for any layer L between 1 & 3 same for B [00:05:42] okay so let's try to do it this is the [00:05:47] okay so let's try to do it this is the first number we want to compute and [00:05:50] first number we want to compute and remember the reason we want to compute [00:05:52] remember the reason we want to compute derivative of the cost with respect to W [00:05:54] derivative of the cost with respect to W 3 is because the relationship between W [00:05:56] 3 is because the relationship between W 3 and the cost is easier than the [00:05:59] 3 and the cost is easier than the relationship between W 1 and the cost [00:06:01] relationship between W 1 and the cost because W 1 had much more connection [00:06:04] because W 1 had much more connection going through the network before ending [00:06:06] going through the network before ending up in the cost computation so one thing [00:06:13] up in the cost computation so one thing we should not cease before starting this [00:06:16] we should not cease before starting this calculation is that the derivative is [00:06:18] calculation is that the derivative is linear so this if I take the derivative [00:06:21] linear so this if I take the derivative of J I can just take the derivative of L [00:06:24] of J I can just take the derivative of L and it's the same thing I just need to [00:06:26] and it's the same thing I just need to add the summation prior to that because [00:06:29] add the summation prior to that because derivative is a linear operation that [00:06:31] derivative is a linear operation that make sense to everyone so instead of [00:06:34] make sense to everyone so instead of computing this I'm going to compute that [00:06:38] computing this I'm going to compute that and then I will add the summation if we [00:06:42] and then I will add the summation if we just make our notation easier so I'm [00:06:45] just make our notation easier so I'm taking the derivative of a loss of one [00:06:47] taking the derivative of a loss of one example propagated through the network [00:06:49] example propagated through the network with respect to W 3 so let's do the [00:06:51] with respect to W 3 so let's do the calculation together everyone I have a [00:06:55] calculation together everyone I have a minus y I derivative with respect to W 3 [00:07:01] minus y I derivative with respect to W 3 of what we remember that Y hat was equal [00:07:05] of what we remember that Y hat was equal to Sigma of W 3x plus B or W 3 a 2 plus [00:07:10] to Sigma of W 3x plus B or W 3 a 2 plus B because a 2 is the input to the second [00:07:12] B because a 2 is the input to the second layer remember so I would write it down [00:07:14] layer remember so I would write it down here sigmoid of W 3 a 2 plus B 3 ok yeah [00:07:40] this is good like that it's too small W [00:07:50] this is good like that it's too small W through a 2 plus B 3 it's good like that [00:07:56] through a 2 plus B 3 it's good like that okay so we have this term and then we [00:08:00] okay so we have this term and then we have the second term which is plus 1 [00:08:02] have the second term which is plus 1 minus y I times derivative of w3 ability [00:08:09] minus y I times derivative of w3 ability with respect to W 3 of 1 oh sorry I [00:08:12] with respect to W 3 of 1 oh sorry I forgot the logarithm here of log of 1 [00:08:21] forgot the logarithm here of log of 1 minus Sigma of W 3 a 2 plus B 3 and so [00:08:31] minus Sigma of W 3 a 2 plus B 3 and so just a reminder the reason we have this [00:08:33] just a reminder the reason we have this is because we've written the forward [00:08:35] is because we've written the forward propagation in the previous class you [00:08:36] propagation in the previous class you guys remember the four forward [00:08:37] guys remember the four forward propagation we had z3 which took a 2 as [00:08:41] propagation we had z3 which took a 2 as input and computed the linear part and [00:08:44] input and computed the linear part and sigmoid this is the activation function [00:08:46] sigmoid this is the activation function used in the last neuron here okay so [00:08:50] used in the last neuron here okay so let's try to to compute this derivative [00:08:55] Y I so the derivative of log log prime [00:09:00] Y I so the derivative of log log prime equals 1 over log I'm ready this formula [00:09:06] equals 1 over log I'm ready this formula so I will just take 1 over sorry 1 over [00:09:11] so I will just take 1 over sorry 1 over X minus 1 over X if you put your next [00:09:13] X minus 1 over X if you put your next here so block prime of X so I will take [00:09:18] here so block prime of X so I will take 1 over sigmoid of w3 a 2 plus B 3 I know [00:09:22] 1 over sigmoid of w3 a 2 plus B 3 I know that this thing can be written a 3 right [00:09:24] that this thing can be written a 3 right so I will just write a 3 instead of [00:09:27] so I will just write a 3 instead of writing the simulator game so we have 1 [00:09:30] writing the simulator game so we have 1 over a 3 times the derivative of a 3 [00:09:34] over a 3 times the derivative of a 3 with respect to W 3 remember is that I'm [00:09:38] with respect to W 3 remember is that I'm gonna write it down here if we take the [00:09:40] gonna write it down here if we take the derivative of sigmoid of blah blah blah [00:09:43] derivative of sigmoid of blah blah blah let's say derivative of log of sigmoid [00:09:47] let's say derivative of log of sigmoid over W what we have is 1 over the [00:09:51] over W what we have is 1 over the sigmoid [00:09:53] sigmoid time's the derivative with respect to w3 [00:09:56] time's the derivative with respect to w3 of the sigmoid does that make sense [00:10:00] of the sigmoid does that make sense that's what we're using here so the [00:10:03] that's what we're using here so the derivative of sigmoid sigmoid prime of X [00:10:07] derivative of sigmoid sigmoid prime of X is actually pretty easy to compute is [00:10:09] is actually pretty easy to compute is sigmoid of x times 1 minus Sigma of X ok [00:10:16] sigmoid of x times 1 minus Sigma of X ok so I'm just going to take the derivative [00:10:18] so I'm just going to take the derivative is going to give me 8 a 3 times 1 minus [00:10:23] is going to give me 8 a 3 times 1 minus a 3 there is still one step because [00:10:29] a 3 there is still one step because there is a composition of three [00:10:31] there is a composition of three functions here there is a logarithm [00:10:33] functions here there is a logarithm there's a sigmoid and there's also a [00:10:34] there's a sigmoid and there's also a linear function WX plus B or W a 2 plus [00:10:38] linear function WX plus B or W a 2 plus B so I also need to take the derivative [00:10:41] B so I also need to take the derivative of the linear part with respect to W 3 [00:10:43] of the linear part with respect to W 3 because I know that sigmoid of W 3 a 2 [00:10:48] because I know that sigmoid of W 3 a 2 plus B 3 if I want to take the [00:10:51] plus B 3 if I want to take the derivative of that with respect to W 3 I [00:10:55] derivative of that with respect to W 3 I need to go inside and take the [00:10:58] need to go inside and take the derivative of what's inside ok so this [00:11:02] derivative of what's inside ok so this will give me the sigmoid or whatever a 3 [00:11:07] will give me the sigmoid or whatever a 3 times 1 minus a 3 times the derivative [00:11:11] times 1 minus a 3 times the derivative with respect to W 3 of the linear part [00:11:21] does this make sense so I'm going to [00:11:27] does this make sense so I'm going to write it here bigger here I need to take [00:11:30] write it here bigger here I need to take the derivative of the linear part with [00:11:33] the derivative of the linear part with respect to W 3 which is equal to a 2 [00:11:36] respect to W 3 which is equal to a 2 transpose so one thing you you may want [00:11:42] transpose so one thing you you may want to check is when we compute [00:11:49] when I'm trying to compute this [00:11:51] when I'm trying to compute this derivative I'm trying to compute this [00:12:04] derivative I'm trying to compute this derivative why is there a transpose that [00:12:06] derivative why is there a transpose that comes out how do you come up with that [00:12:08] comes out how do you come up with that you look at the shape here what's the [00:12:12] you look at the shape here what's the shape of w3 someone remembers one by two [00:12:25] shape of w3 someone remembers one by two yeah [00:12:26] yeah why one by two yeah it's connecting two [00:12:36] why one by two yeah it's connecting two neurons to one your own so it has to be [00:12:38] neurons to one your own so it has to be one by two easily flip it and in order [00:12:41] one by two easily flip it and in order to come back to that you can write your [00:12:44] to come back to that you can write your forward propagation make the shape [00:12:45] forward propagation make the shape analysis and find out that it's a one by [00:12:47] analysis and find out that it's a one by two matrix how about this thing what's [00:12:53] two matrix how about this thing what's the shape of that hmm the scaler yeah so [00:13:01] the shape of that hmm the scaler yeah so scaler so it's one by one how do you [00:13:02] scaler so it's one by one how do you know is because this thing is basically [00:13:04] know is because this thing is basically z3 is the linear part of the last neuron [00:13:07] z3 is the linear part of the last neuron and a3 we know that it's Y hat so it's a [00:13:10] and a3 we know that it's Y hat so it's a scalar between zero and one so this has [00:13:12] scalar between zero and one so this has to be a scalar as well because taking [00:13:14] to be a scalar as well because taking the sigmoid should not change the shape [00:13:16] the sigmoid should not change the shape so now the question is what's the shape [00:13:20] so now the question is what's the shape of this entire thing the shape of this [00:13:26] of this entire thing the shape of this entire thing should be the shape of w3 [00:13:29] entire thing should be the shape of w3 because you're taking the derivative of [00:13:31] because you're taking the derivative of a scalar with respect to a higher [00:13:34] a scalar with respect to a higher dimensional matrix or vector here called [00:13:37] dimensional matrix or vector here called a row vector then it means that the [00:13:40] a row vector then it means that the shape of this has to be the same shape [00:13:41] shape of this has to be the same shape of w3 so 1 by 2 and you know that when [00:13:45] of w3 so 1 by 2 and you know that when you take this simple derivative in in [00:13:48] you take this simple derivative in in real like in with scalars not with high [00:13:51] real like in with scalars not with high dimensional you know that this is an [00:13:53] dimensional you know that this is an easy derivative it just should it should [00:13:55] easy derivative it just should it should give you a 2 right but in higher [00:13:58] give you a 2 right but in higher dimensions sometimes you have transposed [00:14:00] dimensions sometimes you have transposed that come up and [00:14:01] that come up and you know that the answer is a to [00:14:03] you know that the answer is a to transpose is because you know that a 2 [00:14:05] transpose is because you know that a 2 is a 2 by 1 matrix so this is not [00:14:09] is a 2 by 1 matrix so this is not possible it's not possible to get a 2 [00:14:13] possible it's not possible to get a 2 because otherwise it wouldn't match the [00:14:14] because otherwise it wouldn't match the derivative that you're calculating so it [00:14:16] derivative that you're calculating so it has to be a 2 transpose so either you [00:14:18] has to be a 2 transpose so either you you learn the formula by heart or you [00:14:21] you learn the formula by heart or you you learn how to analyze shapes ok any [00:14:25] you learn how to analyze shapes ok any questions on that so that's why it's a 2 [00:14:31] questions on that so that's why it's a 2 transpose now minus y I so I'm I'm on [00:14:42] transpose now minus y I so I'm I'm on this one now the second term of the of [00:14:45] this one now the second term of the of the derivative and I take the derivative [00:14:47] the derivative and I take the derivative of this so I get 1 over 1 minus a 3 a 3 [00:14:52] of this so I get 1 over 1 minus a 3 a 3 denotes the sigmoid so I'm just copying [00:14:54] denotes the sigmoid so I'm just copying this back using the fact that the [00:14:56] this back using the fact that the derivative a logarithm is 1 over X and [00:14:58] derivative a logarithm is 1 over X and then I will multiply this by the [00:15:01] then I will multiply this by the derivative of 1 minus a 3 with respect [00:15:03] derivative of 1 minus a 3 with respect to W 3 I know that there's a minus that [00:15:06] to W 3 I know that there's a minus that needs to come up so I would write it [00:15:07] needs to come up so I would write it down here minus 1 and I also have the [00:15:10] down here minus 1 and I also have the derivative of the sigmoid with respect [00:15:13] derivative of the sigmoid with respect to what's inside the sigmoid so a 3 [00:15:17] to what's inside the sigmoid so a 3 times 1 minus a 3 and what's the last [00:15:21] times 1 minus a 3 and what's the last term the last term is simply the one we [00:15:23] term the last term is simply the one we just talked about it's the derivative of [00:15:26] just talked about it's the derivative of what's inside the sigmoid with respect [00:15:29] what's inside the sigmoid with respect to W 3 so it's a 2 transpose again ok [00:15:43] so now I will just simplify I know this [00:15:48] so now I will just simplify I know this scalar simplifies with this one this one [00:15:51] scalar simplifies with this one this one simplifies with that one really going to [00:15:54] simplifies with that one really going to copy back all the results - why I times [00:16:00] copy back all the results - why I times 1 minus a 3 a 2 transpose plus 1 minus y [00:16:09] 1 minus a 3 a 2 transpose plus 1 minus y I times the - I'm going to put the minus [00:16:14] I times the - I'm going to put the minus here so I'm taking the - putting it on [00:16:17] here so I'm taking the - putting it on on the front times a 3 times a 2 [00:16:22] on the front times a 3 times a 2 transpose and then quickly looking at [00:16:25] transpose and then quickly looking at that I see that some of the terms will [00:16:28] that I see that some of the terms will cancel out right okay so I have one term [00:16:37] cancel out right okay so I have one term here why why I time this - a 3 a 2 [00:16:43] here why why I time this - a 3 a 2 transpose would cancel out with + y i-83 [00:16:46] transpose would cancel out with + y i-83 a 2 transpose to make sense so like the [00:16:50] a 2 transpose to make sense so like the term that we multiply this number we [00:16:53] term that we multiply this number we cancel out with the term we multiply [00:16:54] cancel out with the term we multiply this number going to continue it gives [00:17:02] this number going to continue it gives me Y I times a 2 transpose this part [00:17:07] me Y I times a 2 transpose this part minus a 3 times a 2 transpose I can [00:17:15] minus a 3 times a 2 transpose I can factor this because I have the same term [00:17:16] factor this because I have the same term a 2 transpose and gives me finally why I [00:17:20] a 2 transpose and gives me finally why I - a 3 times a 2 transpose okay so it [00:17:29] - a 3 times a 2 transpose okay so it doesn't look that bad actually I don't [00:17:33] doesn't look that bad actually I don't know when we take a derivative of [00:17:35] know when we take a derivative of something kind of ugly we expect [00:17:37] something kind of ugly we expect something ugly to come out but this [00:17:39] something ugly to come out but this doesn't seem too bad [00:17:45] any questions on that I let you write it [00:17:47] any questions on that I let you write it quickly and then we're going to move to [00:17:49] quickly and then we're going to move to the rest so once I get this result I can [00:17:52] the rest so once I get this result I can just write down the cost for derivative [00:17:54] just write down the cost for derivative with respect to W 3 [00:17:56] with respect to W 3 I know it's just 1 - I just need to take [00:18:01] I know it's just 1 - I just need to take the summation of this thing so why I - a [00:18:09] the summation of this thing so why I - a 3 times y 2 transpose a 2 transpose and [00:18:16] 3 times y 2 transpose a 2 transpose and I have a minus sign coming up front so [00:18:18] I have a minus sign coming up front so that's my derivative okay so we were [00:18:27] that's my derivative okay so we were done with that and we can we can just [00:18:29] done with that and we can we can just take this formula plug it in back in our [00:18:33] take this formula plug it in back in our gradient descent update rule and update [00:18:35] gradient descent update rule and update w3 yeah now the question is you can do [00:18:44] w3 yeah now the question is you can do the same thing as as we just did but [00:18:46] the same thing as as we just did but with v3 is going to be the similar [00:18:48] with v3 is going to be the similar difficulty we're going to do it with W - [00:18:52] difficulty we're going to do it with W - now and think how does that back [00:18:54] now and think how does that back propagate to W - so now it's W to stern [00:18:58] propagate to W - so now it's W to stern we want to compute derivative of L the [00:19:01] we want to compute derivative of L the loss with respect to W of the second [00:19:04] loss with respect to W of the second layer the question is how I'm gonna get [00:19:07] layer the question is how I'm gonna get this one without having too much work [00:19:10] this one without having too much work I'm not going to start over here as we [00:19:12] I'm not going to start over here as we said last time I'm going to use the [00:19:14] said last time I'm going to use the chain rule of calculus so I'm going to [00:19:16] chain rule of calculus so I'm going to try to decompose this derivative into [00:19:19] try to decompose this derivative into several derivatives so I know that Y hat [00:19:22] several derivatives so I know that Y hat is the first thing that is connected to [00:19:25] is the first thing that is connected to the loss function right the output [00:19:28] the loss function right the output neuron is directly connected to the last [00:19:29] neuron is directly connected to the last function so I'm going to take the [00:19:32] function so I'm going to take the derivative of the last function with [00:19:33] derivative of the last function with respect to Y hat also called a tree [00:19:36] respect to Y hat also called a tree right is the easiest one I can calculate [00:19:39] right is the easiest one I can calculate I also know that a tree which is the [00:19:43] I also know that a tree which is the output activation of the last neuron is [00:19:44] output activation of the last neuron is connected with the linear part of the [00:19:46] connected with the linear part of the last neuron which is z3 so I can take [00:19:49] last neuron which is z3 so I can take derivative of a tree with respect to Z 3 [00:19:54] derivative of a tree with respect to Z 3 you remember what this is going to be [00:19:56] you remember what this is going to be derivative of a tree with respect to Z [00:19:59] derivative of a tree with respect to Z three derivative of sigmoid I know that [00:20:05] three derivative of sigmoid I know that a tree called sigmoid of z 3 so this [00:20:08] a tree called sigmoid of z 3 so this derivative is very simple it's just that [00:20:10] derivative is very simple it's just that it's just 8 3 times 1 minus a 3 right so [00:20:16] it's just 8 3 times 1 minus a 3 right so I'm going to continue I know that Z 3 Z [00:20:19] I'm going to continue I know that Z 3 Z 3 is equal to what it's equal to W 3 a 2 [00:20:23] 3 is equal to what it's equal to W 3 a 2 plus B which path things I need do I [00:20:26] plus B which path things I need do I need to take in order to back propagate [00:20:27] need to take in order to back propagate I don't want to take the derivative with [00:20:29] I don't want to take the derivative with respect to W 3 because I went yet stuck [00:20:31] respect to W 3 because I went yet stuck I don't want to take the derivative with [00:20:33] I don't want to take the derivative with respect to B 3 because I will get stuck [00:20:34] respect to B 3 because I will get stuck I will take the derivative with respect [00:20:36] I will take the derivative with respect to a 2 because a 2 will be connected to [00:20:40] to a 2 because a 2 will be connected to Z 2 Z 2 will be connected to a 1 and I [00:20:42] Z 2 Z 2 will be connected to a 1 and I can back propagate from this path so I'm [00:20:46] can back propagate from this path so I'm going to take the relative of Z 3 with [00:20:48] going to take the relative of Z 3 with respect to a 2 to have my error back [00:20:52] respect to a 2 to have my error back propagate and so on I know that a 2 is [00:20:55] propagate and so on I know that a 2 is equal to Sigma of Z 2 so I'm just going [00:20:59] equal to Sigma of Z 2 so I'm just going to do that and I know that this [00:21:01] to do that and I know that this derivative is going to be easy as well [00:21:03] derivative is going to be easy as well and finally I also know that Z 2 is [00:21:08] and finally I also know that Z 2 is connected to W 2 so I'm going to take [00:21:10] connected to W 2 so I'm going to take derivative of Z 2 with respect to W 2 so [00:21:16] derivative of Z 2 with respect to W 2 so just what I want you to get is the [00:21:17] just what I want you to get is the thought process of this chain rule why [00:21:20] thought process of this chain rule why don't we take a derivative with respect [00:21:22] don't we take a derivative with respect to W 3 or B threes because we will get [00:21:23] to W 3 or B threes because we will get stuck we want the error to back [00:21:25] stuck we want the error to back propagate and in order for the error to [00:21:27] propagate and in order for the error to back propagate we have to go through [00:21:28] back propagate we have to go through variables that are connected to each [00:21:31] variables that are connected to each other [00:21:32] other does it make sense so now the question [00:21:39] does it make sense so now the question is how can we use this how can we use [00:21:42] is how can we use this how can we use the derivative we already have in order [00:21:45] the derivative we already have in order to to to to compute the derivative with [00:21:49] to to to to compute the derivative with respect to W 2 [00:21:51] respect to W 2 can someone tell me how we can use the [00:21:53] can someone tell me how we can use the results from this calculation in order [00:21:55] results from this calculation in order not to do it again [00:22:04] you cash it [00:22:07] you cash it so there's another discussion on caching [00:22:10] so there's another discussion on caching which is which is correct that's in [00:22:13] which is which is correct that's in order to get this result very quickly we [00:22:14] order to get this result very quickly we will use cash but what I want here is to [00:22:17] will use cash but what I want here is to you to tell me if this result appears [00:22:20] you to tell me if this result appears somewhere here [00:22:21] somewhere here yeah the first three terms so this one [00:22:27] yeah the first three terms so this one this one in this one yeah is it the [00:22:32] this one in this one yeah is it the first two terms or the first three terms [00:22:33] first two terms or the first three terms the first two terms here but good [00:22:35] the first two terms here but good intuition yeah so these results is [00:22:37] intuition yeah so these results is actually the first two terms here we [00:22:40] actually the first two terms here we just calculated it okay well how do we [00:22:44] just calculated it okay well how do we know that it's not easy to see one thing [00:22:46] know that it's not easy to see one thing we know based on what we've written in [00:22:48] we know based on what we've written in very big on this board is that the [00:22:51] very big on this board is that the derivative of z3 because this is III [00:22:55] derivative of z3 because this is III right derivative of z3 with respect to W [00:22:58] right derivative of z3 with respect to W 3 is a 2 transpose right so I could [00:23:01] 3 is a 2 transpose right so I could write here that this thing is the [00:23:05] write here that this thing is the relative of z3 with respect to W 3 is [00:23:11] relative of z3 with respect to W 3 is correct so I know that because I wanted [00:23:15] correct so I know that because I wanted to compute the derivative of the loss to [00:23:17] to compute the derivative of the loss to W 3 I know that I could have written [00:23:20] W 3 I know that I could have written derivative of loss with respect to W 3 [00:23:22] derivative of loss with respect to W 3 as derivative of loss with respect to Z [00:23:26] as derivative of loss with respect to Z 3 times derivative of z3 with respect to [00:23:33] 3 times derivative of z3 with respect to W 3 correct and I know that this is a to [00:23:38] W 3 correct and I know that this is a to transpose so it means that this thing is [00:23:42] transpose so it means that this thing is the receive of the loss with respect to [00:23:43] the receive of the loss with respect to Z 3 does it make sense so I got I got my [00:23:48] Z 3 does it make sense so I got I got my decomposition of the derivative we had [00:23:50] decomposition of the derivative we had if we wanted to use the chain rule from [00:23:52] if we wanted to use the chain rule from here on we could have just separated it [00:23:54] here on we could have just separated it into two terms and took the derivative [00:23:55] into two terms and took the derivative here okay so I know the result of this [00:24:00] here okay so I know the result of this thing I know that this thing is [00:24:03] thing I know that this thing is basically 83 minus y times a 2 transpose [00:24:10] basically 83 minus y times a 2 transpose I just flipped it because of the minus [00:24:13] I just flipped it because of the minus sign [00:24:19] okay now tell me what's disturb what is [00:24:31] okay now tell me what's disturb what is it sir [00:24:32] it sir let's go back yeah so sigmoid I'm just [00:24:37] let's go back yeah so sigmoid I'm just going to write it a 2 times 1 minus a 2 [00:24:41] going to write it a 2 times 1 minus a 2 if that makes sense [00:24:42] if that makes sense Sigma 8 times 1 minus Sigma what is this [00:24:47] Sigma 8 times 1 minus Sigma what is this term [00:24:57] oh sorry my bad that's not the right one [00:24:59] oh sorry my bad that's not the right one this one this one is that this one is [00:25:06] this one this one is that this one is sigmoid a 2 is sigmoid of Z 2 so this [00:25:09] sigmoid a 2 is sigmoid of Z 2 so this result comes from this term what's what [00:25:11] result comes from this term what's what about this term sorry W 3 is it W 3 or [00:25:21] about this term sorry W 3 is it W 3 or no I heard transpose how do we know if [00:25:25] no I heard transpose how do we know if it's W 3 or W 3 transpose so let's look [00:25:28] it's W 3 or W 3 transpose so let's look at the shape of this what's D 3 it's one [00:25:32] at the shape of this what's D 3 it's one by one it's a scalar is the linear part [00:25:34] by one it's a scalar is the linear part of the last neuron what's the shape of [00:25:36] of the last neuron what's the shape of that this is two one we have two neurons [00:25:40] that this is two one we have two neurons in the layer W 3 we said that it was the [00:25:44] in the layer W 3 we said that it was the 1 by 2 matrix so we have to transpose it [00:25:46] 1 by 2 matrix so we have to transpose it so the result of that is W 3 transpose [00:25:51] so the result of that is W 3 transpose and how about the last term [00:26:01] same as here one layer before yeah [00:26:08] same as here one layer before yeah someone said day 1 transpose ok yep this [00:26:27] someone said day 1 transpose ok yep this one there's a transpose here oh oh yeah [00:26:40] one there's a transpose here oh oh yeah yeah you correct you correct thank you [00:26:42] yeah you correct you correct thank you that's what you mean [00:26:43] that's what you mean yeah this one was from the z3 dw3 we [00:26:47] yeah this one was from the z3 dw3 we didn't end up using that because we will [00:26:49] didn't end up using that because we will get stuck so there's no idea to [00:26:50] get stuck so there's no idea to transpose here Thanks [00:26:52] transpose here Thanks any other questions or remarks so that's [00:26:58] any other questions or remarks so that's cool let's write let's write down our [00:27:02] cool let's write let's write down our derivative cleanly on the board so we [00:27:09] derivative cleanly on the board so we have derivative of our last function [00:27:14] have derivative of our last function with respect to W 2 which seems to be [00:27:18] with respect to W 2 which seems to be equal to a 3 minus y from the first term [00:27:25] equal to a 3 minus y from the first term the second term seems to be equal to W 3 [00:27:32] the second term seems to be equal to W 3 transpose then we have a term which is a [00:27:38] transpose then we have a term which is a 2 times 1 minus a 2 ok and finally [00:27:46] finally we have another term that is a 1 [00:27:49] finally we have another term that is a 1 transpose so are we done or not [00:28:01] so our triggers the thing is there's two [00:28:05] so our triggers the thing is there's two ways to compute derivatives either you [00:28:08] ways to compute derivatives either you go very rigorously and do what we did [00:28:11] go very rigorously and do what we did here for w2 or you try to do a chain [00:28:16] here for w2 or you try to do a chain moon analysis and you try to fit the [00:28:18] moon analysis and you try to fit the terms the problem is this result is not [00:28:21] terms the problem is this result is not completely correct there is a shape [00:28:23] completely correct there is a shape problem it means when we took our [00:28:25] problem it means when we took our derivatives which should have flipped [00:28:27] derivatives which should have flipped some of the terms we did it there is [00:28:30] some of the terms we did it there is actually we won't have time to go in the [00:28:32] actually we won't have time to go in the details in this lecture because we have [00:28:34] details in this lecture because we have other things to see but there is a [00:28:36] other things to see but there is a section note I think on the website [00:28:38] section note I think on the website which details the other method which is [00:28:40] which details the other method which is more rigorous which is like that for all [00:28:42] more rigorous which is like that for all the derivatives what we're going to see [00:28:44] the derivatives what we're going to see is how you can use chain rule plus shape [00:28:46] is how you can use chain rule plus shape analysis to come up with the results [00:28:47] analysis to come up with the results very quickly okay so let's let's analyze [00:28:50] very quickly okay so let's let's analyze the shape of all that we know that the [00:28:52] the shape of all that we know that the first term is a scalar into 1 by 1 we [00:28:56] first term is a scalar into 1 by 1 we know that the second term is the [00:28:57] know that the second term is the transpose of 1 by 2 so it's 2 by 1 and [00:29:01] transpose of 1 by 2 so it's 2 by 1 and we know that this thing here a 2 times 1 [00:29:04] we know that this thing here a 2 times 1 minus a 2 is 2 by 1 it's an element-wise [00:29:11] minus a 2 is 2 by 1 it's an element-wise product and this one is a 1 transpose so [00:29:14] product and this one is a 1 transpose so it's 3 by 1 transpose so it's 1 by 3 so [00:29:18] it's 3 by 1 transpose so it's 1 by 3 so there seem to be a problem here there is [00:29:20] there seem to be a problem here there is no match between these two operations [00:29:22] no match between these two operations for example right so the question is how [00:29:27] for example right so the question is how can we how can we put everything [00:29:29] can we how can we put everything together if we do it very good a city we [00:29:32] together if we do it very good a city we know how to put it together if you're [00:29:34] know how to put it together if you're used to doing the chain rule [00:29:36] used to doing the chain rule you can quickly quickly do it around so [00:29:38] you can quickly quickly do it around so after experience you will be able to to [00:29:40] after experience you will be able to to fit all these together the important [00:29:43] fit all these together the important thing to know is that here there is an [00:29:45] thing to know is that here there is an element twice product which is here so [00:29:50] element twice product which is here so every time you will take the derivative [00:29:52] every time you will take the derivative of the sigmoid is going to end up being [00:29:54] of the sigmoid is going to end up being an element twice product and it's the [00:29:57] an element twice product and it's the case whatever the activation that you're [00:29:59] case whatever the activation that you're using is so the right result is this one [00:30:09] so here I have my elementwise product of [00:30:12] so here I have my elementwise product of a 2 by 1 by a 2 by 1 so it gives me a 2 [00:30:18] a 2 by 1 by a 2 by 1 so it gives me a 2 by 1 column vector and then I need [00:30:25] by 1 column vector and then I need something that is 1 by 1 and 1 by 3 how [00:30:28] something that is 1 by 1 and 1 by 3 how do I know what what do I need to have I [00:30:30] do I know what what do I need to have I know that the shape of this thing w3 [00:30:33] know that the shape of this thing w3 needs to be 2 by 3 it's connecting two [00:30:39] needs to be 2 by 3 it's connecting two three neurons to neurons so w2 has to be [00:30:42] three neurons to neurons so w2 has to be 2 by 3 in order to end up with this I [00:30:44] 2 by 3 in order to end up with this I know that this has to come here a3 minus [00:30:47] know that this has to come here a3 minus y and a1 transpose comes again and here [00:30:51] y and a1 transpose comes again and here I get my correct answer [00:31:10] don't worry if it's the first time you [00:31:12] don't worry if it's the first time you do the chain rule and is going quickly [00:31:14] do the chain rule and is going quickly don't worry read the lecture notes with [00:31:17] don't worry read the lecture notes with the rigorous parts taking the derivative [00:31:19] the rigorous parts taking the derivative it will make more sense but I feel that [00:31:23] it will make more sense but I feel that usually in practice we don't compute [00:31:25] usually in practice we don't compute these chain rules anymore because [00:31:28] these chain rules anymore because because programming frameworks do it for [00:31:30] because programming frameworks do it for us [00:31:31] us but it's important to know at least how [00:31:33] but it's important to know at least how the chain will decomposes and also how [00:31:37] the chain will decomposes and also how to make this the compute this derivative [00:31:38] to make this the compute this derivative if you read research papers specifically [00:31:41] if you read research papers specifically any questions on that I think I want to [00:31:44] any questions on that I think I want to go back to what you mentioned with the [00:31:45] go back to what you mentioned with the cache so why is cache very important [00:31:48] cache so why is cache very important that was your question as well yeah yeah [00:31:54] that was your question as well yeah yeah it has to be so it means when you take [00:31:59] it has to be so it means when you take the derivative of Samoyed you take [00:32:00] the derivative of Samoyed you take derivative with respect to every entry [00:32:02] derivative with respect to every entry of the matrix which gives you an element [00:32:03] of the matrix which gives you an element twice product going back to the cache so [00:32:09] twice product going back to the cache so one thing is it seems that during back [00:32:13] one thing is it seems that during back propagation there is a lot of terms that [00:32:15] propagation there is a lot of terms that appear that were computed during forward [00:32:17] appear that were computed during forward propagation right all these terms a 1 [00:32:20] propagation right all these terms a 1 transpose a 2 a 3 all these we have it [00:32:24] transpose a 2 a 3 all these we have it from the for propagation so if we don't [00:32:26] from the for propagation so if we don't catch anything we have to recompute them [00:32:28] catch anything we have to recompute them it means I'm going backwards but then I [00:32:31] it means I'm going backwards but then I feel oh I need a 2 actually so I have to [00:32:34] feel oh I need a 2 actually so I have to really go forward the game to get a 2 I [00:32:36] really go forward the game to get a 2 I go backwards I need a 1 I need to [00:32:38] go backwards I need a 1 I need to forward propagate my X again to get a 1 [00:32:40] forward propagate my X again to get a 1 I don't want to do that so in order to [00:32:43] I don't want to do that so in order to avoid that when I do my for propagation [00:32:44] avoid that when I do my for propagation I would keep in memory almost all the [00:32:48] I would keep in memory almost all the values that I'm getting including the [00:32:50] values that I'm getting including the W's because as you see to compute the [00:32:52] W's because as you see to compute the derivative of loss with respect W to we [00:32:54] derivative of loss with respect W to we need W 3 but also the activation or [00:32:59] need W 3 but also the activation or linear variables so I'm going to save [00:33:02] linear variables so I'm going to save them in my in my network during the for [00:33:06] them in my in my network during the for propagation in order to use it during [00:33:07] propagation in order to use it during the backward propagation that make sense [00:33:10] the backward propagation that make sense and again it's all for computation [00:33:14] and again it's all for computation efficiency it has some memory cost [00:33:22] okay so that was the backpropagation [00:33:25] okay so that was the backpropagation and now I can use my formula of the cost [00:33:30] and now I can use my formula of the cost with respect to the last function and I [00:33:37] with respect to the last function and I know that this is going to be my update [00:33:43] this is going to be used in order to [00:33:45] this is going to be used in order to update w2 and I will do the same for w1 [00:33:48] update w2 and I will do the same for w1 then you guys can do it at home if you [00:33:51] then you guys can do it at home if you want to make sure you understood take [00:33:52] want to make sure you understood take the derivative with respect to w1 okay [00:34:02] the derivative with respect to w1 okay so let's move on to the next part which [00:34:05] so let's move on to the next part which is improving your neural network so in [00:34:16] is improving your neural network so in practice when you when you do this [00:34:18] practice when you when you do this process of training for propagation [00:34:20] process of training for propagation backward propagation updates you don't [00:34:23] backward propagation updates you don't end up having a good network most of the [00:34:26] end up having a good network most of the time in order to get a good network you [00:34:28] time in order to get a good network you need to improve it you need to use a [00:34:30] need to improve it you need to use a bunch of techniques that will make your [00:34:32] bunch of techniques that will make your network work in practice [00:34:34] network work in practice the first the first trick is to use [00:34:38] the first the first trick is to use different activation functions so [00:34:45] different activation functions so together we've seen one activation [00:34:47] together we've seen one activation function which was sigmoid and we [00:34:53] function which was sigmoid and we remember the graph of sigmoid is getting [00:34:56] remember the graph of sigmoid is getting a number between minus infinity and plus [00:34:58] a number between minus infinity and plus infinity and casting it between zero and [00:35:01] infinity and casting it between zero and one and we know that the formula is [00:35:03] one and we know that the formula is sigmoid of z equals 1 over 1 plus [00:35:07] sigmoid of z equals 1 over 1 plus exponential minus z we also know that [00:35:10] exponential minus z we also know that the derivative of sigmoid is sigmoid of [00:35:13] the derivative of sigmoid is sigmoid of Z times 1 minus Sigma of Z okay another [00:35:19] Z times 1 minus Sigma of Z okay another very common activation function is relu [00:35:23] very common activation function is relu we talked quickly about it last time [00:35:27] we talked quickly about it last time value of Z which is equal to 0 if the [00:35:31] value of Z which is equal to 0 if the is less than zero and Z if Z is positive [00:35:35] is less than zero and Z if Z is positive so the graph of relu [00:35:38] so the graph of relu looks like something like this with and [00:35:51] looks like something like this with and finally another one we were using [00:35:54] finally another one we were using commonly as well is tan H so hyperbolic [00:35:57] commonly as well is tan H so hyperbolic tangent and tan H of Z equals [00:36:02] tangent and tan H of Z equals exponential Z minus exponential minus Z [00:36:05] exponential Z minus exponential minus Z over exponential Z plus exponential [00:36:08] over exponential Z plus exponential minus Z the derivative of tan H is 1 [00:36:16] minus Z the derivative of tan H is 1 minus tan H squared of Z and the graph [00:36:27] minus tan H squared of Z and the graph looks kind of like sigmoid but but it [00:36:32] looks kind of like sigmoid but but it goes between minus one and plus one so [00:36:40] goes between minus one and plus one so one question now that I've given you [00:36:43] one question now that I've given you three activation function can you guess [00:36:48] three activation function can you guess why we would use one instead of the [00:36:50] why we would use one instead of the other and and which one has more [00:36:53] other and and which one has more benefits so when I talk about activation [00:36:58] benefits so when I talk about activation functions I talk about the functions [00:37:00] functions I talk about the functions that you will put in these neurons after [00:37:03] that you will put in these neurons after the linear parts what do you think is [00:37:10] the linear parts what do you think is the main advantage of sigmoid [00:37:16] yeah yeah you use it for classification [00:37:22] yeah yeah you use it for classification between it gives you a probability [00:37:23] between it gives you a probability what's the main disadvantage of sigmoid [00:37:31] it's easy that should be an advantage [00:37:34] it's easy that should be an advantage should be a benefit yeah correct if [00:37:44] should be a benefit yeah correct if you're at high activation if you are [00:37:46] you're at high activation if you are high Z's or low Z's your graduate is [00:37:48] high Z's or low Z's your graduate is very close to zero so look here based on [00:37:51] very close to zero so look here based on this graph we know that if Z is very big [00:37:54] this graph we know that if Z is very big if Z is very big our gradient is going [00:37:57] if Z is very big our gradient is going to be very small the slope of this graph [00:37:59] to be very small the slope of this graph is very very small it's almost flat same [00:38:03] is very very small it's almost flat same for these that are very low in the [00:38:04] for these that are very low in the negative right what's the problem with [00:38:07] negative right what's the problem with having low gradients is when I'm back [00:38:09] having low gradients is when I'm back propagating if the Zi clash was big the [00:38:13] propagating if the Zi clash was big the gradient is going to be very small and [00:38:15] gradient is going to be very small and it would be super hard to update my [00:38:17] it would be super hard to update my parameters that are early in the network [00:38:19] parameters that are early in the network because the gradient is just going to [00:38:20] because the gradient is just going to vanish does it make sense [00:38:23] vanish does it make sense so sigmoid is one of these activation [00:38:25] so sigmoid is one of these activation which which works very well in the [00:38:28] which which works very well in the linear regime but has trouble working in [00:38:32] linear regime but has trouble working in saturating regimes because the network [00:38:34] saturating regimes because the network doesn't update the parameters properly [00:38:36] doesn't update the parameters properly it goes very very slowly we're going to [00:38:39] it goes very very slowly we're going to talk about that a little more [00:38:40] talk about that a little more how about tonnage very similar right [00:38:46] how about tonnage very similar right similar like high seas and low these [00:38:49] similar like high seas and low these lead to saturation of a tannish [00:38:51] lead to saturation of a tannish activation relu on the other hand [00:38:54] activation relu on the other hand doesn't have this problem if Z is very [00:38:57] doesn't have this problem if Z is very big in the positives there is no [00:39:00] big in the positives there is no saturation the gradient just passes and [00:39:02] saturation the gradient just passes and the gradient is one when we're here [00:39:05] the gradient is one when we're here right the slope is equal to one so it's [00:39:08] right the slope is equal to one so it's actually just directing the gradient to [00:39:09] actually just directing the gradient to some entry it's not multiplying it by [00:39:11] some entry it's not multiplying it by anything when you back propagate so you [00:39:14] anything when you back propagate so you know this term here this term that I [00:39:16] know this term here this term that I have here all the a 3 minus 8 3 times 1 [00:39:19] have here all the a 3 minus 8 3 times 1 minus a 3 or a 2 1 times 1 minus a 2 if [00:39:22] minus a 3 or a 2 1 times 1 minus a 2 if we use real activations when we change [00:39:25] we use real activations when we change these [00:39:27] these with what with with the derivative of r [00:39:31] with what with with the derivative of r lu and the derivative of r lu can be [00:39:36] lu and the derivative of r lu can be written indicator function of z being [00:39:39] written indicator function of z being positive you've seen in indicator [00:39:43] positive you've seen in indicator functions so this is equal to 1 if Z is [00:39:46] functions so this is equal to 1 if Z is positive 0 otherwise ok so we will see [00:39:52] positive 0 otherwise ok so we will see why we use Rayleigh mostly yeah yeah [00:40:00] why we use Rayleigh mostly yeah yeah free you remember the house prediction [00:40:02] free you remember the house prediction example in that case if you know if you [00:40:05] example in that case if you know if you only predict the price of a house based [00:40:07] only predict the price of a house based on some features you would use value [00:40:08] on some features you would use value because you know that the output should [00:40:10] because you know that the output should be a positive number between 0 and plus [00:40:12] be a positive number between 0 and plus infinity it doesn't make sense to use [00:40:14] infinity it doesn't make sense to use one of 10 H or Samoyed yeah doesn't [00:40:24] one of 10 H or Samoyed yeah doesn't really matter I think if if I want my [00:40:26] really matter I think if if I want my output to be between 0 and 1 I would use [00:40:28] output to be between 0 and 1 I would use Samoyed if I owned my output to be [00:40:30] Samoyed if I owned my output to be between minus 1 and 1 [00:40:31] between minus 1 and 1 I would use tonnage so you know there is [00:40:34] I would use tonnage so you know there is there are some tasks where the output is [00:40:37] there are some tasks where the output is kind of a reward or a minus reward that [00:40:40] kind of a reward or a minus reward that you want to get like in reinforcement [00:40:42] you want to get like in reinforcement learning you would use 10 H as an output [00:40:44] learning you would use 10 H as an output activation which is because minus 1 [00:40:46] activation which is because minus 1 looks like a negative reward plus 1 [00:40:48] looks like a negative reward plus 1 looks like a positive reward and you [00:40:50] looks like a positive reward and you want to decide what should be the reward [00:40:57] good question why do we consider these [00:41:00] good question why do we consider these functions we can actually consider any [00:41:02] functions we can actually consider any functions apart from the identity [00:41:04] functions apart from the identity function so let's see why [00:41:06] function so let's see why thanks for the transition like why do we [00:41:13] thanks for the transition like why do we need activation functions so let's [00:41:24] need activation functions so let's assume that we have a network which is [00:41:26] assume that we have a network which is the same as before so our network is [00:41:28] the same as before so our network is three neurons casting into two neurons [00:41:30] three neurons casting into two neurons casting into one your own and we're [00:41:33] casting into one your own and we're trying to use activations or equal to [00:41:41] trying to use activations or equal to identity functions so it means Z is [00:41:44] identity functions so it means Z is given to Z let's try to derive the for [00:41:48] given to Z let's try to derive the for propagation Y hat equals a tree equals Z [00:41:54] propagation Y hat equals a tree equals Z 3 equals W 3 a 2 plus B 3 I know that a [00:42:04] 3 equals W 3 a 2 plus B 3 I know that a 2 a 2 is equal to Z 2 because there is [00:42:09] 2 a 2 is equal to Z 2 because there is no activation and Z 2 is equal to W 2 A [00:42:13] no activation and Z 2 is equal to W 2 A 1 plus B 2 so I can cast here W 2 W 2 A [00:42:20] 1 plus B 2 so I can cast here W 2 W 2 A 1 plus B 2 plus B 3 I can continue I [00:42:30] 1 plus B 2 plus B 3 I can continue I know that a 1 is equal to Z 1 and I know [00:42:33] know that a 1 is equal to Z 1 and I know that Z 1 is w 1 X plus B [00:43:25] and B equals W three times W two times B [00:43:39] and B equals W three times W two times B 1 plus W 3 times B 2 plus B 3 so what's [00:43:53] 1 plus W 3 times B 2 plus B 3 so what's the insight here is that we need [00:43:58] the insight here is that we need activation functions the reason is if [00:44:01] activation functions the reason is if you don't use activation functions no [00:44:03] you don't use activation functions no matter how deep is your network is going [00:44:05] matter how deep is your network is going to be equivalent to a linear regression [00:44:07] to be equivalent to a linear regression so the complexity of the network comes [00:44:10] so the complexity of the network comes from the activation function in the [00:44:12] from the activation function in the reason we can understand if we're trying [00:44:15] reason we can understand if we're trying to detect cuts what we're trying to do [00:44:17] to detect cuts what we're trying to do is to train a network that will mimic [00:44:19] is to train a network that will mimic the formula of detecting cuts we don't [00:44:22] the formula of detecting cuts we don't know this formula so we want to mimic it [00:44:24] know this formula so we want to mimic it using a lot of time matters if we just [00:44:27] using a lot of time matters if we just have a linear regression we cannot mimic [00:44:29] have a linear regression we cannot mimic this because we're going to look at [00:44:32] this because we're going to look at pixel by pixel and assign every way to a [00:44:34] pixel by pixel and assign every way to a certain pixel if I give a new example [00:44:38] certain pixel if I give a new example it's not gonna work anymore yeah yeah so [00:44:46] it's not gonna work anymore yeah yeah so I think that's that that goes back to [00:44:48] I think that's that that goes back to your question as well so this is why we [00:44:50] your question as well so this is why we need activation functions and then the [00:44:51] need activation functions and then the question was can we use different [00:44:53] question was can we use different activation functions and how do we how [00:44:55] activation functions and how do we how do we put them inside a layer or inside [00:44:57] do we put them inside a layer or inside neurons usually we would use there are [00:45:00] neurons usually we would use there are more activation functions I think in CS [00:45:02] more activation functions I think in CS 2:30 we go over a few more but not not [00:45:04] 2:30 we go over a few more but not not not today these have been designed with [00:45:08] not today these have been designed with experience so these are the ones that [00:45:10] experience so these are the ones that that that work better and let's our [00:45:14] that that work better and let's our networks train there are plenty of other [00:45:16] networks train there are plenty of other activation functions that have been [00:45:18] activation functions that have been tested [00:45:19] tested usually you would you would use the same [00:45:21] usually you would you would use the same activation functions inside every layer [00:45:24] activation functions inside every layer so when you it's it's a it's it's for [00:45:27] so when you it's it's a it's it's for training it doesn't have any special [00:45:30] training it doesn't have any special reason I think but when you have a [00:45:32] reason I think but when you have a network like that you would call this [00:45:34] network like that you would call this layer a random layer meaning it's a [00:45:36] layer a random layer meaning it's a fully connected layer with radioactive [00:45:38] fully connected layer with radioactive ation this one a sigmoid layer it means [00:45:40] ation this one a sigmoid layer it means it's a fully connected layer with the [00:45:42] it's a fully connected layer with the sigmoid activation and the last one is [00:45:44] sigmoid activation and the last one is sigmoid I I think people have been [00:45:46] sigmoid I I think people have been trying a lot of putting activate [00:45:49] trying a lot of putting activate different activations in different [00:45:50] different activations in different neurons in a layer in different layers [00:45:52] neurons in a layer in different layers and the consensus was using one [00:45:56] and the consensus was using one activation in the layer and also using [00:46:00] activation in the layer and also using one of these three activations yeah so [00:46:04] one of these three activations yeah so if someone comes up with a better [00:46:05] if someone comes up with a better activation that is obviously helping [00:46:09] activation that is obviously helping training our models on different data [00:46:11] training our models on different data sets people would adopt it but right now [00:46:14] sets people would adopt it but right now these are the ones that work better [00:46:24] you know last time we talked about hyper [00:46:27] you know last time we talked about hyper parameters a little bit these are all [00:46:29] parameters a little bit these are all hyper parameters so in practice you're [00:46:31] hyper parameters so in practice you're not going to choose these randomly [00:46:32] not going to choose these randomly you're going to try a bunch of them and [00:46:34] you're going to try a bunch of them and choose some of them that seem to help [00:46:37] choose some of them that seem to help your model train there's a lot of [00:46:39] your model train there's a lot of experimental results in deep burning and [00:46:41] experimental results in deep burning and we don't really understand fully why [00:46:43] we don't really understand fully why certain activations work better than [00:46:45] certain activations work better than others okay let's move on [00:47:14] okay let's go over initialization [00:47:16] okay let's go over initialization techniques [00:47:35] okay [00:47:37] okay let me use this port so another trick [00:47:50] let me use this port so another trick that you can use in order to help your [00:47:52] that you can use in order to help your network train our initialization methods [00:47:55] network train our initialization methods and normalization methods so earlier we [00:48:07] and normalization methods so earlier we talked about the fact that if Z is too [00:48:09] talked about the fact that if Z is too big or Z is too low in the negative [00:48:13] big or Z is too low in the negative numbers [00:48:13] numbers it will lead to saturation of the [00:48:15] it will lead to saturation of the network so in order to avoid that you [00:48:17] network so in order to avoid that you can use normalization of the input so [00:48:27] can use normalization of the input so assume that you have a network where the [00:48:29] assume that you have a network where the data is 2-dimensional x1 x2 is your [00:48:33] data is 2-dimensional x1 x2 is your two-dimensional input you can assume [00:48:40] two-dimensional input you can assume that x1 x2 is distributed like this [00:48:43] that x1 x2 is distributed like this thing so this is if I plot X 1 again X 2 [00:48:48] thing so this is if I plot X 1 again X 2 for a lot of data I will get that type [00:48:50] for a lot of data I will get that type of graph the problem is if I do my W X [00:48:55] of graph the problem is if I do my W X plus B to compute my Z 1 if x's are very [00:48:58] plus B to compute my Z 1 if x's are very big it will lead to very big Z's which [00:49:00] big it will lead to very big Z's which will lead to saturated activations in [00:49:03] will lead to saturated activations in order to avoid that one method is to [00:49:07] compute the mean of this data using mu [00:49:11] compute the mean of this data using mu equals 1 over the size of the batch of [00:49:14] equals 1 over the size of the batch of the internet you have in the training [00:49:16] the internet you have in the training set sum of excise so you're just giving [00:49:21] set sum of excise so you're just giving you the mean for x1 and the mean for x2 [00:49:27] you would compute the operation x equals [00:49:30] you would compute the operation x equals x minus mu and you will get that type of [00:49:33] x minus mu and you will get that type of plot if you re plot the transform data [00:49:40] let's say X 1 tilde X 2 tilde so here [00:49:45] let's say X 1 tilde X 2 tilde so here it's a little better but it's still not [00:49:47] it's a little better but it's still not good in order to solve the problem fully [00:49:51] good in order to solve the problem fully you're going to compute Sigma squared [00:49:55] you're going to compute Sigma squared which is basically the standard [00:49:57] which is basically the standard deviation squared so the variance of the [00:50:00] deviation squared so the variance of the data and then you will divide by Sigma [00:50:04] data and then you will divide by Sigma square [00:50:17] so you would do that and you would make [00:50:20] so you would do that and you would make the transformation of X being equal to X [00:50:23] the transformation of X being equal to X divided by Sigma and it will give you a [00:50:27] divided by Sigma and it will give you a graph that is centered up so you usually [00:50:38] graph that is centered up so you usually prefer to to work with a centered data [00:50:42] prefer to to work with a centered data yeah sorry oh yeah yeah sorry sorry yeah [00:50:48] yeah sorry oh yeah yeah sorry sorry yeah great so if we subtract the mean of X 1 [00:50:53] great so if we subtract the mean of X 1 and X 2 so it should look like this but [00:51:07] and X 2 so it should look like this but be centered okay and then if you says if [00:51:15] be centered okay and then if you says if you standardize it it looks like [00:51:17] you standardize it it looks like something like that so why is it better [00:51:19] something like that so why is it better because if you look at your your your [00:51:21] because if you look at your your your loss function now before the loss [00:51:24] loss function now before the loss function would look like something like [00:51:26] function would look like something like this [00:51:33] and after normalizing the input it may [00:51:37] and after normalizing the input it may look like something something like this [00:51:41] look like something something like this so what's the difference between these [00:51:43] so what's the difference between these two loss functions why is this one [00:51:45] two loss functions why is this one easier to Train is because if you have a [00:51:46] easier to Train is because if you have a starting point that is here let's say [00:51:49] starting point that is here let's say your gradient descent algorithm is going [00:51:51] your gradient descent algorithm is going to go to towards approximately the [00:51:55] to go to towards approximately the steepest slope so you're going to go [00:51:57] steepest slope so you're going to go there and then this one is going to go [00:51:59] there and then this one is going to go there and then you're going to go there [00:52:01] there and then you're going to go there and then you're going to go there like [00:52:03] and then you're going to go there like that and so on until you end up at the [00:52:06] that and so on until you end up at the right point but the steepest slope in [00:52:10] right point but the steepest slope in this loss contour is always pointing [00:52:12] this loss contour is always pointing towards the middle so if you start [00:52:14] towards the middle so if you start somewhere you will directly go towards [00:52:18] somewhere you will directly go towards the minimum of your loss function so [00:52:20] the minimum of your loss function so that's why it's helpful usually to [00:52:22] that's why it's helpful usually to normalize so this is one method and in [00:52:28] normalize so this is one method and in practice the way you initialize your [00:52:31] practice the way you initialize your weights is very important yeah [00:52:38] yes so exactly so here I used a very [00:52:49] yes so exactly so here I used a very simple case but you would divide [00:52:51] simple case but you would divide element-wise by the Sigma here okay so [00:52:56] element-wise by the Sigma here okay so like every entry of your matrix you [00:52:58] like every entry of your matrix you would divide it by the Sigma or one [00:53:01] would divide it by the Sigma or one other thing that is important to notice [00:53:02] other thing that is important to notice this Sigma and mu are computed over the [00:53:06] this Sigma and mu are computed over the training set you have a training set you [00:53:08] training set you have a training set you compute the mean of the training set the [00:53:09] compute the mean of the training set the standard deviation of the training set [00:53:10] standard deviation of the training set and these Sigma and you have to be used [00:53:13] and these Sigma and you have to be used on the test set as well it means now [00:53:15] on the test set as well it means now that you want to test your algorithm on [00:53:16] that you want to test your algorithm on the test set you should not compute the [00:53:18] the test set you should not compute the mean of the test set and the standard [00:53:21] mean of the test set and the standard deviation of the test set and normalize [00:53:22] deviation of the test set and normalize your test input through the network [00:53:25] your test input through the network instead you should use the mu and the [00:53:28] instead you should use the mu and the Sigma that were completed on the train [00:53:30] Sigma that were completed on the train set because your network is used to see [00:53:32] set because your network is used to see this type of transformation as an input [00:53:35] this type of transformation as an input so you want the distribution of the [00:53:38] so you want the distribution of the input at the first year to be always the [00:53:40] input at the first year to be always the same no matter if it's the train or the [00:53:42] same no matter if it's the train or the test set [00:53:48] here likely this leads to fewer [00:53:52] here likely this leads to fewer iterations okay we have a lot to see so [00:53:57] iterations okay we have a lot to see so I will I will skip a few questions so [00:54:03] I will I will skip a few questions so let's let's delve a little more into [00:54:04] let's let's delve a little more into vanishing and exploding radius so in [00:54:20] vanishing and exploding radius so in order to get an intuition of why we have [00:54:22] order to get an intuition of why we have this vanishing or exploding Radian [00:54:23] this vanishing or exploding Radian problem we can consider a network which [00:54:27] problem we can consider a network which is very very deep and has a two [00:54:33] is very very deep and has a two dimensional input okay and so on so [00:54:40] dimensional input okay and so on so let's say we have let's say we have ten [00:54:42] let's say we have let's say we have ten layers in total 10 layers plus an output [00:54:48] layers in total 10 layers plus an output layer so assume assume all the [00:54:53] layer so assume assume all the activations all the activations are [00:54:56] activations all the activations are identity functions and assume that these [00:55:00] identity functions and assume that these biases are equal to 0 if you compute Y [00:55:04] biases are equal to 0 if you compute Y hats the output of the network with [00:55:08] hats the output of the network with respect to the input you know that Y hat [00:55:11] respect to the input you know that Y hat would be equal to W of layer L capital L [00:55:15] would be equal to W of layer L capital L denotes the last layer times a L minus 1 [00:55:21] denotes the last layer times a L minus 1 plus BL but be L is 0 so we can remove [00:55:24] plus BL but be L is 0 so we can remove it WL x al minus 1 you know that al [00:55:28] it WL x al minus 1 you know that al minus 1 is W l minus 1 times a L minus 2 [00:55:36] minus 1 is W l minus 1 times a L minus 2 because the activation is an identity [00:55:39] because the activation is an identity function and so on you can back [00:55:41] function and so on you can back propagate can go back and you will get [00:55:44] propagate can go back and you will get that Y hat equals W L times W L minus 1 [00:55:51] that Y hat equals W L times W L minus 1 times blah blah blah times w1 times X [00:55:56] times blah blah blah times w1 times X you get something like that right so now [00:56:02] you get something like that right so now let's let's consider two cases let's [00:56:05] let's let's consider two cases let's consider where the case where the WL [00:56:08] consider where the case where the WL matrices are a little bigger than the [00:56:13] matrices are a little bigger than the identity function a little larger than [00:56:15] identity function a little larger than the islands function in terms of values [00:56:17] the islands function in terms of values let's say WL including all these so all [00:56:21] let's say WL including all these so all these matrices which are two by two [00:56:24] these matrices which are two by two matrices right are these ones what's the [00:56:31] matrices right are these ones what's the consequence the consequence is that this [00:56:35] consequence the consequence is that this whole thing here is going to be equal to [00:56:39] whole thing here is going to be equal to one point five to the power L one point [00:56:43] one point five to the power L one point five to the power L zero zero it will it [00:56:50] five to the power L zero zero it will it will make Y hat explode to make the [00:56:53] will make Y hat explode to make the value of y hat explode just because this [00:56:55] value of y hat explode just because this number is a tiny little bit more than [00:56:57] number is a tiny little bit more than one same phenomena if we had zero point [00:57:01] one same phenomena if we had zero point five instead of one point five here the [00:57:04] five instead of one point five here the value the multiplicative value of all [00:57:06] value the multiplicative value of all these matrices will be zero point five [00:57:09] these matrices will be zero point five to the power L here 0.5 to the power L [00:57:11] to the power L here 0.5 to the power L here and Y hat will always be very close [00:57:15] here and Y hat will always be very close to zero so you see the issue with [00:57:18] to zero so you see the issue with vanishing exploding gradient is that all [00:57:21] vanishing exploding gradient is that all the Earth's add up like multiplied each [00:57:23] the Earth's add up like multiplied each other and if you end up with numbers [00:57:26] other and if you end up with numbers that are smaller than 1 you will get [00:57:28] that are smaller than 1 you will get totally vanished gradient when you go [00:57:31] totally vanished gradient when you go back if you have values that are a [00:57:34] back if you have values that are a little bigger than 1 you will get [00:57:35] little bigger than 1 you will get exploding gradient so we did it as a [00:57:37] exploding gradient so we did it as a forward propagation equation we could [00:57:39] forward propagation equation we could have done it exactly the same analysis [00:57:41] have done it exactly the same analysis with the derivatives assuming the [00:57:45] with the derivatives assuming the derivatives of the weight matrices are a [00:57:48] derivatives of the weight matrices are a little lower than the identity or a [00:57:50] little lower than the identity or a little higher than the identity so we [00:57:53] little higher than the identity so we want to avoid that one way that is not [00:57:55] want to avoid that one way that is not perfect to avoid this is to initialize [00:57:58] perfect to avoid this is to initialize your weight properly initialize them [00:58:00] your weight properly initialize them into the right range of values so you [00:58:02] into the right range of values so you agree that we would prefer the weights [00:58:04] agree that we would prefer the weights to be around 1 as close as possible to 1 [00:58:07] to be around 1 as close as possible to 1 if they're very close to 1 we probably [00:58:10] if they're very close to 1 we probably we can avoid the vanishing and exploding [00:58:11] we can avoid the vanishing and exploding radiant problem so let's look at the [00:58:17] radiant problem so let's look at the initialization problem the first thing [00:58:25] initialization problem the first thing to look at is example of the 1 euro if [00:58:35] to look at is example of the 1 euro if you consider this neuron here which has [00:58:40] you consider this neuron here which has a bunch of inputs and outputs on [00:58:46] a bunch of inputs and outputs on activation a you know that the equation [00:58:50] activation a you know that the equation inside the neuron is a equals whatever [00:58:55] inside the neuron is a equals whatever function let's say sigmoid of Z and you [00:58:58] function let's say sigmoid of Z and you know that Z is equal to W 1 X 1 plus W 2 [00:59:02] know that Z is equal to W 1 X 1 plus W 2 X 2 plus blah blah blah plus W and X n [00:59:06] X 2 plus blah blah blah plus W and X n so it's a dot product between the W's [00:59:08] so it's a dot product between the W's and the X's so the interesting thing to [00:59:12] and the X's so the interesting thing to notice is that we have n terms here so [00:59:18] notice is that we have n terms here so in order for Z to not explode we would [00:59:20] in order for Z to not explode we would like all of this term to be small if W [00:59:25] like all of this term to be small if W is are too big then this term will [00:59:28] is are too big then this term will explode with the size of the input of [00:59:30] explode with the size of the input of the layer so instead if we have a large [00:59:34] the layer so instead if we have a large n means the input is very large what we [00:59:37] n means the input is very large what we want is very small w i's so the larger n [00:59:42] want is very small w i's so the larger n the smaller has to be W I so based on [00:59:46] the smaller has to be W I so based on this intuition it seems that it would be [00:59:50] this intuition it seems that it would be a good idea to initialize w i's with [00:59:54] a good idea to initialize w i's with something that is close to 1 over n we [00:59:59] something that is close to 1 over n we have n terms the more terms we have the [01:00:01] have n terms the more terms we have the more likely z is going to be big but if [01:00:04] more likely z is going to be big but if our initialization says the more terms [01:00:06] our initialization says the more terms you have the smaller the value of the [01:00:08] you have the smaller the value of the weights we should be able to keep Zener [01:00:09] weights we should be able to keep Zener in a certain range that is appropriate [01:00:12] in a certain range that is appropriate to avoid vanishing and exploding [01:00:13] to avoid vanishing and exploding gradients so this term to be a possible [01:00:17] gradients so this term to be a possible initialization scheme [01:00:22] so in practice I'm going to write a few [01:00:25] so in practice I'm going to write a few initial ization schemes that we're not [01:00:27] initial ization schemes that we're not going to prove if you interested in [01:00:29] going to prove if you interested in seeing more proofs of that you can take [01:00:31] seeing more proofs of that you can take CS 2:30 where we prove this [01:00:33] CS 2:30 where we prove this initialization scheme I take down the [01:00:43] initialization scheme I take down the board [01:00:48] so there are a few initialization that [01:00:50] so there are a few initialization that are commonly used and again this is this [01:00:54] are commonly used and again this is this is very practical and people have been [01:00:55] is very practical and people have been testing a lot of initializations but [01:00:58] testing a lot of initializations but they ended up using those so one is to [01:01:02] they ended up using those so one is to initialize the weights I'm writing the [01:01:04] initialize the weights I'm writing the code for those of you who know one on PI [01:01:08] code for those of you who know one on PI not going to compile it here with [01:01:11] not going to compile it here with whatever shape you're using element [01:01:15] whatever shape you're using element twice times the square root of one over [01:01:22] twice times the square root of one over N of L minus one so what does that mean [01:01:29] N of L minus one so what does that mean it means that I will look at the number [01:01:30] it means that I will look at the number of input I'm writing an n L minus one [01:01:33] of input I'm writing an n L minus one here and to the L minus one I'm looking [01:01:37] here and to the L minus one I'm looking at how many inputs are coming to my [01:01:38] at how many inputs are coming to my layer assuming we're at layer L how many [01:01:42] layer assuming we're at layer L how many inputs are coming I'm going to [01:01:44] inputs are coming I'm going to initialize the weights of this layer [01:01:48] initialize the weights of this layer proportionally to the number of inputs [01:01:50] proportionally to the number of inputs that are coming in so the intuition is [01:01:52] that are coming in so the intuition is very similar to what we described there [01:01:54] very similar to what we described there so this initialization has been shown to [01:01:56] so this initialization has been shown to work very well for sigmoid activation so [01:02:00] work very well for sigmoid activation so if you use sigmoid what's interesting is [01:02:08] if you use sigmoid what's interesting is if you use relu it's been it's been [01:02:11] if you use relu it's been it's been observed that putting a two here instead [01:02:14] observed that putting a two here instead of a one would make the network train [01:02:16] of a one would make the network train better and again it's very practical [01:02:20] better and again it's very practical it's one of the fields that that we need [01:02:23] it's one of the fields that that we need more Theory on it but a lot of [01:02:27] more Theory on it but a lot of observation has been made so far you [01:02:32] observation has been made so far you guys want to do that as a project to see [01:02:34] guys want to do that as a project to see why it is happening it would be [01:02:36] why it is happening it would be interested okay and finally there is a [01:02:42] interested okay and finally there is a more common one that is used which is [01:02:46] more common one that is used which is called Xavier initialization which which [01:02:56] called Xavier initialization which which proposes to update the weights [01:03:00] using square root of 1 over an L minus 1 [01:03:07] using square root of 1 over an L minus 1 for 10h this is another one and another [01:03:11] for 10h this is another one and another one that is I believe called slow [01:03:13] one that is I believe called slow initialization recommends to to [01:03:25] initialization recommends to to initialize the weights of a layer using [01:03:27] initialize the weights of a layer using the foreign formula so quickly the quick [01:03:34] the foreign formula so quickly the quick intuition behind the last one the last [01:03:36] intuition behind the last one the last one is very often used the quick [01:03:39] one is very often used the quick intuition is that we're doing the same [01:03:41] intuition is that we're doing the same thing but also for the back propagated [01:03:43] thing but also for the back propagated gradients so we're saying the weights [01:03:45] gradients so we're saying the weights are going to multiply the back [01:03:47] are going to multiply the back propagated gradient so we also need to [01:03:49] propagated gradient so we also need to look at how many inputs do we have [01:03:51] look at how many inputs do we have during the back propagation and L is the [01:03:54] during the back propagation and L is the number of inputs you have during back [01:03:55] number of inputs you have during back propagation and L minus 1 is the number [01:03:57] propagation and L minus 1 is the number of inputs you have during for [01:03:58] of inputs you have during for propagation so taking an average a [01:04:00] propagation so taking an average a geometric average of those [01:04:16] and the reason we have a random function [01:04:19] and the reason we have a random function here is because if you don't initialize [01:04:22] here is because if you don't initialize your weights randomly you will end up [01:04:24] your weights randomly you will end up with some problem called the symmetry [01:04:26] with some problem called the symmetry problem where every neuron is going to [01:04:28] problem where every neuron is going to learn kind of the same thing to avoid [01:04:30] learn kind of the same thing to avoid that you will make the neuron starts at [01:04:32] that you will make the neuron starts at different places and let them evolve [01:04:34] different places and let them evolve independently from each other as much as [01:04:37] independently from each other as much as possible so now we have two choices [01:04:40] possible so now we have two choices either we go over regularization or [01:04:42] either we go over regularization or optimization how much have you talked [01:04:45] optimization how much have you talked about regularization so far l1 l2 early [01:04:48] about regularization so far l1 l2 early stopping all that's really stopping [01:04:51] stopping all that's really stopping everybody remembers well it is know a [01:04:53] everybody remembers well it is know a little bit so let's go over optimization [01:04:56] little bit so let's go over optimization I guess and then we will do some [01:04:57] I guess and then we will do some regularization depending on the time we [01:04:59] regularization depending on the time we have so I believe so far you've seen [01:05:11] have so I believe so far you've seen gradient descent and stochastic gradient [01:05:13] gradient descent and stochastic gradient descent as to possible optimization [01:05:16] descent as to possible optimization algorithm in practice there is a [01:05:18] algorithm in practice there is a trade-off between these two which is [01:05:19] trade-off between these two which is called mini-batch gradient descent what [01:05:22] called mini-batch gradient descent what is the trade-off the trade-off is that [01:05:24] is the trade-off the trade-off is that batch gradient descent is cool because [01:05:27] batch gradient descent is cool because you can use vectorization you can give a [01:05:30] you can use vectorization you can give a batch of input for propagated all at [01:05:33] batch of input for propagated all at once during vac using a vectorized code [01:05:35] once during vac using a vectorized code stochastic gradient descent advantage is [01:05:38] stochastic gradient descent advantage is that the updates are very quick and [01:05:40] that the updates are very quick and imagine that you have the data set with [01:05:42] imagine that you have the data set with 1 million images 1 million images in the [01:05:45] 1 million images 1 million images in the data set and you want to do batch [01:05:46] data set and you want to do batch gradient descent you know how long it's [01:05:49] gradient descent you know how long it's going to take to do one updates very [01:05:51] going to take to do one updates very long so we don't want that because maybe [01:05:53] long so we don't want that because maybe we don't need to go over the full [01:05:55] we don't need to go over the full dataset in order to have a good update [01:05:56] dataset in order to have a good update maybe the updates based on a thousand [01:05:59] maybe the updates based on a thousand examples might already give us the right [01:06:01] examples might already give us the right direction for the gradient of where to [01:06:03] direction for the gradient of where to go it's not going to be as good as on a [01:06:05] go it's not going to be as good as on a minyan example where is going to be a [01:06:06] minyan example where is going to be a very good approximation so that's why [01:06:09] very good approximation so that's why most people would use mini-batch [01:06:11] most people would use mini-batch gradient descent where you have a [01:06:12] gradient descent where you have a trade-off between stochasticity and also [01:06:15] trade-off between stochasticity and also vectorization so in terms of notation [01:06:22] vectorization so in terms of notation I'm going to call X the matrix X 1 X 2 X [01:06:31] I'm going to call X the matrix X 1 X 2 X m and capital y the same matrix with [01:06:37] m and capital y the same matrix with wise so we have M training examples and [01:06:41] wise so we have M training examples and I'm going to split these into batches so [01:06:45] I'm going to split these into batches so I'm going to call the first batch X 1 [01:06:49] I'm going to call the first batch X 1 like this until X maybe T like that and [01:06:56] like this until X maybe T like that and X 1 can contain probably X 1 until X [01:07:01] X 1 can contain probably X 1 until X 1,000 assuming it's a batch of a [01:07:03] 1,000 assuming it's a batch of a thousand examples X 2 then will contain [01:07:06] thousand examples X 2 then will contain X 1000 and one until X 2000 and so on so [01:07:11] X 1000 and one until X 2000 and so on so this is the notation for the batch when [01:07:14] this is the notation for the batch when I use curly brackets same for y [01:07:33] so in terms of algorithm how does the [01:07:38] so in terms of algorithm how does the mini-batch gradient descent algorithm [01:07:40] mini-batch gradient descent algorithm work we're going to iterate so for TNT [01:07:48] work we're going to iterate so for TNT from 1 to blah blah blah - how many [01:07:51] from 1 to blah blah blah - how many iteration you want to do we're going to [01:07:53] iteration you want to do we're going to select a batch select a batch of XK 1 XT [01:08:07] select a batch select a batch of XK 1 XT YT you will forward propagate the batch [01:08:14] and you will back propagate the batch so [01:08:23] and you will back propagate the batch so by forward propagation I mean you send [01:08:25] by forward propagation I mean you send all the batch to the network and you [01:08:28] all the batch to the network and you compute the lost functions for every [01:08:29] compute the lost functions for every examples of the batch you sum them [01:08:31] examples of the batch you sum them together and you compute the cost [01:08:32] together and you compute the cost function over the entire batch which is [01:08:34] function over the entire batch which is the average of the loss functions and so [01:08:40] the average of the loss functions and so assuming assuming the batch is of size [01:08:45] assuming assuming the batch is of size 1,000 this will be the the formula to [01:08:54] 1,000 this will be the the formula to compute the batch over 1,000 examples [01:09:00] and after the back propagation of course [01:09:04] and after the back propagation of course update W L and the L for all the else [01:09:10] update W L and the L for all the else for all the layers this is the equation [01:09:30] so in terms of graph what you're likely [01:09:37] so in terms of graph what you're likely to see is that for batch gradient [01:09:40] to see is that for batch gradient descent your cost function J would have [01:09:44] descent your cost function J would have looked like that [01:09:45] looked like that if you plot it against the number of [01:09:48] if you plot it against the number of iterations on the other hand if you use [01:09:54] iterations on the other hand if you use a mini batch gradient descent you're [01:09:55] a mini batch gradient descent you're most likely to see something like this [01:09:58] most likely to see something like this so it's also decreasing as a trend but [01:10:01] so it's also decreasing as a trend but because the gradient is approximated and [01:10:04] because the gradient is approximated and doesn't necessarily go straight to the [01:10:06] doesn't necessarily go straight to the to the middle of your last to the lower [01:10:08] to the middle of your last to the lower point of the last function you will see [01:10:10] point of the last function you will see a kind of graph like that the smaller [01:10:13] a kind of graph like that the smaller the batch the more stochasticity so the [01:10:16] the batch the more stochasticity so the more noise you will have on your cost [01:10:18] more noise you will have on your cost function graph and of course if you if [01:10:29] function graph and of course if you if we plot again if we plot the last [01:10:34] we plot again if we plot the last function and this was gradient descent [01:10:37] function and this was gradient descent so this is the top view of the last [01:10:39] so this is the top view of the last function assuming we're in two [01:10:40] function assuming we're in two dimensions your stochastic gradient [01:10:43] dimensions your stochastic gradient descent or batch gradient descent would [01:10:45] descent or batch gradient descent would do something like that so the difference [01:10:51] do something like that so the difference is there seem to be less iteration with [01:10:54] is there seem to be less iteration with the red algorithm but the iteration are [01:10:57] the red algorithm but the iteration are much heavier to compute so each of the [01:11:00] much heavier to compute so each of the green iteration are going to be very [01:11:02] green iteration are going to be very very very quick while the red ones are [01:11:05] very very quick while the red ones are going to be slow to compute this is the [01:11:08] going to be slow to compute this is the trade off now there is another algorithm [01:11:13] trade off now there is another algorithm that I want to go over which is called [01:11:16] that I want to go over which is called the momentum momentum algorithm [01:11:25] sometimes called gradient descent plus [01:11:27] sometimes called gradient descent plus momentum algorithm so what's the [01:11:32] momentum algorithm so what's the intuition behind momentum [01:11:38] the intuition is let's look at this lost [01:11:44] the intuition is let's look at this lost contour plot and I'm doing an extreme [01:11:50] contour plot and I'm doing an extreme case just to illustrate the intuition [01:11:56] assume you have a loss that is very [01:12:00] assume you have a loss that is very extended in one direction so this [01:12:03] extended in one direction so this direction is very extended and the other [01:12:07] direction is very extended and the other one is smaller you're starting at the [01:12:09] one is smaller you're starting at the points like this one your gradient [01:12:13] points like this one your gradient descent algorithm itself is going to [01:12:15] descent algorithm itself is going to follow the following map it's going to [01:12:16] follow the following map it's going to be orthogonal to the current contour is [01:12:21] be orthogonal to the current contour is au term contour loss is going to go [01:12:23] au term contour loss is going to go there and then there and then there and [01:12:26] there and then there and then there and then there and so on so what you would [01:12:31] then there and so on so what you would like is to move a faster on the [01:12:35] like is to move a faster on the horizontal line and slower to the [01:12:36] horizontal line and slower to the vertical on the vertical side so on this [01:12:41] vertical on the vertical side so on this axis you would like to move with smaller [01:12:45] axis you would like to move with smaller updates and on this axis you want to [01:12:48] updates and on this axis you want to move with larger objects correct if this [01:12:51] move with larger objects correct if this happened we would probably end up in the [01:12:54] happened we would probably end up in the minimum much quicker than we currently [01:12:56] minimum much quicker than we currently are so in order to do that we're going [01:12:58] are so in order to do that we're going to use a technique called momentum which [01:13:01] to use a technique called momentum which is going to look at the past gradients [01:13:03] is going to look at the past gradients so look at the past updates assume we're [01:13:05] so look at the past updates assume we're here assume we're somewhere here [01:13:10] here assume we're somewhere here gradient descent doesn't look at its [01:13:12] gradient descent doesn't look at its past at all it just will compute the [01:13:15] past at all it just will compute the fault propagation compute the backdrop [01:13:16] fault propagation compute the backdrop look at the direction and go to that [01:13:18] look at the direction and go to that direction [01:13:19] direction what momentum is going to say is look at [01:13:21] what momentum is going to say is look at the past updates that you did and try to [01:13:23] the past updates that you did and try to consider this past update in order to [01:13:25] consider this past update in order to find the right way to go so if you look [01:13:28] find the right way to go so if you look at the past update and you take an [01:13:30] at the past update and you take an average of the past update you would [01:13:32] average of the past update you would take an average of this update going up [01:13:34] take an average of this update going up and the update after it going down the [01:13:37] and the update after it going down the average on the vertical side is going to [01:13:39] average on the vertical side is going to be small because one went up one went [01:13:41] be small because one went up one went down but on the horizontal axis both [01:13:44] down but on the horizontal axis both went to the same direction so the update [01:13:47] went to the same direction so the update will not change too much on the verts on [01:13:50] will not change too much on the verts on this axis so you're most likely to do [01:13:53] this axis so you're most likely to do something like that if you use momentum [01:14:01] something like that if you use momentum does it make sense the intuition behind [01:14:03] does it make sense the intuition behind it so that's the intuition why we want [01:14:07] it so that's the intuition why we want to use mind and for those of you who do [01:14:10] to use mind and for those of you who do physics sometimes you can think of [01:14:12] physics sometimes you can think of momentum as friction you know like like [01:14:16] momentum as friction you know like like if you if you launch a rocket and you [01:14:19] if you if you launch a rocket and you want to move it quickly around it's not [01:14:21] want to move it quickly around it's not gonna move because the rocket has a [01:14:22] gonna move because the rocket has a certain weight and has a certain [01:14:23] certain weight and has a certain momentum you cannot change its direction [01:14:25] momentum you cannot change its direction very very noisily so let's see at the [01:14:39] very very noisily so let's see at the implementation of of momentum gradient [01:14:42] implementation of of momentum gradient descent oh and I believe we're almost [01:14:46] descent oh and I believe we're almost done right yeah okay so let's look at it [01:14:50] done right yeah okay so let's look at it in the implementation quickly so [01:14:53] in the implementation quickly so gradient descent was W equals W minus [01:14:56] gradient descent was W equals W minus alpha derivative of the loss with [01:14:59] alpha derivative of the loss with respect to W what we're going to do is [01:15:01] respect to W what we're going to do is we're going to use another variable [01:15:03] we're going to use another variable called velocity which is going to be the [01:15:05] called velocity which is going to be the average of the previous velocity and the [01:15:11] average of the previous velocity and the current weight update so we're going to [01:15:20] current weight update so we're going to use that and instead of the updates [01:15:22] use that and instead of the updates being the derivative directly we're [01:15:24] being the derivative directly we're going to update the velocity so the [01:15:26] going to update the velocity so the velocity is going to be a variable that [01:15:29] velocity is going to be a variable that tracks the direction that we should take [01:15:34] tracks the direction that we should take regarding the current update and also [01:15:36] regarding the current update and also the past updates with a factor beta that [01:15:39] the past updates with a factor beta that is B going to be the weight the [01:15:43] is B going to be the weight the interesting point is that in terms of [01:15:45] interesting point is that in terms of implementation it's one more line of [01:15:46] implementation it's one more line of code in terms of memory is just one [01:15:49] code in terms of memory is just one additional variable and it actually has [01:15:51] additional variable and it actually has a big impact on the optimization there [01:15:54] a big impact on the optimization there are much more optimization algorithms [01:15:56] are much more optimization algorithms that we're not going to see together [01:15:57] that we're not going to see together today in C su-30 we teach something [01:16:00] today in C su-30 we teach something called rmsprop [01:16:01] called rmsprop and Adam that's our [01:16:04] and Adam that's our likely the the the ones that are used [01:16:07] likely the the the ones that are used the most in deep learning and the reason [01:16:10] the most in deep learning and the reason is if you come up with an optimization [01:16:12] is if you come up with an optimization algorithm you still have to prove that [01:16:14] algorithm you still have to prove that it works very well on a wide variety of [01:16:16] it works very well on a wide variety of application between bv4 researchers [01:16:19] application between bv4 researchers adopted for their research so Adam [01:16:22] adopted for their research so Adam brings momentum to the OP the returning [01:16:26] brings momentum to the OP the returning optimization algorithms okay thanks guys [01:16:28] optimization algorithms okay thanks guys and that's all for deep learning in cs2 [01:16:32] and that's all for deep learning in cs2 tonight so far ================================================================================ LECTURE 013 ================================================================================ Lecture 13 - Debugging ML Models and Error Analysis | Stanford CS229: Machine Learning (Autumn 2018) Source: https://www.youtube.com/watch?v=ORrStCArmP4 --- Transcript [00:00:04] okay happy Halloween what I want to do [00:00:07] okay happy Halloween what I want to do today is share of you advice for [00:00:10] today is share of you advice for applying machine learning and you [00:00:12] applying machine learning and you probably allude to this before but um [00:00:14] probably allude to this before but um yeah I think over the last several weeks [00:00:17] yeah I think over the last several weeks you've learned a lot about the mechanics [00:00:19] you've learned a lot about the mechanics of how to build different learning [00:00:21] of how to build different learning algorithms everything from the [00:00:23] algorithms everything from the aggression which is aggression as VMS [00:00:25] aggression which is aggression as VMS random forests is it neuro networks and [00:00:28] random forests is it neuro networks and what I want to do today is share of you [00:00:31] what I want to do today is share of you some principles for helping you become [00:00:33] some principles for helping you become efficient at how you apply all of these [00:00:35] efficient at how you apply all of these things to solve whatever application [00:00:37] things to solve whatever application problem you might want to work on um and [00:00:41] problem you might want to work on um and so a lot of today's materials I should [00:00:44] so a lot of today's materials I should not have mathematical but there's also [00:00:46] not have mathematical but there's also some of the hardest material tests were [00:00:48] some of the hardest material tests were in this class to understand um it turns [00:00:52] in this class to understand um it turns out that when you give advice on how to [00:00:54] out that when you give advice on how to apply a learning algorithm such as you [00:00:57] apply a learning algorithm such as you know don't waste lots of time collecting [00:00:58] know don't waste lots of time collecting data unless you you you have confidence [00:01:02] data unless you you you have confidence is useful to actually spend all that [00:01:03] is useful to actually spend all that time it turns out when I say things like [00:01:05] time it turns out when I say things like that people you know this easy agree is [00:01:07] that people you know this easy agree is that of course you shouldn't waste time [00:01:08] that of course you shouldn't waste time collecting loss a day there unless you [00:01:11] collecting loss a day there unless you have some confidence it's actually good [00:01:12] have some confidence it's actually good use your time that's a very easy thing [00:01:14] use your time that's a very easy thing to agree with but the hard thing is when [00:01:18] to agree with but the hard thing is when you go home today and you're actually [00:01:20] you go home today and you're actually working on your class project right to [00:01:24] working on your class project right to apply the principles we talked about [00:01:26] apply the principles we talked about today when you're actually on the ground [00:01:27] today when you're actually on the ground talking to your teammates saying alright [00:01:29] talking to your teammates saying alright do we collect more data for our class [00:01:31] do we collect more data for our class project now or not to make the right [00:01:33] project now or not to make the right judgment call for that to map the [00:01:35] judgment call for that to map the concepts you learn today so when you're [00:01:37] concepts you learn today so when you're actually in the hot seat you know making [00:01:39] actually in the hot seat you know making a decision do we go and spend another [00:01:41] a decision do we go and spend another two days scraping data off the internet [00:01:43] two days scraping data off the internet or do goons tune this out tune this [00:01:45] or do goons tune this out tune this parameters algorithm to actually make [00:01:46] parameters algorithm to actually make those decisions it's actually it it [00:01:49] those decisions it's actually it it often takes a lot of careful thinking to [00:01:52] often takes a lot of careful thinking to make the mapping from the principles we [00:01:54] make the mapping from the principles we talked about today and they prepare all [00:01:55] talked about today and they prepare all of you go yep that makes sense but they [00:01:57] of you go yep that makes sense but they actually do that when you're in the hot [00:01:59] actually do that when you're in the hot seat making the decisions that that's [00:02:01] seat making the decisions that that's something that [00:02:02] something that we often take take some careful thought [00:02:04] we often take take some careful thought I guess and I think you know for a long [00:02:09] I guess and I think you know for a long time [00:02:10] time positive machine learning have been an [00:02:12] positive machine learning have been an art right we're you know we'll go [00:02:14] art right we're you know we'll go through these people that have been [00:02:15] through these people that have been doing it for 30 years and we say hey my [00:02:17] doing it for 30 years and we say hey my learning algorithm doesn't work you know [00:02:20] learning algorithm doesn't work you know what do we do now and then they would [00:02:22] what do we do now and then they would have some judgment or you go people to [00:02:24] have some judgment or you go people to ask me and for some reason because we've [00:02:26] ask me and for some reason because we've done it for a long time we'll say oh [00:02:28] done it for a long time we'll say oh yeah get more dates or tune that [00:02:29] yeah get more dates or tune that parameter or try a new network of big [00:02:31] parameter or try a new network of big hidden units and for some reason that [00:02:33] hidden units and for some reason that work and what I hope to do today is turn [00:02:36] work and what I hope to do today is turn that black magic that hot that that arts [00:02:39] that black magic that hot that that arts into much of a science so that you can [00:02:40] into much of a science so that you can much more systematic make these [00:02:42] much more systematic make these decisions yourself rather than talk to [00:02:44] decisions yourself rather than talk to someone there's done this for 30 years [00:02:47] someone there's done this for 30 years then that for some reason is able to [00:02:49] then that for some reason is able to give you the good recommendations even [00:02:51] give you the good recommendations even if you know but turn it from more of a [00:02:55] if you know but turn it from more of a black art into a more of a systematic [00:02:57] black art into a more of a systematic engineering discipline um and and just a [00:03:00] engineering discipline um and and just a one-note someone I wouldn't do today is [00:03:03] one-note someone I wouldn't do today is not the best approach for developing [00:03:05] not the best approach for developing novel machine or any research if you are [00:03:07] novel machine or any research if you are if your main goes to write research [00:03:09] if your main goes to write research papers some of what I'll say will apply [00:03:12] papers some of what I'll say will apply some others say will not apply but I'll [00:03:13] some others say will not apply but I'll come back to that later but so most of [00:03:15] come back to that later but so most of today is focus on how the hell you build [00:03:17] today is focus on how the hell you build stuff that works right the build build [00:03:19] stuff that works right the build build applications that work so the three key [00:03:23] applications that work so the three key ideas you see today are first is [00:03:26] ideas you see today are first is Diagnostics for debugging learning [00:03:29] Diagnostics for debugging learning algorithms one thing you might not know [00:03:33] algorithms one thing you might not know or actually if you work on a class [00:03:34] or actually if you work on a class project maybe you know this already is [00:03:36] project maybe you know this already is that when you implant to learning Alvin [00:03:38] that when you implant to learning Alvin for the first time it almost never works [00:03:40] for the first time it almost never works right that is not the first time and so [00:03:45] right that is not the first time and so what is it I still remember it was there [00:03:48] what is it I still remember it was there was a weekend about a year ago where I [00:03:51] was a weekend about a year ago where I implemented [00:03:52] implemented softmax regression on my laptop and it [00:03:54] softmax regression on my laptop and it worked the first time and even to this [00:03:56] worked the first time and even to this day I still remember that feeling of [00:03:59] day I still remember that feeling of surprise I know there's gotta be a bug [00:04:00] surprise I know there's gotta be a bug and I went in to try to find a bug and [00:04:02] and I went in to try to find a bug and there wasn't about it but it is so rare [00:04:04] there wasn't about it but it is so rare Travie Algren works the first time I [00:04:07] Travie Algren works the first time I still remember every year later and so [00:04:10] still remember every year later and so longer the workflow of developing [00:04:12] longer the workflow of developing learning algorithms it actually feels [00:04:14] learning algorithms it actually feels like a debugging workflow right and so [00:04:17] like a debugging workflow right and so my hope you become systematic at that um [00:04:21] my hope you become systematic at that um and two key ideas here about our [00:04:24] and two key ideas here about our analysis innovative analysis so how to [00:04:26] analysis innovative analysis so how to analyze the air as you're learning [00:04:28] analyze the air as you're learning algorithm and also how to how to [00:04:29] algorithm and also how to how to understand was not working with error [00:04:31] understand was not working with error analysis how long said what's working [00:04:33] analysis how long said what's working which is ablative analysis and then [00:04:35] which is ablative analysis and then finally some philosophies and how the [00:04:37] finally some philosophies and how the get started or the machine learning [00:04:39] get started or the machine learning project such as your class project [00:04:41] project such as your class project okay so let's starts without discussing [00:04:45] okay so let's starts without discussing debugging learning algorithms um so what [00:04:50] debugging learning algorithms um so what happens all the time is you have an idea [00:04:52] happens all the time is you have an idea from machine learning application you [00:04:54] from machine learning application you implement something and then it won't [00:04:56] implement something and then it won't work as well as you hoped and the key [00:04:58] work as well as you hoped and the key question is what do you do next right [00:05:00] question is what do you do next right whenever I work on machine learning [00:05:02] whenever I work on machine learning algorithm that's actually most of my [00:05:03] algorithm that's actually most of my workflow we usually have something [00:05:05] workflow we usually have something unfermented it's just not working that [00:05:07] unfermented it's just not working that well [00:05:07] well and your ability to decide what to do [00:05:09] and your ability to decide what to do NYX has a huge impact on your efficiency [00:05:13] NYX has a huge impact on your efficiency I think when when when I was a when I [00:05:17] I think when when when I was a when I was in undergrad at Carnegie Mellon [00:05:19] was in undergrad at Carnegie Mellon University I had a friend that would [00:05:22] University I had a friend that would debug their code by you know they write [00:05:25] debug their code by you know they write a piece of code and then as always we [00:05:28] a piece of code and then as always we write Pisco initially always a bunch of [00:05:29] write Pisco initially always a bunch of syntax errors right and so their [00:05:31] syntax errors right and so their debugging strategy was to delete every [00:05:33] debugging strategy was to delete every single line of code that generates a [00:05:34] single line of code that generates a syntax error because it was a good way [00:05:36] syntax error because it was a good way to garrulous and so that wasn't a good [00:05:37] to garrulous and so that wasn't a good strategy so in in in machine learning as [00:05:40] strategy so in in in machine learning as well they're good and less good [00:05:42] well they're good and less good debugging strategies right um so let's [00:05:45] debugging strategies right um so let's not so motivating example uh let's say [00:05:48] not so motivating example uh let's say building an anti spam classifier and [00:05:51] building an anti spam classifier and let's say you've carefully chosen a [00:05:54] let's say you've carefully chosen a small set of hundred words to use as [00:05:55] small set of hundred words to use as features so instead of all using you [00:05:57] features so instead of all using you know ten thousand or fifty thousand [00:05:58] know ten thousand or fifty thousand words you've chosen a hundred words that [00:06:01] words you've chosen a hundred words that you think could be most relevant to [00:06:04] you think could be most relevant to anti-spam and let's say you start off [00:06:08] anti-spam and let's say you start off implementing logistic recognization I [00:06:11] implementing logistic recognization I think when talked about this is also you [00:06:13] think when talked about this is also you know there's a frequencies in Beijing in [00:06:14] know there's a frequencies in Beijing in school but you can think of [00:06:15] school but you can think of basing logistic regression where you [00:06:18] basing logistic regression where you have the maximum likelihood term on the [00:06:21] have the maximum likelihood term on the left and then that second term is the [00:06:23] left and then that second term is the regularization term right so that's so [00:06:26] regularization term right so that's so that's Bayesian logistic regression if [00:06:29] that's Bayesian logistic regression if you are Bayesian or which is regression [00:06:31] you are Bayesian or which is regression with regularization if you're you know [00:06:34] with regularization if you're you know using frequency statistics and let's say [00:06:37] using frequency statistics and let's say that they're just regression with [00:06:39] that they're just regression with regularization or based on Lynch's [00:06:41] regularization or based on Lynch's direction it gets twenty percent test [00:06:43] direction it gets twenty percent test error which is an unacceptably high [00:06:44] error which is an unacceptably high making one in five mistakes on on your [00:06:47] making one in five mistakes on on your spam filter and so what do you do next [00:06:50] spam filter and so what do you do next um now for this scenario I want a and so [00:06:54] um now for this scenario I want a and so um for when you implement an algorithm [00:06:58] um for when you implement an algorithm like this what many teams will do is try [00:07:02] like this what many teams will do is try improving the algorithm in different [00:07:04] improving the algorithm in different ways so what many teams will do is say [00:07:06] ways so what many teams will do is say oh yeah I remember you know well we like [00:07:09] oh yeah I remember you know well we like big data more data always help so let's [00:07:11] big data more data always help so let's get some more data and hope that solves [00:07:13] get some more data and hope that solves the problem [00:07:14] the problem so one some teams will say let's get [00:07:16] so one some teams will say let's get more training examples good and it's [00:07:18] more training examples good and it's actually true you know more data pretty [00:07:20] actually true you know more data pretty much never hurts it almost always holds [00:07:22] much never hurts it almost always holds but but the key question is how much um [00:07:24] but but the key question is how much um or you could try using a smaller set of [00:07:27] or you could try using a smaller set of features behind your features Prarie [00:07:29] features behind your features Prarie somewhere in that relevance so let's get [00:07:31] somewhere in that relevance so let's get rid of some features or you could try [00:07:33] rid of some features or you could try having a larger cell features kinds of [00:07:35] having a larger cell features kinds of features that seem small resourceless [00:07:36] features that seem small resourceless add more features um or you might want [00:07:41] add more features um or you might want other designs of the features you know [00:07:43] other designs of the features you know instead of just using features in the [00:07:46] instead of just using features in the email body you can use features from the [00:07:48] email body you can use features from the email header the email header has on not [00:07:52] email header the email header has on not just a from to subject but also routing [00:07:54] just a from to subject but also routing information about what's a set of [00:07:56] information about what's a set of service of the internet that the email [00:07:58] service of the internet that the email took to get to you or you could try [00:08:02] took to get to you or you could try running gradient descent for more [00:08:03] running gradient descent for more iterations that that you know that never [00:08:05] iterations that that you know that never hurts right pretty usually in self [00:08:09] hurts right pretty usually in self grading descent let's switch to Newton's [00:08:11] grading descent let's switch to Newton's method or let's try a different value [00:08:14] method or let's try a different value for lambda or always say you know forget [00:08:18] for lambda or always say you know forget about basically just regression or [00:08:19] about basically just regression or logistic regularization let's let's use [00:08:22] logistic regularization let's let's use two totally different algorithmic in his [00:08:23] two totally different algorithmic in his or network or something right so what [00:08:27] or network or something right so what happens in a lot of teams is someone [00:08:32] happens in a lot of teams is someone will pick one of these ideas kind of a [00:08:34] will pick one of these ideas kind of a random it depends on you know what they [00:08:37] random it depends on you know what they happen to read the night before write [00:08:39] happen to read the night before write about something or the experience on the [00:08:42] about something or the experience on the last project and sometimes a project and [00:08:45] last project and sometimes a project and sometimes your project leader will say [00:08:47] sometimes your project leader will say you know we'll pick one of these and [00:08:49] you know we'll pick one of these and just say let's try that and it's been a [00:08:51] just say let's try that and it's been a spend a few days or a few weeks trying [00:08:53] spend a few days or a few weeks trying that and it may or may not be the best [00:08:55] that and it may or may not be the best thing so I think that in in in my team's [00:09:00] thing so I think that in in in my team's machine or any workflow so first if you [00:09:02] machine or any workflow so first if you actually you and a few others sit down [00:09:04] actually you and a few others sit down and brainstorm a list of the things you [00:09:06] and brainstorm a list of the things you could try you're actually already ahead [00:09:07] could try you're actually already ahead of a lot of teams because lot of teams [00:09:09] of a lot of teams because lot of teams will kind of just by gut feeling right [00:09:13] will kind of just by gut feeling right or the most opinionated person will pick [00:09:16] or the most opinionated person will pick one of these things at random and do [00:09:17] one of these things at random and do that but you brainstorm a list of things [00:09:19] that but you brainstorm a list of things and then and then try to evaluate the [00:09:21] and then and then try to evaluate the different options you're already ahead [00:09:22] different options you're already ahead of many teams [00:09:24] of many teams oh sorry and and and I think the yeah [00:09:27] oh sorry and and and I think the yeah and I think right you know and then [00:09:30] and I think right you know and then unless you analyze these different [00:09:32] unless you analyze these different options it's hard to know which of these [00:09:37] options it's hard to know which of these is actually the best options right so um [00:09:40] is actually the best options right so um the most common diagnostic I end up [00:09:43] the most common diagnostic I end up using in developing learning algorithms [00:09:46] using in developing learning algorithms is a bias versus variance diagnostic [00:09:50] is a bias versus variance diagnostic right and I think talking about bias and [00:09:53] right and I think talking about bias and variance already with a classifier is [00:09:55] variance already with a classifier is highly bias then it tends to under fit [00:09:59] highly bias then it tends to under fit the data so high bias is well you guys [00:10:05] the data so high bias is well you guys remember this right if if you have a [00:10:08] remember this right if if you have a data set there's like this a highly [00:10:10] data set there's like this a highly biased crossbar and maybe much too [00:10:12] biased crossbar and maybe much too simple and high variance classifies may [00:10:15] simple and high variance classifies may be much too complex and some something [00:10:17] be much too complex and some something in between you know with with trade-off [00:10:20] in between you know with with trade-off bias and variance inappropriately right [00:10:22] bias and variance inappropriately right so those bodies invariance and so it [00:10:26] so those bodies invariance and so it turns out that one of the most common [00:10:28] turns out that one of the most common Diagnostics [00:10:29] Diagnostics using in pretty much every single [00:10:31] using in pretty much every single machine learning project is a bias [00:10:33] machine learning project is a bias versus fear instead gnostic to [00:10:35] versus fear instead gnostic to understand how much of your learning [00:10:37] understand how much of your learning over this problem comes from bias and [00:10:39] over this problem comes from bias and how much of it comes from variance and [00:10:42] how much of it comes from variance and and you know i i've had i don't know [00:10:45] and you know i i've had i don't know they former PhD students right that [00:10:47] they former PhD students right that learned about bias and variance when do [00:10:50] learned about bias and variance when do the india PhD and then sometimes even a [00:10:52] the india PhD and then sometimes even a couple years after they graduated from [00:10:54] couple years after they graduated from Stanford and worked you know on more [00:10:57] Stanford and worked you know on more problems they actually tell me that [00:10:58] problems they actually tell me that their their understanding of bias and [00:11:01] their their understanding of bias and variances continue to deepen right for [00:11:05] variances continue to deepen right for many years so there's one of those [00:11:06] many years so there's one of those concepts this is um if you can system at [00:11:09] concepts this is um if you can system at the apply they're making much more [00:11:10] the apply they're making much more efficient and and this is really the [00:11:12] efficient and and this is really the maybe the single most useful to our town [00:11:15] maybe the single most useful to our town understanding by the variants of [00:11:16] understanding by the variants of debugging learning algorithms and so [00:11:19] debugging learning algorithms and so what I'm going to describe is a workflow [00:11:21] what I'm going to describe is a workflow where you would run some Diagnostics to [00:11:24] where you would run some Diagnostics to figure out what is the problem and then [00:11:26] figure out what is the problem and then try to fix whether problem is and so [00:11:28] try to fix whether problem is and so just to summarize this this example you [00:11:33] just to summarize this this example you know literature in Arizona have to be [00:11:35] know literature in Arizona have to be high and you want to and you suspect [00:11:37] high and you want to and you suspect promise either high variance or high [00:11:39] promise either high variance or high bias and so it turns out that there's a [00:11:42] bias and so it turns out that there's a diagnostic that lets you look at your [00:11:44] diagnostic that lets you look at your algorithms performance and try to figure [00:11:47] algorithms performance and try to figure out if how much the problem is variance [00:11:50] out if how much the problem is variance and how much of the problem is biased oh [00:11:53] and how much of the problem is biased oh and I'm gonna say test error but if [00:11:55] and I'm gonna say test error but if you're developing should I really be [00:11:56] you're developing should I really be doing this with a def said or a [00:11:58] doing this with a def said or a development set rather than a test set [00:12:00] development set rather than a test set right but so let me let me explain this [00:12:05] diagnostic in greater detail so turns [00:12:09] diagnostic in greater detail so turns out that um if you have a classifier [00:12:11] out that um if you have a classifier with very high variance then the [00:12:14] with very high variance then the performance on the test set or actually [00:12:17] performance on the test set or actually be a better brother practice use the [00:12:19] be a better brother practice use the holdout cross-validation so that the [00:12:21] holdout cross-validation so that the developer said you see that the error [00:12:23] developer said you see that the error that you classify has a much a much [00:12:27] that you classify has a much a much lower error on the training set than on [00:12:30] lower error on the training set than on the development set but in contrast if [00:12:33] the development set but in contrast if you have high bias then the training [00:12:35] you have high bias then the training error and the test error on the deficit [00:12:37] error and the test error on the deficit error will go [00:12:38] error will go behind so let me illustrate this with a [00:12:41] behind so let me illustrate this with a picture so this is a learning curve and [00:12:45] picture so this is a learning curve and what that means is on the horizontal [00:12:48] what that means is on the horizontal axis [00:12:49] axis you're gonna vary the number of training [00:12:51] you're gonna vary the number of training examples right and when I talked about [00:12:54] examples right and when I talked about bison barians I had a plot where the [00:12:56] bison barians I had a plot where the horizontal axis was the degree of [00:12:58] horizontal axis was the degree of polynomial right you for the first order [00:13:00] polynomial right you for the first order second order third order fourth order [00:13:02] second order third order fourth order polynomial in this plots the horizontal [00:13:04] polynomial in this plots the horizontal axis is different this number of [00:13:05] axis is different this number of training examples and so it turns out [00:13:08] training examples and so it turns out that when you train a learning algorithm [00:13:11] that when you train a learning algorithm you know the more data you have usually [00:13:13] you know the more data you have usually the better your development set error [00:13:16] the better your development set error your better your test set error right [00:13:18] your better your test set error right it's just error usually goes down when [00:13:20] it's just error usually goes down when you increase the number of training [00:13:21] you increase the number of training examples the other thing the other and [00:13:24] examples the other thing the other and and and let's say that you're hoping to [00:13:26] and and let's say that you're hoping to achieve a certain level of desired [00:13:29] achieve a certain level of desired performance you know for business [00:13:30] performance you know for business reasons you like your spam classifier to [00:13:32] reasons you like your spam classifier to achieve a certain level of design [00:13:34] achieve a certain level of design performance and often sometimes design [00:13:36] performance and often sometimes design level performance is to do about as well [00:13:39] level performance is to do about as well as a human can there's a common business [00:13:41] as a human can there's a common business objective depending on your application [00:13:43] objective depending on your application but sometimes it could be different [00:13:45] but sometimes it could be different right so you have some your product [00:13:47] right so you have some your product manager you know tells you that all you [00:13:49] manager you know tells you that all you if you're leading the brother you think [00:13:51] if you're leading the brother you think that you need to hit a certain level [00:13:52] that you need to hit a certain level target performance in order to be very [00:13:54] target performance in order to be very useful spam filter so the other plot to [00:14:02] useful spam filter so the other plot to add to this which will help you analyze [00:14:04] add to this which will help you analyze bias versus variances support the [00:14:06] bias versus variances support the training error now once you're happy of [00:14:11] training error now once you're happy of training error is that it increases as [00:14:15] training error is that it increases as the training set size increases because [00:14:18] the training set size increases because if you have only one example right let's [00:14:22] if you have only one example right let's see building a spam classifier and you [00:14:24] see building a spam classifier and you have only one training example then any [00:14:26] have only one training example then any algorithm you know can fit one training [00:14:28] algorithm you know can fit one training example perfectly and so if your [00:14:30] example perfectly and so if your training set size is very small the [00:14:32] training set size is very small the training set error is usually zero right [00:14:34] training set error is usually zero right because if you have a five training [00:14:36] because if you have a five training examples probably you can fit all five [00:14:38] examples probably you can fit all five examples perfectly and there's only if [00:14:40] examples perfectly and there's only if you have a bigger training set that it [00:14:42] you have a bigger training set that it becomes harder for the learning [00:14:43] becomes harder for the learning algorithm to fit your training [00:14:45] algorithm to fit your training data that well right oh well in the [00:14:48] data that well right oh well in the linear regression case you have you have [00:14:49] linear regression case you have you have one example yeah you can Phyllis [00:14:51] one example yeah you can Phyllis straight nine two data if you have two [00:14:52] straight nine two data if you have two examples you can fit any model pretty [00:14:54] examples you can fit any model pretty much to the data and have zero training [00:14:57] much to the data and have zero training around there's only if you have a very [00:14:58] around there's only if you have a very large training set that a classifier [00:15:01] large training set that a classifier like which is regression or linear [00:15:03] like which is regression or linear regression may have a harder time [00:15:04] regression may have a harder time fitting all of your training examples so [00:15:06] fitting all of your training examples so that's why training error or average [00:15:08] that's why training error or average training error averaged over your [00:15:09] training error averaged over your training set generally increases as you [00:15:13] training set generally increases as you increase the training set size so um now [00:15:18] increase the training set size so um now there are two characteristics of this [00:15:21] there are two characteristics of this plot that suggest that if you plot the [00:15:24] plot that suggest that if you plot the learning curves if you see this dis [00:15:27] learning curves if you see this dis pattern to suggest that theorem has a [00:15:30] pattern to suggest that theorem has a large bias problem right and the two [00:15:33] large bias problem right and the two properties written in the bottom one the [00:15:35] properties written in the bottom one the weaker signal the one that's harder to [00:15:37] weaker signal the one that's harder to rely on is that the development set [00:15:40] rely on is that the development set error or the test set error is still [00:15:41] error or the test set error is still decreasing as you increase the training [00:15:43] decreasing as you increase the training set size so the green curve is still you [00:15:45] set size so the green curve is still you know still looks like it's going down [00:15:46] know still looks like it's going down and so this suggests that if you [00:15:49] and so this suggests that if you increase the training set size and [00:15:50] increase the training set size and extrapolate further to the right that [00:15:52] extrapolate further to the right that the curve would keep on going down this [00:15:55] the curve would keep on going down this turns out to be a weaker signal because [00:15:57] turns out to be a weaker signal because sometimes we look at the curve like that [00:15:59] sometimes we look at the curve like that is actually quite hard to tell you have [00:16:02] is actually quite hard to tell you have to extrapolate to the right so if you [00:16:05] to extrapolate to the right so if you double the training set size how much [00:16:07] double the training set size how much further would the green curve go down [00:16:08] further would the green curve go down then she kind of hard to tell so I find [00:16:10] then she kind of hard to tell so I find this a useful signal but sometimes it's [00:16:12] this a useful signal but sometimes it's been hard to judge you know exactly [00:16:14] been hard to judge you know exactly where the curve will go of you [00:16:16] where the curve will go of you extrapolate to the right um the stronger [00:16:19] extrapolate to the right um the stronger signal is actually the second one the [00:16:21] signal is actually the second one the fact that there's a huge gap between [00:16:22] fact that there's a huge gap between your training error and your test set [00:16:24] your training error and your test set error or your training or your jeff's [00:16:26] error or your training or your jeff's that there would better thing to look at [00:16:27] that there would better thing to look at is actually a stronger signal that um [00:16:30] is actually a stronger signal that um this particular learning algorithm has [00:16:32] this particular learning algorithm has has high variance right because as you [00:16:37] has high variance right because as you increase the training set size you find [00:16:40] increase the training set size you find that the gap between training test error [00:16:44] that the gap between training test error usually closes usually reduces and so [00:16:47] usually closes usually reduces and so there's no a lot of room for making your [00:16:52] there's no a lot of room for making your test set error become closer to your [00:16:54] test set error become closer to your training [00:16:55] training and so that's if you see a learning [00:16:57] and so that's if you see a learning curve like there's a strong side that um [00:16:59] curve like there's a strong side that um you have a variance problem okay now [00:17:02] you have a variance problem okay now let's look at what the curve what the [00:17:04] let's look at what the curve what the learning curve will look like um if you [00:17:06] learning curve will look like um if you have a bias problem so this is a typical [00:17:09] have a bias problem so this is a typical learning curve for high bias which is [00:17:12] learning curve for high bias which is that's your def set error or your [00:17:14] that's your def set error or your development cell the holocrons values [00:17:16] development cell the holocrons values and say their test error and you're [00:17:19] and say their test error and you're hoping to hit a level of performance [00:17:20] hoping to hit a level of performance like that and your training error looks [00:17:24] like that and your training error looks like that and so one sign that you have [00:17:31] like that and so one sign that you have a high bias problem is that this [00:17:35] a high bias problem is that this algorithm is not even doing that well on [00:17:37] algorithm is not even doing that well on the training set right even on the [00:17:39] the training set right even on the training set you know you're not [00:17:42] training set you know you're not achieving your desired level of [00:17:44] achieving your desired level of performance and it's like not learn it [00:17:47] performance and it's like not learn it imagine you're looking out from you see [00:17:49] imagine you're looking out from you see it was like this algorithm has seen [00:17:51] it was like this algorithm has seen these examples and even for examples the [00:17:53] these examples and even for examples the scene is not doing this was you were [00:17:55] scene is not doing this was you were hoping so clearly the algorithms not [00:17:56] hoping so clearly the algorithms not fitting the data well enough so this is [00:17:59] fitting the data well enough so this is a sign that you have a high bias problem [00:18:01] a sign that you have a high bias problem not in the features you're learning or [00:18:03] not in the features you're learning or it was too simple and and the other [00:18:06] it was too simple and and the other signal is that there's a very small gap [00:18:10] signal is that there's a very small gap between the training under test error [00:18:12] between the training under test error right and you can imagine when you see a [00:18:14] right and you can imagine when you see a plot like this no matter how much more [00:18:17] plot like this no matter how much more data you get right go ahead and [00:18:19] data you get right go ahead and extrapolate to the right as far as you [00:18:21] extrapolate to the right as far as you want you know no matter how much more [00:18:23] want you know no matter how much more data you get no matter how far you [00:18:26] data you get no matter how far you extrapolate to the right of this plot [00:18:27] extrapolate to the right of this plot they read the blue curve the training [00:18:30] they read the blue curve the training error is never gonna come back down to [00:18:32] error is never gonna come back down to hit the desired level of performance and [00:18:34] hit the desired level of performance and because the test set error is you know [00:18:37] because the test set error is you know generally higher than your training set [00:18:38] generally higher than your training set error no matter how much more data you [00:18:40] error no matter how much more data you have no matter how far you extrapolate [00:18:42] have no matter how far you extrapolate to the right the error is never going to [00:18:44] to the right the error is never going to come down to your design level [00:18:46] come down to your design level performance so if you get a training [00:18:50] performance so if you get a training error and test that error curve that [00:18:52] error and test that error curve that looks like this you kind of know that [00:18:53] looks like this you kind of know that you know while getting more training [00:18:56] you know while getting more training data may help right the green curve [00:18:59] data may help right the green curve could come down like a little bit if you [00:19:00] could come down like a little bit if you get more training data the act of [00:19:04] get more training data the act of getting more training data by itself [00:19:06] getting more training data by itself will never get you to where you [00:19:07] will never get you to where you to go okay so let's work through this [00:19:14] to go okay so let's work through this example so for each of the four bullets [00:19:18] example so for each of the four bullets here each of the four first four ideas [00:19:22] here each of the four first four ideas fixes either a high variance or a high [00:19:24] fixes either a high variance or a high bias problem right so let's let's go [00:19:27] bias problem right so let's let's go through them and and say an Oscar for [00:19:30] through them and and say an Oscar for the first one [00:19:32] the first one do you think do you think it helps you [00:19:34] do you think do you think it helps you fix high bias or high variance [00:19:46] cool all right high variance right Amy [00:19:49] cool all right high variance right Amy well I say it will say well right anyone [00:19:53] well I say it will say well right anyone I'll say why yes right yeah right I [00:20:13] I'll say why yes right yeah right I guess if you're feeling a very high [00:20:14] guess if you're feeling a very high order polynomial that we goes like this [00:20:16] order polynomial that we goes like this if you have more data it will make it [00:20:18] if you have more data it will make it anything up then the warm at least [00:20:19] anything up then the warm at least oscillate so crazy you feel a higher [00:20:22] oscillate so crazy you feel a higher order polynomial right and um if you [00:20:25] order polynomial right and um if you look at the high variance curve wow it's [00:20:34] look at the high variance curve wow it's not latency in my that's all for some [00:20:37] not latency in my that's all for some reason [00:20:44] right so this is the high variance plot [00:20:48] right so this is the high variance plot and and if you have a learning algorithm [00:20:53] and and if you have a learning algorithm high variance you can hopefully you know [00:20:56] high variance you can hopefully you know if you extrapolate to the right there is [00:20:58] if you extrapolate to the right there is some hope that the green curve will keep [00:21:00] some hope that the green curve will keep on coming down so so getting more [00:21:03] on coming down so so getting more training data if you have high variance [00:21:05] training data if you have high variance which is if you're in this situation it [00:21:07] which is if you're in this situation it looks like it could happen [00:21:08] looks like it could happen hope this is worth trying great can't [00:21:10] hope this is worth trying great can't guarantee your work with worth trying oh [00:21:21] I see yes sorry other this is good so [00:21:23] I see yes sorry other this is good so let's see um the curse will look like [00:21:26] let's see um the curse will look like this assuming that your training data is [00:21:28] this assuming that your training data is iid right training and death in Texas [00:21:31] iid right training and death in Texas are always drawn from the same [00:21:33] are always drawn from the same distribution there is learning theory [00:21:37] distribution there is learning theory that suggests that in most cases the [00:21:39] that suggests that in most cases the being curve should decay as 1 over [00:21:41] being curve should decay as 1 over square root of M that's the rate that we [00:21:43] square root of M that's the rate that we should decay until an answer because [00:21:46] should decay until an answer because some Bayes error that's what the [00:21:48] some Bayes error that's what the learning theory says that doesn't make [00:21:50] learning theory says that doesn't make sense and sometimes and learning [00:21:54] sense and sometimes and learning algorithms errors don't always go to [00:21:55] algorithms errors don't always go to zero right because sometimes they're [00:21:58] zero right because sometimes they're sometimes on the data is just ambiguous [00:22:00] sometimes on the data is just ambiguous oh my god I guess yeah my PhD students [00:22:03] oh my god I guess yeah my PhD students are including on them we do a lot of [00:22:05] are including on them we do a lot of work on health care and sometimes you [00:22:07] work on health care and sometimes you look in an x-ray it's just blurry and [00:22:09] look in an x-ray it's just blurry and you could try to make a diagnosis right [00:22:11] you could try to make a diagnosis right is there is there or I show all on ins [00:22:14] is there is there or I show all on ins working on predicting patients mortality [00:22:16] working on predicting patients mortality or what's the chance of someone dying in [00:22:17] or what's the chance of someone dying in the next year or so and sometimes the [00:22:20] the next year or so and sometimes the local a patient's medical record you [00:22:22] local a patient's medical record you just can't tell right what is you know [00:22:24] just can't tell right what is you know will they pass away in the next year or [00:22:25] will they pass away in the next year or so or look like x-ray you just can't [00:22:28] so or look like x-ray you just can't tell is there is there a tumor or not [00:22:30] tell is there is there a tumor or not because it's just blurry so learning [00:22:31] because it's just blurry so learning others error I don't always tk20 but the [00:22:34] others error I don't always tk20 but the theory says that as M increases should [00:22:37] theory says that as M increases should decay roughly a rate of 1 over square [00:22:39] decay roughly a rate of 1 over square root of them to what that baseline error [00:22:41] root of them to what that baseline error which is which is called Bayes error [00:22:43] which is which is called Bayes error which is the best that you could [00:22:44] which is the best that you could possibly hope anything could do given [00:22:46] possibly hope anything could do given how blurry the images are [00:22:48] how blurry the images are giving home nobody see the day tourists [00:22:49] giving home nobody see the day tourists right all right um oh sorry gave the [00:22:55] right all right um oh sorry gave the eyes their way okay so try small Estella [00:22:58] eyes their way okay so try small Estella features that fixes a high variance [00:23:01] features that fixes a high variance problem right and one concrete example [00:23:04] problem right and one concrete example would be if you have this data set and [00:23:07] would be if you have this data set and you're fitting uh you know tenth order [00:23:09] you're fitting uh you know tenth order polynomial and the curve all sleeves off [00:23:11] polynomial and the curve all sleeves off of the place that's high variance you [00:23:13] of the place that's high variance you could say well maybe I don't need a [00:23:15] could say well maybe I don't need a tenth order polynomial may be actually [00:23:17] tenth order polynomial may be actually use you only Wow sorry what's going on [00:23:24] okay right so maybe you say maybe I [00:23:28] okay right so maybe you say maybe I don't need my features to be all of [00:23:30] don't need my features to be all of these things ten for the pollen you you [00:23:32] these things ten for the pollen you you know maybe if this is too high bearings [00:23:34] know maybe if this is too high bearings and get rid of a lot of features and [00:23:36] and get rid of a lot of features and just use you know much smaller number of [00:23:38] just use you know much smaller number of features right so that fixes high [00:23:43] features right so that fixes high variance and then if you use the largest [00:23:47] variance and then if you use the largest set of features faces high bias right [00:23:52] set of features faces high bias right cool right so that's if you are setting [00:23:55] cool right so that's if you are setting a straight line to the data there's not [00:23:57] a straight line to the data there's not doing that well in go G maybe actually [00:23:59] doing that well in go G maybe actually add a quadratic term and just add more [00:24:01] add a quadratic term and just add more features right so that fixes values and [00:24:03] features right so that fixes values and having email header features yep January [00:24:10] having email header features yep January I would try this if they try to reduce [00:24:15] I would try this if they try to reduce bias right and so in the workflow of how [00:24:19] bias right and so in the workflow of how you develop a learning algorithm I would [00:24:22] you develop a learning algorithm I would recommend that you yes so so one of the [00:24:27] recommend that you yes so so one of the things about building learning [00:24:29] things about building learning algorithms is that for a new application [00:24:33] algorithms is that for a new application problem it's difficult to know in [00:24:37] problem it's difficult to know in advance if you're gonna run into high [00:24:40] advance if you're gonna run into high bias or high variance problem right it [00:24:43] bias or high variance problem right it is actually very difficult to know in [00:24:45] is actually very difficult to know in advance what's gonna go wrong with your [00:24:46] advance what's gonna go wrong with your learning algorithm and so the advice I [00:24:49] learning algorithm and so the advice I tend to give is if you work on the new [00:24:52] tend to give is if you work on the new app [00:24:52] app Kayson implements a quick and dirty [00:24:54] Kayson implements a quick and dirty learning algorithm it have a quick and [00:24:57] learning algorithm it have a quick and dirty implementation of something so you [00:24:59] dirty implementation of something so you can run your learning algorithm just you [00:25:02] can run your learning algorithm just you know sort of logistic regression it [00:25:04] know sort of logistic regression it starts with something simple and then [00:25:06] starts with something simple and then run this bias-variance type of analysis [00:25:09] run this bias-variance type of analysis to see sort of what went wrong and then [00:25:13] to see sort of what went wrong and then use that to decide what to do next you [00:25:15] use that to decide what to do next you go to more complex algorithms you try [00:25:18] go to more complex algorithms you try any more theater the one exception to [00:25:21] any more theater the one exception to this is if you're working on a domain in [00:25:24] this is if you're working on a domain in which you have a lot of experience right [00:25:26] which you have a lot of experience right and so for example you know I've done [00:25:29] and so for example you know I've done the long work on speech recognition so [00:25:31] the long work on speech recognition so because I've done that work I kind of [00:25:33] because I've done that work I kind of have a sense of how much they just need [00:25:35] have a sense of how much they just need it for a new application then then I [00:25:37] it for a new application then then I might just build something more [00:25:38] might just build something more complicated from the get-go over here [00:25:40] complicated from the get-go over here doing all your working on say face [00:25:42] doing all your working on say face recognition and because you've rid of [00:25:44] recognition and because you've rid of all the research papers you have a sense [00:25:45] all the research papers you have a sense of how much data sneed it then maybe [00:25:47] of how much data sneed it then maybe it's worth trying something because [00:25:49] it's worth trying something because you're building on the body of knowledge [00:25:51] you're building on the body of knowledge but but if you work on something on a [00:25:53] but but if you work on something on a brand new application that you and maybe [00:25:55] brand new application that you and maybe you know no one in the published [00:25:57] you know no one in the published academic literature has worked on or you [00:25:59] academic literature has worked on or you don't totally trust the published [00:26:01] don't totally trust the published results to be representative of your [00:26:03] results to be representative of your problem then I would usually recommend [00:26:05] problem then I would usually recommend that you implement a build a quick and [00:26:08] that you implement a build a quick and dirty implementation look at the buyers [00:26:10] dirty implementation look at the buyers and barons of the algorithm and then use [00:26:12] and barons of the algorithm and then use that to better decide what to try next [00:26:17] that to better decide what to try next so I think bias and variance is I think [00:26:22] so I think bias and variance is I think as that is really like the single most [00:26:23] as that is really like the single most powerful - I know you know for analyzing [00:26:26] powerful - I know you know for analyzing the performance of learning algorithms [00:26:27] the performance of learning algorithms that do this pre-emergent every single [00:26:29] that do this pre-emergent every single machine or any application there's one [00:26:32] machine or any application there's one other pattern that I see quite often [00:26:34] other pattern that I see quite often which is which which edges the second [00:26:38] which is which which edges the second set which is which is a which is your [00:26:43] set which is which is a which is your optimization algorithm working so so let [00:26:47] optimization algorithm working so so let me most let me explain this with most [00:26:48] me most let me explain this with most being example right so um it turns out [00:26:52] being example right so um it turns out that when you implement a learning [00:26:54] that when you implement a learning algorithm you often have a few guesses [00:26:57] algorithm you often have a few guesses for what's wrong [00:26:58] for what's wrong and if you can systematically test if [00:27:01] and if you can systematically test if that hypothesis is right before you [00:27:03] that hypothesis is right before you spend a lot of work to try to fix it [00:27:05] spend a lot of work to try to fix it then you can be much more efficient so [00:27:07] then you can be much more efficient so let's explain that with a concrete [00:27:09] let's explain that with a concrete example so you understand those words I [00:27:11] example so you understand those words I just said maybe they're a little bit [00:27:12] just said maybe they're a little bit abstract which is let's say that you [00:27:15] abstract which is let's say that you know you tune your logistic regression [00:27:17] know you tune your logistic regression the algorithm for a while [00:27:18] the algorithm for a while and let's say the suni Russian gets two [00:27:20] and let's say the suni Russian gets two percent error on spam email and 2% error [00:27:23] percent error on spam email and 2% error on non step right and it's okay to have [00:27:26] on non step right and it's okay to have two percent error on spam email maybe [00:27:28] two percent error on spam email maybe right you know so you have to read a [00:27:30] right you know so you have to read a little bit of spam email it's like you [00:27:32] little bit of spam email it's like you that's okay but two percent error on [00:27:35] that's okay but two percent error on non-stem it's just not really acceptable [00:27:37] non-stem it's just not really acceptable because you're losing one in fifty [00:27:39] because you're losing one in fifty important emails and let's say that you [00:27:44] important emails and let's say that you know your teammate right also try chains [00:27:47] know your teammate right also try chains in SVM and they find in an SVM using a [00:27:51] in SVM and they find in an SVM using a linear kernel guess ten percent error on [00:27:53] linear kernel guess ten percent error on spam but 0.01 percent error on non-stem [00:27:57] spam but 0.01 percent error on non-stem right maybe not great but for this [00:27:59] right maybe not great but for this purpose of illustration let's say this [00:28:01] purpose of illustration let's say this is susceptible um but because it turns [00:28:05] is susceptible um but because it turns out logistic regression is more [00:28:07] out logistic regression is more computationally efficient and and and it [00:28:10] computationally efficient and and and it may be easier to update right here you [00:28:12] may be easier to update right here you get more examples to run a few more [00:28:14] get more examples to run a few more iterations of gradient descent and let's [00:28:16] iterations of gradient descent and let's say you want to ship a logistic [00:28:17] say you want to ship a logistic regression implementation rather than [00:28:19] regression implementation rather than SVM implementation so what do you do [00:28:22] SVM implementation so what do you do thinks it turns out that one common [00:28:27] thinks it turns out that one common question you have when training your [00:28:29] question you have when training your learning algorithm is you often wonder [00:28:32] learning algorithm is you often wonder is your optimization algorithm [00:28:35] is your optimization algorithm converging right so you know it's great [00:28:38] converging right so you know it's great in a sense it is it converging and so [00:28:41] in a sense it is it converging and so one thing you might do is draw a plot of [00:28:44] one thing you might do is draw a plot of the training optimization objective of J [00:28:48] the training optimization objective of J of theta or whatever you're maximizing [00:28:49] of theta or whatever you're maximizing all along likelihood that J of theta or [00:28:51] all along likelihood that J of theta or whatever versus number of iterations and [00:28:54] whatever versus number of iterations and often the plot will look like that right [00:28:57] often the plot will look like that right and you know the curve is kind of going [00:29:00] and you know the curve is kind of going up but not that fast [00:29:03] up but not that fast and if you train it twice as long or [00:29:05] and if you train it twice as long or even ten times as long will that help [00:29:07] even ten times as long will that help right and again training maybe the [00:29:09] right and again training maybe the algorithm for more durations it you know [00:29:11] algorithm for more durations it you know pretty much never hurts [00:29:13] pretty much never hurts if you regular eyes the algorithm [00:29:14] if you regular eyes the algorithm properly trained me the algorithm longer [00:29:16] properly trained me the algorithm longer you know although almost always helps [00:29:18] you know although almost always helps right pretty much never hurts but is the [00:29:22] right pretty much never hurts but is the right thing to do to go and burn another [00:29:24] right thing to do to go and burn another 48 hours of you know CPU or GPU cycles [00:29:28] 48 hours of you know CPU or GPU cycles to just train this thing longer in the [00:29:29] to just train this thing longer in the hoping works better right maybe maybe [00:29:32] hoping works better right maybe maybe not so is there a is there a systematic [00:29:35] not so is there a is there a systematic way to tell is there a better way to [00:29:39] way to tell is there a better way to tell if you should invest a lot more [00:29:41] tell if you should invest a lot more time in running the optimization [00:29:43] time in running the optimization algorithm sometimes it's just hard to [00:29:45] algorithm sometimes it's just hard to tell right so um now the other question [00:29:50] tell right so um now the other question that you sometimes wonder so a lot of [00:29:54] that you sometimes wonder so a lot of where a lot of this iteration of deeper [00:29:56] where a lot of this iteration of deeper learning algorithms is smoking which to [00:29:58] learning algorithms is smoking which to learn I was doing and just asking [00:30:00] learn I was doing and just asking yourself one of my guess is for what [00:30:02] yourself one of my guess is for what could be wrong and maybe one of your [00:30:04] could be wrong and maybe one of your guesses is well maybe optimizing the [00:30:06] guesses is well maybe optimizing the wrong cost function right so so here's [00:30:09] wrong cost function right so so here's what I mean um what you care about is [00:30:11] what I mean um what you care about is this weight and accuracy criteria you [00:30:16] this weight and accuracy criteria you know we're sort of sum over your def set [00:30:20] know we're sort of sum over your def set or test set of you know weights on [00:30:22] or test set of you know weights on different examples of whether it gets it [00:30:25] different examples of whether it gets it right where the weights are higher for [00:30:27] right where the weights are higher for non-standard span because you really [00:30:29] non-standard span because you really wanna make sure you label non-spam [00:30:31] wanna make sure you label non-spam e-mail correctly right so so maybe [00:30:33] e-mail correctly right so so maybe that's the way to accuracy criteria you [00:30:36] that's the way to accuracy criteria you care about but for logistic regression [00:30:41] care about but for logistic regression you're maximizing this cost function [00:30:43] you're maximizing this cost function right love likelihood - this [00:30:46] right love likelihood - this regularization term so you're optimizing [00:30:48] regularization term so you're optimizing J of theta when what you actually care [00:30:51] J of theta when what you actually care about is a of theta so maybe our [00:30:54] about is a of theta so maybe our optimizing the wrong cost function and [00:30:57] optimizing the wrong cost function and then one way to change the cost function [00:30:58] then one way to change the cost function would be to fiddle with the parameter [00:31:00] would be to fiddle with the parameter lambda right that's one way to change [00:31:02] lambda right that's one way to change the definition of J of theta another way [00:31:05] the definition of J of theta another way to change J of theta is to just totally [00:31:08] to change J of theta is to just totally change the cost function you're [00:31:09] change the cost function you're maximizing like change it to the SVM [00:31:11] maximizing like change it to the SVM objective right oh and then then part of [00:31:14] objective right oh and then then part of that also means choosing the appropriate [00:31:16] that also means choosing the appropriate value for see okay [00:31:19] value for see okay and so there's a second diagnostic which [00:31:25] and so there's a second diagnostic which I end up using [00:31:27] I end up using which is we shall help you tell is the [00:31:30] which is we shall help you tell is the problem your optimization algorithm in [00:31:34] problem your optimization algorithm in other words is gradient ascent not [00:31:36] other words is gradient ascent not converging or is the problem that you're [00:31:38] converging or is the problem that you're just optimizing the wrong function right [00:31:41] just optimizing the wrong function right and then we'll see two examples of this [00:31:43] and then we'll see two examples of this is the first example okay um and so [00:31:46] is the first example okay um and so here's the diagnostic that can help you [00:31:48] here's the diagnostic that can help you figure that out so just to summarize [00:31:51] figure that out so just to summarize this scenario this um this example this [00:31:54] this scenario this um this example this running example we're using the SVM [00:31:57] running example we're using the SVM Opera homes which is Russian but you [00:31:58] Opera homes which is Russian but you want to deploy this regression let's let [00:32:01] want to deploy this regression let's let theta SVM but the parameters learned by [00:32:02] theta SVM but the parameters learned by an SVM and instead of writing the SVM [00:32:05] an SVM and instead of writing the SVM parameters as W and B I'm just going to [00:32:08] parameters as W and B I'm just going to write the linear SVM as your linear [00:32:09] write the linear SVM as your linear kernel you know using the logistic [00:32:12] kernel you know using the logistic regression parameter ization right so [00:32:14] regression parameter ization right so you have a linear set of parameters and [00:32:16] you have a linear set of parameters and that's the thing the prrb the parameters [00:32:18] that's the thing the prrb the parameters learned by the just aggression right so [00:32:20] learned by the just aggression right so it's just yeah regularize which is [00:32:23] it's just yeah regularize which is russian or basically just in Russian so [00:32:26] russian or basically just in Russian so you care about weights and accuracy and [00:32:30] and the SVM outperforms basing is just [00:32:35] and the SVM outperforms basing is just regression okay so this is one sly [00:32:37] regression okay so this is one sly summary of where we are in this example [00:32:41] summary of where we are in this example so how can you tell if the problem is [00:32:44] so how can you tell if the problem is your optimization algorithm meaning that [00:32:47] your optimization algorithm meaning that you need to run gradient descent longer [00:32:50] you need to run gradient descent longer to actually maximize J of theta or is [00:32:53] to actually maximize J of theta or is sorry and then right and this is the [00:32:55] sorry and then right and this is the what BRR tries to maximize right so so [00:32:59] what BRR tries to maximize right so so how these hell we were were two possible [00:33:01] how these hell we were were two possible hypotheses we want to distinguish [00:33:02] hypotheses we want to distinguish between one is that the learning [00:33:06] between one is that the learning algorithm is not actually finding the [00:33:08] algorithm is not actually finding the value of theta that maximizes J of theta [00:33:10] value of theta that maximizes J of theta or if for some reason great innocent is [00:33:13] or if for some reason great innocent is not converging so that would be a [00:33:15] not converging so that would be a problem the optimization algorithm that [00:33:17] problem the optimization algorithm that J of theta that you know promptly opt [00:33:21] J of theta that you know promptly opt for the problem to be what the [00:33:23] for the problem to be what the optimization algorithm means that if [00:33:25] optimization algorithm means that if only we could have [00:33:26] only we could have algorithm that maximizes J of theta we [00:33:28] algorithm that maximizes J of theta we would do great but for some reason [00:33:30] would do great but for some reason gradient descent isn't doing well that's [00:33:32] gradient descent isn't doing well that's one hypothesis the second hypothesis is [00:33:35] one hypothesis the second hypothesis is that J of theta is just a wrong function [00:33:37] that J of theta is just a wrong function to be optimizing it's just a bad choice [00:33:38] to be optimizing it's just a bad choice of cost function that J of theta is too [00:33:41] of cost function that J of theta is too different from a of theta the maximizing [00:33:44] different from a of theta the maximizing J of theta doesn't give you you know a [00:33:46] J of theta doesn't give you you know a classifier that does well on a of theta [00:33:49] classifier that does well on a of theta which is what you actually care about [00:33:51] which is what you actually care about okay any quiz a problem set up I wanna [00:33:55] okay any quiz a problem set up I wanna make sure people understand this this is [00:33:57] make sure people understand this this is race raise your hand if this makes sense [00:33:59] race raise your hand if this makes sense most people okay cool okay good any [00:34:04] most people okay cool okay good any questions about this problem set up oh [00:34:09] thank you why not mess myself theta [00:34:11] thank you why not mess myself theta directly because F theta is non [00:34:13] directly because F theta is non differentiable so we don't actually have [00:34:15] differentiable so we don't actually have you know does this indicate a function [00:34:17] you know does this indicate a function so we actually don't we it turns out [00:34:20] so we actually don't we it turns out maximizing a of theta explicitly is [00:34:22] maximizing a of theta explicitly is np-hard but just we just don't have [00:34:25] np-hard but just we just don't have great algorithms to trying to do that [00:34:28] great algorithms to trying to do that okay so it turns out there's a [00:34:32] okay so it turns out there's a diagnostic you could use to distinguish [00:34:34] diagnostic you could use to distinguish between D sub to these two different [00:34:37] between D sub to these two different problems and here's the diagnostic which [00:34:40] problems and here's the diagnostic which is check the cost function that logistic [00:34:45] is check the cost function that logistic regression is trying to maximize so J [00:34:48] regression is trying to maximize so J and compute that cost function on the [00:34:52] and compute that cost function on the parameters found by the SVM and compute [00:34:56] parameters found by the SVM and compute that cost function on the parameters [00:34:58] that cost function on the parameters found by based on logistic regression [00:35:00] found by based on logistic regression and just see which which value is higher [00:35:02] and just see which which value is higher okay so there are two cases either this [00:35:09] okay so there are two cases either this is greater or this is less than equal to [00:35:12] is greater or this is less than equal to right there just two possible cases so [00:35:15] right there just two possible cases so what I'm going to do is go over case one [00:35:17] what I'm going to do is go over case one and case two corresponding to this [00:35:19] and case two corresponding to this greater than or it's less than equal [00:35:21] greater than or it's less than equal then and let's let's see what that [00:35:23] then and let's let's see what that implies so on the next slide I'm going [00:35:25] implies so on the next slide I'm going to copy over this equation right that's [00:35:28] to copy over this equation right that's that's just a fact that the SVM does [00:35:30] that's just a fact that the SVM does better then based on logistic regression [00:35:32] better then based on logistic regression on our problem so on the next I'm going [00:35:34] on our problem so on the next I'm going to copy over this first equation and [00:35:36] to copy over this first equation and then we're going to consider [00:35:38] then we're going to consider these two cases separately so great - [00:35:40] these two cases separately so great - that would be case one and less than [00:35:42] that would be case one and less than equal to will be case - okay so let me [00:35:44] equal to will be case - okay so let me copy over these two equations in the [00:35:46] copy over these two equations in the next slide right so that's the first [00:35:48] next slide right so that's the first equation that i just copied over here [00:35:50] equation that i just copied over here and that's this is the greater than this [00:35:53] and that's this is the greater than this case one okay so let's see how to [00:35:56] case one okay so let's see how to interpret this in case one J of theta [00:36:03] interpret this in case one J of theta SVM is greater than J of theta be our [00:36:06] SVM is greater than J of theta be our right meaning that whatever the SVM was [00:36:10] right meaning that whatever the SVM was doing it found a value for theta which [00:36:16] doing it found a value for theta which we've written as theta SVM and theta as [00:36:23] we've written as theta SVM and theta as VM has a higher value on the cost [00:36:26] VM has a higher value on the cost function J than theta be wrong but base [00:36:31] function J than theta be wrong but base in logistic regression was trying to [00:36:33] in logistic regression was trying to maximize J of theta right I mean [00:36:36] maximize J of theta right I mean basically just Russian it's just using [00:36:37] basically just Russian it's just using gradient descent to try to maximize J of [00:36:40] gradient descent to try to maximize J of theta and so under case one this shows [00:36:45] theta and so under case one this shows that whatever the SVM was doing whatever [00:36:48] that whatever the SVM was doing whatever your buddy [00:36:49] your buddy implementing SVM did they managed to [00:36:52] implementing SVM did they managed to find a value for theta that actually [00:36:55] find a value for theta that actually achieves a higher value of J of theta [00:36:57] achieves a higher value of J of theta then your implementation of base in [00:37:00] then your implementation of base in logistic regression so this means that [00:37:03] logistic regression so this means that theta prr fails to maximize the cost [00:37:06] theta prr fails to maximize the cost function J and and the problem is with [00:37:10] function J and and the problem is with the optimization algorithm okay so this [00:37:13] the optimization algorithm okay so this case one case two again I'm just copying [00:37:18] case one case two again I'm just copying over the first equation right because [00:37:20] over the first equation right because this is just part of our analysis to [00:37:22] this is just part of our analysis to spot the problem set up but in case two [00:37:25] spot the problem set up but in case two is now the second line is now a less [00:37:26] is now the second line is now a less than or equal sign okay so let's see how [00:37:30] than or equal sign okay so let's see how to interpret this so under if we look at [00:37:36] to interpret this so under if we look at the second equation right the less than [00:37:37] the second equation right the less than equal to sign it looks like J do the [00:37:41] equal to sign it looks like J do the better job than the SVM maximizing jr. [00:37:45] better job than the SVM maximizing jr. excuse me it looks like basically just [00:37:47] excuse me it looks like basically just regression did a better job than GSB [00:37:50] regression did a better job than GSB maximising junior theater right so you [00:37:53] maximising junior theater right so you know you talk based on religious [00:37:55] know you talk based on religious aggression to maximize jf Theta and by [00:37:58] aggression to maximize jf Theta and by golly I found it I found the value of [00:38:00] golly I found it I found the value of theta does that it found the value that [00:38:03] theta does that it found the value that achieves a higher value of G of theta [00:38:04] achieves a higher value of G of theta than whatever your buddy did using an [00:38:07] than whatever your buddy did using an SVM implementation so it actually did a [00:38:08] SVM implementation so it actually did a good job trying to find the value of [00:38:10] good job trying to find the value of theta that dries up J of theta as much [00:38:14] theta that dries up J of theta as much as possible but if you look at these two [00:38:17] as possible but if you look at these two equations in combination what we have is [00:38:20] equations in combination what we have is that the SVM does worse on the cost [00:38:26] that the SVM does worse on the cost function J but it does better on the [00:38:30] function J but it does better on the thing you actually care about a of theta [00:38:32] thing you actually care about a of theta so what these two equations in [00:38:36] so what these two equations in combination tell you is that having the [00:38:39] combination tell you is that having the best value the highest value for J of [00:38:41] best value the highest value for J of theta does not correspond to having the [00:38:44] theta does not correspond to having the best possible value for a of theta so [00:38:49] best possible value for a of theta so tells you that maximizing J of theta [00:38:51] tells you that maximizing J of theta doesn't mean you're doing a good job on [00:38:54] doesn't mean you're doing a good job on a of theta and therefore maybe J of [00:38:57] a of theta and therefore maybe J of theta is not such a good thing to be [00:38:58] theta is not such a good thing to be maximizing because maximizing it doesn't [00:39:00] maximizing because maximizing it doesn't actually give you the result you also [00:39:02] actually give you the result you also really care about so under case two you [00:39:07] really care about so under case two you can be convinced that J of theta is just [00:39:10] can be convinced that J of theta is just an is not the best function to [00:39:12] an is not the best function to maximizing because getting high value of [00:39:15] maximizing because getting high value of J of theta it doesn't get your high [00:39:16] J of theta it doesn't get your high value from what you actually care about [00:39:18] value from what you actually care about and so the problem is with the objective [00:39:20] and so the problem is with the objective function of the maximization problem and [00:39:23] function of the maximization problem and maybe we should just find a different [00:39:25] maybe we should just find a different function to maximize okay so um any [00:39:35] function to maximize okay so um any questions about this [00:39:54] yeah let me come back to that yeah it's [00:39:57] yeah let me come back to that yeah it's a complicated answer yeah all right [00:39:59] a complicated answer yeah all right actually let's do this first um so all [00:40:08] actually let's do this first um so all right [00:40:09] right for these four bullets does it fix the [00:40:12] for these four bullets does it fix the optimization algorithm or does it fix [00:40:13] optimization algorithm or does it fix the optimization objective first one [00:40:15] the optimization objective first one does it fix the optimization algorithm [00:40:17] does it fix the optimization algorithm or does it fix the automation or [00:40:18] or does it fix the automation or objective cool second one oh I don't [00:40:25] objective cool second one oh I don't know what's wrong with this thing it's [00:40:26] know what's wrong with this thing it's so strange okay right does it fix the [00:40:31] so strange okay right does it fix the optimization algorithm or fix also an [00:40:33] optimization algorithm or fix also an objective positive right so Newton's [00:40:37] objective positive right so Newton's method still looks at the same cost [00:40:38] method still looks at the same cost function J of theta but in some cases it [00:40:41] function J of theta but in some cases it just optimizes it much more efficiently [00:40:42] just optimizes it much more efficiently um this is a funny one usually you [00:40:45] um this is a funny one usually you fiddle with lambda to trade-off bias and [00:40:50] fiddle with lambda to trade-off bias and variance things right this is one way to [00:40:52] variance things right this is one way to change the optimization objective [00:40:53] change the optimization objective although usually you change lambda so it [00:40:56] although usually you change lambda so it just buys in their hands rather than [00:40:58] just buys in their hands rather than this right and then trying to use an SVM [00:41:01] this right and then trying to use an SVM right would be one way to totally change [00:41:03] right would be one way to totally change the optimization objective okay so to [00:41:07] the optimization objective okay so to answer the question just now sometimes [00:41:09] answer the question just now sometimes we find you have the wrong optimization [00:41:11] we find you have the wrong optimization objective is that there there isn't [00:41:13] objective is that there there isn't always an obvious thing to do sometimes [00:41:16] always an obvious thing to do sometimes you have to brainstorm a few ideas that [00:41:18] you have to brainstorm a few ideas that there isn't always one obvious thing to [00:41:22] there isn't always one obvious thing to try but at least it tells you that that [00:41:24] try but at least it tells you that that category of things are trying out [00:41:26] category of things are trying out different optimization objectives it's [00:41:27] different optimization objectives it's working well right all right [00:41:33] working well right all right so let's go through a more complex [00:41:37] so let's go through a more complex example though they'll you know [00:41:39] example though they'll you know incorporate some of these what's wrong I [00:41:41] incorporate some of these what's wrong I spray my laptop and wonder if life was [00:41:44] spray my laptop and wonder if life was so strange this is what I can do [00:41:55] all right oh all right let's go for more [00:41:58] all right oh all right let's go for more complex example that will illustrate [00:42:01] complex example that will illustrate some of these concepts that we've been [00:42:03] some of these concepts that we've been going through and just let you see [00:42:05] going through and just let you see another example of these things oh and [00:42:09] another example of these things oh and and I find that dumb one thing I've [00:42:12] and I find that dumb one thing I've learned as a teacher you know one of the [00:42:14] learned as a teacher you know one of the ways for you to become good at this [00:42:16] ways for you to become good at this right is to go you know working a good [00:42:19] right is to go you know working a good AI group five years right because when [00:42:21] AI group five years right because when you work on a good AI group for some [00:42:23] you work on a good AI group for some several years then you have seen you [00:42:25] several years then you have seen you know ten projects and that lets you gain [00:42:28] know ten projects and that lets you gain that experience but it turns out that it [00:42:30] that experience but it turns out that it takes I don't know depending on what [00:42:32] takes I don't know depending on what they are group you work on it it takes [00:42:33] they are group you work on it it takes if you work on a different project every [00:42:36] if you work on a different project every year then in five years I guess you're [00:42:37] year then in five years I guess you're working five projects something I [00:42:39] working five projects something I actually don't know or maybe ten [00:42:40] actually don't know or maybe ten projects or something but one of the [00:42:42] projects or something but one of the reasons that song in the way I try to [00:42:45] reasons that song in the way I try to explain this you actually go give [00:42:47] explain this you actually go give specific scenarios with you this so that [00:42:49] specific scenarios with you this so that you know my peer sisters and I we spend [00:42:52] you know my peer sisters and I we spend actually we spent like many years [00:42:54] actually we spent like many years working with Stanford autonomous [00:42:56] working with Stanford autonomous helicopter but I'm trying to distill the [00:42:58] helicopter but I'm trying to distill the key lessons down for you so that you [00:42:59] key lessons down for you so that you don't need to work on a project for you [00:43:01] don't need to work on a project for you know feeis to gain this experience but [00:43:03] know feeis to gain this experience but to give you some approximation to this [00:43:05] to give you some approximation to this knowledge and maybe twenty minutes where [00:43:07] knowledge and maybe twenty minutes where is the twenty minutes won't give you the [00:43:09] is the twenty minutes won't give you the depth of the three years of experience [00:43:10] depth of the three years of experience but have each other [00:43:11] but have each other summarize a key lesson so that so you [00:43:14] summarize a key lesson so that so you can learn from experience that others [00:43:16] can learn from experience that others took years to develop um all right so uh [00:43:21] took years to develop um all right so uh this helicopter he sits in my office but [00:43:25] this helicopter he sits in my office but but if you go to my office and you know [00:43:28] but if you go to my office and you know grab this helicopter and and we asked [00:43:31] grab this helicopter and and we asked you to write a piece of code to make [00:43:33] you to write a piece of code to make this fly by yourself use the learning [00:43:35] this fly by yourself use the learning algorithm to make this slide by yourself [00:43:36] algorithm to make this slide by yourself how do you go about doing so so it turns [00:43:39] how do you go about doing so so it turns out a good way to make a helicopter fly [00:43:42] out a good way to make a helicopter fly by itself is to use is to do the [00:43:45] by itself is to use is to do the following um step one is build a [00:43:48] following um step one is build a computer simulator for a helicopter so [00:43:51] computer simulator for a helicopter so you know that's actually a simulator [00:43:52] you know that's actually a simulator right like a video game simulator of a [00:43:55] right like a video game simulator of a helicopter um the advantage of using you [00:43:58] helicopter um the advantage of using you know say a video game simulator [00:43:59] know say a video game simulator helicopter is you could travel all [00:44:01] helicopter is you could travel all things trash a lot in simulation [00:44:03] things trash a lot in simulation you know which is cheap whereas crashing [00:44:05] you know which is cheap whereas crashing a helicopter in real life is this is [00:44:07] a helicopter in real life is this is slightly dangerous and also more [00:44:10] slightly dangerous and also more expensive but so step one build a [00:44:14] expensive but so step one build a similar helicopter except to choose the [00:44:17] similar helicopter except to choose the cost function and for today I'm just [00:44:20] cost function and for today I'm just using a relatively simple cost function [00:44:21] using a relatively simple cost function which is squared error so you want the [00:44:23] which is squared error so you want the helicopter to fly the position X desired [00:44:26] helicopter to fly the position X desired and your hug copter is dead yeah wanders [00:44:29] and your hug copter is dead yeah wanders off to some other place X so let's use a [00:44:31] off to some other place X so let's use a squared error to penalize it right when [00:44:34] squared error to penalize it right when we talk about reinforcement learning [00:44:36] we talk about reinforcement learning towards the end of this quarter well why [00:44:38] towards the end of this quarter well why should go through the same example again [00:44:40] should go through the same example again by using the reinforcement or any [00:44:42] by using the reinforcement or any terminology you understand is that [00:44:44] terminology you understand is that slightly in a slightly deeper level and [00:44:45] slightly in a slightly deeper level and we'll go over this exact same example [00:44:47] we'll go over this exact same example after you learned about reinforcement [00:44:48] after you learned about reinforcement learning but we'll just go over a [00:44:50] learning but we'll just go over a slightly simplified very simplified [00:44:52] slightly simplified very simplified version today and so running reinforce a [00:44:57] version today and so running reinforce a learning algorithm and whatever [00:44:59] learning algorithm and whatever enforcement learning algorithm does is [00:45:00] enforcement learning algorithm does is it tries to minimize that cost function [00:45:03] it tries to minimize that cost function J of theta and so you know and so you [00:45:08] J of theta and so you know and so you learn some set of parameters theta [00:45:09] learn some set of parameters theta subscript R L for controlling the [00:45:13] subscript R L for controlling the helicopter right and we're talked about [00:45:14] helicopter right and we're talked about reinforcement learning you know well you [00:45:17] reinforcement learning you know well you see all this redone with proper [00:45:19] see all this redone with proper reinforcement learning notation where J [00:45:21] reinforcement learning notation where J is a reward function theta R is a [00:45:23] is a reward function theta R is a control policy and so on but don't worry [00:45:24] control policy and so on but don't worry about that for now um so let's say you [00:45:28] about that for now um so let's say you do this and the resulting controller [00:45:31] do this and the resulting controller right the way you fly the helicopter it [00:45:34] right the way you fly the helicopter it gives much worse performance than your [00:45:36] gives much worse performance than your human pilot you know so the helicopter [00:45:38] human pilot you know so the helicopter wobbles off of the place and then [00:45:40] wobbles off of the place and then doesn't quite stay where you were hoping [00:45:42] doesn't quite stay where you were hoping it will so what do you do next right [00:45:46] it will so what do you do next right well here are some options corresponding [00:45:49] well here are some options corresponding to the three steps above you could work [00:45:51] to the three steps above you could work on improving your simulator it turns out [00:45:55] on improving your simulator it turns out even today you know we've had [00:45:57] even today you know we've had helicopters for what I don't know I [00:46:01] helicopters for what I don't know I think that having all commercial houses [00:46:03] think that having all commercial houses around 1950 zip or technical trunk also [00:46:06] around 1950 zip or technical trunk also for many decades now but air flow around [00:46:09] for many decades now but air flow around the helicopter is very complicated and [00:46:10] the helicopter is very complicated and even today they're actually some details [00:46:13] even today they're actually some details of how air flows are [00:46:15] of how air flows are after the aerodynamics textbook you know [00:46:18] after the aerodynamics textbook you know that even aero-astro people write the [00:46:21] that even aero-astro people write the explicitly our answer cannot fully [00:46:22] explicitly our answer cannot fully explain so helicopter is incredibly [00:46:25] explain so helicopter is incredibly complicated and there's almost unlimited [00:46:27] complicated and there's almost unlimited Headroom for building better and more [00:46:30] Headroom for building better and more accurate simulators our copter so maybe [00:46:32] accurate simulators our copter so maybe you want to do that or maybe you think [00:46:35] you want to do that or maybe you think the cost function is messed up you know [00:46:36] the cost function is messed up you know maybe a square error isn't the best [00:46:39] maybe a square error isn't the best metric right and it turns out you know [00:46:42] metric right and it turns out you know the way a helicopter helicopter has a [00:46:44] the way a helicopter helicopter has a tail rotor that blows went to one side [00:46:46] tail rotor that blows went to one side right so yes because the the main rotor [00:46:49] right so yes because the the main rotor spins in one direction if it only had a [00:46:52] spins in one direction if it only had a main rotor then the body was spinning in [00:46:54] main rotor then the body was spinning in the opposite direction kind of equal and [00:46:56] the opposite direction kind of equal and opposite reaction [00:46:56] opposite reaction but in 12 right so the main rotor spins [00:46:59] but in 12 right so the main rotor spins in one direction if it only had the main [00:47:01] in one direction if it only had the main rotor the rotor on top and it just spun [00:47:03] rotor the rotor on top and it just spun that there's a body and her cotton spin [00:47:05] that there's a body and her cotton spin in the opposite direction so that's why [00:47:06] in the opposite direction so that's why you need a tear altar to blow air down [00:47:08] you need a tear altar to blow air down off to one side to not make it spin the [00:47:12] off to one side to not make it spin the opposite direction but because of that [00:47:14] opposite direction but because of that it turns out a helicopter staying in [00:47:16] it turns out a helicopter staying in place is actually tilted slightly to a [00:47:18] place is actually tilted slightly to a side because the tail rotor blows air in [00:47:20] side because the tail rotor blows air in one direction so it's pushing you off to [00:47:22] one direction so it's pushing you off to one side so you have to tell you how to [00:47:24] one side so you have to tell you how to caulked in the opposite direction so the [00:47:26] caulked in the opposite direction so the main role said blows air to one side - [00:47:28] main role said blows air to one side - terrible - blows air to the other side [00:47:29] terrible - blows air to the other side so you actually stay in place [00:47:31] so you actually stay in place right so how often is actually [00:47:32] right so how often is actually asymmetric the definite right is not the [00:47:34] asymmetric the definite right is not the same so so so because of this [00:47:37] same so so so because of this complication maybe squared error isn't [00:47:39] complication maybe squared error isn't the best error because you know you're [00:47:43] the best error because you know you're your orientation your optimal [00:47:45] your orientation your optimal orientation is actually not zero right [00:47:47] orientation is actually not zero right so so so maybe you should multiply the [00:47:51] so so so maybe you should multiply the cost function or maybe you want to [00:47:54] cost function or maybe you want to modify the reinforcement learning [00:47:56] modify the reinforcement learning algorithm because you secretly suspect [00:47:58] algorithm because you secretly suspect that your algorithm is not doing a great [00:48:01] that your algorithm is not doing a great job of minimizing that cost function [00:48:04] job of minimizing that cost function great that is not actually finding the [00:48:07] great that is not actually finding the value of theta that absolutely minimizes [00:48:09] value of theta that absolutely minimizes J of theta so it turns out that each one [00:48:15] J of theta so it turns out that each one of these topics can easily be a PhD [00:48:18] of these topics can easily be a PhD thesis and you could definitely work for [00:48:20] thesis and you could definitely work for six years on anyone [00:48:21] six years on anyone these topics and the problem is you know [00:48:26] these topics and the problem is you know so actually I actually know someone that [00:48:29] so actually I actually know someone that wrote a PhD thesis on write improving [00:48:32] wrote a PhD thesis on write improving helicopter simulator right but the [00:48:35] helicopter simulator right but the problem is maybe a helicopter simulator [00:48:37] problem is maybe a helicopter simulator is good enough and you can spend six [00:48:39] is good enough and you can spend six years improving your helicopter [00:48:42] years improving your helicopter simulator but will that actually get you [00:48:44] simulator but will that actually get you there is and you can write and you can [00:48:45] there is and you can write and you can write a PhD season together PhD doing [00:48:47] write a PhD season together PhD doing that maybe but if you go is not just a [00:48:49] that maybe but if you go is not just a very PhD thesis and actually make your [00:48:51] very PhD thesis and actually make your helicopter fly better is that he's not [00:48:53] helicopter fly better is that he's not not totally clear right if that's the [00:48:56] not totally clear right if that's the key thing for you to spend time on um so [00:49:00] key thing for you to spend time on um so what I'd like to do is describe to you a [00:49:04] what I'd like to do is describe to you a set of Diagnostics that allows you to [00:49:06] set of Diagnostics that allows you to use this sort of logical step-by-step [00:49:08] use this sort of logical step-by-step reasoning to debug which of these three [00:49:12] reasoning to debug which of these three things is what you should actually be [00:49:14] things is what you should actually be spending time on right so is it possible [00:49:17] spending time on right so is it possible for us to come up with a debugging [00:49:19] for us to come up with a debugging process to logically reason so to select [00:49:24] process to logically reason so to select one of these things to work on and have [00:49:26] one of these things to work on and have conviction and then be relatively [00:49:27] conviction and then be relatively confident that this is a useful thing to [00:49:30] confident that this is a useful thing to work all right so here's how we're going [00:49:35] work all right so here's how we're going to do it so just to summarize a scenario [00:49:39] to do it so just to summarize a scenario right the controller given by theta RL [00:49:43] right the controller given by theta RL the false Paulie right so this is how I [00:49:47] the false Paulie right so this is how I would reason through a learning [00:49:48] would reason through a learning algorithm right so suppose suppose all [00:49:53] algorithm right so suppose suppose all of these things were true suppose that [00:49:57] of these things were true suppose that again corresponding to the three steps [00:49:59] again corresponding to the three steps in the previous slide suppose the [00:50:00] in the previous slide suppose the helicopter simulator was accurate and [00:50:02] helicopter simulator was accurate and suppose you know the learning algorithm [00:50:07] suppose you know the learning algorithm correctly you know minimizes the cost [00:50:10] correctly you know minimizes the cost function and suppose J of theta is a [00:50:13] function and suppose J of theta is a good cost function right if all of these [00:50:16] good cost function right if all of these things were true then the learn [00:50:18] things were true then the learn parameters should fly well on the actual [00:50:20] parameters should fly well on the actual helicopter right but it doesn't fire on [00:50:25] helicopter right but it doesn't fire on the helicopter so one of these three [00:50:27] the helicopter so one of these three things is false [00:50:29] things is false and our job as a figure out is its [00:50:33] and our job as a figure out is its identified at least one of these three [00:50:35] identified at least one of these three statements one two or three that is [00:50:37] statements one two or three that is false because that that lets you sink [00:50:39] false because that that lets you sink your teeth into something that to work [00:50:41] your teeth into something that to work on right and I think to make an analogy [00:50:46] on right and I think to make an analogy to more conventional software debugging [00:50:48] to more conventional software debugging of a big complicated program and for [00:50:51] of a big complicated program and for some reason your program crashes you're [00:50:53] some reason your program crashes you're like the cool downs or whatever if you [00:50:56] like the cool downs or whatever if you can isolate this big complicated program [00:50:58] can isolate this big complicated program into one component that crashes then you [00:51:02] into one component that crashes then you can focus your attention on that [00:51:04] can focus your attention on that component that you know crashes for some [00:51:06] component that you know crashes for some reason and try to find a bug there right [00:51:09] reason and try to find a bug there right and so instead of trying to look over a [00:51:11] and so instead of trying to look over a huge codebase if you could do binary [00:51:13] huge codebase if you could do binary search or try to isolate the problem in [00:51:15] search or try to isolate the problem in a smaller part of your codebase then you [00:51:17] a smaller part of your codebase then you could focus your debugging efforts on [00:51:20] could focus your debugging efforts on that padukone page try to figure out why [00:51:21] that padukone page try to figure out why it crashes and then fix that at first [00:51:23] it crashes and then fix that at first and after you fix that they might still [00:51:25] and after you fix that they might still crash then there might be a second [00:51:26] crash then there might be a second problem that we work on but at least you [00:51:28] problem that we work on but at least you know that trying to fix the first bug [00:51:31] know that trying to fix the first bug seems like it seems like a worthwhile [00:51:32] seems like it seems like a worthwhile thing to do right so what we're going to [00:51:37] thing to do right so what we're going to do is come up with a server the gradient [00:51:41] do is come up with a server the gradient design come over set of Diagnostics to [00:51:43] design come over set of Diagnostics to isolate the problem to one of these [00:51:45] isolate the problem to one of these three components okay so the first step [00:51:49] three components okay so the first step is let's look at how well the algorithm [00:51:54] is let's look at how well the algorithm flies in simulation right so what I said [00:51:57] flies in simulation right so what I said just now was you ran the algorithm and [00:51:59] just now was you ran the algorithm and it resulted in a set of parameters that [00:52:02] it resulted in a set of parameters that doesn't do well on your actual [00:52:04] doesn't do well on your actual helicopter so the first thing I would do [00:52:06] helicopter so the first thing I would do is just check how well does this thing [00:52:08] is just check how well does this thing even do in simulation right and there [00:52:12] even do in simulation right and there are two possible cases if it flies well [00:52:16] are two possible cases if it flies well in simulation but doesn't do well in [00:52:18] in simulation but doesn't do well in real life then means something's wrong [00:52:21] real life then means something's wrong with a simulator right and it means is [00:52:23] with a simulator right and it means is actually worth working on the simulator [00:52:25] actually worth working on the simulator because you know if it's already working [00:52:27] because you know if it's already working well in the simulator I mean what else [00:52:30] well in the simulator I mean what else could you expect to learn the very force [00:52:32] could you expect to learn the very force of learning algorithms [00:52:33] of learning algorithms right you know you you told the reporter [00:52:35] right you know you you told the reporter learning algorithm to go and fly well in [00:52:36] learning algorithm to go and fly well in the simulator because it's just training [00:52:38] the simulator because it's just training simulation it's already doing well in [00:52:40] simulation it's already doing well in the simulator so there's not much to [00:52:42] the simulator so there's not much to improve on their release is hard to [00:52:44] improve on their release is hard to improve on that but but but if you found [00:52:48] improve on that but but but if you found it learning out if you learn the I room [00:52:49] it learning out if you learn the I room just one simulator but not in real life [00:52:51] just one simulator but not in real life then this means that the simulator isn't [00:52:55] then this means that the simulator isn't matching real life well and so dish that [00:52:58] matching real life well and so dish that does strong evidence there's strong [00:53:00] does strong evidence there's strong grounds for you to spend some time to [00:53:02] grounds for you to spend some time to improve your simulator yeah yeah right [00:53:12] improve your simulator yeah yeah right is that it just repeats another camera [00:53:14] is that it just repeats another camera is it is ever the case that it flies [00:53:16] is it is ever the case that it flies values away to about one roll life I [00:53:18] values away to about one roll life I wish that happen [00:53:23] very rarely I I think if that happens I [00:53:27] very rarely I I think if that happens I would I would still work on improving [00:53:28] would I would still work on improving the simulator so there's actually once [00:53:32] the simulator so there's actually once an era where that happens it turns out [00:53:33] an era where that happens it turns out that when we train this helicopter in [00:53:39] that when we train this helicopter in the simulator or really any robot [00:53:40] the simulator or really any robot simulator we often add a lava noise to [00:53:42] simulator we often add a lava noise to the simulator because one lessons of [00:53:44] the simulator because one lessons of learn is that if your simulator is noisy [00:53:46] learn is that if your simulator is noisy customizers are always wrong right I [00:53:48] customizers are always wrong right I mean any digital simulation is only an [00:53:50] mean any digital simulation is only an approximation in real world so we tend [00:53:51] approximation in real world so we tend to have a lot of noise so all of our [00:53:53] to have a lot of noise so all of our simulators because we think that the [00:53:55] simulators because we think that the learning algorithm is robust so all this [00:53:57] learning algorithm is robust so all this noise you've thrown at it in simulation [00:53:59] noise you've thrown at it in simulation then whatever noise the real world [00:54:01] then whatever noise the real world throws at it it has a bigger chance at [00:54:03] throws at it it has a bigger chance at being robust to as well and so we tend [00:54:06] being robust to as well and so we tend to throw a lot of noise into Intel [00:54:09] to throw a lot of noise into Intel simulators and so one case where that [00:54:11] simulators and so one case where that does happen is when we find we threw too [00:54:13] does happen is when we find we threw too much noise on it in simulation and then [00:54:15] much noise on it in simulation and then that might be a sign we should dial back [00:54:17] that might be a sign we should dial back the noise a bit [00:54:17] the noise a bit yeah all right cool oh so yeah so this [00:54:26] yeah all right cool oh so yeah so this first I know see tells you should work [00:54:27] first I know see tells you should work on improving simulation but just I think [00:54:30] on improving simulation but just I think if there's a big mismatch between [00:54:32] if there's a big mismatch between simulation performance and real work [00:54:34] simulation performance and real work performance that's a good sign that you [00:54:37] performance that's a good sign that you know that [00:54:38] know that improve insulation second um this is [00:54:42] improve insulation second um this is actually very similar to the diagnostic [00:54:44] actually very similar to the diagnostic we use on the spam you know based on [00:54:47] we use on the spam you know based on logistic regression as a SVM example so [00:54:52] logistic regression as a SVM example so what we're going to do is we're going to [00:54:56] what we're going to do is we're going to measure this equation and this is this [00:55:00] measure this equation and this is this again this very similar to our previous [00:55:02] again this very similar to our previous equation which is take the cost function [00:55:05] equation which is take the cost function - similar to previous example take the [00:55:09] - similar to previous example take the cost function J that reinforcement [00:55:11] cost function J that reinforcement learning is total minimize right it's JJ [00:55:15] learning is total minimize right it's JJ of theta was a squared error right so [00:55:17] of theta was a squared error right so take the cost function that [00:55:19] take the cost function that reinforcement learning was total [00:55:21] reinforcement learning was total minimize and see if the human achieves [00:55:26] minimize and see if the human achieves better squared error than the [00:55:28] better squared error than the reinforcement learning algorithm and [00:55:30] reinforcement learning algorithm and just see you know this human flies [00:55:32] just see you know this human flies better so let's measure the human [00:55:34] better so let's measure the human performance on this squared error cost [00:55:36] performance on this squared error cost function and see which one does better [00:55:40] function and see which one does better so they're two cases that equation will [00:55:43] so they're two cases that equation will be either less than or they'll be [00:55:45] be either less than or they'll be greater than or equal to is there less [00:55:47] greater than or equal to is there less law greater than equal to so case one is [00:55:50] law greater than equal to so case one is say two human is less than sees me J of [00:55:55] say two human is less than sees me J of theta human is less than J of theta RL [00:55:57] theta human is less than J of theta RL that would be this case then that tells [00:56:00] that would be this case then that tells you that the problem is with the [00:56:02] you that the problem is with the reinforcement learning algorithm right [00:56:04] reinforcement learning algorithm right that somehow the human achieves a lower [00:56:07] that somehow the human achieves a lower squared error and so the learning [00:56:11] squared error and so the learning algorithm is not finding the best [00:56:13] algorithm is not finding the best possible squared error that is some [00:56:15] possible squared error that is some other controller as evidenced by [00:56:17] other controller as evidenced by whatever the human is doing that [00:56:19] whatever the human is doing that actually achieves a lower cost function [00:56:20] actually achieves a lower cost function right so in this case we think the [00:56:26] right so in this case we think the learning algorithm or the harbor [00:56:28] learning algorithm or the harbor enforcement learning algorithm is not [00:56:29] enforcement learning algorithm is not doing a good job minimizing that and [00:56:31] doing a good job minimizing that and what were Connie reinforcement or anyhow [00:56:33] what were Connie reinforcement or anyhow really the other case would be of the [00:56:37] really the other case would be of the sine inequality is the other way around [00:56:38] sine inequality is the other way around right now in this case you can infer [00:56:43] right now in this case you can infer that the problem is in the cost [00:56:45] that the problem is in the cost function because what happens here is [00:56:49] function because what happens here is the humanists line better than your [00:56:52] the humanists line better than your enforcement learning algorithm but the [00:56:55] enforcement learning algorithm but the human is achieving what looks like a [00:56:57] human is achieving what looks like a worse cause that your enforcement [00:56:59] worse cause that your enforcement learning algorithm so what this tells [00:57:03] learning algorithm so what this tells you is that minimizing J of theta does [00:57:06] you is that minimizing J of theta does not correspond to flying well right your [00:57:09] not correspond to flying well right your learning algorithm achieves a better [00:57:10] learning algorithm achieves a better value for J of theta you know J of theta [00:57:13] value for J of theta you know J of theta are out is actually smaller than one of [00:57:15] are out is actually smaller than one of the human is doing so the reinforcement [00:57:17] the human is doing so the reinforcement learning algorithm as far as it knows [00:57:19] learning algorithm as far as it knows this doing a great job because it's [00:57:21] this doing a great job because it's finding a value of theta where J of [00:57:23] finding a value of theta where J of theta is really really small but in this [00:57:25] theta is really really small but in this last case you know that finding such a [00:57:31] last case you know that finding such a small value of J of theta doesn't [00:57:33] small value of J of theta doesn't correspond to flying well off because a [00:57:35] correspond to flying well off because a human doesn't achieve such a good value [00:57:37] human doesn't achieve such a good value in the cost function but the helicopter [00:57:38] in the cost function but the helicopter actually just looks better was flying in [00:57:40] actually just looks better was flying in a more satisfactory way and that tells [00:57:43] a more satisfactory way and that tells you that the squared error cost function [00:57:45] you that the squared error cost function is not the right cost function for what [00:57:49] is not the right cost function for what flying after it events right and so um [00:57:53] flying after it events right and so um through this set of Diagnostics you [00:57:58] through this set of Diagnostics you could decide which one of these three [00:58:00] could decide which one of these three things improving the simulator improving [00:58:03] things improving the simulator improving in our our algorithm before so learning [00:58:05] in our our algorithm before so learning algorithm or improving the cost function [00:58:08] algorithm or improving the cost function is the thing you should work on and what [00:58:11] is the thing you should work on and what happens in actually in this particular [00:58:14] happens in actually in this particular project and what often happens in [00:58:16] project and what often happens in machine learning applications is you run [00:58:18] machine learning applications is you run the set of Diagnostics and this actually [00:58:20] the set of Diagnostics and this actually happened when we're working on this [00:58:21] happened when we're working on this helicopter we've run the set of [00:58:23] helicopter we've run the set of Diagnostics and then one week we were [00:58:25] Diagnostics and then one week we were saying yep simulated score the problem [00:58:27] saying yep simulated score the problem let's work on that and it would improve [00:58:28] let's work on that and it would improve the simulator improves the simulator and [00:58:30] the simulator improves the simulator and after a couple weeks of work we run [00:58:32] after a couple weeks of work we run these Diagnostics and say oh it looks [00:58:34] these Diagnostics and say oh it looks like the simulator is not good enough [00:58:35] like the simulator is not good enough and maybe there's a problem with the our [00:58:37] and maybe there's a problem with the our our algorithm then we work on that work [00:58:38] our algorithm then we work on that work on that improve that and after that [00:58:40] on that improve that and after that after a while I'll say oh they'll say [00:58:41] after a while I'll say oh they'll say that's also good enough and the problem [00:58:43] that's also good enough and the problem is in the cost function and sometimes [00:58:45] is in the cost function and sometimes the the location of the most acute [00:58:48] the the location of the most acute problem shifts right after you've [00:58:49] problem shifts right after you've cleared out one set of problems it might [00:58:52] cleared out one set of problems it might be the case that now the ball Oneg is [00:58:55] be the case that now the ball Oneg is the simulator right and so I often use [00:58:59] the simulator right and so I often use this workflow to constantly drive [00:59:03] this workflow to constantly drive prioritization for what to work on next [00:59:05] prioritization for what to work on next and and to answer a question just now [00:59:08] and and to answer a question just now about how do you find a new cost [00:59:10] about how do you find a new cost function it turns out find me a new cost [00:59:12] function it turns out find me a new cost function it is actually not that easy [00:59:13] function it is actually not that easy so as you want one my own former PhD [00:59:16] so as you want one my own former PhD students Adam coats through this type of [00:59:19] students Adam coats through this type of process realize that finding a good cost [00:59:22] process realize that finding a good cost function is actually really difficult [00:59:23] function is actually really difficult because if you want a helicopter to fly [00:59:25] because if you want a helicopter to fly maneuver you're like fire speed and they [00:59:28] maneuver you're like fire speed and they make a bank turn right like how do you [00:59:30] make a bank turn right like how do you math Matthew define what is an accurate [00:59:32] math Matthew define what is an accurate bank determines thank you really [00:59:33] bank determines thank you really difficult to write down an equation to [00:59:35] difficult to write down an equation to specify what is a good way for how to [00:59:37] specify what is a good way for how to fly like that and then do a turn it's [00:59:39] fly like that and then do a turn it's just how do you specify what is a good [00:59:41] just how do you specify what is a good turn so he wound up writing a research [00:59:44] turn so he wound up writing a research paper one of the best application paper [00:59:47] paper one of the best application paper more than I see now on how to define a [00:59:50] more than I see now on how to define a good cost function it's actually pretty [00:59:52] good cost function it's actually pretty complicated but the reason he did it and [00:59:55] complicated but the reason he did it and it was a good use of his time was [00:59:56] it was a good use of his time was running Diagnostics like these which [00:59:58] running Diagnostics like these which gave us confidence that this was [01:00:00] gave us confidence that this was actually a worthwhile problem and then [01:00:02] actually a worthwhile problem and then that results in you know making real [01:00:04] that results in you know making real progress now about it yeah okay [01:00:07] progress now about it yeah okay um any questions about this all right [01:00:16] um any questions about this all right cool actually I think of all right [01:00:19] cool actually I think of all right anyway all right fun how the to do [01:00:21] anyway all right fun how the to do so let's not show this is fine yeah you [01:00:23] so let's not show this is fine yeah you guys saw some of these earlier all right [01:00:26] guys saw some of these earlier all right so um a long time [01:00:37] all right let's go for this so um in [01:00:44] all right let's go for this so um in addition to these specific diagnoses of [01:00:48] addition to these specific diagnoses of bias versus variance and optimization of [01:00:51] bias versus variance and optimization of the results musician objective [01:00:53] the results musician objective oh sorry and when we do our L I want to [01:00:56] oh sorry and when we do our L I want to just go through that example one more [01:00:58] just go through that example one more time so you see everything we just saw [01:00:59] time so you see everything we just saw again after you've learned about [01:01:01] again after you've learned about reinforcement learning they turn this [01:01:03] reinforcement learning they turn this course L okay [01:01:04] course L okay now in addition to these type of [01:01:08] now in addition to these type of Diagnostics how to debug learning [01:01:11] Diagnostics how to debug learning algorithms there's one other set of [01:01:13] algorithms there's one other set of tools you find very useful which is [01:01:15] tools you find very useful which is error analysis tools which lets you [01:01:18] error analysis tools which lets you figure out which is another way for you [01:01:20] figure out which is another way for you to figure out what's working what's not [01:01:22] to figure out what's working what's not working or really what's not working the [01:01:24] working or really what's not working the learning algorithm so let's let's go [01:01:26] learning algorithm so let's let's go through a multi-beam example um so let's [01:01:29] through a multi-beam example um so let's say you're building a you know like a [01:01:33] say you're building a you know like a security system so when someone walks in [01:01:35] security system so when someone walks in front of a door you unlock the door not [01:01:37] front of a door you unlock the door not based on whether or not you know that [01:01:39] based on whether or not you know that person is authorized to enter right that [01:01:41] person is authorized to enter right that please and so let's say that so they're [01:01:48] please and so let's say that so they're longer machine learning applications [01:01:49] longer machine learning applications where is not just one learning algorithm [01:01:52] where is not just one learning algorithm right but instead you have a pipeline [01:01:53] right but instead you have a pipeline that's trained together many different [01:01:55] that's trained together many different steps so how do you build a face [01:01:57] steps so how do you build a face recognition algorithm to decide if [01:02:00] recognition algorithm to decide if someone approaching your front doors [01:02:01] someone approaching your front doors authorized unlock the door right well [01:02:04] authorized unlock the door right well here's something you could do which is [01:02:05] here's something you could do which is you start with a camera image like this [01:02:07] you start with a camera image like this and then um you could do pre-processing [01:02:10] and then um you could do pre-processing to remove the background so all that [01:02:12] to remove the background so all that complicated color background let's get [01:02:15] complicated color background let's get rid of that and it turns out that um [01:02:16] rid of that and it turns out that um when you have a camera against a static [01:02:19] when you have a camera against a static background right you could actually do [01:02:21] background right you could actually do this you know what a little bit of noise [01:02:23] this you know what a little bit of noise be relatively easy because if you have a [01:02:25] be relatively easy because if you have a fixed camera that's just like mounted [01:02:27] fixed camera that's just like mounted you know on your doorframe it always [01:02:29] you know on your doorframe it always sees the same background and so you can [01:02:32] sees the same background and so you can just look at what pixels have changed [01:02:34] just look at what pixels have changed and and just keep the pixels that have [01:02:36] and and just keep the pixels that have changed compared to I mean right because [01:02:38] changed compared to I mean right because you know this camera always sees that [01:02:40] you know this camera always sees that gray background and [01:02:42] gray background and some Brown bench in the back and so just [01:02:44] some Brown bench in the back and so just look at what pixels have changed the [01:02:45] look at what pixels have changed the login and does that's background remove [01:02:47] login and does that's background remove all right so business this this is this [01:02:50] all right so business this this is this is actually feasible back just looking [01:02:51] is actually feasible back just looking at what pixels have changed and keeping [01:02:52] at what pixels have changed and keeping the pixels they've changed relative [01:02:53] the pixels they've changed relative today [01:02:55] today and so after getting to the background [01:02:58] and so after getting to the background you could run the face detection [01:03:00] you could run the face detection algorithm and then after detecting the [01:03:03] algorithm and then after detecting the face it turns out actually you know I've [01:03:07] face it turns out actually you know I've actually worked a bunch of face [01:03:08] actually worked a bunch of face detection told a bunch of face face [01:03:09] detection told a bunch of face face recognition systems it turns out that [01:03:11] recognition systems it turns out that for some of the leading face recognition [01:03:13] for some of the leading face recognition systems someone depends on details with [01:03:16] systems someone depends on details with some of them it turns out that the [01:03:18] some of them it turns out that the appearance of the eyes is a very [01:03:20] appearance of the eyes is a very important cue for recognizing people [01:03:22] important cue for recognizing people this is why if you cover your eyes a [01:03:24] this is why if you cover your eyes a much all the time recognizing [01:03:25] much all the time recognizing people as eyes are very distinct were [01:03:26] people as eyes are very distinct were people just segment out the eyes [01:03:30] people just segment out the eyes segments the nose and not the same you [01:03:33] segments the nose and not the same you send out the know it's Halloween and [01:03:41] send out the know it's Halloween and then and then feed these features into [01:03:43] then and then feed these features into some other algorithm say which is [01:03:44] some other algorithm say which is Russian that then you know finally [01:03:46] Russian that then you know finally outputs a label that says is this the [01:03:49] outputs a label that says is this the person right that that you know you're [01:03:52] person right that that you know you're authorized to open the door for um so it [01:03:56] authorized to open the door for um so it so in many learning algorithms you have [01:03:59] so in many learning algorithms you have a complicated pipeline like this of [01:04:01] a complicated pipeline like this of different components that that have to [01:04:03] different components that that have to be strung together and you know if you [01:04:07] be strung together and you know if you read the newspaper articles about or if [01:04:10] read the newspaper articles about or if you read research papers in machine [01:04:11] you read research papers in machine learning often the research papers will [01:04:16] learning often the research papers will say oh we build a machine translation [01:04:18] say oh we build a machine translation system where you train down gazillion [01:04:19] system where you train down gazillion you know sentences found on the internet [01:04:22] you know sentences found on the internet and it does great and a pure end-to-end [01:04:24] and it does great and a pure end-to-end system so there's like one learning [01:04:25] system so there's like one learning algorithm that sucks in an input like [01:04:27] algorithm that sucks in an input like suck on an English sentence and spit on [01:04:29] suck on an English sentence and spit on the French sentence or something right [01:04:31] the French sentence or something right so this that's like one learning [01:04:33] so this that's like one learning algorithm it turns out that for a lot of [01:04:35] algorithm it turns out that for a lot of practical applications if you don't have [01:04:37] practical applications if you don't have a gazillion examples you end up [01:04:39] a gazillion examples you end up designing [01:04:40] designing much more complex machine learning [01:04:42] much more complex machine learning pipelines like this where it's not just [01:04:44] pipelines like this where it's not just one monolithic learning algorithm but [01:04:47] one monolithic learning algorithm but instead there are many different smaller [01:04:48] instead there are many different smaller components and I think in I think that [01:04:53] components and I think in I think that you know the [01:04:56] you know the I think that having a lot of data is [01:04:59] I think that having a lot of data is great right I love having more data but [01:05:01] great right I love having more data but big data has also been a little bit [01:05:02] big data has also been a little bit overhyped and a lot of things you could [01:05:05] overhyped and a lot of things you could do with small data sets as well and in [01:05:07] do with small data sets as well and in the teams I work with we find that if [01:05:11] the teams I work with we find that if you have a relatively small D that's [01:05:13] you have a relatively small D that's that often you can still get great [01:05:14] that often you can still get great results you know my team's often get [01:05:17] results you know my team's often get great results with 100 images a hundred [01:05:19] great results with 100 images a hundred change examples something but when [01:05:21] change examples something but when you're a small data it often takes more [01:05:23] you're a small data it often takes more insightful design about machine learning [01:05:25] insightful design about machine learning pipelines like this know we have a [01:05:31] pipelines like this know we have a machine learning pipeline like this the [01:05:33] machine learning pipeline like this the things you want to do one thing you want [01:05:35] things you want to do one thing you want to do is so so you you build the [01:05:38] to do is so so you you build the pipeline like this and it doesn't work [01:05:40] pipeline like this and it doesn't work right there's this common workflow you [01:05:41] right there's this common workflow you build a pipe build something doesn't [01:05:43] build a pipe build something doesn't work so you want to debug it so in order [01:05:47] work so you want to debug it so in order to decide which part of the pipeline to [01:05:49] to decide which part of the pipeline to work on is very useful if you can look [01:05:52] work on is very useful if you can look at your the error of your system [01:05:54] at your the error of your system and try to attribute the error to the [01:05:58] and try to attribute the error to the different components so you can decide [01:06:00] different components so you can decide which component to work on X right and [01:06:04] which component to work on X right and and there's a here I'll tell you true [01:06:06] and there's a here I'll tell you true story you're the pre process background [01:06:08] story you're the pre process background removal step right since you're getting [01:06:10] removal step right since you're getting rid of the background it turns out that [01:06:13] rid of the background it turns out that there are a lot of details of how to do [01:06:16] there are a lot of details of how to do background removal for example the [01:06:19] background removal for example the simple way to do it is to look at every [01:06:21] simple way to do it is to look at every pixel and just see which pixels have [01:06:23] pixel and just see which pixels have changed oh but it turns out that if [01:06:26] changed oh but it turns out that if there's a tree in the background that [01:06:28] there's a tree in the background that you know waves a little bit because the [01:06:29] you know waves a little bit because the wind moves the tree and blows the leaves [01:06:32] wind moves the tree and blows the leaves and branches around a little bit then [01:06:34] and branches around a little bit then sometimes the background pixels do [01:06:35] sometimes the background pixels do change a little bit and so they're [01:06:37] change a little bit and so they're actually really complicates a background [01:06:39] actually really complicates a background removal algorithms they tried to model [01:06:41] removal algorithms they tried to model basically the trees and the bushes [01:06:43] basically the trees and the bushes moving around a little bit in background [01:06:44] moving around a little bit in background so you know that even though the pixels [01:06:47] so you know that even though the pixels of the tree roos around this part of the [01:06:48] of the tree roos around this part of the background is just still get rid of it [01:06:50] background is just still get rid of it so background removal there's simple [01:06:52] so background removal there's simple versions where you just look at each [01:06:53] versions where you just look at each pixel and see how much has changed and [01:06:55] pixel and see how much has changed and they're incredibly complicated versions [01:06:57] they're incredibly complicated versions so I actually know someone that was [01:07:03] so I actually know someone that was trying to work on a problem like this [01:07:04] trying to work on a problem like this and they decided to improve [01:07:07] and they decided to improve background removal algorithm and they [01:07:09] background removal algorithm and they actually does this per person actually [01:07:11] actually does this per person actually literally wrote a PhD theses on [01:07:13] literally wrote a PhD theses on background removal and so I'm glad you [01:07:15] background removal and so I'm glad you got a PhD but it turned but you know [01:07:19] got a PhD but it turned but you know when I look at the problem he was [01:07:20] when I look at the problem he was actually trying to solve I don't think [01:07:22] actually trying to solve I don't think it actually moved in you know right so [01:07:24] it actually moved in you know right so so this one I suspected the American I [01:07:28] so this one I suspected the American I so laws you know you can still publish a [01:07:31] so laws you know you can still publish a paper and and it was technically [01:07:36] paper and and it was technically innovative I was thanking very good [01:07:37] innovative I was thanking very good technical work but but but but if so [01:07:40] technical work but but but but if so there you go suppose you favor great do [01:07:42] there you go suppose you favor great do that but there goes to build a better [01:07:44] that but there goes to build a better face recognition system then I would [01:07:45] face recognition system then I would carefully ask which components should [01:07:47] carefully ask which components should you actually spend your time to work all [01:07:49] you actually spend your time to work all right um so yes what you can do with [01:07:54] right um so yes what you can do with error analysis which is say your overall [01:07:57] error analysis which is say your overall system has eighty five percent accuracy [01:08:00] system has eighty five percent accuracy it's what I would do I would go in and [01:08:04] it's what I would do I would go in and in your depth set and your development [01:08:07] in your depth set and your development said to hold our cross-validation settle [01:08:09] said to hold our cross-validation settle right go in and for every one of your [01:08:12] right go in and for every one of your examples in the DEF set I would plug in [01:08:16] examples in the DEF set I would plug in the ground truth for the background [01:08:18] the ground truth for the background meaning that rather than using a some [01:08:22] meaning that rather than using a some you know approximate heuristic algorithm [01:08:24] you know approximate heuristic algorithm for roughly cleaning out the background [01:08:26] for roughly cleaning out the background which may or may not were that well I [01:08:28] which may or may not were that well I would just use Photoshop and for every [01:08:30] would just use Photoshop and for every example in the Deaf said I would give it [01:08:32] example in the Deaf said I would give it the perfect background removal right so [01:08:34] the perfect background removal right so imagine if instead of some noisy harbor [01:08:37] imagine if instead of some noisy harbor I'm trying to remove the background this [01:08:39] I'm trying to remove the background this step of the algorithm was just had [01:08:41] step of the algorithm was just had perfect performance right and then you [01:08:43] perfect performance right and then you could give it perfect performance on [01:08:45] could give it perfect performance on your depth set on your test set just by [01:08:46] your depth set on your test set just by using Photoshop to just tell it this is [01:08:48] using Photoshop to just tell it this is a background this is the foreground [01:08:50] a background this is the foreground right and let's say that when you plug [01:08:53] right and let's say that when you plug in this perfect background remove all [01:08:55] in this perfect background remove all the accuracy improves to eighty five [01:08:57] the accuracy improves to eighty five point one percent and then you can keep [01:09:00] point one percent and then you can keep on going from left to right in this pat [01:09:02] on going from left to right in this pat pipeline which is now instead of using [01:09:06] pipeline which is now instead of using some learning algorithm to do face [01:09:07] some learning algorithm to do face detection this just go in and for the [01:09:09] detection this just go in and for the test set [01:09:10] test set you know modify kind of have the face [01:09:12] you know modify kind of have the face detection algorithm cheat right having [01:09:14] detection algorithm cheat right having just memorized it right [01:09:15] just memorized it right location for the face and the test seven [01:09:18] location for the face and the test seven just give it a perfect result in the [01:09:20] just give it a perfect result in the test set so when I shade in these things [01:09:22] test set so when I shade in these things that means I'm giving a perfect result [01:09:26] that means I'm giving a perfect result right so let's just go in and on the [01:09:28] right so let's just go in and on the test set give it the perfect face [01:09:30] test set give it the perfect face detection for every single example and [01:09:32] detection for every single example and and then look at the final output and [01:09:34] and then look at the final output and see how that changes the accuracy of the [01:09:36] see how that changes the accuracy of the final output right and then same for [01:09:39] final output right and then same for these components I segmentation no [01:09:42] these components I segmentation no segmentation most segmentation and then [01:09:46] segmentation most segmentation and then and you do this one at a time and then [01:09:48] and you do this one at a time and then finally for let's just regression if you [01:09:50] finally for let's just regression if you give it the perfect output you know your [01:09:52] give it the perfect output you know your your accuracy should be a hundred [01:09:54] your accuracy should be a hundred percent right so now what you can do is [01:09:58] percent right so now what you can do is look at the sequence of of steps and see [01:10:02] look at the sequence of of steps and see which one gave you the biggest gain and [01:10:06] which one gave you the biggest gain and it looks like um in this example it [01:10:09] it looks like um in this example it looks like when you gave it perfect face [01:10:13] looks like when you gave it perfect face detection the accuracy improved from [01:10:16] detection the accuracy improved from eighty five point one to ninety one [01:10:17] eighty five point one to ninety one percent so roughly a six percent [01:10:19] percent so roughly a six percent improvement and that tells you that if [01:10:22] improvement and that tells you that if only you can improve your face detection [01:10:23] only you can improve your face detection algorithm maybe your overall system [01:10:26] algorithm maybe your overall system could get better by as much as six [01:10:28] could get better by as much as six percent so this gives you faith that you [01:10:30] percent so this gives you faith that you know maybe it's worth improving on your [01:10:32] know maybe it's worth improving on your face detection component and in contrast [01:10:34] face detection component and in contrast just tells you that even if you had [01:10:37] just tells you that even if you had perfect background removal it's only 0.1 [01:10:40] perfect background removal it's only 0.1 percent better so maybe don't don't [01:10:42] percent better so maybe don't don't don't spend too much time on that and it [01:10:46] don't spend too much time on that and it looks like that when you gave it perfect [01:10:48] looks like that when you gave it perfect eye segmentation it went up another four [01:10:50] eye segmentation it went up another four percent so maybe that's another good [01:10:52] percent so maybe that's another good project to prioritize right um and if [01:10:56] project to prioritize right um and if you're in a team one common structure [01:10:58] you're in a team one common structure would be to do this type of analysis and [01:11:00] would be to do this type of analysis and then have some people work on face [01:11:02] then have some people work on face detection some people work on our [01:11:03] detection some people work on our segmentation you could usually do a few [01:11:05] segmentation you could usually do a few things in parallel if you have a large [01:11:07] things in parallel if you have a large area team but at least this should give [01:11:09] area team but at least this should give you a sense of the relative position of [01:11:11] you a sense of the relative position of the different things this question [01:11:29] Yeah right so if you just cumulatively [01:11:31] Yeah right so if you just cumulatively sighs just give a perfect eye [01:11:32] sighs just give a perfect eye cementation then add on top perfect no [01:11:34] cementation then add on top perfect no segmentation or do you give a perfect [01:11:36] segmentation or do you give a perfect eye segmentation and then take that away [01:11:38] eye segmentation and then take that away and then give it perfect no segmentation [01:11:40] and then give it perfect no segmentation um the way I presented it here is done [01:11:42] um the way I presented it here is done cumulatively and and it turns out that [01:11:46] cumulatively and and it turns out that uh let's see if you give it once you [01:11:48] uh let's see if you give it once you give it perfect face uh once you give it [01:11:51] give it perfect face uh once you give it you know perfect things in the later [01:11:53] you know perfect things in the later stages maybe the earliest stages doesn't [01:11:56] stages maybe the earliest stages doesn't matter that much anymore so that's one [01:11:57] matter that much anymore so that's one pattern but it turns out that you could [01:12:00] pattern but it turns out that you could do it either way right for the eyes nose [01:12:03] do it either way right for the eyes nose mouth you could do it cumulatively or [01:12:05] mouth you could do it cumulatively or one at a time and you'll probably get [01:12:07] one at a time and you'll probably get relatively similar results no guarantee [01:12:10] relatively similar results no guarantee you might get different results in terms [01:12:12] you might get different results in terms of conclusions but but I think to the [01:12:15] of conclusions but but I think to the extent that you are wondering if doing a [01:12:17] extent that you are wondering if doing a cumulative Leivers is not a cure to you [01:12:19] cumulative Leivers is not a cure to you might give you different results I would [01:12:20] might give you different results I would just do it both ways and then and then [01:12:22] just do it both ways and then and then and then I think these um error analysis [01:12:26] and then I think these um error analysis is not hard mathematical rule if that [01:12:29] is not hard mathematical rule if that makes sense it's not that you do this [01:12:31] makes sense it's not that you do this and then there's a form that that tells [01:12:33] and then there's a form that that tells you okay work on face detection right I [01:12:36] you okay work on face detection right I think that this should be married with [01:12:40] think that this should be married with judgments on you know how hard do you [01:12:43] judgments on you know how hard do you think it is to improve face detection [01:12:44] think it is to improve face detection versus my segmentation right but this at [01:12:47] versus my segmentation right but this at least gives you a sense of it gives you [01:12:49] least gives you a sense of it gives you a sense of privatization and it's worth [01:12:51] a sense of privatization and it's worth doing this in multiple ways if if you [01:12:54] doing this in multiple ways if if you think that if you're concerned I'm a [01:12:56] think that if you're concerned I'm a discrepancy in accumulative in [01:12:57] discrepancy in accumulative in documented versions um so when you have [01:13:02] documented versions um so when you have a complex machine learning pipeline this [01:13:04] a complex machine learning pipeline this type of error analysis helps you break [01:13:07] type of error analysis helps you break down the error to attribute the error to [01:13:09] down the error to attribute the error to different components which lets you [01:13:11] different components which lets you focus your attention on what to work on [01:13:15] Oh ready yeah if your face insertion [01:13:22] Oh ready yeah if your face insertion accuracy and then you're Eric jumps what [01:13:24] accuracy and then you're Eric jumps what is that anything um it's not impossible [01:13:26] is that anything um it's not impossible for that to happen it would be quite [01:13:29] for that to happen it would be quite rare [01:13:29] rare I will so at the high level what I would [01:13:35] I will so at the high level what I would do is go in and try to figure out what's [01:13:36] do is go in and try to figure out what's going on actually I wouldn't ignore that [01:13:38] going on actually I wouldn't ignore that so this is another thing I see sometimes [01:13:41] so this is another thing I see sometimes the team gets a discovers a weird [01:13:43] the team gets a discovers a weird phenomena like that and they just ignore [01:13:45] phenomena like that and they just ignore and they move on I wouldn't do that I [01:13:46] and they move on I wouldn't do that I would is actually go whenever you find [01:13:49] would is actually go whenever you find one of these weird things [01:13:50] one of these weird things I wouldn't gloss over the lurid I would [01:13:54] I wouldn't gloss over the lurid I would go in and figure out what's going on [01:13:55] go in and figure out what's going on this make sense it's like debugging [01:13:57] this make sense it's like debugging software you know if you are if you're a [01:13:59] software you know if you are if you're a chunk debugger piece of software and if [01:14:02] chunk debugger piece of software and if whenever you move your mouse over you [01:14:05] whenever you move your mouse over you know some button some random pixel color [01:14:07] know some button some random pixel color changes you go home that's weird and [01:14:09] changes you go home that's weird and then some people just ignore it and say [01:14:11] then some people just ignore it and say oh well the user won't see this so what [01:14:19] oh well the user won't see this so what you're saying is quite rare but not [01:14:21] you're saying is quite rare but not impossible but I would I would I don't [01:14:24] impossible but I would I would I don't have an easy solution for how to figure [01:14:25] have an easy solution for how to figure out what's going on but I would I would [01:14:27] out what's going on but I would I would want to figure out what's going on all [01:14:31] want to figure out what's going on all right so one last thing before we break [01:14:35] right so one last thing before we break so error analysis helps figure out the [01:14:38] so error analysis helps figure out the difference between where you are now 85% [01:14:41] difference between where you are now 85% overall system accuracy and 100% right [01:14:45] overall system accuracy and 100% right so it tries to explain difference [01:14:46] so it tries to explain difference between where you are and you know [01:14:48] between where you are and you know perfect performance there's a different [01:14:50] perfect performance there's a different type of analysis called ablative [01:14:52] type of analysis called ablative analysis which figures out the [01:14:54] analysis which figures out the difference between where you are and [01:14:55] difference between where you are and something much worse so so here's what I [01:14:57] something much worse so so here's what I mean um so let's say that you built on [01:15:05] less heat build a good anti-spam [01:15:07] less heat build a good anti-spam crossfire by adding lots of clever [01:15:09] crossfire by adding lots of clever features so this is a Russian right so [01:15:11] features so this is a Russian right so you know spelling correction because [01:15:12] you know spelling correction because families tried to misspell words to mess [01:15:15] families tried to misspell words to mess up the tokenizer [01:15:16] up the tokenizer to make words not you know spammy was [01:15:19] to make words not you know spammy was not look like spam [01:15:21] not look like spam send the host features what machines [01:15:23] send the host features what machines email come from Eva Heather features you [01:15:26] email come from Eva Heather features you could have a parser from NLP pasta text [01:15:30] could have a parser from NLP pasta text use JavaScript pauses understand write [01:15:33] use JavaScript pauses understand write or even couldn't fetch the webpages to [01:15:36] or even couldn't fetch the webpages to know that email refers to and pause that [01:15:39] know that email refers to and pause that and the question is how what the ditch [01:15:42] and the question is how what the ditch these components really hope and it [01:15:45] these components really hope and it turns out if you're writing a research [01:15:46] turns out if you're writing a research paper you know sometimes you rather use [01:15:48] paper you know sometimes you rather use me to say hey look I build a great spam [01:15:50] me to say hey look I build a great spam classifier and that's okay that's like a [01:15:52] classifier and that's okay that's like a nice result to have but if you can [01:15:54] nice result to have but if you can explain to your reader either in a [01:15:56] explain to your reader either in a research paper or or in a class project [01:15:58] research paper or or in a class project report like a term project what what [01:16:00] report like a term project what what actually made a difference that conveys [01:16:02] actually made a difference that conveys a lot of insight as well so um so say [01:16:07] a lot of insight as well so um so say simple which is Russian whether all [01:16:09] simple which is Russian whether all these clever features got ninety four [01:16:11] these clever features got ninety four percent performance and with all of your [01:16:13] percent performance and with all of your addition of all these clever features [01:16:15] addition of all these clever features you've got ninety nine percent accuracy [01:16:19] you've got ninety nine percent accuracy so in ablative analysis what you would [01:16:22] so in ablative analysis what you would do is remove the components one at a [01:16:25] do is remove the components one at a time to see how it breaks right so [01:16:28] time to see how it breaks right so Justin so just now we were adding to the [01:16:30] Justin so just now we were adding to the system by making components perfect with [01:16:33] system by making components perfect with error analysis is how it improves here [01:16:35] error analysis is how it improves here we're gonna remove things one at a time [01:16:39] did not mean to remove that to figure [01:16:44] did not mean to remove that to figure out what's going on with PowerPoint all [01:16:45] out what's going on with PowerPoint all right remove things one at a time to see [01:16:48] right remove things one at a time to see how it breaks so lets you remove [01:16:49] how it breaks so lets you remove spelling correction and as a set [01:16:52] spelling correction and as a set features the error goes down like that [01:16:53] features the error goes down like that then let's remove the center host [01:16:55] then let's remove the center host features room with email header features [01:16:57] features room with email header features and so on until when you remove all of [01:17:02] and so on until when you remove all of these features you end up there and [01:17:03] these features you end up there and again you could do this cumulative lee [01:17:05] again you could do this cumulative lee or the roof one and put it back we want [01:17:07] or the roof one and put it back we want to put back you know you could do it [01:17:09] to put back you know you could do it both ways and see if they give you [01:17:10] both ways and see if they give you slightly different insights and so the [01:17:14] slightly different insights and so the conclusion from this peculiar analysis [01:17:16] conclusion from this peculiar analysis is that the biggest gap is from the text [01:17:21] is that the biggest gap is from the text positive features because when you [01:17:22] positive features because when you remove that the error the accuracy went [01:17:26] remove that the error the accuracy went down by four percent [01:17:27] down by four percent so you know this is strong evidence oh [01:17:29] so you know this is strong evidence oh you want to publish a paper you can say [01:17:31] you want to publish a paper you can say write text positive features sick of [01:17:33] write text positive features sick of weekly improves spam filter accuracy and [01:17:35] weekly improves spam filter accuracy and then that level of insight and then if [01:17:38] then that level of insight and then if you're working on the spam filter for [01:17:39] you're working on the spam filter for many years right you know they're [01:17:41] many years right you know they're they're out there there are really [01:17:42] they're out there there are really important applications where sometimes [01:17:43] important applications where sometimes the same team will work on for many [01:17:45] the same team will work on for many years [01:17:45] years so this types of error analysis gives [01:17:47] so this types of error analysis gives you intuition about what's important and [01:17:49] you intuition about what's important and what's not and helps you decide to maybe [01:17:53] what's not and helps you decide to maybe even double down on to expose the [01:17:55] even double down on to expose the features or maybe ever or maybe a [01:17:58] features or maybe ever or maybe a dissent the host features as to [01:18:00] dissent the host features as to competition expensive to compute tells [01:18:02] competition expensive to compute tells you maybe just get rid of that then [01:18:03] you maybe just get rid of that then without too much harm but and also if [01:18:05] without too much harm but and also if you're publishing paper or sending a [01:18:07] you're publishing paper or sending a report this gives much more insight into [01:18:09] report this gives much more insight into your replicates okay all right [01:18:12] your replicates okay all right um so that's it for error analysis and [01:18:16] um so that's it for error analysis and if later analysis I hope this was useful [01:18:18] if later analysis I hope this was useful real class projects as well take one [01:18:20] real class projects as well take one last question oh right [01:18:21] last question oh right oh yeah how are you - zo there was no [01:18:29] oh yeah how are you - zo there was no systematic way if you can have a [01:18:30] systematic way if you can have a systematic way you do that the other way [01:18:32] systematic way you do that the other way to non-cumulative would remove one [01:18:34] to non-cumulative would remove one coming back when you might put it back [01:18:35] coming back when you might put it back so either way it works [01:18:37] so either way it works alright let's break and problem set who [01:18:41] alright let's break and problem set who is due tonight a friendly reminder and [01:18:44] is due tonight a friendly reminder and prom said three will be posted in the [01:18:46] prom said three will be posted in the next like several tens of minutes okay [01:18:48] next like several tens of minutes okay thanks everyone ================================================================================ LECTURE 014 ================================================================================ Lecture 14 - Expectation-Maximization Algorithms | Stanford CS229: Machine Learning (Autumn 2018) Source: https://www.youtube.com/watch?v=rVfZHWTwXSA --- Transcript [00:00:03] all right um let's get started so um [00:00:08] all right um let's get started so um let's see logistical reminder the class [00:00:12] let's see logistical reminder the class midterm is this Wednesday and it's for [00:00:15] midterm is this Wednesday and it's for the eighth article midterm and the [00:00:18] the eighth article midterm and the logistical details you can find at this [00:00:21] logistical details you can find at this Piazza Post right so the midterm will [00:00:23] Piazza Post right so the midterm will start Wednesday evening your $40.00 to [00:00:25] start Wednesday evening your $40.00 to do it and then submit it online through [00:00:27] do it and then submit it online through great scope and because of the midterm [00:00:31] great scope and because of the midterm there won't be a section this Friday ok [00:00:33] there won't be a section this Friday ok oh and the midterm will cover everything [00:00:36] oh and the midterm will cover everything up to and including e/m which will spend [00:00:39] up to and including e/m which will spend most of today talking about Susie they [00:00:42] most of today talking about Susie they don't look so stressed it'll be fun all [00:00:47] don't look so stressed it'll be fun all right [00:00:47] right um so what I'd like to do today is start [00:00:50] um so what I'd like to do today is start our foray into unsupervised learning so [00:00:55] our foray into unsupervised learning so far spent a lot of time on supervised [00:00:57] far spent a lot of time on supervised learning algorithms including advice and [00:00:59] learning algorithms including advice and how-to applies to advise there any [00:01:01] how-to applies to advise there any algorithms in which you'd have you know [00:01:05] algorithms in which you'd have you know positive examples and negative examples [00:01:08] positive examples and negative examples and you run logistic regression or [00:01:11] and you run logistic regression or something or there's V M or something to [00:01:12] something or there's V M or something to find the line find the decision boundary [00:01:14] find the line find the decision boundary between them in unsupervised learning [00:01:17] between them in unsupervised learning you're given unlabeled data so rather [00:01:20] you're given unlabeled data so rather than given data with x and y you're [00:01:24] than given data with x and y you're given only X and so your training set [00:01:27] given only X and so your training set now looks like x1 x2 up through XM and [00:01:33] now looks like x1 x2 up through XM and you're asked to find something [00:01:36] you're asked to find something interesting about the data [00:01:37] interesting about the data so the first unsupervised learning [00:01:39] so the first unsupervised learning algorithm we'll talk about is clustering [00:01:41] algorithm we'll talk about is clustering in which given a data set like this [00:01:43] in which given a data set like this hopefully we can have an algorithm that [00:01:45] hopefully we can have an algorithm that can figure out that this data set has [00:01:49] can figure out that this data set has two separate clusters and so one of the [00:01:52] two separate clusters and so one of the most common uses of clustering is market [00:01:55] most common uses of clustering is market segmentation if you have a website you [00:01:57] segmentation if you have a website you know selling things online we have a [00:01:59] know selling things online we have a huge database of many different users [00:02:01] huge database of many different users and brand clustering to decide what are [00:02:03] and brand clustering to decide what are the different market segments right so [00:02:05] the different market segments right so there may be you know people were [00:02:06] there may be you know people were certain age range of a certain gender [00:02:08] certain age range of a certain gender are people different age range different [00:02:10] are people different age range different of Education and people [00:02:11] of Education and people the East Coast versus West Coast versus [00:02:13] the East Coast versus West Coast versus also in the country but by clustering [00:02:15] also in the country but by clustering you can group people into different [00:02:18] you can group people into different groups right so I want to show you an [00:02:22] groups right so I want to show you an animation of really the most commonly [00:02:25] animation of really the most commonly used the clustering algorithm called [00:02:28] used the clustering algorithm called k-means clustering and let me show you [00:02:31] k-means clustering and let me show you an animation of what k-means does and [00:02:33] an animation of what k-means does and then we'll write right out the map and [00:02:34] then we'll write right out the map and then how you can implement it so um [00:02:38] then how you can implement it so um let's you're given a set like this so [00:02:40] let's you're given a set like this so all these are unlabeled examples so just [00:02:42] all these are unlabeled examples so just exported here and we want an algorithm [00:02:45] exported here and we want an algorithm to try to find maybe the two clusters [00:02:47] to try to find maybe the two clusters here the first step of k-means is to [00:02:50] here the first step of k-means is to pick two points denoted by the two crop [00:02:53] pick two points denoted by the two crop two crosses called cluster centroids and [00:02:55] two crosses called cluster centroids and the cluster centroids are your best [00:02:57] the cluster centroids are your best guess for where are the Centers of the [00:03:00] guess for where are the Centers of the two clusters you're trying to find and [00:03:01] two clusters you're trying to find and then k-means is an iterative algorithm [00:03:03] then k-means is an iterative algorithm and repeatedly you do two things the [00:03:06] and repeatedly you do two things the first thing is go through each of your [00:03:08] first thing is go through each of your training example oh I'm sorry oh okay [00:03:10] training example oh I'm sorry oh okay thank you right let me know if it [00:03:13] thank you right let me know if it happens again okay right so right so [00:03:16] happens again okay right so right so right you have two cluster centroids so [00:03:18] right you have two cluster centroids so the first thing you do is go through [00:03:19] the first thing you do is go through each of your training examples the green [00:03:21] each of your training examples the green dots and for each of them you color them [00:03:24] dots and for each of them you color them either red or blue depending on which is [00:03:26] either red or blue depending on which is the closer cluster centroid so here [00:03:28] the closer cluster centroid so here we've taken every daunting color that [00:03:30] we've taken every daunting color that you know red or blue depending on which [00:03:33] you know red or blue depending on which side it is which costs essential is [00:03:35] side it is which costs essential is close to two and then the second thing [00:03:37] close to two and then the second thing you do is look at all the blue dots and [00:03:41] you do is look at all the blue dots and compute the average right just find the [00:03:44] compute the average right just find the mean of all the blue dots and move the [00:03:46] mean of all the blue dots and move the blue cluster centroid there and [00:03:48] blue cluster centroid there and similarly look at all the red dots and [00:03:50] similarly look at all the red dots and look at only the red dots and find it [00:03:53] look at only the red dots and find it mean finding oh what's wrong with this [00:03:55] mean finding oh what's wrong with this that's it oh this thing was very strange [00:03:58] that's it oh this thing was very strange all right apparently if I keep moving my [00:03:59] all right apparently if I keep moving my mouse it doesn't do that all right thank [00:04:02] mouse it doesn't do that all right thank you and they find a mean of all the red [00:04:04] you and they find a mean of all the red dots and move your red cluster centroid [00:04:08] dots and move your red cluster centroid there so let me do that right so the [00:04:11] there so let me do that right so the cluster centroids move as follows to the [00:04:14] cluster centroids move as follows to the mean of the red and the blue dot since [00:04:16] mean of the red and the blue dot since it's just a standard arithmetic gap [00:04:18] it's just a standard arithmetic gap and then you repeat again where you look [00:04:22] and then you repeat again where you look at each of the dots and color in either [00:04:24] at each of the dots and color in either red or blue depending on which cross the [00:04:27] red or blue depending on which cross the central is closer so when I recolor [00:04:29] central is closer so when I recolor every point based on you know what's [00:04:31] every point based on you know what's closer so that's the new set of colors [00:04:35] closer so that's the new set of colors and then the second part of the [00:04:38] and then the second part of the algorithm was again look at the blue [00:04:40] algorithm was again look at the blue dots find a mean look at the red cards [00:04:42] dots find a mean look at the red cards find the mean and then move the cluster [00:04:44] find the mean and then move the cluster centroids over excuse me to that mean [00:04:48] centroids over excuse me to that mean okay [00:04:50] okay and so and it turns out if you keep [00:04:53] and so and it turns out if you keep running the algorithm nothing changes so [00:04:55] running the algorithm nothing changes so the arm has converged so if you look at [00:04:57] the arm has converged so if you look at this picture and you repeatedly color [00:04:59] this picture and you repeatedly color each point red or blue depending on [00:05:01] each point red or blue depending on which cross the central discloser [00:05:03] which cross the central discloser nothing changes and if you repeatedly [00:05:05] nothing changes and if you repeatedly look at each the two clusters of colored [00:05:07] look at each the two clusters of colored on same computer mean and move the [00:05:08] on same computer mean and move the cluster there nothing changes so this [00:05:11] cluster there nothing changes so this album has converged even if you keep on [00:05:13] album has converged even if you keep on running these two steps okay so um let's [00:05:19] running these two steps okay so um let's see let's write down in math what we [00:05:23] see let's write down in math what we just did [00:05:29] all right so this is um a clustering [00:05:35] all right so this is um a clustering algorithm and specifically this is a [00:05:38] algorithm and specifically this is a k-means clustering algorithm so your [00:05:43] k-means clustering algorithm so your data set now does not come with any [00:05:47] data set now does not come with any labels and so in k-means step one is [00:05:52] labels and so in k-means step one is initialize the cluster centroids right [00:06:02] initialize the cluster centroids right I'm gonna call them mu1 up to the UK [00:06:07] I'm gonna call them mu1 up to the UK random V so this was a step where you [00:06:13] random V so this was a step where you plop down the Red Cross and the Blue [00:06:15] plop down the Red Cross and the Blue Cross and when I did it on the [00:06:18] Cross and when I did it on the powerpoints you know I did it as it will [00:06:20] powerpoints you know I did it as it will just choose these as random vectors in [00:06:23] just choose these as random vectors in practice is a good way of it they're [00:06:25] practice is a good way of it they're actually the most common way to select [00:06:27] actually the most common way to select their brand-new initial cross the [00:06:28] their brand-new initial cross the centroids isn't quite what I showed is [00:06:29] centroids isn't quite what I showed is to actually pick K examples out of your [00:06:32] to actually pick K examples out of your training set and just set the cluster [00:06:34] training set and just set the cluster centroids to be equal to K randomly [00:06:36] centroids to be equal to K randomly chosen the examples right so in the low [00:06:38] chosen the examples right so in the low dimensional space like a 2d plot you [00:06:40] dimensional space like a 2d plot you know they can do on the diagram it [00:06:41] know they can do on the diagram it doesn't really matter but when you work [00:06:43] doesn't really matter but when you work with very hard dimensional data says the [00:06:45] with very hard dimensional data says the more common way to initialize these two [00:06:47] more common way to initialize these two just pick you know K training examples [00:06:50] just pick you know K training examples and set the cluster centroids to be at [00:06:52] and set the cluster centroids to be at exactly the location of those examples [00:06:55] exactly the location of those examples but then the all dimensional spaces you [00:06:57] but then the all dimensional spaces you know it doesn't make a big difference [00:07:00] know it doesn't make a big difference and then next you repeat until [00:07:03] and then next you repeat until convergence one is [00:07:28] right so this is a I just write this [00:07:40] right so this is a I just write this down [00:08:17] so ever since so the two steps you would [00:08:21] so ever since so the two steps you would alternate between the first one is set [00:08:24] alternate between the first one is set CI for every value of I so for every [00:08:27] CI for every value of I so for every example set C are equal to you know [00:08:30] example set C are equal to you know either 1 or 2 depending on whether that [00:08:33] either 1 or 2 depending on whether that example X is closer to cluster Center 1 [00:08:36] example X is closer to cluster Center 1 no cluster centroid to right so it's as [00:08:38] no cluster centroid to right so it's as taking point of color either red or blue [00:08:40] taking point of color either red or blue or and we represent that by setting CI [00:08:44] or and we represent that by setting CI equals 1 or 2 if you have two clusters [00:08:47] equals 1 or 2 if you have two clusters if K is equal to 2 oh the note say L 1 R [00:08:59] if K is equal to 2 oh the note say L 1 R squared from this morning one else was [00:09:08] squared from this morning one else was sent out this morning oh that's weird [00:09:12] sent out this morning oh that's weird it shouldn't be L 1 norm if it says that [00:09:15] it shouldn't be L 1 norm if it says that one norm that's a mistake [00:09:17] one norm that's a mistake serve all that but usually and and it [00:09:20] serve all that but usually and and it turns out whether you use L 2 norm L 2 [00:09:22] turns out whether you use L 2 norm L 2 norm square they give you the same [00:09:23] norm square they give you the same answer because the augment is the same [00:09:25] answer because the augment is the same either way but it's usually do a type [00:09:28] either way but it's usually do a type one of those oh I see oh god oh oh oh [00:09:33] one of those oh I see oh god oh oh oh okay looks like a nose he wrote that [00:09:35] okay looks like a nose he wrote that okay cool [00:09:36] okay cool but by default when we write that norm [00:09:38] but by default when we write that norm we actually use we mean L 2 norm yeah by [00:09:41] we actually use we mean L 2 norm yeah by default this is the L 2 norm of X it is [00:09:45] default this is the L 2 norm of X it is unspecified if it's L 1 norm we usually [00:09:48] unspecified if it's L 1 norm we usually write this so L 2 norm is more common [00:09:50] write this so L 2 norm is more common and with what without the square you get [00:09:52] and with what without the square you get the same image okay thank you [00:09:55] the same image okay thank you all right so that's colored adults pain [00:09:59] all right so that's colored adults pain each dot either red or blue and then for [00:10:04] each dot either red or blue and then for this this is you know some career [00:10:08] this this is you know some career examples and take all the examples [00:10:10] examples and take all the examples assigned to certain cluster right [00:10:12] assigned to certain cluster right assigned to cluster J and set new J to [00:10:15] assigned to cluster J and set new J to be average of all the points assigned to [00:10:16] be average of all the points assigned to that cluster chain yeah oh you know I [00:10:31] that cluster chain yeah oh you know I don't think I don't know whatever all [00:10:34] don't think I don't know whatever all right none of the black markers are [00:10:35] right none of the black markers are working this better [00:10:38] working this better alright let me try to use this is that [00:10:40] alright let me try to use this is that part of this is unclear if this part is [00:10:42] part of this is unclear if this part is you can't see oh I'll write it out more [00:10:43] you can't see oh I'll write it out more clearly oh sure I do a place in France [00:11:00] got it [00:11:03] got it let there be light [00:11:05] let there be light all right awesome great that was the [00:11:08] all right awesome great that was the easy request isatis like okay let you [00:11:13] easy request isatis like okay let you look at it for another minute all right [00:11:14] look at it for another minute all right okay thank you [00:11:26] go for it and this wasn't positive okay [00:11:33] go for it and this wasn't positive okay all right now I can move it up [00:11:44] all right um so it turns out that this [00:11:49] all right um so it turns out that this algorithm can be proven to converge that [00:11:52] algorithm can be proven to converge that exactly why is written out in the [00:11:55] exactly why is written out in the lecture notes but it turns out if you [00:11:57] lecture notes but it turns out if you write this as a cost function so the [00:12:10] write this as a cost function so the cost function for a certain set of [00:12:12] cost function for a certain set of assignments of points of examples to [00:12:15] assignments of points of examples to cross the centroids and for a certain [00:12:17] cross the centroids and for a certain set of positions of the cluster [00:12:18] set of positions of the cluster centroids so so see these are the [00:12:20] centroids so so see these are the assignments and these are the centroids [00:12:27] assignments and these are the centroids right so so this cost here is some of [00:12:31] right so so this cost here is some of your training set what's the squared [00:12:33] your training set what's the squared distance between each point and the [00:12:36] distance between each point and the cluster centroid did this assigned to so [00:12:39] cluster centroid did this assigned to so it turns out I won't prove this a little [00:12:41] it turns out I won't prove this a little bit more details legend elsewhere on [00:12:43] bit more details legend elsewhere on truth is it turns out then on every [00:12:45] truth is it turns out then on every iteration K means we'll drive this cost [00:12:47] iteration K means we'll drive this cost function down and so you know beyond a [00:12:50] function down and so you know beyond a certain point this cost function can't [00:12:52] certain point this cost function can't go even you can't go any lower [00:12:55] go even you can't go any lower look just this can't go below zero right [00:12:57] look just this can't go below zero right and so this shows that k-means must [00:12:59] and so this shows that k-means must converge release this function must [00:13:01] converge release this function must converge because there's a strictly [00:13:03] converge because there's a strictly non-negative function that's going down [00:13:05] non-negative function that's going down on every derivation so at some point [00:13:06] on every derivation so at some point that has to stop going down and then you [00:13:08] that has to stop going down and then you could declare gave me a self converged [00:13:10] could declare gave me a self converged in practice if you're running k-means [00:13:13] in practice if you're running k-means and i'm very very large dataset then as [00:13:15] and i'm very very large dataset then as you plot the number of iterations j may [00:13:18] you plot the number of iterations j may go down and you know and and just [00:13:20] go down and you know and and just because of lack of compute or lack of [00:13:23] because of lack of compute or lack of patience you might just stop this [00:13:24] patience you might just stop this running after a while it is going down [00:13:26] running after a while it is going down too slowly [00:13:27] too slowly so that's sort of k-means in practice [00:13:28] so that's sort of k-means in practice but maybe hasn't totally conversions [00:13:30] but maybe hasn't totally conversions just cut it off and call it good enough [00:13:33] just cut it off and call it good enough now the most frequently asked question I [00:13:38] now the most frequently asked question I get the k-means is how do you choose K [00:13:40] get the k-means is how do you choose K it turns out that I when I use k-means i [00:13:43] it turns out that I when I use k-means i still usually choose K by hand and so [00:13:46] still usually choose K by hand and so and [00:13:47] and why which is in a supervised learning [00:13:51] why which is in a supervised learning sometimes it's just ambiguous right how [00:13:54] sometimes it's just ambiguous right how many clusters there are what this data [00:14:03] many clusters there are what this data said some of you will see two clusters [00:14:05] said some of you will see two clusters and some of you will see full clusters [00:14:09] and some of you will see full clusters and it's just inherently ambiguous what [00:14:12] and it's just inherently ambiguous what is the right number of clusters so there [00:14:14] is the right number of clusters so there are some formulas you can find online [00:14:16] are some formulas you can find online with criteria like AIC and B I sieve [00:14:18] with criteria like AIC and B I sieve automatically choosing the number of [00:14:19] automatically choosing the number of clusters in practice I tend not to use [00:14:21] clusters in practice I tend not to use them [00:14:22] them because I usually look at the downstream [00:14:25] because I usually look at the downstream application of what you actually want to [00:14:27] application of what you actually want to use k-means for in order to make a [00:14:29] use k-means for in order to make a decision on a number of classes so for [00:14:31] decision on a number of classes so for example if you're doing a market [00:14:33] example if you're doing a market segmentation here because you're [00:14:35] segmentation here because you're marketers want to design different [00:14:36] marketers want to design different marketing campaigns right for different [00:14:38] marketing campaigns right for different groups of users then your marketers [00:14:41] groups of users then your marketers might have the bandwidth to design for [00:14:42] might have the bandwidth to design for separate marketing campaigns but now the [00:14:44] separate marketing campaigns but now the hundred marketing campaigns so they'll [00:14:46] hundred marketing campaigns so they'll be good reason to choose for clusters [00:14:47] be good reason to choose for clusters rather than hundred clusters so as often [00:14:49] rather than hundred clusters so as often if you look at the purpose of what [00:14:52] if you look at the purpose of what you're doing this for I think in the [00:14:53] you're doing this for I think in the program exercise in homework do you see [00:14:56] program exercise in homework do you see a image compression exercise where you [00:14:59] a image compression exercise where you want to cluster colors into smaller [00:15:02] want to cluster colors into smaller number of clusters do you implement this [00:15:03] number of clusters do you implement this it's actually one of the most fun [00:15:04] it's actually one of the most fun exercises I think but but so there you [00:15:10] exercises I think but but so there you you know be saying well how much do you [00:15:11] you know be saying well how much do you want to compress the image to decide how [00:15:13] want to compress the image to decide how many clusters to try to use okay so I [00:15:16] many clusters to try to use okay so I usually pick the number of clusters you [00:15:19] usually pick the number of clusters you know either manually or looking at what [00:15:21] know either manually or looking at what you want to use Kanis cluster for are [00:15:23] you want to use Kanis cluster for are you trying to cluster news articles like [00:15:25] you trying to cluster news articles like the Google News example I think I showed [00:15:27] the Google News example I think I showed in the first nature you say well how [00:15:29] in the first nature you say well how many clusters kind of make sense for [00:15:31] many clusters kind of make sense for news articles okay all right so [00:15:37] news articles okay all right so oh sure welcome to yet second local [00:15:43] oh sure welcome to yet second local minima oh yes [00:15:44] minima oh yes kami intercepts of the local minima [00:15:46] kami intercepts of the local minima sometimes and so if you're worried about [00:15:48] sometimes and so if you're worried about local minima or the thing you can do is [00:15:51] local minima or the thing you can do is run k-means say ten times or 100 times [00:15:54] run k-means say ten times or 100 times 1,000 times from different random [00:15:55] 1,000 times from different random initializations of the cluster centroids [00:15:58] initializations of the cluster centroids and then run it you know say a hundred [00:15:59] and then run it you know say a hundred times and then pick whichever run [00:16:02] times and then pick whichever run resulted in the lowest value for this [00:16:05] resulted in the lowest value for this cost function alright so you play up [00:16:12] cost function alright so you play up this more in in the program exercise [00:16:18] this more in in the program exercise know there's a there's a problem that [00:16:22] know there's a there's a problem that seems so closely related but but there's [00:16:29] seems so closely related but but there's actually quite different where he's [00:16:30] actually quite different where he's arrived the algorithms which is density [00:16:33] arrived the algorithms which is density estimation so let me motivate this I [00:16:36] estimation so let me motivate this I actually about well right some time back [00:16:39] actually about well right some time back has some friends working on a problem [00:16:42] has some friends working on a problem which I simplified little bits of you [00:16:45] which I simplified little bits of you know if you have aircraft engines coming [00:16:47] know if you have aircraft engines coming off the assembly line alright and every [00:16:49] off the assembly line alright and every time an aircraft engine comes on the [00:16:51] time an aircraft engine comes on the assembly line you measure some features [00:16:53] assembly line you measure some features of this engine so you measure some [00:16:54] of this engine so you measure some features about the vibration and you [00:16:56] features about the vibration and you measure some features of all the heat [00:16:58] measure some features of all the heat that the aircraft engine is producing [00:17:01] that the aircraft engine is producing and let's say that you gathered a set [00:17:09] and the anomaly detection problem is if [00:17:21] and the anomaly detection problem is if you get a new aircraft engine that comes [00:17:23] you get a new aircraft engine that comes off the assembly line and if the [00:17:25] off the assembly line and if the vibration feature it takes on this value [00:17:27] vibration feature it takes on this value and the heat feature takes on this value [00:17:29] and the heat feature takes on this value is that aircraft engine an anomalous one [00:17:32] is that aircraft engine an anomalous one this is your [00:17:33] this is your right and so the application of this is [00:17:36] right and so the application of this is that as your aircraft engine comes off [00:17:39] that as your aircraft engine comes off the assembly line if you see a very [00:17:40] the assembly line if you see a very unusual signature in terms of the [00:17:42] unusual signature in terms of the vibrations and heat the aircraft engine [00:17:44] vibrations and heat the aircraft engine is generating then probably something's [00:17:46] is generating then probably something's wrong with this aircraft engine if your [00:17:48] wrong with this aircraft engine if your people have you have your team inspected [00:17:50] people have you have your team inspected further or tested further before you [00:17:52] further or tested further before you should the airplane before you ship the [00:17:54] should the airplane before you ship the engine tort or airplane may occur and [00:17:56] engine tort or airplane may occur and then something goes around the air and [00:17:57] then something goes around the air and there's a there's a major accident a [00:17:59] there's a there's a major accident a major disaster right [00:18:01] major disaster right and so anomaly detection is most [00:18:04] and so anomaly detection is most commonly done or one of the common ways [00:18:06] commonly done or one of the common ways to implement anomaly detection is the [00:18:10] to implement anomaly detection is the model P of X which is given all of these [00:18:13] model P of X which is given all of these blue examples given all these thoughts [00:18:16] blue examples given all these thoughts can you model what is the density from [00:18:19] can you model what is the density from which X was drawn so then if P of X is [00:18:24] which X was drawn so then if P of X is very small then you flag an anomaly [00:18:29] very small then you flag an anomaly meaning that gee I think something's [00:18:31] meaning that gee I think something's funny here and maybe someone should [00:18:34] funny here and maybe someone should inspect this aircraft engine a little [00:18:37] inspect this aircraft engine a little bit further sonar detection is used for [00:18:40] bit further sonar detection is used for tasks like this for inspection tossed [00:18:43] tasks like this for inspection tossed like this is used for many years ago as [00:18:46] like this is used for many years ago as su work of some telecoms providers with [00:18:49] su work of some telecoms providers with you know helping out telecoms company e [00:18:51] you know helping out telecoms company e on anomaly detection to figure out if [00:18:54] on anomaly detection to figure out if something's gone wrong with part of this [00:18:56] something's gone wrong with part of this cells her network right so if one day [00:18:58] cells her network right so if one day one of the South Tower starts throwing [00:19:00] one of the South Tower starts throwing off network patterns that seem very [00:19:02] off network patterns that seem very unusual then maybe something's wrong [00:19:03] unusual then maybe something's wrong with that cell tower like that [00:19:05] with that cell tower like that something's gone wrong it sent out the [00:19:06] something's gone wrong it sent out the technician to fix it it's also used a [00:19:09] technician to fix it it's also used a computer security of a computer save [00:19:11] computer security of a computer save computer Stanford start sending are very [00:19:14] computer Stanford start sending are very strange [00:19:14] strange you know network traffic there's very [00:19:17] you know network traffic there's very unusual relative their views on the four [00:19:19] unusual relative their views on the four browser what was this is a very [00:19:21] browser what was this is a very anomalous network traffic then maybe IT [00:19:23] anomalous network traffic then maybe IT stops you have a look to see if that [00:19:25] stops you have a look to see if that good computer has been hacked so these [00:19:27] good computer has been hacked so these are some of the applications that were [00:19:29] are some of the applications that were an all new section and what good way to [00:19:31] an all new section and what good way to do this is given the unlabeled data set [00:19:33] do this is given the unlabeled data set model P of X and then if you have very [00:19:35] model P of X and then if you have very low probability examples you flag that [00:19:37] low probability examples you flag that as a possible anomaly for further study [00:19:40] as a possible anomaly for further study now given this data sets how do you [00:19:45] now given this data sets how do you model this one distinct thing about this [00:19:47] model this one distinct thing about this green dots is that neither the vibration [00:19:50] green dots is that neither the vibration no the heat signature is actually out of [00:19:52] no the heat signature is actually out of range right you know like there are a [00:19:53] range right you know like there are a lot of aircraft engines with vibrations [00:19:56] lot of aircraft engines with vibrations in that range they're long of aircraft [00:19:57] in that range they're long of aircraft engines with heat in that range so [00:19:59] engines with heat in that range so neither feature by itself is actually [00:20:01] neither feature by itself is actually data unusual it's actually the [00:20:02] data unusual it's actually the combination of the two that is unusual [00:20:04] combination of the two that is unusual and so that's less what I want to do is [00:20:07] and so that's less what I want to do is uh come up with an algorithm to model [00:20:10] uh come up with an algorithm to model this and in fact welcome of an algorithm [00:20:12] this and in fact welcome of an algorithm they can model you know maybe maybe your [00:20:15] they can model you know maybe maybe your data density looks like this made more [00:20:16] data density looks like this made more of an L shape like that but how do you [00:20:18] of an L shape like that but how do you model P of X with the data coming from [00:20:22] model P of X with the data coming from an L shape and it turns out that there [00:20:24] an L shape and it turns out that there is no textbook distribution right you [00:20:27] is no textbook distribution right you know there isn't you know if you look at [00:20:28] know there isn't you know if you look at this simple and there's no exponential [00:20:30] this simple and there's no exponential family model the types of distributions [00:20:32] family model the types of distributions there is no distribution for modeling [00:20:35] there is no distribution for modeling very very complex distributions like [00:20:37] very very complex distributions like this so what I'm going to talk about is [00:20:40] this so what I'm going to talk about is the mixture of gaussians volatile which [00:20:42] the mixture of gaussians volatile which would look for data like this and say it [00:20:44] would look for data like this and say it looks like this data actually comes from [00:20:46] looks like this data actually comes from two Gaussian there's one Gaussian maybe [00:20:48] two Gaussian there's one Gaussian maybe that's one type of aircraft engine that [00:20:50] that's one type of aircraft engine that you know it's drawn from a Gaussian like [00:20:52] you know it's drawn from a Gaussian like the one below and a separate aircraft [00:20:54] the one below and a separate aircraft type of aircraft engine that's drawn [00:20:57] type of aircraft engine that's drawn from a Gaussian like that above and this [00:21:00] from a Gaussian like that above and this is why there's a lot of probably Mars in [00:21:02] is why there's a lot of probably Mars in just O'Shea region by very low [00:21:04] just O'Shea region by very low probability outside that O'Shay region [00:21:07] probability outside that O'Shay region right oh and these ellipses I'm drawing [00:21:09] right oh and these ellipses I'm drawing other contours of these two gaussians [00:21:11] other contours of these two gaussians right and so what I'd like to do next is [00:21:16] right and so what I'd like to do next is develop the mixture of gaussians model [00:21:19] develop the mixture of gaussians model which is useful for an audience section [00:21:22] which is useful for an audience section and and and then those this will lead us [00:21:26] and and and then those this will lead us to our second unsupervised so in order [00:21:34] to our second unsupervised so in order to make the mixture of gaussians model a [00:21:38] to make the mixture of gaussians model a bit easier to develop let me just use a [00:21:40] bit easier to develop let me just use a one-dimensional example where so let's [00:21:48] one-dimensional example where so let's see so let's say that we gather data set [00:21:51] see so let's say that we gather data set that looks like [00:21:52] that looks like this so it's just one roll number [00:22:05] this so it's just one roll number searches online I've plotted a few dots [00:22:07] searches online I've plotted a few dots um so looks like this day there maybe [00:22:10] um so looks like this day there maybe comes from two gaussians or it looks [00:22:12] comes from two gaussians or it looks like you know there's some data from [00:22:13] like you know there's some data from this Gaussian and there's some data from [00:22:17] this Gaussian and there's some data from that Gaussian on the right um and it's [00:22:21] that Gaussian on the right um and it's and if only we knew right which example [00:22:25] and if only we knew right which example had come from which Gaussian if if we [00:22:28] had come from which Gaussian if if we knew that these examples that come from [00:22:31] knew that these examples that come from Gaussian one we wanted to know with [00:22:33] Gaussian one we wanted to know with crosses and if only we knew what the [00:22:37] crosses and if only we knew what the actually this finally fell over if only [00:22:40] actually this finally fell over if only we knew that these examples that come [00:22:43] we knew that these examples that come from Gaussian to which I'm willing to [00:22:45] from Gaussian to which I'm willing to draw with oles then we just fake calcium [00:22:47] draw with oles then we just fake calcium 1/2 the crosses figure out into the O's [00:22:49] 1/2 the crosses figure out into the O's and then we'd be pretty much done right [00:22:51] and then we'd be pretty much done right oh and sorry and so these are the two [00:22:54] oh and sorry and so these are the two gaussians and so the overall density [00:22:56] gaussians and so the overall density would be something like this right [00:22:58] would be something like this right that's the probability of all the party [00:23:01] that's the probability of all the party muscle left while probably must know [00:23:02] muscle left while probably must know very low less probably mass on so the [00:23:08] very low less probably mass on so the overall density just told again would be [00:23:10] overall density just told again would be no high no high something like that [00:23:13] no high no high something like that right but the reason and then if you [00:23:19] right but the reason and then if you actually had these labels if you knew [00:23:20] actually had these labels if you knew that these examples came from gaussian [00:23:22] that these examples came from gaussian one those examples come from gaussian [00:23:24] one those examples come from gaussian two then you can actually use an [00:23:26] two then you can actually use an algorithm very similar to GD a gaussian [00:23:28] algorithm very similar to GD a gaussian difference to fit this model the problem [00:23:32] difference to fit this model the problem with this density estimation problem is [00:23:34] with this density estimation problem is you just see this data and maybe the [00:23:37] you just see this data and maybe the data came from two different gaussians [00:23:39] data came from two different gaussians but you don't know which example [00:23:40] but you don't know which example actually came from which coliseum okay [00:23:42] actually came from which coliseum okay so the e/m algorithm or the expectation [00:23:45] so the e/m algorithm or the expectation maximization algorithm will allow us to [00:23:47] maximization algorithm will allow us to fit a model [00:23:50] fit a model despite not knowing which Gaussian each [00:23:54] despite not knowing which Gaussian each example [00:24:07] so let me first write down the young [00:24:10] so let me first write down the young mixture of gaussians model and then [00:24:20] mixture of gaussians model and then we'll describe the EML room for this so [00:24:24] we'll describe the EML room for this so let's imagine let's suppose that there's [00:24:28] let's imagine let's suppose that there's a so the term we sometimes use this [00:24:36] a so the term we sometimes use this latent but latent just means hidden [00:24:40] observed [00:25:39] so so let's imagine that there's some [00:25:44] so so let's imagine that there's some hidden random variable Z and the term [00:25:47] hidden random variable Z and the term latent just means hidden on observe it [00:25:49] latent just means hidden on observe it means that it exists but you don't get [00:25:50] means that it exists but you don't get to see the value directly so I say later [00:25:53] to see the value directly so I say later it just means hidden on observe so let's [00:25:56] it just means hidden on observe so let's imagine that this hidden or latent [00:25:57] imagine that this hidden or latent random variable Z and Xin Z I had this [00:26:02] random variable Z and Xin Z I had this joint distribution and this this this is [00:26:04] joint distribution and this this this is very very similar to the model you saw [00:26:05] very very similar to the model you saw in Gaussian destroyers but Zi is [00:26:09] in Gaussian destroyers but Zi is multinomial with some set of parameters [00:26:11] multinomial with some set of parameters Phi for a mixture of two gaussians this [00:26:14] Phi for a mixture of two gaussians this would just be Bernoulli with two values [00:26:16] would just be Bernoulli with two values but if you're a mixture of K calcium's [00:26:17] but if you're a mixture of K calcium's then Z you know can take on values from [00:26:20] then Z you know can take on values from 1 through K and it was two gaussians it [00:26:25] 1 through K and it was two gaussians it just before nearly and then once you [00:26:28] just before nearly and then once you know that one example comes from [00:26:30] know that one example comes from Gaussian number J then X condition that [00:26:34] Gaussian number J then X condition that Zi is equal to J that is drawn from a [00:26:37] Zi is equal to J that is drawn from a Gaussian distribution with some mean and [00:26:39] Gaussian distribution with some mean and some coherence Sigma okay so the two [00:26:44] some coherence Sigma okay so the two unimportant ways this is different than [00:26:46] unimportant ways this is different than GTA one well I set Z to be one of K [00:26:50] GTA one well I set Z to be one of K values instead of one of two values and [00:26:52] values instead of one of two values and GDA god-centered from analysis we had Z [00:26:56] GDA god-centered from analysis we had Z know why the labels Y took on one of two [00:26:58] know why the labels Y took on one of two values and then second is I have Sigma J [00:27:02] values and then second is I have Sigma J instead of Sigma so by convention when [00:27:05] instead of Sigma so by convention when we feed mixture of gaussians models we [00:27:07] we feed mixture of gaussians models we let each gaussian have his own [00:27:08] let each gaussian have his own covariance matrix Sigma we could [00:27:10] covariance matrix Sigma we could actually force it to be the same way you [00:27:11] actually force it to be the same way you want but these are the trivial [00:27:12] want but these are the trivial differences the most significant [00:27:15] differences the most significant difference is that in Gaussian districts [00:27:20] difference is that in Gaussian districts I Y I whereas Y was observed and the [00:27:25] I Y I whereas Y was observed and the main difference between this and [00:27:27] main difference between this and Gaussian disappearing analysis is now we [00:27:29] Gaussian disappearing analysis is now we have replaced that with this latent or [00:27:31] have replaced that with this latent or hidden random variables Z are they [00:27:33] hidden random variables Z are they do not get to see in the training set [00:27:34] do not get to see in the training set okay so all right that was better [00:28:03] all right so if we knew the sea-ice [00:28:12] all right so if we knew the sea-ice right then we can use maximum likelihood [00:28:18] right then we can use maximum likelihood estimation right so if only we knew the [00:28:20] estimation right so if only we knew the value of the Z is which we don't but if [00:28:22] value of the Z is which we don't but if only we did then we could use maximum [00:28:24] only we did then we could use maximum likelihood estimation or mo e to [00:28:26] likelihood estimation or mo e to estimate everything you know so we were [00:28:28] estimate everything you know so we were right the log likelihood other [00:28:30] right the log likelihood other parameters equals some log P of X our Zi [00:28:39] parameters equals some log P of X our Zi you know given the parameters right and [00:28:44] you know given the parameters right and then you take the river to set the ders [00:28:46] then you take the river to set the ders equal to zero and you guys did this in [00:28:48] equal to zero and you guys did this in problem set one right and then you find [00:28:50] problem set one right and then you find that Phi J is equal to 1 over m [00:29:22] okay so if only you knew the values of [00:29:25] okay so if only you knew the values of the sea-ice then you could use maximum [00:29:28] the sea-ice then you could use maximum likelihood estimates and this is what [00:29:31] likelihood estimates and this is what you get and this is pretty much the [00:29:32] you get and this is pretty much the formulas actually these two are exactly [00:29:35] formulas actually these two are exactly the formulas we had for Gaussian Tuscon [00:29:38] the formulas we had for Gaussian Tuscon analysis except we'll replace Y with Z [00:29:42] analysis except we'll replace Y with Z and then there's some other formula for [00:29:44] and then there's some other formula for Sigma just written in the lecture notes [00:29:46] Sigma just written in the lecture notes but I won't that one right down here [00:29:47] but I won't that one right down here okay um [00:29:50] okay um but the reason we can't use this use [00:29:54] but the reason we can't use this use these formulas we don't actually know [00:29:55] these formulas we don't actually know whether the values of Z so what we will [00:30:00] whether the values of Z so what we will do in the e/m algorithm is two steps in [00:30:13] do in the e/m algorithm is two steps in the first step we will guess the value [00:30:18] the first step we will guess the value of the Z's and in the second step we [00:30:21] of the Z's and in the second step we will use these equations using the [00:30:24] will use these equations using the values of disease we just guessed so let [00:30:27] values of disease we just guessed so let me so sometimes in machine learning [00:30:29] me so sometimes in machine learning something to call this a bootstrap [00:30:31] something to call this a bootstrap procedure where you get something they [00:30:33] procedure where you get something they run an algorithm you're using your [00:30:35] run an algorithm you're using your guesses and then you update your guesses [00:30:37] guesses and then you update your guesses and then run the algorithm okay let me [00:30:38] and then run the algorithm okay let me let me make that concrete by writing [00:30:40] let me make that concrete by writing this down [00:30:54] so the e/m algorithm has two steps the a [00:30:59] so the e/m algorithm has two steps the a step also called the expectation step is [00:31:13] step also called the expectation step is set w IJ so W IJ is going to be the [00:31:30] set w IJ so W IJ is going to be the probability that Zi is equal to J okay [00:31:36] probability that Zi is equal to J okay given all the parameters and and much as [00:31:39] given all the parameters and and much as we did with generative learning [00:31:41] we did with generative learning algorithms right with generative [00:31:44] algorithms right with generative learning algorithms we'll use Bayes rule [00:31:45] learning algorithms we'll use Bayes rule to estimate the probability of Y given X [00:31:50] to estimate the probability of Y given X and so to compute this you use a similar [00:31:52] and so to compute this you use a similar Bayes rule type of calculation and so [00:31:56] Bayes rule type of calculation and so disappear [00:32:18] right where for example this term here P [00:32:27] right where for example this term here P of X i given Z I equals J this would be [00:32:30] of X i given Z I equals J this would be a Gaussian density right this comes from [00:32:32] a Gaussian density right this comes from a Gaussian density with mean mu J and [00:32:36] a Gaussian density with mean mu J and covariance Sigma J right and so this [00:32:39] covariance Sigma J right and so this term here would be a 1 over you know 2 [00:32:43] term here would be a 1 over you know 2 pi it's an N over 2 Sigma J and then [00:33:02] pi it's an N over 2 Sigma J and then this term here I guess this would be a [00:33:04] this term here I guess this would be a Phi J that's just a Bernoulli [00:33:06] Phi J that's just a Bernoulli probability remember Z is multinomial [00:33:08] probability remember Z is multinomial right Suzy this multinomial we're [00:33:14] right Suzy this multinomial we're parameters Phi so I guess the parameters [00:33:17] parameters Phi so I guess the parameters v for multinomial distribution tell you [00:33:19] v for multinomial distribution tell you what's the chance of Z B 1 2 3 4 and so [00:33:22] what's the chance of Z B 1 2 3 4 and so on up to K so the chance of Zi being for [00:33:25] on up to K so the chance of Zi being for the K is just this chance of Zi pee [00:33:28] the K is just this chance of Zi pee really Jane is just Phi J right it's [00:33:30] really Jane is just Phi J right it's just read it off one of the parameters [00:33:32] just read it off one of the parameters and your multinomial probability but for [00:33:36] and your multinomial probability but for the also CV different values okay and so [00:33:39] the also CV different values okay and so and similarly the terms of denominator [00:33:41] and similarly the terms of denominator this term here is from Gaussian and that [00:33:44] this term here is from Gaussian and that second term is from the multinomial [00:33:48] second term is from the multinomial probability that you have for Z and so [00:33:51] probability that you have for Z and so that's how you plug in all of these [00:33:52] that's how you plug in all of these numbers and use Bayes rule use this [00:33:54] numbers and use Bayes rule use this equation to compute given all given the [00:33:57] equation to compute given all given the position of all these gaussians what is [00:33:59] position of all these gaussians what is the chance of W IJ taking on a certain [00:34:02] the chance of W IJ taking on a certain value [00:34:06] and and and so to to make this really [00:34:09] and and and so to to make this really concrete you remember how I guess ones [00:34:13] concrete you remember how I guess ones and zeros are the other way if you were [00:34:15] and zeros are the other way if you were to look at these if you were to scan [00:34:18] to look at these if you were to scan through right to left [00:34:19] through right to left remember how you know you give the [00:34:21] remember how you know you give the sigmoid function right the same point [00:34:23] sigmoid function right the same point can be this way or this way or interval [00:34:24] can be this way or this way or interval sign I guess these are positive examples [00:34:26] sign I guess these are positive examples these negatives you have a sigmoid [00:34:28] these negatives you have a sigmoid function like this and so W IJ is just [00:34:32] function like this and so W IJ is just the height of this Sigma is just a [00:34:35] the height of this Sigma is just a chance you know each of these examples [00:34:38] chance you know each of these examples being coming from either the Z equals 1 [00:34:42] being coming from either the Z equals 1 is equal 0 and then you store all of [00:34:44] is equal 0 and then you store all of these numbers in the variables W IJ okay [00:34:48] these numbers in the variables W IJ okay so W IJ is just compute the posterior [00:34:51] so W IJ is just compute the posterior chance of discharge this example coming [00:34:53] chance of discharge this example coming from the nest Gaussian present right now [00:34:54] from the nest Gaussian present right now saying they just saw that W IJ so that's [00:35:09] saying they just saw that W IJ so that's the e set and you compute the W IJ for [00:35:14] the e set and you compute the W IJ for every single training example I mix the [00:35:19] every single training example I mix the m-step is [00:35:29] sorry is this what oh this one you're [00:35:37] sorry is this what oh this one you're sorry okay so in the so the e step tells [00:35:48] sorry okay so in the so the e step tells us you know trying to guess the values [00:35:50] us you know trying to guess the values of the Z's right we figure out what's [00:35:52] of the Z's right we figure out what's the probability of Z being one two three [00:35:53] the probability of Z being one two three four after Cain was stolen here and then [00:35:56] four after Cain was stolen here and then in the m-step what we're going to do is [00:35:58] in the m-step what we're going to do is use the formulas behalf for maximum [00:36:02] use the formulas behalf for maximum likelihood estimation and I want you to [00:36:06] likelihood estimation and I want you to compare these with the equations I had [00:36:08] compare these with the equations I had above okay see but so these equations [00:36:31] above okay see but so these equations are a lot like the equations above [00:36:33] are a lot like the equations above except that instead of indicator Z I [00:36:36] except that instead of indicator Z I equals J we replaced it with W IJ right [00:36:42] equals J we replaced it with W IJ right which by the way is the expected value [00:36:44] which by the way is the expected value of this indicates a function because the [00:36:52] of this indicates a function because the expected value of an indicator function [00:36:54] expected value of an indicator function is just equal to the probability of that [00:36:56] is just equal to the probability of that thing in the middle being true and then [00:37:04] thing in the middle being true and then there's a formula for Sigma J as well [00:37:06] there's a formula for Sigma J as well that's all you can get from the lecture [00:37:08] that's all you can get from the lecture notes but I want I won't write down here [00:37:10] notes but I want I won't write down here okay so one intuition of this mixture of [00:37:19] okay so one intuition of this mixture of gaussians algorithm is that it's a [00:37:21] gaussians algorithm is that it's a little bit like k-means but with Sophos [00:37:23] little bit like k-means but with Sophos i'm in so in k-means in the first step [00:37:27] i'm in so in k-means in the first step we will take each point and just assign [00:37:29] we will take each point and just assign it to one of the clade k cluster [00:37:31] it to one of the clade k cluster centroids right and it was a little bit [00:37:34] centroids right and it was a little bit closer to the red cluster centroid than [00:37:36] closer to the red cluster centroid than the blue cluster centroid we would just [00:37:37] the blue cluster centroid we would just assign it to the red cross [00:37:39] assign it to the red cross so even with just a little bit closer [00:37:41] so even with just a little bit closer one closer than another k-means we just [00:37:43] one closer than another k-means we just make what's called a hard assignment [00:37:45] make what's called a hard assignment meaning you know whatever plus the [00:37:46] meaning you know whatever plus the centroid is closed assume we just [00:37:48] centroid is closed assume we just assigned at a hundred percent to that [00:37:49] assigned at a hundred percent to that say cluster centroid so yeah is you can [00:37:55] say cluster centroid so yeah is you can think well yeah implements a softer way [00:37:59] think well yeah implements a softer way of assigning points to to the different [00:38:02] of assigning points to to the different cluster centroids because instead of [00:38:04] cluster centroids because instead of just picking the one closest Gaussian [00:38:06] just picking the one closest Gaussian Center and the signing of there it uses [00:38:08] Center and the signing of there it uses these probabilities and gives it a [00:38:10] these probabilities and gives it a weighting in terms of how much the sign [00:38:12] weighting in terms of how much the sign to calcium want versus gals into and [00:38:15] to calcium want versus gals into and then second obtains you know the means [00:38:17] then second obtains you know the means accordingly write sum over all the [00:38:20] accordingly write sum over all the excise to the extent you're assigned to [00:38:21] excise to the extent you're assigned to that cluster centroid divided by the [00:38:23] that cluster centroid divided by the number of examples assigned to a cluster [00:38:25] number of examples assigned to a cluster centroid okay so so so that's one [00:38:28] centroid okay so so so that's one intuition behind between them and [00:38:31] intuition behind between them and k-means and in a second but but when you [00:38:37] k-means and in a second but but when you run this algorithm it turns out that [00:38:40] run this algorithm it turns out that this algorithm will converge with some [00:38:43] this algorithm will converge with some caveats I'll get to later and this will [00:38:46] caveats I'll get to later and this will find a pretty decent estimate of the [00:38:51] find a pretty decent estimate of the parameters you know say fitting a [00:38:53] parameters you know say fitting a mixture of two gaussians model so this [00:39:00] mixture of two gaussians model so this is some they owe and so if you are given [00:39:03] is some they owe and so if you are given the data set of say airplane engines you [00:39:06] the data set of say airplane engines you can run this algorithm for the mixture [00:39:07] can run this algorithm for the mixture of two gaussians and then when a new [00:39:09] of two gaussians and then when a new airplane engine rolls off the assembly [00:39:10] airplane engine rolls off the assembly line you so after your fitting the [00:39:15] line you so after your fitting the k-means algorithm you now have a after [00:39:18] k-means algorithm you now have a after 15 ml room you now have a joint density [00:39:20] 15 ml room you now have a joint density of a p of x comma Z and so the density [00:39:23] of a p of x comma Z and so the density for X is just sum over all the values of [00:39:26] for X is just sum over all the values of Z P of X comma Z and so [00:39:39] and so a mixture of gaussians can fit [00:39:42] and so a mixture of gaussians can fit distributions that look like this it can [00:39:44] distributions that look like this it can fit distributions that look like this [00:39:46] fit distributions that look like this right there's these up these are both [00:39:47] right there's these up these are both mixtures of two Gaussian so this gives [00:39:49] mixtures of two Gaussian so this gives you a very rich family of models to fits [00:39:52] you a very rich family of models to fits very complicated distributions and now [00:39:54] very complicated distributions and now that right and you've also fit no know [00:39:58] that right and you've also fit no know something like this so this is a mixture [00:40:01] something like this so this is a mixture of two gaussians I guess one thin narrow [00:40:03] of two gaussians I guess one thin narrow Gaussian here and one much wider fatter [00:40:05] Gaussian here and one much wider fatter Gaussian so mixture of two gaussians [00:40:07] Gaussian so mixture of two gaussians can't you fill them all the different [00:40:08] can't you fill them all the different things can fit a lot and the mixture of [00:40:12] things can fit a lot and the mixture of more than two gaussians can fit even [00:40:14] more than two gaussians can fit even richer models and so by doing this you [00:40:17] richer models and so by doing this you can now model P of X for many [00:40:20] can now model P of X for many complicated densities including this one [00:40:23] complicated densities including this one this example I just now this will allow [00:40:26] this example I just now this will allow you to fit a priori density function [00:40:28] you to fit a priori density function that puts some all the promos on on a [00:40:30] that puts some all the promos on on a region that looks like this [00:40:31] region that looks like this and so we have a new example you can [00:40:34] and so we have a new example you can evaluate P of X and a P of X is large [00:40:36] evaluate P of X and a P of X is large then you can say you know this looks [00:40:39] then you can say you know this looks okay and the P of X is less than Epsilon [00:40:43] okay and the P of X is less than Epsilon you can find in an RV and say Oh take a [00:40:46] you can find in an RV and say Oh take a look take another look at this airplane [00:40:47] look take another look at this airplane engine okay [00:40:50] engine okay so um I kind of just wrote down this [00:40:54] so um I kind of just wrote down this algorithm with a little bit of a hand [00:40:56] algorithm with a little bit of a hand wavy explanation at the house derive [00:40:58] wavy explanation at the house derive right so I said if only you knew the [00:41:00] right so I said if only you knew the values of C and just used maximum [00:41:02] values of C and just used maximum likelihood estimation [00:41:03] likelihood estimation so let's guess the values of Z and then [00:41:05] so let's guess the values of Z and then plug that into the formula so maximum IQ [00:41:07] plug that into the formula so maximum IQ estimation it turns out that hand-wavy [00:41:10] estimation it turns out that hand-wavy explanation works in the particular case [00:41:12] explanation works in the particular case of [00:41:12] of Yampa mixtures of gaussians but that [00:41:15] Yampa mixtures of gaussians but that there is a more formal way of deriving [00:41:18] there is a more formal way of deriving the EML rhythm that shows that this is a [00:41:22] the EML rhythm that shows that this is a maximum likelihood estimation algorithm [00:41:23] maximum likelihood estimation algorithm and then it converges at least the local [00:41:25] and then it converges at least the local optimum and in particular there what [00:41:29] optimum and in particular there what we'll do is show that if you go is given [00:41:34] we'll do is show that if you go is given a model P of X Z prior tries by theta if [00:41:39] a model P of X Z prior tries by theta if you go this to maximize P of X right [00:41:44] you go this to maximize P of X right excuse me [00:41:51] right so this is what maximum likely [00:41:53] right so this is what maximum likely you're supposed to do that [00:41:55] you're supposed to do that eeehm is exactly trying to do that okay [00:41:57] eeehm is exactly trying to do that okay so I'll go on in a minute present this [00:42:01] so I'll go on in a minute present this more general derivation that the full [00:42:04] more general derivation that the full morris derivation of the e/m algorithm [00:42:06] morris derivation of the e/m algorithm that doesn't rely on this hand wavy [00:42:08] that doesn't rely on this hand wavy argument of thus guesses ease and use [00:42:10] argument of thus guesses ease and use master like you were to guess value so [00:42:12] master like you were to guess value so I'll do the rigorous derivation of VM in [00:42:15] I'll do the rigorous derivation of VM in a minute but before I do that let me [00:42:17] a minute but before I do that let me just pause and check if there are any [00:42:19] just pause and check if there are any questions maybe let's see maybe I'll [00:42:41] questions maybe let's see maybe I'll help to not think of them as weights [00:42:43] help to not think of them as weights yeah I think this is actually there [00:42:46] yeah I think this is actually there waiting you assigned to a certain [00:42:48] waiting you assigned to a certain Gaussian so that's one intuition and [00:42:51] Gaussian so that's one intuition and hence weights but so one way to think of [00:42:59] hence weights but so one way to think of this as W IJ is how much X I is assigned [00:43:09] this as W IJ is how much X I is assigned to you know to do so W IJ is a strength [00:43:28] to you know to do so W IJ is a strength of how strongly you want to assign that [00:43:31] of how strongly you want to assign that training example X I to that cluster or [00:43:34] training example X I to that cluster or to that to that particular Gaussian and [00:43:36] to that to that particular Gaussian and so this is a number of G 0 and 1 right [00:43:39] so this is a number of G 0 and 1 right and the strength of all the assignments [00:43:41] and the strength of all the assignments and every point is a sign with a total [00:43:44] and every point is a sign with a total strength equal to 1 because all these [00:43:46] strength equal to 1 because all these properties must sum up to 1 and so when [00:43:48] properties must sum up to 1 and so when I take this point and assign it you know [00:43:50] I take this point and assign it you know 0.82 more close gaussian and point to to [00:43:54] 0.82 more close gaussian and point to to a more distinct [00:43:55] a more distinct and this is our guest though you know [00:43:57] and this is our guest though you know well there's an 80% chance of him but [00:43:59] well there's an 80% chance of him but that gal seen a 20% chance of camera a [00:44:00] that gal seen a 20% chance of camera a second girl seen this make sense oh I [00:44:10] second girl seen this make sense oh I see so let's see um so when you're [00:44:14] see so let's see um so when you're running the ml room you never know [00:44:15] running the ml room you never know whether the true values of Z all right [00:44:17] whether the true values of Z all right you're given the data set so you only [00:44:19] you're given the data set so you only told the excess and false we know these [00:44:23] told the excess and false we know these airplane engines were generated off you [00:44:26] airplane engines were generated off you know two different gaussians maybe there [00:44:27] know two different gaussians maybe there are two separate assembly processes you [00:44:29] are two separate assembly processes you know one from the one from plot number [00:44:33] know one from the one from plot number one one from plot number two and maybe [00:44:35] one one from plot number two and maybe they're actually operate a little bit [00:44:36] they're actually operate a little bit differently but by the time they merge [00:44:38] differently but by the time they merge onto one but by the time the two [00:44:42] onto one but by the time the two supplies of aircraft engines get to you [00:44:44] supplies of aircraft engines get to you they've been mixed together and so you [00:44:46] they've been mixed together and so you can't tell anymore which aircraft engine [00:44:48] can't tell anymore which aircraft engine came from profit pond one and which [00:44:50] came from profit pond one and which profit aircraft engine came from plant - [00:44:52] profit aircraft engine came from plant - I don't even know there are two fonts [00:44:54] I don't even know there are two fonts you just see the stream of aircraft [00:44:55] you just see the stream of aircraft engines you're hypothesizing they're the [00:44:57] engines you're hypothesizing they're the two types and so in every iteration of [00:45:00] two types and so in every iteration of PM you're taking each aircraft engine [00:45:04] PM you're taking each aircraft engine and guessing you know for this one I [00:45:06] and guessing you know for this one I think does 80% chance that came for [00:45:07] think does 80% chance that came for process one the 30% chance came for [00:45:09] process one the 30% chance came for process - so that's the e step and then [00:45:12] process - so that's the e step and then in the m-step you look at all the [00:45:14] in the m-step you look at all the engines that you're kind of guessing [00:45:16] engines that you're kind of guessing were generated by process one and you [00:45:19] were generated by process one and you update your Gaussian to be a better [00:45:20] update your Gaussian to be a better model for all of the things that were [00:45:22] model for all of the things that were that you kind of think were generated by [00:45:24] that you kind of think were generated by process one and if there's something [00:45:26] process one and if there's something that you're absolutely sure came from [00:45:28] that you're absolutely sure came from process one then it has a weight of one [00:45:30] process one then it has a weight of one close to one and this do you think there [00:45:32] close to one and this do you think there was something that you know are the 10% [00:45:34] was something that you know are the 10% chance come to process 1 then that [00:45:36] chance come to process 1 then that example is given a lower weight and how [00:45:38] example is given a lower weight and how you update the meaning for that [00:45:44] all right so [00:46:31] well I still remember when I was an [00:46:34] well I still remember when I was an undergrad doing a summer internship at [00:46:36] undergrad doing a summer internship at AT&T Bell Labs and then someone the few [00:46:39] AT&T Bell Labs and then someone the few offices down had learned about diem for [00:46:41] offices down had learned about diem for the mixture of gaussians her first time [00:46:42] the mixture of gaussians her first time was running on his computer and he's [00:46:44] was running on his computer and he's going around to every single office [00:46:46] going around to every single office saying oh my god you gotta check this [00:46:48] saying oh my god you gotta check this out this is unbelievable look at what [00:46:49] out this is unbelievable look at what this elephant can do Tiffany makes is a [00:46:51] this elephant can do Tiffany makes is a Gaussian so it shows you those other [00:46:54] Gaussian so it shows you those other people I hang out with all right um so [00:47:06] people I hang out with all right um so in order to derive you know so slightly [00:47:09] in order to derive you know so slightly hand wavy arguments that oh let's get to [00:47:11] hand wavy arguments that oh let's get to let's guess the values of the Z's let's [00:47:13] let's guess the values of the Z's let's just have these ways and plug them into [00:47:14] just have these ways and plug them into maximum likelihood um what I like to do [00:47:17] maximum likelihood um what I like to do is give a more rigorous derivation for [00:47:20] is give a more rigorous derivation for ye M algorithm is a reasonable algorithm [00:47:22] ye M algorithm is a reasonable algorithm and Y is a massive likely estimation [00:47:25] and Y is a massive likely estimation algorithm and why we can expect it to [00:47:26] algorithm and why we can expect it to converge and it turns out there rather [00:47:29] converge and it turns out there rather than just proving you know that this is [00:47:31] than just proving you know that this is a sound algorithm what we'll see on [00:47:33] a sound algorithm what we'll see on Wednesday is that this view of p.m. [00:47:35] Wednesday is that this view of p.m. allows us to derive em in a in a more [00:47:39] allows us to derive em in a in a more correct way for other models as well [00:47:41] correct way for other models as well they make sense of gaussians on [00:47:42] they make sense of gaussians on Wednesday we'll talk about a model [00:47:46] Wednesday we'll talk about a model called factor analysis unless you model [00:47:48] called factor analysis unless you model gaussians an extremely high dimensional [00:47:49] gaussians an extremely high dimensional spaces where if you have a thousand [00:47:51] spaces where if you have a thousand dimensional data but only thirty [00:47:52] dimensional data but only thirty examples how do you for the girls into [00:47:54] examples how do you for the girls into that so we talked about that on [00:47:55] that so we talked about that on Wednesday and it turns out this [00:47:57] Wednesday and it turns out this derivation that yeah we're gonna go [00:47:58] derivation that yeah we're gonna go about through now is crucial for [00:48:01] about through now is crucial for applying M accurately in problems like [00:48:05] applying M accurately in problems like that so in order to lead up to that [00:48:10] that so in order to lead up to that derivation let me describe Jensen's [00:48:14] derivation let me describe Jensen's inequality so let F be a convex function [00:48:25] to do yeah we're actually going to need [00:48:27] to do yeah we're actually going to need concave functions so be all - of [00:48:29] concave functions so be all - of everything but what gets it done in a [00:48:31] everything but what gets it done in a second but so a convex function means [00:48:39] second but so a convex function means the second derivative is greater than 0 [00:48:42] the second derivative is greater than 0 or in other words it looks like that [00:48:43] or in other words it looks like that right so that's a convex function that X [00:48:48] right so that's a convex function that X be a random variable then F of the [00:48:59] be a random variable then F of the expected value of x is less than equal [00:49:01] expected value of x is less than equal to the expected value of x [00:49:25] maybe young here's an example right so [00:49:32] maybe young here's an example right so here's a let's see that's the function f [00:49:38] here's a let's see that's the function f of X and let's say that these are the [00:49:40] of X and let's say that these are the values 1 2 3 4 5 and suppose that X is [00:49:47] values 1 2 3 4 5 and suppose that X is equal to 1 with probability 1/2 is equal [00:49:53] equal to 1 with probability 1/2 is equal to 5 probably just an illustration then [00:50:03] here is the F of 1 here is F of 5 here [00:50:15] here is the F of 1 here is F of 5 here is f of 3 and F of 3 is f of the [00:50:19] is f of 3 and F of 3 is f of the expected value of x right because so the [00:50:22] expected value of x right because so the expected value of x and sometimes I [00:50:25] expected value of x and sometimes I write 2 so called the square brackets [00:50:27] write 2 so called the square brackets it's the average of X is equal to 3 and [00:50:30] it's the average of X is equal to 3 and so the expected value seems to be F of [00:50:34] so the expected value seems to be F of the expected value of x is equal to this [00:50:37] the expected value of x is equal to this value whereas the expected value of f of [00:50:42] value whereas the expected value of f of X is the mean of F of 1 and F of 5 right [00:50:53] X is the mean of F of 1 and F of 5 right so the expected value of f of X f of X [00:50:55] so the expected value of f of X f of X is a 50% chance of being F of 1 and a [00:50:57] is a 50% chance of being F of 1 and a 50% chance of being a 4/5 and so the [00:51:01] 50% chance of being a 4/5 and so the expected value of f of X is equal to [00:51:03] expected value of f of X is equal to this value in the middle let's really [00:51:05] this value in the middle let's really take these two take this value and this [00:51:08] take these two take this value and this value and take the mean so it's this [00:51:09] value and take the mean so it's this value up here and and this value [00:51:14] expensive value and so in this example [00:51:19] expensive value and so in this example the expected value of f of X is greater [00:51:22] the expected value of f of X is greater than F of the expected value of x as [00:51:25] than F of the expected value of x as predicted by Jensen's inequality I'm [00:51:28] predicted by Jensen's inequality I'm going to just draw one illustration that [00:51:30] going to just draw one illustration that may or may not help is some of my [00:51:31] may or may not help is some of my friends like it I sometimes use it but [00:51:33] friends like it I sometimes use it but it was confusing then don't worry about [00:51:34] it was confusing then don't worry about it but it turns out that if you draw a [00:51:37] it but it turns out that if you draw a line that connects these two then the [00:51:40] line that connects these two then the midpoint of this line is the height of F [00:51:43] midpoint of this line is the height of F of expected value of x right so the [00:51:46] of expected value of x right so the height of this you know so given these [00:51:48] height of this you know so given these two points this point in this point if [00:51:50] two points this point in this point if you draw this line it's called a chord [00:51:52] you draw this line it's called a chord then the height of this point is [00:51:57] then the height of this point is expected value of f of X and this point [00:52:06] expected value of f of X and this point is f of the expected value events and in [00:52:13] is f of the expected value events and in any convex function you know really take [00:52:19] any convex function you know really take any convex function that's also called [00:52:22] any convex function that's also called back function if you draw any chords [00:52:24] back function if you draw any chords that mean point it's always higher right [00:52:30] that mean point it's always higher right then that group Green Point which is Y [00:52:33] then that group Green Point which is Y which is another way of seeing Y [00:52:36] which is another way of seeing Y Jensen's equality holds true okay if [00:52:39] Jensen's equality holds true okay if this visualization doesn't help don't [00:52:40] this visualization doesn't help don't worry about it but it's just so actually [00:52:42] worry about it but it's just so actually what a lot my friends do is we the cell [00:52:45] what a lot my friends do is we the cell you know we keep on forgetting which [00:52:47] you know we keep on forgetting which direction Jensen's equality goes that's [00:52:50] direction Jensen's equality goes that's not great that Sol all of my friends [00:52:52] not great that Sol all of my friends were don't remember we draw this picture [00:52:54] were don't remember we draw this picture and draw that chord and if we quickly [00:52:55] and draw that chord and if we quickly figure out which we do equality girls [00:53:01] all right so one addendum further it's [00:53:21] all right so one addendum further it's strictly greater than zero [00:53:23] strictly greater than zero and so if this is the case we say F is [00:53:26] and so if this is the case we say F is strictly convex so let's see a straight [00:53:52] strictly convex so let's see a straight line is also convex function right so [00:53:54] line is also convex function right so this is the convex function this [00:53:55] this is the convex function this congressional district on various ocean [00:53:57] congressional district on various ocean turns out a straight line that's also a [00:53:59] turns out a straight line that's also a convex function but so in this addendum [00:54:02] convex function but so in this addendum is saying that if F is a strictly convex [00:54:04] is saying that if F is a strictly convex function meaning racing it's not a [00:54:06] function meaning racing it's not a straight line right bit more than is not [00:54:09] straight line right bit more than is not a straight line but if the curvature if [00:54:11] a straight line but if the curvature if it's always bending up then the only way [00:54:15] it's always bending up then the only way for the left and right hand sides to be [00:54:17] for the left and right hand sides to be equal is an X is a constant meaning it's [00:54:20] equal is an X is a constant meaning it's a random variable that always takes on [00:54:22] a random variable that always takes on the same value [00:54:23] the same value okay so Jensen's equality says that you [00:54:27] okay so Jensen's equality says that you know left hand sides got to be the same [00:54:31] know left hand sides got to be the same as right hand side sorry I think I [00:54:32] as right hand side sorry I think I reversed the order of these two for that [00:54:34] reversed the order of these two for that equation that doesn't matter right so [00:54:35] equation that doesn't matter right so Jennison equality says left hand side is [00:54:37] Jennison equality says left hand side is always less than equals to the right [00:54:39] always less than equals to the right hand side and the only way is equal is [00:54:41] hand side and the only way is equal is if X you know is a random variable that [00:54:44] if X you know is a random variable that always takes on the same value [00:54:55] yeah [00:54:57] yeah so it turns out what if the other have [00:54:58] so it turns out what if the other have one single that now the another feat it [00:55:00] one single that now the another feat it turns out does vary so let's see so one [00:55:04] turns out does vary so let's see so one way that could happen would be if the [00:55:06] way that could happen would be if the function were like that and then if you [00:55:09] function were like that and then if you take the drawing horde we take the [00:55:11] take the drawing horde we take the meanest no higher then this Impala if [00:55:22] meanest no higher then this Impala if you had a flat part here then the [00:55:24] you had a flat part here then the function is not strictly convex and so [00:55:26] function is not strictly convex and so it's still less than equal to but it's [00:55:28] it's still less than equal to but it's not but it can't be equal to Y of X is [00:55:30] not but it can't be equal to Y of X is random so um and and and we'll use this [00:55:36] random so um and and and we'll use this in a little bit well actually end up [00:55:38] in a little bit well actually end up using this and again for the strict [00:55:42] using this and again for the strict proper low states you know if those of [00:55:43] proper low states you know if those of you that don't know take classes in [00:55:45] you that don't know take classes in advanced probability the technical way [00:55:48] advanced probability the technical way of saying X is a constant is X's equal [00:55:51] of saying X is a constant is X's equal to DX [00:55:52] to DX we're probability one you know what I [00:56:01] we're probability one you know what I think for all practical human purposes [00:56:03] think for all practical human purposes you do not need to worry about this but [00:56:04] you do not need to worry about this but if you think the cost in measure theory [00:56:07] if you think the cost in measure theory the professor in measure theory will be [00:56:09] the professor in measure theory will be happy if you say this then you say X is [00:56:11] happy if you say this then you say X is a constant but maybe maybe none of you [00:56:13] a constant but maybe maybe none of you know okay this is don't worry about it [00:56:14] know okay this is don't worry about it oh yes okay now um just one one more [00:56:21] oh yes okay now um just one one more addendum to this is that the form of [00:56:26] addendum to this is that the form of Jensen's equality we're gonna use is [00:56:28] Jensen's equality we're gonna use is actually a form for a concave function [00:56:30] actually a form for a concave function so instead of convex I'm gonna say [00:56:34] so instead of convex I'm gonna say concave and so you know a concave [00:56:37] concave and so you know a concave function is just a negative of a convex [00:56:39] function is just a negative of a convex function right if you take a convex [00:56:41] function right if you take a convex function and take negative of that it [00:56:43] function and take negative of that it becomes concave and so the whole thing [00:56:46] becomes concave and so the whole thing works with the with everything flipped [00:56:51] works with the with everything flipped around the other way [00:57:01] okay so the phone with Jensen's [00:57:04] okay so the phone with Jensen's inequality we're going to use it's [00:57:06] inequality we're going to use it's actually the concave foam with Jensen's [00:57:09] actually the concave foam with Jensen's equality and we're actually going to [00:57:11] equality and we're actually going to apply it to the log function so the log [00:57:13] apply it to the log function so the log function write log X looks like this and [00:57:15] function write log X looks like this and so that's a concave function and so the [00:57:18] so that's a concave function and so the inequality will use to be in this [00:57:19] inequality will use to be in this direction to have an orange [00:57:23] all right [00:57:54] so just the density estimation problem [00:57:59] so just the density estimation problem meaning density estimation means you [00:58:02] meaning density estimation means you want to estimate P of X all right so we [00:58:04] want to estimate P of X all right so we have a model of a P of X comma Z with [00:58:13] have a model of a P of X comma Z with parameters theta and so you know instead [00:58:16] parameters theta and so you know instead of writing out Mu Sigma Nu Sigma Phi [00:58:21] of writing out Mu Sigma Nu Sigma Phi like we did for the mixture of gaussians [00:58:22] like we did for the mixture of gaussians I'm just gonna capture all the [00:58:24] I'm just gonna capture all the parameters you have whatever your [00:58:25] parameters you have whatever your parameters are obviously capture them in [00:58:27] parameters are obviously capture them in one variable theta and you only observe [00:58:34] one variable theta and you only observe thanks so your training set looks like [00:58:40] thanks so your training set looks like that so the UM log likelihood of the [00:58:48] that so the UM log likelihood of the parameters theta is equal to some of [00:58:53] parameters theta is equal to some of your training examples log hearings i [00:58:57] your training examples log hearings i franchised by theta and this in turn is [00:59:03] franchised by theta and this in turn is log of sum over Z P of X I see I [00:59:13] franchise by theta right because P of X [00:59:20] franchise by theta right because P of X you know is just taking the Joint [00:59:22] you know is just taking the Joint Distribution and summary notes [00:59:24] Distribution and summary notes marginalizing out Zi [00:59:28] and so what we want is maximum [00:59:34] and so what we want is maximum likelihood estimation which is to find [00:59:36] likelihood estimation which is to find the value of theta that maximizes is [00:59:42] the value of theta that maximizes is long likelihood and what well like what [00:59:46] long likelihood and what well like what we'd like to do is derive in year now [00:59:48] we'd like to do is derive in year now derive an algorithm which will turn out [00:59:50] derive an algorithm which will turn out to be an e/m algorithm as an iterative [00:59:52] to be an e/m algorithm as an iterative algorithm for finding the mass of life [00:59:56] algorithm for finding the mass of life for an estimate of the parameters theta [01:00:05] so let me draw a picture that I could [01:00:12] so let me draw a picture that I could keep in mind as we go through the math [01:00:14] keep in mind as we go through the math which is you know the horizontal axis is [01:00:18] which is you know the horizontal axis is a space of possible values of parameters [01:00:20] a space of possible values of parameters theta and so there's some function o of [01:00:25] theta and so there's some function o of theta then you try to maximize [01:00:35] and so what yen does is lesson you [01:00:39] and so what yen does is lesson you initialize theta at some value may be [01:00:43] initialize theta at some value may be randomly initialize so similar to the [01:00:47] randomly initialize so similar to the k-means clustering we just you know [01:00:48] k-means clustering we just you know randomly initialize your muse for that [01:00:50] randomly initialize your muse for that ratio gaussians what the IAM algorithm [01:00:53] ratio gaussians what the IAM algorithm does is in the east step we're going to [01:00:56] does is in the east step we're going to construct a lower bound shown in green [01:01:00] construct a lower bound shown in green here for the log likelihood and this [01:01:04] here for the log likelihood and this lower bound is being curve has two [01:01:06] lower bound is being curve has two properties one is it is a lower bound so [01:01:09] properties one is it is a lower bound so everywhere you look you know over all [01:01:10] everywhere you look you know over all values of theta the green curve lies [01:01:13] values of theta the green curve lies below the blue curve so this is a lower [01:01:15] below the blue curve so this is a lower bound and the second property that the [01:01:17] bound and the second property that the green curve has is that it is equal to [01:01:21] green curve has is that it is equal to the blue curve at the current value of [01:01:24] the blue curve at the current value of theta so what the east step does which [01:01:28] theta so what the east step does which you will see later on and just keep this [01:01:30] you will see later on and just keep this picture in mind as we go through the [01:01:31] picture in mind as we go through the east of an e/m set is um it'll construct [01:01:34] east of an e/m set is um it'll construct the lower bound it looks like this right [01:01:36] the lower bound it looks like this right oh and and also to foreshadow a part of [01:01:41] oh and and also to foreshadow a part of the derivation right there was that [01:01:42] the derivation right there was that addendum to Jensen's equality what we [01:01:45] addendum to Jensen's equality what we said well under these conditions it [01:01:46] said well under these conditions it holds with equality right here if f of X [01:01:49] holds with equality right here if f of X equals F of G of X we said well the two [01:01:51] equals F of G of X we said well the two things are equal with under certain [01:01:52] things are equal with under certain conditions we want things to be equal we [01:01:55] conditions we want things to be equal we want the green curve to be equal to the [01:01:57] want the green curve to be equal to the blue curve at the old value of theta so [01:01:59] blue curve at the old value of theta so what we'll use that addendum to just [01:02:01] what we'll use that addendum to just inequality when we do like that so [01:02:04] inequality when we do like that so that's estep is draw the green curve and [01:02:08] that's estep is draw the green curve and then what the m-step does is it takes a [01:02:11] then what the m-step does is it takes a green curve and it finds the maximum [01:02:17] so what the em set does is it takes a [01:02:21] so what the em set does is it takes a green curve and it finds the maximum and [01:02:25] green curve and it finds the maximum and one step of eeehm will then move theta [01:02:30] one step of eeehm will then move theta from this green value to this red value [01:02:32] from this green value to this red value okay so the e step constructs the green [01:02:36] okay so the e step constructs the green curve and the m-step finds the maximum [01:02:40] curve and the m-step finds the maximum of the green curve and this is one [01:02:42] of the green curve and this is one iteration of M the second iteration of M [01:02:46] iteration of M the second iteration of M now that you're at this red thing is [01:02:48] now that you're at this red thing is will construct a new lower bound again [01:02:51] will construct a new lower bound again you know is it different though about [01:02:52] you know is it different though about everywhere the red curve is below the [01:02:54] everywhere the red curve is below the blue curve and the values are equal at [01:02:57] blue curve and the values are equal at this new value that's the e step and an [01:03:01] this new value that's the e step and an M step will maximize this red curve and [01:03:07] M step will maximize this red curve and so on now you're here construct another [01:03:11] so on now you're here construct another thing do that and you can kind of tell [01:03:15] thing do that and you can kind of tell that they keep running eeehm this is [01:03:17] that they keep running eeehm this is constantly trying to increase L of theta [01:03:20] constantly trying to increase L of theta trying to increase the log likelihood [01:03:22] trying to increase the log likelihood until it converges to local optima they [01:03:26] until it converges to local optima they give Albert does converge only to local [01:03:28] give Albert does converge only to local Optima so if you if there was another [01:03:29] Optima so if you if there was another even bigger thing there that they may [01:03:32] even bigger thing there that they may never find its way over to that other [01:03:34] never find its way over to that other better optimum but the e/m algorithm by [01:03:37] better optimum but the e/m algorithm by repeatedly doing this will hopefully [01:03:39] repeatedly doing this will hopefully converse to a pretty good local optimum [01:03:43] converse to a pretty good local optimum all right so that's right to how we do [01:03:50] all right so that's right to how we do that [01:04:10] so I've already said that our goal is to [01:04:13] so I've already said that our goal is to find the Frances data then maximize this [01:04:21] and so that equation we said or just now [01:04:24] and so that equation we said or just now is some of our I log some of the Zi P X [01:04:30] is some of our I log some of the Zi P X Y comma Z i given theta okay so this is [01:04:38] Y comma Z i given theta okay so this is just what we had written down I guess on [01:04:40] just what we had written down I guess on the left what I'm going to do next is [01:04:50] divided by I must find if I buy this [01:05:11] where a Qi of Zi is a probability [01:05:17] where a Qi of Zi is a probability distribution [01:05:26] ie some of us Zi Qi of Zi equals one so [01:05:37] ie some of us Zi Qi of Zi equals one so with the multiplying defined by some [01:05:39] with the multiplying defined by some high resolution and we'll decide later [01:05:41] high resolution and we'll decide later how to come up with this probably this [01:05:43] how to come up with this probably this usually Qi right but you know I'm [01:05:45] usually Qi right but you know I'm allowed to construct a prize [01:05:47] allowed to construct a prize distribution and multiply and divide by [01:05:48] distribution and multiply and divide by the same thing right [01:05:50] the same thing right now if you look at this all right let's [01:05:55] now if you look at this all right let's put square brackets here [01:05:56] put square brackets here if these Qi is this a probably [01:05:59] if these Qi is this a probably distribution meaning that some of us Zi [01:06:00] distribution meaning that some of us Zi Qi Zi sums over from some some one then [01:06:04] Qi Zi sums over from some some one then this thing inside is equal to sum of I [01:06:11] this thing inside is equal to sum of I of an expected value of Zi drawing from [01:06:16] of an expected value of Zi drawing from the Qi distribution we use colors to [01:06:34] the Qi distribution we use colors to make this clearer [01:06:42] right so the way you compute the [01:06:45] right so the way you compute the expected value of you know some function [01:06:47] expected value of you know some function of Z is you sum over all the possible [01:06:50] of Z is you sum over all the possible values of Z I of the property of Zi [01:06:53] values of Z I of the property of Zi times what if that function is so this [01:06:56] times what if that function is so this equation is just the expected value who [01:06:58] equation is just the expected value who respect to Z I drawn from that Qi [01:07:00] respect to Z I drawn from that Qi distribution of that thing in the square [01:07:03] distribution of that thing in the square brackets in the purple square brackets [01:07:10] now using the concave form of Jensen's [01:07:18] now using the concave form of Jensen's inequality we have that this is greater [01:07:24] inequality we have that this is greater than or equal to so this is a form of [01:07:56] than or equal to so this is a form of Jensen's equality where f of X is [01:08:01] Jensen's equality where f of X is greater than or equal to X where here [01:08:09] greater than or equal to X where here this is the logarithmic function so the [01:08:14] this is the logarithmic function so the log function is a concave function it [01:08:15] log function is a concave function it looks like that and so using the I guess [01:08:20] looks like that and so using the I guess you use it using the form of Jensen's [01:08:22] you use it using the form of Jensen's equality with the science reverse write [01:08:26] equality with the science reverse write f of e^x is great enclose in a of FX so [01:08:29] f of e^x is great enclose in a of FX so you get log of expectation is pretty [01:08:31] you get log of expectation is pretty equal to expectation it along [01:08:35] and then finally let me just take this [01:08:39] and then finally let me just take this expectation and unpack it one more time [01:08:41] expectation and unpack it one more time so this is now sum of I sum of Zi so I [01:09:03] so this is now sum of I sum of Zi so I just took this expected value and turn [01:09:05] just took this expected value and turn the back to the sum over random variable [01:09:06] the back to the sum over random variable probably times that thing okay so if you [01:09:16] probably times that thing okay so if you remember this picture from the middle [01:09:17] remember this picture from the middle what we wanted to do was to construct a [01:09:20] what we wanted to do was to construct a function construct this green curve [01:09:23] function construct this green curve there's a lower bound for the blue curve [01:09:25] there's a lower bound for the blue curve and if you view this formula here as a [01:09:31] and if you view this formula here as a function of theta right so your X X is [01:09:35] function of theta right so your X X is just your data and Z is a variable you [01:09:37] just your data and Z is a variable you sum over so this whole thing is the [01:09:39] sum over so this whole thing is the function of theta or because X is FX Z [01:09:42] function of theta or because X is FX Z is yourself you found some over so this [01:09:44] is yourself you found some over so this whole formula here this is a function of [01:09:46] whole formula here this is a function of the parameters theta and what we're [01:09:49] the parameters theta and what we're showing is that this thing you know this [01:09:53] showing is that this thing you know this formula here this is a lower bound for [01:09:56] formula here this is a lower bound for the log likelihood I thought for this [01:09:59] the log likelihood I thought for this thing I guess this is our theta so [01:10:07] oh how we got to disagree sure sure so [01:10:24] messy let's go let's say that Z takes on [01:10:32] messy let's go let's say that Z takes on values from 135 right unless these [01:10:34] values from 135 right unless these details on Val's room one through ten [01:10:36] details on Val's room one through ten zero attend sided guys and I want to [01:10:39] zero attend sided guys and I want to compute you know the expected value of [01:10:43] compute you know the expected value of some function of some function G G of Z [01:10:47] some function of some function G G of Z right then expected value G of Z is sum [01:10:51] right then expected value G of Z is sum of all the possible values of C of the [01:10:53] of all the possible values of C of the probability do you get that Z times G of [01:10:58] probability do you get that Z times G of Z right so that's that's what's the [01:11:00] Z right so that's that's what's the expected value of a function of a random [01:11:02] expected value of a function of a random variable right and and this is the [01:11:05] variable right and and this is the expected value of Z is some of us Z P of [01:11:09] expected value of Z is some of us Z P of Z times Z right that's the average of [01:11:11] Z times Z right that's the average of random variable and so in the notation [01:11:15] random variable and so in the notation that we have the probability of Z taking [01:11:19] that we have the probability of Z taking on different values is to note about a [01:11:21] on different values is to note about a of Z which is why we wind up with that [01:11:24] of Z which is why we wind up with that formula does make sense okay [01:11:33] all right if one of these steps doesn't [01:11:35] all right if one of these steps doesn't make sense then you know other questions [01:11:45] okay [01:11:52] all right [01:12:04] now one of the things we want when [01:12:09] now one of the things we want when constructing this green lower bound is [01:12:11] constructing this green lower bound is we want that green lower bound to be [01:12:13] we want that green lower bound to be equal to the blue function at this point [01:12:15] equal to the blue function at this point right this is actually how you guarantee [01:12:18] right this is actually how you guarantee that when you optimize the green [01:12:20] that when you optimize the green function by improving on the green [01:12:22] function by improving on the green function you're improving on the blue [01:12:23] function you're improving on the blue function so we want this lower bound to [01:12:25] function so we want this lower bound to be tight right to meet the two functions [01:12:27] be tight right to meet the two functions being equal or tangent to each other so [01:12:29] being equal or tangent to each other so in other words we want this inequality [01:12:31] in other words we want this inequality to hold with equality so we want yeah so [01:12:36] to hold with equality so we want yeah so we want the left-hand side on the [01:12:37] we want the left-hand side on the right-hand side to be equal for the [01:12:41] right-hand side to be equal for the current value of theta [01:13:03] so on a given iteration with the current [01:13:13] so on a given iteration with the current perhapses equal to theta we want we want [01:13:52] perhapses equal to theta we want we want I know this is a lot of math but you [01:13:55] I know this is a lot of math but you know we wanted the left and right hand [01:13:56] know we wanted the left and right hand sides to be equal to each other because [01:13:59] sides to be equal to each other because that's what it means for almost for the [01:14:05] that's what it means for almost for the lower bound to be tight for the green [01:14:07] lower bound to be tight for the green curve to be exactly touching the blue [01:14:08] curve to be exactly touching the blue curve as we construct that know about [01:14:11] curve as we construct that know about and so for this to be true we need the [01:14:23] and so for this to be true we need the random variable inside to be a constant [01:14:25] random variable inside to be a constant so we need p of x i zi / qi of zi to be [01:14:33] so we need p of x i zi / qi of zi to be equal to constitutes a constant meaning [01:14:38] equal to constitutes a constant meaning that no matter what value of Zi you plug [01:14:41] that no matter what value of Zi you plug in this should evaluate to the same [01:14:44] in this should evaluate to the same value you know in other words the ratio [01:14:46] value you know in other words the ratio between the numerator and denominator [01:14:48] between the numerator and denominator must be the same and fortunately so far [01:14:52] must be the same and fortunately so far but not yet specified how will choose [01:14:55] but not yet specified how will choose this distribution for zi right so so far [01:14:58] this distribution for zi right so so far the only constraint we have is that Qi [01:15:00] the only constraint we have is that Qi has to be a probability density has [01:15:02] has to be a probability density has their probability distribution over Zi [01:15:03] their probability distribution over Zi we could choose whatever distribution [01:15:05] we could choose whatever distribution you want for Zi and it turns out that [01:15:11] we can set qi of zi to be proportional [01:15:17] we can set qi of zi to be proportional to p of x i zi parametrized by theta and [01:15:25] to p of x i zi parametrized by theta and this means that for any value of Z you [01:15:27] this means that for any value of Z you know whether those e indicates is it [01:15:30] know whether those e indicates is it from Gaussian one of Gaussian to right [01:15:31] from Gaussian one of Gaussian to right so this means that the chance of [01:15:33] so this means that the chance of Gaussian one is proportional to the [01:15:35] Gaussian one is proportional to the chance of Gaussian one versus goes into [01:15:37] chance of Gaussian one versus goes into whether Zi takes on one or two is [01:15:39] whether Zi takes on one or two is proportional to this and I don't want to [01:15:44] proportional to this and I don't want to prove it but one way to ensure this and [01:15:46] prove it but one way to ensure this and this is proven in the lecture notes but [01:15:48] this is proven in the lecture notes but it turns out that one way to ensure well [01:15:51] it turns out that one way to ensure well so the Q is need to sum to one so one [01:15:54] so the Q is need to sum to one so one way to ensure that this is proportional [01:15:55] way to ensure that this is proportional to the right hand side is to just take [01:15:58] to the right hand side is to just take the right hand side so one so let's see [01:16:16] right so the cure eyes have to sum to [01:16:18] right so the cure eyes have to sum to one and so one way to ensure the [01:16:20] one and so one way to ensure the proportionality is to just take the [01:16:22] proportionality is to just take the right hand side and normalize it it's [01:16:29] right hand side and normalize it it's something one and after after a couple [01:16:39] something one and after after a couple steps that intellectually I don't want [01:16:41] steps that intellectually I don't want to do here you can show that this [01:16:46] to do here you can show that this results in 7qi of zi to be equal to that [01:16:49] results in 7qi of zi to be equal to that that posterior probability okay and so [01:16:56] that posterior probability okay and so sorry I skipped a couple of steps here [01:16:58] sorry I skipped a couple of steps here you can get from the lecture notes but [01:17:00] you can get from the lecture notes but it turns out that if you want this to be [01:17:02] it turns out that if you want this to be constant meaning whether you plugged in [01:17:04] constant meaning whether you plugged in CI equals 1 or Z equals 2 or whatever [01:17:06] CI equals 1 or Z equals 2 or whatever disavows the same constant the only way [01:17:09] disavows the same constant the only way to do that is make sure the numerator [01:17:11] to do that is make sure the numerator and denominator are proportional to each [01:17:13] and denominator are proportional to each other and because qi of zi is a density [01:17:17] other and because qi of zi is a density that must sound [01:17:18] that must sound one one way to mr. Li proportional it [01:17:20] one one way to mr. Li proportional it was to just said this to be really right [01:17:22] was to just said this to be really right hand side but normalize the sum to one [01:17:24] hand side but normalize the sum to one okay and then we derived this a little [01:17:25] okay and then we derived this a little bit more carefully your lecture notes so [01:17:36] just to summarize this gives us de em [01:17:41] just to summarize this gives us de em algorithm let's take all this everything [01:17:43] algorithm let's take all this everything we're just doing Rapids Indian algorithm [01:17:45] we're just doing Rapids Indian algorithm and it Estep we're going to set Q I of [01:17:50] and it Estep we're going to set Q I of Zi equal to that and previously this was [01:18:03] Zi equal to that and previously this was the W IJ s right so incentive so [01:18:06] the W IJ s right so incentive so previously restoring these probabilities [01:18:08] previously restoring these probabilities and the variables you call WI J's and [01:18:13] and the variables you call WI J's and then in the M step we're going to take [01:18:25] then in the M step we're going to take that know about that we constructed [01:18:27] that know about that we constructed which is this function and maximize it [01:18:39] which is this function and maximize it with respect to theta okay and so [01:18:44] with respect to theta okay and so remember in the M set we constructed [01:18:47] remember in the M set we constructed this thing on the right hand side [01:18:48] this thing on the right hand side there's a lower bound for the log [01:18:50] there's a lower bound for the log likelihood and so for the fixed value of [01:18:53] likelihood and so for the fixed value of Q you can maximize this respect to theta [01:18:55] Q you can maximize this respect to theta and that updates the theta [01:18:57] and that updates the theta you know maximizing the green lower [01:18:59] you know maximizing the green lower boundary that's what the end step does [01:19:01] boundary that's what the end step does and if you any rate these two steps then [01:19:04] and if you any rate these two steps then you find that this should converge to [01:19:06] you find that this should converge to the whole optimal okay oh and there's [01:19:10] the whole optimal okay oh and there's just maybe that's the obvious question [01:19:12] just maybe that's the obvious question um why don't we try to maximize vary [01:19:15] um why don't we try to maximize vary theta why we try to massage the log like [01:19:19] theta why we try to massage the log like indirectly it turns out that if you take [01:19:22] indirectly it turns out that if you take the mixture of gaussians model try to [01:19:24] the mixture of gaussians model try to take derivatives of this and set their 2 [01:19:26] take derivatives of this and set their 2 is equal to 0 [01:19:26] is equal to 0 there's no known way to solve for the [01:19:28] there's no known way to solve for the value of theta the maximizing the log [01:19:30] value of theta the maximizing the log likelihood but you'll find that for the [01:19:32] likelihood but you'll find that for the mixture of gaussians model and for many [01:19:33] mixture of gaussians model and for many models including factor analysis we [01:19:35] models including factor analysis we talked about on Wednesday if you [01:19:37] talked about on Wednesday if you actually plug in the Gaussian density if [01:19:40] actually plug in the Gaussian density if you actually plug in the mixture of [01:19:41] you actually plug in the mixture of gaussians model for p and take you know [01:19:44] gaussians model for p and take you know take take the riveter cetera t goes here [01:19:46] take take the riveter cetera t goes here and solve you will be able to find an [01:19:48] and solve you will be able to find an analytic solution to maximize this M [01:19:50] analytic solution to maximize this M step and it'll be exactly what we have [01:19:52] step and it'll be exactly what we have worked out ok but so this derivation [01:19:57] worked out ok but so this derivation shows that the yam algorithm you know is [01:20:01] shows that the yam algorithm you know is a maximum likelihood estimation [01:20:03] a maximum likelihood estimation algorithm with optimization solved by [01:20:06] algorithm with optimization solved by constructing little balance and [01:20:07] constructing little balance and optimizing those bounds ok all right [01:20:11] optimizing those bounds ok all right that's it for today and only it's tough [01:20:14] that's it for today and only it's tough up to here right and so this stuff will [01:20:17] up to here right and so this stuff will be up to midterm but we're talking about [01:20:20] be up to midterm but we're talking about on factor analysis we're not on my way [01:20:22] on factor analysis we're not on my way okay [01:20:23] okay so let's break for today and I'll see [01:20:25] so let's break for today and I'll see you guys on Wednesday ================================================================================ LECTURE 015 ================================================================================ Lecture 15 - EM Algorithm & Factor Analysis | Stanford CS229: Machine Learning Andrew Ng -Autumn2018 Source: https://www.youtube.com/watch?v=tw6cmL5STuY --- Transcript [00:00:03] alright hey everyone welcome back so [00:00:10] alright hey everyone welcome back so what we'll see today is additional [00:00:15] what we'll see today is additional elaborations on the e/m on the [00:00:19] elaborations on the e/m on the expectation-maximization [00:00:21] expectation-maximization harbor volley and so what you see today [00:00:24] harbor volley and so what you see today is go over you know quick recap of all [00:00:27] is go over you know quick recap of all we talked about [00:00:28] we talked about eeehm on Monday and then describe how [00:00:31] eeehm on Monday and then describe how you can monitor if eeehm is converging [00:00:34] you can monitor if eeehm is converging and on Monday we talked about the [00:00:39] and on Monday we talked about the mixture of gaussians model and started [00:00:42] mixture of gaussians model and started deriving Ian's for that I won't just [00:00:44] deriving Ian's for that I won't just take these two equations and map it back [00:00:46] take these two equations and map it back to specifically the E&M steps that you [00:00:49] to specifically the E&M steps that you saw for the mixture of gaussians models [00:00:51] saw for the mixture of gaussians models to see exactly how these map - you know [00:00:55] to see exactly how these map - you know updating the weights of UI and so on how [00:00:58] updating the weights of UI and so on how you should derive the m-step and then [00:01:00] you should derive the m-step and then most of what I went spend today talking [00:01:03] most of what I went spend today talking about is the model called the factor [00:01:05] about is the model called the factor analysis model and this is model useful [00:01:08] analysis model and this is model useful but for data there can be very [00:01:13] but for data there can be very high-dimensional even when you have very [00:01:14] high-dimensional even when you have very few training examples so what I want to [00:01:16] few training examples so what I want to do is talk a bit about properties of [00:01:18] do is talk a bit about properties of Gaussian distributions and then describe [00:01:21] Gaussian distributions and then describe the factor analysis model some more [00:01:24] the factor analysis model some more about Gaussian distributions and then [00:01:26] about Gaussian distributions and then we'll derive e/m for the factor analysis [00:01:28] we'll derive e/m for the factor analysis model and once talked about factor [00:01:31] model and once talked about factor analysis with two reasons I guess one is [00:01:33] analysis with two reasons I guess one is you know useful algorithm in an episome [00:01:35] you know useful algorithm in an episome right and second the derivation for [00:01:38] right and second the derivation for ian's for factor analysis is actually [00:01:39] ian's for factor analysis is actually one of the trickier ones and there are [00:01:42] one of the trickier ones and there are key steps and how you actually derived [00:01:43] key steps and how you actually derived in E and M steps that I think you learn [00:01:46] in E and M steps that I think you learn better or you better master better by [00:01:49] better or you better master better by going through the factor analysis [00:01:51] going through the factor analysis example ok um so just a recap last [00:01:56] example ok um so just a recap last Monday or on Monday we had talked about [00:02:00] Monday or on Monday we had talked about the Ian algorithm and we wound up [00:02:03] the Ian algorithm and we wound up figuring out this estep and this n step [00:02:06] figuring out this estep and this n step remember that if this is the log [00:02:08] remember that if this is the log likelihood that you trying to maximize [00:02:10] likelihood that you trying to maximize what the estep does is it constructs a [00:02:13] what the estep does is it constructs a lower bound then this is a funk [00:02:15] lower bound then this is a funk theta so this thing on the right hand [00:02:18] theta so this thing on the right hand side this is a function of the [00:02:21] side this is a function of the parameters theta what we proved last [00:02:24] parameters theta what we proved last time was that that function is a lower [00:02:29] time was that that function is a lower bound of the log-likelihood right and [00:02:31] bound of the log-likelihood right and depending on what you choose for Q you [00:02:34] depending on what you choose for Q you get different lower bound so one choice [00:02:35] get different lower bound so one choice of Q you make it this little bow for a [00:02:37] of Q you make it this little bow for a different choice Q might get that lower [00:02:38] different choice Q might get that lower bound for a different choice you may get [00:02:40] bound for a different choice you may get that low about and what the e set does [00:02:44] that low about and what the e set does is it uses Q together lower bound this [00:02:46] is it uses Q together lower bound this tight that just touches the lock like [00:02:49] tight that just touches the lock like here at the current value of theta and [00:02:50] here at the current value of theta and what the M set does is it chooses the [00:02:53] what the M set does is it chooses the parameters later that maximizes that all [00:02:55] parameters later that maximizes that all right so those are eeehm algorithm that [00:02:59] right so those are eeehm algorithm that we saw now um I want to step through how [00:03:02] we saw now um I want to step through how you would take this you know slightly [00:03:05] you would take this you know slightly abstract mathematical definition of VM [00:03:07] abstract mathematical definition of VM and derive a concrete algorithm that you [00:03:10] and derive a concrete algorithm that you would implement right and in you know in [00:03:13] would implement right and in you know in Python and so let's let's just step [00:03:16] Python and so let's let's just step through this for the mixture of [00:03:18] through this for the mixture of gaussians model so for the mixture of [00:03:21] gaussians model so for the mixture of gaussians model we had a model for P of [00:03:25] gaussians model we had a model for P of X I comma Z I which given CI times P I [00:03:35] X I comma Z I which given CI times P I and a model was that Z is a multinomial [00:03:41] with some set of parameters Phi oh and [00:03:44] with some set of parameters Phi oh and so you know the probability of Z I J is [00:03:48] so you know the probability of Z I J is equal to Phi J right so Phi is just a [00:03:51] equal to Phi J right so Phi is just a vector of numbers that sum to one [00:03:53] vector of numbers that sum to one specifying what's the chance of Z being [00:03:55] specifying what's the chance of Z being each of the K possible to speak values [00:03:58] each of the K possible to speak values and then we have that X i given the CI [00:04:03] and then we have that X i given the CI equals J that that is Gaussian with some [00:04:07] equals J that that is Gaussian with some mean and what we said last time was that [00:04:11] mean and what we said last time was that um this is a lot like the Gaussian [00:04:14] um this is a lot like the Gaussian destroying an Alice model and the the [00:04:18] destroying an Alice model and the the trivial one trivial difference is this a [00:04:20] trivial one trivial difference is this a sigma j instead of Sigma right GTA call [00:04:23] sigma j instead of Sigma right GTA call centers analysis had the same Sigma [00:04:24] centers analysis had the same Sigma every cost but that's not the key [00:04:25] every cost but that's not the key difference the key difference is that [00:04:28] difference the key difference is that in this density estimation problem Z is [00:04:32] in this density estimation problem Z is not observable z is a latent random [00:04:34] not observable z is a latent random variable Raven which is why we have all [00:04:36] variable Raven which is why we have all this machinery of so now that you have [00:04:44] this machinery of so now that you have this model let's see so now that you [00:04:57] this model let's see so now that you have this model this is how you would [00:05:01] have this model this is how you would derive the E and the M steps right so [00:05:06] derive the E and the M steps right so the e step is you know you have Q I of [00:05:10] the e step is you know you have Q I of CI right but let me just write this as [00:05:12] CI right but let me just write this as qi of Zi equals J this is sort of the [00:05:15] qi of Zi equals J this is sort of the probability of Z I equals J I know this [00:05:18] probability of Z I equals J I know this notation a little bit strange but under [00:05:19] notation a little bit strange but under the Qi distribution what do you want the [00:05:22] the Qi distribution what do you want the chance of Z being equal to J right and [00:05:25] chance of Z being equal to J right and so in the estep you were said that to P [00:05:28] so in the estep you were said that to P of Z I equals J given X I parameterize [00:05:33] of Z I equals J given X I parameterize by all of the parameters and we actually [00:05:38] by all of the parameters and we actually saw with Bayes rule right how you would [00:05:41] saw with Bayes rule right how you would fetch this out okay and what we do in [00:05:44] fetch this out okay and what we do in the Estep is saw this number right in [00:05:50] the Estep is saw this number right in what we wrote as W IJ last time okay and [00:05:54] what we wrote as W IJ last time okay and so you remember if you a mixture of two [00:05:57] so you remember if you a mixture of two gaussians maybe that's the first girls [00:05:58] gaussians maybe that's the first girls in the second gaussian you have an [00:06:00] in the second gaussian you have an example X I here or so looks like it's [00:06:02] example X I here or so looks like it's more likely I come from the first and [00:06:04] more likely I come from the first and second girls yet and so this would be [00:06:06] second girls yet and so this would be reflected in W IJ that that example is [00:06:08] reflected in W IJ that that example is assigned more to the first gaussian then [00:06:11] assigned more to the first gaussian then to second gaussian so what you implement [00:06:13] to second gaussian so what you implement encodes is you know you write code to [00:06:15] encodes is you know you write code to compute this number and store it in W IJ [00:06:28] and then for the m-step [00:06:34] and then for the m-step you will want to maximize over the [00:06:37] you will want to maximize over the parameters of the model right fine mu [00:06:39] parameters of the model right fine mu and signal these are the parameter [00:06:42] and signal these are the parameter parameters of the mixture of gaussians [00:06:44] parameters of the mixture of gaussians of some of the why some of the Zi and so [00:07:07] of some of the why some of the Zi and so the way you actually derived this is [00:07:09] the way you actually derived this is your write this as sum of I so Zi you [00:07:14] your write this as sum of I so Zi you know takes on the certain to speak to [00:07:16] know takes on the certain to speak to the value so Zi to turn turn Zi into J [00:07:19] the value so Zi to turn turn Zi into J right so Zi can be I guess one or two of [00:07:22] right so Zi can be I guess one or two of you have make sure to gaussians you sum [00:07:23] you have make sure to gaussians you sum of all the indices of the different [00:07:25] of all the indices of the different clusters W IJ times log of the numerator [00:07:32] clusters W IJ times log of the numerator is going to be [00:07:53] times pi J that's the numerator and so [00:08:01] times pi J that's the numerator and so you know this term is equal to this [00:08:05] you know this term is equal to this first gaussian term times that second [00:08:07] first gaussian term times that second term right because this term is P of X i [00:08:09] term right because this term is P of X i given Z I write the parameters and this [00:08:15] given Z I write the parameters and this is just Q and then if you take this and [00:08:24] is just Q and then if you take this and divide it by W IJ okay [00:08:29] divide it by W IJ okay so I'm gonna step you through the steps [00:08:31] so I'm gonna step you through the steps you would go through if you're deriving [00:08:33] you would go through if you're deriving um using that you know you step in M [00:08:36] um using that you know you step in M said we wrote up above but if you're [00:08:38] said we wrote up above but if you're deriving this for the mixture of [00:08:40] deriving this for the mixture of gaussians modeled and these are the [00:08:43] gaussians modeled and these are the steps of algebra so in order to perform [00:08:56] steps of algebra so in order to perform this maximization what you will do is [00:09:02] this maximization what you will do is you want to maximize this formula right [00:09:05] you want to maximize this formula right this big double summation with respect [00:09:07] this big double summation with respect to each the parameters Phi mu and Sigma [00:09:10] to each the parameters Phi mu and Sigma and so what you would do is you know [00:09:13] and so what you would do is you know take this big formula right and take the [00:09:17] take this big formula right and take the derivatives with respect to each of the [00:09:18] derivatives with respect to each of the parameters so you take the derivative [00:09:20] parameters so you take the derivative respective mu J yeah thought that out is [00:09:23] respective mu J yeah thought that out is that big formula on the left set it to [00:09:25] that big formula on the left set it to zero right and take and then and then it [00:09:28] zero right and take and then and then it turns out if you do this you will derive [00:09:32] turns out if you do this you will derive that nu J should be equal to some K and [00:09:45] that nu J should be equal to some K and this is what we said is how you update [00:09:48] this is what we said is how you update the means mu right the WI J's are the [00:09:51] the means mu right the WI J's are the strength with which X I so W IJ is [00:09:57] strength with which X I so W IJ is informally this is the strength [00:10:01] with which Xie is a sign right - [00:10:08] with which Xie is a sign right - Gaussian and more formally this is [00:10:15] Gaussian and more formally this is really P of Z I equals J given X I the [00:10:20] really P of Z I equals J given X I the parameters and so so you end up with [00:10:24] parameters and so so you end up with this formula but the way you compute [00:10:26] this formula but the way you compute this formula is by the the rigorous way [00:10:29] this formula is by the the rigorous way to show this is the right formula [00:10:31] to show this is the right formula updating you J is looking at this [00:10:33] updating you J is looking at this objective taking derivative saying there [00:10:35] objective taking derivative saying there is equal zero to maximize and therefore [00:10:38] is equal zero to maximize and therefore deriving that equation for new J you [00:10:41] deriving that equation for new J you know by by solving for the value of MU J [00:10:43] know by by solving for the value of MU J that maximizes this expression and [00:10:46] that maximizes this expression and similarly you know you take derivatives [00:10:49] similarly you know you take derivatives respective this thing respective fires [00:10:52] respective this thing respective fires said it's a zero take derivatives of [00:10:57] said it's a zero take derivatives of this thing and set that to zero and [00:11:04] this thing and set that to zero and that's how you would derive the update [00:11:07] that's how you would derive the update equations in the m-step for fire and for [00:11:09] equations in the m-step for fire and for sigma as well okay um so and and so for [00:11:15] sigma as well okay um so and and so for example when you do this you find that [00:11:18] example when you do this you find that the optimal value for Phi is we had this [00:11:31] the optimal value for Phi is we had this near the start of Monday's lecture as [00:11:34] near the start of Monday's lecture as well okay um so this is the process of [00:11:38] well okay um so this is the process of how you would look the estep CM steps I [00:11:42] how you would look the estep CM steps I wrote up and apply it to a specific [00:11:44] wrote up and apply it to a specific model such as the mixtures of gaussians [00:11:47] model such as the mixtures of gaussians model and that's how you you know solve [00:11:49] model and that's how you you know solve for the maximization in DM step okay and [00:11:53] for the maximization in DM step okay and so what I'd like to do today is describe [00:11:56] so what I'd like to do today is describe the application of VM to a more complex [00:11:58] the application of VM to a more complex model called the factor of analysis [00:12:00] model called the factor of analysis model and so it's important that so I [00:12:04] model and so it's important that so I hope you understand the mechanics of how [00:12:05] hope you understand the mechanics of how you do this because we're gonna do this [00:12:07] you do this because we're gonna do this today for a different model [00:12:10] today for a different model questions about this with why move on oh [00:12:22] questions about this with why move on oh so in order to you know foreshadow a [00:12:33] so in order to you know foreshadow a little bit what we'll see when it comes [00:12:35] little bit what we'll see when it comes down to the mixture of gaussians model [00:12:38] down to the mixture of gaussians model excuse me the factor analysis model [00:12:40] excuse me the factor analysis model which we talked about which is what was [00:12:42] which we talked about which is what was famous today talking about in the factor [00:12:44] famous today talking about in the factor analysis model instead of Zi being [00:12:47] analysis model instead of Zi being discrete Zi will be continuous right and [00:12:51] discrete Zi will be continuous right and the paper zi will be distributed [00:12:54] the paper zi will be distributed Gaussian so in the mixture of gaussians [00:12:55] Gaussian so in the mixture of gaussians model we had a joint distribution for X [00:12:58] model we had a joint distribution for X and Z where X was a discrete random [00:13:00] and Z where X was a discrete random variable so in the factor analysis model [00:13:03] variable so in the factor analysis model will describe a different model you know [00:13:05] will describe a different model you know for p.m. X and Z where Z is continuous [00:13:08] for p.m. X and Z where Z is continuous and so instead of sum over Zi just be an [00:13:11] and so instead of sum over Zi just be an integral over Zi of d zi right so [00:13:16] integral over Zi of d zi right so instead of sum becomes integral and and [00:13:19] instead of sum becomes integral and and it turns out that yeah well right yeah [00:13:26] it turns out that yeah well right yeah and it turns out that if you go through [00:13:28] and it turns out that if you go through the derivation of the EEM algorithm that [00:13:30] the derivation of the EEM algorithm that we worked out on Monday all of the steps [00:13:33] we worked out on Monday all of the steps with Jenson equality all of those steps [00:13:35] with Jenson equality all of those steps work exactly that's before many of you [00:13:37] work exactly that's before many of you check every single step for whether Zi [00:13:39] check every single step for whether Zi was continuous it work the same as [00:13:41] was continuous it work the same as before if you have changed the sum to an [00:13:43] before if you have changed the sum to an integral all right [00:13:50] so so I want to mention one other view [00:14:16] so so I want to mention one other view of yam that's equivalent everything [00:14:18] of yam that's equivalent everything we've seen up until now which is um let [00:14:22] we've seen up until now which is um let me define J of theta comma Q okay says [00:14:47] me define J of theta comma Q okay says that phone that you've seen a few times [00:14:48] that phone that you've seen a few times now what we proved on Monday was um Ella [00:14:56] now what we proved on Monday was um Ella theta is greater than or equal to J of [00:15:00] theta is greater than or equal to J of theta comma Q right and this is true for [00:15:03] theta comma Q right and this is true for any theta and any choice of Q okay so [00:15:08] any theta and any choice of Q okay so using using Jensen's inequality you can [00:15:11] using using Jensen's inequality you can show that you know J for any choice of [00:15:14] show that you know J for any choice of theta and Q is a lower bound for the [00:15:16] theta and Q is a lower bound for the log-likelihood of theta so it turns out [00:15:21] log-likelihood of theta so it turns out that an equivalent view of PM as [00:15:23] that an equivalent view of PM as everything was seen before is that in [00:15:26] everything was seen before is that in the Estep what you're doing is maximize [00:15:30] the Estep what you're doing is maximize J with respect to Q and in the m-step [00:15:40] maximize J with respect to theta right [00:15:46] maximize J with respect to theta right so in the East step you're picking the [00:15:48] so in the East step you're picking the choice of Q that maximizes this and it [00:15:52] choice of Q that maximizes this and it turns out that the choice of Q we have [00:15:54] turns out that the choice of Q we have will set J equal to L and then M step [00:15:59] will set J equal to L and then M step maximizes this respect to theta and [00:16:01] maximizes this respect to theta and pushes the value of L [00:16:04] pushes the value of L even higher so this algorithm is [00:16:06] even higher so this algorithm is sometimes called coordinate ascent if [00:16:07] sometimes called coordinate ascent if you have a function of two variables and [00:16:09] you have a function of two variables and you are twice respected this also has [00:16:11] you are twice respected this also has respect to this they go back and forth [00:16:13] respect to this they go back and forth and optimize the respective one at a [00:16:14] and optimize the respective one at a time that's a procedure that sometimes [00:16:17] time that's a procedure that sometimes called coordinate ascent because you're [00:16:18] called coordinate ascent because you're maximizing respect to one coordinate at [00:16:20] maximizing respect to one coordinate at a time and so e/m is a coordinate ascent [00:16:24] a time and so e/m is a coordinate ascent algorithm relative to this cost function [00:16:26] algorithm relative to this cost function J right and and and you know and on [00:16:29] J right and and and you know and on every iteration J ends up being sent to [00:16:31] every iteration J ends up being sent to L which is why you know that as the [00:16:34] L which is why you know that as the algorithm increases J you know that the [00:16:36] algorithm increases J you know that the log likelihood is increasing on every [00:16:38] log likelihood is increasing on every iteration and if you want to track [00:16:40] iteration and if you want to track whether the e/m algorithm is converging [00:16:43] whether the e/m algorithm is converging or how was converging you can plot you [00:16:45] or how was converging you can plot you know the value of J or the value of L on [00:16:48] know the value of J or the value of L on successive iterations and see this very [00:16:50] successive iterations and see this very valid data is going are monotonically [00:16:52] valid data is going are monotonically and then when it plateaus and isn't [00:16:54] and then when it plateaus and isn't improving anymore then you might have a [00:16:55] improving anymore then you might have a sense that the algorithm is converging [00:17:04] all right [00:17:12] okay so that's it the basic algorithm [00:17:17] okay so that's it the basic algorithm and make sure of gaussians what I want [00:17:20] and make sure of gaussians what I want to do now is and it's going to talk [00:17:23] to do now is and it's going to talk about the factor analysis out all right [00:17:39] about the factor analysis out all right so um you know that the factor analysis [00:17:43] so um you know that the factor analysis algorithm will work actually so over the [00:17:49] algorithm will work actually so over the compare and contrast mixture of [00:17:50] compare and contrast mixture of gaussians with factor analysis are [00:17:53] gaussians with factor analysis are talking about a little bit which is uh [00:17:54] talking about a little bit which is uh for the mixture of gaussians let's say N [00:17:58] for the mixture of gaussians let's say N equals 2 and M equals 100 right see if a [00:18:02] equals 2 and M equals 100 right see if a dataset with two features x1 and x2 so [00:18:05] dataset with two features x1 and x2 so any two or two and maybe of a data set [00:18:07] any two or two and maybe of a data set that looks like this [00:18:11] you know then make sure to gaussians we [00:18:13] you know then make sure to gaussians we were a pretty good model for this data [00:18:15] were a pretty good model for this data set right now say one gaussian there for [00:18:18] set right now say one gaussian there for the second gaussian there you can kind [00:18:19] the second gaussian there you can kind of capture a distribution like this with [00:18:20] of capture a distribution like this with a mixture of two gaussians and this is [00:18:24] a mixture of two gaussians and this is one illustration of when when you apply [00:18:27] one illustration of when when you apply make sure gaussians in this picture M is [00:18:31] make sure gaussians in this picture M is much bigger than n right you have a lot [00:18:34] much bigger than n right you have a lot more examples then you have dimensions [00:18:41] where I will not use mixture of [00:18:43] where I will not use mixture of gaussians and where you see the minute [00:18:49] gaussians and where you see the minute factor analysis will apply maybe if M is [00:18:54] factor analysis will apply maybe if M is about similar to N no even n is or even [00:19:03] about similar to N no even n is or even M it's much less than okay and so um [00:19:08] M it's much less than okay and so um just for purposes of illustration let's [00:19:11] just for purposes of illustration let's say M equals 30 and equals 100 right so [00:19:18] say M equals 30 and equals 100 right so let's say you have a hundred dimensional [00:19:20] let's say you have a hundred dimensional data but only thirty examples and so to [00:19:26] data but only thirty examples and so to make this more concrete you know many [00:19:31] make this more concrete you know many years ago there was a Stanford PhD [00:19:33] years ago there was a Stanford PhD student that was placing temperature [00:19:36] student that was placing temperature sensors are around different standard [00:19:38] sensors are around different standard buildings and so what you do is model [00:19:41] buildings and so what you do is model you measure the temperature at many [00:19:44] you measure the temperature at many different places right around campus but [00:19:48] different places right around campus but if you have a hundred sensors you know [00:19:54] if you have a hundred sensors you know taking a hundred temperature readings [00:19:56] taking a hundred temperature readings around campus but only thirty days of [00:19:59] around campus but only thirty days of data or maybe thirty examples then you [00:20:02] data or maybe thirty examples then you would have a hundred dimensional data [00:20:04] would have a hundred dimensional data because each example is a vector of a [00:20:08] because each example is a vector of a hundred temperature readings you know at [00:20:10] hundred temperature readings you know at different points around this building [00:20:11] different points around this building say but you may have only thirty [00:20:13] say but you may have only thirty examples of if you say thirty thirty [00:20:17] examples of if you say thirty thirty such vectors and so the application that [00:20:20] such vectors and so the application that the Stanford PhD student at the time was [00:20:22] the Stanford PhD student at the time was working on was he wants a model P of X [00:20:25] working on was he wants a model P of X right so this is X as a vector of a [00:20:27] right so this is X as a vector of a hundred sends a hundred temperature [00:20:30] hundred sends a hundred temperature readings because if something goes wrong [00:20:33] readings because if something goes wrong for example it has a bad case if there's [00:20:35] for example it has a bad case if there's a fire in one of the rooms then there'll [00:20:38] a fire in one of the rooms then there'll be a very anomalous temperature reading [00:20:41] be a very anomalous temperature reading in one place and if you can model P of X [00:20:43] in one place and if you can model P of X and if you ever observe a value of P of [00:20:45] and if you ever observe a value of P of X that is very small you would say oh [00:20:48] X that is very small you would say oh looks like there's an anomaly there [00:20:49] looks like there's an anomaly there right and we ratio let's worry about [00:20:52] right and we ratio let's worry about fires on Stanford the use cases [00:20:55] fires on Stanford the use cases was it an energy conservation if someone [00:20:58] was it an energy conservation if someone unexpectedly these a window open in the [00:21:00] unexpectedly these a window open in the building you were studying you know and [00:21:02] building you were studying you know and it was hot and was it and it's it's [00:21:04] it was hot and was it and it's it's winter and it's warmer inside the [00:21:07] winter and it's warmer inside the building and ku air blows in and the [00:21:08] building and ku air blows in and the temperature of one Broome Johnson and [00:21:10] temperature of one Broome Johnson and all this way you want to realize that [00:21:11] all this way you want to realize that something was going wrong with the [00:21:12] something was going wrong with the windows or the or the temperature in [00:21:15] windows or the or the temperature in part of the building okay so for the [00:21:17] part of the building okay so for the application like that you need to model [00:21:19] application like that you need to model P of X as a Joint Distribution over you [00:21:24] P of X as a Joint Distribution over you know all of the different senses right [00:21:26] know all of the different senses right actually [00:21:26] actually if you imagine maybe just in this room [00:21:28] if you imagine maybe just in this room let's say we have thirty senses in this [00:21:31] let's say we have thirty senses in this room then the temperatures that the [00:21:33] room then the temperatures that the thirty different points in this room [00:21:35] thirty different points in this room will be highly correlated with each [00:21:36] will be highly correlated with each other but how do you model this vector [00:21:38] other but how do you model this vector of a hundred hundred original vector [00:21:41] of a hundred hundred original vector with relatively small training set so it [00:21:48] with relatively small training set so it turns out that the problem with applying [00:21:52] turns out that the problem with applying a Gaussian model right so one thing [00:22:02] a Gaussian model right so one thing could do is model this as a single [00:22:04] could do is model this as a single Gaussian and say that X is distributed [00:22:09] right and if you look at your training [00:22:13] right and if you look at your training sets of thirty examples and find the [00:22:15] sets of thirty examples and find the maximum likely estimate parameters you [00:22:17] maximum likely estimate parameters you find that the maximum likelihood [00:22:18] find that the maximum likelihood estimate of MU is just the average and [00:22:22] estimate of MU is just the average and the best of like the estimate of Sigma [00:22:24] the best of like the estimate of Sigma is this but it turns out that if M is [00:22:40] is this but it turns out that if M is less than equal to n then Sigma this [00:22:45] less than equal to n then Sigma this covariance matrix will be singular and [00:22:51] covariance matrix will be singular and singular I just means our non-invertible [00:23:00] I'll step for an illustration in a [00:23:03] I'll step for an illustration in a second [00:23:15] but if it looks like formula for the [00:23:18] but if it looks like formula for the Gaussian density right so the Gaussian [00:23:28] Gaussian density right so the Gaussian density kind of looks like this right [00:23:31] density kind of looks like this right abstracting away some details and when [00:23:34] abstracting away some details and when the covariance matrix is singular then [00:23:36] the covariance matrix is singular then this term this determinant term will be [00:23:41] this term this determinant term will be zero so you end up with 1 over 0 and [00:23:45] zero so you end up with 1 over 0 and then Sigma inverse is also undefined or [00:23:49] then Sigma inverse is also undefined or blows up to infinity or depending how [00:23:51] blows up to infinity or depending how you think about it right so you know the [00:23:53] you think about it right so you know the inverse of a matrix like um 110 right [00:23:58] inverse of a matrix like um 110 right would be I guess 1 1 over 10 and an [00:24:03] would be I guess 1 1 over 10 and an example of a non invertible matrix so [00:24:05] example of a non invertible matrix so singular matrix would be this and you [00:24:07] singular matrix would be this and you can't actually calculate the inverse of [00:24:09] can't actually calculate the inverse of that matrix right so it turns out that [00:24:12] that matrix right so it turns out that um if your number of training examples [00:24:14] um if your number of training examples is less than the dimension of the data [00:24:16] is less than the dimension of the data if you use the usual formula to derive [00:24:19] if you use the usual formula to derive the maximum likelihood estimate of Sigma [00:24:20] the maximum likelihood estimate of Sigma you end up with a covariance matrix that [00:24:23] you end up with a covariance matrix that a singular singular just means [00:24:25] a singular singular just means non-invertible which means the [00:24:26] non-invertible which means the covariance ratios would look like this [00:24:27] covariance ratios would look like this and so you know the Gaussian density we [00:24:29] and so you know the Gaussian density we try to compute P of X you get you can't [00:24:32] try to compute P of X you get you can't get infinity over 0 [00:24:38] oh sorry not actually zero over zero [00:24:41] oh sorry not actually zero over zero doesn't matter it's all bad um and I [00:24:45] doesn't matter it's all bad um and I think let me just illustrate what this [00:24:48] think let me just illustrate what this looks like [00:24:48] looks like which is um let's say M equals 2 and N [00:24:53] which is um let's say M equals 2 and N equals 2 right so you have two [00:24:56] equals 2 right so you have two dimensional data x1 and x2 and so N [00:25:01] dimensional data x1 and x2 and so N equals two and the number of training [00:25:02] equals two and the number of training examples 0 2 so it turns out that let's [00:25:05] examples 0 2 so it turns out that let's see [00:25:06] see so you see me draw contours of Gaussian [00:25:08] so you see me draw contours of Gaussian densities like this right like the lips [00:25:10] densities like this right like the lips is like that it turns out that if you [00:25:12] is like that it turns out that if you have two examples the two dimensional [00:25:14] have two examples the two dimensional space and you compute the most likely [00:25:16] space and you compute the most likely maximum likelihood estimate the [00:25:17] maximum likelihood estimate the parameters of the Gaussian if it is a [00:25:19] parameters of the Gaussian if it is a there then it turns out that these [00:25:21] there then it turns out that these contours will look like that except that [00:25:27] contours will look like that except that instead of being very thin as I'm [00:25:30] instead of being very thin as I'm drawing it it'll be it'll be infinitely [00:25:32] drawing it it'll be it'll be infinitely skinny see and ever be Gaussian density [00:25:35] skinny see and ever be Gaussian density where I can't draw lines you know of [00:25:37] where I can't draw lines you know of zero width on the whiteboard right but [00:25:39] zero width on the whiteboard right but it turns out that the contours will be [00:25:41] it turns out that the contours will be squished infinitely thin so you end up [00:25:44] squished infinitely thin so you end up with a Gaussian density all of whose [00:25:46] with a Gaussian density all of whose mass is on the straight line over there [00:25:50] mass is on the straight line over there with infinitely thin cons was just you [00:25:52] with infinitely thin cons was just you know this is question centers on the on [00:25:55] know this is question centers on the on the plane I guess or on the line [00:25:56] the plane I guess or on the line connecting these two points and so this [00:25:59] connecting these two points and so this is so first there are practical [00:26:03] is so first there are practical numerical problems right as in your [00:26:04] numerical problems right as in your network so zero over zero if you try to [00:26:06] network so zero over zero if you try to compute P of X for any example and [00:26:08] compute P of X for any example and second this is this very poorly [00:26:12] second this is this very poorly conditioned Gaussian density puts all [00:26:14] conditioned Gaussian density puts all the Prairie mouths on this line segments [00:26:16] the Prairie mouths on this line segments and so any example right over there just [00:26:19] and so any example right over there just a little bit off has no probably mas [00:26:21] a little bit off has no probably mas because has probably loss of zero a [00:26:23] because has probably loss of zero a probably density of 0 because the [00:26:25] probably density of 0 because the Gaussian is squish infinitely thin you [00:26:27] Gaussian is squish infinitely thin you know on that on that line okay but but [00:26:30] know on that on that line okay but but you know this just not be good just this [00:26:33] you know this just not be good just this is just not a good model right for this [00:26:35] is just not a good model right for this danger [00:26:41] so what we're gonna do is uh come up [00:26:45] so what we're gonna do is uh come up with a model that will work even for for [00:26:50] with a model that will work even for for these applications even even for a [00:26:51] these applications even even for a dataset like this right [00:26:53] dataset like this right there's actually a I think the the [00:26:56] there's actually a I think the the origins of the factor analysis model one [00:26:59] origins of the factor analysis model one of the very early applications was [00:27:00] of the very early applications was actually a psychological testing where [00:27:03] actually a psychological testing where if you have a you know administer a [00:27:06] if you have a you know administer a psychology exam two people to measure [00:27:09] psychology exam two people to measure different personality attributes right [00:27:12] different personality attributes right so you might measure you might have a [00:27:14] so you might measure you might have a hundred questions or measure a hundred [00:27:17] hundred questions or measure a hundred psychological attributes but have a data [00:27:23] psychological attributes but have a data set of thirty persons right okay you [00:27:28] set of thirty persons right okay you know doing doing psych research [00:27:30] know doing doing psych research collecting you know running survey data [00:27:31] collecting you know running survey data is harder it's a myriad of a sample of [00:27:33] is harder it's a myriad of a sample of 30 people and each person answers a [00:27:35] 30 people and each person answers a hundred quiz questions and so each [00:27:38] hundred quiz questions and so each person is one gives you one example X [00:27:43] person is one gives you one example X and the dimension of this is hundred [00:27:46] and the dimension of this is hundred image now B of only thirty of these and [00:27:49] image now B of only thirty of these and so if you want to model P of X try to [00:27:51] so if you want to model P of X try to model how correlated are the different [00:27:53] model how correlated are the different psychological attributes of people rate [00:27:55] psychological attributes of people rate um Louis intelligence correlated math [00:27:57] um Louis intelligence correlated math ability is that correlated with language [00:28:00] ability is that correlated with language ability is that correlated with other [00:28:01] ability is that correlated with other things then how do you build a model for [00:28:04] things then how do you build a model for P of X okay alright so um if the [00:28:11] P of X okay alright so um if the standard Gaussian model doesn't work [00:28:12] standard Gaussian model doesn't work let's look at some alternatives one [00:28:16] let's look at some alternatives one thing you could do is constrained Sigma [00:28:31] to be diagonal right so Sigma is a [00:28:34] to be diagonal right so Sigma is a covariance matrix is an N by n [00:28:36] covariance matrix is an N by n covariance matrix so in this case be one [00:28:38] covariance matrix so in this case be one hundred one hundred matrix but let's say [00:28:42] hundred one hundred matrix but let's say we constrain it to just have diagonal [00:28:47] we constrain it to just have diagonal entries and zeros on the off diagonals [00:28:49] entries and zeros on the off diagonals right so these giant zeroes I mean the [00:28:52] right so these giant zeroes I mean the diagonal entries of the square matrix [00:28:53] diagonal entries of the square matrix are these values and all of the entries [00:28:56] are these values and all of the entries of the diagonals you're set to zero so [00:28:58] of the diagonals you're set to zero so that's one thing you could do and this [00:29:00] that's one thing you could do and this turns out to be this turns out to [00:29:04] turns out to be this turns out to correspond to constraining your Gaussian [00:29:07] correspond to constraining your Gaussian to have axis aligned contours so this is [00:29:10] to have axis aligned contours so this is a Gaussian with zero diagonals this [00:29:14] a Gaussian with zero diagonals this would be another one this would be [00:29:18] would be another one this would be another one so these are examples of [00:29:21] another one so these are examples of Gaussian of contours of Gaussian [00:29:23] Gaussian of contours of Gaussian densities with zero off diagonal so the [00:29:27] densities with zero off diagonal so the axis here and thanks more than x2 right [00:29:30] axis here and thanks more than x2 right whereas you cannot model something like [00:29:33] whereas you cannot model something like this if you're 0 if you're off diagonals [00:29:37] this if you're 0 if you're off diagonals are 0 and so if you do this the maximum [00:29:42] are 0 and so if you do this the maximum likely estimate of the parameters Sigma [00:29:44] likely estimate of the parameters Sigma J is pretty much what you'd expect [00:29:46] J is pretty much what you'd expect actually the maximum like give you an [00:29:52] actually the maximum like give you an estimate of the mean vector mu is the [00:29:55] estimate of the mean vector mu is the same as before and this is Mathematica [00:29:57] same as before and this is Mathematica estimate of Sigma J right this is kind [00:30:00] estimate of Sigma J right this is kind of not a huge surprise gonna put you [00:30:01] of not a huge surprise gonna put you things back oh and it turns out that and [00:30:06] things back oh and it turns out that and so the covariance matrix here as n [00:30:08] so the covariance matrix here as n parameters instead of in spirit or about [00:30:10] parameters instead of in spirit or about n square over 2 parameters the [00:30:16] n square over 2 parameters the covariance matrix Sigma now just as in [00:30:18] covariance matrix Sigma now just as in parameters which the n diagonal entries [00:30:21] parameters which the n diagonal entries now the problem with this is that this [00:30:24] now the problem with this is that this molding assumption assumes that all of [00:30:27] molding assumption assumes that all of your features are uncorrelated right so [00:30:29] your features are uncorrelated right so you know this just assumes that any two [00:30:31] you know this just assumes that any two features they connect share are [00:30:33] features they connect share are completely uncorrelated and [00:30:35] completely uncorrelated and if you have temperature sensors in this [00:30:37] if you have temperature sensors in this room there's just not a good assumption [00:30:38] room there's just not a good assumption to assume the temperature at all points [00:30:40] to assume the temperature at all points of this room are completely uncorrelated [00:30:42] of this room are completely uncorrelated completely independent [00:30:43] completely independent each other or if you measure yes the [00:30:45] each other or if you measure yes the glaciers the people is just not a great [00:30:47] glaciers the people is just not a great assumption to assume that you know the [00:30:49] assumption to assume that you know the different different psychological [00:30:51] different different psychological measures you might have a completely [00:30:52] measures you might have a completely independent so while this model would [00:30:57] independent so while this model would take care of the problem the the [00:30:59] take care of the problem the the technical problem of the covariance [00:31:01] technical problem of the covariance matrix being singular you can't fit this [00:31:04] matrix being singular you can't fit this model on a hundred dimensional data set [00:31:08] model on a hundred dimensional data set with 30 examples you can fit this you [00:31:10] with 30 examples you can fit this you won't get in you this you could build [00:31:11] won't get in you this you could build this model you won't run into numerical [00:31:14] this model you won't run into numerical or singular cover and species size [00:31:16] or singular cover and species size problems it's just not a very good model [00:31:18] problems it's just not a very good model you're just assuming nothing is [00:31:19] you're just assuming nothing is correlated than anything else [00:31:36] something else that you can do is make [00:31:43] something else that you can do is make an even stronger assumption so this is [00:31:44] an even stronger assumption so this is an even worse model but I go through it [00:31:46] an even worse model but I go through it because there'll be a building buffer [00:31:48] because there'll be a building buffer will actually do later which is [00:31:52] constraint Sigma to be Sigma equals [00:32:01] lowercase Sigma squared times R right [00:32:03] lowercase Sigma squared times R right and so constraint Sigma to be done not [00:32:08] and so constraint Sigma to be done not only diagonal but to have the same entry [00:32:11] only diagonal but to have the same entry in every single element so now you've [00:32:14] in every single element so now you've gone from I guess n parameters to just [00:32:18] gone from I guess n parameters to just one parameter right and this means that [00:32:22] one parameter right and this means that you're constraining the covariance [00:32:24] you're constraining the covariance matrix the constraining the gaussians [00:32:27] matrix the constraining the gaussians you used to have circular control so [00:32:29] you used to have circular control so this is an example of what you can model [00:32:31] this is an example of what you can model and this would be another example this [00:32:35] and this would be another example this is another example okay so you could [00:32:37] is another example okay so you could model things like this where every [00:32:39] model things like this where every feature not only is every features [00:32:41] feature not only is every features uncorrelated but every feature further [00:32:42] uncorrelated but every feature further has the same variance as every other [00:32:43] has the same variance as every other feature and maximum likelihood is this [00:32:58] not a huge surprise does the average [00:33:00] not a huge surprise does the average over the previous values so what we'd [00:33:06] over the previous values so what we'd like to do is not quite used either of [00:33:10] like to do is not quite used either of these options right which assumes really [00:33:11] these options right which assumes really bigger problems as soon as the features [00:33:13] bigger problems as soon as the features are uncorrelated and what we'd like to [00:33:15] are uncorrelated and what we'd like to do is build a model that you can fit [00:33:18] do is build a model that you can fit even when you have very high dimensional [00:33:19] even when you have very high dimensional data and the relatively small number of [00:33:22] data and the relatively small number of examples but that allows you to capture [00:33:24] examples but that allows you to capture some of the correlations right so if we [00:33:26] some of the correlations right so if we have 30 temperature sensors in this room [00:33:28] have 30 temperature sensors in this room you know probably there are some [00:33:30] you know probably there are some correlations very probably decided in [00:33:32] correlations very probably decided in the room temperature is gonna be [00:33:33] the room temperature is gonna be correlated data of insertion correlated [00:33:36] correlated data of insertion correlated and maybe the ambient temperature in [00:33:37] and maybe the ambient temperature in this whole building or the temperature [00:33:39] this whole building or the temperature of this room pretty girls up and down [00:33:40] of this room pretty girls up and down there's a whole but maybe some of the [00:33:42] there's a whole but maybe some of the lands on the side heat up that's out of [00:33:44] lands on the side heat up that's out of the room a bit more so different [00:33:45] the room a bit more so different different there are correlations but [00:33:47] different there are correlations but maybe you don't need for [00:33:48] maybe you don't need for covariance matrix either so what what [00:33:52] covariance matrix either so what what factor analysis we'll do is give us a [00:33:54] factor analysis we'll do is give us a model that you can fit even when you [00:33:56] model that you can fit even when you have you know hundred emissions a there [00:33:58] have you know hundred emissions a there are thirty examples they captures some [00:34:00] are thirty examples they captures some of the correlations but that doesn't run [00:34:02] of the correlations but that doesn't run into the onion vertical covariance [00:34:06] into the onion vertical covariance matrices that the naive Gaussian model [00:34:08] matrices that the naive Gaussian model does alright so let me just just track [00:34:22] does alright so let me just just track anyone let me let me describe the models [00:34:24] anyone let me let me describe the models check any questions by anyone oh sure [00:34:35] check any questions by anyone oh sure yes yes there is one thing you can do a [00:34:38] yes yes there is one thing you can do a common thing to do is apply which hot [00:34:41] common thing to do is apply which hot Pryor and what that boils down to is as [00:34:44] Pryor and what that boils down to is as a small diagonal value to that to the [00:34:46] a small diagonal value to that to the maximum I could estimate it kind of in a [00:34:50] maximum I could estimate it kind of in a technical sense it takes away the [00:34:51] technical sense it takes away the non-invertible matrix problem there's [00:34:54] non-invertible matrix problem there's actually not the best algorithm from [00:34:56] actually not the best algorithm from other types of danger the the the [00:35:00] other types of danger the the the Wishart or inverse Wishart prior yeah as [00:35:03] Wishart or inverse Wishart prior yeah as you know basically you take the maximum [00:35:04] you know basically you take the maximum likely the Sigma and add you know some [00:35:08] likely the Sigma and add you know some constant to the diagonal it take has the [00:35:11] constant to the diagonal it take has the problem in a technical way but it's not [00:35:13] problem in a technical way but it's not it's not the best model for a lot of [00:35:15] it's not the best model for a lot of data says I see [00:35:20] oh yes why do you think about option two [00:35:25] oh yes why do you think about option two when it's like even worse than option [00:35:27] when it's like even worse than option one um yes option two is not a good [00:35:29] one um yes option two is not a good option but I need to use this as a [00:35:31] option but I need to use this as a building block for factor analysis so [00:35:32] building block for factor analysis so you see this is a small component of C I [00:35:35] you see this is a small component of C I actually plan these things out and I [00:35:40] actually plan these things out and I actually took on it's just just a [00:35:42] actually took on it's just just a mention you know just mentioned some [00:35:46] mention you know just mentioned some things I see yeah she's actually did the [00:35:47] things I see yeah she's actually did the machine there any work he balls all the [00:35:48] machine there any work he balls all the time we should find it fascinating but [00:35:50] time we should find it fascinating but they look at all the big tech companies [00:35:52] they look at all the big tech companies um a lot of the large tech companies [00:35:54] um a lot of the large tech companies they're all like working on exactly the [00:35:55] they're all like working on exactly the same problems right every large tech [00:35:57] same problems right every large tech company you know software AI complete [00:36:00] company you know software AI complete his work on machine translation every [00:36:01] his work on machine translation every one of them works on speech recognition [00:36:03] one of them works on speech recognition every one of them works on face [00:36:05] every one of them works on face recognition and I've been part of these [00:36:07] recognition and I've been part of these teams myself right and I think it's [00:36:08] teams myself right and I think it's great that we have so much progress in [00:36:11] great that we have so much progress in machine translation cuz there's so many [00:36:12] machine translation cuz there's so many people in so many large companies work [00:36:14] people in so many large companies work on machine translation it's actually [00:36:16] on machine translation it's actually really happy to see so much progress in [00:36:17] really happy to see so much progress in these problems that every single large [00:36:19] these problems that every single large tech company large software a Irish tech [00:36:22] tech company large software a Irish tech company works on um one of the [00:36:24] company works on um one of the fascinating things I see is that because [00:36:27] fascinating things I see is that because of all this work in the large tech [00:36:30] of all this work in the large tech companies work on very similar problems [00:36:32] companies work on very similar problems one of the really overlooked parts of [00:36:34] one of the really overlooked parts of the machine there any world is a small [00:36:35] the machine there any world is a small data problems right so they're all [00:36:37] data problems right so they're all working big data representing English [00:36:40] working big data representing English and French and Chinese and Spanish [00:36:41] and French and Chinese and Spanish sentences it's in across small does it [00:36:43] sentences it's in across small does it work and I think there's actually a lack [00:36:47] work and I think there's actually a lack of attention like a disproportionate [00:36:49] of attention like a disproportionate small amount of attention on you know [00:36:51] small amount of attention on you know small data problems we're instead of a [00:36:53] small data problems we're instead of a hundred million images you maybe have [00:36:55] hundred million images you maybe have 100 images and so some of the teams I [00:36:58] 100 images and so some of the teams I work with these days actually landing a [00:37:00] work with these days actually landing a I actually spent all my time thinking [00:37:02] I actually spent all my time thinking about small danger problems because a [00:37:05] about small danger problems because a lot of the practical applications of [00:37:06] lot of the practical applications of machine learning including the wrong [00:37:08] machine learning including the wrong things you see in your class projects or [00:37:10] things you see in your class projects or actually small data problems right I [00:37:12] actually small data problems right I think when when onion works with the [00:37:15] think when when onion works with the healthcare system where our substantive [00:37:16] healthcare system where our substantive Hospital force all the problems you only [00:37:18] Hospital force all the problems you only have 100 examples only a thousand only [00:37:20] have 100 examples only a thousand only 10,000 you don't have a million patients [00:37:22] 10,000 you don't have a million patients or the same medical condition and so I [00:37:24] or the same medical condition and so I think that a lot of these models so in [00:37:28] think that a lot of these models so in the game earlier this week I was using [00:37:32] the game earlier this week I was using slightly modified version [00:37:34] slightly modified version factor analysis on the manufacturing [00:37:36] factor analysis on the manufacturing problem at landing AI right and I think [00:37:38] problem at landing AI right and I think a lot of these small data problems are [00:37:40] a lot of these small data problems are actually we're allowed the exciting work [00:37:42] actually we're allowed the exciting work is to be done in machine learning and is [00:37:44] is to be done in machine learning and is somehow it feels like a blind spot or [00:37:47] somehow it feels like a blind spot or the feels like girls like a a good gap [00:37:48] the feels like girls like a a good gap of a lot of their work done into a I [00:37:51] of a lot of their work done into a I world today yeah why don't we use the [00:38:09] world today yeah why don't we use the same arrow as a swan big data it turns [00:38:10] same arrow as a swan big data it turns out that you know it turns out if you [00:38:15] out that you know it turns out if you look at computer vision world right is a [00:38:17] look at computer vision world right is a dataset everyone's working on now now [00:38:19] dataset everyone's working on now now we're pasta we don't really use anymore [00:38:21] we're pasta we don't really use anymore called image net which had a million [00:38:23] called image net which had a million images and so the tons of computation [00:38:25] images and so the tons of computation architectures that have been heavily [00:38:27] architectures that have been heavily designed for the use case of if you have [00:38:29] designed for the use case of if you have exactly 1 million training examples it [00:38:32] exactly 1 million training examples it turns out that the algorithms that work [00:38:34] turns out that the algorithms that work best maybe have 100 training examples is [00:38:36] best maybe have 100 training examples is you know looks like is different than [00:38:37] you know looks like is different than the best learning algorithm I think and [00:38:41] the best learning algorithm I think and so I think right now we actually I think [00:38:44] so I think right now we actually I think the Machine there any world we are not [00:38:45] the Machine there any world we are not very good at understanding the scaling [00:38:47] very good at understanding the scaling the best algorithm for one training [00:38:49] the best algorithm for one training example you know as far as we are able [00:38:52] example you know as far as we are able to invent algorithm mister community is [00:38:54] to invent algorithm mister community is different than best algorithm for a [00:38:55] different than best algorithm for a thousand best earn per million is [00:38:57] thousand best earn per million is actually different than actually and [00:39:00] actually different than actually and Facebook published paper we see with 3.5 [00:39:02] Facebook published paper we see with 3.5 billion images there's no school there's [00:39:04] billion images there's no school there's very large right so was I think we don't [00:39:06] very large right so was I think we don't actually have a good understanding of [00:39:08] actually have a good understanding of how to modify our algorithms set one [00:39:10] how to modify our algorithms set one algorithm work on every single point of [00:39:12] algorithm work on every single point of the spectrum green from one example to [00:39:14] the spectrum green from one example to like a billion examples and so there's a [00:39:17] like a billion examples and so there's a lot of work optimizing for different [00:39:20] lot of work optimizing for different points of the spectrum and I think [00:39:21] points of the spectrum and I think there's been a lot of work optimizing [00:39:24] there's been a lot of work optimizing for Big Data which is great you know [00:39:25] for Big Data which is great you know built some of these large systems to [00:39:27] built some of these large systems to handle whatever I petabytes of data a [00:39:29] handle whatever I petabytes of data a day that's great [00:39:31] day that's great but I feel like relative to the number [00:39:35] but I feel like relative to the number of application opportunities there there [00:39:37] of application opportunities there there there's a lot of work on small data as [00:39:38] there's a lot of work on small data as well that I find very exciting that [00:39:41] well that I find very exciting that that and I think of this as an example [00:39:43] that and I think of this as an example the reason I was using this literally [00:39:46] the reason I was using this literally well modified version of this earlier [00:39:48] well modified version of this earlier this week on the manufacturing problem [00:39:50] this week on the manufacturing problem is because there isn't that much data in [00:39:54] is because there isn't that much data in those scenarios right all right those [00:40:00] those scenarios right all right those them off topic but let's go and describe [00:40:02] them off topic but let's go and describe well hopefully maybe yeah so this stuff [00:40:04] well hopefully maybe yeah so this stuff does get used right so let's let's talk [00:40:06] does get used right so let's let's talk about the model so similar to your [00:40:12] about the model so similar to your information of Gaussian is I'm going to [00:40:13] information of Gaussian is I'm going to define a model with PXE equals to P of X [00:40:19] define a model with PXE equals to P of X given Z times P of Z and Z is hidden ok [00:40:28] given Z times P of Z and Z is hidden ok so that's a framework same as mixture of [00:40:32] so that's a framework same as mixture of gaussians so let me just define the [00:40:34] gaussians so let me just define the factor analysis model [00:40:50] so first um Z will be drawn distributed [00:40:54] so first um Z will be drawn distributed according to a Gaussian density where Z [00:40:56] according to a Gaussian density where Z is going to be an R D where D is less [00:40:59] is going to be an R D where D is less than n and again to think about it maybe [00:41:01] than n and again to think about it maybe in think of it as on D equals 3 N equals [00:41:08] in think of it as on D equals 3 N equals 100 M equals 30 okay and and but I guess [00:41:16] 100 M equals 30 okay and and but I guess just did me just as a concrete example [00:41:17] just did me just as a concrete example to think about it and what we're going [00:41:19] to think about it and what we're going to assume is that X is equal to mu plus [00:41:22] to assume is that X is equal to mu plus lambda Z this is the capital Greek [00:41:29] lambda Z this is the capital Greek alphabet lambda plus epsilon where [00:41:33] alphabet lambda plus epsilon where epsilon is disputed Gaussian with mean 0 [00:41:36] epsilon is disputed Gaussian with mean 0 and covariance psy [00:41:51] so the parameters of this model are mu [00:41:55] so the parameters of this model are mu which is n dimensional lambda which is n [00:42:02] which is n dimensional lambda which is n by D and psy which is n by n and we're [00:42:08] by D and psy which is n by n and we're going to assume that psy is diagonal [00:42:12] going to assume that psy is diagonal okay and so let's see the second [00:42:18] okay and so let's see the second equation an equivalent way to write that [00:42:21] equation an equivalent way to write that is that given the value of Z the [00:42:30] is that given the value of Z the conditional distribution of X right X [00:42:32] conditional distribution of X right X given Z this is Gaussian with me given [00:42:35] given Z this is Gaussian with me given by mu plus lambda Z and kobir inside [00:42:43] by mu plus lambda Z and kobir inside okay so once you've given Z once your [00:42:46] okay so once you've given Z once your sample Z so this is P of Z and this is P [00:42:49] sample Z so this is P of Z and this is P P of Z and SP of XE X given Z right so [00:42:52] P of Z and SP of XE X given Z right so given Z X is computed as mu plus lambda [00:42:56] given Z X is computed as mu plus lambda Z so this is just some constant and then [00:42:59] Z so this is just some constant and then you add Gaussian noise to it and so this [00:43:01] you add Gaussian noise to it and so this equation an equivalent way to defining [00:43:03] equation an equivalent way to defining this equation is to say that the mean of [00:43:05] this equation is to say that the mean of X conditioned on Z is this first term [00:43:11] X conditioned on Z is this first term since that's the mean and the covariance [00:43:15] since that's the mean and the covariance of x given Z is given by this you know [00:43:19] of x given Z is given by this you know additional term for sy by that noise [00:43:22] additional term for sy by that noise term that you had to it okay [00:43:26] term that you had to it okay so let me go through a few examples and [00:43:30] so let me go through a few examples and I think the intuition behind this model [00:43:32] I think the intuition behind this model is in if you think that there are three [00:43:36] is in if you think that there are three powerful forces driving temperatures [00:43:38] powerful forces driving temperatures across this room maybe one powerful [00:43:41] across this room maybe one powerful force is just what is the temperature [00:43:42] force is just what is the temperature you know here in Palo Alto what's the [00:43:45] you know here in Palo Alto what's the temperature here at Stanford and another [00:43:46] temperature here at Stanford and another powerful force is how bright are the [00:43:48] powerful force is how bright are the lights on the left side of the room and [00:43:50] lights on the left side of the room and how hot does it heat up the side of room [00:43:51] how hot does it heat up the side of room and another is how hot does it heat up [00:43:53] and another is how hot does it heat up the right side of the group right so [00:43:54] the right side of the group right so let's say there are three main driving [00:43:56] let's say there are three main driving factors affecting the temperature of [00:43:58] factors affecting the temperature of this room then that's when D would be [00:44:01] this room then that's when D would be equal to three that you assume that you [00:44:02] equal to three that you assume that you know there are three things in the world [00:44:04] know there are three things in the world they drive the temperature of this room [00:44:05] they drive the temperature of this room this three dimensional which is the [00:44:07] this three dimensional which is the temperature in Palo Alto kind of around [00:44:09] temperature in Palo Alto kind of around this area how bright are the lies there [00:44:12] this area how bright are the lies there and how bright otherwise they're and you [00:44:13] and how bright otherwise they're and you try to capture that with three numbers [00:44:15] try to capture that with three numbers given those three numbers right given Z [00:44:19] given those three numbers right given Z the actual temperature for the 100 [00:44:22] the actual temperature for the 100 sensors we scatter around this room will [00:44:25] sensors we scatter around this room will be determined by each sensor right so [00:44:29] be determined by each sensor right so plant 30 temperature sensors all over [00:44:30] plant 30 temperature sensors all over this room each sensor we plant will [00:44:34] this room each sensor we plant will measure an actual temperature there's a [00:44:36] measure an actual temperature there's a linear function of those three powerful [00:44:39] linear function of those three powerful forces and if it sends it on that side [00:44:42] forces and if it sends it on that side of the room it will be affected more by [00:44:43] of the room it will be affected more by how bright are the lights on that side [00:44:45] how bright are the lights on that side of the room if the sensor near the door [00:44:48] of the room if the sensor near the door it will be more affected by the [00:44:49] it will be more affected by the temperature outside temperature here in [00:44:52] temperature outside temperature here in Palo Alto right but so X will be a [00:44:54] Palo Alto right but so X will be a linear function this first time I [00:44:57] linear function this first time I underlined but rather than just that [00:45:00] underlined but rather than just that term there's a little noise right so [00:45:02] term there's a little noise right so your sensor has his own noise term which [00:45:04] your sensor has his own noise term which is governed by this additional noise [00:45:07] is governed by this additional noise term epsilon and the assumption that [00:45:11] term epsilon and the assumption that this matrix psy is diagonal is saying [00:45:16] this matrix psy is diagonal is saying that after you compute the mean the [00:45:19] that after you compute the mean the noise that you observe that your sensor [00:45:22] noise that you observe that your sensor is independent of the noise and every [00:45:24] is independent of the noise and every other sensor right there maybe maybe the [00:45:28] other sensor right there maybe maybe the sensor you know up there right maybe [00:45:30] sensor you know up there right maybe it's just noisy or something this gust [00:45:32] it's just noisy or something this gust of wind but you assume that the noise [00:45:34] of wind but you assume that the noise observe at different senses is [00:45:36] observe at different senses is independent [00:45:37] independent the additional epsilon error term has a [00:45:40] the additional epsilon error term has a diagonal covariance matrix given by side [00:45:43] diagonal covariance matrix given by side okay so you can so you can think of that [00:45:45] okay so you can so you can think of that as what factor analysis is trying to [00:45:50] as what factor analysis is trying to model so let me um you just go through a [00:45:54] model so let me um you just go through a couple of examples of the types of data [00:45:57] couple of examples of the types of data factor analysis can model alright oh and [00:46:09] factor analysis can model alright oh and again body constrains the white board [00:46:11] again body constrains the white board I'm gonna have to go low-dimensional [00:46:12] I'm gonna have to go low-dimensional here so actually let me let me go [00:46:16] here so actually let me let me go through a couple examples so let's say Z [00:46:23] through a couple examples so let's say Z is R 1 and x is r 2 so in this example I [00:46:28] is R 1 and x is r 2 so in this example I guess D is equal to 1 and is equal to 2 [00:46:33] guess D is equal to 1 and is equal to 2 and let's say M is 7 right just just so [00:46:40] and let's say M is 7 right just just so won't be a typical sample generated by [00:46:42] won't be a typical sample generated by you know what would be an example of a [00:46:44] you know what would be an example of a type of data that disc can model so this [00:46:57] so this would be a typical sample of Zi [00:47:00] so this would be a typical sample of Zi which is you know so this is Z is just [00:47:06] which is you know so this is Z is just drawn from a static alcian so I guess Z [00:47:09] drawn from a static alcian so I guess Z miss Gaussian with mean 0 unit variance [00:47:12] miss Gaussian with mean 0 unit variance so that's a number line and if you draw [00:47:14] so that's a number line and if you draw seven points from the Gaussian you know [00:47:16] seven points from the Gaussian you know maybe you get a sample like that okay [00:47:18] maybe you get a sample like that okay and now let's say lambda is two one and [00:47:26] and now let's say lambda is two one and let's just say mu is zero zero okay so [00:47:33] let's just say mu is zero zero okay so now let's compute lambda X plus mu so [00:47:44] now let's compute lambda X plus mu so give it a sample like that [00:47:46] give it a sample like that if you computes lambda X plus mu this [00:47:50] if you computes lambda X plus mu this will now be an r2 right so here's X 1 is [00:47:54] will now be an r2 right so here's X 1 is X 2 I'm gonna take those examples and [00:47:56] X 2 I'm gonna take those examples and map them to align as follows right wait [00:48:04] map them to align as follows right wait these examples on R 1 so oh excuse me [00:48:08] these examples on R 1 so oh excuse me long disease okay so this is just a real [00:48:11] long disease okay so this is just a real number and so lambda Z plus mu is now [00:48:15] number and so lambda Z plus mu is now two-dimensional right because uh lambda [00:48:18] two-dimensional right because uh lambda R it's a 2 by 1 matrix okay so you end [00:48:22] R it's a 2 by 1 matrix okay so you end up with so this would be a typical [00:48:24] up with so this would be a typical sample to look around the sample of [00:48:26] sample to look around the sample of lambda Z plus mu and there's a 2 [00:48:29] lambda Z plus mu and there's a 2 dimensional data set but all of the [00:48:31] dimensional data set but all of the examples lie perfectly on a straight [00:48:34] examples lie perfectly on a straight line okay [00:48:36] line okay then finally let's say that sy the [00:48:39] then finally let's say that sy the covariance matrix is equal to this as a [00:48:45] covariance matrix is equal to this as a diagonal covariance matrix and so this [00:48:49] diagonal covariance matrix and so this covariance matrix corresponds to x2 [00:48:51] covariance matrix corresponds to x2 having a bigger Berens the next one [00:48:52] having a bigger Berens the next one right now so you know this this I guess [00:48:55] right now so you know this this I guess the density of epsilon has ellipses that [00:48:58] the density of epsilon has ellipses that look a little bit like this is taller [00:48:59] look a little bit like this is taller than Y the aspect ratio should [00:49:01] than Y the aspect ratio should technically be 1 over root 2 [00:49:03] technically be 1 over root 2 right there's a standard deviation would [00:49:05] right there's a standard deviation would be root 2 oh yes and so in the last step [00:49:10] be root 2 oh yes and so in the last step of what we're going to do x equals [00:49:12] of what we're going to do x equals lambda Z plus mu those Epsilon we're [00:49:16] lambda Z plus mu those Epsilon we're going to take each of these points we [00:49:17] going to take each of these points we have and put a little Gaussian contour [00:49:23] you know there's that shape there's this [00:49:26] you know there's that shape there's this this I'm just doing one contour yes it [00:49:28] this I'm just doing one contour yes it is a shape and just puts it on top of [00:49:30] is a shape and just puts it on top of this and if you sample one point from [00:49:34] this and if you sample one point from each of these gaussians there may be [00:49:36] each of these gaussians there may be again this example this example this [00:49:38] again this example this example this example this example so what I just did [00:49:43] example this example so what I just did was looking into the Gaussian contours [00:49:45] was looking into the Gaussian contours in sample a point from that Gaussian and [00:49:48] in sample a point from that Gaussian and so the red crosses here are a typical [00:49:52] so the red crosses here are a typical sample drawn from this model okay and so [00:49:57] sample drawn from this model okay and so if you have data that looks like this [00:49:58] if you have data that looks like this that looks like the red crosses come [00:50:00] that looks like the red crosses come disease are latent random variables [00:50:02] disease are latent random variables right and when you get it is that you [00:50:04] right and when you get it is that you can't actually see Z so what you [00:50:05] can't actually see Z so what you actually see is just you know the red [00:50:07] actually see is just you know the red crosses that's your training set and if [00:50:11] crosses that's your training set and if you apply the factor analysis model with [00:50:12] you apply the factor analysis model with these parameters then you know by e/m [00:50:15] these parameters then you know by e/m and so on hopefully you can find [00:50:16] and so on hopefully you can find parameters and all those this data set [00:50:18] parameters and all those this data set pretty well but hopefully this to the [00:50:19] pretty well but hopefully this to the sense of the type of data set this could [00:50:23] sense of the type of data set this could generate [00:50:25] generate and so and so um and then one one way to [00:50:35] and so and so um and then one one way to think of this data is you have two [00:50:37] think of this data is you have two dimensional data but most of the data [00:50:39] dimensional data but most of the data lies on a 1d subspace so this is how to [00:50:43] lies on a 1d subspace so this is how to think about it you have two dimensional [00:50:44] think about it you have two dimensional data since n is two but most of the data [00:50:47] data since n is two but most of the data lies on the roughly one dimensional [00:50:49] lies on the roughly one dimensional subspace meaning lies are up there on [00:50:51] subspace meaning lies are up there on the line and then there's a little bit [00:50:52] the line and then there's a little bit of noise off that line okay all right [00:50:58] of noise off that line okay all right let me quickly do one more example [00:50:59] let me quickly do one more example because these are these are high [00:51:01] because these are these are high dimensional spaces I think this I think [00:51:03] dimensional spaces I think this I think is useful to build intuition all right [00:51:06] is useful to build intuition all right um [00:51:08] um so let's go through the example where Z [00:51:13] so let's go through the example where Z is 2 X is an r3 and let's use M equals [00:51:21] is 2 X is an r3 and let's use M equals 5/2 so um what about different set of [00:51:31] 5/2 so um what about different set of parameters let's look at the type of [00:51:33] parameters let's look at the type of data you can generate a factor analysis [00:51:34] data you can generate a factor analysis which is just e 1 is e 2 Z is [00:51:37] which is just e 1 is e 2 Z is distributed Gaussian standing goes in to [00:51:39] distributed Gaussian standing goes in to do you so beyond you know circular [00:51:40] do you so beyond you know circular Gaussian so maybe this is what a typical [00:51:42] Gaussian so maybe this is what a typical sample right looks like if you if you [00:51:47] sample right looks like if you if you sample sort of z1 and z2 from a standard [00:51:49] sample sort of z1 and z2 from a standard Gaussian right that would be a typical [00:51:51] Gaussian right that would be a typical sample in z1 and z2 so now all right I'm [00:51:59] sample in z1 and z2 so now all right I'm gonna do a demo let me take these five [00:52:02] gonna do a demo let me take these five examples and just copy them to this [00:52:04] examples and just copy them to this piece of paper okay so all right there [00:52:11] piece of paper okay so all right there great transfer this from the whiteboard [00:52:14] great transfer this from the whiteboard to this piece of paper to this brown [00:52:16] to this piece of paper to this brown cardboard so now you have z1 and z2 in a [00:52:20] cardboard so now you have z1 and z2 in a two dimensional space what we're going [00:52:23] two dimensional space what we're going to do is compute lambda Z plus mu and [00:52:28] to do is compute lambda Z plus mu and this will be 3 by 2 and this will be 3 [00:52:32] this will be 3 by 2 and this will be 3 by 1 [00:52:34] by 1 so what this computation will do as you [00:52:37] so what this computation will do as you map from Z in two dimensions to lambda Z [00:52:40] map from Z in two dimensions to lambda Z plus mu is you're going to map from [00:52:41] plus mu is you're going to map from two-dimensional data to [00:52:43] two-dimensional data to three-dimensional data in other words [00:52:45] three-dimensional data in other words you're going to take the two dimensional [00:52:47] you're going to take the two dimensional data lying on the plane of the [00:52:49] data lying on the plane of the whiteboard and map it check out the [00:52:51] whiteboard and map it check out the school animation into the [00:52:53] school animation into the three-dimensional space of our classroom [00:53:04] and then the last step is for each of [00:53:09] and then the last step is for each of these points in this video space like X [00:53:11] these points in this video space like X 1 X 2 X 3 right we'll have a little [00:53:14] 1 X 2 X 3 right we'll have a little Gaussian bump that is acts as a line [00:53:16] Gaussian bump that is acts as a line because epsilon is the features the the [00:53:20] because epsilon is the features the the components of epsilon are uncorrelated [00:53:22] components of epsilon are uncorrelated and taking each of these five points and [00:53:25] and taking each of these five points and add the low you know be the fuzziness a [00:53:27] add the low you know be the fuzziness a little bit of Gaussian noise to it and [00:53:29] little bit of Gaussian noise to it and so what you end up with is a set of red [00:53:32] so what you end up with is a set of red crosses and you end up with a few [00:53:35] crosses and you end up with a few examples you know L above the noise you [00:53:38] examples you know L above the noise you end up with except that they would have [00:53:41] end up with except that they would have a bit of noise off this plane as well [00:53:42] a bit of noise off this plane as well right but so what the factor analysis [00:53:46] right but so what the factor analysis model can capture is if you have data in [00:53:48] model can capture is if you have data in 3d right in this 3d space but most of [00:53:51] 3d right in this 3d space but most of the data set lies on this maybe roughly [00:53:54] the data set lies on this maybe roughly two dimensional pancake but there's a [00:53:56] two dimensional pancake but there's a little bit of fuzziness off the [00:53:57] little bit of fuzziness off the packaging right so so this would be an [00:54:00] packaging right so so this would be an example of the type of data that factor [00:54:02] example of the type of data that factor analysis can model [00:54:07] and the intuition is really think factor [00:54:11] and the intuition is really think factor analysis can take very high-dimensional [00:54:12] analysis can take very high-dimensional data say 100 dimensional data and model [00:54:17] data say 100 dimensional data and model the data is roughly lying on a three [00:54:19] the data is roughly lying on a three dimensional five dimensional subspace [00:54:21] dimensional five dimensional subspace with a little bit of fuss will blow of [00:54:24] with a little bit of fuss will blow of noise of that low dimensional subspace [00:54:55] so let's talk about Oh right does not [00:55:04] so let's talk about Oh right does not work as well if the data is not lying on [00:55:06] work as well if the data is not lying on the load original subspace um let's see [00:55:08] the load original subspace um let's see so even in 2d if you have this data set [00:55:13] so even in 2d if you have this data set your I should the freedom to choose [00:55:16] your I should the freedom to choose Gaussian noise just like that in which [00:55:18] Gaussian noise just like that in which case you can actually model things that [00:55:20] case you can actually model things that quite far off a subspace but yeah you [00:55:26] quite far off a subspace but yeah you know I think we're very high dimensional [00:55:27] know I think we're very high dimensional dataset it's actually very difficult to [00:55:29] dataset it's actually very difficult to know what's going on because you can't [00:55:30] know what's going on because you can't visualize these very high dimensional [00:55:32] visualize these very high dimensional data sets and you also don't have enough [00:55:34] data sets and you also don't have enough data to go very secure to models so so I [00:55:37] data to go very secure to models so so I feel like yes if you have if the data [00:55:40] feel like yes if you have if the data actually does not roughly line the [00:55:42] actually does not roughly line the subspace then this model you know may [00:55:45] subspace then this model you know may not be the best model but when you have [00:55:47] not be the best model but when you have such high dimensional data in such a [00:55:49] such high dimensional data in such a small data set [00:55:51] small data set you can't fit very complex models to it [00:55:54] you can't fit very complex models to it anyway so this might be pretty [00:55:55] anyway so this might be pretty reasonable so um so it turns out that [00:56:11] reasonable so um so it turns out that the derivation of vm for factor analysis [00:56:14] the derivation of vm for factor analysis is actually is actually one of the [00:56:16] is actually is actually one of the trickiest VM derivations in terms of how [00:56:18] trickiest VM derivations in terms of how you calculate at each step and how you [00:56:20] you calculate at each step and how you calculate the m-step the whole algorithm [00:56:23] calculate the m-step the whole algorithm is you know describe every every every [00:56:25] is you know describe every every every single step step through in great detail [00:56:27] single step step through in great detail the lecture notes but what I want to do [00:56:29] the lecture notes but what I want to do is give you the flavor of how to do the [00:56:31] is give you the flavor of how to do the derivation and to especially draw [00:56:33] derivation and to especially draw attention to the trickiest step so that [00:56:35] attention to the trickiest step so that if you need to derive an algorithm like [00:56:37] if you need to derive an algorithm like this yourself or maybe a different [00:56:38] this yourself or maybe a different Gaussian model that you know how to do [00:56:40] Gaussian model that you know how to do it but I won't do every step the algebra [00:56:42] it but I won't do every step the algebra here um so in order to set ourselves up [00:56:46] here um so in order to set ourselves up to derive factor analysis here enm for [00:56:50] to derive factor analysis here enm for better analysis I want to describe a few [00:56:52] better analysis I want to describe a few properties of multivariate gaussians [00:56:56] properties of multivariate gaussians so let's say that X is a vector and I'm [00:57:01] so let's say that X is a vector and I'm gonna write this as a partition vector [00:57:03] gonna write this as a partition vector right and we jumped if there are our [00:57:08] right and we jumped if there are our components there and s components there [00:57:10] components there and s components there so x1 is in SS so if X is Gaussian with [00:57:26] so x1 is in SS so if X is Gaussian with mean mu and covariance Sigma then let's [00:57:30] mean mu and covariance Sigma then let's similarly let me be written as this sort [00:57:34] similarly let me be written as this sort of partition vector right just break it [00:57:36] of partition vector right just break it up into two sub vectors corresponding to [00:57:38] up into two sub vectors corresponding to the first R components in the second s [00:57:41] the first R components in the second s components and similarly lesson that the [00:57:44] components and similarly lesson that the covariance matrix be partitioned into [00:57:50] you know these four diagonal blocks [00:57:53] you know these four diagonal blocks where I guess this is our components [00:57:55] where I guess this is our components this is s components this is our [00:57:58] this is s components this is our components this is s components so all [00:58:01] components this is s components so all this means is you take the covariance [00:58:04] this means is you take the covariance matrix and take the top leftmost R by R [00:58:07] matrix and take the top leftmost R by R elements and call that Sigma 1 1 right [00:58:10] elements and call that Sigma 1 1 right and and similarly for the other sub [00:58:14] and and similarly for the other sub blocks of this covariance matrix so in [00:58:22] blocks of this covariance matrix so in order to derive factor analysis one of [00:58:25] order to derive factor analysis one of the things you need to do is compute [00:58:27] the things you need to do is compute marginal and conditional distributions [00:58:31] marginal and conditional distributions of gaussians so the marginal is you know [00:58:37] of gaussians so the marginal is you know what is P of x1 and so the the if you [00:58:44] what is P of x1 and so the the if you you know where to derive this the way [00:58:47] you know where to derive this the way you compute the marginal is to take the [00:58:49] you compute the marginal is to take the joint density of P of X right and you [00:58:52] joint density of P of X right and you can write this as P of x1 x2 because X [00:58:55] can write this as P of x1 x2 because X can be partitioned into x1 and x2 and [00:58:57] can be partitioned into x1 and x2 and then integrate out x2 [00:58:59] then integrate out x2 P of X 1 X 2 right DX 2 and just to give [00:59:04] P of X 1 X 2 right DX 2 and just to give you P of X 1 and if you plug in the [00:59:07] you P of X 1 and if you plug in the Gaussian density the formula for the [00:59:09] Gaussian density the formula for the Gaussian density if you plug in I guess [00:59:11] Gaussian density if you plug in I guess you know 1 over 2 pi to the N over 2 C [00:59:15] you know 1 over 2 pi to the N over 2 C it was in one half e to the minus 1/2 X [00:59:20] it was in one half e to the minus 1/2 X 1 minus mu 1 if you plug this into P of [00:59:40] 1 minus mu 1 if you plug this into P of X 1 comma X 2 and as she do the integral [00:59:49] then you will find that the marginal [00:59:53] then you will find that the marginal distribution of X 1 is given by X 1 is a [00:59:58] distribution of X 1 is given by X 1 is a Gaussian with mean mu 1 and covariance [01:00:02] Gaussian with mean mu 1 and covariance Sigma 1 1 so it's kind of not a shocking [01:00:06] Sigma 1 1 so it's kind of not a shocking result that the marginal distribution is [01:00:08] result that the marginal distribution is given just by that and that's and then [01:00:12] given just by that and that's and then again the way to show it vigorously is [01:00:14] again the way to show it vigorously is to do this calculation but it's actually [01:00:16] to do this calculation but it's actually not shocking I guess that that's what [01:00:18] not shocking I guess that that's what you would get ok um and then the other [01:00:24] you would get ok um and then the other property you will need to use is a [01:00:25] property you will need to use is a conditional which is given the value of [01:00:30] conditional which is given the value of X 2 what is the conditional value of x 1 [01:00:35] X 2 what is the conditional value of x 1 and so the way to do that would be you [01:00:38] and so the way to do that would be you know in theory you would take P of X 1 [01:00:39] know in theory you would take P of X 1 comma X 2 divided by P of X 2 right and [01:00:43] comma X 2 divided by P of X 2 right and then simplify and it turns out you can [01:00:45] then simplify and it turns out you can show that X 1 given X 2 is itself [01:00:50] show that X 1 given X 2 is itself Gaussian or some mean in some covariance [01:00:56] Gaussian or some mean in some covariance which comes right at some U of 1 given 2 [01:00:59] which comes right at some U of 1 given 2 and Sigma of 1 given 2 over mu of 1 [01:01:02] and Sigma of 1 given 2 over mu of 1 given 2 is and but this is one of those [01:01:05] given 2 is and but this is one of those long is that I actually don't [01:01:07] long is that I actually don't I actually don't manage to remember but [01:01:09] I actually don't manage to remember but every time I need [01:01:10] every time I need just look what is written election I was [01:01:11] just look what is written election I was as well so so that's how you compute [01:01:40] as well so so that's how you compute modules and conditionals of a Gaussian [01:01:42] modules and conditionals of a Gaussian distribution [01:01:47] so using these properties of the [01:01:59] so using these properties of the multivariate Gaussian density let's go [01:02:02] multivariate Gaussian density let's go through the high-level steps of how you [01:02:04] through the high-level steps of how you derive the EML rule [01:02:23] step one is less compute actually let's [01:02:34] let's derive what is the joint [01:02:37] let's derive what is the joint distribution of P of X and Z and in [01:02:42] distribution of P of X and Z and in particular it turns out that if you take [01:02:44] particular it turns out that if you take Z and X and stack them up into a vector [01:02:47] Z and X and stack them up into a vector like so Z and X viewed as a vector will [01:02:52] like so Z and X viewed as a vector will be Gaussian wouldn't mean with some mean [01:03:01] be Gaussian wouldn't mean with some mean and some Co various because X and Z [01:03:05] and some Co various because X and Z jointly will have a Gaussian density and [01:03:08] jointly will have a Gaussian density and let's try to quickly figure out what are [01:03:12] let's try to quickly figure out what are this mean and that's covariance matrix [01:03:18] so that was a definition of these terms [01:03:25] so that was a definition of these terms and so the expected value of Z is equal [01:03:30] and so the expected value of Z is equal to 0 because 0 Z is Gaussian with mean 0 [01:03:34] to 0 because 0 Z is Gaussian with mean 0 and covariance identity and do you [01:03:36] and covariance identity and do you expected value of x is equal to the [01:03:39] expected value of x is equal to the expected value of mu plus lambda Z plus [01:03:43] expected value of mu plus lambda Z plus epsilon but Z has 0 expected value [01:03:46] epsilon but Z has 0 expected value epsilon 0 expected value so that just [01:03:48] epsilon 0 expected value so that just leaves you with mu and so this mean [01:03:52] leaves you with mu and so this mean vector mu XZ is going to equal to 0 and [01:03:59] vector mu XZ is going to equal to 0 and so this is d dimensional and this is a n [01:04:02] so this is d dimensional and this is a n dimensional and it turns out that let's [01:04:11] dimensional and it turns out that let's see [01:04:22] and it turns out that you can similarly [01:04:25] and it turns out that you can similarly compute the covariance matrix Sigma [01:04:32] right where this is D dimensions and [01:04:37] right where this is D dimensions and this is n dimensions it turns out that [01:04:42] this is n dimensions it turns out that if you take this partition vector and [01:04:44] if you take this partition vector and compute the covariance matrix the four [01:04:47] compute the covariance matrix the four blocks of the covariance matrix that can [01:04:49] blocks of the covariance matrix that can be written as follows and you can [01:05:21] be written as follows and you can one-at-a-time derive what each of these [01:05:23] one-at-a-time derive what each of these different blocks look like and let me [01:05:28] different blocks look like and let me just do one of these and now let me just [01:05:32] just do one of these and now let me just derive one Sigma 2 to the lower-right [01:05:34] derive one Sigma 2 to the lower-right blockers and the rest are derived [01:05:36] blockers and the rest are derived similarly and also fleshed out in the [01:05:38] similarly and also fleshed out in the lecture notes [01:05:43] so the way you derive what this block is [01:05:46] so the way you derive what this block is like is that you say Sigma 2 2 is X [01:05:51] like is that you say Sigma 2 2 is X minus y X X minus X transpose and so if [01:06:00] minus y X X minus X transpose and so if I plug in the definition of X that would [01:06:02] I plug in the definition of X that would be a lambda Z plus mu plus epsilon minus [01:06:06] be a lambda Z plus mu plus epsilon minus mu the same thing so that's X minus yeah [01:06:21] mu the same thing so that's X minus yeah so that's X minus y X okay because the [01:06:25] so that's X minus y X okay because the expected value of x is 2 so the Meuse [01:06:30] expected value of x is 2 so the Meuse cancel out and then if you do the [01:06:34] cancel out and then if you do the quadratic expansion I guess this becomes [01:06:36] quadratic expansion I guess this becomes expected value of let's see [01:06:44] lambda Z times each of these two terms [01:06:50] lambda Z times each of these two terms transpose plus it's all about you know a [01:06:57] transpose plus it's all about you know a plus B times a plus B right is a times a [01:07:02] plus B times a plus B right is a times a times a plus B times a plus B you get [01:07:04] times a plus B times a plus B you get four terms as a result and so the first [01:07:06] four terms as a result and so the first term is lambda Z times lambda Z [01:07:08] term is lambda Z times lambda Z transpose which is this plus lambda Z [01:07:14] transpose which is this plus lambda Z epsilon transpose plus with Epsilon [01:07:25] and so this term has your expected value [01:07:31] and so this term has your expected value because epsilon and and Z both have zero [01:07:34] because epsilon and and Z both have zero expected value and correlated so this is [01:07:37] expected value and correlated so this is 0 this is 0 on expectation and so you [01:07:42] 0 this is 0 on expectation and so you just left with an expected value of [01:07:44] just left with an expected value of lambda Z Z transpose lambda transpose [01:07:47] lambda Z Z transpose lambda transpose plus the expected value of epsilon [01:07:50] plus the expected value of epsilon epsilon transpose and so by the [01:07:56] epsilon transpose and so by the linearity of expectation you can take a [01:07:58] linearity of expectation you can take a expectation inside the mouth matrix [01:08:00] expectation inside the mouth matrix multiplication so this launder is the [01:08:02] multiplication so this launder is the expected value of Z Z transpose times [01:08:05] expected value of Z Z transpose times lambda transpose plus and this is just [01:08:08] lambda transpose plus and this is just the covariance of epsilon right which is [01:08:11] the covariance of epsilon right which is which is Sai and then because Z is drawn [01:08:16] which is Sai and then because Z is drawn from a standard Gaussian with identity [01:08:18] from a standard Gaussian with identity covariance that expectation in the [01:08:19] covariance that expectation in the middle is just the identity [01:08:21] middle is just the identity so there's lambda lambda transpose plus [01:08:26] so there's lambda lambda transpose plus sign okay so that's how you work out [01:08:32] sign okay so that's how you work out what is this lower right block of this [01:08:33] what is this lower right block of this um covariance matrix I know I did that a [01:08:36] um covariance matrix I know I did that a little bit quickly but every every step [01:08:38] little bit quickly but every every step is written out more slowly into lecture [01:08:42] is written out more slowly into lecture notes as well and it turns out that if [01:08:45] notes as well and it turns out that if you go through a similar process at [01:08:47] you go through a similar process at Phaedra you know one at a time using a [01:08:49] Phaedra you know one at a time using a similar process what are the other [01:08:50] similar process what are the other blocks of this covariance matrix you [01:08:53] blocks of this covariance matrix you find that the other blocks of this [01:08:54] find that the other blocks of this covariance matrix are identity lounder [01:08:57] covariance matrix are identity lounder transpose that one we just worked out so [01:09:03] transpose that one we just worked out so that that's the one we just worked out [01:09:04] that that's the one we just worked out but so that is the covariance matrix [01:09:07] but so that is the covariance matrix side [01:09:41] so where we are is that we figured out [01:09:44] so where we are is that we figured out that the joint distribution at the joint [01:09:46] that the joint distribution at the joint density of Z X is Gaussian would mean [01:09:49] density of Z X is Gaussian would mean given by that vector and covariance [01:09:52] given by that vector and covariance given by that matrix and so what you [01:10:04] given by that matrix and so what you could do is you write down write P of X [01:10:10] could do is you write down write P of X I and try to take the so P of X I will [01:10:14] I and try to take the so P of X I will be this Gaussian density and what you [01:10:16] be this Gaussian density and what you could do is take derivatives of the [01:10:18] could do is take derivatives of the log-likelihood respect to the parameters [01:10:20] log-likelihood respect to the parameters instead on 0 0 and solve and you find [01:10:22] instead on 0 0 and solve and you find that there is no known closed form [01:10:23] that there is no known closed form solution there is actually no closed [01:10:26] solution there is actually no closed form solution for finding the values of [01:10:28] form solution for finding the values of lambda and sy and mu that maximizes life [01:10:32] lambda and sy and mu that maximizes life likelihood so in order to fit the [01:10:40] likelihood so in order to fit the parameters of the model we're instead [01:10:43] parameters of the model we're instead going to resort to Y M and so in the [01:10:49] going to resort to Y M and so in the estep [01:11:02] so let's let's first derive what is the [01:11:05] so let's let's first derive what is the e step which is an e step you need to [01:11:07] e step which is an e step you need to compute this now Zi here is a continuous [01:11:13] compute this now Zi here is a continuous random variable when we're fitting a [01:11:16] random variable when we're fitting a mixture of gaussians distribution Zi was [01:11:18] mixture of gaussians distribution Zi was discrete and so you could have a list of [01:11:20] discrete and so you could have a list of numbers represented by you know W IJ [01:11:22] numbers represented by you know W IJ that just just have a vector soaring [01:11:25] that just just have a vector soaring what is the probability of each of the [01:11:26] what is the probability of each of the discrete values of Zi but in this case [01:11:29] discrete values of Zi but in this case Zi is a continuous density so how do you [01:11:32] Zi is a continuous density so how do you represent qi of Zi in a computer it [01:11:37] represent qi of Zi in a computer it turns out that using the formulas we [01:11:39] turns out that using the formulas we have for the marginal excuse me for the [01:11:42] have for the marginal excuse me for the conditional distribution of a Gaussian [01:11:43] conditional distribution of a Gaussian it turns out that if you compute this [01:11:46] it turns out that if you compute this right hand side you'll find that Zi [01:11:49] right hand side you'll find that Zi given X I this is going to be Gaussian [01:11:52] given X I this is going to be Gaussian with some mean and some covariance right [01:12:01] with some mean and some covariance right where it's basically those formulas mu [01:12:09] where it's basically those formulas mu of Z i given X I is equal to if you kind [01:12:14] of Z i given X I is equal to if you kind of take that foam and then apply it all [01:12:15] of take that foam and then apply it all thing here is 0 plus lambda transpose [01:12:44] okay so these equations are exactly [01:12:47] okay so these equations are exactly these two equations right maps to map to [01:12:51] these two equations right maps to map to that big Gaussian density that we have [01:12:53] that big Gaussian density that we have okay so what you would do in the East [01:12:57] okay so what you would do in the East step is compute this and compute this [01:13:01] step is compute this and compute this compute this vector in computers matrix [01:13:03] compute this vector in computers matrix and saw that sorters in you know store [01:13:06] and saw that sorters in you know store these as variables and your [01:13:08] these as variables and your representation of the Qi is that Qi is a [01:13:12] representation of the Qi is that Qi is a Gaussian density right with this mean [01:13:16] Gaussian density right with this mean and disco beer so this is what you [01:13:17] and disco beer so this is what you actually compute to represent Qi [01:13:32] all right so step two was to write the e [01:13:35] all right so step two was to write the e step and step three is the RIVM step and [01:13:46] step and step three is the RIVM step and the derivation of the m-step is is quite [01:13:50] the derivation of the m-step is is quite long and complicated but I want to [01:13:54] long and complicated but I want to mention just a key out your algebraic [01:13:56] mention just a key out your algebraic trick you need to use when deriving the [01:13:58] trick you need to use when deriving the m-step so you know we know from the east [01:14:02] m-step so you know we know from the east step that qi of zi is that gaussian [01:14:05] step that qi of zi is that gaussian density right so yes 1 over 2 pi to the [01:14:08] density right so yes 1 over 2 pi to the D over 2 that thing and e to the [01:14:12] D over 2 that thing and e to the negative 1/2 so that's the formula for [01:14:16] negative 1/2 so that's the formula for Qi it turns out that in the m-step [01:14:21] Qi it turns out that in the m-step there'll be a few places in the [01:14:23] there'll be a few places in the derivation where you need to compute [01:14:24] derivation where you need to compute something like this and one way to [01:14:35] something like this and one way to approach this would be to plug in the [01:14:37] approach this would be to plug in the density for Qi which is so you end up [01:14:42] density for Qi which is so you end up with this 1/2 pi to the T over 2 Sigma [01:14:46] with this 1/2 pi to the T over 2 Sigma you know and so on time zi Dzi and then [01:15:00] you know and so on time zi Dzi and then charlie compute this integral it turns [01:15:03] charlie compute this integral it turns out there's a much simpler way to [01:15:04] out there's a much simpler way to computers in the draw anyone know what [01:15:06] computers in the draw anyone know what it is [01:15:13] all right cool awesome right expect the [01:15:16] all right cool awesome right expect the value so the other way to compute this [01:15:18] value so the other way to compute this integral is to note that is that this is [01:15:20] integral is to note that is that this is the expected value of CI when CI is [01:15:26] the expected value of CI when CI is drawn from Qi right so you know the [01:15:30] drawn from Qi right so you know the definition of expect about value of [01:15:32] definition of expect about value of random variables [01:15:33] random variables expected value of Z is equal to integral [01:15:37] expected value of Z is equal to integral over C probably up to Z times Z DZ right [01:15:41] over C probably up to Z times Z DZ right that's what the expected value of a [01:15:42] that's what the expected value of a random variable is and so this integral [01:15:46] random variable is and so this integral is the expected value of Z respect to Z [01:15:49] is the expected value of Z respect to Z drawn from the QI distribution but we [01:15:52] drawn from the QI distribution but we know that Q is Gaussian with a certain [01:15:56] know that Q is Gaussian with a certain mean a certain variance and so the [01:15:57] mean a certain variance and so the expected value of this this is just mu [01:16:00] expected value of this this is just mu of Z i given X I is that thing that [01:16:04] of Z i given X I is that thing that you've already computed and so when [01:16:09] you've already computed and so when students derived the m-step you know for [01:16:12] students derived the m-step you know for young implementations of gaussians one [01:16:15] young implementations of gaussians one of the key things to notice is when are [01:16:17] of the key things to notice is when are you actually taking an expected value [01:16:19] you actually taking an expected value respect to around in variable in which [01:16:20] respect to around in variable in which case is just the value of computer [01:16:22] case is just the value of computer already and when do you need to plug in [01:16:25] already and when do you need to plug in this big complicated integral which can [01:16:27] this big complicated integral which can lead to very complicated very [01:16:28] lead to very complicated very intractable calculations ok so just when [01:16:31] intractable calculations ok so just when you're whenever you see this think about [01:16:34] you're whenever you see this think about whether you need to be expanding a big [01:16:37] whether you need to be expanding a big complicated integral or if it can be [01:16:38] complicated integral or if it can be interpreted as an expected out [01:16:46] and so for the m-step is really you know [01:16:55] and so for the m-step is really you know the m-step [01:16:55] the m-step is right so that's the m-step [01:17:24] is right so that's the m-step and if you rewrite this term as on some [01:17:29] and if you rewrite this term as on some of I the expected value of Zi drawn from [01:17:34] of I the expected value of Zi drawn from Qi it turns out that um if you go ahead [01:17:49] Qi it turns out that um if you go ahead and plug in the Gaussian density here [01:18:02] actually want one rule of thumb for [01:18:04] actually want one rule of thumb for whether or not you should plug in a [01:18:05] whether or not you should plug in a complicated in the grow-op sarcoma [01:18:07] complicated in the grow-op sarcoma Gaussian density this is just a rule of [01:18:09] Gaussian density this is just a rule of thumb after doing this type of map a [01:18:11] thumb after doing this type of map a long time it see if there's a log in [01:18:12] long time it see if there's a log in front if there's a log in front of a [01:18:14] front if there's a log in front of a Gaussian density because the Gaussian [01:18:16] Gaussian density because the Gaussian density as an exponentiation right [01:18:18] density as an exponentiation right through the Gaussian density is you know [01:18:19] through the Gaussian density is you know 1 over e to the something so one of [01:18:22] 1 over e to the something so one of there's a log in front of volcanic [01:18:23] there's a log in front of volcanic initiation cancel out and this equations [01:18:25] initiation cancel out and this equations simplify so one trick as you're doing [01:18:27] simplify so one trick as you're doing these derivations is just see if there's [01:18:29] these derivations is just see if there's a log in front of a Gaussian density and [01:18:31] a log in front of a Gaussian density and when there is a plugin go ahead and plug [01:18:33] when there is a plugin go ahead and plug in the formulas for Gaussian density the [01:18:35] in the formulas for Gaussian density the log will simplify that and what you end [01:18:37] log will simplify that and what you end up with is the log of a Gaussian density [01:18:40] up with is the log of a Gaussian density as there being a quadratic function [01:18:44] a quadratic function of the parameters [01:18:46] a quadratic function of the parameters and if you take the expected value [01:18:48] and if you take the expected value respect to a Gaussian Jessie respect to [01:18:51] respect to a Gaussian Jessie respect to a quadratic function this whole thing [01:18:53] a quadratic function this whole thing ends up being a quadratic function and [01:18:56] ends up being a quadratic function and then you can take derivatives of that [01:18:58] then you can take derivatives of that equation with respect to the parameters [01:19:01] equation with respect to the parameters respective new that whole thing set it [01:19:05] respective new that whole thing set it to zero and then solve and it'll be [01:19:07] to zero and then solve and it'll be roughly [01:19:08] roughly love of complexity of maximizing [01:19:11] love of complexity of maximizing quadratic function okay hope that makes [01:19:14] quadratic function okay hope that makes sense um the actual formulas are a [01:19:16] sense um the actual formulas are a little bit complicated so I don't I'll [01:19:18] little bit complicated so I don't I'll leave user luckily I shall for this the [01:19:20] leave user luckily I shall for this the lecture notes but I think the takeaway [01:19:21] lecture notes but I think the takeaway is uh don't expand this in the draw and [01:19:25] is uh don't expand this in the draw and when you are deriving this plug in the [01:19:28] when you are deriving this plug in the Gaussian densities here because they'll [01:19:30] Gaussian densities here because they'll all be simplified okay and details of [01:19:32] all be simplified okay and details of it's an election notes so let's break [01:19:35] it's an election notes so let's break for today best of luck with the midterm [01:19:38] for today best of luck with the midterm oh and no Suzy I hope you guys do well [01:19:40] oh and no Suzy I hope you guys do well alright I'll see you guys in a few days ================================================================================ LECTURE 016 ================================================================================ Lecture 16 - Independent Component Analysis & RL | Stanford CS229: Machine Learning (Autumn 2018) Source: https://www.youtube.com/watch?v=YQA9lLdLig8 --- Transcript [00:00:03] hey everyone let's get started so um [00:00:12] hey everyone let's get started so um let's see plan for today is we'll go [00:00:15] let's see plan for today is we'll go over the rest of ICA independent [00:00:18] over the rest of ICA independent component analysis and in particular [00:00:19] component analysis and in particular talk about CDF's cumulative distribution [00:00:24] talk about CDF's cumulative distribution functions and then um [00:00:37] all right so plan is a we'll go over the [00:00:42] all right so plan is a we'll go over the rest of ICA independent components [00:00:44] rest of ICA independent components analysis and we'll talk a bit about CDF [00:00:47] analysis and we'll talk a bit about CDF cumulative distribution functions and [00:00:50] cumulative distribution functions and then derive the ISEE model and in the [00:00:53] then derive the ISEE model and in the second half of today we'll start on the [00:00:56] second half of today we'll start on the final of the four major topics of the [00:01:04] final of the four major topics of the cost which is reinforcement learning we [00:01:05] cost which is reinforcement learning we talk about MDPs or Marvel teacher [00:01:06] talk about MDPs or Marvel teacher processes [00:01:07] processes so to recap briefly we had you remember [00:01:13] so to recap briefly we had you remember the overlapping voices demo so we said [00:01:16] the overlapping voices demo so we said that in the I see a problem dependent [00:01:19] that in the I see a problem dependent components now this problem we're seeing [00:01:21] components now this problem we're seeing we have sources s which are are n if you [00:01:25] we have sources s which are are n if you have n speakers so for example if this [00:01:28] have n speakers so for example if this is speaker ones audio then at time T s [00:01:32] is speaker ones audio then at time T s you know superscript parenthesis T [00:01:35] you know superscript parenthesis T subscript 1 is the sound emitted by [00:01:38] subscript 1 is the sound emitted by speaker 1 at time T as I've seen all [00:01:43] speaker 1 at time T as I've seen all right make that go a little bit and [00:01:47] right make that go a little bit and we're using sometimes I to index [00:01:50] we're using sometimes I to index training examples and so the training [00:01:52] training examples and so the training examples sweep over time and sometimes [00:01:55] examples sweep over time and sometimes usually I use I sometimes I use T I [00:01:57] usually I use I sometimes I use T I guess in the case where the different [00:02:01] guess in the case where the different examples come from different points in [00:02:03] examples come from different points in time in your recording and what your [00:02:06] time in your recording and what your microphones record is X I equals a of s [00:02:09] microphones record is X I equals a of s I so just for now let's say you have two [00:02:12] I so just for now let's say you have two speakers and two microphones in which [00:02:14] speakers and two microphones in which case a will be a 2 by 2 matrix and home [00:02:17] case a will be a 2 by 2 matrix and home or problem your face because in five [00:02:18] or problem your face because in five microphones in which case a will be a [00:02:20] microphones in which case a will be a five by five matrix what's helped later [00:02:22] five by five matrix what's helped later about what happens is the numbers [00:02:24] about what happens is the numbers because in microphones is not the same [00:02:25] because in microphones is not the same and the goal is to find a matrix W which [00:02:30] and the goal is to find a matrix W which should hopefully be a inverse so that si [00:02:35] should hopefully be a inverse so that si is w times X recover the original [00:02:37] is w times X recover the original sources and we're going to use these W 1 [00:02:42] sources and we're going to use these W 1 up to WN [00:02:43] up to WN represent the rows of this matrix w oh [00:02:52] yes you're right thank you [00:02:59] so last time we had all right just [00:03:08] so last time we had all right just remember this is a picture the cocktail [00:03:09] remember this is a picture the cocktail party problem and last time I showed [00:03:12] party problem and last time I showed these pictures about you know why why is [00:03:15] these pictures about you know why why is ICA even possible right given two [00:03:17] ICA even possible right given two overlapping voices how is even possible [00:03:20] overlapping voices how is even possible to separate them out how is there enough [00:03:23] to separate them out how is there enough information to know you know what are [00:03:26] information to know you know what are the two overlapping voices and so one [00:03:28] the two overlapping voices and so one picture we saw was this one where if s1 [00:03:31] picture we saw was this one where if s1 and s2 are uniform between minus 1 and [00:03:33] and s2 are uniform between minus 1 and plus 1 then the distribution of data [00:03:35] plus 1 then the distribution of data will look like this if you pass this [00:03:38] will look like this if you pass this data through the mixing matrix a then [00:03:41] data through the mixing matrix a then your observations now the axes have [00:03:44] your observations now the axes have changed X 1 and X 2 may look like this [00:03:46] changed X 1 and X 2 may look like this and your job is a finding unmixing [00:03:48] and your job is a finding unmixing matrix W that map's this data back to [00:03:51] matrix W that map's this data back to the square ok now this example is [00:03:57] the square ok now this example is possible because the examples because [00:04:00] possible because the examples because the sources s-one and s-two were [00:04:02] the sources s-one and s-two were distributed uniformly between minus 1 [00:04:04] distributed uniformly between minus 1 and plus 1 it turns out human voices you [00:04:07] and plus 1 it turns out human voices you know the recordings per moment in time I [00:04:09] know the recordings per moment in time I not distributed uniform between minus 1 [00:04:12] not distributed uniform between minus 1 and plus 1 and it turns out that dumped [00:04:14] and plus 1 and it turns out that dumped if the data was Gaussian then ICA is [00:04:18] if the data was Gaussian then ICA is actually not possible [00:04:19] actually not possible here's what I mean let's say that so the [00:04:24] here's what I mean let's say that so the uniform distribution is a highly non [00:04:25] uniform distribution is a highly non Gaussian distribution right uniformly [00:04:27] Gaussian distribution right uniformly mine's one plus one you know this is not [00:04:29] mine's one plus one you know this is not Gaussian and that that makes I see [00:04:30] Gaussian and that that makes I see possible [00:04:31] possible what if s1 and s2 came from Gaussian [00:04:36] what if s1 and s2 came from Gaussian densities right if that were the case [00:04:39] densities right if that were the case then this distribution s1 and s2 would [00:04:43] then this distribution s1 and s2 would be rotationally symmetric and so there'd [00:04:47] be rotationally symmetric and so there'd be a rotational ambiguity right any axis [00:04:50] be a rotational ambiguity right any axis could be s 1 and s 2 you can't map you [00:04:53] could be s 1 and s 2 you can't map you know this type of parallelogram back to [00:04:55] know this type of parallelogram back to this square [00:04:56] this square right so so you can't so if I think in [00:04:58] right so so you can't so if I think in this parallelogram you can sort of read [00:05:04] this parallelogram you can sort of read off you know there may be one axis [00:05:05] off you know there may be one axis should look like that [00:05:06] should look like that sorry I'm joining with Mouse not doing [00:05:08] sorry I'm joining with Mouse not doing very well well second axis should maybe [00:05:11] very well well second axis should maybe look like that right and by by inverting [00:05:14] look like that right and by by inverting that you can get the data back to the [00:05:15] that you can get the data back to the square but in the case of the data look [00:05:19] square but in the case of the data look like this then you actually don't know [00:05:22] like this then you actually don't know because maybe this should be s 1 and [00:05:26] because maybe this should be s 1 and that should be s 2 right but so there's [00:05:29] that should be s 2 right but so there's this rotational ambiguity because the [00:05:32] this rotational ambiguity because the Gaussian distribution is rotationally [00:05:34] Gaussian distribution is rotationally symmetric of s 1 and s 2 are standard [00:05:36] symmetric of s 1 and s 2 are standard Gaussian then then this distribution is [00:05:39] Gaussian then then this distribution is rotation symmetric and you don't have [00:05:40] rotation symmetric and you don't have enough information to recover the [00:05:42] enough information to recover the directions that correspond to the [00:05:44] directions that correspond to the original sources [00:05:44] original sources ok so it turns out that there is some [00:05:49] ok so it turns out that there is some ambiguity and the output of ICA in [00:05:52] ambiguity and the output of ICA in particular last time we talked about two [00:05:54] particular last time we talked about two sources of ambiguity you don't know [00:05:57] sources of ambiguity you don't know which is speaker 1 at which the speaker [00:05:58] which is speaker 1 at which the speaker 2 right you don't know which one to [00:05:59] 2 right you don't know which one to number speaker 1 which on the numbers [00:06:01] number speaker 1 which on the numbers because you and you might take this data [00:06:04] because you and you might take this data and flip it horizontally reflect this [00:06:07] and flip it horizontally reflect this you know on the name s 1 goes to [00:06:10] you know on the name s 1 goes to negative s 1 or reflect this on the [00:06:13] negative s 1 or reflect this on the vertical axis we don't know this [00:06:14] vertical axis we don't know this positive s 2 and negative s 2 and in the [00:06:17] positive s 2 and negative s 2 and in the case of this example where s 1 s 2 a [00:06:19] case of this example where s 1 s 2 a uniform minus 1 plus 1 those are the [00:06:21] uniform minus 1 plus 1 those are the only sources of ambiguity but the data [00:06:25] only sources of ambiguity but the data was Gaussian that the additional [00:06:26] was Gaussian that the additional rotation on vacuity which actually in [00:06:28] rotation on vacuity which actually in part which actually makes it impossible [00:06:30] part which actually makes it impossible to separate out the sources ok so it [00:06:34] to separate out the sources ok so it turns out that [00:06:43] so it turns out that the Gaussian [00:06:46] so it turns out that the Gaussian density is the only distribution that is [00:06:50] density is the only distribution that is rotationally symmetric if s1 and s2 are [00:06:54] rotationally symmetric if s1 and s2 are independent and the distribution is [00:06:57] independent and the distribution is rotationally symmetric meaning that the [00:06:59] rotationally symmetric meaning that the distribution has sort of circular [00:07:02] distribution has sort of circular contours then it then it then it must be [00:07:04] contours then it then it then it must be a Gaussian density and so there is a [00:07:07] a Gaussian density and so there is a theorem which just stated formally that [00:07:10] theorem which just stated formally that I see is possible only if your data is [00:07:12] I see is possible only if your data is not Gaussian right but but so once your [00:07:14] not Gaussian right but but so once your data is not Gaussian then it is possible [00:07:17] data is not Gaussian then it is possible to recover the independent sources okay [00:07:19] to recover the independent sources okay I'm just taking that informally so let's [00:07:27] I'm just taking that informally so let's see so what I'd like to do is develop [00:07:31] see so what I'd like to do is develop the ICA algorithm assuming that the data [00:07:35] the ICA algorithm assuming that the data is non Gaussian okay now in order to [00:07:43] is non Gaussian okay now in order to divert the ICA model we need to figure [00:07:47] divert the ICA model we need to figure out what is the density of s right and [00:07:51] out what is the density of s right and I'm going to use P subscript s you know [00:07:53] I'm going to use P subscript s you know of the of the random variable s to [00:07:57] of the of the random variable s to represent the density of s an equivalent [00:08:01] represent the density of s an equivalent way to represent the probability of the [00:08:03] way to represent the probability of the density of continuous random variables [00:08:05] density of continuous random variables virus CDF which stands for cumulative [00:08:09] virus CDF which stands for cumulative distribution functions and the [00:08:17] distribution functions and the cumulative distribution function of a [00:08:20] cumulative distribution function of a random variable f of s in probability is [00:08:24] random variable f of s in probability is defined as the chance that the random [00:08:26] defined as the chance that the random variable is less than that value so I [00:08:29] variable is less than that value so I guess notation has been inconsistent [00:08:31] guess notation has been inconsistent sorry but this is capital S I'm using to [00:08:34] sorry but this is capital S I'm using to denote the random variable and this is [00:08:37] denote the random variable and this is some constant right and it's that same [00:08:42] some constant right and it's that same constants is that lowercase s okay and [00:08:45] constants is that lowercase s okay and so for example if [00:08:50] this is the PDF of random variable s may [00:08:54] this is the PDF of random variable s may be of a Gaussian right the CDF is a [00:09:01] be of a Gaussian right the CDF is a function that increases from 0 to 1 [00:09:16] function that increases from 0 to 1 where the height of a CDF at a certain [00:09:22] where the height of a CDF at a certain point is the probability so if you take [00:09:27] point is the probability so if you take the curves the same point so the height [00:09:30] the curves the same point so the height of a CDF at a certain point lowercase s [00:09:35] of a CDF at a certain point lowercase s is the probability that the random [00:09:38] is the probability that the random variable takes on the value equal to [00:09:40] variable takes on the value equal to this value or lower which means that the [00:09:42] this value or lower which means that the height of this function is equal to you [00:09:46] height of this function is equal to you know the probability mass the area under [00:09:48] know the probability mass the area under the curve of your PDF over to the left [00:09:51] the curve of your PDF over to the left at that point okay so that's a know some [00:09:54] at that point okay so that's a know some sometimes this some problem in [00:09:56] sometimes this some problem in statistics courses teach this concept in [00:09:58] statistics courses teach this concept in some Jones I guess but there so there's [00:10:00] some Jones I guess but there so there's a mapping between the PDS and the CDF of [00:10:03] a mapping between the PDS and the CDF of a function of a continuous random [00:10:06] a function of a continuous random variable and the relation between the [00:10:09] variable and the relation between the PDF and the CDF is that the density is [00:10:13] PDF and the CDF is that the density is equal to the first derivative right F [00:10:17] equal to the first derivative right F prime so if you take the derivative of [00:10:19] prime so if you take the derivative of the CDF then you should recover the PDF [00:10:23] the CDF then you should recover the PDF ok but so I think in order to specify [00:10:27] ok but so I think in order to specify you know some random variable we could [00:10:30] you know some random variable we could either specify the PDF right the [00:10:32] either specify the PDF right the probably density function or you could [00:10:35] probably density function or you could specify the CDF which is just no less [00:10:37] specify the CDF which is just no less tell me what's the chance of the random [00:10:39] tell me what's the chance of the random variable taking on any value less than [00:10:41] variable taking on any value less than any particular value s and by taking the [00:10:43] any particular value s and by taking the derivative this you can always recover [00:10:45] derivative this you can always recover the PDF and by integrating this you can [00:10:47] the PDF and by integrating this you can always go to the senior ok and so what [00:10:51] always go to the senior ok and so what we're going to do in ICA is instead of [00:10:55] we're going to do in ICA is instead of specifying a PDF for how speakers voices [00:10:59] specifying a PDF for how speakers voices sound we're instead going to specify a [00:11:00] sound we're instead going to specify a CDF and we have to choose as India [00:11:03] CDF and we have to choose as India that is not the Gaussian density CDF [00:11:07] that is not the Gaussian density CDF because we have assumed that the data is [00:11:08] because we have assumed that the data is non Gaussian and and the CDF you know is [00:11:13] non Gaussian and and the CDF you know is a function that always goes from right [00:11:16] a function that always goes from right zero to one okay so [00:11:33] all right so we'll specify so in a [00:11:45] all right so we'll specify so in a little bit we'll specify some CDF for [00:11:48] little bit we'll specify some CDF for the density of the sources of what human [00:11:50] the density of the sources of what human voices sound like let's say and if you [00:11:52] voices sound like let's say and if you differentiate this you will get the PDF [00:11:56] differentiate this you will get the PDF or the density is equal to that now um [00:12:04] or the density is equal to that now um we're going to derive a massive likely [00:12:06] we're going to derive a massive likely estimation algorithm in a minute but our [00:12:09] estimation algorithm in a minute but our model is that X is equal to a s which is [00:12:14] model is that X is equal to a s which is equal to I guess W inverse of s and s is [00:12:17] equal to I guess W inverse of s and s is equal to WX right so that that's that's [00:12:20] equal to WX right so that that's that's the model and in order to derive a [00:12:24] the model and in order to derive a maximum likelihood estimate for the [00:12:25] maximum likelihood estimate for the parameters when you have so this is [00:12:30] parameters when you have so this is going to be the density X so this is a [00:12:43] going to be the density X so this is a relationship between this is [00:12:47] relationship between this is relationship between X and s X is equal [00:12:51] relationship between X and s X is equal to a s equals W inverse s and has to [00:12:53] to a s equals W inverse s and has to equal to WX right so this is a model and [00:12:55] equal to WX right so this is a model and what I'd like to do is let's say you [00:12:58] what I'd like to do is let's say you know what's the density of s what is the [00:13:04] know what's the density of s what is the density of X if X is computed as the [00:13:09] density of X if X is computed as the matrix a times s so one step that's [00:13:15] matrix a times s so one step that's tempting to take is to just say well s [00:13:18] tempting to take is to just say well s is equal to W times X so the probability [00:13:22] is equal to W times X so the probability of X is just equal to the probability of [00:13:24] of X is just equal to the probability of s taking on the certain value right so [00:13:27] s taking on the certain value right so so I mean this is s and so the [00:13:31] so I mean this is s and so the probability of seeing a certain value of [00:13:33] probability of seeing a certain value of x is equal to the probability of s [00:13:35] x is equal to the probability of s taking on that corresponding value [00:13:37] taking on that corresponding value because assuming W is an invertible [00:13:39] because assuming W is an invertible matrix [00:13:40] matrix is one-to-one mapping between X and s so [00:13:43] is one-to-one mapping between X and s so to find it probably FX just find a pair [00:13:45] to find it probably FX just find a pair of s and compute a corresponding [00:13:47] of s and compute a corresponding probability it turns out this is this is [00:13:51] probability it turns out this is this is incorrect and this works were probably [00:13:53] incorrect and this works were probably mask functions but discreet party [00:13:55] mask functions but discreet party distributions that take on discrete [00:13:57] distributions that take on discrete values but this is actually incorrect [00:13:59] values but this is actually incorrect for continuous probability densities so [00:14:02] for continuous probability densities so let me let me um show an illustration [00:14:04] let me let me um show an illustration and go back to derive what is a correct [00:14:07] and go back to derive what is a correct way of computing the density of X oh and [00:14:11] way of computing the density of X oh and we'll want a density of X because when [00:14:15] we'll want a density of X because when you get the training set you only get to [00:14:17] you get the training set you only get to observe X and so for finding a master of [00:14:21] observe X and so for finding a master of like ledesma parameters you need to know [00:14:23] like ledesma parameters you need to know what's the density of X you come and you [00:14:25] what's the density of X you come and you know choose the parameters choose the [00:14:26] know choose the parameters choose the parameters W the maximizing likelihood [00:14:28] parameters W the maximizing likelihood so that's what we want to compute the [00:14:30] so that's what we want to compute the density of X but um let's let's use a [00:14:33] density of X but um let's let's use a simple example let's say the density of [00:14:35] simple example let's say the density of s is indicator SS between 0 and 1 okay [00:14:43] s is indicator SS between 0 and 1 okay so this is SS distribution uniform from [00:14:49] so this is SS distribution uniform from 0 to 1 and let's say X is equal to 2 [00:14:55] 0 to 1 and let's say X is equal to 2 times s sitting on notation is equal to [00:14:58] times s sitting on notation is equal to 2 W is equal to 1/2 this is a N equals 1 [00:15:02] 2 W is equal to 1/2 this is a N equals 1 one dimensional example so this is a [00:15:10] one dimensional example so this is a density of s right uniform distribution [00:15:14] density of s right uniform distribution from 0 to 1 and if X is equal to 2 times [00:15:18] from 0 to 1 and if X is equal to 2 times s then this seems like X should be equal [00:15:22] s then this seems like X should be equal X is distributed uniformly from 0 to 2 [00:15:27] X is distributed uniformly from 0 to 2 right Chris if s is uniform from 0 to 1 [00:15:30] right Chris if s is uniform from 0 to 1 you multiply by the 2 X a certain [00:15:32] you multiply by the 2 X a certain uniformly from 0 to 2 and so the density [00:15:35] uniformly from 0 to 2 and so the density for X is equal to this [00:15:45] and it's now half as tall because [00:15:48] and it's now half as tall because probably density functions need to [00:15:50] probably density functions need to integrate to one right so this is a [00:15:52] integrate to one right so this is a uniform from zero to two [00:15:54] uniform from zero to two probably density function and so the [00:15:58] probably density function and so the correct formula is P of X x equals 1/2 [00:16:08] times indicator is realistic of the X [00:16:14] let's say equal to 2 okay and more [00:16:27] let's say equal to 2 okay and more generally the correct formula for this [00:16:30] generally the correct formula for this is actually this x this is the [00:16:37] is actually this x this is the determinant of the matrix W and in the [00:16:46] determinant of the matrix W and in the case of a real number the determine if I [00:16:48] case of a real number the determine if I want real number is just this absolute [00:16:49] want real number is just this absolute value which is why we have the density [00:16:54] value which is why we have the density of x equals 1/2 you know that's the [00:16:58] of x equals 1/2 you know that's the absolute value of the determinant of W x [00:17:04] absolute value of the determinant of W x times y times indicator where there are [00:17:08] times y times indicator where there are two times s is within 0 0 to 1 ok [00:17:16] right so I guess this oh this is [00:17:19] right so I guess this oh this is indicator zero less than 1/2 okay so [00:17:28] indicator zero less than 1/2 okay so this is an illustration showing why this [00:17:30] this is an illustration showing why this is the right way with the determinant of [00:17:33] is the right way with the determinant of W most value here as the as a way to [00:17:36] W most value here as the as a way to compute its identity of X and don't you [00:17:39] compute its identity of X and don't you familiar with determinants and [00:17:42] familiar with determinants and determinants is the function you can [00:17:44] determinants is the function you can call your numpy to compute but also the [00:17:48] call your numpy to compute but also the intuition of deterministic measures how [00:17:50] intuition of deterministic measures how much it stretches out a local whopping [00:17:53] much it stretches out a local whopping and so you need to sort of divide by the [00:17:57] and so you need to sort of divide by the determinant of a or multiply by [00:18:00] determinant of a or multiply by determinant of W in order to make sure [00:18:03] determinant of W in order to make sure these distributions don't normalizes the [00:18:05] these distributions don't normalizes the one right so that's where that comes [00:18:06] one right so that's where that comes from [00:18:09] so we're nearly done [00:18:13] so we're nearly done just one more decision and then we can [00:18:16] just one more decision and then we can derive a maximum likelihood estimation [00:18:18] derive a maximum likelihood estimation to derive a mess with likely estimate of [00:18:20] to derive a mess with likely estimate of this of the parameters the last thing we [00:18:23] this of the parameters the last thing we need to do is choose the density of what [00:18:29] need to do is choose the density of what you know speakers voices sound like and [00:18:32] you know speakers voices sound like and as I said just now what we are going to [00:18:36] as I said just now what we are going to do is choose a non Gaussian distribution [00:18:40] do is choose a non Gaussian distribution right and so well f of S is equal to the [00:18:45] right and so well f of S is equal to the chance of this person's voice right [00:18:48] chance of this person's voice right random variable s being less than [00:18:50] random variable s being less than certain value and we need a smooth [00:18:52] certain value and we need a smooth function that goes between you know 0 [00:18:55] function that goes between you know 0 and 1 right we need a smooth function [00:18:58] and 1 right we need a smooth function that has Davey that shake and so well [00:19:01] that has Davey that shake and so well what functions we know that they be that [00:19:03] what functions we know that they be that shape let's take the sigmoid function [00:19:08] shape let's take the sigmoid function and it turns out this world this will [00:19:11] and it turns out this world this will work ok there are many choices that [00:19:13] work ok there are many choices that actually work fine it turns out that if [00:19:16] actually work fine it turns out that if you choose the sigmoid function to be [00:19:18] you choose the sigmoid function to be the CDF then if you look at the PDF this [00:19:22] the CDF then if you look at the PDF this induces if you take the derivative this [00:19:24] induces if you take the derivative this right so take P of x equals the [00:19:26] right so take P of x equals the derivative [00:19:27] derivative CEDIA it turns out that if this is the [00:19:33] CEDIA it turns out that if this is the Gaussian then the PDF that this choice [00:19:39] Gaussian then the PDF that this choice induces is something with fatter tails [00:19:44] induces is something with fatter tails by which I mean that it goes to zero you [00:19:48] by which I mean that it goes to zero you know so Gaussian density goes to zero [00:19:58] know so Gaussian density goes to zero very quickly right it's like e to the [00:20:01] very quickly right it's like e to the negative x squared the Gaussian is a [00:20:03] negative x squared the Gaussian is a square in the exponent of the density [00:20:05] square in the exponent of the density and it turns out that this because the [00:20:07] and it turns out that this because the density taken by compute derivative a [00:20:10] density taken by compute derivative a sigmoid it goes to zero more slowly and [00:20:12] sigmoid it goes to zero more slowly and this captures human voice and many [00:20:15] this captures human voice and many natural phenomena better than the [00:20:16] natural phenomena better than the Gaussian density because there are [00:20:18] Gaussian density because there are larger number of extreme outliers there [00:20:20] larger number of extreme outliers there are more than one or two standard [00:20:22] are more than one or two standard deviations away but they're actually [00:20:24] deviations away but they're actually multiple distributions that work you [00:20:26] multiple distributions that work you could have used a double double [00:20:27] could have used a double double exponential distribution so this is an [00:20:30] exponential distribution so this is an exponential distribution exponential [00:20:31] exponential distribution exponential then see if you take a symmetric go to [00:20:33] then see if you take a symmetric go to side the explanation that's the PIA best [00:20:35] side the explanation that's the PIA best they'll also work quite well for ICA but [00:20:37] they'll also work quite well for ICA but I think early history of ICA you know [00:20:41] I think early history of ICA you know researchers I think of them might have [00:20:44] researchers I think of them might have been Terry Sadowski download the softest [00:20:46] been Terry Sadowski download the softest if you just needed a function with these [00:20:49] if you just needed a function with these properties and you picked the sigmoid [00:20:50] properties and you picked the sigmoid and plugged it in and it works just fine [00:20:51] and plugged it in and it works just fine it's been a good enough default that tom [00:20:54] it's been a good enough default that tom is still it's still widely use right but [00:20:57] is still it's still widely use right but but but but they've used this double [00:21:00] but but but they've used this double side the exponential sometimes also [00:21:01] side the exponential sometimes also called the laplacian distribution this [00:21:03] called the laplacian distribution this works fine as well as a choice of P of s [00:21:17] so the final step the density of s is [00:21:31] so the final step the density of s is equal to the product of the lessee soap [00:21:46] equal to the product of the lessee soap rather from I equals 1 through your n [00:21:48] rather from I equals 1 through your n sources of the probability of each of [00:21:51] sources of the probability of each of the speakers emitting that sound right [00:21:54] the speakers emitting that sound right because the N speakers are speaking [00:22:05] because the N speakers are speaking independently right [00:22:12] wait say that again [00:22:24] oh yes you're right sorry about that yes [00:22:27] oh yes you're right sorry about that yes this should be sorry yes this should [00:22:34] this should be sorry yes this should have been up here right go from a CD f [00:22:41] have been up here right go from a CD f sub-p di by taking derivatives [00:22:43] sub-p di by taking derivatives oh cool [00:22:47] oh cool so um s is the vector of all you know [00:22:53] so um s is the vector of all you know two speakers are all five speakers [00:22:54] two speakers are all five speakers voices at one moment in time so the [00:22:57] voices at one moment in time so the density of s right s is in RN is the [00:23:01] density of s right s is in RN is the product of the individual speakers [00:23:03] product of the individual speakers probabilities and this is the key [00:23:06] probabilities and this is the key assumption of ICA that you know your two [00:23:08] assumption of ICA that you know your two speakers or your five speakers are [00:23:10] speakers or your five speakers are having independent conversations and so [00:23:12] having independent conversations and so at every moment in time they choose [00:23:14] at every moment in time they choose independently of each other what sound [00:23:16] independently of each other what sound teammate and so using the formulas you [00:23:20] teammate and so using the formulas you worked out just now the density of X is [00:23:24] worked out just now the density of X is equal to well as we did the density of W [00:23:36] equal to well as we did the density of W x times the determinant of W so and this [00:23:44] x times the determinant of W so and this is equal to [00:23:59] Oh in this notation WI transpose X this [00:24:04] Oh in this notation WI transpose X this is um right because WI is the I've row [00:24:08] is um right because WI is the I've row of the matrix W and so you know I guess [00:24:13] of the matrix W and so you know I guess s SJ is equal to W J transpose X right [00:24:19] s SJ is equal to W J transpose X right so you take a corresponding row and [00:24:20] so you take a corresponding row and multiply it by X to get the [00:24:22] multiply it by X to get the corresponding source actually sorry I [00:24:25] corresponding source actually sorry I think that's right yeah let me use J [00:24:27] think that's right yeah let me use J there okay and so um this writes out so [00:24:42] there okay and so um this writes out so this shows what is the density of X [00:24:46] this shows what is the density of X expressed as a function of P of s which [00:24:51] expressed as a function of P of s which have assumed which affects as a CDF of [00:24:53] have assumed which affects as a CDF of the sigmoid as a as the derivative of [00:24:56] the sigmoid as a as the derivative of the sigmoid and as a function of the [00:24:58] the sigmoid and as a function of the parameter W right so this is a model [00:25:02] parameter W right so this is a model that given a setting of the parameters W [00:25:05] that given a setting of the parameters W which square matrix allows us to write [00:25:09] which square matrix allows us to write down what's the density of banks [00:25:20] so the final step is we could use [00:25:25] so the final step is we could use maximum likelihood estimation to [00:25:28] maximum likelihood estimation to estimate the parameters W so the log [00:25:32] estimate the parameters W so the log likelihood of W is equal to sum over the [00:25:35] likelihood of W is equal to sum over the training examples of log and you can use [00:25:55] training examples of log and you can use the cost agree in the sense take the [00:26:03] the cost agree in the sense take the derivative of W respective along [00:26:05] derivative of W respective along likelihood and it turns out this is [00:26:09] likelihood and it turns out this is derived a lecture notes I'll just write [00:26:11] derived a lecture notes I'll just write it out here [00:26:31] I hope I got that right yeah okay right [00:26:41] and it turns out that if you use this [00:26:46] and it turns out that if you use this formula don't don't worry about the form [00:26:48] formula don't don't worry about the form for derivatives the full derivations [00:26:49] for derivatives the full derivations give a legend else but it turns out that [00:26:52] give a legend else but it turns out that if you use the derivative of the log [00:26:54] if you use the derivative of the log likelihood with respect to parameter [00:26:56] likelihood with respect to parameter matrix W and use stochastic gradient a [00:26:59] matrix W and use stochastic gradient a sense to maximize the log likelihood run [00:27:02] sense to maximize the log likelihood run this for a while then you can get ICA to [00:27:07] this for a while then you can get ICA to find they're pretty good matrix W for [00:27:11] find they're pretty good matrix W for unmixing the sources okay so just to [00:27:14] unmixing the sources okay so just to recap the whole algorithm right you [00:27:18] recap the whole algorithm right you would have a training set of x1 up [00:27:25] would have a training set of x1 up through XM where each of your training [00:27:27] through XM where each of your training examples is the microphone recordings at [00:27:32] examples is the microphone recordings at one moment in time and so the time goes [00:27:34] one moment in time and so the time goes from 1 through m what you do is [00:27:37] from 1 through m what you do is initialize the matrix W say randomly and [00:27:40] initialize the matrix W say randomly and use gradient descent with this formula [00:27:44] use gradient descent with this formula for the derivative in order to maximize [00:27:46] for the derivative in order to maximize the log likelihood of the data and after [00:27:49] the log likelihood of the data and after a gradient ascent converges you then [00:27:51] a gradient ascent converges you then have a matrix W and you can then recover [00:27:54] have a matrix W and you can then recover the sources as s equals W of X and then [00:27:58] the sources as s equals W of X and then now we have the sources you can take say [00:28:03] now we have the sources you can take say s1 1 through s 1m and play that through [00:28:09] s1 1 through s 1m and play that through your you know laptop speaker in order to [00:28:13] your you know laptop speaker in order to see what source 1 sounds like so that's [00:28:17] see what source 1 sounds like so that's how you would take you know [00:28:19] how you would take you know overlapping voices and try to unmixed up [00:28:25] a wise choice the save point now [00:28:32] a wise choice the save point now rotation America [00:28:34] rotation America boy how to visualize that try plotting [00:28:39] boy how to visualize that try plotting it in numpy matplotlib I guess if you [00:28:43] it in numpy matplotlib I guess if you plot the contours of the day so it turns [00:28:45] plot the contours of the day so it turns out that if this is s 1 and s 2 what you [00:28:49] out that if this is s 1 and s 2 what you do not want is a density whose contours [00:28:52] do not want is a density whose contours look like that haven't done this for a [00:28:56] look like that haven't done this for a while I believe if you take this [00:28:57] while I believe if you take this distribution the contours will look like [00:29:00] distribution the contours will look like that it's been a while since I hope that [00:29:03] that it's been a while since I hope that this but I think it'll look like that so [00:29:05] this but I think it'll look like that so this is not rotations in magic do you [00:29:08] this is not rotations in magic do you know this laplacian yeah ok yeah oh yes [00:29:10] know this laplacian yeah ok yeah oh yes the pasta he looks like that I think [00:29:12] the pasta he looks like that I think sigmoid looks a bit like that too yeah [00:29:14] sigmoid looks a bit like that too yeah talk to this even right Paul's on Piazza [00:29:17] talk to this even right Paul's on Piazza if one of you positive because you can [00:29:19] if one of you positive because you can see I haven't done ever louder [00:29:28] oh why don't you interpret differently [00:29:42] oh why don't you interpret differently along that actually yes the law should [00:29:46] along that actually yes the law should be like this I think oh sorry [00:29:54] be like this I think oh sorry G is the sigmoid function yes so Jia Zi [00:30:15] sure what's the yeah what's the closest [00:30:20] sure what's the yeah what's the closest nonlinear extension of this I don't we [00:30:24] nonlinear extension of this I don't we don't have great answer to that right [00:30:27] don't have great answer to that right now frankly so a bunch of people [00:30:33] now frankly so a bunch of people including you know my former students [00:30:36] including you know my former students and me have done research to try to [00:30:38] and me have done research to try to extend this to nonlinear versions and [00:30:40] extend this to nonlinear versions and there's some stuff that kind of works [00:30:41] there's some stuff that kind of works but I don't think there's like a [00:30:42] but I don't think there's like a tried-and-true algorithm that I'm ready [00:30:45] tried-and-true algorithm that I'm ready to say this is the right way to do it [00:30:50] yeah actually maybe I should I can say a [00:30:52] yeah actually maybe I should I can say a little bit more about other people [00:30:53] little bit more about other people interesting well yeah yeah yeah let me [00:30:56] interesting well yeah yeah yeah let me let me try there [00:31:25] all right let's see [00:31:36] so so for several several years ago and [00:31:41] so so for several several years ago and and so kind of ongoing there's been [00:31:43] and so kind of ongoing there's been research some done by my collaboration [00:31:47] research some done by my collaboration me some time my others aren't trying to [00:31:49] me some time my others aren't trying to build nonlinear versions of ICA and so [00:31:51] build nonlinear versions of ICA and so some of you might have seen this [00:31:53] some of you might have seen this slightly infamous Google Katz result [00:31:56] slightly infamous Google Katz result right so it doesn't want to leave the [00:31:59] right so it doesn't want to leave the Google brain project one of the first [00:32:00] Google brain project one of the first parts if you did this a few years ago [00:32:01] parts if you did this a few years ago now where we trained in your network on [00:32:07] now where we trained in your network on was it many many hours of YouTube videos [00:32:10] was it many many hours of YouTube videos and eventually it learned to detect cats [00:32:15] and eventually it learned to detect cats because apparently there are a lot of [00:32:16] because apparently there are a lot of cats and YouTube videos and so it turns [00:32:20] cats and YouTube videos and so it turns out that the algorithm we used was a was [00:32:23] out that the algorithm we used was a was sparse coding which is actually very [00:32:26] sparse coding which is actually very closely related to ICA and so this rough [00:32:30] closely related to ICA and so this rough algorithm was attempting to build a [00:32:32] algorithm was attempting to build a nonlinear version of ICA where you train [00:32:34] nonlinear version of ICA where you train one version once trained train one layer [00:32:36] one version once trained train one layer of sparse coding let's say to extract [00:32:38] of sparse coding let's say to extract low-level features and then recursively [00:32:40] low-level features and then recursively apply this on top to learn not just edge [00:32:42] apply this on top to learn not just edge detectors but object part detectors and [00:32:44] detectors but object part detectors and then eventually you know the somewhat [00:32:46] then eventually you know the somewhat infamous two somewhat infamous google [00:32:49] infamous two somewhat infamous google cat but I think that this is actually [00:32:51] cat but I think that this is actually still ongoing research I think the most [00:32:54] still ongoing research I think the most interesting research some of the most [00:32:57] interesting research some of the most interesting research has been on [00:32:58] interesting research has been on hierarchical versions of sparse coding [00:32:59] hierarchical versions of sparse coding in sparse coding it's a different [00:33:01] in sparse coding it's a different algorithm that turns out to be very [00:33:02] algorithm that turns out to be very closely related to ICA and then you can [00:33:04] closely related to ICA and then you can show that they're optimizing for very [00:33:06] show that they're optimizing for very similar things so if I say sparse coding [00:33:08] similar things so if I say sparse coding is very similar ICA but there are [00:33:10] is very similar ICA but there are hierarchical versions of this they tried [00:33:11] hierarchical versions of this they tried to turn this as a multi-layer neural [00:33:13] to turn this as a multi-layer neural network and it kind of works wherever [00:33:15] network and it kind of works wherever that show can learn there's new features [00:33:17] that show can learn there's new features but what happened was a supervised [00:33:19] but what happened was a supervised learning there and really took off in [00:33:21] learning there and really took off in the whole world shifted Walters [00:33:22] the whole world shifted Walters attention to supervised learning and [00:33:24] attention to supervised learning and building deep supervised learning in [00:33:25] building deep supervised learning in your own networks and so the hierarchal [00:33:28] your own networks and so the hierarchal sparse coding running I see over and [00:33:29] sparse coding running I see over and over to learn nonlinear versions there's [00:33:32] over to learn nonlinear versions there's there's pretty less attention from [00:33:34] there's pretty less attention from research on that on that topic then it [00:33:37] research on that on that topic then it then it really deserves so maybe you [00:33:39] then it really deserves so maybe you maybe someone in a costly go back and do [00:33:41] maybe someone in a costly go back and do more research on that I still think is a [00:33:43] more research on that I still think is a promising area [00:33:45] promising area all right um so let me wrap up with some [00:33:53] ICA examples so there's actually a [00:33:58] ICA examples so there's actually a former ta from the class Katie Chang and [00:34:03] former ta from the class Katie Chang and so it turns out that ICS routinely used [00:34:08] so it turns out that ICS routinely used to clean up [00:34:09] to clean up EEG data today so what's an EEG [00:34:12] EEG data today so what's an EEG right place many electrodes on your [00:34:14] right place many electrodes on your scalp to measure little electrical [00:34:17] scalp to measure little electrical recordings on the surface of your scalp [00:34:20] recordings on the surface of your scalp so you know what does human brain do [00:34:22] so you know what does human brain do right human brain your neurons in your [00:34:24] right human brain your neurons in your brain right now [00:34:25] brain right now fire generated little pulses of [00:34:27] fire generated little pulses of electricity and if you place electrode [00:34:30] electricity and if you place electrode on your scalp you can get a very weak [00:34:31] on your scalp you can get a very weak measurement of the of the voltage of the [00:34:34] measurement of the of the voltage of the electrical activity in a you know at a [00:34:36] electrical activity in a you know at a certain point in your scalp so the [00:34:38] certain point in your scalp so the analogy to oh excuse me what's wrong [00:34:43] analogy to oh excuse me what's wrong alright so the analogy to the cocktail [00:34:46] alright so the analogy to the cocktail of Hardy problem the overlapping [00:34:48] of Hardy problem the overlapping speakers voices is that you know your [00:34:51] speakers voices is that you know your your brain does a lot of things at the [00:34:54] your brain does a lot of things at the same time right your brain helps [00:34:56] same time right your brain helps regulate your heartbeat part of your [00:34:59] regulate your heartbeat part of your brain does that and now the part of your [00:35:00] brain does that and now the part of your brain you know makes your eyes blink [00:35:02] brain you know makes your eyes blink every now and then another part of your [00:35:04] every now and then another part of your brain probably brain is also responsible [00:35:05] brain probably brain is also responsible making sure that you breathe and then [00:35:08] making sure that you breathe and then part of your brain is responsible [00:35:09] part of your brain is responsible thinking about machine learning and [00:35:11] thinking about machine learning and stuff like that right so so your brain [00:35:13] stuff like that right so so your brain actually handles memories didn't ask at [00:35:14] actually handles memories didn't ask at the same time and as your brain sorry [00:35:17] the same time and as your brain sorry else not sure what's wrong with this [00:35:19] else not sure what's wrong with this okay and as your brain carries out these [00:35:23] okay and as your brain carries out these different tasks in parallel different [00:35:26] different tasks in parallel different parts of your brain generate different [00:35:27] parts of your brain generate different electrical impulses so I think of there [00:35:30] electrical impulses so I think of there as imagine that you have a you know [00:35:32] as imagine that you have a you know cocktail party in your head right so [00:35:34] cocktail party in your head right so many overlapping voices so this is now [00:35:37] many overlapping voices so this is now voices in your head bad but one one one [00:35:41] voices in your head bad but one one one part of your brain is saying alright hot [00:35:43] part of your brain is saying alright hot go and be hot go and beat harder and [00:35:45] go and be hot go and beat harder and beaten and not my brains I hate breathe [00:35:46] beaten and not my brains I hate breathe in and breathe out breathe in and [00:35:47] in and breathe out breathe in and breathe out now if I were in a zoo you [00:35:49] breathe out now if I were in a zoo you know what's wrong with this PowerPoint [00:35:53] know what's wrong with this PowerPoint right um and what's each electrode on [00:35:57] right um and what's each electrode on the surface of your scalp does is it [00:35:59] the surface of your scalp does is it measures an overlapping combination of [00:36:01] measures an overlapping combination of all of these voices because the [00:36:02] all of these voices because the different positive brain are sending [00:36:04] different positive brain are sending these electric impulses they add up and [00:36:06] these electric impulses they add up and so any one point on the surface of your [00:36:08] so any one point on the surface of your brain reflects a sum or a mixture really [00:36:11] brain reflects a sum or a mixture really a sum of these different voices of these [00:36:13] a sum of these different voices of these different things your brain is doing and [00:36:16] different things your brain is doing and so if you just just zooming into the EEG [00:36:20] so if you just just zooming into the EEG plot each line is the voltage measured [00:36:24] plot each line is the voltage measured at a single electrode right on say your [00:36:27] at a single electrode right on say your scalp and these signals are quite [00:36:31] scalp and these signals are quite correlated you see that when there's a [00:36:33] correlated you see that when there's a massive voice in your brain shouting you [00:36:36] massive voice in your brain shouting you know like right beat your heart or blink [00:36:39] know like right beat your heart or blink your eyes that signal can go through all [00:36:42] your eyes that signal can go through all of the different electrodes which is why [00:36:44] of the different electrodes which is why you can see these artifacts are affected [00:36:46] you can see these artifacts are affected in all of these electrodes all right [00:36:51] in all of these electrodes all right turns out a pretty good way to clean up [00:36:53] turns out a pretty good way to clean up this data is to take all of these time [00:36:56] this data is to take all of these time series [00:36:57] series pretty-pretty exactly as we learned [00:36:59] pretty-pretty exactly as we learned about it with the ISEE algorithm and [00:37:01] about it with the ISEE algorithm and separate out into the independent [00:37:03] separate out into the independent components and so it turns out in this [00:37:06] components and so it turns out in this example there are two components [00:37:08] example there are two components corresponding to driving the heartbeat [00:37:10] corresponding to driving the heartbeat that's actually the eye blink component [00:37:13] that's actually the eye blink component and so one way to clean up this data [00:37:15] and so one way to clean up this data sorry I should really wonder what's [00:37:18] sorry I should really wonder what's wrong with this all right let me try [00:37:19] wrong with this all right let me try something [00:37:31] all right [00:37:44] if you write says hi Peters I blink and [00:37:52] alright and if you run I see a and then [00:37:55] alright and if you run I see a and then remove out I have a person say oh that's [00:37:58] remove out I have a person say oh that's happy that's I blink and remove subtract [00:38:00] happy that's I blink and remove subtract out those components then you can end up [00:38:02] out those components then you can end up with a much more cleaned up [00:38:05] with a much more cleaned up eg signal which you can then use for [00:38:07] eg signal which you can then use for downstream processing sorry overpass [00:38:09] downstream processing sorry overpass there is a lot of research on your [00:38:11] there is a lot of research on your chicken eg reading to try to guess at [00:38:13] chicken eg reading to try to guess at the high level what you're thinking [00:38:15] the high level what you're thinking right it turns out that if you train a [00:38:17] right it turns out that if you train a trainer trainer you know supervised [00:38:20] trainer trainer you know supervised learning algorithm to try to decide are [00:38:22] learning algorithm to try to decide are you thinking of a noun or a verb or you [00:38:24] you thinking of a noun or a verb or you thinking of something edible or are you [00:38:26] thinking of something edible or are you thinking of something any other boat [00:38:28] thinking of something any other boat there's been very interesting research [00:38:29] there's been very interesting research trying to use EEG to figure out just in [00:38:32] trying to use EEG to figure out just in a very coarse level no not quite my [00:38:36] a very coarse level no not quite my reading every thought you are thinking [00:38:38] reading every thought you are thinking but that that can we categorize very [00:38:41] but that that can we categorize very coarse level thoughts like are you [00:38:43] coarse level thoughts like are you thinking of a person or you think of an [00:38:45] thinking of a person or you think of an object then you can actually do that to [00:38:47] object then you can actually do that to some extent using EQ meeting it's been [00:38:48] some extent using EQ meeting it's been cleaning up the data to get really I [00:38:51] cleaning up the data to get really I blink the heartbeat artifacts is a very [00:38:53] blink the heartbeat artifacts is a very useful pre-processing step to get [00:38:56] useful pre-processing step to get cleaner data to feed into the learning [00:38:58] cleaner data to feed into the learning algorithm to try to figure out try to [00:38:59] algorithm to try to figure out try to categorize you know some coarse cavity [00:39:01] categorize you know some coarse cavity of what you're thinking okay and then [00:39:04] of what you're thinking okay and then more research it turns out that what [00:39:06] more research it turns out that what kind of I mentioned that Google can't [00:39:08] kind of I mentioned that Google can't thing just now it turns out that if you [00:39:12] thing just now it turns out that if you train I see a font is messed up if you [00:39:17] train I see a font is messed up if you train I see a on natural images I see a [00:39:21] train I see a on natural images I see a will say that the natural independent [00:39:23] will say that the natural independent components of natural images are these [00:39:25] components of natural images are these edges and as in that you know when you [00:39:29] edges and as in that you know when you see a little image patch in the world we [00:39:30] see a little image patch in the world we see you know look somewhere in there one [00:39:32] see you know look somewhere in there one looked just a tiny little piece of the [00:39:34] looked just a tiny little piece of the image right like 10 pixels by 10 pixels [00:39:36] image right like 10 pixels by 10 pixels and if you take that data and model as [00:39:39] and if you take that data and model as ICA I say we'll say that the world is [00:39:42] ICA I say we'll say that the world is made up of edges or made up of patches [00:39:44] made up of edges or made up of patches like these and that the way you end up [00:39:47] like these and that the way you end up with images in the world is by each of [00:39:49] with images in the world is by each of these patches [00:39:50] these patches you know independently saying is there [00:39:51] you know independently saying is there reservations or horizontal insurers [00:39:53] reservations or horizontal insurers is there this type of light on the left [00:39:57] is there this type of light on the left dark on the right is that this type of [00:39:58] dark on the right is that this type of lighter on top doctor the bottom and so [00:40:01] lighter on top doctor the bottom and so on and it's by adding all of these [00:40:03] on and it's by adding all of these voices there you get a typical image [00:40:04] voices there you get a typical image passionate world so there are there [00:40:06] passionate world so there are there interesting theories in neuroscience [00:40:07] interesting theories in neuroscience about whether this is how you know the [00:40:09] about whether this is how you know the human brain learns to see as well so so [00:40:12] human brain learns to see as well so so very very same work on them I see and [00:40:14] very very same work on them I see and sparse coding to try to use these [00:40:16] sparse coding to try to use these mechanisms to explain how you know the [00:40:18] mechanisms to explain how you know the human brain tries to explain it tries to [00:40:22] human brain tries to explain it tries to learn to perceive images for example [00:40:24] learn to perceive images for example okay so all right so that's it for um [00:40:36] the algorithms of ICA justify no [00:40:40] the algorithms of ICA justify no comments I think on Mondays someone asks [00:40:44] comments I think on Mondays someone asks do the number of speakers the number of [00:40:46] do the number of speakers the number of microphones need to be equal so it turns [00:40:49] microphones need to be equal so it turns out that if the number of microphones is [00:40:53] out that if the number of microphones is larger than the number of speakers [00:40:55] larger than the number of speakers that's actually fine right if you're the [00:40:58] that's actually fine right if you're the number of microphones large number of [00:40:59] number of microphones large number of speakers then if you run ICA or a [00:41:02] speakers then if you run ICA or a slightly modified version of it you find [00:41:04] slightly modified version of it you find that some of the speakers are just [00:41:05] that some of the speakers are just silent speakers and so you know if you [00:41:08] silent speakers and so you know if you have ten microphones and five speakers [00:41:10] have ten microphones and five speakers if you run this algorithm on ten [00:41:13] if you run this algorithm on ten microphones you can find that well maybe [00:41:14] microphones you can find that well maybe five of the sources are just silent or [00:41:17] five of the sources are just silent or there ways to just now model those five [00:41:18] there ways to just now model those five sources as well right if you think that [00:41:21] sources as well right if you think that they're just some sources of silence so [00:41:23] they're just some sources of silence so so this so slightly modified version of [00:41:26] so this so slightly modified version of this works quite well if the number of [00:41:30] this works quite well if the number of speakers is larger than the number of [00:41:31] speakers is larger than the number of microphones if the excuse me the number [00:41:34] microphones if the excuse me the number of microphones is lodged in the Armagh [00:41:35] of microphones is lodged in the Armagh speakers this works quite well if the [00:41:38] speakers this works quite well if the number of microphones is smaller than [00:41:40] number of microphones is smaller than the number of speakers then that's still [00:41:42] the number of speakers then that's still very much a cutting-edge research [00:41:44] very much a cutting-edge research problem so so for example if you have [00:41:47] problem so so for example if you have two speakers and one microphone it turns [00:41:51] two speakers and one microphone it turns out that if you have one male and one [00:41:54] out that if you have one male and one female speaker so one relatively high [00:41:56] female speaker so one relatively high patient one much lower pitch then you [00:41:58] patient one much lower pitch then you can sometimes have some algorithms that [00:42:00] can sometimes have some algorithms that separate out two voices with one [00:42:03] separate out two voices with one microphone but it doesn't work that [00:42:05] microphone but it doesn't work that reliably is a little bit finicky but [00:42:07] reliably is a little bit finicky but there have been research papers [00:42:08] there have been research papers published showing that you know you [00:42:10] published showing that you know you could make a reasonable attempt at [00:42:11] could make a reasonable attempt at separating out two voices with my one [00:42:14] separating out two voices with my one microphone though the pitches are quite [00:42:16] microphone though the pitches are quite different such as is one male one female [00:42:17] different such as is one male one female voice but separating out two male voices [00:42:21] voice but separating out two male voices or two female voices is still very hard [00:42:24] or two female voices is still very hard and then there's ongoing research in in [00:42:28] and then there's ongoing research in in those settings so that's ICA and I guess [00:42:34] those settings so that's ICA and I guess you get to play more of it in your [00:42:36] you get to play more of it in your homework problem as well okay any last [00:42:39] homework problem as well okay any last questions about ICA [00:42:42] questions about ICA oh wait sorry it would be Jose yeah so [00:43:29] oh wait sorry it would be Jose yeah so um I think actually go through a lot of [00:43:33] um I think actually go through a lot of math it just breaks down I think because [00:43:36] math it just breaks down I think because there you can have two independent [00:43:38] there you can have two independent sources but W is now no longer a square [00:43:41] sources but W is now no longer a square matrix right it'll be uh what is it [00:43:46] so I write so is that X is equal to a s [00:43:50] so I write so is that X is equal to a s right and so if X is a real number and s [00:43:55] right and so if X is a real number and s was two-dimensional so I guess this [00:44:01] was two-dimensional so I guess this would be um a would be two by one s [00:44:07] would be um a would be two by one s would be a would be 2 by 1 SOT - Suzi a [00:44:11] would be a would be 2 by 1 SOT - Suzi a would be 1 by 2 and s would be a 2 by 1 [00:44:15] would be 1 by 2 and s would be a 2 by 1 and this is 1 by 1 then you know ain't [00:44:19] and this is 1 by 1 then you know ain't inverse kind of doesn't exist right so [00:44:21] inverse kind of doesn't exist right so you need to come over way to form the [00:44:23] you need to come over way to form the mass molecular model and where you have [00:44:25] mass molecular model and where you have one microphone it's just how do you [00:44:27] one microphone it's just how do you separate out to overlapping voices so it [00:44:31] separate out to overlapping voices so it takes much higher level knowledge yeah [00:44:35] takes much higher level knowledge yeah to separate out two voices [00:44:42] oh I see right let's see so right so if [00:44:59] oh I see right let's see so right so if you don't know how many speakers there [00:45:00] you don't know how many speakers there are you have all these microphones where [00:45:01] are you have all these microphones where you about the number of electrodes you [00:45:03] you about the number of electrodes you have is fixed so that's just a data set [00:45:04] have is fixed so that's just a data set and it turns out that if you run ICA [00:45:09] and it turns out that if you run ICA where the large numbers speakers you [00:45:10] where the large numbers speakers you find them in the speakers are silent [00:45:12] find them in the speakers are silent there are also some versions of ICA that [00:45:14] there are also some versions of ICA that you so if you think that there are let's [00:45:21] you so if you think that there are let's see why [00:45:23] see why no smells worse on this but it turns out [00:45:24] no smells worse on this but it turns out that um if you think that there is a [00:45:27] that um if you think that there is a relatively small number of speakers then [00:45:30] relatively small number of speakers then you don't need to explicitly model all [00:45:32] you don't need to explicitly model all the speakers instead what you would [00:45:34] the speakers instead what you would model so again suppose sense of Max or [00:45:39] model so again suppose sense of Max or likely estimation problem let's say [00:45:42] likely estimation problem let's say that's X is in our 10 right Co 10 [00:45:45] that's X is in our 10 right Co 10 recordings but you suspect that you're [00:45:47] recordings but you suspect that you're near 5 speakers then in this case I [00:45:50] near 5 speakers then in this case I guess the matrix a would be um what is [00:45:54] guess the matrix a would be um what is it was it be 10 by 5 is it right to mix [00:46:01] it was it be 10 by 5 is it right to mix the five sources into 10 speakers and [00:46:04] the five sources into 10 speakers and you could for me the maximum likelihood [00:46:08] you could for me the maximum likelihood estimation problem assuming the [00:46:10] estimation problem assuming the existence of only 5 speakers without [00:46:11] existence of only 5 speakers without modeling a lot of speakers and then [00:46:14] modeling a lot of speakers and then finding later that they're all silent so [00:46:17] finding later that they're all silent so if your formula so if you parameterize [00:46:19] if your formula so if you parameterize model like this using a instead of W [00:46:22] model like this using a instead of W then can form their maximum likelihood [00:46:24] then can form their maximum likelihood estimation problem where you just assume [00:46:27] estimation problem where you just assume that they're 5 speakers and s is [00:46:29] that they're 5 speakers and s is generated by five speakers mixing [00:46:32] generated by five speakers mixing through a linear thing plus the noise [00:46:35] through a linear thing plus the noise oh I see sure right how do you know if [00:46:45] oh I see sure right how do you know if you have how do you know how this [00:46:45] you have how do you know how this because you have so I think it's one of [00:46:48] because you have so I think it's one of those things a little bit like k-means I [00:46:49] those things a little bit like k-means I guess where you try it and see what [00:46:51] guess where you try it and see what works and if you find that the first [00:46:53] works and if you find that the first view you know speakers will capture [00:46:55] view you know speakers will capture mostly variance you find the digital [00:46:57] mostly variance you find the digital speakers are quite silent and they're [00:46:59] speakers are quite silent and they're quite small that you could just cut off [00:47:00] quite small that you could just cut off at that time I don't want to go too much [00:47:02] at that time I don't want to go too much into the different numbers of speakers [00:47:04] into the different numbers of speakers and and and microphones I see a verbose [00:47:09] and and and microphones I see a verbose let me just take a couple of questions [00:47:11] let me just take a couple of questions only one question yeah is it oh do you [00:47:20] only one question yeah is it oh do you ever see my parent of you um I'm sure [00:47:24] ever see my parent of you um I'm sure you can is not usually done in this [00:47:27] you can is not usually done in this version of the algorithm but I would not [00:47:30] version of the algorithm but I would not be surprised if there are some other [00:47:31] be surprised if there are some other versions where you do I've not seen that [00:47:33] versions where you do I've not seen that about myself actually all right cool [00:47:48] good um so [00:48:39] all right so that wraps up our chapter [00:48:45] all right so that wraps up our chapter on unsupervised learning right so you [00:48:50] on unsupervised learning right so you learned about yes k-means clustering the [00:48:54] learned about yes k-means clustering the Yemm algorithm for mixture of gaussians [00:48:57] Yemm algorithm for mixture of gaussians really makes your gaseous model factor [00:49:00] really makes your gaseous model factor analysis model and also PCA and then you [00:49:04] analysis model and also PCA and then you know today the ica independent [00:49:07] know today the ica independent components analysis algorithm and all of [00:49:09] components analysis algorithm and all of these were the algorithms that could [00:49:11] these were the algorithms that could take as input an unlabeled training set [00:49:12] take as input an unlabeled training set just the excise and no labels and we [00:49:15] just the excise and no labels and we find various interesting structures in [00:49:17] find various interesting structures in the data such as clusters or subspaces [00:49:19] the data such as clusters or subspaces or in the case of ICA the voices of you [00:49:21] or in the case of ICA the voices of you and the speakers and you implement ICA [00:49:24] and the speakers and you implement ICA and play about yourself in the homework [00:49:27] and play about yourself in the homework problem well you get to separate out [00:49:28] problem well you get to separate out many five overlapping voices the loss of [00:49:33] many five overlapping voices the loss of the four major topics well cover in this [00:49:36] the four major topics well cover in this verse we Thomas two eyes learning kind [00:49:38] verse we Thomas two eyes learning kind of advice machine learning on two eyes [00:49:40] of advice machine learning on two eyes learning and the fourth and the final [00:49:43] learning and the fourth and the final major topics we cover in this class will [00:49:45] major topics we cover in this class will be a reinforcement learning so to [00:49:55] be a reinforcement learning so to motivate reinforcement learning let's [00:50:00] motivate reinforcement learning let's say you want to have a computer learn to [00:50:04] say you want to have a computer learn to fly a helicopter right I think I showed [00:50:07] fly a helicopter right I think I showed some of the videos that are in the first [00:50:09] some of the videos that are in the first lecture and so I'll just skip that here [00:50:10] lecture and so I'll just skip that here but it turns out that if you are at [00:50:14] but it turns out that if you are at every point in time given the position [00:50:16] every point in time given the position of a helicopter call the state of a [00:50:18] of a helicopter call the state of a helicopter and you also take an action [00:50:20] helicopter and you also take an action on how to move the control sticks you [00:50:22] on how to move the control sticks you know to make the helicopter fly in a [00:50:24] know to make the helicopter fly in a certian trajectory it turns out that [00:50:26] certian trajectory it turns out that it's very difficult to know what's the [00:50:28] it's very difficult to know what's the one right answer for how to move the [00:50:30] one right answer for how to move the control sticks of a helicopter right so [00:50:32] control sticks of a helicopter right so if you don't have a mapping from x to y [00:50:34] if you don't have a mapping from x to y because you can't quite specify the one [00:50:36] because you can't quite specify the one true way to fly a helicopter it's hard [00:50:40] true way to fly a helicopter it's hard to use supervised learning [00:50:42] to use supervised learning and what the enforcement learning does [00:50:44] and what the enforcement learning does is is is it is an algorithm that doesn't [00:50:47] is is is it is an algorithm that doesn't ask you to tell it the right answer at [00:50:50] ask you to tell it the right answer at every step it doesn't ask you to tell it [00:50:51] every step it doesn't ask you to tell it exactly what's the one true way to move [00:50:53] exactly what's the one true way to move the controls of a helicopter at any [00:50:55] the controls of a helicopter at any moment in time instead your [00:50:57] moment in time instead your responsibility as a designer a machine [00:50:59] responsibility as a designer a machine or an engineer or an engineer is to [00:51:01] or an engineer or an engineer is to specify reward function that just tells [00:51:04] specify reward function that just tells the helicopter when it's flying well and [00:51:06] the helicopter when it's flying well and when it's lying poorly so your job as a [00:51:08] when it's lying poorly so your job as a designer is to write a cost function or [00:51:11] designer is to write a cost function or reward function that gives the [00:51:13] reward function that gives the helicopter a high reward whenever it's [00:51:15] helicopter a high reward whenever it's doing well flying accurately find reject [00:51:17] doing well flying accurately find reject you want sue and gives the helicopter a [00:51:19] you want sue and gives the helicopter a large negative reward whenever it [00:51:22] large negative reward whenever it crashes with or something bad then I [00:51:24] crashes with or something bad then I think I think you know think of this [00:51:26] think I think you know think of this like training a doll right when do you [00:51:28] like training a doll right when do you say good dog when you say bad dog and [00:51:30] say good dog when you say bad dog and the dog figures out when to do more the [00:51:32] the dog figures out when to do more the good dog things and your job is not to [00:51:33] good dog things and your job is not to tell the dog you know well you can't [00:51:35] tell the dog you know well you can't actually talk to the dog and tell what [00:51:37] actually talk to the dog and tell what to do I guess that doesn't work but you [00:51:39] to do I guess that doesn't work but you can tell a good dog and bad dog and [00:51:41] can tell a good dog and bad dog and hopefully their instruments positive [00:51:42] hopefully their instruments positive negative was how to do more of the good [00:51:44] negative was how to do more of the good things another example let's say you [00:51:49] things another example let's say you want to write their program to play [00:51:51] want to write their program to play chess or I guess most know somewhat [00:51:54] chess or I guess most know somewhat famously and arguably somewhat slightly [00:51:57] famously and arguably somewhat slightly over height go alpha go right so it's [00:52:01] over height go alpha go right so it's very difficult to know in given a [00:52:04] very difficult to know in given a certain chess board position or checkers [00:52:06] certain chess board position or checkers or go for position what is the one true [00:52:08] or go for position what is the one true move what's the one best move so it's [00:52:10] move what's the one best move so it's very difficult to formulate you know [00:52:13] very difficult to formulate you know playing chess as a supervised learning [00:52:15] playing chess as a supervised learning problem and instead the mechanisms used [00:52:19] problem and instead the mechanisms used to play chess are much more like [00:52:21] to play chess are much more like reinforcement learning where you can let [00:52:25] reinforcement learning where you can let your program play chess or go or [00:52:26] your program play chess or go or whatever and whenever it wins you go oh [00:52:29] whatever and whenever it wins you go oh good computer and when it loses you go [00:52:31] good computer and when it loses you go oh bad computer so that's a reward [00:52:34] oh bad computer so that's a reward function and learning algorithms job is [00:52:36] function and learning algorithms job is to figure out by itself how to get more [00:52:38] to figure out by itself how to get more of the positive rewards right and [00:52:40] of the positive rewards right and actually common rewards for learning to [00:52:43] actually common rewards for learning to play chess or checkers or fellow go is a [00:52:46] play chess or checkers or fellow go is a plus reward of plus one for win [00:52:50] plus reward of plus one for win - one for Luzon zero for a time say [00:52:56] - one for Luzon zero for a time say ready a chest pain program this be a [00:52:58] ready a chest pain program this be a common choice reward where R is the [00:53:01] common choice reward where R is the reward function and s is the state okay [00:53:04] reward function and s is the state okay and I will go into the notation in a [00:53:08] and I will go into the notation in a little bit and so as you can imagine [00:53:12] little bit and so as you can imagine giving only this type of information to [00:53:15] giving only this type of information to their chest pain program it places much [00:53:17] their chest pain program it places much more burden on the program to figure out [00:53:19] more burden on the program to figure out what to do in fact one of the challenges [00:53:22] what to do in fact one of the challenges of reinforcement learning is so just [00:53:25] of reinforcement learning is so just call the reward that's called the state [00:53:29] call the reward that's called the state and the state means on the status of the [00:53:32] and the state means on the status of the chess board where are the pieces on a [00:53:33] chess board where are the pieces on a chess board or the status of the [00:53:35] chess board or the status of the helicopter where exactly is a helicopter [00:53:37] helicopter where exactly is a helicopter and are you the right side up or upside [00:53:39] and are you the right side up or upside down and where are you right and it [00:53:42] down and where are you right and it turns out one of the challenges one of [00:53:49] turns out one of the challenges one of the things that makes them reinforce the [00:53:51] the things that makes them reinforce the learning Hart is the credit assignment [00:53:54] learning Hart is the credit assignment problem and that means that if your [00:53:57] problem and that means that if your program is playing a game of chess and [00:53:59] program is playing a game of chess and let's say it loses on move 50 you know [00:54:03] let's say it loses on move 50 you know so plays a game and then I'll move 50 [00:54:05] so plays a game and then I'll move 50 right it's checkmate and then loses his [00:54:07] right it's checkmate and then loses his opponent so gets a reward a negative one [00:54:10] opponent so gets a reward a negative one but how can the program actually figure [00:54:12] but how can the program actually figure out what it did well and what it did [00:54:14] out what it did well and what it did poorly right if you lose a game and move [00:54:16] poorly right if you lose a game and move 50 it might be that the program made a [00:54:19] 50 it might be that the program made a really bad move made a blunder and move [00:54:21] really bad move made a blunder and move 20 and then you know but they just hope [00:54:24] 20 and then you know but they just hope another 30 moves before his fates were [00:54:26] another 30 moves before his fates were sealed right so in the game of chess we [00:54:28] sealed right so in the game of chess we made a bad mistake early on you can [00:54:30] made a bad mistake early on you can still take many many games there are [00:54:32] still take many many games there are many many moves in the game of chess [00:54:33] many many moves in the game of chess before before the final outcome of [00:54:35] before before the final outcome of losing or winning or losing this reached [00:54:38] losing or winning or losing this reached or in a and a another it turns out that [00:54:44] or in a and a another it turns out that so if you are trying to build a [00:54:45] so if you are trying to build a self-driving car if ever car crashes [00:54:50] self-driving car if ever car crashes rain chances are the thing the car was [00:54:52] rain chances are the thing the car was doing right before it crashes was break [00:54:54] doing right before it crashes was break but it's not breaking that causes a [00:54:56] but it's not breaking that causes a crash it's pretty something else I [00:54:58] crash it's pretty something else I called it many many seconds ago then [00:55:00] called it many many seconds ago then - the bad outcome so there's a bad [00:55:01] - the bad outcome so there's a bad outcome how does the algorithm know of [00:55:04] outcome how does the algorithm know of all the things that did before how does [00:55:06] all the things that did before how does it know whether it did well which you [00:55:07] it know whether it did well which you should do more of and one it did poorly [00:55:09] should do more of and one it did poorly which you should do less of and [00:55:11] which you should do less of and conversely if that's a good outcome [00:55:14] conversely if that's a good outcome you're like a wins a game of chess well [00:55:15] you're like a wins a game of chess well how do you know what you did well right [00:55:17] how do you know what you did well right so that's called the credit assignment [00:55:19] so that's called the credit assignment problem which is when your algorithm [00:55:21] problem which is when your algorithm gets some reward how do you actually [00:55:23] gets some reward how do you actually figure out what you did what you did [00:55:24] figure out what you did what you did poorly so you know what to do more of [00:55:27] poorly so you know what to do more of and what to do less up right so as we [00:55:30] and what to do less up right so as we develop reinforcement learning [00:55:32] develop reinforcement learning algorithms will see that the algorithms [00:55:34] algorithms will see that the algorithms we use have to at least indirectly try [00:55:38] we use have to at least indirectly try to solve the credit assignment problem [00:55:39] to solve the credit assignment problem okay so um reinforcement learning [00:55:46] okay so um reinforcement learning problems like play chess of AI [00:55:48] problems like play chess of AI helicopters or you know building these [00:55:51] helicopters or you know building these there's robots is modeled using the MDP [00:56:02] there's robots is modeled using the MDP or the Markov decision process [00:56:18] and this is a way this is a notation in [00:56:22] and this is a way this is a notation in the formulism but modeling how the world [00:56:24] the formulism but modeling how the world works and then reinforcement learning [00:56:25] works and then reinforcement learning algorithms will solve problems using [00:56:27] algorithms will solve problems using this formula zhim so what is an MDP sin [00:56:31] this formula zhim so what is an MDP sin MDP is a five tuple and let me explain [00:56:40] MDP is a five tuple and let me explain what each of these are so s there's a [00:56:44] what each of these are so s there's a set of states so for example in chess [00:56:52] set of states so for example in chess this would be the set of all possible [00:56:53] this would be the set of all possible chess positions or in flying a [00:56:56] chess positions or in flying a helicopter this would be the set of all [00:56:58] helicopter this would be the set of all the possible positions and orientations [00:56:59] the possible positions and orientations and velocities of your helicopter a is [00:57:05] and velocities of your helicopter a is the set of actions where in the [00:57:11] the set of actions where in the helicopter this would be all the [00:57:12] helicopter this would be all the positions you could move your control [00:57:14] positions you could move your control sticks or in Chester's be all the moves [00:57:16] sticks or in Chester's be all the moves you could make you know in a in a game [00:57:18] you could make you know in a in a game of chess [00:57:39] P subscript s a is is a state transition [00:57:43] P subscript s a is is a state transition probabilities and so we'll see later [00:57:46] probabilities and so we'll see later this state transition probably is tell [00:57:49] this state transition probably is tell you if you take a certain action a and a [00:57:53] you if you take a certain action a and a certain state s once the chance of you [00:57:56] certain state s once the chance of you ending up at a particular different [00:57:59] ending up at a particular different state s prime our gamma is the discount [00:58:16] state s prime our gamma is the discount factor as number between 0 and 1 [00:58:19] factor as number between 0 and 1 don't worry about this for now we'll [00:58:20] don't worry about this for now we'll come back to this in a minute and R is [00:58:23] come back to this in a minute and R is that all-important reward function so [00:58:41] in order to develop a reinforcement [00:58:45] in order to develop a reinforcement learning algorithm I'm going to use as a [00:58:49] learning algorithm I'm going to use as a running example a simplified MDP that we [00:58:52] running example a simplified MDP that we can draw on the whiteboard right so [00:58:54] can draw on the whiteboard right so helicopters in chess and go and so on [00:58:56] helicopters in chess and go and so on they're really complicated MDP so just [00:58:58] they're really complicated MDP so just to illustrate the algorithms I'm going [00:59:00] to illustrate the algorithms I'm going to use a simpler MVP and this is an [00:59:04] to use a simpler MVP and this is an example we drawn from the textbook [00:59:06] example we drawn from the textbook Russell and Norvig then we'll use imply [00:59:12] Russell and Norvig then we'll use imply MVP in which you have a robot navigating [00:59:16] MVP in which you have a robot navigating this simple maze and there's an obstacle [00:59:18] this simple maze and there's an obstacle so this is a grid world icy robot you [00:59:21] so this is a grid world icy robot you know and it's navigating this very [00:59:27] know and it's navigating this very simple maze and this is a pillar or this [00:59:30] simple maze and this is a pillar or this is a wall so you can't walk into that [00:59:32] is a wall so you can't walk into that wall and let me just use indexing on the [00:59:38] wall and let me just use indexing on the states as follows so this MDP let's [00:59:44] states as follows so this MDP let's let's go through the five tempo and talk [00:59:45] let's go through the five tempo and talk about what the the each of the five [00:59:49] about what the the each of the five things are so this MVP has eleven states [00:59:53] things are so this MVP has eleven states corresponding to the eleven possible [00:59:55] corresponding to the eleven possible positions that the robot could be in [00:59:58] positions that the robot could be in right each of these banks square so the [00:59:59] right each of these banks square so the eleven possible states and the actions [01:00:08] are north south east and west [01:00:11] are north south east and west right you can come on your robot to move [01:00:12] right you can come on your robot to move in any of these directions and I don't [01:00:16] in any of these directions and I don't know if you're working robots before you [01:00:18] know if you're working robots before you know that um when you come on the robot [01:00:20] know that um when you come on the robot you know head straight it doesn't always [01:00:23] you know head straight it doesn't always go exactly straight sometimes the wheel [01:00:26] go exactly straight sometimes the wheel slip it veers of a slight angle and so [01:00:28] slip it veers of a slight angle and so in this simplified example we're going [01:00:30] in this simplified example we're going to model it as that if you command the [01:00:34] to model it as that if you command the robot to go north from a certain state [01:00:37] robot to go north from a certain state that there's a 0.8 percent chance [01:00:40] that there's a 0.8 percent chance they'll successfully go where you told [01:00:43] they'll successfully go where you told it to and there's zero point one chance [01:00:45] it to and there's zero point one chance that they'll accidentally if you're off [01:00:47] that they'll accidentally if you're off to the left for a student if you're off [01:00:48] to the left for a student if you're off to the right [01:00:49] to the right okay if you are working on row robots [01:00:52] okay if you are working on row robots right what's a lot of robots it is [01:00:54] right what's a lot of robots it is actually important to model the noisy [01:00:57] actually important to model the noisy dynamics of a robot real slipping so [01:00:59] dynamics of a robot real slipping so your orientation being slightly off now [01:01:01] your orientation being slightly off now in a real robot you'd have a much bigger [01:01:04] in a real robot you'd have a much bigger stage space than the eleven states right [01:01:06] stage space than the eleven states right so so this is simplified so this is not [01:01:08] so so this is simplified so this is not a realistic model for how robots [01:01:10] a realistic model for how robots actually slip but because we're using [01:01:11] actually slip but because we're using such a small state space I think just [01:01:13] such a small state space I think just for illustration purposes as well well [01:01:16] for illustration purposes as well well we'll use this and so for example the [01:01:24] we'll use this and so for example the state transition probably so specified [01:01:26] state transition probably so specified is you say that every under state 3 1 so [01:01:28] is you say that every under state 3 1 so the state 3 comma 1 and you command it [01:01:31] the state 3 comma 1 and you command it to go north that the chance of getting [01:01:34] to go north that the chance of getting to the state 3 2 is 0.8 and the chance [01:01:45] to the state 3 2 is 0.8 and the chance are getting to the States for 10.1 [01:01:52] Charles again 2 to 1 is 0.1 and the [01:01:59] Charles again 2 to 1 is 0.1 and the chance of getting to other states is [01:02:02] chance of getting to other states is like 3 3 and other states is equal to 0 [01:02:05] like 3 3 and other states is equal to 0 ok so the state transition probabilities [01:02:07] ok so the state transition probabilities would capture that if you here in [01:02:09] would capture that if you here in surgical north as the whole point a [01:02:10] surgical north as the whole point a chance of getting here 0.1 chance again [01:02:13] chance of getting here 0.1 chance again here 0.1 chance of getting here and you [01:02:15] here 0.1 chance of getting here and you know point Oh a chance of right hopping [01:02:18] know point Oh a chance of right hopping to steps [01:02:18] to steps oh it's implement a PE example we'll [01:02:29] oh it's implement a PE example we'll just assume that the robot you know hits [01:02:31] just assume that the robot you know hits a wall it just bounces off the wall and [01:02:33] a wall it just bounces off the wall and stays where it is [01:02:33] stays where it is so if you told us to go 'yes it slips [01:02:36] so if you told us to go 'yes it slips off it just bounced off the wall and [01:02:37] off it just bounced off the wall and stay exactly where this now let's [01:02:43] stay exactly where this now let's specify the reward function we'll come [01:02:46] specify the reward function we'll come back to discount factor later but let's [01:02:49] back to discount factor later but let's say you want the robot to navigate to [01:02:52] say you want the robot to navigate to this cell in the upper right hand corner [01:02:55] this cell in the upper right hand corner and so to incentivize the reward [01:02:59] and so to incentivize the reward incentivize the robot to get to this [01:03:01] incentivize the robot to get to this square you know that's the prize [01:03:02] square you know that's the prize Saguna knees let's put a +1 reward there [01:03:05] Saguna knees let's put a +1 reward there and let's say you really don't want the [01:03:08] and let's say you really don't want the robot to go to this cell they could put [01:03:12] robot to go to this cell they could put a negative one or what there right so [01:03:14] a negative one or what there right so the way you specify the toss for a robot [01:03:18] the way you specify the toss for a robot to do is in designing the reward [01:03:21] to do is in designing the reward function [01:03:40] so in our example I'm just copied out of [01:03:48] so in our example I'm just copied out of the game plus one minus one we have that [01:03:57] the game plus one minus one we have that the reward at the cell for three is plus [01:04:01] the reward at the cell for three is plus one and the reward at the cell for 2 is [01:04:06] one and the reward at the cell for 2 is minus one and then you know if you want [01:04:09] minus one and then you know if you want the robot to get to the plus one rewards [01:04:12] the robot to get to the plus one rewards cell as quickly as possible then again [01:04:16] cell as quickly as possible then again there there are many ways of designing [01:04:18] there there are many ways of designing reward functions but one common choice [01:04:20] reward functions but one common choice would be to put a negative penalty a [01:04:24] would be to put a negative penalty a very small negative penalty right such [01:04:34] very small negative penalty right such as a set of rewards a negative 0.02 for [01:04:37] as a set of rewards a negative 0.02 for all other states and the effect of a [01:04:40] all other states and the effect of a small negative reward like this is to [01:04:42] small negative reward like this is to charge it right every every step it is [01:04:45] charge it right every every step it is just loitering around so charge a little [01:04:47] just loitering around so charge a little bit for using electricity and wandering [01:04:49] bit for using electricity and wandering around because this incentivizes a robot [01:04:52] around because this incentivizes a robot to hurry up and get to the plus one [01:04:54] to hurry up and get to the plus one reward right so you give a small penalty [01:04:57] reward right so you give a small penalty you know loitering and wasting [01:04:59] you know loitering and wasting electricity [01:05:11] so this is how an MDP works your robot [01:05:17] so this is how an MDP works your robot wakes up at some state as zero at time [01:05:20] wakes up at some state as zero at time zero because you turn on the robot and [01:05:22] zero because you turn on the robot and the robot says oh I'm at this state [01:05:23] the robot says oh I'm at this state and based on what state it is in it will [01:05:28] and based on what state it is in it will get to choose some action a zero so [01:05:33] get to choose some action a zero so decides I want to go north south east or [01:05:34] decides I want to go north south east or west let's choose some action based on [01:05:38] west let's choose some action based on the action the consequence of the choice [01:05:40] the action the consequence of the choice is it will get to some state s1 to stay [01:05:45] is it will get to some state s1 to stay that the next time step which is [01:05:46] that the next time step which is distributed according to the state [01:05:50] distributed according to the state transition probability is governed by [01:05:51] transition probability is governed by the previous state and the action and [01:05:53] the previous state and the action and chose so develop what actually chose is [01:05:55] chose so develop what actually chose is there's different chances of moving [01:05:57] there's different chances of moving north south east or west now that is an [01:06:01] north south east or west now that is an s-1 it then has to choose a new action [01:06:05] s-1 it then has to choose a new action a1 and as a consequence of the action a1 [01:06:10] a1 and as a consequence of the action a1 it will get to some new state s2 which [01:06:14] it will get to some new state s2 which is governed by the state transition [01:06:17] is governed by the state transition probabilities you know s 1 a1 and so on [01:06:23] probabilities you know s 1 a1 and so on okay and then the robot just keeps on [01:06:25] okay and then the robot just keeps on running [01:06:32] and so the robot will go through a [01:06:37] and so the robot will go through a sequence of states s0 s1 s2 and so on [01:06:43] sequence of states s0 s1 s2 and so on depending on the choices it receives [01:06:45] depending on the choices it receives defend the actions it chooses and the [01:06:48] defend the actions it chooses and the total payoff is written as follows with [01:06:57] total payoff is written as follows with one more detail is that term gamma so [01:07:09] one more detail is that term gamma so think of gamma as a number like 0.99 so [01:07:13] think of gamma as a number like 0.99 so gamma is usually chosen to be just [01:07:15] gamma is usually chosen to be just slightly less than one and what the so [01:07:19] slightly less than one and what the so the total payoff is the sum of rewards [01:07:22] the total payoff is the sum of rewards or more technically as a sum of [01:07:23] or more technically as a sum of discounted rewards and what this does is [01:07:26] discounted rewards and what this does is it adds up all the rewards that the [01:07:28] it adds up all the rewards that the robot receives over time but the further [01:07:31] robot receives over time but the further reward is into the future you know the [01:07:35] reward is into the future you know the smaller the gammas ^ time that that [01:07:38] smaller the gammas ^ time that that reward is x okay so anyway what'd you [01:07:41] reward is x okay so anyway what'd you get this time one you get all of that [01:07:44] get this time one you get all of that every one you get at time - its x point [01:07:46] every one you get at time - its x point 99 Roy gets thanks that this one point [01:07:49] 99 Roy gets thanks that this one point 99 squared or not a cube and so on and [01:07:52] 99 squared or not a cube and so on and so what the discount factor does is it [01:07:56] so what the discount factor does is it has the effect of giving a smaller way [01:07:58] has the effect of giving a smaller way to rewards in the distant future and [01:08:01] to rewards in the distant future and this means that this encourages the [01:08:03] this means that this encourages the robot to also get the positive rewards [01:08:06] robot to also get the positive rewards faster or postpone the negative rewards [01:08:09] faster or postpone the negative rewards right and so in financial applications [01:08:12] right and so in financial applications the discount factor has one as has a [01:08:15] the discount factor has one as has a natural interpretation as the time value [01:08:18] natural interpretation as the time value of money because if you have a dollar [01:08:19] of money because if you have a dollar today you know you're better off having [01:08:22] today you know you're better off having a dollar today they're having a year $1 [01:08:24] a dollar today they're having a year $1 year from now right because you put the [01:08:26] year from now right because you put the dollar in the bank and [01:08:27] dollar in the bank and interests for a year on your dollar and [01:08:30] interests for a year on your dollar and so dollars they're strictly rather than [01:08:31] so dollars they're strictly rather than thought in the future [01:08:33] thought in the future and conversely having to pay $100 or [01:08:36] and conversely having to pay $100 or having to pay one dollar a year from now [01:08:38] having to pay one dollar a year from now is also better than having to pay a [01:08:40] is also better than having to pay a dollar today right because if you could [01:08:42] dollar today right because if you could you know save your money and earn inches [01:08:44] you know save your money and earn inches and then issue a payment to someone else [01:08:46] and then issue a payment to someone else a year from now rather than now then [01:08:48] a year from now rather than now then you're actually slightly wealthier and [01:08:50] you're actually slightly wealthier and so and so the gamma and financial [01:08:53] so and so the gamma and financial applications as an interpretation as the [01:08:55] applications as an interpretation as the time value of money [01:08:57] time value of money oh it's the interest rate I guess and [01:09:02] oh it's the interest rate I guess and but but but more generally even for [01:09:04] but but but more generally even for non-financial applications most of our [01:09:07] non-financial applications most of our most there are some financial [01:09:08] most there are some financial applications our enforcement but lots of [01:09:10] applications our enforcement but lots of non fan traffic as well this mechanism [01:09:13] non fan traffic as well this mechanism of using a discount factor has the [01:09:15] of using a discount factor has the effect of encouraging the system to get [01:09:18] effect of encouraging the system to get to the positive was as quickly as [01:09:19] to the positive was as quickly as possible but then also conversely to try [01:09:22] possible but then also conversely to try to push the negative rewards as founds [01:09:24] to push the negative rewards as founds in the future as possible right oh and I [01:09:29] in the future as possible right oh and I think to be pragmatic there are two [01:09:32] think to be pragmatic there are two reasons why people use gamma the story I [01:09:34] reasons why people use gamma the story I just told time value of money [01:09:36] just told time value of money your friends'll sponsor was postponed [01:09:38] your friends'll sponsor was postponed that was that's that's the story you [01:09:41] that was that's that's the story you tend to people you tend to hear people [01:09:44] tend to people you tend to hear people say in terms of why we have a discount [01:09:46] say in terms of why we have a discount factor the other reason where the [01:09:48] factor the other reason where the discount factor is actually much more [01:09:49] discount factor is actually much more pragmatic one which is that lovely [01:09:51] pragmatic one which is that lovely reinforcement learning algorithms you [01:09:52] reinforcement learning algorithms you see they converge much faster or they [01:09:54] see they converge much faster or they work much better if you're willing to [01:09:56] work much better if you're willing to have a discount factor right so it turns [01:09:58] have a discount factor right so it turns out that if gamma is is equal to 1 if [01:10:01] out that if gamma is is equal to 1 if gamma is not strictly less than 1 it's [01:10:04] gamma is not strictly less than 1 it's much harder or they're there many [01:10:06] much harder or they're there many ripples to learning algorithms that may [01:10:08] ripples to learning algorithms that may not converge you as much how the croutha [01:10:10] not converge you as much how the croutha conversions of no may not converse which [01:10:11] conversions of no may not converse which isn't a pragmatic thing this makes the [01:10:14] isn't a pragmatic thing this makes the job much easier for your algorithms now [01:10:16] job much easier for your algorithms now I see sorry you're shaking your heads in [01:10:17] I see sorry you're shaking your heads in this disapproval all right [01:10:31] yeah yeah yes yeah that's a good point [01:10:34] yeah yeah yes yeah that's a good point yes so one of the things if there's no [01:10:35] yes so one of the things if there's no camera is that the reward summer was you [01:10:38] camera is that the reward summer was you know could be can increase or decrease [01:10:40] know could be can increase or decrease of our balance so by having gammer [01:10:42] of our balance so by having gammer discount easier the total payoff is a [01:10:45] discount easier the total payoff is a finite value whereas the boundary value [01:10:47] finite value whereas the boundary value so that that's one of the parts they go [01:10:50] so that that's one of the parts they go into some of the proofs or something [01:10:51] into some of the proofs or something reasons behind why rate for so many [01:10:53] reasons behind why rate for so many avenues conversion yeah okay so the go [01:11:02] avenues conversion yeah okay so the go of reinforcement learning is to choose [01:11:05] of reinforcement learning is to choose actions over time to maximize the [01:11:15] actions over time to maximize the expected total payoff [01:11:32] and in particular what most [01:11:40] and in particular what most reinforcement learning algorithms will [01:11:42] reinforcement learning algorithms will come up with is a policy that maps from [01:11:53] come up with is a policy that maps from States to actions right so the output of [01:11:57] States to actions right so the output of most reinforcement learning algorithms [01:11:59] most reinforcement learning algorithms will be a policy or controller in the [01:12:03] will be a policy or controller in the our world we tend to use the term policy [01:12:05] our world we tend to use the term policy but policy just means controller there [01:12:07] but policy just means controller there maps of states actions so it turns out [01:12:10] maps of states actions so it turns out that for the MDP that we have right it [01:12:24] that for the MDP that we have right it turns out that this is the optimal [01:12:27] turns out that this is the optimal policy [01:12:36] so for example I want you to take this [01:12:38] so for example I want you to take this example just this cell here to sell over [01:12:45] example just this cell here to sell over here [01:12:46] here this policy is saying PI apply to the [01:12:49] this policy is saying PI apply to the state 3 1 as equal to West and that so [01:12:59] so it separately worked out what is the [01:13:01] so it separately worked out what is the optimal policy and this turns out to be [01:13:03] optimal policy and this turns out to be also a policy in the sense that if you [01:13:06] also a policy in the sense that if you we say execute this policy so the [01:13:10] we say execute this policy so the executors policy means that whenever you [01:13:12] executors policy means that whenever you in the state s take the action given by [01:13:23] in the state s take the action given by PI of s so that's what it means to [01:13:32] PI of s so that's what it means to execute a certain policy and it turns [01:13:35] execute a certain policy and it turns out that this policy was I worked out [01:13:39] out that this policy was I worked out separately right offline you know in my [01:13:42] separately right offline you know in my laptop that this is the optimal policy [01:13:45] laptop that this is the optimal policy for this MVP and it turns out that if [01:13:48] for this MVP and it turns out that if you execute this policy meaning whenever [01:13:50] you execute this policy meaning whenever a certain state you know take the action [01:13:52] a certain state you know take the action indicated by the arrow that this is the [01:13:55] indicated by the arrow that this is the policy that will maximize the expected [01:13:57] policy that will maximize the expected total payoff okay and the problem in [01:14:03] total payoff okay and the problem in reinforcement learning is given a [01:14:06] reinforcement learning is given a definition for an MDP or given a problem [01:14:09] definition for an MDP or given a problem suppose the problem is an MDP figure out [01:14:13] suppose the problem is an MDP figure out what's the set of states with set of [01:14:15] what's the set of states with set of actions one of the state transition [01:14:17] actions one of the state transition probabilities specify a discount factor [01:14:19] probabilities specify a discount factor a specified reward function and then to [01:14:22] a specified reward function and then to a reinforcer learning algorithm find the [01:14:25] a reinforcer learning algorithm find the policy PI that maximizes expected payoff [01:14:28] policy PI that maximizes expected payoff and then when you want your robot to act [01:14:31] and then when you want your robot to act or when you want your chess playing [01:14:32] or when you want your chess playing program to act whenever you're in [01:14:34] program to act whenever you're in something s take the action given by PI [01:14:37] something s take the action given by PI of s and hopefully this will result in a [01:14:40] of s and hopefully this will result in a robot that you know efficiently [01:14:41] robot that you know efficiently navigates to the +1 state [01:14:45] navigates to the +1 state so turns out that MVPs are quite good at [01:14:49] so turns out that MVPs are quite good at making fine distinction so one example [01:14:51] making fine distinction so one example is actually not totally obvious whether [01:14:54] is actually not totally obvious whether here you're better off going off or [01:14:56] here you're better off going off or going west right and it turns out that [01:14:58] going west right and it turns out that there is a trade off if you go Wes here [01:15:01] there is a trade off if you go Wes here then you know you're gonna take a longer [01:15:03] then you know you're gonna take a longer route to get to the plus one so you take [01:15:05] route to get to the plus one so you take longer the plus one is discounted more [01:15:08] longer the plus one is discounted more heavily you're taking these penalties [01:15:09] heavily you're taking these penalties along the way excuse me [01:15:13] along the way excuse me but on the flip side if you were to try [01:15:16] but on the flip side if you were to try to go north you could try to get there [01:15:18] to go north you could try to get there faster but on this that there's a 0.1% [01:15:22] faster but on this that there's a 0.1% chance that you accidentally slip off to [01:15:25] chance that you accidentally slip off to the minus one state so so what is the [01:15:27] the minus one state so so what is the optimal action right it's actually quite [01:15:29] optimal action right it's actually quite hard to just look at it with your eyes [01:15:30] hard to just look at it with your eyes and make a decision but it turns out [01:15:33] and make a decision but it turns out that if you solve for the optimal set of [01:15:35] that if you solve for the optimal set of actions and does MDP you in this example [01:15:37] actions and does MDP you in this example is they just take longer and safer route [01:15:46] the sense of why cycles and policies so [01:15:50] the sense of why cycles and policies so if the optimal set of actions is the [01:15:52] if the optimal set of actions is the cycle around then it should find out I [01:15:55] cycle around then it should find out I mean for example if they're only [01:15:57] mean for example if they're only penalties everywhere and she's just go [01:15:59] penalties everywhere and she's just go and run in a circle you know then then [01:16:01] and run in a circle you know then then the algorithm watch she choose to do [01:16:03] the algorithm watch she choose to do that but in this case you want to get to [01:16:06] that but in this case you want to get to the plus one as quickly as possible [01:16:07] the plus one as quickly as possible and so what we'll see is one question [01:16:29] wait so alright sure sorry so testing [01:16:32] wait so alright sure sorry so testing checkers and go and so on they're a [01:16:33] checkers and go and so on they're a little more complication is you take a [01:16:35] little more complication is you take a move so actually to refine the [01:16:38] move so actually to refine the description of chess what happens in [01:16:40] description of chess what happens in playing chess is the state status your [01:16:44] playing chess is the state status your board right says your move so you see a [01:16:46] board right says your move so you see a board that's the state and so you make a [01:16:48] board that's the state and so you make a move and then the opponent makes a move [01:16:49] move and then the opponent makes a move and then that's the new state so the [01:16:51] and then that's the new state so the state is when you and your opponent both [01:16:53] state is when you and your opponent both make take turns then it's cut back to [01:16:55] make take turns then it's cut back to you right and because you don't know [01:16:58] you right and because you don't know exactly what your opponent will do there [01:17:00] exactly what your opponent will do there is a probably distribution over if I [01:17:01] is a probably distribution over if I make a move or what's the other person [01:17:03] make a move or what's the other person gonna do yeah right oh they're probably [01:17:15] gonna do yeah right oh they're probably sighs I'm very no two point eight point [01:17:17] sighs I'm very no two point eight point one point one where does that come from [01:17:18] one point one where does that come from so we'll talk about that later [01:17:21] so we'll talk about that later in some applications does this learn so [01:17:23] in some applications does this learn so if you build a robot you might not know [01:17:25] if you build a robot you might not know is it point eight point one point one or [01:17:27] is it point eight point one point one or you know point seven point one five [01:17:29] you know point seven point one five point one five so it's quite common to [01:17:31] point one five so it's quite common to use data to learn those state transition [01:17:33] use data to learn those state transition probabilities as well well we'll see a [01:17:35] probabilities as well well we'll see a specific example that into it okay so [01:17:38] specific example that into it okay so alright so where we are just to [01:17:39] alright so where we are just to summarize this is how you formulate a [01:17:42] summarize this is how you formulate a problem as an MDP and then the the job [01:17:47] problem as an MDP and then the the job reinforcing learning algorithm is ready [01:17:49] reinforcing learning algorithm is ready to go from there MDP to telling you what [01:17:52] to go from there MDP to telling you what is a good policy okay [01:17:54] is a good policy okay so let's break and then Oh have a good [01:17:57] so let's break and then Oh have a good Thanksgiving everyone [01:17:58] Thanksgiving everyone won't see you for like a week and half [01:18:00] won't see you for like a week and half enjoy yourselves and we'll reconvene [01:18:02] enjoy yourselves and we'll reconvene after Thanksgiving with ================================================================================ LECTURE 017 ================================================================================ Lecture 17 - MDPs & Value/Policy Iteration | Stanford CS229: Machine Learning Andrew Ng (Autumn2018) Source: https://www.youtube.com/watch?v=d5gaWTo6kDM --- Transcript [00:00:03] welcome back everyone hope you had a [00:00:05] welcome back everyone hope you had a good Thanksgiving these chairs here um [00:00:12] good Thanksgiving these chairs here um by the way not sure thanks awning not [00:00:15] by the way not sure thanks awning not sure you guys funny didn't use but in [00:00:17] sure you guys funny didn't use but in reinforcement learning which had a lot [00:00:19] reinforcement learning which had a lot about robotics right then one of the you [00:00:21] about robotics right then one of the you know cost a problem lot of people use [00:00:24] know cost a problem lot of people use reinforcement to solve is robotics and I [00:00:27] reinforcement to solve is robotics and I think back in May the insight Mars [00:00:32] think back in May the insight Mars Lander had launched from here in [00:00:34] Lander had launched from here in California and is about to make an [00:00:35] California and is about to make an attempt at landing on the planet Mars in [00:00:38] attempt at landing on the planet Mars in the next two and a half hours or so so [00:00:40] the next two and a half hours or so so excited about that I think is actually [00:00:42] excited about that I think is actually one of the grandest applications of [00:00:45] one of the grandest applications of robotics because you know with what 20 [00:00:47] robotics because you know with what 20 minute life speed from Earth to Mars [00:00:49] minute life speed from Earth to Mars you know once it starts this landing [00:00:51] you know once it starts this landing there's nothing anyone on the earth can [00:00:53] there's nothing anyone on the earth can do and so I think is actually one the [00:00:54] do and so I think is actually one the most exciting applications of a Tong's [00:00:56] most exciting applications of a Tong's robotics but you launched this thing is [00:00:58] robotics but you launched this thing is now about 20 20 light minutes away from [00:01:00] now about 20 20 light minutes away from Planet Earth so you actually can't [00:01:02] Planet Earth so you actually can't control it in real time and you just [00:01:04] control it in real time and you just have to hope like crazy that your [00:01:06] have to hope like crazy that your software works well enough but land on [00:01:08] software works well enough but land on this planet you know and stuff well [00:01:11] this planet you know and stuff well we'll find out a little bit afternoon if [00:01:13] we'll find out a little bit afternoon if the landing have been successfully or [00:01:15] the landing have been successfully or longer as you know III think um sir I [00:01:18] longer as you know III think um sir I just get excited about stuff like this I [00:01:19] just get excited about stuff like this I hope you guys do too and but they'll see [00:01:22] hope you guys do too and but they'll see they're from California I mean take some [00:01:23] they're from California I mean take some pride that it launched from my home [00:01:24] pride that it launched from my home state of California and it's now nearing [00:01:27] state of California and it's now nearing is uh landing on Mars alright so um what [00:01:34] is uh landing on Mars alright so um what I want to do today is continue our [00:01:36] I want to do today is continue our discussion on reinforcement learning do [00:01:40] discussion on reinforcement learning do a quick recap of the MDP or the Markov [00:01:43] a quick recap of the MDP or the Markov decision process framework and then [00:01:46] decision process framework and then we'll start to talk about algorithms for [00:01:47] we'll start to talk about algorithms for solving them DPS in particular need to [00:01:50] solving them DPS in particular need to define something called the value [00:01:52] define something called the value function which tells you how good it is [00:01:55] function which tells you how good it is to be in different states of the MDP and [00:01:58] to be in different states of the MDP and then we'll define the value function and [00:02:01] then we'll define the value function and then talk about an algorithm called [00:02:03] then talk about an algorithm called Valley iteration for computing the value [00:02:05] Valley iteration for computing the value function and this will help us figure [00:02:07] function and this will help us figure out how to actually find a good control [00:02:10] out how to actually find a good control or a finally good policy for them DP and [00:02:13] or a finally good policy for them DP and it will wrap up with learning state [00:02:14] it will wrap up with learning state transition probabilities and how to put [00:02:16] transition probabilities and how to put alson together [00:02:17] alson together into an actual reinforcement learning [00:02:19] into an actual reinforcement learning algorithm that you can implement to [00:02:23] algorithm that you can implement to recap our motivating example run the [00:02:26] recap our motivating example run the example from the last time from before [00:02:28] example from the last time from before Thanksgiving was this 11 state MVP and [00:02:31] Thanksgiving was this 11 state MVP and we said that an MDP comprises a five [00:02:35] we said that an MDP comprises a five tuple list of five things with States so [00:02:38] tuple list of five things with States so that example had 11 States actions and [00:02:43] that example had 11 States actions and in this example the actions were the [00:02:45] in this example the actions were the compass direction north south east and [00:02:47] compass direction north south east and west we can try to go in each of the [00:02:48] west we can try to go in each of the four compass directions the state [00:02:50] four compass directions the state transition probabilities and example if [00:02:53] transition probabilities and example if the robot attempts to go north [00:02:55] the robot attempts to go north it has 80% chance of heading north and [00:02:58] it has 80% chance of heading north and 0.1% chance of viewing off to the left [00:03:01] 0.1% chance of viewing off to the left and the point one chance of veering off [00:03:02] and the point one chance of veering off to the right gamma is a number slightly [00:03:07] to the right gamma is a number slightly less than one usually say less than one [00:03:10] less than one usually say less than one there's a discount factor think of the [00:03:12] there's a discount factor think of the 0.99 and R is the reward function that [00:03:16] 0.99 and R is the reward function that helps us specify where we want the robot [00:03:20] helps us specify where we want the robot to end up and so what we said last time [00:03:24] to end up and so what we said last time was that the way an MDP works is you [00:03:28] was that the way an MDP works is you start up in some state as zero honestly [00:03:30] start up in some state as zero honestly better you choose an action a zero and [00:03:34] better you choose an action a zero and as a result of that it transitions the [00:03:36] as a result of that it transitions the new state s 1 which is drawn according [00:03:39] new state s 1 which is drawn according to p FS 0 a 0 and then you choose a new [00:03:43] to p FS 0 a 0 and then you choose a new action a 1 and as a result the MDP [00:03:46] action a 1 and as a result the MDP transition system new state PF s 1 a 1 [00:03:52] transition system new state PF s 1 a 1 and the total payoff is the sum of [00:03:55] and the total payoff is the sum of rewards and the goal is to come up with [00:04:06] rewards and the goal is to come up with a way and formally that goes to come [00:04:10] a way and formally that goes to come over policy pi which is a mapping from [00:04:15] over policy pi which is a mapping from the states to the actions that will tell [00:04:19] the states to the actions that will tell you how to choose actions from whatever [00:04:21] you how to choose actions from whatever state you're in so that the policy [00:04:23] state you're in so that the policy maximizes the expected value of the [00:04:26] maximizes the expected value of the total payoff ok [00:04:28] total payoff ok and so I think lost time I I kind of [00:04:31] and so I think lost time I I kind of claimed that this is the optimal policy [00:04:35] claimed that this is the optimal policy for this MVP and what this means for [00:04:42] for this MVP and what this means for example is if you look at this state but [00:04:47] example is if you look at this state but this policy is telling you that fire 3 [00:04:50] this policy is telling you that fire 3 comma 1 equals West I guess oh you can [00:04:54] comma 1 equals West I guess oh you can write west or left or what do you call [00:04:56] write west or left or what do you call that left arrow right we're from the [00:04:58] that left arrow right we're from the state from the state 3 1 you know the [00:05:03] state from the state 3 1 you know the best action to take this to go left us [00:05:05] best action to take this to go left us to go west and so if you're executing [00:05:08] to go west and so if you're executing this policy what that means is that on [00:05:11] this policy what that means is that on every step the action you choose would [00:05:14] every step the action you choose would be you know PI right of the of the state [00:05:17] be you know PI right of the of the state that you're in [00:05:17] that you're in ok so what I'd like to do is now define [00:05:24] ok so what I'd like to do is now define the value function so how did I come up [00:05:27] the value function so how did I come up with this right what I like to do is [00:05:28] with this right what I like to do is have you learn given an MDP given this [00:05:32] have you learn given an MDP given this five tuple how do you compute the octal [00:05:36] five tuple how do you compute the octal policy and one of the challenges with [00:05:41] policy and one of the challenges with finding the optimal policy is that you [00:05:43] finding the optimal policy is that you know there's a there's an exponentially [00:05:44] know there's a there's an exponentially large number of possible policies right [00:05:46] large number of possible policies right if you have eleven states and four [00:05:48] if you have eleven states and four actions per state the number of possible [00:05:51] actions per state the number of possible policies is four to the power of 11 [00:05:53] policies is four to the power of 11 which is not that Bay because 11 is a [00:05:55] which is not that Bay because 11 is a small MDP right this is the number of [00:05:57] small MDP right this is the number of policies possible policies for an MTP is [00:06:00] policies possible policies for an MTP is combinatorially large is a number of [00:06:02] combinatorially large is a number of actions the power a number of states so [00:06:04] actions the power a number of states so how do you find the best policy so what [00:06:08] how do you find the best policy so what you learned today is how to compute the [00:06:12] you learned today is how to compute the auto policy now in order to develop an [00:06:17] auto policy now in order to develop an algorithm for computing an auto policy [00:06:19] algorithm for computing an auto policy we'll need to define three things so [00:06:23] we'll need to define three things so just as a roadmap what I'm about to do [00:06:27] just as a roadmap what I'm about to do is define V PI V Star and PI star okay [00:06:33] is define V PI V Star and PI star okay and based on these definitions will see [00:06:35] and based on these definitions will see that will come to definition [00:06:39] that will come to definition derived that pi-star is the auto policy [00:06:42] derived that pi-star is the auto policy okay but so let's let's go through these [00:06:44] okay but so let's let's go through these few definitions first V PI so for a [00:06:51] few definitions first V PI so for a policy PI V PI is a function mapping [00:06:57] policy PI V PI is a function mapping from States to the rails is such that V [00:07:05] from States to the rails is such that V PI of s is the expected total payoff for [00:07:24] PI of s is the expected total payoff for starting and state that's executing PI [00:07:33] starting and state that's executing PI and so sometimes you write this as V PI [00:07:35] and so sometimes you write this as V PI of s is the expected [00:07:38] of s is the expected well total payoff given that you execute [00:07:47] well total payoff given that you execute the policy PI and the initial state as 0 [00:07:52] the policy PI and the initial state as 0 is equal to s so the definition of a V [00:07:56] is equal to s so the definition of a V PI this is called the value function for [00:08:00] PI this is called the value function for a policy this is called the value [00:08:02] a policy this is called the value function for the policy PI ok and so [00:08:15] function for the policy PI ok and so what the value function for a policy PI [00:08:18] what the value function for a policy PI denoted be pious is it tells you for any [00:08:22] denoted be pious is it tells you for any state you might start it there's a [00:08:23] state you might start it there's a function mapping of states the rewards [00:08:25] function mapping of states the rewards write for any say you might start saying [00:08:26] write for any say you might start saying what's the expected total payoff if you [00:08:29] what's the expected total payoff if you start off your robot in that state and [00:08:31] start off your robot in that state and if you execute the policy PI and XC the [00:08:34] if you execute the policy PI and XC the policy PI means take actions according [00:08:37] policy PI means take actions according to the policy PI right so here's a [00:08:38] to the policy PI right so here's a here's a specific example this policy so [00:08:46] here's a specific example this policy so let's consider the following policy PI [00:08:48] let's consider the following policy PI right [00:08:59] so this is now the great policy right [00:09:02] so this is now the great policy right you're from some of these days it looks [00:09:04] you're from some of these days it looks like is heading to the minus one reward [00:09:06] like is heading to the minus one reward oh sorry Segura the reward was plus one [00:09:08] oh sorry Segura the reward was plus one we get here and technically this called [00:09:11] we get here and technically this called an absorbing state meaning that if you [00:09:12] an absorbing state meaning that if you ever get to the plus one to the minus [00:09:14] ever get to the plus one to the minus one then the world ends and then no more [00:09:16] one then the world ends and then no more rewards or penalties after that right so [00:09:18] rewards or penalties after that right so but so there's actually not a very good [00:09:21] but so there's actually not a very good policy so policy is any function mapping [00:09:22] policy so policy is any function mapping from the states to the actions so this [00:09:25] from the states to the actions so this is one policy that says are in this [00:09:28] is one policy that says are in this state you know this policy tells you in [00:09:32] state you know this policy tells you in this state for one go north which is [00:09:34] this state for one go north which is actually pretty bad thing to do right [00:09:36] actually pretty bad thing to do right it's take you to the minus one reward so [00:09:38] it's take you to the minus one reward so this is not a great policy but but just [00:09:41] this is not a great policy but but just just a policy and V PI for this policy [00:10:10] don't worry too much about the specific [00:10:13] don't worry too much about the specific numbers but yo if you look at this [00:10:15] numbers but yo if you look at this policy you see that from this set of [00:10:17] policy you see that from this set of states it's pretty efficient at getting [00:10:19] states it's pretty efficient at getting you to the really bad reward and from [00:10:22] you to the really bad reward and from this set of states is pretty efficient [00:10:24] this set of states is pretty efficient at getting you to the good reward right [00:10:26] at getting you to the good reward right what's some mixing because of their [00:10:28] what's some mixing because of their noise in the robotic veering off to the [00:10:30] noise in the robotic veering off to the side and so you know these numbers are [00:10:34] side and so you know these numbers are all negative and those numbers are at [00:10:36] all negative and those numbers are at least somewhat positive right so but so [00:10:39] least somewhat positive right so but so V PI is just um if you start from say [00:10:43] V PI is just um if you start from say this state from the state 1 1 on [00:10:46] this state from the state 1 1 on expectation you're expected some of this [00:10:48] expectation you're expected some of this counter Wars will be negative [00:10:50] counter Wars will be negative point-eight [00:10:54] so that's what be pious [00:10:59] no the following equation governs the [00:11:26] no the following equation governs the value function it's called it's called a [00:11:38] value function it's called it's called a bellman equation and this is that your [00:11:48] bellman equation and this is that your expected payoff at a given stage is the [00:11:50] expected payoff at a given stage is the reward that you receive plus a discount [00:11:53] reward that you receive plus a discount factor times the future rewards so let [00:11:56] factor times the future rewards so let me let me actually explain the intuition [00:11:58] me let me actually explain the intuition behind is right which is that let's say [00:12:02] behind is right which is that let's say you start off at some state as 0 right [00:12:05] you start off at some state as 0 right so oh and again let's let's say s is [00:12:07] so oh and again let's let's say s is equal to s 0 so V PI of s it is equal to [00:12:11] equal to s 0 so V PI of s it is equal to well just for your robots waking up in [00:12:17] well just for your robots waking up in that I'm gonna add to it in a second ok [00:12:19] that I'm gonna add to it in a second ok but just for the sake just for this for [00:12:21] but just for the sake just for this for the fact that your robot woke up in this [00:12:25] the fact that your robot woke up in this state s you get the immediate you get it [00:12:28] state s you get the immediate you get it reward RF as zero right away just as [00:12:31] reward RF as zero right away just as something's called this is also called [00:12:32] something's called this is also called the immediate reward because you know [00:12:39] the immediate reward because you know just for the for the good fortune of bad [00:12:42] just for the for the good fortune of bad fortune of starting off in this state [00:12:44] fortune of starting off in this state the robot gets a reward right away this [00:12:47] the robot gets a reward right away this is called the immediate reward and then [00:12:50] is called the immediate reward and then it will take some action and get to some [00:12:56] it will take some action and get to some new stage s1 well receive you know gamma [00:12:59] new stage s1 well receive you know gamma times the reward of s1 and then right [00:13:10] times the reward of s1 and then right and then I'll get some future reward at [00:13:12] and then I'll get some future reward at the next step and so on and just to [00:13:14] the next step and so on and just to flesh out the definition the value [00:13:17] flesh out the definition the value function V PI is really this given that [00:13:20] function V PI is really this given that you execute the policy PI and s0 equals [00:13:24] you execute the policy PI and s0 equals s right and you start off in the same as [00:13:27] s right and you start off in the same as zero now what I'm going to do is we [00:13:30] zero now what I'm going to do is we write this part of the equation a little [00:13:31] write this part of the equation a little bit I'm going to factor out I'm just [00:13:33] bit I'm going to factor out I'm just going to take the rest of this and [00:13:35] going to take the rest of this and factor out one factor of gamma so let me [00:13:38] factor out one factor of gamma so let me put parentheses around this right and [00:13:42] put parentheses around this right and just take out gamma there okay so I'm [00:13:45] just take out gamma there okay so I'm just you know taking this PVC this was [00:13:48] just you know taking this PVC this was gamma squared right but I think the [00:13:51] gamma squared right but I think the parentheses here I'm just taking out one [00:13:53] parentheses here I'm just taking out one factor of gamma that multiplies in the [00:13:56] factor of gamma that multiplies in the restaurant equation okay does any sense [00:13:59] restaurant equation okay does any sense no so as in gamma R of s 1 plus gamma [00:14:04] no so as in gamma R of s 1 plus gamma squared R of s 2 equals gamma R times R [00:14:09] squared R of s 2 equals gamma R times R of s 1 ok so that's that's what I did [00:14:15] of s 1 ok so that's that's what I did down there right just factor out 1 1 [00:14:17] down there right just factor out 1 1 factor of gamma and so this is the the [00:14:24] factor of gamma and so this is the the value of state s is the immediate reward [00:14:26] value of state s is the immediate reward plus gamma times the expected future [00:14:29] plus gamma times the expected future rewards right so this the expected value [00:14:34] rewards right so this the expected value of this is really V PI of s 1 right so [00:14:44] of this is really V PI of s 1 right so this and [00:14:46] this and so the second term here this is the [00:14:50] so the second term here this is the expected future rewards so pelvis [00:14:59] expected future rewards so pelvis equation says that the value of a state [00:15:03] equation says that the value of a state the value the expected total payoff you [00:15:06] the value the expected total payoff you get if your robot wakes up in the state [00:15:09] get if your robot wakes up in the state s is the immediate reward plus gamma [00:15:12] s is the immediate reward plus gamma times the expected future rewards okay [00:15:16] times the expected future rewards okay right and and this thing under you know [00:15:19] right and and this thing under you know above the curly braces is really asking [00:15:25] above the curly braces is really asking if you rope out wakes up at the state s1 [00:15:27] if you rope out wakes up at the state s1 and excuse PI what is the expected total [00:15:30] and excuse PI what is the expected total payoff right and this what if you robot [00:15:33] payoff right and this what if you robot wakes I'm gonna state s1 then you know [00:15:34] wakes I'm gonna state s1 then you know take an action get us to take an [00:15:36] take an action get us to take an actually get to s3 and this is the sum [00:15:39] actually get to s3 and this is the sum of this counter war sort of it starts [00:15:41] of this counter war sort of it starts off with the state s1 okay so this base [00:15:52] off with the state s1 okay so this base on this you can write out [00:15:54] on this you can write out well these justify Bellman's equations [00:15:57] well these justify Bellman's equations which is oh and and the mapping from [00:16:02] which is oh and and the mapping from this equation to this equation [00:16:22] all right the mapping from the equation [00:16:25] all right the mapping from the equation on top to the equation that bottom is [00:16:27] on top to the equation that bottom is that s maps to s 0 and s prime master s [00:16:33] that s maps to s 0 and s prime master s 1 right and and so if we have that be PI [00:16:42] 1 right and and so if we have that be PI of s equals so the value of state s is [00:17:09] of s equals so the value of state s is our Vespas V PI of s prime where this is [00:17:13] our Vespas V PI of s prime where this is s 0 and this is s 1 and and in the [00:17:19] s 0 and this is s 1 and and in the notation of MDP if you want to write a [00:17:21] notation of MDP if you want to write a long sequence of States we tend to use s [00:17:23] long sequence of States we tend to use s 0 s 1 s 2 s 3 and s 4 and so on but if [00:17:26] 0 s 1 s 2 s 3 and s 4 and so on but if you have want to look at just the [00:17:27] you have want to look at just the current state and the state you get 2 [00:17:29] current state and the state you get 2 after 1 times that we tend to use s and [00:17:31] after 1 times that we tend to use s and s prime for that so that's why this is [00:17:33] s prime for that so that's why this is mapping between these two pieces [00:17:34] mapping between these two pieces notation so s prime is say you get two [00:17:39] notation so s prime is say you get two after one step well let's see one is s [00:17:43] after one step well let's see one is s prime drawn from write this so does the [00:17:45] prime drawn from write this so does the state s prime or s 1 is the state you [00:17:48] state s prime or s 1 is the state you get to after 1 time step so what is what [00:17:51] get to after 1 time step so what is what is the distribution to s prime is drawn [00:17:53] is the distribution to s prime is drawn from s prime is drawn from P of what [00:18:02] okay PFS cool because in state s you [00:18:16] okay PFS cool because in state s you will take action a equals PI of s right [00:18:23] will take action a equals PI of s right so we're executing the policy pi so that [00:18:26] so we're executing the policy pi so that means that when you're in the state s [00:18:28] means that when you're in the state s you're going to take the action a given [00:18:30] you're going to take the action a given by PI of s goes PI of s tells you please [00:18:33] by PI of s goes PI of s tells you please take this action a when you're in state [00:18:35] take this action a when you're in state s and so s prime is drawn from P of s a [00:18:42] s and so s prime is drawn from P of s a where a is equal to PI of s right [00:18:46] where a is equal to PI of s right because the cause that's the action you [00:18:47] because the cause that's the action you took which is why s Prime the state you [00:18:50] took which is why s Prime the state you get to after one time step is drawn from [00:18:52] get to after one time step is drawn from the distribution s PI of s [00:19:08] so putting all that together that's why [00:19:10] so putting all that together that's why well I just write the other game where [00:19:13] well I just write the other game where belma's equations which is V PI of s [00:19:16] belma's equations which is V PI of s equals R of s plus the discount factor [00:19:20] equals R of s plus the discount factor times the expected value of V PI of s [00:19:23] times the expected value of V PI of s prime and so this term here is just some [00:19:27] prime and so this term here is just some of the S prime be s PI of s be PI of s [00:19:36] of the S prime be s PI of s be PI of s Prime okay so that underlying term I [00:19:38] Prime okay so that underlying term I guess is this underlying term here um [00:19:43] guess is this underlying term here um now notice that this gives you a linear [00:19:46] now notice that this gives you a linear system of equations for actually solving [00:19:48] system of equations for actually solving for the value function so let's say I [00:19:51] for the value function so let's say I give you a policy it could be a good [00:19:53] give you a policy it could be a good policy could be a bad policy and you [00:19:55] policy could be a bad policy and you want to solve the PI of s what this does [00:20:00] want to solve the PI of s what this does is if you think of the PI of s as the [00:20:05] is if you think of the PI of s as the unknowns you're trying to solve for [00:20:08] unknowns you're trying to solve for given PI write these equations [00:20:27] these are the pelvis equations defines a [00:20:30] these are the pelvis equations defines a linear system of equations in terms of [00:20:34] linear system of equations in terms of the PI of s as the very values to be [00:20:38] the PI of s as the very values to be solved for so maybe here's a here's a [00:20:40] solved for so maybe here's a here's a specific example let's take the state v1 [00:20:46] right so this is the state 3 1 what this [00:20:52] right so this is the state 3 1 what this what balance equation this tells us is [00:20:54] what balance equation this tells us is the PI of the state 3 comma 1 is equal [00:21:00] the PI of the state 3 comma 1 is equal to the mediator what you get at the [00:21:04] to the mediator what you get at the state 3 1 plus the discount factor times [00:21:11] state 3 1 plus the discount factor times well some of s prime PS PI of s be PI of [00:21:14] well some of s prime PS PI of s be PI of s prime right so oh and let's say that [00:21:20] s prime right so oh and let's say that PI of 3 1 is no so let's see try to go [00:21:25] PI of 3 1 is no so let's see try to go north if you try to go north from the [00:21:27] north if you try to go north from the state then you have a 0.8 chance of [00:21:30] state then you have a 0.8 chance of getting to 3/2 plus a 0.1 chance of [00:21:38] veering left plus 0.1 chance of veering [00:21:46] veering left plus 0.1 chance of veering right so that's what balance equation [00:21:58] right so that's what balance equation says about these values right and if [00:22:04] says about these values right and if your goal is to solve for the value [00:22:07] your goal is to solve for the value function then these things I'm just [00:22:12] function then these things I'm just circling in purple are the unknown [00:22:16] circling in purple are the unknown variables [00:22:18] variables and if you have eleven states like in [00:22:22] and if you have eleven states like in our MDP then this gives you a system of [00:22:25] our MDP then this gives you a system of eleven linear equations with eleven [00:22:27] eleven linear equations with eleven unknowns and so using server linear [00:22:32] unknowns and so using server linear algebra solver you can solve explicitly [00:22:34] algebra solver you can solve explicitly for the value of these eleven unknowns [00:22:37] for the value of these eleven unknowns so they way you it so let's say give you [00:22:40] so they way you it so let's say give you a policy PI you know any policy PI the [00:22:44] a policy PI you know any policy PI the way you can solve for the value function [00:22:46] way you can solve for the value function is create an eleven dimensional vector [00:22:50] is create an eleven dimensional vector with V PI of you know one one V PI of 1 [00:22:58] with V PI of you know one one V PI of 1 2 and so on down to the PI of whether is [00:23:02] 2 and so on down to the PI of whether is the last thing you have eleven state so [00:23:03] the last thing you have eleven state so V PI of easy or whatever for three right [00:23:10] V PI of easy or whatever for three right so if you want to solve for those eleven [00:23:14] so if you want to solve for those eleven numbers I wrote up just in terms of [00:23:16] numbers I wrote up just in terms of defining V PI what you can do is I'll [00:23:19] defining V PI what you can do is I'll give you a policy PI you can then [00:23:21] give you a policy PI you can then construct an eleven dimensional vector [00:23:24] construct an eleven dimensional vector you know 11 dimensional vector of [00:23:27] you know 11 dimensional vector of unknown values that you want to solve [00:23:29] unknown values that you want to solve for and balanced equations for each of [00:23:32] for and balanced equations for each of the eleven states for each of the eleven [00:23:35] the eleven states for each of the eleven states you could plug in on the left [00:23:36] states you could plug in on the left hand side just gives you one equation [00:23:38] hand side just gives you one equation for how one of the values is determined [00:23:42] for how one of the values is determined as a linear function of a few other of [00:23:44] as a linear function of a few other of the values in this vector okay and so [00:23:49] the values in this vector okay and so what this does is it sets up a linear [00:23:52] what this does is it sets up a linear system of equations with eleven [00:23:54] system of equations with eleven variables in eleven unknowns right and [00:23:56] variables in eleven unknowns right and using a linear algebra solver you you [00:23:59] using a linear algebra solver you you will be able to solve this linear system [00:24:01] will be able to solve this linear system of equations does make sense okay all [00:24:09] of equations does make sense okay all right and so this works so lousy about [00:24:11] right and so this works so lousy about this piece yeah if you have eleven [00:24:13] this piece yeah if you have eleven states you know it takes this takes [00:24:16] states you know it takes this takes almost it takes almost no time right and [00:24:18] almost it takes almost no time right and the computer to solve and then this is [00:24:19] the computer to solve and then this is an eleven equation so that's how you [00:24:20] an eleven equation so that's how you would actually get those values if you [00:24:23] would actually get those values if you have a called on to solve for V pi okay [00:24:30] actually the there why just say make [00:24:33] actually the there why just say make sense raise your hand if what I just [00:24:34] sense raise your hand if what I just explained made sense like cool awesome [00:24:36] explained made sense like cool awesome thing [00:24:45] all right good [00:24:47] all right good so moving on our roadmap will define V [00:24:51] so moving on our roadmap will define V PI let's now define V Star so so V Star [00:25:13] PI let's now define V Star so so V Star is the optimal value function and we'll [00:25:24] is the optimal value function and we'll define it as V star of s equals max [00:25:32] define it as V star of s equals max overall policies PI of V PI one of the [00:25:39] overall policies PI of V PI one of the slightly confusing things about [00:25:41] slightly confusing things about reinforced with an in terminology is [00:25:43] reinforced with an in terminology is that there two types of value function [00:25:44] that there two types of value function there's value function for a given [00:25:47] there's value function for a given policy PI and that's the optimal value [00:25:49] policy PI and that's the optimal value function V star so both of these are [00:25:51] function V star so both of these are called value functions but one is a [00:25:53] called value functions but one is a value function for a specific policy [00:25:55] value function for a specific policy could be a great policy could be [00:25:56] could be a great policy could be terrible policy could be also policy the [00:25:58] terrible policy could be also policy the other is V star which is the optimal [00:26:00] other is V star which is the optimal optimal value function so V Star is [00:26:03] optimal value function so V Star is defined as locally value for you know [00:26:07] defined as locally value for you know any look across all of the possible [00:26:09] any look across all of the possible policies you could have all four to [00:26:11] policies you could have all four to eleven where all the company totally [00:26:14] eleven where all the company totally large number of possible policy so [00:26:15] large number of possible policy so there's MVP and these star affairs is [00:26:18] there's MVP and these star affairs is well let's just take the max which is of [00:26:20] well let's just take the max which is of all the possible of all the policies you [00:26:23] all the possible of all the policies you know anyone could implement of all the [00:26:24] know anyone could implement of all the possible policies let's take the value [00:26:27] possible policies let's take the value of the best possible policy for that [00:26:28] of the best possible policy for that state so that's V star okay that's the [00:26:31] state so that's V star okay that's the all Tolle also a value function and it [00:26:36] all Tolle also a value function and it turns out that [00:26:40] there is a different version of bellman [00:26:44] there is a different version of bellman equations for this and again there's a [00:26:53] equations for this and again there's a balance equations for be pi/4 value of a [00:26:57] balance equations for be pi/4 value of a policy and then there's a different [00:26:59] policy and then there's a different version of bellman equations for the [00:27:00] version of bellman equations for the optimal value function right so just as [00:27:03] optimal value function right so just as the two versions of value functions [00:27:06] the two versions of value functions there are two versions of balance [00:27:08] there are two versions of balance equations but let me just write this out [00:27:10] equations but let me just write this out hopefully this will make sense actually [00:27:18] hopefully this will make sense actually let's think this through so let's say [00:27:20] let's think this through so let's say you start off your robot in a state s [00:27:22] you start off your robot in a state s what is the best possible expected some [00:27:26] what is the best possible expected some of this counselor was what's the best [00:27:28] of this counselor was what's the best possible payoff it again right well just [00:27:32] possible payoff it again right well just for the privilege of waking up in state [00:27:34] for the privilege of waking up in state s the robot will receive an immediate [00:27:37] s the robot will receive an immediate what R of s and then it has to take some [00:27:40] what R of s and then it has to take some action and after taking some action it [00:27:44] action and after taking some action it will get to some other state as a prime [00:27:50] you know and after some other state s [00:27:53] you know and after some other state s prime it will receive future expected [00:27:56] prime it will receive future expected rewards v-star best prime and we have to [00:27:59] rewards v-star best prime and we have to discount that by camera right so so well [00:28:06] discount that by camera right so so well the state s prime was arrived at by [00:28:09] the state s prime was arrived at by you're taking some action a from the [00:28:12] you're taking some action a from the initial state and so whatever the action [00:28:16] initial state and so whatever the action is you know but if you take action a so [00:28:30] is you know but if you take action a so if you take an action a in the state s [00:28:34] if you take an action a in the state s then your total payoff will be expected [00:28:37] then your total payoff will be expected total payoff will be the immediate [00:28:38] total payoff will be the immediate reward plus gamma times the expected [00:28:40] reward plus gamma times the expected value of the future payoff but what is [00:28:45] value of the future payoff but what is the action a that we should plug in here [00:28:47] the action a that we should plug in here right [00:28:48] right well the optimal action to take in the [00:28:50] well the optimal action to take in the MDP [00:28:50] MDP is whatever action maximizes your [00:28:53] is whatever action maximizes your expected total payoff maximize you [00:28:56] expected total payoff maximize you expected some rewards which is why the [00:28:58] expected some rewards which is why the action you want to plug in is just [00:29:01] action you want to plug in is just whatever action a maximizes that okay so [00:29:05] whatever action a maximizes that okay so this is um Domus equations for the [00:29:08] this is um Domus equations for the optimal value function which says that [00:29:11] optimal value function which says that the best possible expected total payoff [00:29:14] the best possible expected total payoff you could receive starting from state s [00:29:17] you could receive starting from state s is the immediate reward R of s plus max [00:29:21] is the immediate reward R of s plus max over all possible actions of whatever [00:29:23] over all possible actions of whatever action allows you to maximize you know [00:29:25] action allows you to maximize you know your expected total payoff expect a [00:29:28] your expected total payoff expect a future payoff [00:29:29] future payoff okay so this is the expected future [00:29:32] okay so this is the expected future payoff expected future reward now based [00:29:51] payoff expected future reward now based on the argument we just went through [00:29:54] on the argument we just went through this allows us to figure out how to [00:29:58] this allows us to figure out how to compute PI star of s as well right which [00:30:04] compute PI star of s as well right which is let's say let's say we have a way of [00:30:08] is let's say let's say we have a way of computing V star of s right we don't yet [00:30:10] computing V star of s right we don't yet but let's say I tell you what does V [00:30:12] but let's say I tell you what does V Sarvis and then I'll see you you know [00:30:15] Sarvis and then I'll see you you know what is the action you should take in a [00:30:17] what is the action you should take in a given stage so remember PI spy star of [00:30:20] given stage so remember PI spy star of PI star is going to auto policy and so [00:30:29] PI star is going to auto policy and so what should PI star vests be right which [00:30:31] what should PI star vests be right which is let's say let's say we're we're [00:30:33] is let's say let's say we're we're computing V Star and I now ask you hey [00:30:37] computing V Star and I now ask you hey my robots in state s what is the best [00:30:39] my robots in state s what is the best action I should take from the state s [00:30:42] action I should take from the state s right then how do I decide what action [00:30:46] right then how do I decide what action to take in the state yes [00:30:49] to take in the state yes well what would think is the best action [00:30:51] well what would think is the best action to take from this state and the answer [00:30:55] to take from this state and the answer is almost given in the equation of oh [00:30:57] is almost given in the equation of oh yeah [00:31:01] yeah cool awesome right so the best [00:31:05] yeah cool awesome right so the best action to take and state us and best [00:31:07] action to take and state us and best means maximizing expected total payoff [00:31:10] means maximizing expected total payoff but the option that maximizes your [00:31:12] but the option that maximizes your expenses total payoff is you know well [00:31:13] expenses total payoff is you know well whatever action we were choosing a up [00:31:16] whatever action we were choosing a up here and so it's just long max over a [00:31:27] and because gamma is just a constant [00:31:30] and because gamma is just a constant that doesn't affect the arcmap usually [00:31:32] that doesn't affect the arcmap usually we just we just eliminate that this is [00:31:34] we just we just eliminate that this is just a positive number right so this [00:31:42] just a positive number right so this gives us the strategy we will use for [00:31:46] gives us the strategy we will use for finding the also policy for an MVP which [00:31:50] finding the also policy for an MVP which is we're going to find a way to compute [00:31:54] is we're going to find a way to compute V Star of S which we don't have a way of [00:31:56] V Star of S which we don't have a way of doing yet rightly star was defined as a [00:31:58] doing yet rightly star was defined as a max over combinatorially or [00:32:00] max over combinatorially or exponentially large number policies so [00:32:02] exponentially large number policies so we don't have way of computing piece not [00:32:03] we don't have way of computing piece not yet but if we can find a way to compute [00:32:05] yet but if we can find a way to compute B star then you know using this equation [00:32:08] B star then you know using this equation certainly just scratch themself using [00:32:11] certainly just scratch themself using this equation gives you a way for every [00:32:14] this equation gives you a way for every state of every state s pretty [00:32:16] state of every state s pretty efficiently computes this augment and [00:32:19] efficiently computes this augment and therefore figure out what is the optimal [00:32:21] therefore figure out what is the optimal action for every state [00:32:46] all right so all right so just practice [00:32:54] all right so all right so just practice with confusing notation all right let's [00:33:02] with confusing notation all right let's see if you understand this equation I'm [00:33:04] see if you understand this equation I'm just claiming this I'm not proving this [00:33:05] just claiming this I'm not proving this but for every state as V Star of s [00:33:09] but for every state as V Star of s equals G of Pi star of s is greater than [00:33:14] equals G of Pi star of s is greater than a PI of s all right for every policy [00:33:20] a PI of s all right for every policy Pyne every state s okay so I hope this [00:33:24] Pyne every state s okay so I hope this equation makes sense this is what I'm [00:33:27] equation makes sense this is what I'm claiming I didn't prove this one [00:33:28] claiming I didn't prove this one claiming is that the October value for [00:33:31] claiming is that the October value for state s is this is the optimal value [00:33:34] state s is this is the optimal value function on the left this is the value [00:33:38] function on the left this is the value function for pi star so this is this is [00:33:47] function for pi star so this is this is about optimal value function this is the [00:33:49] about optimal value function this is the value function for a specific policy PI [00:33:51] value function for a specific policy PI where the policy PI happens to be PI [00:33:53] where the policy PI happens to be PI star and so what I'm claiming here is [00:33:56] star and so what I'm claiming here is that what what I'm writing here is that [00:33:58] that what what I'm writing here is that the optimal value for state s is equal [00:34:01] the optimal value for state s is equal to the value function for PI star [00:34:03] to the value function for PI star applied to the state s and this is great [00:34:06] applied to the state s and this is great sin equal to V PI of s for any other [00:34:08] sin equal to V PI of s for any other policy by [00:34:17] so the strategy you can use for finding [00:34:22] so the strategy you can use for finding for also policy is one v V star to you [00:34:31] for also policy is one v V star to you know use the R max equation to find pi [00:34:42] know use the R max equation to find pi star okay and so what we're going to do [00:34:45] star okay and so what we're going to do is well step to write we know how to do [00:34:49] is well step to write we know how to do from the optimized equation so what [00:34:50] from the optimized equation so what we're gonna do is top an algorithm for [00:34:52] we're gonna do is top an algorithm for actually computing visa because if you [00:34:55] actually computing visa because if you can compute V song then this equation [00:34:57] can compute V song then this equation helps allows you to pretty quickly find [00:35:00] helps allows you to pretty quickly find the optimal action for every state [00:35:10] so um so value iteration is as an album [00:35:32] so um so value iteration is as an album you can use to to find V star so let me [00:35:37] you can use to to find V star so let me just write out the algorithm so in the [00:36:17] just write out the algorithm so in the value iteration algorithm you initialize [00:36:20] value iteration algorithm you initialize the estimated value of every state to [00:36:23] the estimated value of every state to zero and then you update these estimated [00:36:27] zero and then you update these estimated values using Bellman's equations and [00:36:29] values using Bellman's equations and this is the optimal value function the V [00:36:32] this is the optimal value function the V star version of Bellman's equations and [00:36:46] so to be concrete about how you [00:36:48] so to be concrete about how you implement this you have um inferencing [00:36:50] implement this you have um inferencing this right if you're implying didn't [00:36:52] this right if you're implying didn't Python what you would do is create an 11 [00:36:55] Python what you would do is create an 11 dimensional vector to store all the [00:36:57] dimensional vector to store all the values of V of s so you create a you [00:37:00] values of V of s so you create a you know 11 dimensional vector right that [00:37:02] know 11 dimensional vector right that that represents V of 1 1 V of 1 2 you [00:37:08] that represents V of 1 1 V of 1 2 you know down to V over 4 3 right so this is [00:37:12] know down to V over 4 3 right so this is um 11 dimensional vector corresponding [00:37:15] um 11 dimensional vector corresponding to the 11 states oh I'm sorry I should [00:37:20] to the 11 states oh I'm sorry I should wait did I say 11 where 10 stays in the [00:37:22] wait did I say 11 where 10 stays in the MTB don't we wait yes we have 10 sees [00:37:25] MTB don't we wait yes we have 10 sees I've been saying 11 all along sorry okay [00:37:27] I've been saying 11 all along sorry okay 10 oh yes you're right sorry yes okay [00:37:41] 10 oh yes you're right sorry yes okay sorry [00:37:45] so 11 state MDP serie credit initial [00:37:48] so 11 state MDP serie credit initial credit 11 dimensional vector and [00:37:51] credit 11 dimensional vector and initialize all of these values to 0 and [00:37:54] initialize all of these values to 0 and then you will repeatedly update the [00:38:00] then you will repeatedly update the estimated value of every state according [00:38:03] estimated value of every state according to balance equations right and so [00:38:09] to balance equations right and so they're there they're actually two ways [00:38:11] they're there they're actually two ways to interpret this and similar to some of [00:38:15] to interpret this and similar to some of the gradient descent right we've written [00:38:16] the gradient descent right we've written out you know a gradient descent rule for [00:38:18] out you know a gradient descent rule for updating the theta the the vector [00:38:22] updating the theta the the vector parameters theta and what you do is you [00:38:25] parameters theta and what you do is you know and you have and what you do is you [00:38:28] know and you have and what you do is you update all of the components of theta [00:38:30] update all of the components of theta simultaneously right and so that's [00:38:32] simultaneously right and so that's called a synchronous update in gradient [00:38:35] called a synchronous update in gradient descent so one way to so the way you [00:38:38] descent so one way to so the way you would update this equation in what's [00:38:42] would update this equation in what's called a synchronous update [00:38:47] will behave you compute the right hand [00:38:50] will behave you compute the right hand side for all 11 states and then you [00:38:53] side for all 11 states and then you simultaneously overwrite all 11 values [00:38:56] simultaneously overwrite all 11 values at the same time and then you compute [00:38:58] at the same time and then you compute all 11 values for the right hand side [00:39:00] all 11 values for the right hand side and then you're simultaneously update [00:39:02] and then you're simultaneously update all 11 values okay the alternative would [00:39:05] all 11 values okay the alternative would be an asynchronous update and then a [00:39:13] be an asynchronous update and then a synchronous update what you do is you [00:39:14] synchronous update what you do is you compute V f11 right and the value of V [00:39:18] compute V f11 right and the value of V of 1 1 depends on some of the other [00:39:20] of 1 1 depends on some of the other values on the right hand side right but [00:39:22] values on the right hand side right but in a synchronous update you compute V of [00:39:24] in a synchronous update you compute V of 1 1 and then you would overwrite this [00:39:26] 1 1 and then you would overwrite this value first and then you use that [00:39:28] value first and then you use that equation to compute V of 1 2 and then [00:39:31] equation to compute V of 1 2 and then you update this and then you update [00:39:33] you update this and then you update these one at a time and the difference [00:39:36] these one at a time and the difference between synchronous and asynchronous is [00:39:38] between synchronous and asynchronous is you know if you're using asynchronous [00:39:40] you know if you're using asynchronous update by the time you're using V or 4/3 [00:39:43] update by the time you're using V or 4/3 which depends on some of the earlier [00:39:44] which depends on some of the earlier values you'd be using a new and refresh [00:39:47] values you'd be using a new and refresh value of some of the earlier values on [00:39:49] value of some of the earlier values on your list ok it turns out that value [00:39:54] your list ok it turns out that value iteration works fine with either [00:39:55] iteration works fine with either synchronous up these or asynchronous [00:39:57] synchronous up these or asynchronous updates but further but because it [00:40:03] updates but further but because it vectorized is better because you can use [00:40:05] vectorized is better because you can use more efficient matrix operations most [00:40:07] more efficient matrix operations most people use the synchronous update but it [00:40:08] people use the synchronous update but it turns out that the algorithm will work [00:40:10] turns out that the algorithm will work whether using is synchronous or [00:40:12] whether using is synchronous or asynchronous update sorry is unless [00:40:14] asynchronous update sorry is unless unless otherwise you know stated you [00:40:18] unless otherwise you know stated you should usually assume that when I talk [00:40:20] should usually assume that when I talk about validation I'm referring to [00:40:22] about validation I'm referring to synchronous update where you compute all [00:40:24] synchronous update where you compute all the values all 11 values using the and [00:40:28] the values all 11 values using the and then update all 11 values at the same [00:40:30] then update all 11 values at the same time ok is there a question just now so [00:40:32] time ok is there a question just now so my that yeah [00:40:53] yeah yes so I think they're there yes so [00:40:57] yeah yes so I think they're there yes so how do you represent the absorbing state [00:40:59] how do you represent the absorbing state the sink say we go to plus or minus one [00:41:00] the sink say we go to plus or minus one day the world ends in this framework one [00:41:03] day the world ends in this framework one way to code that up would be to say that [00:41:05] way to code that up would be to say that the state has inference from that to any [00:41:08] the state has inference from that to any other state is zero that's one way to [00:41:09] other state is zero that's one way to that that would work another way would [00:41:13] that that would work another way would be less done less often maybe [00:41:15] be less done less often maybe mathematical but clean up and not how [00:41:17] mathematical but clean up and not how people tend to do this it would be to [00:41:19] people tend to do this it would be to take your let me say MDP and then create [00:41:23] take your let me say MDP and then create at all state and the tall state always [00:41:25] at all state and the tall state always goes back to itself with no further than [00:41:27] goes back to itself with no further than what so do both both of these would give [00:41:29] what so do both both of these would give you the same result though mat batty is [00:41:31] you the same result though mat batty is pretty more convenient to just set you [00:41:33] pretty more convenient to just set you know PFF say s prime equals 0 for all [00:41:35] know PFF say s prime equals 0 for all other states it's not quite safe hasn't [00:41:38] other states it's not quite safe hasn't already but that that will give you the [00:41:40] already but that that will give you the ranch as well all right cool [00:41:48] ranch as well all right cool so just as a point of notation if you're [00:41:51] so just as a point of notation if you're using synchronous updates you can think [00:41:54] using synchronous updates you can think of this as taking the old value function [00:41:58] of this as taking the old value function o estimate right and using it to compute [00:42:07] o estimate right and using it to compute the new estimates so this this you know [00:42:11] the new estimates so this this you know assuming the synchronous update you have [00:42:13] assuming the synchronous update you have some previous 11 dimensional vector with [00:42:17] some previous 11 dimensional vector with your estimate of the value from the [00:42:19] your estimate of the value from the previous iteration and after doing one [00:42:22] previous iteration and after doing one iteration of this you have a new set of [00:42:24] iteration of this you have a new set of estimate so one step of this algorithm [00:42:26] estimate so one step of this algorithm is sometimes called via bellman back of [00:42:28] is sometimes called via bellman back of operator and so where you update the [00:42:35] operator and so where you update the equals [00:42:37] equals b.o.b right where we're now he is a 11 [00:42:41] b.o.b right where we're now he is a 11 dimensional vector so you have an O the [00:42:43] dimensional vector so you have an O the leverage the original vector compute the [00:42:45] leverage the original vector compute the bellmen backup operator was just that [00:42:47] bellmen backup operator was just that equation there and update the according [00:42:49] equation there and update the according to B and so one thing that you see in [00:42:55] to B and so one thing that you see in the problem set is is showing that this [00:43:02] the problem set is is showing that this will make a BFS [00:43:04] will make a BFS condors to be stock so it turns out that [00:43:26] okay so it turns out that you can prove [00:43:31] okay so it turns out that you can prove and you see more details that this is a [00:43:33] and you see more details that this is a problem set that by repeatedly enforcing [00:43:37] problem set that by repeatedly enforcing Bellman's equations that this equate [00:43:40] Bellman's equations that this equate this this algorithm will cause your [00:43:42] this this algorithm will cause your vector of eleven value so cause V to [00:43:44] vector of eleven value so cause V to converge to the optimal value function V [00:43:47] converge to the optimal value function V star okay and more details you see the [00:43:50] star okay and more details you see the homework Illumina lecture notes and it [00:43:52] homework Illumina lecture notes and it turns out this algorithm actually [00:43:53] turns out this algorithm actually converges quite quickly right so to give [00:43:57] converges quite quickly right so to give you a flavor I think that uh with the [00:43:59] you a flavor I think that uh with the discount factor if the discount factor [00:44:01] discount factor if the discount factor is 0.99 it turns out that you can show [00:44:03] is 0.99 it turns out that you can show that the error reduces your by a factor [00:44:07] that the error reduces your by a factor of point 99 on every iteration and so V [00:44:10] of point 99 on every iteration and so V actually converges quite quickly [00:44:12] actually converges quite quickly dramatically quickly if you are [00:44:13] dramatically quickly if you are exponentially quickly to the October [00:44:15] exponentially quickly to the October value function V Star and service you [00:44:18] value function V Star and service you know the discount factor is 0.99 there [00:44:20] know the discount factor is 0.99 there was like a few where behind your [00:44:22] was like a few where behind your iterations there are a few hundred [00:44:23] iterations there are a few hundred iterations v p-- would be very close to [00:44:25] iterations v p-- would be very close to be stock okay and and the discount [00:44:28] be stock okay and and the discount factors point nine then with just you [00:44:30] factors point nine then with just you know ten or few dozens of innovations [00:44:32] know ten or few dozens of innovations would be very close to be saw so this [00:44:34] would be very close to be saw so this outer measure converges quite quickly to [00:44:37] outer measure converges quite quickly to be stock [00:44:41] so let's see [00:44:47] [Applause] [00:45:15] so just to put everything together if [00:45:20] so just to put everything together if you if you run value iteration on debt [00:45:28] you if you run value iteration on debt MDP [00:45:37] you end up with this so this is B star [00:46:03] you end up with this so this is B star so solicited eleven numbers telling you [00:46:06] so solicited eleven numbers telling you what is the optimal expected payoff for [00:46:10] what is the optimal expected payoff for starting off in into the eleven possible [00:46:13] starting off in into the eleven possible states and so I had previously said I [00:46:17] states and so I had previously said I think I said last week of the week [00:46:21] think I said last week of the week before Thanksgiving that this is the [00:46:23] before Thanksgiving that this is the optimal policy so you know let's just [00:46:33] optimal policy so you know let's just use as a case study how you compute the [00:46:36] use as a case study how you compute the optimal action for that state given this [00:46:41] optimal action for that state given this v-star [00:46:42] v-star all right well what you do is you [00:46:44] all right well what you do is you actually just use this equation and so [00:46:48] actually just use this equation and so if you were to go Wes then if you were [00:46:52] if you were to go Wes then if you were to compute I guess this term sum of s [00:46:58] to compute I guess this term sum of s prime Wetzel left I guess right P of si [00:47:02] prime Wetzel left I guess right P of si s prime B star of s prime is equal to if [00:47:08] s prime B star of s prime is equal to if you were to go west you have a [00:47:29] right so if you're in this state and if [00:47:33] right so if you're in this state and if you attempt to go left then there's a [00:47:35] you attempt to go left then there's a point a chance you end up there with a [00:47:40] point a chance you end up there with a visa 0.75 there's a point 1 chance you [00:47:45] visa 0.75 there's a point 1 chance you know if you try to go left this point [00:47:48] know if you try to go left this point one chance you veer off to the north and [00:47:49] one chance you veer off to the north and have a 0.69 and then there's a point 1 [00:47:53] have a 0.69 and then there's a point 1 chance that you actually go south and [00:47:55] chance that you actually go south and bounce off the wall and end up with 0.71 [00:47:58] bounce off the wall and end up with 0.71 and so do you expected future rewards [00:48:02] and so do you expected future rewards expected future payoff given this [00:48:03] expected future payoff given this equation is that if you tend to go Wes [00:48:05] equation is that if you tend to go Wes you end up with 0.7 for 0 as expected [00:48:09] you end up with 0.7 for 0 as expected future rewards whereas you were to go [00:48:11] future rewards whereas you were to go north if you do a similar computation [00:48:18] you know so 0.8 times point 6 9 plus [00:48:21] you know so 0.8 times point 6 9 plus point 1 times 175 plus point 1 times 24 [00:48:24] point 1 times 175 plus point 1 times 24 9 the appropriate way to the average you [00:48:26] 9 the appropriate way to the average you find that is equal to 0.67 6 which is [00:48:31] find that is equal to 0.67 6 which is why the expected future rewards so if [00:48:34] why the expected future rewards so if you go Wes it'll know left is 0.74 0 [00:48:38] you go Wes it'll know left is 0.74 0 which is quite a bit higher than if you [00:48:40] which is quite a bit higher than if you go north which is why we can conclude [00:48:42] go north which is why we can conclude based on this low calculation that the [00:48:45] based on this low calculation that the also policy is to go left at that state [00:48:48] also policy is to go left at that state and then really and technically you [00:48:50] and then really and technically you check north south east and west and make [00:48:52] check north south east and west and make sure that going west gives a high reward [00:48:53] sure that going west gives a high reward and that's how you can conclude that [00:48:55] and that's how you can conclude that going west is actually the better action [00:48:57] going west is actually the better action at this state okay so that's value [00:49:02] at this state okay so that's value iteration and based on this if you are [00:49:07] iteration and based on this if you are given an MDP you can implement this [00:49:10] given an MDP you can implement this solve a V Star and be able to compute PI [00:49:16] solve a V Star and be able to compute PI sock [00:49:20] a few more things I'll go over but [00:49:22] a few more things I'll go over but before I move on let me check are there [00:49:24] before I move on let me check are there any questions [00:49:25] any questions oh sure yep Islamic state is always [00:49:34] oh sure yep Islamic state is always finite so in what we're discussing so [00:49:36] finite so in what we're discussing so far yes but what we'll see on Wednesday [00:49:39] far yes but what we'll see on Wednesday is how to generalize this framework well [00:49:42] is how to generalize this framework well looted this a little bit later but it [00:49:44] looted this a little bit later but it turns out if you have a continuous state [00:49:46] turns out if you have a continuous state MDP one of the things that's often done [00:49:50] MDP one of the things that's often done i guess is to discretize into finite [00:49:53] i guess is to discretize into finite number of states but then there are also [00:49:55] number of states but then there are also some other versions of you know value [00:49:59] some other versions of you know value duration that applies directly to [00:50:01] duration that applies directly to continuous states as well so what I [00:50:19] continuous states as well so what I described is an algorithm called value [00:50:21] described is an algorithm called value iteration the other on a common sort of [00:50:25] iteration the other on a common sort of textbook algorithm for solving for MVPs [00:50:28] textbook algorithm for solving for MVPs is called policy iteration and let me [00:50:36] is called policy iteration and let me just well just write out what the [00:50:38] just well just write out what the algorithm is so here's the algorithm [00:50:43] algorithm is so here's the algorithm which is um you know initialize PI [00:50:46] which is um you know initialize PI randomly [00:51:42] okay so let's see what this algorithm [00:51:45] okay so let's see what this algorithm does so let's talk about pros and cons [00:51:47] does so let's talk about pros and cons evaluation versus policy aeration a [00:51:48] evaluation versus policy aeration a little bit in policy iteration instead [00:51:53] little bit in policy iteration instead of solving for the optimal policy vsauce [00:51:56] of solving for the optimal policy vsauce in that the iteration a focus of [00:51:58] in that the iteration a focus of attention was v star right where you [00:52:01] attention was v star right where you know you do a lot of work to try to find [00:52:03] know you do a lot of work to try to find the value function and then once you [00:52:04] the value function and then once you solve for v song you then figure out the [00:52:07] solve for v song you then figure out the best policy in policy iteration the [00:52:09] best policy in policy iteration the focus of attention is on the policy pi [00:52:12] focus of attention is on the policy pi rather than the value function and so [00:52:14] rather than the value function and so initialize pi randomly so that means for [00:52:17] initialize pi randomly so that means for each of the 11 states pick a random [00:52:19] each of the 11 states pick a random action it's a random initial time and [00:52:21] action it's a random initial time and then we're going to repeatedly carry out [00:52:24] then we're going to repeatedly carry out these two steps the first step is solve [00:52:28] these two steps the first step is solve for the value function for the policy pi [00:52:30] for the value function for the policy pi right I remember for V PI this was a [00:52:36] right I remember for V PI this was a linear system of equations right with [00:52:44] linear system of equations right with eleven variables with eleven unknowns in [00:52:47] eleven variables with eleven unknowns in it was a linear system of eleven [00:52:49] it was a linear system of eleven equations with eleven unknowns and so [00:52:50] equations with eleven unknowns and so using a sort of linear algebra solver or [00:52:53] using a sort of linear algebra solver or a linear equation solver given a fixed [00:52:56] a linear equation solver given a fixed policy PI you could just you know at the [00:52:58] policy PI you could just you know at the cost of inverting a matrix roughly right [00:53:01] cost of inverting a matrix roughly right you can solve for you can solve for all [00:53:03] you can solve for you can solve for all of these eleven values and so in policy [00:53:05] of these eleven values and so in policy iteration you would you know use a [00:53:09] iteration you would you know use a linear solver to solve for the optimal [00:53:12] linear solver to solve for the optimal value function for this policy pi that [00:53:14] value function for this policy pi that we just randomly initialized and then [00:53:17] we just randomly initialized and then set V to be the value function for that [00:53:20] set V to be the value function for that policy okay and so this is done quite [00:53:25] policy okay and so this is done quite efficiently with a linear solver and [00:53:28] efficiently with a linear solver and then the second step of policy duration [00:53:31] then the second step of policy duration is pretend that V is the optimal value [00:53:34] is pretend that V is the optimal value function and update PI of s you know [00:53:40] function and update PI of s you know using the balanced equations for the [00:53:43] using the balanced equations for the octal value function very updated as you [00:53:47] octal value function very updated as you saw right how do you update the higher [00:53:50] saw right how do you update the higher best [00:53:50] best and then you iterate and then given a [00:53:53] and then you iterate and then given a new policy you then solve that linear [00:53:55] new policy you then solve that linear system equations for your new policy PI [00:53:58] system equations for your new policy PI to get a new B PI and you keep on [00:54:00] to get a new B PI and you keep on iterating these two steps until [00:54:03] iterating these two steps until converges ok yeah yep yes that's right [00:54:22] converges ok yeah yep yes that's right so in in value yeah yeah so in in value [00:54:28] so in in value yeah yeah so in in value iteration in value iteration think [00:54:33] iteration in value iteration think evaluator HS waiting to the end to [00:54:35] evaluator HS waiting to the end to compute PI of s very soft wavy stop [00:54:37] compute PI of s very soft wavy stop first and compute PI of s whereas in [00:54:39] first and compute PI of s whereas in policy iteration we're coming up with a [00:54:41] policy iteration we're coming up with a new policy on every single iteration [00:54:43] new policy on every single iteration okay so um pros and cons of poly and it [00:54:50] okay so um pros and cons of poly and it turns out that this algorithm will also [00:54:51] turns out that this algorithm will also converge to the optimal policy pros and [00:54:55] converge to the optimal policy pros and cons of policy iteration versus [00:54:56] cons of policy iteration versus valuation policy duration requires [00:54:59] valuation policy duration requires solving this linear system of equations [00:55:02] solving this linear system of equations in order to get B PI and so it turns out [00:55:06] in order to get B PI and so it turns out that if you have a relatively small [00:55:08] that if you have a relatively small state space like if you have 11 states [00:55:11] state space like if you have 11 states is really easy to solve a linear system [00:55:14] is really easy to solve a linear system of equations [00:55:15] of equations you know if 11 equations in order to get [00:55:18] you know if 11 equations in order to get V PI and so if you're relatively small [00:55:20] V PI and so if you're relatively small set of states like eleven states are [00:55:22] set of states like eleven states are really anything you know like a few [00:55:23] really anything you know like a few hundred States policy raishin we're [00:55:27] hundred States policy raishin we're quite quickly but if you have a [00:55:30] quite quickly but if you have a relatively large set of states you know [00:55:32] relatively large set of states you know like ten thousand stays or a million [00:55:35] like ten thousand stays or a million states then this step would be much [00:55:38] states then this step would be much slower at least if you do it right by [00:55:40] slower at least if you do it right by solving the system of equations and then [00:55:42] solving the system of equations and then I would favor a value iteration over [00:55:44] I would favor a value iteration over policy iterations so for larger problems [00:55:46] policy iterations so for larger problems usually value iteration will usually I [00:55:51] usually value iteration will usually I would use value iteration because [00:55:53] would use value iteration because solving this linear system of equations [00:55:55] solving this linear system of equations you know this is pretty expensive if [00:55:58] you know this is pretty expensive if it's a good million by there's a million [00:56:00] it's a good million by there's a million equations a million unknowns that's [00:56:02] equations a million unknowns that's quite expensive [00:56:03] quite expensive but if in Lebanon stays in Lebanon knows [00:56:04] but if in Lebanon stays in Lebanon knows there's very small system equations and [00:56:07] there's very small system equations and then one one other pros and cons one of [00:56:09] then one one other pros and cons one of the difference that's maybe maybe more [00:56:13] the difference that's maybe maybe more academic than practical but it turns out [00:56:16] academic than practical but it turns out that if you use value iteration V will [00:56:19] that if you use value iteration V will converge to what V Star but they won't [00:56:23] converge to what V Star but they won't ever get to exactly the star right so [00:56:26] ever get to exactly the star right so just as if you apply gradient descent [00:56:28] just as if you apply gradient descent for linear regression gradient descent [00:56:31] for linear regression gradient descent gets closer and closer and closer to the [00:56:33] gets closer and closer and closer to the global optimum but it never you know [00:56:35] global optimum but it never you know guess exactly the global optimum it just [00:56:37] guess exactly the global optimum it just gets really really close really fast [00:56:39] gets really really close really fast actually great in the sand actually [00:56:40] actually great in the sand actually turns out as an topically converges [00:56:42] turns out as an topically converges geometrically quickly really quickly [00:56:44] geometrically quickly really quickly right but but never quite gets you know [00:56:46] right but but never quite gets you know definitively to the optimal to the one [00:56:48] definitively to the optimal to the one optimal value whereas you saw using [00:56:51] optimal value whereas you saw using normals equations it just jump straight [00:56:52] normals equations it just jump straight to the optimal value and there's no you [00:56:55] to the optimal value and there's no you know converging slowly and so value [00:56:57] know converging slowly and so value duration converges to or V star but it [00:57:00] duration converges to or V star but it doesn't ever end up at exactly the value [00:57:02] doesn't ever end up at exactly the value V star this difference may be a bit [00:57:04] V star this difference may be a bit epidemic because in practice it doesn't [00:57:06] epidemic because in practice it doesn't matter right but in policy iteration if [00:57:12] matter right but in policy iteration if you innovate this algorithm then after a [00:57:15] you innovate this algorithm then after a finite number of iterations this album [00:57:17] finite number of iterations this album will stop changing meaning that after [00:57:20] will stop changing meaning that after certain number of iterations PI of s [00:57:23] certain number of iterations PI of s would just not change anymore right so [00:57:26] would just not change anymore right so you find higher best update the value [00:57:28] you find higher best update the value function and then after another [00:57:29] function and then after another iteration when you take these out maxes [00:57:31] iteration when you take these out maxes you end up with exactly the same policy [00:57:33] you end up with exactly the same policy and so this just salsa the also a value [00:57:36] and so this just salsa the also a value and the also policy and they just you [00:57:38] and the also policy and they just you know it doesn't converge it doesn't does [00:57:41] know it doesn't converge it doesn't does converge to what the also value it just [00:57:43] converge to what the also value it just gets the optimal value when it when it [00:57:46] gets the optimal value when it when it converges okay so I think in practice I [00:57:51] converges okay so I think in practice I actually see value iteration use much [00:57:53] actually see value iteration use much more [00:57:55] more because solving this linear equations [00:57:58] because solving this linear equations gets expensive you know if you have a [00:58:00] gets expensive you know if you have a large estate space but valuation it's [00:58:04] large estate space but valuation it's usually policy I see valuation use much [00:58:06] usually policy I see valuation use much more but if you have a small problem you [00:58:08] more but if you have a small problem you know I think you could also use policy [00:58:10] know I think you could also use policy iteration which might converse a little [00:58:11] iteration which might converse a little bit faster if you have a small problem [00:58:23] so the last thing is kind of putting it [00:58:28] so the last thing is kind of putting it together right and what if you don't [00:58:30] together right and what if you don't know so it turns out that when you apply [00:58:46] know so it turns out that when you apply this to a practical problem you know in [00:58:49] this to a practical problem you know in robotics right one common scenario you [00:58:54] robotics right one common scenario you run into is if you do not know what is P [00:58:57] run into is if you do not know what is P of Si if you don't know the state [00:58:59] of Si if you don't know the state transition probabilities right so when [00:59:01] transition probabilities right so when we built the MDP we said well let's say [00:59:04] we built the MDP we said well let's say the robots if you go off you know has a [00:59:09] the robots if you go off you know has a point a chance a great knife and a point [00:59:10] point a chance a great knife and a point one chance of varying off so therefore [00:59:12] one chance of varying off so therefore rights if you actually the game this is [00:59:13] rights if you actually the game this is a very simplified robot but if you build [00:59:16] a very simplified robot but if you build a actual robot to build a you know [00:59:18] a actual robot to build a you know helicopter or whatever play play chess [00:59:21] helicopter or whatever play play chess against an opponent the state-transition [00:59:23] against an opponent the state-transition properties are often not known in [00:59:25] properties are often not known in advance and so in many MVP [00:59:32] advance and so in many MVP implementations you need to estimate [00:59:34] implementations you need to estimate this from data and so the workflow of [00:59:38] this from data and so the workflow of many many reinforcement learning [00:59:40] many many reinforcement learning projects will be that you will have some [00:59:43] projects will be that you will have some policy and have the robot run around you [00:59:46] policy and have the robot run around you know just have a robot run around a maze [00:59:47] know just have a robot run around a maze and counter of all the times you had to [00:59:50] and counter of all the times you had to take the action north how often did it [00:59:52] take the action north how often did it actually go know of and how often do [00:59:54] actually go know of and how often do they fear often left or right right so [00:59:56] they fear often left or right right so you use those statistics as state [00:59:58] you use those statistics as state transition probabilities so let me just [01:00:00] transition probabilities so let me just write this out so you estimate so after [01:00:03] write this out so you estimate so after you know taking maybe a random policy to [01:00:06] you know taking maybe a random policy to take some policy execute some policy in [01:00:08] take some policy execute some policy in the MD [01:00:08] the MD for a while and then you would estimate [01:00:11] for a while and then you would estimate this from data and so the obvious [01:00:14] this from data and so the obvious formula would be SVP of SAS prime to be [01:00:17] formula would be SVP of SAS prime to be number of times took action a and state [01:00:25] number of times took action a and state s and got to s Prime and divide that by [01:00:37] s and got to s Prime and divide that by the number of times you took action [01:00:46] that's right so TFSAs prime estimate [01:00:51] that's right so TFSAs prime estimate there's actually a massive likely [01:00:52] there's actually a massive likely estimate when you look at the number of [01:00:54] estimate when you look at the number of times you took action in state s and [01:00:57] times you took action in state s and that was the fraction of times you got [01:00:59] that was the fraction of times you got to this day that's prime right or 1 over [01:01:07] to this day that's prime right or 1 over s in a common you know heuristic is if [01:01:16] s in a common you know heuristic is if you've never taken this action in just a [01:01:19] you've never taken this action in just a before if you if the number of times you [01:01:22] before if you if the number of times you try action in state as a zero so you've [01:01:24] try action in state as a zero so you've never tried this action this state so [01:01:25] never tried this action this state so you have no idea what's gonna do then [01:01:27] you have no idea what's gonna do then just assume that the state transition [01:01:30] just assume that the state transition probability is 1 over 11 right then [01:01:32] probability is 1 over 11 right then you're randomly takes you to endlessly [01:01:34] you're randomly takes you to endlessly so this would be rather common [01:01:36] so this would be rather common heuristics that people use when [01:01:37] heuristics that people use when implementing reports or learning [01:01:39] implementing reports or learning algorithms and it turns out that you can [01:01:46] algorithms and it turns out that you can use the paths moving for this if you [01:01:48] use the paths moving for this if you wish but you don't have to because so [01:01:51] wish but you don't have to because so you're in the past moving right Sofia [01:01:54] you're in the past moving right Sofia you know adds one to the numerator and [01:01:56] you know adds one to the numerator and and 11 to the denominator would be if [01:01:59] and 11 to the denominator would be if you were to use Laplace smoothing which [01:02:01] you were to use Laplace smoothing which a voice the problems of zero over zeroes [01:02:03] a voice the problems of zero over zeroes as well but it turns out that unlike the [01:02:05] as well but it turns out that unlike the naive Bayes algorithm these solvers MDPs [01:02:10] naive Bayes algorithm these solvers MDPs are not that sensitive to 0 values so if [01:02:13] are not that sensitive to 0 values so if if one your estimates are probably is [01:02:15] if one your estimates are probably is zero you know unlike naive Bayes we're [01:02:19] zero you know unlike naive Bayes we're having a zero probability was very [01:02:21] having a zero probability was very problematic [01:02:22] problematic for the classifications made by naive [01:02:24] for the classifications made by naive Bayes it turns out that MDP solvers [01:02:27] Bayes it turns out that MDP solvers including evaluation and policy duration [01:02:29] including evaluation and policy duration they do not give sort of nonsensical / [01:02:32] they do not give sort of nonsensical / horrible results just because of a few [01:02:34] horrible results just because of a few probabilities are exactly zero and so in [01:02:38] probabilities are exactly zero and so in practice you know you can use the [01:02:39] practice you know you can use the Laplace moving if you wish but because [01:02:42] Laplace moving if you wish but because the reinforcement learning algorithms [01:02:44] the reinforcement learning algorithms don't don't perform that badly of these [01:02:47] don't don't perform that badly of these estimates are often well below zero in [01:02:48] estimates are often well below zero in practice the past moving is not commonly [01:02:51] practice the past moving is not commonly unison what I just wrote is it's more [01:02:53] unison what I just wrote is it's more common so to put it together [01:03:27] if I give you a robot and ask you to [01:03:30] if I give you a robot and ask you to implement a MTP solver to find a good [01:03:33] implement a MTP solver to find a good policy for this robot what you would do [01:03:35] policy for this robot what you would do is the following take actions respect to [01:03:44] is the following take actions respect to some policy PI to get experience in the [01:03:56] some policy PI to get experience in the MDP so go ahead and let your robot loose [01:04:04] MDP so go ahead and let your robot loose and have it ask you some policy for a [01:04:06] and have it ask you some policy for a while and then update estimates of PFS a [01:04:18] based on the observations where the [01:04:21] based on the observations where the robot goes when takes different states [01:04:23] robot goes when takes different states update update SMS app EFSA solve Velma's [01:04:36] update update SMS app EFSA solve Velma's equation using value iteration [01:04:47] to get V and then I'll update so this is [01:05:08] to get V and then I'll update so this is the value generation way of putting it [01:05:10] the value generation way of putting it together if you want to plug in policy [01:05:12] together if you want to plug in policy innovation instead and just that that's [01:05:14] innovation instead and just that that's also okay but so if you actually get a [01:05:18] also okay but so if you actually get a robot you know right if you actually get [01:05:27] robot you know right if you actually get a robot where you do not know in advance [01:05:30] a robot where you do not know in advance the state transition probabilities then [01:05:32] the state transition probabilities then this is what you would do in order to [01:05:35] this is what you would do in order to enter in a few times I guess right [01:05:38] enter in a few times I guess right repeatedly finally find a final policy [01:05:41] repeatedly finally find a final policy given your carbon estimate of the state [01:05:43] given your carbon estimate of the state transition probabilities get some [01:05:45] transition probabilities get some experience update your S Pen is finally [01:05:46] experience update your S Pen is finally your policy and kind of repeat this [01:05:48] your policy and kind of repeat this process until hopefully converges to [01:05:51] process until hopefully converges to good policy [01:06:03] now just to add more color more richness [01:06:08] now just to add more color more richness to this we usually think of we usually [01:06:18] to this we usually think of we usually think of the reward function as being [01:06:21] think of the reward function as being given right as part of the problem [01:06:23] given right as part of the problem specification but sometimes you see that [01:06:26] specification but sometimes you see that the reward function may be unknown and [01:06:29] the reward function may be unknown and so for example if you're building a [01:06:31] so for example if you're building a stock trading application and the reward [01:06:34] stock trading application and the reward is the returns on a certain day it may [01:06:37] is the returns on a certain day it may not be a function of the statement may [01:06:38] not be a function of the statement may be a little bit random or if you're [01:06:41] be a little bit random or if you're robots is you know running around but [01:06:43] robots is you know running around but depend on where it goes it may hit [01:06:45] depend on where it goes it may hit different bumps in the road and you want [01:06:47] different bumps in the road and you want to give her the penalty every time it [01:06:48] to give her the penalty every time it hits a bump build self-driving car right [01:06:50] hits a bump build self-driving car right and every time it hits a bump hits a [01:06:52] and every time it hits a bump hits a pothole you give the negative reward [01:06:53] pothole you give the negative reward then sometimes the rewards are random [01:06:56] then sometimes the rewards are random function of the environment and so [01:06:57] function of the environment and so sometimes you can also estimate the [01:06:59] sometimes you can also estimate the expected value of a reward but but in [01:07:02] expected value of a reward but but in some applications of the reward is the [01:07:04] some applications of the reward is the random function the state then this [01:07:06] random function the state then this process allows you to also estimate the [01:07:09] process allows you to also estimate the expected value the reward from every [01:07:10] expected value the reward from every state and then running this more oq2 [01:07:14] state and then running this more oq2 okay [01:07:30] yeah yep cool great question so let me [01:07:34] yeah yep cool great question so let me let me talk about exploration right so [01:07:36] let me talk about exploration right so it turns out that um this one so it [01:07:41] it turns out that um this one so it turns out this algorithm will work okay [01:07:43] turns out this algorithm will work okay for some problems but there's one other [01:07:46] for some problems but there's one other again to add richness to this there's [01:07:49] again to add richness to this there's one other issue that this is not solving [01:07:52] one other issue that this is not solving which is the exploration problem and [01:07:55] which is the exploration problem and possible earnings sometimes you hear the [01:07:57] possible earnings sometimes you hear the term exploration versus exploitation [01:08:01] term exploration versus exploitation which is let me use a different MVP [01:08:07] which is let me use a different MVP example which is um if your robot you [01:08:12] example which is um if your robot you know starts off here and if there is a [01:08:18] know starts off here and if there is a plus-one reward here right and maybe a [01:08:23] plus-one reward here right and maybe a +10 the water here if just by chance [01:08:27] +10 the water here if just by chance doing the first time you run the robot [01:08:29] doing the first time you run the robot it happens to find its way to the +1 [01:08:32] it happens to find its way to the +1 then if you run this algorithm it may [01:08:36] then if you run this algorithm it may figure out that going to the +1 is a [01:08:39] figure out that going to the +1 is a good way right over we're giving a [01:08:41] good way right over we're giving a discount factor does a feel so in charge [01:08:43] discount factor does a feel so in charge of minus 0.02 on every step so if just [01:08:46] of minus 0.02 on every step so if just by chance your robot happens to find [01:08:49] by chance your robot happens to find this way to the +1 the first few times [01:08:51] this way to the +1 the first few times you run this algorithm then this [01:08:53] you run this algorithm then this algorithm is yourself locally greedy [01:08:55] algorithm is yourself locally greedy right [01:08:56] right it may figure out that this is a great [01:08:59] it may figure out that this is a great way to get to +1 reward and in the world [01:09:02] way to get to +1 reward and in the world ends it stops getting these minus 0.02 [01:09:04] ends it stops getting these minus 0.02 surcharges but fuel and so this [01:09:08] surcharges but fuel and so this particular algorithm may converge to a [01:09:10] particular algorithm may converge to a bad you know kind of local optima where [01:09:14] bad you know kind of local optima where it's always heading to the +1 and as it [01:09:17] it's always heading to the +1 and as it has a +1 sometimes OVR randomly right [01:09:20] has a +1 sometimes OVR randomly right and you look a little bit more [01:09:22] and you look a little bit more experienced in the right half of the [01:09:23] experienced in the right half of the state space and end up with pretty good [01:09:26] state space and end up with pretty good estimate of what happens in the right [01:09:28] estimate of what happens in the right of the state space and and it may never [01:09:31] of the state space and and it may never find this hard to define +10 pilot goat [01:09:34] find this hard to define +10 pilot goat over on the lower left okay so this [01:09:38] over on the lower left okay so this problem is sometimes called actually [01:09:41] problem is sometimes called actually wrong it is called the exploration [01:09:43] wrong it is called the exploration versus exploitation problem which is [01:09:46] versus exploitation problem which is when you're acting on an MDP you know [01:09:49] when you're acting on an MDP you know how aggressively or how greedy should [01:09:52] how aggressively or how greedy should you be at just taking actions to [01:09:54] you be at just taking actions to maximize your rewards and so the average [01:09:57] maximize your rewards and so the average strive is relatively greedy meaning that [01:10:03] strive is relatively greedy meaning that is taking your best estimate of the [01:10:05] is taking your best estimate of the state transition probabilities and [01:10:07] state transition probabilities and rewards and it's just taking whatever [01:10:09] rewards and it's just taking whatever actions and this is really saying you [01:10:10] actions and this is really saying you know pick the policy that maximizes your [01:10:14] know pick the policy that maximizes your current estimate of the expected rewards [01:10:17] current estimate of the expected rewards and it's just acting greedily meaning on [01:10:19] and it's just acting greedily meaning on every step is just executing the policy [01:10:21] every step is just executing the policy that it thing's allows it to maximize [01:10:24] that it thing's allows it to maximize the expected payoff and what this album [01:10:28] the expected payoff and what this album does not do at all is explore which is [01:10:31] does not do at all is explore which is the process of taking actions that may [01:10:34] the process of taking actions that may appear less optimal at the outset such [01:10:37] appear less optimal at the outset such as if the robot hasn't seen this +10 [01:10:40] as if the robot hasn't seen this +10 reward doesn't know how we get there [01:10:41] reward doesn't know how we get there maybe it should you know just try going [01:10:44] maybe it should you know just try going left a couple times just for the heck of [01:10:46] left a couple times just for the heck of it right to see what happens because [01:10:48] it right to see what happens because even if it seems less even if going left [01:10:51] even if it seems less even if going left from the perspective of the current [01:10:53] from the perspective of the current state of the knowledge robot maybe if it [01:10:56] state of the knowledge robot maybe if it try some new things has never tried [01:10:58] try some new things has never tried before maybe you'll find a new pot of [01:11:00] before maybe you'll find a new pot of gold okay so this is called the [01:11:03] gold okay so this is called the exploration versus exploitation [01:11:04] exploration versus exploitation trade-off oh and this is actually not [01:11:07] trade-off oh and this is actually not just an academic problem it turns out [01:11:09] just an academic problem it turns out that some of the large online web [01:11:11] that some of the large online web advertising platforms have the same [01:11:14] advertising platforms have the same problem as well again I I've mixed [01:11:17] problem as well again I I've mixed feelings about the advertising business [01:11:19] feelings about the advertising business it's very lucrative and it causes other [01:11:20] it's very lucrative and it causes other problems as well but but it turns out [01:11:23] problems as well but but it turns out that for something large online [01:11:24] that for something large online platforms you know when when an [01:11:28] platforms you know when when an advertiser starts selling a new ad or [01:11:31] advertiser starts selling a new ad or you're posting a new ad on one of the [01:11:33] you're posting a new ad on one of the large online ad platforms the ad [01:11:35] large online ad platforms the ad platform does not know who is most [01:11:37] platform does not know who is most likely to click on this ad [01:11:39] likely to click on this ad and so pure explore in pure exploitation [01:11:42] and so pure explore in pure exploitation boy exploitation is such horrible [01:11:45] boy exploitation is such horrible conversation especially terrible on my [01:11:46] conversation especially terrible on my lap that one's the technical term no [01:11:49] lap that one's the technical term no there's no the social term when uses in [01:11:51] there's no the social term when uses in the context but the pure you know [01:11:52] the context but the pure you know reinforcement learning sense [01:11:54] reinforcement learning sense exploitation policy not not the other [01:11:56] exploitation policy not not the other even more horrible sense of exploitation [01:11:58] even more horrible sense of exploitation would be to always just show you show [01:12:01] would be to always just show you show show users the ads that you know they [01:12:04] show users the ads that you know they are most likely to click on to drive [01:12:05] are most likely to click on to drive short-term revenues who's gonna just [01:12:07] short-term revenues who's gonna just show people the atom also you click on [01:12:08] show people the atom also you click on to actual turn revenue whereas an [01:12:10] to actual turn revenue whereas an exploration for policy for a large you [01:12:13] exploration for policy for a large you know something large online ad platforms [01:12:14] know something large online ad platforms it's a show people some ads that may not [01:12:17] it's a show people some ads that may not be what we think you're most likely to [01:12:18] be what we think you're most likely to click on in this moment in time but by [01:12:20] click on in this moment in time but by showing you that ad or by showing the [01:12:22] showing you that ad or by showing the pool of users an ad that you might be [01:12:24] pool of users an ad that you might be less like quick on maybe we'll learn [01:12:25] less like quick on maybe we'll learn more about your interests and that [01:12:28] more about your interests and that increases the effectiveness of these [01:12:30] increases the effectiveness of these large of these ad platforms that finding [01:12:32] large of these ad platforms that finding more relevant ads for example I don't [01:12:35] more relevant ads for example I don't know purview no I I guess uh they're [01:12:38] know purview no I I guess uh they're probably know a participants for maja [01:12:40] probably know a participants for maja slanders as I know but if the large [01:12:42] slanders as I know but if the large online ad platforms don't know that I'm [01:12:44] online ad platforms don't know that I'm actually pretty interested in Mars [01:12:45] actually pretty interested in Mars Landers if it shows me an ad for Mars [01:12:47] Landers if it shows me an ad for Mars Lander which I don't think such a thing [01:12:49] Lander which I don't think such a thing exists right and I click on it didn't [01:12:51] exists right and I click on it didn't learn that showing me as the Mars [01:12:53] learn that showing me as the Mars Landers great thing right [01:12:55] Landers great thing right or some other thing that I mean no no [01:12:57] or some other thing that I mean no no interesting so this is actually a real [01:12:59] interesting so this is actually a real problem there are some of the large [01:13:02] problem there are some of the large online ad platforms actually do [01:13:05] online ad platforms actually do explicitly consider exploration versus [01:13:08] explicitly consider exploration versus exploitation and make sure that [01:13:09] exploitation and make sure that sometimes it shows as that may not be [01:13:12] sometimes it shows as that may not be the most likely you click on but you [01:13:14] the most likely you click on but you know allows you to gather information to [01:13:16] know allows you to gather information to then be better situated to figure out [01:13:18] then be better situated to figure out where the future rewards to be better [01:13:20] where the future rewards to be better position to learn how to mash it's not [01:13:24] position to learn how to mash it's not just that you but other users like you [01:13:27] sorry okay but so in order to make sure [01:13:30] sorry okay but so in order to make sure they're reinforcement learning algorithm [01:13:33] they're reinforcement learning algorithm explores as exploits a common [01:13:39] explores as exploits a common modification to this would be tick [01:13:43] modification to this would be tick instead of taking actions respect to pie [01:13:45] instead of taking actions respect to pie you may have a zero point nine chance [01:13:53] respective high and 0pi one chance take [01:14:00] respective high and 0pi one chance take an action randomly okay and so this [01:14:04] an action randomly okay and so this particular exploration policy is called [01:14:11] particular exploration policy is called epsilon greedy we're on every time step [01:14:14] epsilon greedy we're on every time step and on every time step you toss a biased [01:14:16] and on every time step you toss a biased coin but on every time step let's say [01:14:19] coin but on every time step let's say 90% of the chance you execute whatever [01:14:23] 90% of the chance you execute whatever you think is a current best policy and [01:14:25] you think is a current best policy and with 10% chance you just take a random [01:14:27] with 10% chance you just take a random action and this type of exploration [01:14:30] action and this type of exploration policy increases the odds that you know [01:14:32] policy increases the odds that you know every now and then maybe just by chance [01:14:34] every now and then maybe just by chance right it'll find this way to the plus 10 [01:14:37] right it'll find this way to the plus 10 polyp goals and learning state [01:14:39] polyp goals and learning state transition probabilities and and and [01:14:41] transition probabilities and and and then eventually end up exploring the [01:14:44] then eventually end up exploring the state space more thoroughly okay this is [01:14:48] state space more thoroughly okay this is called epsilon greedy exploration and [01:14:50] called epsilon greedy exploration and there's a little bit of a misnomer I [01:14:52] there's a little bit of a misnomer I think so in in in the way we think of [01:14:55] think so in in in the way we think of epsilon we D epsilon is on say 0.1 is [01:14:59] epsilon we D epsilon is on say 0.1 is the chance of taking a random action [01:15:01] the chance of taking a random action instead of the greedy action this [01:15:03] instead of the greedy action this algorithm it's has always been a little [01:15:05] algorithm it's has always been a little bit strangely named because if 0.1 is [01:15:09] bit strangely named because if 0.1 is actually the chance of you're acting [01:15:10] actually the chance of you're acting randomly right so epsilon greedy it [01:15:12] randomly right so epsilon greedy it sounds like you're being greedy point [01:15:14] sounds like you're being greedy point one of the time but but you're actually [01:15:16] one of the time but but you're actually taking actions randomly upon one at a [01:15:18] taking actions randomly upon one at a time so epsilon greediest actually may [01:15:20] time so epsilon greediest actually may be one minus Epsilon greedy so this name [01:15:23] be one minus Epsilon greedy so this name has always been a little bit off but [01:15:25] has always been a little bit off but that's what that's that's that's how [01:15:27] that's what that's that's that's how people use this term epsilon greedy [01:15:28] people use this term epsilon greedy exploration means epsilon at the time [01:15:30] exploration means epsilon at the time which is a hyper parameter which is the [01:15:32] which is a hyper parameter which is the parameter the algorithm you act randomly [01:15:34] parameter the algorithm you act randomly into instead of going to what you think [01:15:36] into instead of going to what you think is the best policy [01:15:37] is the best policy okay and it turns out that if you [01:15:41] okay and it turns out that if you implement this algorithm with epsilon [01:15:45] implement this algorithm with epsilon greedy exploration then this this [01:15:49] greedy exploration then this this algorithm will converge to the optimal [01:15:53] algorithm will converge to the optimal policy for any discrete state MVP right [01:15:56] policy for any discrete state MVP right sometimes didn't take a long time [01:15:57] sometimes didn't take a long time because you know if there's a if it [01:15:59] because you know if there's a if it takes a long time to randomly find plus [01:16:01] takes a long time to randomly find plus ten [01:16:02] ten it could take a long time before [01:16:03] it could take a long time before randomly stumbles upon 2 plus 10 [01:16:05] randomly stumbles upon 2 plus 10 particles but this algorithm women with [01:16:09] particles but this algorithm women with an exploration policy will converge to [01:16:11] an exploration policy will converge to the optimal what will converge the [01:16:14] the optimal what will converge the optimal policy for any MDP oh yeah yes [01:16:28] optimal policy for any MDP oh yeah yes sir right she always keep us on constant [01:16:29] sir right she always keep us on constant or she view dilation Epsilon so yes [01:16:32] or she view dilation Epsilon so yes there are there are many heuristics for [01:16:35] there are there are many heuristics for how to explore one reasonable thing to [01:16:37] how to explore one reasonable thing to do would be start with a large value of [01:16:39] do would be start with a large value of epsilon and it's low e-string [01:16:40] epsilon and it's low e-string another common heuristic would be um [01:16:43] another common heuristic would be um there's a different type of exploration [01:16:45] there's a different type of exploration called Bo's from the exploration we can [01:16:47] called Bo's from the exploration we can look up if you want which is uh if you [01:16:49] look up if you want which is uh if you think that the value of going off is you [01:16:52] think that the value of going off is you know 10 and the value of going solvus 1 [01:16:54] know 10 and the value of going solvus 1 then there's such a huge difference that [01:16:56] then there's such a huge difference that you should bias your action to are going [01:16:59] you should bias your action to are going to the bigger result the bigger reward [01:17:02] to the bigger result the bigger reward and then you could have the probability [01:17:04] and then you could have the probability be e to the value basically right time [01:17:06] be e to the value basically right time divided times the scaling factor right [01:17:10] divided times the scaling factor right so that's called bozeman exploration [01:17:12] so that's called bozeman exploration where instead of having a 10% chance of [01:17:14] where instead of having a 10% chance of taking action completely at random you [01:17:16] taking action completely at random you could just you know have a very strong [01:17:18] could just you know have a very strong bias to heading toward the higher values [01:17:21] bias to heading toward the higher values but also have some probability to go [01:17:23] but also have some probability to go into lower values but where the exact [01:17:25] into lower values but where the exact probability depends on the different [01:17:27] probability depends on the different values so the another pair either I [01:17:30] values so the another pair either I think I've saw greenie I feel like I see [01:17:32] think I've saw greenie I feel like I see this use the most often for these types [01:17:34] this use the most often for these types of MVPs and then Boltzmann exploration [01:17:36] of MVPs and then Boltzmann exploration which is why just like this also use [01:17:39] which is why just like this also use just two more questions you'd wrap up [01:17:41] just two more questions you'd wrap up good [01:17:45] oh yes you give her one for reaching say [01:17:49] oh yes you give her one for reaching say she has never seen before [01:17:51] she has never seen before yes there's a fascinating line of [01:17:52] yes there's a fascinating line of research called intrinsic reinforcement [01:17:54] research called intrinsic reinforcement learning I mean we started by sitting [01:17:58] learning I mean we started by sitting the same if you google for intrinsic [01:17:59] the same if you google for intrinsic intrinsic motivation you find some [01:18:02] intrinsic motivation you find some research papers on and then there was [01:18:04] research papers on and then there was some recent fall on work I think by deep [01:18:06] some recent fall on work I think by deep line or some of the groups but intrinsic [01:18:08] line or some of the groups but intrinsic motivation is to turn to Google where [01:18:10] motivation is to turn to Google where you reward reinforcement learning [01:18:11] you reward reinforcement learning algorithm for finding new things about [01:18:13] algorithm for finding new things about the world oh I see right [01:18:27] the world oh I see right how often should on the Asha issue you [01:18:29] how often should on the Asha issue you take before updating pi um [01:18:32] take before updating pi um there's no how to do it as frequently as [01:18:34] there's no how to do it as frequently as possible in the if you're doing this [01:18:37] possible in the if you're doing this with a real robot what you know I've [01:18:39] with a real robot what you know I've seen is this is sometimes going to a [01:18:42] seen is this is sometimes going to a physical robot and so you know I don't [01:18:45] physical robot and so you know I don't know one of five helicopters you've gone [01:18:46] know one of five helicopters you've gone to the view for a day collect all data [01:18:48] to the view for a day collect all data and then go back to the lab and evening [01:18:50] and then go back to the lab and evening and rerun the algorithms but if there's [01:18:53] and rerun the algorithms but if there's no barrier to running this all the time [01:18:54] no barrier to running this all the time then it doesn't hurt the performance [01:18:56] then it doesn't hurt the performance they're just running as beautiful as you [01:18:57] they're just running as beautiful as you can all right that's it for basis of MDP [01:19:01] can all right that's it for basis of MDP on Wednesday we'll continue with [01:19:03] on Wednesday we'll continue with generalizing all these to continuous [01:19:06] generalizing all these to continuous state ok let's break loose your [01:19:09] state ok let's break loose your Wednesday ================================================================================ LECTURE 018 ================================================================================ Lecture 18 - Continous State MDP & Model Simulation | Stanford CS229: Machine Learning (Autumn 2018) Source: https://www.youtube.com/watch?v=QFu5nuc-S0s --- Transcript [00:00:03] alright hey everyone welcome back um so [00:00:10] alright hey everyone welcome back um so let's continue our discussion today of [00:00:12] let's continue our discussion today of reinforcement learning and mdps [00:00:15] reinforcement learning and mdps and specifically what I hope you learn [00:00:18] and specifically what I hope you learn from today is how to apply reinforcement [00:00:21] from today is how to apply reinforcement learning um even to continuous states or [00:00:23] learning um even to continuous states or infinite state MVPs so top out [00:00:27] infinite state MVPs so top out discretization model based RL talked [00:00:30] discretization model based RL talked about models the simulation and fitted [00:00:32] about models the simulation and fitted value iteration is the main algorithm I [00:00:34] value iteration is the main algorithm I want to lead up to for today just a [00:00:38] want to lead up to for today just a recap because we're going to build on [00:00:40] recap because we're going to build on what we had learned in the last two [00:00:42] what we had learned in the last two lectures wanna make sure that you have [00:00:44] lectures wanna make sure that you have the notation fresh in your mind [00:00:47] the notation fresh in your mind MVP was state's actions transition [00:00:50] MVP was state's actions transition probabilities discount back to reward [00:00:51] probabilities discount back to reward that was an example V PI was the value [00:00:56] that was an example V PI was the value function for a policy PI which is [00:00:58] function for a policy PI which is expected payoff if you execute that [00:01:00] expected payoff if you execute that policy starting from a status and these [00:01:02] policy starting from a status and these star was the optimal value function and [00:01:05] star was the optimal value function and last time we figured out that if you [00:01:07] last time we figured out that if you know what is V star then Python the [00:01:11] know what is V star then Python the optimal policy or the optimal action for [00:01:13] optimal policy or the optimal action for a given state can be computed as the [00:01:16] a given state can be computed as the augments of that and one one one thing [00:01:21] augments of that and one one one thing though we'll come back to later is an [00:01:24] though we'll come back to later is an equivalent way of writing that formula [00:01:26] equivalent way of writing that formula is that this is the expectation with [00:01:30] is that this is the expectation with respect to s Prime [00:01:31] respect to s Prime drawn from PS a B star best prime right [00:01:39] drawn from PS a B star best prime right so when we go to we've been we have been [00:01:43] so when we go to we've been we have been working with discrete state MVPs with [00:01:45] working with discrete state MVPs with the eleven state MVP so this is a sum [00:01:48] the eleven state MVP so this is a sum over all the states s prime but when we [00:01:50] over all the states s prime but when we have will go to continuous state MVPs [00:01:52] have will go to continuous state MVPs the generalization of this or what this [00:01:55] the generalization of this or what this becomes this is the expected value with [00:01:57] becomes this is the expected value with respect to s prime drawn from the state [00:01:59] respect to s prime drawn from the state transition probabilities you know with [00:02:02] transition probabilities you know with index by s a current state currents that [00:02:05] index by s a current state currents that action of the value that you attain in [00:02:09] action of the value that you attain in the future so V star of s Prime and [00:02:13] the future so V star of s Prime and we saw the value iteration algorithm [00:02:16] we saw the value iteration algorithm we're also so we talked about valuation [00:02:18] we're also so we talked about valuation policy iteration but today we'll build [00:02:20] policy iteration but today we'll build on value iteration but the value of the [00:02:22] on value iteration but the value of the raishin algorithm uses Spellman's [00:02:25] raishin algorithm uses Spellman's equations which says take the left hand [00:02:28] equations which says take the left hand side set it to the right hand side right [00:02:30] side set it to the right hand side right and for V star if V was equal to V stop [00:02:34] and for V star if V was equal to V stop the left hand side it's equal to the [00:02:35] the left hand side it's equal to the right hand side that was um oh I'm sorry [00:02:38] right hand side that was um oh I'm sorry it's missing a max there right so if he [00:02:44] it's missing a max there right so if he was equal to V star then the left hand [00:02:46] was equal to V star then the left hand side and the right hand side would be [00:02:47] side and the right hand side would be equal to each other but what value [00:02:50] equal to each other but what value iteration does is an algorithm that [00:02:52] iteration does is an algorithm that initializes V of s is 0 and repeatedly [00:02:54] initializes V of s is 0 and repeatedly carries his update until the converges [00:02:57] carries his update until the converges to V Star and after that you can then [00:03:00] to V Star and after that you can then compute PI stock or fine for every state [00:03:03] compute PI stock or fine for every state find the optimal action [00:03:05] find the optimal action ok so um because we're going to build on [00:03:08] ok so um because we're going to build on this notation and this set of ideas [00:03:11] this notation and this set of ideas today supposed to make sure all this [00:03:13] today supposed to make sure all this makes sense right any questions ok all [00:03:22] makes sense right any questions ok all right so ok so everything we've done so [00:03:28] right so ok so everything we've done so far was built on the MDP having a finite [00:03:32] far was built on the MDP having a finite set of states right so with 11 state [00:03:34] set of states right so with 11 state MTPs was a discrete set of states um [00:03:37] MTPs was a discrete set of states um last time on Monday I think so much she [00:03:39] last time on Monday I think so much she asked well how do you do of continuous [00:03:41] asked well how do you do of continuous States so we'll work on that today but [00:03:43] States so we'll work on that today but let's say you want to build a right [00:03:49] let's say you want to build a right let's say you want to build a car maybe [00:03:52] let's say you want to build a car maybe a self-driving car right the state space [00:03:56] a self-driving car right the state space of a car is let's see I'm gonna well [00:04:00] of a car is let's see I'm gonna well instead of taking the my artistic side [00:04:04] instead of taking the my artistic side view of the car if you take a top-down [00:04:06] view of the car if you take a top-down view of a car alright so this is from [00:04:10] view of a car alright so this is from the satellite imagery you know top down [00:04:12] the satellite imagery you know top down the other car with two wheels of call [00:04:13] the other car with two wheels of call having this way how do you model the [00:04:16] having this way how do you model the state of a car right well calm the way [00:04:18] state of a car right well calm the way to model the state of a car that's [00:04:20] to model the state of a car that's driving around the planet earth is that [00:04:22] driving around the planet earth is that you need to know the position and so [00:04:27] you need to know the position and so that can be represented as X comma Y two [00:04:30] that can be represented as X comma Y two numbers to represent you know really [00:04:31] numbers to represent you know really lots your lattice you or something right [00:04:34] lots your lattice you or something right you probably want to know the [00:04:37] you probably want to know the orientation of the car right may be [00:04:41] orientation of the car right may be measured relative to north [00:04:43] measured relative to north what's the orientation of the car and [00:04:46] what's the orientation of the car and then it turns out if you're driving at [00:04:48] then it turns out if you're driving at very low speeds this is fine but if [00:04:51] very low speeds this is fine but if you're driving anything other than very [00:04:53] you're driving anything other than very low speeds then will often include in [00:04:58] low speeds then will often include in the state space also the velocities and [00:05:01] the state space also the velocities and angular velocity so X dot is the [00:05:03] angular velocity so X dot is the velocity in the X direction so X all [00:05:06] velocity in the X direction so X all this dx/dt right oh this this velocity X [00:05:08] this dx/dt right oh this this velocity X Direction Y dot it's lost in y direction [00:05:10] Direction Y dot it's lost in y direction and theta dot is the angular velocity [00:05:12] and theta dot is the angular velocity the rate at which your car is turning [00:05:14] the rate at which your car is turning okay and this sort of um up to you how [00:05:18] okay and this sort of um up to you how you want to model the car is it [00:05:19] you want to model the car is it important to model the current angle of [00:05:22] important to model the current angle of the steering view is it important to [00:05:23] the steering view is it important to model how worn down is your front left [00:05:26] model how worn down is your front left tire I supposed to have worn down is [00:05:28] tire I supposed to have worn down is your beer right tire so depending on the [00:05:30] your beer right tire so depending on the application you are building is up to [00:05:33] application you are building is up to you to decide what is the state base [00:05:36] you to decide what is the state base state space you want to use the model [00:05:38] state space you want to use the model this car and I guess it and if you're [00:05:40] this car and I guess it and if you're building a car to race on a racetrack [00:05:42] building a car to race on a racetrack maybe it is important to model what is [00:05:45] maybe it is important to model what is the temperature of the engine and how [00:05:46] the temperature of the engine and how you know one down this each of your four [00:05:49] you know one down this each of your four tires separately but for a lot of normal [00:05:51] tires separately but for a lot of normal driving this would be you know [00:05:53] driving this would be you know sufficient level of detail to model the [00:05:55] sufficient level of detail to model the state space but so this is a six [00:05:59] state space but so this is a six dimensional so this is a six dimensional [00:06:06] dimensional so this is a six dimensional state space representation oh and for [00:06:08] state space representation oh and for those they work in robotics that would [00:06:10] those they work in robotics that would be called the kinematic mode of the car [00:06:12] be called the kinematic mode of the car and that would be the dynamics model car [00:06:13] and that would be the dynamics model car right if you want to model the [00:06:15] right if you want to model the velocities as well [00:06:17] velocities as well um oh let's see how about the helicopter [00:06:25] all right [00:06:27] all right the States howdy marvelous taste is a [00:06:29] the States howdy marvelous taste is a helicopter helicopter flies around in 3d [00:06:31] helicopter helicopter flies around in 3d rather than drives around in 2d and so [00:06:33] rather than drives around in 2d and so common way to model the statesman's [00:06:36] common way to model the statesman's helicopter would be to model it as [00:06:37] helicopter would be to model it as having a position X Y Z and then also a [00:06:42] having a position X Y Z and then also a 3d orientation of a helicopter is [00:06:46] 3d orientation of a helicopter is usually modeled with three numbers which [00:06:49] usually modeled with three numbers which we sometimes call the roll pitch and yaw [00:06:52] we sometimes call the roll pitch and yaw right so you know if you ever an [00:06:54] right so you know if you ever an airplane row is that you roll into left [00:06:56] airplane row is that you roll into left or right pictures are you pitching up [00:06:57] or right pictures are you pitching up and down and you always know are you [00:06:59] and down and you always know are you facing north south east or west so this [00:07:01] facing north south east or west so this is one way to turn the three-dimensional [00:07:03] is one way to turn the three-dimensional orientation of an object like an [00:07:05] orientation of an object like an airplane or helicopter into CT numbers [00:07:07] airplane or helicopter into CT numbers so so the details aren't important if [00:07:10] so so the details aren't important if you actually work a helicopter you can [00:07:12] you actually work a helicopter you can just figure this out but for today's [00:07:13] just figure this out but for today's purposes just right I guess the row [00:07:16] purposes just right I guess the row picture but to represent the orientation [00:07:23] picture but to represent the orientation of a three-dimensional object flying [00:07:25] of a three-dimensional object flying around is conventionally represented [00:07:27] around is conventionally represented with three numbers such as a rotation [00:07:29] with three numbers such as a rotation jaw and then I thought why don't you see [00:07:34] jaw and then I thought why don't you see dot Phi dot theta dot side on the linear [00:07:40] dot Phi dot theta dot side on the linear velocity and the angular velocity okay [00:07:47] maybe just one last example so it turns [00:07:50] maybe just one last example so it turns out in enforcement learning maybe early [00:07:54] out in enforcement learning maybe early early history of reinforcement learning [00:07:55] early history of reinforcement learning one of the problems that a lot of people [00:07:57] one of the problems that a lot of people just happen to work on and and therefore [00:08:01] just happen to work on and and therefore you see in a lot of reinforcement [00:08:03] you see in a lot of reinforcement learning textbooks there's something [00:08:04] learning textbooks there's something called the inverted pendulum problem but [00:08:06] called the inverted pendulum problem but what that is is a little toy which is a [00:08:10] what that is is a little toy which is a little cot there's on wheels there's on [00:08:12] little cot there's on wheels there's on a track and you have a little pole that [00:08:17] a track and you have a little pole that is attached to this cot and there's a [00:08:20] is attached to this cot and there's a three swivel there [00:08:27] and so this pole just flops over all [00:08:29] and so this pole just flops over all this pose just swings freely and there's [00:08:31] this pose just swings freely and there's no motor there's no motor at this at [00:08:34] no motor there's no motor at this at this little hinge there and so the [00:08:36] this little hinge there and so the inverted pendulum problem is see that my [00:08:39] inverted pendulum problem is see that my prepared this right is if you have a if [00:08:44] prepared this right is if you have a if you have a free PO and if this is your [00:08:46] you have a free PO and if this is your carts moving left and right the inverted [00:08:48] carts moving left and right the inverted pendulum problem is you know can you [00:08:50] pendulum problem is you know can you with a swivel can you kind of balance [00:08:52] with a swivel can you kind of balance that right well and so one of the common [00:09:02] that right well and so one of the common textbook examples of reinforced learning [00:09:05] textbook examples of reinforced learning is can you choose actions over time to [00:09:09] is can you choose actions over time to move this left and right so as to keep [00:09:12] move this left and right so as to keep the pole oriented up with right and so [00:09:15] the pole oriented up with right and so for a problem like this if you have a [00:09:17] for a problem like this if you have a linear wheel just a one-dimensional you [00:09:19] linear wheel just a one-dimensional you know like a real aircraft that this cart [00:09:21] know like a real aircraft that this cart is on the state space would be X which [00:09:24] is on the state space would be X which is the position of the cot theta which [00:09:31] is the position of the cot theta which is the orientation of the pole as well X [00:09:35] is the orientation of the pole as well X dot and theta dot right so this would be [00:09:38] dot and theta dot right so this would be a four dimensional state space for the [00:09:42] a four dimensional state space for the inverted pendulum if this like running [00:09:44] inverted pendulum if this like running left and right on a railway track and [00:09:46] left and right on a railway track and the one dimensional railway track right [00:09:48] the one dimensional railway track right um [00:09:52] so for all of these problems if you want [00:09:55] so for all of these problems if you want to build you know a self-driving car and [00:09:57] to build you know a self-driving car and have it do something or build an [00:09:59] have it do something or build an autonomous helicopter and have it either [00:10:01] autonomous helicopter and have it either harvest a bleed or flight trajectory or [00:10:03] harvest a bleed or flight trajectory or keep the pole upright and inverted [00:10:05] keep the pole upright and inverted pendulum these are examples of robotics [00:10:08] pendulum these are examples of robotics problems where you would model the state [00:10:09] problems where you would model the state space as a continuous state space so [00:10:13] space as a continuous state space so what I want to do today is focus on [00:10:15] what I want to do today is focus on problems where the state space is our n [00:10:20] problems where the state space is our n so n dimensional set of real numbers and [00:10:22] so n dimensional set of real numbers and in these examples I guess n would be [00:10:24] in these examples I guess n would be four or six or twelve right oh and again [00:10:28] four or six or twelve right oh and again for the for the mathematicians in this [00:10:30] for the for the mathematicians in this class [00:10:31] class technically angles are not real numbers [00:10:33] technically angles are not real numbers because they wrap around [00:10:35] because they wrap around we go to 360 and then they wrap around [00:10:36] we go to 360 and then they wrap around to zero but I think for the purposes of [00:10:38] to zero but I think for the purposes of today that's not important so we're just [00:10:41] today that's not important so we're just treatises RN so um serve the most [00:10:57] treatises RN so um serve the most straight straightforward way the most [00:11:06] straight straightforward way the most straightforward way to work with a [00:11:08] straightforward way to work with a continuous state space is discretization [00:11:11] continuous state space is discretization where you know you might have in this [00:11:16] where you know you might have in this example a two dimensional state space [00:11:17] example a two dimensional state space maybe uh [00:11:18] maybe uh X and theta for the inverted pendulum [00:11:20] X and theta for the inverted pendulum and then you just lay down the set of [00:11:24] and then you just lay down the set of grid values right and disparate eyes it [00:11:29] grid values right and disparate eyes it back to a discrete state problem and so [00:11:32] back to a discrete state problem and so you know so you can give the state's a [00:11:34] you know so you can give the state's a set of names [00:11:35] set of names one two three four whatever and anywhere [00:11:38] one two three four whatever and anywhere within that little square you just [00:11:39] within that little square you just pretend that you're MDP that you robot [00:11:42] pretend that you're MDP that you robot doesn't stay number one so this takes a [00:11:44] doesn't stay number one so this takes a Content stay problem and turns it back [00:11:46] Content stay problem and turns it back to a discrete state problem um [00:11:48] to a discrete state problem um this is such a simple straightforward [00:11:50] this is such a simple straightforward way to do it this is actually reasonable [00:11:53] way to do it this is actually reasonable to do for small problems and if you have [00:11:56] to do for small problems and if you have a relatively small low dimensional state [00:11:58] a relatively small low dimensional state Series MVP like an inverted pendulum [00:12:00] Series MVP like an inverted pendulum problem you're a four dimensional it's [00:12:02] problem you're a four dimensional it's actually perfectly fine to discretize [00:12:04] actually perfectly fine to discretize the state space and solve it this way [00:12:05] the state space and solve it this way let me describe some disadvantages of [00:12:08] let me describe some disadvantages of discretization first and then and then [00:12:11] discretization first and then and then which a little bit about when you should [00:12:12] which a little bit about when you should just use discretization because even [00:12:14] just use discretization because even though it's not the best algorithm it [00:12:16] though it's not the best algorithm it works fine for smaller problems but for [00:12:19] works fine for smaller problems but for bigger problems we'll have to go to more [00:12:21] bigger problems we'll have to go to more sophisticated algorithms like fitted [00:12:23] sophisticated algorithms like fitted value iteration okay but um so what are [00:12:26] value iteration okay but um so what are the problems with discretization right [00:12:31] well first [00:12:41] this is a very this is kind of a naive [00:12:45] this is a very this is kind of a naive representation for V star and PI star [00:12:56] representation for V star and PI star right which is you know remember the [00:12:59] right which is you know remember the very first problem we talked about of [00:13:02] very first problem we talked about of predicting housing prices [00:13:05] predicting housing prices imagine if X was the size of a house and [00:13:10] imagine if X was the size of a house and vertical axis was the price of a house [00:13:13] vertical axis was the price of a house and you had a data set that look like [00:13:15] and you had a data set that look like this discritization is that the [00:13:21] this discritization is that the discretization equivalent of trying to [00:13:23] discretization equivalent of trying to for the function of this data would be [00:13:25] for the function of this data would be to look at the input feature and you [00:13:29] to look at the input feature and you know let's discretize it into five [00:13:32] know let's discretize it into five values and for each of these little [00:13:34] values and for each of these little buckets in each of these five intervals [00:13:36] buckets in each of these five intervals let's fit a constant function right [00:13:40] let's fit a constant function right something like that so this staircase [00:13:44] something like that so this staircase would be how you know descritization [00:13:47] would be how you know descritization will represent the price of a house as a [00:13:49] will represent the price of a house as a function of the size and the analogy is [00:13:55] function of the size and the analogy is that what we're doing in reinforcement [00:13:57] that what we're doing in reinforcement learning is you want to approximate the [00:13:59] learning is you want to approximate the value function and if you were to [00:14:01] value function and if you were to discretize it then on the x axis is [00:14:04] discretize it then on the x axis is maybe the state and now I'm down to one [00:14:07] maybe the state and now I'm down to one dimensional state right because that's [00:14:08] dimensional state right because that's what I can plot and you're saying that [00:14:10] what I can plot and you're saying that well let's approximate the value [00:14:12] well let's approximate the value function you know as a as a staircase [00:14:16] function you know as a as a staircase function as a function of the set of [00:14:18] function as a function of the set of states right and you know and this is [00:14:20] states right and you know and this is not terrible if you have a lot of data [00:14:21] not terrible if you have a lot of data and very few input features you can get [00:14:23] and very few input features you can get away with this this will work okay but [00:14:25] away with this this will work okay but this doesn't it doesn't seem to allow [00:14:29] this doesn't it doesn't seem to allow you to fit a smooth function right so [00:14:31] you to fit a smooth function right so that's one downside so it's not a very [00:14:34] that's one downside so it's not a very good representation and the second [00:14:37] good representation and the second downside is the [00:14:46] right someone fancifully named curse of [00:14:49] right someone fancifully named curse of dimensionality which is Richard bellman [00:14:53] dimensionality which is Richard bellman had given this name as a cool sounding [00:14:56] had given this name as a cool sounding name but what it means is that if the [00:14:59] name but what it means is that if the state spaces in RN and disparate eyes [00:15:05] you know each dimension into K values [00:15:14] you know each dimension into K values then you get paid to the end discrete [00:15:19] then you get paid to the end discrete states so if this critize position and [00:15:26] states so if this critize position and orientation into ten values which is [00:15:28] orientation into ten values which is quite small then you end up with you [00:15:31] quite small then you end up with you know ten to ten states which grows [00:15:32] know ten to ten states which grows exponentially in dimensional state space [00:15:34] exponentially in dimensional state space n so this transition works fine if you [00:15:37] n so this transition works fine if you have relatively low dimensional problems [00:15:40] have relatively low dimensional problems like two dimensions no problem four [00:15:42] like two dimensions no problem four dimensions maybe it's okay but they were [00:15:44] dimensions maybe it's okay but they were very high dimensional state spaces this [00:15:46] very high dimensional state spaces this is this is not a good this is not a good [00:15:49] is this is not a good this is not a good representation and it turns out the [00:15:53] representation and it turns out the curse of dimensionality to take a [00:15:56] curse of dimensionality to take a slightly aside from continuous state [00:15:58] slightly aside from continuous state spaces because the dimensionality also [00:16:00] spaces because the dimensionality also applies for very large discrete state [00:16:03] applies for very large discrete state MDPs [00:16:03] MDPs so for example one of the places people [00:16:06] so for example one of the places people have apply reinforcement learning is in [00:16:08] have apply reinforcement learning is in factory optimization right so if you [00:16:10] factory optimization right so if you have a factory with a hundred machines [00:16:12] have a factory with a hundred machines in a factory and if every machine in the [00:16:16] in a factory and if every machine in the factory is doing something slightly [00:16:17] factory is doing something slightly different then if you have a hundred [00:16:21] different then if you have a hundred machines in the giant factory and each [00:16:26] machines in the giant factory and each machine can be in K different states [00:16:30] machine can be in K different states then the total number of states of your [00:16:33] then the total number of states of your factory is K to the power of 100 right [00:16:38] factory is K to the power of 100 right and so even if so so curse of [00:16:40] and so even if so so curse of dimensionality also applies to very [00:16:42] dimensionality also applies to very large discrete state spaces such as if a [00:16:45] large discrete state spaces such as if a factory over hundred machines and then [00:16:47] factory over hundred machines and then your total state space becomes kids at [00:16:49] your total state space becomes kids at 100 and it turns out that for this type [00:16:51] 100 and it turns out that for this type of discrete state space fits a value [00:16:54] of discrete state space fits a value iteration [00:16:55] iteration can be a much better album as well we'll [00:16:57] can be a much better album as well we'll get to Fitz evaluation a little bit okay [00:17:00] get to Fitz evaluation a little bit okay so let's see so some practical so now [00:17:12] so let's see so some practical so now despite all this criticism of [00:17:13] despite all this criticism of digitalization if you have a small stage [00:17:15] digitalization if you have a small stage space is a simple method to try to apply [00:17:18] space is a simple method to try to apply you know and and if you're if you're [00:17:20] you know and and if you're if you're very small so you say go ahead and [00:17:22] very small so you say go ahead and discreet eyes they could be one of the [00:17:23] discreet eyes they could be one of the quick things to try and just get [00:17:25] quick things to try and just get something working so let me share have [00:17:27] something working so let me share have you some maybe guidelines this is this [00:17:30] you some maybe guidelines this is this is how I do it I guess right if you have [00:17:32] is how I do it I guess right if you have a you know two dimensional state space [00:17:34] a you know two dimensional state space or three dimensional state space is no [00:17:37] or three dimensional state space is no problem just discretized usually for a [00:17:40] problem just discretized usually for a lot of problems it's just fine if you [00:17:43] lot of problems it's just fine if you have maybe a four to six dimensional [00:17:47] have maybe a four to six dimensional state space you know I would think about [00:17:50] state space you know I would think about it and it will still often work so for [00:17:53] it and it will still often work so for the inverted pendulum which is four [00:17:54] the inverted pendulum which is four dimensional state space it works just [00:17:55] dimensional state space it works just fine I've had some friends work on [00:17:58] fine I've had some friends work on trying to drive a bicycle right which [00:18:02] trying to drive a bicycle right which you can model the six dimensional state [00:18:04] you can model the six dimensional state space and you know disk realization it [00:18:07] space and you know disk realization it kind of works is that it works if you [00:18:09] kind of works is that it works if you put some work into it one of the tricks [00:18:12] put some work into it one of the tricks you want to use as you approach to four [00:18:14] you want to use as you approach to four to six dimensional state space range is [00:18:18] to six dimensional state space range is choose your discretization more [00:18:20] choose your discretization more carefully so for example if the state s2 [00:18:23] carefully so for example if the state s2 is really important so if you think the [00:18:27] is really important so if you think the actions you need to take or the value of [00:18:30] actions you need to take or the value of the performance is really sensitive to [00:18:31] the performance is really sensitive to state as to and less institute to state [00:18:34] state as to and less institute to state as one then in this range people end up [00:18:37] as one then in this range people end up designing unequal discretization where [00:18:41] designing unequal discretization where you might discretize as too much more [00:18:42] you might discretize as too much more funny than s1 right and then the reason [00:18:44] funny than s1 right and then the reason you do that is the number of states the [00:18:47] you do that is the number of states the number of discrete states is now blowing [00:18:49] number of discrete states is now blowing up exponentially something the power [00:18:50] up exponentially something the power power for some of the politics and these [00:18:52] power for some of the politics and these tricks allow you to just reduce a little [00:18:54] tricks allow you to just reduce a little bit the number of discrete States you [00:18:56] bit the number of discrete States you end up [00:18:57] end up I think you know if you have a 7/8 [00:19:01] I think you know if you have a 7/8 dimensional problem that's that's [00:19:04] dimensional problem that's that's pushing it that's when I would kind of [00:19:05] pushing it that's when I would kind of be nervous and and you know be [00:19:08] be nervous and and you know be increasingly inclined to not use [00:19:10] increasingly inclined to not use dissociation I personally rarely use [00:19:12] dissociation I personally rarely use this realization for problems that are 8 [00:19:14] this realization for problems that are 8 dimensional and then when your problem [00:19:17] dimensional and then when your problem is that you even higher dimensional than [00:19:18] is that you even higher dimensional than this you know like 9 10 and higher then [00:19:21] this you know like 9 10 and higher then I would very seriously consider an [00:19:24] I would very seriously consider an algorithm that does not dispute eyes [00:19:26] algorithm that does not dispute eyes it's very very rare to use this code [00:19:29] it's very very rare to use this code ization for 4 problems as high even 78 [00:19:31] ization for 4 problems as high even 78 is quite rare I've seen it done in rare [00:19:33] is quite rare I've seen it done in rare occasions but but and these things get [00:19:36] occasions but but and these things get worse exponentially right with the [00:19:38] worse exponentially right with the number of dimensions so maybe there's a [00:19:39] number of dimensions so maybe there's a set of guidelines for when to use the [00:19:42] set of guidelines for when to use the scores ation it when to seriously [00:19:44] scores ation it when to seriously consider doing something else all right [00:19:51] consider doing something else all right so um in the alternative approach that [00:19:55] so um in the alternative approach that you see today what you'll be able to do [00:19:59] you see today what you'll be able to do is to approximate you start directly [00:20:08] without resorting to descritization and [00:20:21] without resorting to descritization and there'll be an analogy that will make [00:20:23] there'll be an analogy that will make later just you know alluding to this [00:20:26] later just you know alluding to this plot again right to this analogy between [00:20:28] plot again right to this analogy between linear regression when you're trying to [00:20:30] linear regression when you're trying to approximate y is function of X and value [00:20:33] approximate y is function of X and value iteration when you're trying to learn [00:20:35] iteration when you're trying to learn their approximate V as a function of s [00:20:39] their approximate V as a function of s which is that in linear regression you [00:20:44] which is that in linear regression you say let's approximate X as a linear [00:20:48] say let's approximate X as a linear function of Y right or if you don't want [00:20:53] function of Y right or if you don't want to use the raw features Y what you can [00:20:56] to use the raw features Y what you can do is use you know theta transpose theta [00:21:02] do is use you know theta transpose theta transpose v oh I'm sorry [00:21:05] transpose v oh I'm sorry totally picks up right where Phi of X is [00:21:11] totally picks up right where Phi of X is the features of X so if right so this is [00:21:19] the features of X so if right so this is what linear regression does where if X [00:21:21] what linear regression does where if X is your housing price then maybe Phi of [00:21:23] is your housing price then maybe Phi of X is equal to you know X 1 X 2 X 1 [00:21:28] X is equal to you know X 1 X 2 X 1 squared X 1 X 2 and so on right so [00:21:32] squared X 1 X 2 and so on right so that's how that's how you can use linear [00:21:34] that's how that's how you can use linear regression to approximate the price of a [00:21:36] regression to approximate the price of a house either as a function of the raw [00:21:38] house either as a function of the raw features or as a function of some you [00:21:41] features or as a function of some you know slightly more sophisticated study [00:21:42] know slightly more sophisticated study more complex of the features of the [00:21:44] more complex of the features of the house and what we what you'll see in [00:21:48] house and what we what you'll see in fitted value iteration is a model where [00:21:52] fitted value iteration is a model where we will approximate Basara ves as a [00:21:58] we will approximate Basara ves as a linear function of features of the state [00:22:04] linear function of features of the state ok so that's the algorithm wolf build up [00:22:08] ok so that's the algorithm wolf build up to and yeah we're going to try to use [00:22:13] to and yeah we're going to try to use linear regression with a lot of [00:22:15] linear regression with a lot of modifications to approximate the value [00:22:18] modifications to approximate the value function okay and and and again enforce [00:22:21] function okay and and and again enforce our learning in value iteration the your [00:22:25] our learning in value iteration the your goal is to find a good approximation to [00:22:27] goal is to find a good approximation to the value function because once you have [00:22:29] the value function because once you have that you can then use you know the [00:22:31] that you can then use you know the equation we had earlier to compute the [00:22:32] equation we had earlier to compute the optimal action for every state right so [00:22:34] optimal action for every state right so so we just focus on computing the value [00:22:36] so we just focus on computing the value function now in order to derive the [00:22:43] function now in order to derive the fitted value iteration algorithm it [00:22:48] fitted value iteration algorithm it turns out that fits a value duration [00:22:53] turns out that fits a value duration works best with a model or the simulator [00:22:57] works best with a model or the simulator of D MVP so let me describe what that [00:22:59] of D MVP so let me describe what that means and how you get a model and then [00:23:01] means and how you get a model and then we'll talk about how you can actually [00:23:02] we'll talk about how you can actually you'll implement the fitted value [00:23:05] you'll implement the fitted value generation algorithm and have it work on [00:23:06] generation algorithm and have it work on these types of problems ok [00:23:18] all right so um what a model or a [00:23:36] all right so um what a model or a simulator of your robot is is is just a [00:23:39] simulator of your robot is is is just a function that takes as input a state [00:23:45] function that takes as input a state takes as inputs in action and it outputs [00:23:49] takes as inputs in action and it outputs the next state s prime drawn from the [00:23:54] the next state s prime drawn from the state transition probabilities okay and [00:23:58] state transition probabilities okay and the way that model is built is that the [00:24:04] the way that model is built is that the states and the actions are both and [00:24:07] states and the actions are both and let's see and the way the model is built [00:24:10] let's see and the way the model is built is the state is just a real value vector [00:24:12] is the state is just a real value vector okay oh and um I think for simplicity [00:24:16] okay oh and um I think for simplicity but now let's assume that the action [00:24:20] but now let's assume that the action space is discrete it turns out that for [00:24:23] space is discrete it turns out that for a lot of em DPS the state space can be [00:24:27] a lot of em DPS the state space can be very high dimensional and the action [00:24:29] very high dimensional and the action space is much lower dimensional than the [00:24:31] space is much lower dimensional than the state space so for example for a car you [00:24:34] state space so for example for a car you know s is six dimensional but the space [00:24:40] know s is six dimensional but the space of actions is just two dimensionals [00:24:42] of actions is just two dimensionals right the steering and braking it turns [00:24:44] right the steering and braking it turns out for a helicopter you know the state [00:24:48] out for a helicopter you know the state space is twelve dimensional and I guess [00:24:51] space is twelve dimensional and I guess you pray most of you I wouldn't expect [00:24:53] you pray most of you I wouldn't expect you in their heart helicopter flies but [00:24:54] you in their heart helicopter flies but it turns out that you have two four [00:24:56] it turns out that you have two four dimensional actions in the helicopter [00:24:58] dimensional actions in the helicopter the way you find welcomes these are two [00:24:59] the way you find welcomes these are two control sticks so your left hand the [00:25:01] control sticks so your left hand the right hand you know can move has two [00:25:05] right hand you know can move has two dimensions of control and for the [00:25:06] dimensions of control and for the inverted pendulum here's the state space [00:25:10] inverted pendulum here's the state space is for D and the action space is just [00:25:12] is for D and the action space is just one D right you move left or right so [00:25:14] one D right you move left or right so you actually see in a lot of [00:25:16] you actually see in a lot of reinforcement learning problems that [00:25:18] reinforcement learning problems that it's quite common for the state space to [00:25:21] it's quite common for the state space to be much [00:25:21] be much dimensional in the action space and so [00:25:24] dimensional in the action space and so let's say for now that we do not want to [00:25:28] let's say for now that we do not want to discretize the state space because it's [00:25:30] discretize the state space because it's your high dimensional but just for the [00:25:32] your high dimensional but just for the sake of simplicity let's say we [00:25:33] sake of simplicity let's say we discretize the action space for now [00:25:35] discretize the action space for now right which is which is usually much [00:25:36] right which is which is usually much easier to do but I think as we develop [00:25:39] easier to do but I think as we develop it evaluation as well well well you you [00:25:42] it evaluation as well well well you you might you get hints of when maybe you [00:25:45] might you get hints of when maybe you don't need to discretize the action [00:25:46] don't need to discretize the action space either but let's just say we have [00:25:48] space either but let's just say we have a dispute [00:25:48] a dispute dispute action space so all right so how [00:26:10] dispute action space so all right so how do you get a model right one way to [00:26:22] do you get a model right one way to build a model is to use a physics [00:26:25] build a model is to use a physics simulator so you know in the case of an [00:26:31] simulator so you know in the case of an inverted pendulum right it turns out [00:26:36] inverted pendulum right it turns out that well if the action is what's the [00:26:39] that well if the action is what's the acceleration you apply to either a [00:26:41] acceleration you apply to either a positive negative or to the to the X all [00:26:42] positive negative or to the to the X all right so therefore the right then it [00:26:44] right so therefore the right then it turns out that let's see so the state [00:26:48] turns out that let's see so the state space is four-dimensional right and it [00:26:54] space is four-dimensional right and it turns out that if you sort of flip open [00:26:57] turns out that if you sort of flip open the you know physics textbook using [00:26:59] the you know physics textbook using Newtonian mechanics if you know the [00:27:02] Newtonian mechanics if you know the weight of the card the way to the pole [00:27:05] yeah I think that says actually you know [00:27:07] yeah I think that says actually you know the mass of the constant mass in the [00:27:08] the mass of the constant mass in the pole and the length of the pole it turns [00:27:11] pole and the length of the pole it turns out you can derive equations about what [00:27:14] out you can derive equations about what is the velocity right so it's thought is [00:27:16] is the velocity right so it's thought is equal you know don't don't worry about [00:27:19] equal you know don't don't worry about this think of the map as declaration [00:27:22] this think of the map as declaration other than something you need to learn [00:27:24] other than something you need to learn where you know L was the length of the [00:27:26] where you know L was the length of the pole M is the mass of one of these [00:27:28] pole M is the mass of one of these things as you don't know m is the Hamas [00:27:30] things as you don't know m is the Hamas a is the force extender [00:27:33] a is the force extender and so on and and and conventional [00:27:37] and so on and and and conventional physics textbook will kind of let you [00:27:39] physics textbook will kind of let you derive these equations or rather than [00:27:42] derive these equations or rather than trying to derive this yourself using you [00:27:45] trying to derive this yourself using you know either yourself using Newtonian [00:27:47] know either yourself using Newtonian mechanics or finding the help of the [00:27:49] mechanics or finding the help of the physicist friend there are also a lot of [00:27:52] physicist friend there are also a lot of open source physics simulator software [00:27:55] open source physics simulator software packages we can download open source [00:27:57] packages we can download open source simulator plug in the dimensions and [00:27:59] simulator plug in the dimensions and mass and so on of your system and then [00:28:00] mass and so on of your system and then they'll spit out the simulators and [00:28:02] they'll spit out the simulators and tells you how the state evolves from one [00:28:04] tells you how the state evolves from one time said to another times then right [00:28:06] time said to another times then right and so but so in this example the [00:28:08] and so but so in this example the simulator will say that s prime is equal [00:28:12] simulator will say that s prime is equal to S Plus you know delta T times I guess [00:28:20] to S Plus you know delta T times I guess I times s dot where delta T could be [00:28:24] I times s dot where delta T could be lets say 0.1 seconds right so if you [00:28:29] lets say 0.1 seconds right so if you want to simulate this at 10 Hertz so [00:28:31] want to simulate this at 10 Hertz so that 10 10 10 updates per second so that [00:28:34] that 10 10 10 updates per second so that the time difference between the current [00:28:36] the time difference between the current state in the next day there's one tenth [00:28:37] state in the next day there's one tenth of a second then you write a simulator [00:28:40] of a second then you write a simulator like this okay and but and really the [00:28:44] like this okay and but and really the the most common way to do this is not to [00:28:46] the most common way to do this is not to actually derive the physics update [00:28:49] actually derive the physics update equations the most common way to do this [00:28:51] equations the most common way to do this is to just download one of the open [00:28:53] is to just download one of the open source physics engines right so um so [00:28:59] source physics engines right so um so this will work okay for problems like [00:29:02] this will work okay for problems like the inverted pendulum I once use a [00:29:06] the inverted pendulum I once use a physics engine to build a simulator for [00:29:09] physics engine to build a simulator for a four-legged robot and manager user [00:29:10] a four-legged robot and manager user enforcer learning together for the girl [00:29:12] enforcer learning together for the girl over to walk around right so it works [00:29:21] the second way to get a model is to [00:29:24] the second way to get a model is to learn it from data [00:29:31] right and I press they end up using this [00:29:33] right and I press they end up using this much more often so um here's what I mean [00:29:45] much more often so um here's what I mean let's say you want to build a controller [00:29:48] let's say you want to build a controller for an autonomous helicopter right so so [00:29:50] for an autonomous helicopter right so so this is case study and what I'm [00:29:52] this is case study and what I'm describing is real like this will [00:29:53] describing is real like this will actually work right let's do you want to [00:29:55] actually work right let's do you want to build up let's say you haven't let's say [00:29:58] build up let's say you haven't let's say you have a helicopter and you want to [00:29:59] you have a helicopter and you want to build on songs controller for it what [00:30:01] build on songs controller for it what you can do is start your helicopter off [00:30:05] you can do is start your helicopter off in some state s0 right so with GPS [00:30:09] in some state s0 right so with GPS accelerometers magnetic compass you can [00:30:11] accelerometers magnetic compass you can just measure the position and [00:30:13] just measure the position and orientation of the helicopter and then [00:30:15] orientation of the helicopter and then have a human pilot fly the helicopter [00:30:18] have a human pilot fly the helicopter around so the human pilot you know using [00:30:20] around so the human pilot you know using control sticks will move the helicopter [00:30:23] control sticks will move the helicopter they'll know their command the [00:30:25] they'll know their command the helicopter with some action a zero and [00:30:27] helicopter with some action a zero and then a tenth of a second later the [00:30:30] then a tenth of a second later the helicopter will get to some slightly [00:30:32] helicopter will get to some slightly different position and orientation [00:30:33] different position and orientation that's one and then the human pilot you [00:30:37] that's one and then the human pilot you know will just keep on moving the [00:30:39] know will just keep on moving the control sticks and so you record down [00:30:42] control sticks and so you record down what action they are taking a1 and based [00:30:45] what action they are taking a1 and based on that how copter will get to some new [00:30:47] on that how copter will get to some new state s2 and then they will take some [00:30:50] state s2 and then they will take some action a to or get to some state s3 and [00:30:53] action a to or get to some state s3 and so on and let them just write this as [00:30:56] so on and let them just write this as capital T right so in other words what [00:30:59] capital T right so in other words what you do is a take the helicopter out to [00:31:01] you do is a take the helicopter out to the field and hire a human pilot to fly [00:31:04] the field and hire a human pilot to fly this thing for a while and record the [00:31:06] this thing for a while and record the position of the helicopter ten times a [00:31:09] position of the helicopter ten times a second and also record all the actions [00:31:11] second and also record all the actions that human pilot was taking on the [00:31:14] that human pilot was taking on the control stick okay and then do this not [00:31:19] control stick okay and then do this not just one time but to do this M time so [00:31:23] just one time but to do this M time so let me use a superscript one what you [00:31:27] let me use a superscript one what you get the idea [00:31:30] to denote the first trajectory so you do [00:31:34] to denote the first trajectory so you do this a second time and so on and maybe [00:31:41] this a second time and so on and maybe do this every time [00:31:43] do this every time so there's just a lot of map of saying [00:31:46] so there's just a lot of map of saying fly the helicopter around you know M [00:31:48] fly the helicopter around you know M times right and then recall everything [00:31:50] times right and then recall everything that happened and now your goal is to [00:32:05] that happened and now your goal is to apply supervised learning right to [00:32:15] apply supervised learning right to estimate s T plus 1 as a function of s T [00:32:25] estimate s T plus 1 as a function of s T and a T so the job of the model the [00:32:30] and a T so the job of the model the jobless simulator is to take as input [00:32:32] jobless simulator is to take as input the current state and the current option [00:32:34] the current state and the current option and tell you where the helicopters gonna [00:32:36] and tell you where the helicopters gonna go you know like a 0.1 seconds later and [00:32:39] go you know like a 0.1 seconds later and so given all this data what you can do [00:32:44] so given all this data what you can do is apply a supervised learning algorithm [00:32:46] is apply a supervised learning algorithm to predict well what is the next state s [00:32:49] to predict well what is the next state s prime as a function of the current state [00:32:52] prime as a function of the current state in action right and the other notation [00:32:54] in action right and the other notation is when I drew the boxless emulator [00:32:56] is when I drew the boxless emulator above I was using s prime to denote s T [00:32:59] above I was using s prime to denote s T plus 1 and s n right so that's the [00:33:03] plus 1 and s n right so that's the mapping between the notations and so if [00:33:10] mapping between the notations and so if you use the linear regression version of [00:33:18] you use the linear regression version of this idea you will say this approximate [00:33:23] this idea you will say this approximate s T plus 1 as a linear function of the [00:33:27] s T plus 1 as a linear function of the previous state plus another linear [00:33:30] previous state plus another linear function of the previous state and it [00:33:33] function of the previous state and it turns out this actually works ok for [00:33:35] turns out this actually works ok for helicopters flying at slow speeds this [00:33:37] helicopters flying at slow speeds this is actually not a terrible model about [00:33:39] is actually not a terrible model about if your helicopter is moving slowly and [00:33:41] if your helicopter is moving slowly and and not flying upside down if you have a [00:33:43] and not flying upside down if you have a copters flying in the relatively level [00:33:46] copters flying in the relatively level way and kind of at slow speeds this [00:33:47] way and kind of at slow speeds this model is not too bad if you find your [00:33:51] model is not too bad if you find your helicopter in the highly dynamic [00:33:52] helicopter in the highly dynamic situations find very fast making a very [00:33:54] situations find very fast making a very fast aggressive turn [00:33:55] fast aggressive turn this is not a great model but this is [00:33:57] this is not a great model but this is that you okay first little speed spiking [00:34:06] um and so I guess a here will be a and [00:34:12] um and so I guess a here will be a and by n matrix because the state space is n [00:34:15] by n matrix because the state space is n dimensional you know so a is a square [00:34:18] dimensional you know so a is a square matrix and B will usually be a tall [00:34:22] matrix and B will usually be a tall skinny matrix I guess whereas the [00:34:23] skinny matrix I guess whereas the dimension of B is the dimensional state [00:34:26] dimension of B is the dimensional state space by the dimension of the action [00:34:28] space by the dimension of the action space right and so in order to fit the [00:34:33] space right and so in order to fit the parameters a and B you would minimize [00:34:35] parameters a and B you would minimize with respect to the parameters a and B [00:34:38] with respect to the parameters a and B this so you wanna approximate as [00:35:09] this so you wanna approximate as cheapest one as a function of that and [00:35:13] cheapest one as a function of that and so you know pretty natural to fit the [00:35:18] so you know pretty natural to fit the parameters of this linear model in a way [00:35:20] parameters of this linear model in a way that minimizes the squared difference [00:35:22] that minimizes the squared difference between the left hand side the right [00:35:24] between the left hand side the right hand side wait did I screw up yes [00:35:30] okay oh sure [00:35:37] okay oh sure what's the difference we find helicopter [00:35:39] what's the difference we find helicopter M times RS by helicopter once very very [00:35:42] M times RS by helicopter once very very long in this example it makes no [00:35:44] long in this example it makes no difference yeah this is fine either way [00:35:47] difference yeah this is fine either way unless some yeah for purposes doesn't [00:35:52] unless some yeah for purposes doesn't matter sorry [00:35:53] matter sorry umm for the person since classes doesn't [00:35:56] umm for the person since classes doesn't matter for practical purposes if you [00:35:58] matter for practical purposes if you find helicopter M times it turns out the [00:36:00] find helicopter M times it turns out the fuel burns down slowly and so the way to [00:36:02] fuel burns down slowly and so the way to her coffee changes slowly and you've won [00:36:04] her coffee changes slowly and you've won an average over how much fuel do you [00:36:06] an average over how much fuel do you have for winning conditions this is what [00:36:08] have for winning conditions this is what actually it's done but for the purposes [00:36:10] actually it's done but for the purposes of understanding without room playing a [00:36:12] of understanding without room playing a single time for a long time you know [00:36:13] single time for a long time you know well it's just fine as well okay um so [00:36:22] well it's just fine as well okay um so this is the linear regression version of [00:36:24] this is the linear regression version of this and it and we actually talked about [00:36:28] this and it and we actually talked about some other models later called lqr in [00:36:31] some other models later called lqr in lqg you you see this linear regression [00:36:34] lqg you you see this linear regression version of a model as well disree just a [00:36:37] version of a model as well disree just a linear model the dynamics right well [00:36:41] linear model the dynamics right well we'll come back to linear models [00:36:42] we'll come back to linear models dynamics later next week but it turns [00:36:45] dynamics later next week but it turns out that if you want to use a nonlinear [00:36:49] out that if you want to use a nonlinear model you know plug in a nonlinear you [00:36:51] model you know plug in a nonlinear you know if you you can also plug in write [00:36:54] know if you you can also plug in write Phi of s you know it may be v prime of a [00:36:57] Phi of s you know it may be v prime of a as well if you want to have a low [00:36:59] as well if you want to have a low nonlinear model and this will work even [00:37:02] nonlinear model and this will work even better depending on your choice of [00:37:04] better depending on your choice of features okay now um [00:37:11] finally having run this your little [00:37:14] finally having run this your little linear regression thing where you were [00:37:16] linear regression thing where you were and this is not quite linear regression [00:37:18] and this is not quite linear regression because a and B are matrices but but you [00:37:20] because a and B are matrices but but you can minimize this objective but it turns [00:37:22] can minimize this objective but it turns out - this turns out to be equivalent to [00:37:24] out - this turns out to be equivalent to running linear regression n times so s [00:37:28] running linear regression n times so s has 12 dimensions this turns out to [00:37:31] has 12 dimensions this turns out to equivalent to running linear regression [00:37:33] equivalent to running linear regression n times to predict the first day second [00:37:36] n times to predict the first day second day third state to variable and so on [00:37:38] day third state to variable and so on right that that's this one what this is [00:37:40] right that that's this one what this is equivalent to but having done this you [00:37:43] equivalent to but having done this you now have a choice of two possible models [00:37:45] now have a choice of two possible models one model would be to just say my model [00:37:49] one model would be to just say my model will said st plus 1 as a st plus b 18 or [00:37:55] will said st plus 1 as a st plus b 18 or another version would be to set st plus [00:38:08] another version would be to set st plus 1 equals a cos B T plus epsilon t where [00:38:12] 1 equals a cos B T plus epsilon t where epsilon T is distributed maybe from from [00:38:20] epsilon T is distributed maybe from from a Gaussian from a Gaussian density okay [00:38:25] a Gaussian from a Gaussian density okay and so this first model would be a [00:38:28] and so this first model would be a deterministic model and this model would [00:38:33] deterministic model and this model would be a stochastic model and if you use a [00:38:39] be a stochastic model and if you use a stochastic model then that's saying that [00:38:52] when you're running your simulator when [00:38:54] when you're running your simulator when you're running in the model every time [00:38:56] you're running in the model every time you generate st plus 1 you would be [00:38:59] you generate st plus 1 you would be something this epsilon from a Gaussian [00:39:01] something this epsilon from a Gaussian vector and adding it to the prediction [00:39:04] vector and adding it to the prediction of your linear model and and they've [00:39:06] of your linear model and and they've uses stochastic model what that means is [00:39:08] uses stochastic model what that means is that you know if you similar you have a [00:39:10] that you know if you similar you have a calcifying around your simulator will [00:39:12] calcifying around your simulator will generate random noise the add and [00:39:14] generate random noise the add and subtract a little bit to the state space [00:39:16] subtract a little bit to the state space of the helicopter as if there were [00:39:18] of the helicopter as if there were little wind gusts blowing it blowing the [00:39:19] little wind gusts blowing it blowing the helicopter around okay and this is a [00:39:27] so-so it and in in most cases when [00:39:38] so-so it and in in most cases when you're building reinforcement learning [00:39:39] you're building reinforcement learning models oh and so the the approach we're [00:39:42] models oh and so the the approach we're taking here this is called model-based [00:39:44] taking here this is called model-based reinforcement learning when you're going [00:39:46] reinforcement learning when you're going to build a model of your robot and then [00:39:49] to build a model of your robot and then let's train the reinforcement learning [00:39:51] let's train the reinforcement learning algorithm in the simulator and then take [00:39:54] algorithm in the simulator and then take the policy learn and take the policy PI [00:39:55] the policy learn and take the policy PI you learn in simulation and apply it [00:39:57] you learn in simulation and apply it back on your real robot alright so this [00:39:59] back on your real robot alright so this is this dis approach we're taking is [00:40:01] is this dis approach we're taking is called model-based RL there is an [00:40:10] called model-based RL there is an alternative called model free RL which [00:40:12] alternative called model free RL which is you just run your enforcement [00:40:14] is you just run your enforcement learning algorithm on the robot directly [00:40:15] learning algorithm on the robot directly and that the robot - the robot around [00:40:17] and that the robot - the robot around and so on and then I learn I think that [00:40:20] and so on and then I learn I think that in terms of robotics applications I [00:40:23] in terms of robotics applications I think model-based RL has been taking off [00:40:27] think model-based RL has been taking off faster a lot of the most promising [00:40:29] faster a lot of the most promising approaches are model-based RL because of [00:40:31] approaches are model-based RL because of your physical robot you know you just [00:40:34] your physical robot you know you just can't afford to have a reinforcement [00:40:35] can't afford to have a reinforcement learning algorithm - your robot around [00:40:37] learning algorithm - your robot around for too long or how many helicopters do [00:40:39] for too long or how many helicopters do you want to crash before you learn the [00:40:40] you want to crash before you learn the armor things as well [00:40:41] armor things as well model free RL works fine if you want to [00:40:45] model free RL works fine if you want to play video games because if you're [00:40:47] play video games because if you're trying to get a computer or play chess [00:40:49] trying to get a computer or play chess or thell or go right because you have a [00:40:52] or thell or go right because you have a perfect simulator for the video game [00:40:54] perfect simulator for the video game which is a video game itself and so your [00:40:56] which is a video game itself and so your your your ro algorithm you can on there [00:40:59] your your ro algorithm you can on there blow up hundreds of millions of times in [00:41:01] blow up hundreds of millions of times in a video game and [00:41:02] a video game and that's fine episode 4 playing video [00:41:04] that's fine episode 4 playing video games were playing on like you know [00:41:07] games were playing on like you know traditional games model free approaches [00:41:09] traditional games model free approaches can work fine but I most of the a lot of [00:41:13] can work fine but I most of the a lot of the successful applications of [00:41:16] the successful applications of reinforced knowledge of robots have been [00:41:18] reinforced knowledge of robots have been model based although again the field is [00:41:21] model based although again the field is evolving quickly so there's there's very [00:41:23] evolving quickly so there's there's very interesting work at the intersection of [00:41:24] interesting work at the intersection of model-based in model free that gets more [00:41:27] model-based in model free that gets more complicated but I would say if you want [00:41:29] complicated but I would say if you want to use something tried-and-true you know [00:41:31] to use something tried-and-true you know for robotics problems seriously consider [00:41:33] for robotics problems seriously consider using model based RL because you can [00:41:35] using model based RL because you can then fly a helicopter in simulation let [00:41:38] then fly a helicopter in simulation let me crash a million times right and no [00:41:39] me crash a million times right and no one's hurt there's no physical damage [00:41:41] one's hurt there's no physical damage anywhere the world is just OK and and oh [00:41:46] anywhere the world is just OK and and oh and just one last tip one things we [00:41:49] and just one last tip one things we learned building these reinforcer [00:41:52] learned building these reinforcer learning algorithms for a lot of robots [00:41:54] learning algorithms for a lot of robots is that you know have you run this model [00:41:57] is that you know have you run this model you might ask well how do I choose the [00:41:59] you might ask well how do I choose the distribution for this noise right how do [00:42:04] distribution for this noise right how do you model the distribution for the noise [00:42:06] you model the distribution for the noise once you could do is estimate it from [00:42:09] once you could do is estimate it from data but as a practical matter what [00:42:12] data but as a practical matter what happens is so long as you remember to [00:42:15] happens is so long as you remember to inject so let's see it turns out if you [00:42:17] inject so let's see it turns out if you used to deterministic simulator a lot of [00:42:20] used to deterministic simulator a lot of reinforcement learning our and also [00:42:21] reinforcement learning our and also learn a very brittle model that works in [00:42:24] learn a very brittle model that works in your simulator but doesn't actually work [00:42:26] your simulator but doesn't actually work when you put it into your real robot and [00:42:29] when you put it into your real robot and so if you actually look on YouTube or [00:42:31] so if you actually look on YouTube or Twitter in the last year or two there [00:42:34] Twitter in the last year or two there been a lot of cool looking videos that [00:42:36] been a lot of cool looking videos that people using reinforce learning to [00:42:38] people using reinforce learning to control various really configured robots [00:42:40] control various really configured robots a really good snake robot or some five [00:42:43] a really good snake robot or some five ago thing or some whatever is this cool [00:42:45] ago thing or some whatever is this cool random is I I'm not gonna drink this but [00:42:48] random is I I'm not gonna drink this but you know if you build a 5 bigger robot I [00:42:50] you know if you build a 5 bigger robot I didn't know what has five legs right how [00:42:52] didn't know what has five legs right how do you control that it turns out that if [00:42:54] do you control that it turns out that if you have a deterministic simulator using [00:42:58] you have a deterministic simulator using these methods it's not that hard to [00:43:00] these methods it's not that hard to generate a cool-looking video of your [00:43:02] generate a cool-looking video of your reinforcement learning algorithms [00:43:03] reinforcement learning algorithms supposedly controlling a 5 thing [00:43:06] supposedly controlling a 5 thing or some crazy you know a worm with two [00:43:09] or some crazy you know a worm with two legs or something these crazy robots so [00:43:11] legs or something these crazy robots so you can build in [00:43:12] you can build in simulator but it turns out that even [00:43:16] simulator but it turns out that even those easy to well not easy even though [00:43:18] those easy to well not easy even though you can generate those types of videos [00:43:20] you can generate those types of videos in the deterministic simulator if you [00:43:23] in the deterministic simulator if you use a deterministic model of a robot and [00:43:26] use a deterministic model of a robot and you ever actually try to build a [00:43:27] you ever actually try to build a physical robot and you take that policy [00:43:29] physical robot and you take that policy from your physics simulator to the real [00:43:32] from your physics simulator to the real robot the odds of it work on the real [00:43:34] robot the odds of it work on the real robot [00:43:35] robot are quite low if you use the [00:43:37] are quite low if you use the deterministic simulator great because [00:43:39] deterministic simulator great because the problem with simulators is that your [00:43:42] the problem with simulators is that your simulators never 100% accurate right [00:43:44] simulators never 100% accurate right yeah it's always just a little bit off [00:43:46] yeah it's always just a little bit off and one of the lessons we learned the [00:43:48] and one of the lessons we learned the field the whole few learned applying RL [00:43:52] field the whole few learned applying RL so a lot of robots is that if you want [00:43:54] so a lot of robots is that if you want your model-based are aware to work not [00:43:58] your model-based are aware to work not just in simulation engineer cool video [00:43:59] just in simulation engineer cool video but you wanted to actually work on a [00:44:01] but you wanted to actually work on a physical robot like a physical [00:44:03] physical robot like a physical helicopter that you own that is really [00:44:05] helicopter that you own that is really important to add some noise to your [00:44:07] important to add some noise to your simulator because if the policy you [00:44:10] simulator because if the policy you learn is robust to a slightly stochastic [00:44:15] learn is robust to a slightly stochastic simulator then the all server [00:44:17] simulator then the all server generalizing you know to the to the real [00:44:20] generalizing you know to the to the real world to the physical real world it's [00:44:22] world to the physical real world it's much higher than if you had a completely [00:44:24] much higher than if you had a completely deterministic simulator so I think [00:44:26] deterministic simulator so I think whenever I'm building a robot right III [00:44:28] whenever I'm building a robot right III pretty much yeah actually yeah I don't [00:44:30] pretty much yeah actually yeah I don't think I with one exception LKR LQG [00:44:33] think I with one exception LKR LQG without ball next week well one with one [00:44:35] without ball next week well one with one very narrow exception I pretty much [00:44:37] very narrow exception I pretty much never use deterministic simulators when [00:44:40] never use deterministic simulators when welcome to robotic control problems [00:44:43] welcome to robotic control problems unless assuming assuming I wanted to [00:44:45] unless assuming assuming I wanted to work in the real world as well and and [00:44:49] work in the real world as well and and again you know tips and tricks so the [00:44:53] again you know tips and tricks so the most important thing is to add some [00:44:55] most important thing is to add some noise and then sometimes the exact [00:44:58] noise and then sometimes the exact distribution of noise you know go ahead [00:45:00] distribution of noise you know go ahead and try to pick something realistic but [00:45:01] and try to pick something realistic but the exact distribution of noise actually [00:45:03] the exact distribution of noise actually matters less I want to say then just a [00:45:05] matters less I want to say then just a faculty remembering to add some noise [00:45:08] faculty remembering to add some noise okay [00:45:20] by the way you guys really don't know [00:45:22] by the way you guys really don't know this but my PhD thesis was using [00:45:26] this but my PhD thesis was using reinforcement learning to fly [00:45:27] reinforcement learning to fly helicopters so so I'm trying to oh no so [00:45:31] helicopters so so I'm trying to oh no so you're talking to someone just crash a [00:45:32] you're talking to someone just crash a bunch of helicopters and that's model [00:45:35] bunch of helicopters and that's model helicopters and has lived through the [00:45:37] helicopters and has lived through the the pain and the joys are seeing this [00:45:39] the pain and the joys are seeing this stuff work or not work all right so now [00:45:57] stuff work or not work all right so now that you have built a model build a [00:46:00] that you have built a model build a simulator for your helicopter for your [00:46:03] simulator for your helicopter for your folding a robot or for your car how do [00:46:07] folding a robot or for your car how do you how do you approximate the value [00:46:11] you how do you approximate the value function right so um in order to apply [00:46:23] fitted value iteration the first step is [00:46:27] fitted value iteration the first step is to choose features of the state s right [00:46:37] to choose features of the state s right and then we're approximately of s you [00:46:43] and then we're approximately of s you know we're approximately saw using a [00:46:45] know we're approximately saw using a function V of s which is going to be [00:46:47] function V of s which is going to be theta transpose Phi of s and so and so [00:46:57] you know in the case of a in the case of [00:47:02] you know in the case of a in the case of a on inverted pendulum right then Phi of [00:47:06] a on inverted pendulum right then Phi of s maybe you have X x dot maybe one x [00:47:10] s maybe you have X x dot maybe one x squared or x times X dot or x times the [00:47:14] squared or x times X dot or x times the pole orientation and so on so take take [00:47:17] pole orientation and so on so take take two states as and think of some [00:47:19] two states as and think of some nonlinear features [00:47:21] nonlinear features that you think might be useful for [00:47:22] that you think might be useful for representing the value and remember that [00:47:26] representing the value and remember that what the value is the value of a state [00:47:28] what the value is the value of a state is your expected payoff from that state [00:47:31] is your expected payoff from that state expect some discount or it was so the [00:47:33] expect some discount or it was so the value function captures if your robot [00:47:36] value function captures if your robot starts off in this state you know how [00:47:38] starts off in this state you know how well is it gonna do if it starts here so [00:47:41] well is it gonna do if it starts here so when you're designing features pick a [00:47:43] when you're designing features pick a bunch of features that you think help [00:47:45] bunch of features that you think help convey how well is your robot doing [00:47:48] convey how well is your robot doing and so maybe for the inverted pendulum [00:47:51] and so maybe for the inverted pendulum for example if the PO is way over to the [00:47:55] for example if the PO is way over to the right then maybe the pole will fall over [00:47:56] right then maybe the pole will fall over or give it a reward of minus one when [00:47:59] or give it a reward of minus one when the pole falls over right but so sorry [00:48:03] the pole falls over right but so sorry I'm overloading notation a bit theta is [00:48:05] I'm overloading notation a bit theta is both the angle of the pole as was the [00:48:06] both the angle of the pole as was the parameters but but but if the PO is [00:48:09] parameters but but but if the PO is falling way over that looks like it's [00:48:10] falling way over that looks like it's doing pretty badly [00:48:11] doing pretty badly unless X dot is very large and positive [00:48:16] unless X dot is very large and positive right and so maybe that's interaction [00:48:18] right and so maybe that's interaction between Phi and X dots you might say [00:48:20] between Phi and X dots you might say well let me have a new feature which is [00:48:22] well let me have a new feature which is the anchor the PO multiplied by the [00:48:24] the anchor the PO multiplied by the velocity right because then because it [00:48:28] velocity right because then because it seems like these two variables cannot [00:48:29] seems like these two variables cannot depend on each other so so so just as [00:48:34] depend on each other so so so just as when you are trying to predict the price [00:48:35] when you are trying to predict the price of a house you would say well what are [00:48:37] of a house you would say well what are the most useful features for the price [00:48:38] the most useful features for the price of a house you do something similar for [00:48:43] of a house you do something similar for fit evaluation and one nice thing about [00:48:49] fit evaluation and one nice thing about one nice thing about maldo based RL is [00:48:53] one nice thing about maldo based RL is that one small debasement folsom [00:48:55] that one small debasement folsom learning is that once you have built a [00:48:58] learning is that once you have built a model you see a little bit that you can [00:49:00] model you see a little bit that you can collect an essentially infinite amount [00:49:02] collect an essentially infinite amount of data from your model right and so [00:49:05] of data from your model right and so with a lot of data you can usually [00:49:07] with a lot of data you can usually afford to choose a larger number of [00:49:09] afford to choose a larger number of features because you can generate a ton [00:49:12] features because you can generate a ton of data with which to fit this linear [00:49:14] of data with which to fit this linear fashion and so you know you are usually [00:49:17] fashion and so you know you are usually not super constrained in terms of [00:49:19] not super constrained in terms of needing to be really careful not to [00:49:21] needing to be really careful not to choose too many features because of fear [00:49:23] choose too many features because of fear of overfitting you could get so much [00:49:25] of overfitting you could get so much data from a simulator that you know you [00:49:28] data from a simulator that you know you could usually make up quite a lot of [00:49:30] could usually make up quite a lot of features [00:49:31] features and I saw the features and the not being [00:49:32] and I saw the features and the not being useful is okay because you can get an [00:49:34] useful is okay because you can get an update from running your simulator for [00:49:37] update from running your simulator for the algorithm to store for their pretty [00:49:39] the algorithm to store for their pretty good set of parameters data even if you [00:49:41] good set of parameters data even if you have a lot of features because you can [00:49:43] have a lot of features because you can have a log that you can generate a lot [00:49:44] have a log that you can generate a lot of data to fit this function so um let's [00:49:50] of data to fit this function so um let's talk through the fitted value iteration [00:49:52] talk through the fitted value iteration algorithm alright you know what this is [00:49:59] algorithm alright you know what this is a long algorithm let me just use a fresh [00:50:01] a long algorithm let me just use a fresh board for this alright so let me just [00:50:15] board for this alright so let me just write down the original value iteration [00:50:18] write down the original value iteration algorithm to speed States so what we had [00:50:21] algorithm to speed States so what we had previously [00:50:22] previously was we would update BFS according to our [00:50:26] was we would update BFS according to our FS plus gamma max over here right so [00:50:37] FS plus gamma max over here right so this is what we had lost Monday and I [00:50:41] this is what we had lost Monday and I said at the start of today's lecture [00:50:44] said at the start of today's lecture that you can also write this as this [00:50:57] so let's take that and generalize it to [00:51:01] so let's take that and generalize it to fit to value iteration [00:51:30] all right um so first let's choose a set [00:51:39] all right um so first let's choose a set of States randomly unless initializer [00:51:58] of States randomly unless initializer Prime's is equal to zero and what we're [00:52:02] Prime's is equal to zero and what we're going to do is we're so so let's see in [00:52:06] going to do is we're so so let's see in many regression and you learn the [00:52:08] many regression and you learn the mapping from X to the Y and you have a [00:52:12] mapping from X to the Y and you have a discrete set of examples for X and you [00:52:15] discrete set of examples for X and you fit a function mapping from X to Y so in [00:52:18] fit a function mapping from X to Y so in what we're going to do here we're going [00:52:19] what we're going to do here we're going to learn a mapping from s to V of s and [00:52:24] to learn a mapping from s to V of s and we are going to take a discrete set of [00:52:27] we are going to take a discrete set of examples for s and try to figure out [00:52:30] examples for s and try to figure out what is V of s for them and then for the [00:52:32] what is V of s for them and then for the straight line you know to try to model [00:52:34] straight line you know to try to model this relationship right so so just as [00:52:36] this relationship right so so just as you had a finite set of examples a [00:52:38] you had a finite set of examples a finite set of houses that you see a [00:52:40] finite set of houses that you see a certain set of values of X in your [00:52:42] certain set of values of X in your training set for predicting housing [00:52:43] training set for predicting housing prices we're going to see you know a [00:52:46] prices we're going to see you know a certain set of states and then use that [00:52:47] certain set of states and then use that finite set of examples to use linear [00:52:50] finite set of examples to use linear regression into 50 of s right so that's [00:52:53] regression into 50 of s right so that's what this initial sample is meant to do [00:52:55] what this initial sample is meant to do and so this is the Ultimo's loop of [00:53:03] and so this is the Ultimo's loop of value iteration a fit evaluation and [00:53:07] value iteration a fit evaluation and then for I equals 1 [00:53:13] through em [00:53:44] let's see [00:54:11] all right so what we're going to do is [00:54:22] go over each of these M States for go [00:54:25] go over each of these M States for go over each of these M States and for each [00:54:29] over each of these M States and for each one of them we're going to and for each [00:54:33] one of them we're going to and for each one of those days of each one of those [00:54:34] one of those days of each one of those actions we're going to take a sample of [00:54:38] actions we're going to take a sample of K things in order to estimate that [00:54:40] K things in order to estimate that expected value right [00:54:42] expected value right and so this expectation is over s Prime [00:54:46] and so this expectation is over s Prime drawn from the state-transition [00:54:47] drawn from the state-transition distribution it's saying you're from [00:54:49] distribution it's saying you're from this state if you take this action where [00:54:51] this state if you take this action where you get tunics and so these two loops [00:54:56] you get tunics and so these two loops this for I equals 1 through m and for [00:54:59] this for I equals 1 through m and for each action a this is just looping over [00:55:01] each action a this is just looping over every state in every action and taking K [00:55:04] every state in every action and taking K samples that something K examples of [00:55:06] samples that something K examples of where you get to if you take an action a [00:55:09] where you get to if you take an action a in a certain status right and so and by [00:55:15] in a certain status right and so and by taking that K examples and computing [00:55:19] taking that K examples and computing this average QA right is your estimate [00:55:26] this average QA right is your estimate of that expectation okay so so all we've [00:55:29] of that expectation okay so so all we've done so far is a take K samples you know [00:55:33] done so far is a take K samples you know from this distribution of with s prime [00:55:37] from this distribution of with s prime is drawn and average V of s OS yeah oh [00:55:41] is drawn and average V of s OS yeah oh I'm sorry [00:55:42] I'm sorry and if I move our FS inside sorry then [00:55:48] and if I move our FS inside sorry then that's Q of a yeah [00:55:53] sorry there's never just rewrite this to [00:55:56] sorry there's never just rewrite this to move our FS inside [00:56:05] so this is written as gamma if you write [00:56:12] so this is written as gamma if you write this as Matt a little bit a of our s [00:56:15] this as Matt a little bit a of our s plus gamma yeah okay yes siree [00:56:34] plus gamma yeah okay yes siree so move the Max and expectation out then [00:56:36] so move the Max and expectation out then this is this is Q of a next let's set Y [00:56:58] this is this is Q of a next let's set Y I equals max over a of Q of a and so by [00:57:11] I equals max over a of Q of a and so by taking the max over a of Q of a that's [00:57:20] taking the max over a of Q of a that's what Y is is your estimate at the right [00:57:23] what Y is is your estimate at the right hand side of value iteration [00:57:33] and so why is your estimate for for this [00:57:41] and so why is your estimate for for this quantity for the right-hand side of [00:57:43] quantity for the right-hand side of valuation now in the original value [00:57:57] valuation now in the original value iteration algorithm I'm just using VI to [00:58:01] iteration algorithm I'm just using VI to approximate out to abbreviate value [00:58:04] approximate out to abbreviate value elevation in the original algorithm what [00:58:07] elevation in the original algorithm what we did was we set V of Si to be equal to [00:58:13] we did was we set V of Si to be equal to Y I write it said you know in the [00:58:15] Y I write it said you know in the original value iteration algorithm we [00:58:17] original value iteration algorithm we would compute the right-hand side this [00:58:19] would compute the right-hand side this purple thing and then said VFS equals to [00:58:22] purple thing and then said VFS equals to that he just said right-hand side equal [00:58:23] that he just said right-hand side equal to set the left-hand side equal the [00:58:25] to set the left-hand side equal the right-hand side but in fitted value [00:58:29] right-hand side but in fitted value iteration you know V of s is now [00:58:34] approximated by a linear function so you [00:58:36] approximated by a linear function so you can't just go into a linear function and [00:58:38] can't just go into a linear function and set the value of points individually so [00:58:42] set the value of points individually so what we're going to do instead is in [00:58:46] what we're going to do instead is in fitted VI [00:58:50] we're going to use linear regression to [00:58:56] we're going to use linear regression to make V of si as close as possible to y:i [00:58:59] make V of si as close as possible to y:i but VF si is now represented as a linear [00:59:06] but VF si is now represented as a linear function of the state so a linear [00:59:10] function of the state so a linear function of the features of states so VF [00:59:12] function of the features of states so VF si is Theta transpose Phi of Si and you [00:59:15] si is Theta transpose Phi of Si and you want that to be close to Y I and so the [00:59:19] want that to be close to Y I and so the final step is run linear regression to [00:59:26] final step is run linear regression to choose the parameters theta that [00:59:30] choose the parameters theta that minimizes the squared error [01:00:19] oh yes just make my curly braces match [01:00:31] okay [01:00:34] so that's fitted question oh this one oh [01:00:50] so that's fitted question oh this one oh this one oh no the M is used differently [01:00:54] this one oh no the M is used differently the so when we were learning a model M [01:00:57] the so when we were learning a model M was just how many times do you fly the [01:00:59] was just how many times do you fly the helicopter in order to build a model and [01:01:01] helicopter in order to build a model and the number of times you find the [01:01:03] the number of times you find the helicopter in order to build a physics [01:01:05] helicopter in order to build a physics model to build a model helicopter [01:01:07] model to build a model helicopter dynamics has it has nothing to do with [01:01:10] dynamics has it has nothing to do with this M which is the number of states you [01:01:13] this M which is the number of states you use in order to sort of anchor or in [01:01:16] use in order to sort of anchor or in order to so I think I'm actually so the [01:01:20] order to so I think I'm actually so the the way to think about this is um you [01:01:23] the way to think about this is um you want to learn a mapping from States to B [01:01:26] want to learn a mapping from States to B of s and so the sample you know this M [01:01:31] of s and so the sample you know this M stays is we're gonna choose M States on [01:01:34] stays is we're gonna choose M States on the x-axis right so and that M is the [01:01:40] the x-axis right so and that M is the number of points you choose on the [01:01:41] number of points you choose on the x-axis and then in each iteration [01:01:45] x-axis and then in each iteration evaluation we're going to go through [01:01:47] evaluation we're going to go through this procedure so you have sort of s 1 [01:01:50] this procedure so you have sort of s 1 up to SM right and then for each of [01:01:53] up to SM right and then for each of these you're going to compute some value [01:01:58] Y I using this procedure and then you [01:02:03] Y I using this procedure and then you fill a straight line to this sample of Y [01:02:05] fill a straight line to this sample of Y eyes [01:02:17] think of these think of the way you [01:02:21] think of these think of the way you build a model and the way you apply [01:02:23] build a model and the way you apply fitted value duration as two completely [01:02:25] fitted value duration as two completely separate operations so you can have one [01:02:29] separate operations so you can have one team of 10 engineers fly the helicopter [01:02:31] team of 10 engineers fly the helicopter around you know five helicopter around a [01:02:33] around you know five helicopter around a thousand times build them although run [01:02:35] thousand times build them although run linear regression and they have a model [01:02:37] linear regression and they have a model and then they could publish the model on [01:02:40] and then they could publish the model on the internet and a totally different [01:02:42] the internet and a totally different team could download their model and do [01:02:44] team could download their model and do this and the second team does no need to [01:02:46] this and the second team does no need to talk the first team at all other than [01:02:48] talk the first team at all other than downloading the model off the internet [01:02:49] downloading the model off the internet oh yes a good question you mean there's [01:02:59] oh yes a good question you mean there's something there's something K times [01:03:02] something there's something K times right yep that's a great question yes [01:03:04] right yep that's a great question yes that was a yes that was one of my next [01:03:07] that was a yes that was one of my next points which is the reason you sample [01:03:10] points which is the reason you sample from this distribution is because you [01:03:13] from this distribution is because you are using so you should do this if [01:03:15] are using so you should do this if you're using a stochastic simulator [01:03:16] you're using a stochastic simulator right and then and actually there's [01:03:19] right and then and actually there's actually also ask you guys what should [01:03:21] actually also ask you guys what should you do [01:03:22] you do how can you simplify this algorithm if [01:03:24] how can you simplify this algorithm if you use a deterministic simulator and [01:03:26] you use a deterministic simulator and service elastic simulator [01:03:34] let's see so if you said determining if [01:03:37] let's see so if you said determining if you said deterministic simulator then [01:03:39] you said deterministic simulator then you know given a certain state kind of [01:03:44] you know given a certain state kind of such an action it will always map to the [01:03:46] such an action it will always map to the exact same s Prime right so how can you [01:03:49] exact same s Prime right so how can you simplify the yep yes so if your [01:04:00] simplify the yep yes so if your determines simulator you can set a [01:04:02] determines simulator you can set a equals one and set the sample only once [01:04:05] equals one and set the sample only once because this distribution it always [01:04:08] because this distribution it always returns the same value so all of these [01:04:10] returns the same value so all of these case ampuls would be exactly the same so [01:04:13] case ampuls would be exactly the same so you might as well just do this once [01:04:14] you might as well just do this once rather than K times this one oh no this [01:04:34] rather than K times this one oh no this is uh this is actually as square [01:04:36] is uh this is actually as square brackets the thing is we're trying to [01:04:39] brackets the thing is we're trying to approximate this expectation and the way [01:04:43] approximate this expectation and the way you're approximate the mean is you know [01:04:45] you're approximate the mean is you know sample K times if you take the average [01:04:47] sample K times if you take the average right right so so what we've done here [01:04:50] right right so so what we've done here is in alternate approximate dis [01:04:51] is in alternate approximate dis expectation [01:04:52] expectation we're going to draw K samples and then [01:04:55] we're going to draw K samples and then sum over them and divide by K so you [01:04:57] sum over them and divide by K so you average over the case our polls [01:05:20] let's see so how do you choose em and [01:05:24] let's see so how do you choose em and how do you test the overfitting so you [01:05:27] how do you test the overfitting so you know one once you have a model one of [01:05:29] know one once you have a model one of the nice things about model Bizzaro is [01:05:30] the nice things about model Bizzaro is let's say that Phi of s right let's say [01:05:34] let's say that Phi of s right let's say that Phi of s is 50 features so let's [01:05:40] that Phi of s is 50 features so let's say you chose 50 features approximately [01:05:41] say you chose 50 features approximately the value function of your inverted [01:05:44] the value function of your inverted pendulum system then we know that you [01:05:48] pendulum system then we know that you know that you're going to be fitting [01:05:49] know that you're going to be fitting linear regression right to this 50 [01:05:51] linear regression right to this 50 dimensional state space I mean this step [01:05:53] dimensional state space I mean this step here this is really linear regression [01:05:57] here this is really linear regression and so you can ask if you want to run [01:06:01] and so you can ask if you want to run linear regression with 50 parameters how [01:06:04] linear regression with 50 parameters how many examples do you need to fit in [01:06:06] many examples do you need to fit in linear regression and I would say you [01:06:08] linear regression and I would say you know if M was maybe 500 right maybe it'd [01:06:12] know if M was maybe 500 right maybe it'd be ok you have 5 10 examples to fit 50 [01:06:14] be ok you have 5 10 examples to fit 50 parameters but if for computational [01:06:17] parameters but if for computational reasons if it doesn't run too slowly to [01:06:20] reasons if it doesn't run too slowly to even set M equals 1000 or even 5000 then [01:06:24] even set M equals 1000 or even 5000 then there's no harm to letting em be bigger [01:06:26] there's no harm to letting em be bigger so usually mu must all said to be as big [01:06:30] so usually mu must all said to be as big as you feel like subject to the program [01:06:32] as you feel like subject to the program not taking too long to run because it [01:06:35] not taking too long to run because it you know if you're if you're fitting it [01:06:38] you know if you're if you're fitting it unlike supervised learning if you're [01:06:40] unlike supervised learning if you're fitting data to housing prices you need [01:06:43] fitting data to housing prices you need to go out and you know collect data [01:06:45] to go out and you know collect data right off Craigslist or was Zillow or [01:06:50] right off Craigslist or was Zillow or Trulia or Redfern or whatever about [01:06:53] Trulia or Redfern or whatever about prices of houses and so data is [01:06:56] prices of houses and so data is expensive to collect in the real world [01:06:58] expensive to collect in the real world but once you have a model you could set [01:07:00] but once you have a model you could set m equals 5,000 or 10,000 or 100,000 and [01:07:03] m equals 5,000 or 10,000 or 100,000 and just and then your algorithm will run [01:07:05] just and then your algorithm will run more slowly but but selassie algorithm [01:07:08] more slowly but but selassie algorithm doesn't run too slowly there's no harm [01:07:10] doesn't run too slowly there's no harm to setting them to be bigger [01:07:18] cool so so I know there's a lot going on [01:07:24] cool so so I know there's a lot going on to this algorithm but this is fitted [01:07:26] to this algorithm but this is fitted value iteration and if you do this this [01:07:30] value iteration and if you do this this skill you can get reasonable behavior on [01:07:32] skill you can get reasonable behavior on a lot of robots by choosing Casella [01:07:35] a lot of robots by choosing Casella features and learning the value [01:07:37] features and learning the value functions are approximate the value of [01:07:39] functions are approximate the value of the really approximate expected payoff [01:07:42] the really approximate expected payoff of a robot starting off in different [01:07:44] of a robot starting off in different states okay now just a few [01:07:52] details to wrap up again some practical [01:07:58] details to wrap up again some practical aspects of how you do this after you've [01:08:13] aspects of how you do this after you've learned all these parameters this you've [01:08:16] learned all these parameters this you've now learned yeah [01:08:32] OSE yes thank you yes so in this [01:08:37] OSE yes thank you yes so in this expression where do you get a V of s [01:08:40] expression where do you get a V of s prime J from yes so you would get this [01:08:43] prime J from yes so you would get this from theta transpose Phi of s prime J [01:08:47] from theta transpose Phi of s prime J using the parameters of theta from the [01:08:50] using the parameters of theta from the last iteration of fitted value iteration [01:08:53] last iteration of fitted value iteration just as in value iteration this is the [01:08:57] just as in value iteration this is the values from the last iteration the you [01:08:59] values from the last iteration the you use update the new iteration so then you [01:09:01] use update the new iteration so then you use the last value of theta is updated [01:09:07] oh and one one one other thing you could [01:09:16] oh and one one one other thing you could do which is I talked about the linear [01:09:20] do which is I talked about the linear regression version of this algorithm [01:09:21] regression version of this algorithm which is you know this hope that this [01:09:25] which is you know this hope that this whole exercise is about generating a [01:09:27] whole exercise is about generating a sample of s and Y so you can apply [01:09:31] sample of s and Y so you can apply linear regression to predict the value [01:09:32] linear regression to predict the value of y from the values of s right but [01:09:35] of y from the values of s right but there's nothing in this algorithm that [01:09:37] there's nothing in this algorithm that says you have to use linear regression [01:09:39] says you have to use linear regression in order to now that you've generated [01:09:41] in order to now that you've generated this data set there's this box that [01:09:43] this data set there's this box that happier this is linear regression right [01:09:46] happier this is linear regression right but you don't have to use linear [01:09:47] but you don't have to use linear regression in Mauldin you know deep [01:09:50] regression in Mauldin you know deep reinforcement learning one of the ways [01:09:52] reinforcement learning one of the ways well one of the ways to go from [01:09:54] well one of the ways to go from reinforce on a deeper enforcer learning [01:09:55] reinforce on a deeper enforcer learning is to just use the neural network for [01:09:57] is to just use the neural network for this step instead then you can call then [01:09:58] this step instead then you can call then then you call that deep reinforcement [01:10:00] then you call that deep reinforcement learning where no but hey it's legit you [01:10:03] learning where no but hey it's legit you know but but you can also use locally [01:10:08] know but but you can also use locally weighted linear regression or whatever [01:10:10] weighted linear regression or whatever regression algorithm you want in order [01:10:12] regression algorithm you want in order to estimate Y as a function of the state [01:10:15] to estimate Y as a function of the state s yeah and I should have used a neural [01:10:19] s yeah and I should have used a neural network it relieves the need to choose [01:10:21] network it relieves the need to choose features PI as well you can feed in the [01:10:23] features PI as well you can feed in the raw features you know poor angle poor [01:10:25] raw features you know poor angle poor orientation and use a neural network to [01:10:27] orientation and use a neural network to learn them having a supervisor alright [01:10:37] learn them having a supervisor alright so one last important I guess practical [01:10:43] so one last important I guess practical implementation of detail which is fitted [01:10:47] implementation of detail which is fitted VI right just approximation to V Star [01:10:58] VI right just approximation to V Star and this um implicitly defines PI star [01:11:08] and this um implicitly defines PI star right because the definition for pi star [01:11:12] right because the definition for pi star is that [01:11:36] so when you're running a robot you know [01:11:40] so when you're running a robot you know you need to execute a policy prior given [01:11:43] you need to execute a policy prior given the stage music and actually given the [01:11:44] the stage music and actually given the stage Nipigon action and and having [01:11:47] stage Nipigon action and and having computed v-star it only implicitly [01:11:50] computed v-star it only implicitly defines the optimal policy PI staff and [01:11:56] defines the optimal policy PI staff and so if you're running a rover or if [01:11:59] so if you're running a rover or if you're running a robot in real time then [01:12:02] you're running a robot in real time then you know actually if you find a [01:12:03] you know actually if you find a helicopter you might have to choose [01:12:05] helicopter you might have to choose control actions at ten Hertz meaning ten [01:12:07] control actions at ten Hertz meaning ten times a second you given the state you [01:12:08] times a second you given the state you have you choose in action if you're [01:12:11] have you choose in action if you're building a self-driving car again a ten [01:12:12] building a self-driving car again a ten Hertz controller would be pretty [01:12:14] Hertz controller would be pretty reasonable guys choose a new action [01:12:15] reasonable guys choose a new action there maybe ten times a second would be [01:12:17] there maybe ten times a second would be pretty reasonable but how do you compute [01:12:20] pretty reasonable but how do you compute this expectation and this maximization [01:12:22] this expectation and this maximization ten times for a second so in what we use [01:12:27] ten times for a second so in what we use for fitted value iteration we used [01:12:32] sample of we use K samples to [01:12:44] sample of we use K samples to approximate the expectation but if [01:12:48] approximate the expectation but if you're running this in real time on a [01:12:51] you're running this in real time on a helicopter you know probably you don't [01:12:53] helicopter you know probably you don't want to at least I don't know for my [01:12:58] want to at least I don't know for my robotics implementations I have been [01:13:00] robotics implementations I have been reluctant to use a random number [01:13:02] reluctant to use a random number generator right in the inner loop of how [01:13:04] generator right in the inner loop of how we control a helicopter it might work [01:13:07] we control a helicopter it might work but I but I think you know it's [01:13:09] but I but I think you know it's approximately if you want to compute [01:13:11] approximately if you want to compute this arc Merricks it's approximate [01:13:12] this arc Merricks it's approximate expectation and do you really want to be [01:13:15] expectation and do you really want to be running a random number generator on a [01:13:17] running a random number generator on a helicopter and if you're really unlucky [01:13:18] helicopter and if you're really unlucky and a random air engineer generator [01:13:20] and a random air engineer generator journey is an unlucky value.we [01:13:22] journey is an unlucky value.we helicopter to do something you know iiii [01:13:26] helicopter to do something you know iiii would again just emotionally i don't [01:13:28] would again just emotionally i don't feel very good [01:13:29] feel very good you yourself driving car has a random [01:13:32] you yourself driving car has a random number generator in a loop of house [01:13:34] number generator in a loop of house choosing to drive so just as a practical [01:13:38] choosing to drive so just as a practical matter there are a couple of tricks that [01:13:43] matter there are a couple of tricks that people often use which is the simulator [01:13:58] is often of this form [01:14:15] so most simulators of this form next [01:14:18] so most simulators of this form next state is equal to some function of the [01:14:21] state is equal to some function of the Peter previous state and action plus [01:14:24] Peter previous state and action plus some noise and so one thing that is [01:14:27] some noise and so one thing that is often done is for your deployment or for [01:14:36] often done is for your deployment or for the you know for the for the actual [01:14:39] the you know for the for the actual policy you implement on the robot set [01:14:44] policy you implement on the robot set epsilon T equals zero and set K equals [01:14:50] epsilon T equals zero and set K equals one right and so so just this this is a [01:14:56] one right and so so just this this is a reasonable way to make this policy run [01:14:58] reasonable way to make this policy run on a helicopter which is during training [01:15:02] on a helicopter which is during training you do want to add noise to the [01:15:03] you do want to add noise to the simulator because it causes a policy you [01:15:07] simulator because it causes a policy you learn to be much more robust so little [01:15:09] learn to be much more robust so little errors in the simulator your simulator [01:15:11] errors in the simulator your simulator is always going a little bit off you [01:15:12] is always going a little bit off you know maybe it didn't quite simulate wind [01:15:14] know maybe it didn't quite simulate wind gust or when you turn the helicopter [01:15:16] gust or when you turn the helicopter does it back exactly right amount some [01:15:18] does it back exactly right amount some of its as always in practice is always a [01:15:20] of its as always in practice is always a little bit off so it's important to have [01:15:23] little bit off so it's important to have noise in the simulator in model-based RL [01:15:25] noise in the simulator in model-based RL but when you're deploying this in a [01:15:27] but when you're deploying this in a physical simulator one thing you could [01:15:30] physical simulator one thing you could do to be very reasonable is just get rid [01:15:32] do to be very reasonable is just get rid of the noise and stay K equals one and [01:15:35] of the noise and stay K equals one and so what you would do is [01:15:46] let's see whenever you're in the state s [01:15:58] pick the option a according to our masks [01:16:05] pick the option a according to our masks over a of V s a so this F is this F from [01:16:15] over a of V s a so this F is this F from here so this is the simulator with the [01:16:25] here so this is the simulator with the noise removed okay and so what you would [01:16:29] noise removed okay and so what you would do is actually and and you know [01:16:32] do is actually and and you know computers are now fast enough you can [01:16:33] computers are now fast enough you can you could do this ten times a second [01:16:34] you could do this ten times a second right if you want to control helicopters [01:16:36] right if you want to control helicopters self-driving car ten Hertz you can [01:16:37] self-driving car ten Hertz you can actually easily do this you know ten [01:16:40] actually easily do this you know ten times a second which is your car or your [01:16:42] times a second which is your car or your helicopters in some physical state in [01:16:44] helicopters in some physical state in the world so you know what is s and so [01:16:47] the world so you know what is s and so you can quickly for every possible [01:16:50] you can quickly for every possible action a that you could take use a [01:16:53] action a that you could take use a simulator to simulate where your [01:16:54] simulator to simulate where your helicopter will go if you were to take [01:16:58] helicopter will go if you were to take that action so go ahead and run your [01:16:59] that action so go ahead and run your simulator [01:17:00] simulator you know once for each possible action [01:17:02] you know once for each possible action you could take right computer actually [01:17:04] you could take right computer actually fast enough to do this in real time and [01:17:06] fast enough to do this in real time and then for each of the possible next [01:17:09] then for each of the possible next actions you could get to compute V apply [01:17:12] actions you could get to compute V apply to that so this is really right as a [01:17:15] to that so this is really right as a prime drawn from PSA but with this term [01:17:20] prime drawn from PSA but with this term the six simulator [01:17:32] right so every tenth of a second you [01:17:34] right so every tenth of a second you could assimilate to try out every single [01:17:37] could assimilate to try out every single possible action user simulator to figure [01:17:41] possible action user simulator to figure out where you would go under each every [01:17:43] out where you would go under each every single possible action and apply your [01:17:45] single possible action and apply your value function to see of all of these [01:17:48] value function to see of all of these possible actions which one gets my [01:17:50] possible actions which one gets my helicopter you know in the next one [01:17:53] helicopter you know in the next one tenth of a second to the state that [01:17:55] tenth of a second to the state that looks best according to the value [01:17:57] looks best according to the value function you've learned from fits [01:17:58] function you've learned from fits evaluation and it turns out if you do [01:18:04] evaluation and it turns out if you do this then you can this is how you [01:18:06] this then you can this is how you actually implement something that runs [01:18:07] actually implement something that runs in the whole time and oh and I just [01:18:10] in the whole time and oh and I just mentioned you know the the idea of a [01:18:12] mentioned you know the the idea of a training was so costly simulator and [01:18:15] training was so costly simulator and then just setting the noise is zero it's [01:18:17] then just setting the noise is zero it's one of those things there's not very [01:18:19] one of those things there's not very rigorously justified but in practice [01:18:21] rigorously justified but in practice this this works well oh yes so so um for [01:18:28] this this works well oh yes so so um for purpose of this you can assume you have [01:18:30] purpose of this you can assume you have a discretized action space and it turns [01:18:33] a discretized action space and it turns out that for a self-driving car is [01:18:34] out that for a self-driving car is actually okay to this precise reaction [01:18:36] actually okay to this precise reaction space for a helicopter we tend not to [01:18:40] space for a helicopter we tend not to disguise the action space but it turns [01:18:43] disguise the action space but it turns out if F is a continuous function then [01:18:45] out if F is a continuous function then you can use other methods as well right [01:18:48] you can use other methods as well right this is about optimizing over the I [01:18:50] this is about optimizing over the I didn't mean to talk about this so sorry [01:18:51] didn't mean to talk about this so sorry this getting a little bit deeper but [01:18:53] this getting a little bit deeper but even if a was a continuous thing you can [01:18:57] even if a was a continuous thing you can actually use real time optimization [01:18:58] actually use real time optimization algorithms to very quickly try to [01:19:01] algorithms to very quickly try to optimize this function even as a [01:19:02] optimize this function even as a function of the concerns actually [01:19:04] function of the concerns actually there's a literature on something called [01:19:06] there's a literature on something called model predictive control which we can [01:19:08] model predictive control which we can actually you can actually do these [01:19:09] actually you can actually do these optimizations in real time and use final [01:19:11] optimizations in real time and use final thoughts last question [01:19:22] wait oh say that what's the question is [01:19:24] wait oh say that what's the question is oh I use an observation yeah yes yes so [01:19:38] oh I use an observation yeah yes yes so you take an action and then your [01:19:39] you take an action and then your helicopter do something there'll be some [01:19:41] helicopter do something there'll be some wind your model may be off and so you [01:19:43] wind your model may be off and so you would then a tenth of a second later [01:19:45] would then a tenth of a second later take another you know GPS reading [01:19:47] take another you know GPS reading accelerometer reading magnetic compass [01:19:49] accelerometer reading magnetic compass reading and use the whole copper sensor [01:19:51] reading and use the whole copper sensor to tell you where you actually are no [01:19:53] to tell you where you actually are no cool okay cool all right I hope yeah [01:19:57] cool okay cool all right I hope yeah hopefully this was helpful I feel like [01:20:00] hopefully this was helpful I feel like you know the I think that's fascinating [01:20:01] you know the I think that's fascinating that the excitement that by myself [01:20:02] that the excitement that by myself driving cars and final hug calls and all [01:20:04] driving cars and final hug calls and all that it gives both down to equations [01:20:06] that it gives both down to equations like these though I think that's not [01:20:07] like these though I think that's not cool okay that's great thanks I'll see [01:20:09] cool okay that's great thanks I'll see you guys next week ================================================================================ LECTURE 019 ================================================================================ Lecture 19 - Reward Model & Linear Dynamical System | Stanford CS229: Machine Learning (Autumn 2018) Source: https://www.youtube.com/watch?v=0rt2CsEQv6U --- Transcript [00:00:04] okay hey everyone so welcome to the [00:00:08] okay hey everyone so welcome to the final week of the class um what I want [00:00:13] final week of the class um what I want to do today is share with you a few [00:00:15] to do today is share with you a few generalizations of reinforcement [00:00:18] generalizations of reinforcement learning and of mdps so you've learned [00:00:22] learning and of mdps so you've learned about the basic MVP formula zone of [00:00:24] about the basic MVP formula zone of states action stations info releases [00:00:26] states action stations info releases compactor and rewards the first thing [00:00:30] compactor and rewards the first thing you see today is to you know slight [00:00:33] you see today is to you know slight generalizations of this framework to [00:00:35] generalizations of this framework to state action rewards and to find the [00:00:36] state action rewards and to find the horizon MVPs that make it a little bit [00:00:39] horizon MVPs that make it a little bit easier for you to model certain types of [00:00:41] easier for you to model certain types of problems certain types of robots or [00:00:43] problems certain types of robots or certain types of factory automation [00:00:44] certain types of factory automation problems will be easier to model with [00:00:46] problems will be easier to model with these two small generalizations so talk [00:00:50] these two small generalizations so talk about those first and then second we'll [00:00:52] about those first and then second we'll talk about linear dynamical systems last [00:00:56] talk about linear dynamical systems last Wednesday you saw a fitted value [00:00:58] Wednesday you saw a fitted value iteration which was a way to solve for [00:01:03] iteration which was a way to solve for an MDP even when the state space may be [00:01:05] an MDP even when the state space may be infinite even when the state space is [00:01:07] infinite even when the state space is several numbers was RN so it's an [00:01:10] several numbers was RN so it's an infinite list of states or contingency [00:01:12] infinite list of states or contingency other states we use fitted value [00:01:14] other states we use fitted value iteration in which we're to use a [00:01:15] iteration in which we're to use a functional approximator right like [00:01:17] functional approximator right like linear regression to try to approximate [00:01:19] linear regression to try to approximate the value function there's one very [00:01:21] the value function there's one very important special case of an MDP where [00:01:24] important special case of an MDP where even if the state space is infinite of [00:01:27] even if the state space is infinite of continuous real numbers does that well [00:01:31] continuous real numbers does that well there's one important special case we [00:01:32] there's one important special case we can still compute the value function [00:01:35] can still compute the value function exactly without needing to use you know [00:01:38] exactly without needing to use you know like a linear function approximate or to [00:01:40] like a linear function approximate or to use something like linear regression in [00:01:41] use something like linear regression in the inner loop a fitted value iteration [00:01:43] the inner loop a fitted value iteration and so you also see that today and when [00:01:47] and so you also see that today and when you can take a robot or some factory [00:01:50] you can take a robot or some factory automation tools or whatever problem and [00:01:52] automation tools or whatever problem and model within this framework it turns out [00:01:54] model within this framework it turns out to be incredibly efficient because you [00:01:56] to be incredibly efficient because you can fit a continuous for the value [00:01:58] can fit a continuous for the value function as a function of the states [00:02:00] function as a function of the states without needing to approximate you can [00:02:03] without needing to approximate you can just compute the exact value function [00:02:04] just compute the exact value function even though the state space is [00:02:06] even though the state space is continuous so this is a framework that [00:02:09] continuous so this is a framework that doesn't apply to all problems but when [00:02:11] doesn't apply to all problems but when it does apply is incredibly convenient [00:02:13] it does apply is incredibly convenient gruffly efficient so you see that in a [00:02:16] gruffly efficient so you see that in a second half of today oh yes a 1:1 [00:02:21] second half of today oh yes a 1:1 tactical oh two two tactical things um [00:02:23] tactical oh two two tactical things um let's see from the questions that we're [00:02:26] let's see from the questions that we're getting from students um since they're [00:02:27] getting from students um since they're asking us oh how is grading and CSU's [00:02:29] asking us oh how is grading and CSU's you know and whatever I did well and [00:02:30] you know and whatever I did well and does you know didn't do so on that um [00:02:32] does you know didn't do so on that um for people taking a class pass/fail c- [00:02:36] for people taking a class pass/fail c- or better as a passing great this is [00:02:37] or better as a passing great this is quite I think there's a standard at [00:02:39] quite I think there's a standard at Stanford and I think sisters mignon has [00:02:43] Stanford and I think sisters mignon has historically been one of the heavy [00:02:44] historically been one of the heavy workload classes we know that people [00:02:46] workload classes we know that people taking sis you know I yeah I see a few [00:02:48] taking sis you know I yeah I see a few has nothing people King sisters end up [00:02:55] has nothing people King sisters end up you know putting a lot of work on this [00:02:56] you know putting a lot of work on this class maybe frankly more than average [00:02:58] class maybe frankly more than average for even Stanford courses and so we've [00:03:01] for even Stanford courses and so we've usually been quite nice with respect to [00:03:04] usually been quite nice with respect to party and acknowledge that so I think [00:03:06] party and acknowledge that so I think yeah just for what as well so don't [00:03:09] yeah just for what as well so don't don't don't sweat too much do work hard [00:03:11] don't don't sweat too much do work hard for the finer projects don't don't sweat [00:03:13] for the finer projects don't don't sweat too much um oh and on Wednesday after [00:03:17] too much um oh and on Wednesday after Claus had a funny question after I [00:03:19] Claus had a funny question after I talked about the fitted value iteration [00:03:21] talked about the fitted value iteration question the Sun came out to me and said [00:03:22] question the Sun came out to me and said hey Andrew um you know this algorithm [00:03:25] hey Andrew um you know this algorithm you you just told us does it actually [00:03:26] you you just told us does it actually work it doesn't actually work on the [00:03:29] work it doesn't actually work on the Tongass helicopter and the answer is yes [00:03:31] Tongass helicopter and the answer is yes the algorithms are teaching you know if [00:03:33] the algorithms are teaching you know if you do fits evaluation as you learned [00:03:37] you do fits evaluation as you learned last week it will work on flying an [00:03:38] last week it will work on flying an autonomous helicopter at low speed so [00:03:40] autonomous helicopter at low speed so your fly very high speeds very dynamic [00:03:42] your fly very high speeds very dynamic maneuvers crazy bang flipping upside [00:03:44] maneuvers crazy bang flipping upside down you need a bit more than that but [00:03:46] down you need a bit more than that but for flying a helicopter at low speeds [00:03:48] for flying a helicopter at low speeds the the exact algorithm that you learned [00:03:50] the the exact algorithm that you learned last Wednesday as well as any of the [00:03:53] last Wednesday as well as any of the algorithms you learned today including [00:03:55] algorithms you learned today including them lqr you know if you actually ever [00:03:58] them lqr you know if you actually ever need to find autonomous helicopter [00:03:59] need to find autonomous helicopter forever all these albums were actually [00:04:01] forever all these albums were actually work decently well work quite well for [00:04:03] work decently well work quite well for flying helicopter at low speeds maybe [00:04:05] flying helicopter at low speeds maybe not at very very high speeds and a crazy [00:04:07] not at very very high speeds and a crazy dynamic maneuvers but those speeds these [00:04:09] dynamic maneuvers but those speeds these algorithms [00:04:10] algorithms pretty much as I'm presenting them won't [00:04:12] pretty much as I'm presenting them won't work so okay [00:04:16] work so okay so the first generalization to the MDP [00:04:20] so the first generalization to the MDP framework that I want to describe is [00:04:23] framework that I want to describe is state action rewards and so so far we've [00:04:41] state action rewards and so so far we've had the rewards be a function mapping [00:04:43] had the rewards be a function mapping from the states to the set of real [00:04:46] from the states to the set of real numbers and we'll say action rewards [00:04:50] numbers and we'll say action rewards this is a slight modification to the MDP [00:04:53] this is a slight modification to the MDP formalism where now the reward function [00:04:56] formalism where now the reward function R as a function mapping from States and [00:05:00] R as a function mapping from States and actions to D rewards and so you know in [00:05:04] actions to D rewards and so you know in an MDP you stop the mistake [00:05:06] an MDP you stop the mistake s0 you take an action a zero then based [00:05:09] s0 you take an action a zero then based on value gets s1 take an action a1 to [00:05:12] on value gets s1 take an action a1 to state s to get to state s to take an [00:05:15] state s to get to state s to take an action a 2 and so on and with a state [00:05:18] action a 2 and so on and with a state action rewards the total payoff [00:05:25] there's written like this and this is [00:05:29] there's written like this and this is this this allows you to model that [00:05:32] this this allows you to model that different actions may have different [00:05:34] different actions may have different costs for example in the little robot [00:05:37] costs for example in the little robot wandering around the maze example maybe [00:05:40] wandering around the maze example maybe it's more costly for the robot to move [00:05:42] it's more costly for the robot to move than to stay still [00:05:44] than to stay still and so if you have an action for the [00:05:46] and so if you have an action for the robot to stay still the reward can be [00:05:48] robot to stay still the reward can be you know zero for staying slow and a [00:05:51] you know zero for staying slow and a slight negative reward for moving [00:05:52] slight negative reward for moving because you're burning or because [00:05:53] because you're burning or because because you're using electricity and so [00:06:01] because you're using electricity and so in that case [00:06:07] Velma's equations becomes this v-star [00:06:10] Velma's equations becomes this v-star equals where now you still break down [00:06:37] equals where now you still break down the value of a state as a sum of the [00:06:41] the value of a state as a sum of the immediate reward [00:06:42] immediate reward plus the you know expected future [00:06:44] plus the you know expected future rewards but now the immediate or what [00:06:49] rewards but now the immediate or what you get depends on the action that you [00:06:51] you get depends on the action that you take in the current state right so this [00:06:54] take in the current state right so this is a and so this is Bellman's equations [00:06:56] is a and so this is Bellman's equations and if and notice that previously you [00:07:00] and if and notice that previously you know we had the max kind of over here [00:07:02] know we had the max kind of over here but now you need to choose the option a [00:07:05] but now you need to choose the option a that maximizes you immediate reward plus [00:07:08] that maximizes you immediate reward plus your discounted future rewards which is [00:07:10] your discounted future rewards which is why the max kind of moved right if your [00:07:12] why the max kind of moved right if your local equation you look at this equation [00:07:14] local equation you look at this equation I guess the Mac set to move outside [00:07:16] I guess the Mac set to move outside because now the immediate reward you get [00:07:18] because now the immediate reward you get depends on the action you choose at this [00:07:21] depends on the action you choose at this step in time as well this models that [00:07:23] step in time as well this models that different actions may have different [00:07:26] different actions may have different costs yeah [00:07:31] oh yes yes yes just max applies to the [00:07:37] oh yes yes yes just max applies to the entire expression right yeah let's see [00:07:53] entire expression right yeah let's see so in this formulation ever wasn't [00:07:54] so in this formulation ever wasn't deterministic based on the state and [00:07:56] deterministic based on the state and action yes that is correct [00:07:58] action yes that is correct so in this formulation the reward [00:08:01] so in this formulation the reward depends on the current state and the [00:08:04] depends on the current state and the current action but all on the next date [00:08:05] current action but all on the next date you get two okay oh and by the way there [00:08:12] you get two okay oh and by the way there are multiple variations of formulations [00:08:14] are multiple variations of formulations of MVPs but this is some one convenient [00:08:17] of MVPs but this is some one convenient one I guess the model that different [00:08:19] one I guess the model that different costs and I think and and action and [00:08:21] costs and I think and and action and you're finding a helicopter a common [00:08:24] you're finding a helicopter a common formulation of this would be to say that [00:08:26] formulation of this would be to say that yanking aggressively on the control [00:08:29] yanking aggressively on the control statements should be assigned a higher [00:08:31] statements should be assigned a higher cost because yanking the control stick [00:08:33] cost because yanking the control stick aggressively causes a helicopter to jerk [00:08:36] aggressively causes a helicopter to jerk around more and so maybe you want to [00:08:38] around more and so maybe you want to penalize that by setting reward function [00:08:40] penalize that by setting reward function that you know penalizes very aggressive [00:08:42] that you know penalizes very aggressive maneuvers so these are ways that this [00:08:45] maneuvers so these are ways that this gives you the as a problem designer sort [00:08:50] gives you the as a problem designer sort of more flexibility right and then and [00:08:55] of more flexibility right and then and then finally so I'm gonna just write [00:08:57] then finally so I'm gonna just write this on top in this formulation the [00:09:00] this on top in this formulation the optimal action so right so in order to [00:09:05] optimal action so right so in order to compute the value function you can still [00:09:07] compute the value function you can still use value iteration right which is snow [00:09:14] use value iteration right which is snow you know V of s just updated as [00:09:17] you know V of s just updated as basically the right-hand side from [00:09:20] basically the right-hand side from pellman's equations so now the iteration [00:09:23] pellman's equations so now the iteration works just fine for the state action [00:09:24] works just fine for the state action reward formulation as well and if you [00:09:28] reward formulation as well and if you apply value iteration until the [00:09:31] apply value iteration until the convergence of esau then the optimal [00:09:33] convergence of esau then the optimal action is just the opposite [00:09:49] right so soap I saw is just the odd max [00:09:53] right so soap I saw is just the odd max of this thing right now when you're [00:09:56] of this thing right now when you're given state you want to choose the [00:09:57] given state you want to choose the action that maximizes your media reward [00:09:59] action that maximizes your media reward plus your expected future rewards okay [00:10:06] so I think just maybe another example if [00:10:09] so I think just maybe another example if you want to use an MDP to planner [00:10:14] you want to use an MDP to planner shortest route for robot say drive from [00:10:17] shortest route for robot say drive from here in Stanford to drive up to San [00:10:19] here in Stanford to drive up to San Francisco right then if it costs [00:10:22] Francisco right then if it costs different amounts to drive on different [00:10:24] different amounts to drive on different Road segments because they're traffic or [00:10:26] Road segments because they're traffic or because of the speed limit on different [00:10:27] because of the speed limit on different roads then this allows you to say that [00:10:30] roads then this allows you to say that while driving this distance on this road [00:10:32] while driving this distance on this road cost this much in terms of fuel [00:10:34] cost this much in terms of fuel consumption or in terms of time and so [00:10:36] consumption or in terms of time and so on so [00:10:43] or in factory maintenance if you send in [00:10:47] or in factory maintenance if you send in a team to maintain the machine that has [00:10:49] a team to maintain the machine that has a certain cost versus if you do nothing [00:10:51] a certain cost versus if you do nothing that has a different cost but then the [00:10:53] that has a different cost but then the machine breaks down it has yet another [00:10:54] machine breaks down it has yet another cost evaluations okay so that's the [00:10:59] cost evaluations okay so that's the first generation the second generation [00:11:02] first generation the second generation is the finite horizon [00:11:09] MDP and in a final horizon MDP we're [00:11:20] MDP and in a final horizon MDP we're going to replace the discount factor [00:11:23] going to replace the discount factor gamma with a horizon time she and and [00:11:31] gamma with a horizon time she and and we'll just forget about the discount [00:11:33] we'll just forget about the discount factor and in the final horizon [00:11:36] factor and in the final horizon MDP the MDP will run for a finite number [00:11:44] MDP the MDP will run for a finite number of t step so you stopping to state a [00:11:46] of t step so you stopping to state a zero take an action a zero get to s1 [00:11:49] zero take an action a zero get to s1 take action a one get to state st take [00:11:52] take action a one get to state st take an action a T at time step T and in the [00:11:55] an action a T at time step T and in the world ends and it we're done right and [00:11:57] world ends and it we're done right and so the payoff is this finite sum and and [00:12:08] so the payoff is this finite sum and and kinda it's just a full stop at the end [00:12:10] kinda it's just a full stop at the end of that um you can also apply [00:12:11] of that um you can also apply discounting but usually when you have a [00:12:14] discounting but usually when you have a finite horizon MDP maybe there's no need [00:12:16] finite horizon MDP maybe there's no need to apply discounting and so this model [00:12:20] to apply discounting and so this model is a problem where there are you know T [00:12:23] is a problem where there are you know T time steps and then the world ends after [00:12:25] time steps and then the world ends after that right or what world end sounds a [00:12:27] that right or what world end sounds a bit dire but you know if you find an [00:12:30] bit dire but you know if you find an airplane or if I hold copter and you [00:12:32] airplane or if I hold copter and you know you only have fuel you know for 30 [00:12:35] know you only have fuel you know for 30 minutes right RC helicopter whatsoever [00:12:38] minutes right RC helicopter whatsoever have 20 30 minutes of fuel then you know [00:12:40] have 20 30 minutes of fuel then you know that you're gonna run this thing for 30 [00:12:43] that you're gonna run this thing for 30 minutes and then you're done and so the [00:12:45] minutes and then you're done and so the goal is to accumulate as many rewards as [00:12:47] goal is to accumulate as many rewards as possible up until you you know run out [00:12:50] possible up until you you know run out of fuel and then you have [00:12:51] of fuel and then you have laughs right so that be example of a [00:12:53] laughs right so that be example of a finer horizon MDP now and and and ago is [00:12:59] finer horizon MDP now and and and ago is to maximize this payoff or the expected [00:13:06] to maximize this payoff or the expected payoff over these tea time steps okay [00:13:10] payoff over these tea time steps okay now one interesting property of a finite [00:13:16] now one interesting property of a finite horizon of a fine horizon MDP is that [00:13:20] horizon of a fine horizon MDP is that the action you take may depend on what [00:13:23] the action you take may depend on what time it is on the clock right so there's [00:13:25] time it is on the clock right so there's a clock marching from you know x at 0 to [00:13:28] a clock marching from you know x at 0 to x at t whereupon right the world ends [00:13:30] x at t whereupon right the world ends the way whereupon that's all the rewards [00:13:32] the way whereupon that's all the rewards the MDP is trying to collect and one [00:13:35] the MDP is trying to collect and one interesting effect of this is that this [00:13:38] interesting effect of this is that this pendulum right is that the optimal [00:13:45] pendulum right is that the optimal action may depend on what what the time [00:13:51] action may depend on what what the time is on the clock so let's say your robot [00:13:53] is on the clock so let's say your robot is running around this maze and there's [00:13:56] is running around this maze and there's a small plus one reward here and much [00:14:00] a small plus one reward here and much larger +10 reward there and let's say [00:14:05] larger +10 reward there and let's say your robots is here right then the [00:14:09] your robots is here right then the optimal action for whether you go left [00:14:11] optimal action for whether you go left or go right will depend on how much time [00:14:13] or go right will depend on how much time you have left on the clock if you have [00:14:15] you have left on the clock if you have only you know two or three times as left [00:14:17] only you know two or three times as left on the clock it's better to just rush [00:14:19] on the clock it's better to just rush and get the plus one but we still have [00:14:22] and get the plus one but we still have you know 10 20 ticks left on the clock [00:14:24] you know 10 20 ticks left on the clock then you should just go and get the plus [00:14:27] then you should just go and get the plus tender one and so in this example pi [00:14:31] tender one and so in this example pi star of s it's not well-defined because [00:14:36] star of s it's not well-defined because well the the optimal action to take when [00:14:39] well the the optimal action to take when you robot is here in this station you go [00:14:41] you robot is here in this station you go left watch it you're right [00:14:42] left watch it you're right it actually depends on what time it is [00:14:45] it actually depends on what time it is on the clock and so PI star in this [00:14:48] on the clock and so PI star in this example should be written instead PI [00:14:51] example should be written instead PI star subscript T of s because the auto [00:14:56] star subscript T of s because the auto action depends on what time T it is the [00:15:00] action depends on what time T it is the technical term for this is that this is [00:15:02] technical term for this is that this is a non-stationary [00:15:04] a non-stationary non stationary policy and non stationary [00:15:11] non stationary policy and non stationary means it depends on the time actually [00:15:14] means it depends on the time actually changes over time right whereas in [00:15:22] changes over time right whereas in contrast up until now we've been saying [00:15:25] contrast up until now we've been saying you know PI star of s is the octal [00:15:28] you know PI star of s is the octal policy before we before this formalism [00:15:30] policy before we before this formalism right which is at PI star of s and [00:15:32] right which is at PI star of s and that's was a stationary policy and [00:15:36] that's was a stationary policy and stationary means or does not change over [00:15:45] stationary means or does not change over time okay so one one one thing that dumb [00:15:48] time okay so one one one thing that dumb I didn't quite prove but that was [00:15:49] I didn't quite prove but that was implicit was that the optimal action you [00:15:52] implicit was that the optimal action you take in the original formulation is the [00:15:55] take in the original formulation is the same action right no matter what time it [00:15:58] same action right no matter what time it is in the MDP so in the original [00:16:00] is in the MDP so in the original formulation that you saw last week the [00:16:03] formulation that you saw last week the octal policy was stationary meaning that [00:16:05] octal policy was stationary meaning that the optimal policy is the same policy no [00:16:08] the optimal policy is the same policy no matter what time it is it doesn't change [00:16:10] matter what time it is it doesn't change over time [00:16:10] over time whereas in fine horizon MDP setting the [00:16:14] whereas in fine horizon MDP setting the Austral policy you know the also action [00:16:16] Austral policy you know the also action changes over time and so this is a non [00:16:19] changes over time and so this is a non stationary policy so say XI versus on [00:16:20] stationary policy so say XI versus on stage she just means does it change over [00:16:22] stage she just means does it change over time or does it not change over time [00:16:24] time or does it not change over time okay [00:16:25] okay and so if you're using a non stationary [00:16:33] and so if you're using a non stationary policy anyway you can also build an MDP [00:16:38] policy anyway you can also build an MDP with non stationary the transition [00:16:40] with non stationary the transition probabilities one on stage three rewards [00:16:52] actually so maybe here's an example um [00:16:55] actually so maybe here's an example um let's say you're driving from campus [00:16:57] let's say you're driving from campus from Palo Alto to San Francisco and we [00:17:00] from Palo Alto to San Francisco and we know that rush R is that what like 5 [00:17:03] know that rush R is that what like 5 p.m. or 6 p.m. or something right and [00:17:05] p.m. or 6 p.m. or something right and maybe maybe the weather forecast even [00:17:07] maybe maybe the weather forecast even says it's gonna rain at 6 p.m. or [00:17:08] says it's gonna rain at 6 p.m. or something right but so you know that the [00:17:10] something right but so you know that the dynamics of how you drive your car from [00:17:12] dynamics of how you drive your car from here at the San Francisco will change [00:17:14] here at the San Francisco will change over time [00:17:15] over time as in the time it takes you know to [00:17:17] as in the time it takes you know to drive on a certain segment of the road [00:17:19] drive on a certain segment of the road is a function of time and if you want to [00:17:22] is a function of time and if you want to build an MDP to solve the best way to [00:17:25] build an MDP to solve the best way to drive from here in San Francisco say [00:17:26] drive from here in San Francisco say then the state transitions so SC plus [00:17:32] then the state transitions so SC plus one is drawn from state transition [00:17:35] one is drawn from state transition probabilities indexed by the state at [00:17:37] probabilities indexed by the state at time T and the action at time T and if [00:17:40] time T and the action at time T and if these state transition probabilities [00:17:42] these state transition probabilities change over time then if you index it by [00:17:47] change over time then if you index it by the time T this would be an example of a [00:17:49] the time T this would be an example of a non-stationary of a non stationary state [00:17:53] non-stationary of a non stationary state transition probabilities okay or [00:17:55] transition probabilities okay or alternatively if you want non stationary [00:17:59] alternatively if you want non stationary rewards then you can have a superscript [00:18:03] rewards then you can have a superscript T of si is the reward you get for taking [00:18:07] T of si is the reward you get for taking a certain action for being at a certain [00:18:11] a certain action for being at a certain state at a certain time okay so all of [00:18:13] state at a certain time okay so all of these are different variations of MDPs [00:18:17] these are different variations of MDPs and so maybe just a few examples of when [00:18:20] and so maybe just a few examples of when you will want a final horizon MVP or use [00:18:24] you will want a final horizon MVP or use a non stationary state transitions so [00:18:29] a non stationary state transitions so let's see if you are flying an airplane [00:18:33] let's see if you are flying an airplane right for some airplanes something like [00:18:37] right for some airplanes something like for commercial very large commercial [00:18:38] for commercial very large commercial airplanes sometimes over a third of the [00:18:41] airplanes sometimes over a third of the weight of the airplane comes from the [00:18:43] weight of the airplane comes from the field right so actually if you take a [00:18:45] field right so actually if you take a large commercial airplane you know when [00:18:47] large commercial airplane you know when you take off from SFO and you fly - oh [00:18:51] you take off from SFO and you fly - oh no way you guys prefer to fly - I [00:18:53] no way you guys prefer to fly - I applied to London or something it's very [00:18:55] applied to London or something it's very direct flex appears in London by the [00:18:57] direct flex appears in London by the time the plane lands and get much [00:18:59] time the plane lands and get much lighter airplane than when you took off [00:19:00] lighter airplane than when you took off because maybe sometimes maybe like a [00:19:03] because maybe sometimes maybe like a third of the weight disappear you know [00:19:05] third of the weight disappear you know because of burning fuel and so then the [00:19:07] because of burning fuel and so then the dynamics there how an airplane feels [00:19:11] dynamics there how an airplane feels between takeoff and landing is actually [00:19:13] between takeoff and landing is actually different because the weight is [00:19:14] different because the weight is dramatically different and so this would [00:19:18] dramatically different and so this would be one example of where the state [00:19:19] be one example of where the state transition priorities changes and a [00:19:21] transition priorities changes and a pretty predictable way right [00:19:23] pretty predictable way right or oh right already mentioned weather [00:19:28] or oh right already mentioned weather forecasts right where weather forecasts [00:19:33] forecasts right where weather forecasts of traffic for cars would be driving [00:19:34] of traffic for cars would be driving here or drive yeah if you're driving [00:19:37] here or drive yeah if you're driving over different types of terrain over [00:19:39] over different types of terrain over time you know that's gonna rain tomorrow [00:19:41] time you know that's gonna rain tomorrow you're gonna know it's gonna rain [00:19:43] you're gonna know it's gonna rain tonight and the ground will turn muddy [00:19:44] tonight and the ground will turn muddy you know then all the traffic would turn [00:19:46] you know then all the traffic would turn bad and then on the industrial [00:19:54] bad and then on the industrial automation um I'll see how friends work [00:19:58] automation um I'll see how friends work on industrial automation and I think [00:20:00] on industrial automation and I think that may be one example if you run a [00:20:03] that may be one example if you run a factory 24 hours a day then the cost of [00:20:06] factory 24 hours a day then the cost of labor you know getting people to come [00:20:09] labor you know getting people to come into the factory to do some work at noon [00:20:11] into the factory to do some work at noon is actually easier right and less costly [00:20:14] is actually easier right and less costly than getting someone to show up in the [00:20:16] than getting someone to show up in the factory to do some work at 3:00 a.m. [00:20:17] factory to do some work at 3:00 a.m. right and so depending on really labor [00:20:21] right and so depending on really labor availability over time the cost of [00:20:23] availability over time the cost of taking different actions and the cost of [00:20:26] taking different actions and the cost of and and the likelihood of transitioning [00:20:28] and and the likelihood of transitioning to different stations and priorities can [00:20:30] to different stations and priorities can vary over the 24 hour clock as well [00:20:32] vary over the 24 hour clock as well right so these are other examples of [00:20:35] right so these are other examples of when you can have a non safety policy [00:20:41] when you can have a non safety policy and non safety state transitions okay [00:20:43] and non safety state transitions okay now um let's talk about how you would [00:20:47] now um let's talk about how you would actually solve for a fine horizon MDP [00:20:50] actually solve for a fine horizon MDP and I think for the sake of simplicity [00:20:52] and I think for the sake of simplicity for the most part I'm going to not [00:20:55] for the most part I'm going to not bother with non stations transition [00:20:57] bother with non stations transition rewards so for the most part just focus [00:20:59] rewards so for the most part just focus on for the most part it's gonna forget [00:21:01] on for the most part it's gonna forget about you know the fact that this could [00:21:03] about you know the fact that this could be beer being I mentioned it briefly but [00:21:06] be beer being I mentioned it briefly but I want to focus on the finer horizon [00:21:09] I want to focus on the finer horizon aspect [00:21:11] aspect so so let me define the autovalue [00:21:28] so so let me define the autovalue function [00:22:03] so this is the also value function for [00:22:06] so this is the also value function for time T for starting a new state as so [00:22:09] time T for starting a new state as so this is the expected total payoff [00:22:19] starting in state s at time T and if you [00:22:27] starting in state s at time T and if you execute you know the best possible [00:22:32] execute you know the best possible policy so now the optimal value function [00:22:35] policy so now the optimal value function depends on what time it is because if [00:22:39] depends on what time it is because if you look at that example with the +1 [00:22:41] you look at that example with the +1 reward on the left and the +10 reward on [00:22:43] reward on the left and the +10 reward on the right depending on how much time you [00:22:45] the right depending on how much time you have left on the clock the amount of [00:22:47] have left on the clock the amount of rewards you can accumulate can be quite [00:22:49] rewards you can accumulate can be quite different if you have more time yet more [00:22:51] different if you have more time yet more than you know you can more time to get [00:22:53] than you know you can more time to get to the past and reward indeed and the +1 [00:22:55] to the past and reward indeed and the +1 and +10 rewards that I drew example that [00:22:58] and +10 rewards that I drew example that you just now and so um in this example [00:23:03] you just now and so um in this example value iteration becomes the following it [00:23:10] value iteration becomes the following it actually becomes a dynamic programming [00:23:12] actually becomes a dynamic programming algorithm what you see in a second ok [00:23:19] algorithm what you see in a second ok which is that [00:23:40] all right which is that the star of T of [00:23:47] all right which is that the star of T of S is equal to max' over a R of the s a [00:23:53] S is equal to max' over a R of the s a plus and actually this is a question for [00:24:14] plus and actually this is a question for you so this does this one missing thing [00:24:16] you so this does this one missing thing here right so what's saying that the [00:24:21] here right so what's saying that the optimal value you can get when you start [00:24:24] optimal value you can get when you start up in state as at time T is the max over [00:24:27] up in state as at time T is the max over all actions of the immediate reward plus [00:24:29] all actions of the immediate reward plus sum of s prime state transmitter is s [00:24:32] sum of s prime state transmitter is s prime times V star of s prime and then [00:24:34] prime times V star of s prime and then what should go in that box okay cool [00:24:38] what should go in that box okay cool awesome great right and then PI star of [00:24:48] awesome great right and then PI star of s is just you know off max of 80 right [00:24:54] s is just you know off max of 80 right of the same thing of this whole [00:24:56] of the same thing of this whole expression up on top and so this formula [00:25:01] expression up on top and so this formula defines VT as a function of V T plus 1 [00:25:06] defines VT as a function of V T plus 1 so this is like oh this is like the [00:25:08] so this is like oh this is like the iterative step right given meet engine [00:25:10] iterative step right given meet engine compute V now I'm given V now you can [00:25:12] compute V now I'm given V now you can abbreviate given behavior music v7 and [00:25:15] abbreviate given behavior music v7 and so to start this off there's just one [00:25:18] so to start this off there's just one last thing we need to define which is [00:25:20] last thing we need to define which is the capital T at the finite step at the [00:25:24] the capital T at the finite step at the final step when the clocks about to run [00:25:26] final step when the clocks about to run out all you get to do is choose the [00:25:31] out all you get to do is choose the action a [00:25:37] that maximizes the immediate reward and [00:25:40] that maximizes the immediate reward and then and then and then there's no sum [00:25:42] then and then and then there's no sum after that right so if you start off at [00:25:45] after that right so if you start off at state as at the final time step T then [00:25:48] state as at the final time step T then you get to take an action and you get [00:25:51] you get to take an action and you get immediate reward and then there is no [00:25:53] immediate reward and then there is no next day because the world just ends [00:25:54] next day because the world just ends right after that step which is why the [00:25:58] right after that step which is why the auto value at time T is just max over a [00:26:01] auto value at time T is just max over a at the immediate reward because what [00:26:03] at the immediate reward because what happens after that doesn't matter okay [00:26:05] happens after that doesn't matter okay so this is a dynamic programming [00:26:09] so this is a dynamic programming algorithm in which this algorithm does [00:26:13] algorithm in which this algorithm does step on top defines you allows you to [00:26:16] step on top defines you allows you to compute V saw of T and then the [00:26:19] compute V saw of T and then the inductive step or the n plus 1 step I [00:26:21] inductive step or the n plus 1 step I guess is if you then having computed V [00:26:24] guess is if you then having computed V Star of T for every state s right so you [00:26:27] Star of T for every state s right so you know so you compute this for every state [00:26:28] know so you compute this for every state that's having done this you can then [00:26:30] that's having done this you can then compute V star t minus 1 using this [00:26:34] compute V star t minus 1 using this inductive step then it's not t minus 2 [00:26:37] inductive step then it's not t minus 2 and so on down to V star of 0 so you [00:26:41] and so on down to V star of 0 so you compute this for every state and then [00:26:43] compute this for every state and then based on these you can compute no sorry [00:26:46] based on these you can compute no sorry it's PI star of T right compute the auto [00:26:48] it's PI star of T right compute the auto policy the non stationary policy for [00:26:52] policy the non stationary policy for every states as function of both the [00:26:54] every states as function of both the state ok and and I think again I don't [00:27:04] state ok and and I think again I don't want to draw on this but if you want to [00:27:07] want to draw on this but if you want to work with non stationary state [00:27:09] work with non stationary state transition probabilities or non [00:27:10] transition probabilities or non stationary rewards then this algorithm [00:27:13] stationary rewards then this algorithm hardly changes in that you can just add [00:27:18] hardly changes in that you can just add you know if your rewards the state [00:27:21] you know if your rewards the state transceiver is at index by time as well [00:27:23] transceiver is at index by time as well then this is just a very small [00:27:25] then this is just a very small modification to this algorithm and it [00:27:26] modification to this algorithm and it turns out that once you're using a [00:27:28] turns out that once you're using a finite horizon [00:27:29] finite horizon MDP making the rewards and state [00:27:33] MDP making the rewards and state transmitters non stationary it's just a [00:27:35] transmitters non stationary it's just a small tweak right so [00:27:46] okay okay [00:27:48] okay okay this one Oh an on station so in the end [00:27:54] this one Oh an on station so in the end you get a policy PI star subscript T of [00:27:57] you get a policy PI star subscript T of s I'm sorry oke this one oh this one [00:28:12] s I'm sorry oke this one oh this one OSE sure yes Oh a price on this is a [00:28:15] OSE sure yes Oh a price on this is a non-stationary policy yes so that's [00:28:16] non-stationary policy yes so that's Island yeah yeah so this the the auto [00:28:19] Island yeah yeah so this the the auto policy will be an on station policy yes [00:28:22] policy will be an on station policy yes I think yes I think I was using PI [00:28:25] I think yes I think I was using PI started not not to denote that it has to [00:28:28] started not not to denote that it has to be a fictional target yes that's an [00:28:30] be a fictional target yes that's an awesome awesome thank you right if you [00:28:39] awesome awesome thank you right if you think big teeth and Finity can just [00:28:41] think big teeth and Finity can just become the usual value iteration so the [00:28:45] become the usual value iteration so the everything so the two things with that [00:28:50] everything so the two things with that so the two frameworks are closely [00:28:52] so the two frameworks are closely related right you can consideration ship [00:28:54] related right you can consideration ship between the valuation um [00:28:56] between the valuation um one problem with taking the strain were [00:28:59] one problem with taking the strain were too big t to infinity is that the values [00:29:02] too big t to infinity is that the values become unbounded right as in yeah well [00:29:06] become unbounded right as in yeah well and that's actually one of the reasons [00:29:07] and that's actually one of the reasons why we use a discount factor when you [00:29:10] why we use a discount factor when you have an infinite horizon MVP when their [00:29:12] have an infinite horizon MVP when their interviews goes on forever one of the [00:29:14] interviews goes on forever one of the things that discount factor does is it [00:29:16] things that discount factor does is it make sure that the value function [00:29:18] make sure that the value function doesn't draw without bound [00:29:21] doesn't draw without bound right and in fact you know if the [00:29:23] right and in fact you know if the rewards are bounded by on right by some [00:29:28] rewards are bounded by on right by some R max then when you use discounting then [00:29:31] R max then when you use discounting then V you know is bounded by I guess R max [00:29:35] V you know is bounded by I guess R max over 1 minus gab [00:29:37] over 1 minus gab there's someone with dramatic sequence [00:29:39] there's someone with dramatic sequence Oh and so but but when you find her as [00:29:41] Oh and so but but when you find her as entropy because you only add up tira was [00:29:43] entropy because you only add up tira was it it can't get bigger than T times R [00:29:46] it it can't get bigger than T times R max oh let me think so I think they're [00:30:19] max oh let me think so I think they're dumb boy so I think you know what you [00:30:23] dumb boy so I think you know what you find is that um let's see [00:30:31] actually let me just draw a one Dedra [00:30:33] actually let me just draw a one Dedra just to make life simpler right so let's [00:30:36] just to make life simpler right so let's see there's a plus one reward there the [00:30:40] see there's a plus one reward there the plus one reward there if you look at the [00:30:42] plus one reward there if you look at the optimal value function depending on what [00:30:45] optimal value function depending on what time it is if you have two times and [00:30:47] time it is if you have two times and that's let's say the dynamics are [00:30:50] that's let's say the dynamics are deterministic right so there's no noise [00:30:52] deterministic right so there's no noise then if you're two time steps left then [00:30:55] then if you're two time steps left then I guess V star would be you know ten ten [00:31:00] I guess V star would be you know ten ten ten one one one zero zero right and so [00:31:05] ten one one one zero zero right and so depending on where you are I guess if [00:31:08] depending on where you are I guess if you're yeah actually in fact I guess if [00:31:12] you're yeah actually in fact I guess if you're here there's nothing you can do [00:31:14] you're here there's nothing you can do right this kind of gets either reward in [00:31:15] right this kind of gets either reward in time but depending on whether you're [00:31:18] time but depending on whether you're here or here or here at the auto action [00:31:20] here or here or here at the auto action will will change you to computer this pi [00:31:22] will will change you to computer this pi star this make sense okay yeah maybe do [00:31:27] star this make sense okay yeah maybe do do cars here there [00:31:29] do cars here there if this yeah if you actually built a [00:31:32] if this yeah if you actually built a little you know grid simulator and use [00:31:34] little you know grid simulator and use these equations to compute PI sine V [00:31:36] these equations to compute PI sine V star you will see that the Ottoman [00:31:38] star you will see that the Ottoman policy when you have lots of time will [00:31:41] policy when you have lots of time will be this wherever you are go for the ten [00:31:45] be this wherever you are go for the ten rewards but when the clock runs down if [00:31:47] rewards but when the clock runs down if then the also policy will [00:31:49] then the also policy will end up being a mix-up go left and go [00:31:52] end up being a mix-up go left and go right all right cool all right [00:32:22] so the last thing I want to share of you [00:32:25] so the last thing I want to share of you today is the new quadratic regulation [00:32:34] and as a saying at the start [00:32:38] and as a saying at the start um lqr applies only in the relatively [00:32:41] um lqr applies only in the relatively small set of problems but whenever the [00:32:44] small set of problems but whenever the plies this is a great out room and I [00:32:46] plies this is a great out room and I just you know use it whenever right it [00:32:48] just you know use it whenever right it seems usable to apply because it is very [00:32:52] seems usable to apply because it is very efficient and sometimes gives very good [00:32:54] efficient and sometimes gives very good control policies and let's see [00:33:00] control policies and let's see and so lqr applies in the following [00:33:05] and so lqr applies in the following setting so let's see in order to specify [00:33:08] setting so let's see in order to specify an MDP we need to specify the state [00:33:11] an MDP we need to specify the state reactions the state transition or movies [00:33:14] reactions the state transition or movies I'm going to use to find a horizon [00:33:16] I'm going to use to find a horizon formulation so capital T and rewards [00:33:19] formulation so capital T and rewards this this also works with the discounted [00:33:22] this this also works with the discounted MVP formalism but this would be a little [00:33:24] MVP formalism but this would be a little bit easier a little bit more convenient [00:33:26] bit easier a little bit more convenient to develop we've defined a horizon [00:33:28] to develop we've defined a horizon setting so let me just use that today [00:33:30] setting so let me just use that today and lqr applies under a specific set of [00:33:34] and lqr applies under a specific set of circumstances which is that this set of [00:33:38] circumstances which is that this set of states there's an RN set of actions is [00:33:45] states there's an RN set of actions is in Rd and so to specify the state [00:33:50] in Rd and so to specify the state transition probabilities we need to tell [00:33:52] transition probabilities we need to tell you what's the distribution of the [00:33:54] you what's the distribution of the Knicks a given the previous state so to [00:33:56] Knicks a given the previous state so to specify the state transition [00:33:58] specify the state transition probabilities I'm going to say that the [00:34:00] probabilities I'm going to say that the way st plus one evolves is going to be [00:34:03] way st plus one evolves is going to be as a linear function some matrix a times [00:34:13] as a linear function some matrix a times s T plus some matrix B times a T plus [00:34:16] s T plus some matrix B times a T plus some noise and sorry there's a little [00:34:18] some noise and sorry there's a little bit of notation over loading key and [00:34:20] bit of notation over loading key and sorry about that a is both the set of [00:34:22] sorry about that a is both the set of actions as was this matrix a right so [00:34:24] actions as was this matrix a right so there's two separate things but same [00:34:26] there's two separate things but same symbol I think I think that the field of [00:34:30] symbol I think I think that the field of all the ideas of lqr came from [00:34:32] all the ideas of lqr came from traditional control [00:34:34] traditional control from what from mom I guess from EE a [00:34:38] from what from mom I guess from EE a mechanical engineering a lot of their [00:34:39] mechanical engineering a lot of their ideas of reinforcement learning came [00:34:41] ideas of reinforcement learning came from computer science so these two [00:34:42] from computer science so these two literature's kind of evolved and then [00:34:45] literature's kind of evolved and then when the literature's merge you end up [00:34:47] when the literature's merge you end up with clashing notation so CS people use [00:34:50] with clashing notation so CS people use a to denote the set of actions and the [00:34:53] a to denote the set of actions and the stuff you know mechanical engineering [00:34:55] stuff you know mechanical engineering and EE people use a to denote this [00:34:57] and EE people use a to denote this matrix and when we merge these two [00:35:00] matrix and when we merge these two literature's the notation ends up being [00:35:02] literature's the notation ends up being overloaded okay oh and then um it turns [00:35:09] overloaded okay oh and then um it turns out one thing we'll see later is that [00:35:11] out one thing we'll see later is that this noise term it we'll see you later [00:35:14] this noise term it we'll see you later is actually not super important but for [00:35:16] is actually not super important but for now let's just assume that the noise [00:35:18] now let's just assume that the noise term is distributed Gaussian with some [00:35:20] term is distributed Gaussian with some mean zero and some covariance Sigma [00:35:23] mean zero and some covariance Sigma subscript W okay but we'll see later [00:35:26] subscript W okay but we'll see later that the noise will be less important [00:35:28] that the noise will be less important than you think [00:35:34] right [00:35:35] right and so this matrix a is going to you are [00:35:38] and so this matrix a is going to you are n by n and this matrix B is going to be [00:35:41] n by n and this matrix B is going to be R and body D where N and D are [00:35:45] R and body D where N and D are respectively the dimensions of the state [00:35:47] respectively the dimensions of the state space in the dimension of the action [00:35:49] space in the dimension of the action space so for driving a car for example [00:35:53] space so for driving a car for example we saw last time that maybe the state [00:35:55] we saw last time that maybe the state space is six dimensional so if you're [00:35:57] space is six dimensional so if you're driving a car I mean the state space is [00:35:59] driving a car I mean the state space is XY theta x dot y dot beta dot and the [00:36:05] XY theta x dot y dot beta dot and the action space is you know steering [00:36:09] action space is you know steering control so maybe a is two-dimensional [00:36:11] control so maybe a is two-dimensional right acceleration and steering okay so [00:36:18] right acceleration and steering okay so let's see so to specify an MDP we need [00:36:20] let's see so to specify an MDP we need to specify this five tuple right so we [00:36:23] to specify this five tuple right so we specify three of the elements the fourth [00:36:27] specify three of the elements the fourth one T is just some number right so [00:36:29] one T is just some number right so that's easy [00:36:30] that's easy and then the final assumption we need to [00:36:33] and then the final assumption we need to apply out your arm is that the reward [00:36:35] apply out your arm is that the reward function has the following form [00:36:38] function has the following form the reward is negative of s [00:36:43] the reward is negative of s transpose u s plus a transpose VA where [00:36:52] transpose u s plus a transpose VA where you yes and by n V is d by D and UV are [00:37:05] you yes and by n V is d by D and UV are a positive semi-definite okay so these [00:37:08] a positive semi-definite okay so these are matrices a and bringing the zeros or [00:37:10] are matrices a and bringing the zeros or pauses immunity so the fact that U and V [00:37:23] pauses immunity so the fact that U and V are positive semi-definite that implies [00:37:25] are positive semi-definite that implies that s T u s is greater than equal to 0 [00:37:29] that s T u s is greater than equal to 0 and s transpose u s sorry a transpose VA [00:37:34] and s transpose u s sorry a transpose VA is also greater than or equal to 0 so [00:37:43] is also greater than or equal to 0 so here's one example if you want to fly an [00:37:50] here's one example if you want to fly an autonomous helicopter and if you want [00:37:54] autonomous helicopter and if you want you know the state the state vector to [00:37:58] you know the state the state vector to be close to zero so the state vector [00:38:00] be close to zero so the state vector captures a position orientation velocity [00:38:02] captures a position orientation velocity angular velocity if you want a [00:38:03] angular velocity if you want a helicopter to just hover in place then [00:38:06] helicopter to just hover in place then maybe you want the state to be you know [00:38:08] maybe you want the state to be you know regulated or to be controlled near some [00:38:11] regulated or to be controlled near some zero position and so if you choose u [00:38:19] zero position and so if you choose u equals the identity matrix and V also [00:38:23] equals the identity matrix and V also equal to the identity matrix this will [00:38:26] equal to the identity matrix this will be different dimensions right this would [00:38:28] be different dimensions right this would be an N by n identity matrix as you add [00:38:30] be an N by n identity matrix as you add D by D domain and D matrix then R of sa [00:38:33] D by D domain and D matrix then R of sa ends up equal to negative normal for s [00:38:38] ends up equal to negative normal for s squared plus normal a squared [00:38:44] and so this allows you to this allows [00:38:50] and so this allows you to this allows you to specify reward function that [00:38:52] you to specify reward function that penalizes you know with a quadratic cost [00:38:55] penalizes you know with a quadratic cost function the state deviating from zero [00:38:57] function the state deviating from zero or if you want the actions deviating [00:39:00] or if you want the actions deviating from zero that's penalizing very large [00:39:02] from zero that's penalizing very large jerky motions on the control sticks or [00:39:05] jerky motions on the control sticks or we said V equal to zero then the second [00:39:07] we said V equal to zero then the second term goes away okay so these are some of [00:39:09] term goes away okay so these are some of the cost functions you can specify in [00:39:13] the cost functions you can specify in terms of a quadratic cost function okay [00:39:17] terms of a quadratic cost function okay now again you know just so that you can [00:39:21] now again you know just so that you can see the generalization if you want [00:39:27] see the generalization if you want non-stationary dynamics this model is [00:39:31] non-stationary dynamics this model is quite simple to change where you can say [00:39:34] quite simple to change where you can say the matrices a and B depends on the time [00:39:36] the matrices a and B depends on the time T you can also say these mean you know [00:39:40] T you can also say these mean you know the matrices u and V depends on the time [00:39:43] the matrices u and V depends on the time T so if you have non stationary state [00:39:46] T so if you have non stationary state transition probabilities or non [00:39:47] transition probabilities or non stationary cost function that's how you [00:39:51] stationary cost function that's how you would modify this but I won't [00:39:53] would modify this but I won't I won't use this generalization for [00:39:56] I won't use this generalization for today now [00:40:20] so the two key assumptions of the lqr [00:40:23] so the two key assumptions of the lqr framework are that first the state [00:40:27] framework are that first the state transition dynamics the way your sales [00:40:29] transition dynamics the way your sales change is as a linear function of the [00:40:32] change is as a linear function of the previous state and action plus some [00:40:34] previous state and action plus some noise and second that the reward [00:40:37] noise and second that the reward function is a quadratic cost function [00:40:39] function is a quadratic cost function right so these are the two key [00:40:41] right so these are the two key assumptions and so first you know we're [00:40:46] assumptions and so first you know we're worth where do you get the matrices a [00:40:53] worth where do you get the matrices a and B one thing that we talked about on [00:40:57] and B one thing that we talked about on Wednesday already was so again this will [00:41:00] Wednesday already was so again this will actually work if you're trying to apply [00:41:01] actually work if you're trying to apply lqr to find the homicidal confidence [00:41:03] lqr to find the homicidal confidence this will work for a helicopter flying [00:41:05] this will work for a helicopter flying at low speeds which is if you find a [00:41:07] at low speeds which is if you find a helicopter around you know start to some [00:41:14] helicopter around you know start to some state as zero take an action a zero get [00:41:18] state as zero take an action a zero get to state s1 do this until you get to st [00:41:22] to state s1 do this until you get to st right and then this was the first trial [00:41:24] right and then this was the first trial and then you do this M times so we [00:41:31] and then you do this M times so we talked about this on Wednesday so fly [00:41:33] talked about this on Wednesday so fly the helicopter through M trajectory sort [00:41:36] the helicopter through M trajectory sort of T time steps each and then we know [00:41:40] of T time steps each and then we know that we want st plus 1 is approximately [00:41:43] that we want st plus 1 is approximately ast plus PA t and so you can minimize [00:42:08] right so we want the left in the right [00:42:12] right so we want the left in the right hand side to be close each other so you [00:42:14] hand side to be close each other so you could you know minimize the squared [00:42:18] could you know minimize the squared difference between the left hand side [00:42:19] difference between the left hand side and the right hand side in a procedure a [00:42:22] and the right hand side in a procedure a lot like linear regression in order to [00:42:24] lot like linear regression in order to fit matrices a and B so if you actually [00:42:28] fit matrices a and B so if you actually fly a helicopter around and collect this [00:42:31] fly a helicopter around and collect this type of data and fit this model to it [00:42:33] type of data and fit this model to it this will work you know this is actually [00:42:35] this will work you know this is actually a pretty reasonable model for the [00:42:37] a pretty reasonable model for the dynamics of a helicopter and those [00:42:39] dynamics of a helicopter and those speeds okay so this is one way to do [00:42:57] so let's see method one is to learn it [00:43:01] so let's see method one is to learn it right a second method is to linearize a [00:43:10] right a second method is to linearize a nonlinear model so um let me just [00:43:22] nonlinear model so um let me just describe the ideas at a high level [00:43:26] describe the ideas at a high level which is let's say that and I think for [00:43:29] which is let's say that and I think for this it might be useful to think of the [00:43:33] this it might be useful to think of the inverted pendulum right so that was uh [00:43:35] inverted pendulum right so that was uh you know say imagine you've uh yeah [00:43:37] you know say imagine you've uh yeah inverted pendulum that was that Rangi of [00:43:39] inverted pendulum that was that Rangi of a pole you're trying to your long [00:43:41] a pole you're trying to your long vertical pole you try to keep it [00:43:43] vertical pole you try to keep it balanced so for an inverted pendulum [00:43:45] balanced so for an inverted pendulum like this [00:43:46] like this if you download an open source physics [00:43:48] if you download an open source physics simulator or if you have a friend you [00:43:51] simulator or if you have a friend you know from from the physics degree help [00:43:54] know from from the physics degree help you derived in Newtonian mechanics [00:43:55] you derived in Newtonian mechanics equations for this let's see I actually [00:44:01] equations for this let's see I actually tried to work through the physics [00:44:03] tried to work through the physics equations or inverted pendulum ones is [00:44:04] equations or inverted pendulum ones is pretty complicated but you might have a [00:44:13] function that tells you that if the [00:44:16] function that tells you that if the state is a certain position orientation [00:44:19] state is a certain position orientation of the pole lost the angular velocity [00:44:21] of the pole lost the angular velocity and you was it apply a certain [00:44:25] and you was it apply a certain acceleration the options are certainly [00:44:27] acceleration the options are certainly left for settling right then you know [00:44:29] left for settling right then you know one tenth of a second later the state [00:44:31] one tenth of a second later the state will get to this right so you you know [00:44:33] will get to this right so you you know your physics friend can help you derive [00:44:35] your physics friend can help you derive this equation and and and then maybe [00:44:39] this equation and and and then maybe plus noise right well no just ignore the [00:44:41] plus noise right well no just ignore the noise for now and so what you have is a [00:44:47] noise for now and so what you have is a function [00:45:06] right the maps from the state xx dot [00:45:10] right the maps from the state xx dot theta theta dot that's a position of the [00:45:12] theta theta dot that's a position of the cart and the angle of the pole and the [00:45:14] cart and the angle of the pole and the velocities and angular velocities [00:45:16] velocities and angular velocities they're maps from the current state at [00:45:17] they're maps from the current state at time t excuse me comma 80 right maps [00:45:24] time t excuse me comma 80 right maps from the I guess current state vector to [00:45:27] from the I guess current state vector to the next state vector as a function they [00:45:30] the next state vector as a function they comments a nuclear reaction okay so um [00:45:33] comments a nuclear reaction okay so um yes what Vineyards asian means and i'm [00:45:38] yes what Vineyards asian means and i'm going to use a 1d example so because i [00:45:41] going to use a 1d example so because i can only draw on a flat board right i [00:45:43] can only draw on a flat board right i can't do because because of the two [00:45:45] can't do because because of the two dimensional nature of the white board [00:45:46] dimensional nature of the white board I'm just gonna use a let's let's suppose [00:45:50] I'm just gonna use a let's let's suppose that do you have SC plus one equals f of [00:45:52] that do you have SC plus one equals f of s T and let me just forget let me just [00:45:55] s T and let me just forget let me just ignore the action for now so I have one [00:45:56] ignore the action for now so I have one input and one output so I can draw this [00:45:58] input and one output so I can draw this more easy on the white board so if you [00:46:06] more easy on the white board so if you have some function like this so the [00:46:08] have some function like this so the x-axis is s T and Y axis is SC plus one [00:46:13] x-axis is s T and Y axis is SC plus one and this is the function f right well [00:46:16] and this is the function f right well plug in back the action later what the [00:46:19] plug in back the action later what the linearization process does is you pick a [00:46:22] linearization process does is you pick a point I'm gonna call this point st over [00:46:27] point I'm gonna call this point st over bar and we're going to [00:46:34] you know take the derivative of F and [00:46:36] you know take the derivative of F and finish straight lines are you annoyed [00:46:38] finish straight lines are you annoyed really not very well take the tangent [00:46:41] really not very well take the tangent straight line at this point s T bar and [00:46:46] straight line at this point s T bar and what and we're going to use this [00:46:49] what and we're going to use this straight line and we're going to use the [00:46:59] straight line and we're going to use the green straight line to approximate the [00:47:02] green straight line to approximate the function okay and so if you look at the [00:47:06] function okay and so if you look at the equation for the green straight line the [00:47:09] equation for the green straight line the green straight line is a function [00:47:11] green straight line is a function mapping from st 2sc plus 1 and s bar is [00:47:15] mapping from st 2sc plus 1 and s bar is the point around which your linearizing [00:47:18] the point around which your linearizing the function so s bar is a constant and [00:47:21] the function so s bar is a constant and this function is actually defined by SC [00:47:24] this function is actually defined by SC plus 1 is approximately the derivative [00:47:28] plus 1 is approximately the derivative of the function at s Bar times st minus [00:47:33] of the function at s Bar times st minus s bar plus s Bar T ok and so so s bar T [00:47:44] s bar plus s Bar T ok and so so s bar T is a constant and this equation [00:47:50] is a constant and this equation expresses s T plus 1 as a linear [00:47:53] expresses s T plus 1 as a linear function of s T so think of it as [00:47:57] function of s T so think of it as volunteers a fixed number right it [00:47:58] volunteers a fixed number right it doesn't vary so given some fixed s bar [00:48:02] doesn't vary so given some fixed s bar this equation here this is actually the [00:48:05] this equation here this is actually the equation of the green straight line [00:48:07] equation of the green straight line which is it says you know if you've used [00:48:09] which is it says you know if you've used the green straight line to approximate [00:48:09] the green straight line to approximate the function f this tells you what is s [00:48:12] the function f this tells you what is s T plus 1 as a function of s T and this [00:48:15] T plus 1 as a function of s T and this is a linear and affine relationship [00:48:17] is a linear and affine relationship between s T plus 1 ok so that's how you [00:48:24] between s T plus 1 ok so that's how you will linearize a function and and in a [00:48:28] will linearize a function and and in a more general case where and in a more [00:48:41] more general case where and in a more general case where SC plus 1 is actually [00:48:44] general case where SC plus 1 is actually a function of you know putting this back [00:48:47] a function of you know putting this back again right both st and a t the formula [00:48:55] again right both st and a t the formula becomes well i'll write out the form in [00:49:05] becomes well i'll write out the form in a second effect on in this example s bar [00:49:08] a second effect on in this example s bar T is usually chosen to be a typical [00:49:11] T is usually chosen to be a typical value for s and so in particular if you [00:49:20] value for s and so in particular if you expect your helicopter to be doing a [00:49:22] expect your helicopter to be doing a pretty good job hovering near the stage [00:49:24] pretty good job hovering near the stage 0 then it'd be pretty reasonable to [00:49:27] 0 then it'd be pretty reasonable to choose s Bar T to be the vector of all [00:49:29] choose s Bar T to be the vector of all zeros because if you look at how good is [00:49:32] zeros because if you look at how good is the green line as an approximation of [00:49:34] the green line as an approximation of the blue line right in a small region [00:49:37] the blue line right in a small region like this you know the green line is [00:49:38] like this you know the green line is actually pretty close to the blue line [00:49:39] actually pretty close to the blue line and so if you choose as bar to be the [00:49:44] and so if you choose as bar to be the place where you expect your helicopter [00:49:46] place where you expect your helicopter to spend most of its time then the green [00:49:48] to spend most of its time then the green line is not too bad an approximation so [00:49:51] line is not too bad an approximation so the true function to the physics OC you [00:49:52] the true function to the physics OC you know if you expect for the inverted [00:49:54] know if you expect for the inverted pendulum if you expect that your [00:49:56] pendulum if you expect that your inverted pendulum will spend most of its [00:49:57] inverted pendulum will spend most of its time but the pole upright and the [00:49:59] time but the pole upright and the velocity not too large then you choose s [00:50:01] velocity not too large then you choose s bar to be maybe the zero vector and so [00:50:04] bar to be maybe the zero vector and so long as your pendulum your inverted [00:50:06] long as your pendulum your inverted pendulum is spending most of its time [00:50:08] pendulum is spending most of its time kind of you know close to the zero state [00:50:11] kind of you know close to the zero state then the green lines not too bad an [00:50:13] then the green lines not too bad an approximation for the blue line right so [00:50:15] approximation for the blue line right so this is an approximation but you try to [00:50:17] this is an approximation but you try to choose them because I mean in in this [00:50:20] choose them because I mean in in this little region it's actually not that bad [00:50:22] little region it's actually not that bad an approximation is only when you go [00:50:24] an approximation is only when you go really far away right that there's a [00:50:26] really far away right that there's a huge gap between the linear [00:50:28] huge gap between the linear approximation and the true function [00:50:32] approximation and the true function okay and so in the more general case [00:50:40] okay and so in the more general case where f is a function about the state [00:50:43] where f is a function about the state and the action then what you have to do [00:50:45] and the action then what you have to do is the input now becomes st comma 80 [00:50:50] is the input now becomes st comma 80 because f maps from st comma 82 st plus [00:50:54] because f maps from st comma 82 st plus 1 and then instead of choosing as bhatia [00:50:57] 1 and then instead of choosing as bhatia choosing as bar t comma a bar T which is [00:51:00] choosing as bar t comma a bar T which is a typical state in action around which [00:51:03] a typical state in action around which you linearize the function let me just [00:51:05] you linearize the function let me just write down the formula for that [00:51:29] in which you would say if you linearize [00:51:33] in which you would say if you linearize around the points given by s party a [00:51:38] around the points given by s party a body under the typical values then the [00:51:41] body under the typical values then the form that you have is st plus 1 is given [00:51:45] form that you have is st plus 1 is given by f of s bar ta bar t plus the gradient [00:51:53] by f of s bar ta bar t plus the gradient with respect to s so this is the [00:52:17] with respect to s so this is the generalization of the 1d function we [00:52:20] generalization of the 1d function we measure just now we wrote down just now [00:52:22] measure just now we wrote down just now which says that you know the next day [00:52:24] which says that you know the next day there's approximately this point around [00:52:26] there's approximately this point around you which you linearize plus the [00:52:28] you which you linearize plus the gradient respect to s times how much the [00:52:30] gradient respect to s times how much the state differs from the linearization [00:52:32] state differs from the linearization point plus the gradient respect the [00:52:34] point plus the gradient respect the actions times how much the action vary [00:52:37] actions times how much the action vary from a bar and this kind of generalizes [00:52:42] from a bar and this kind of generalizes that equation you wrote [00:52:52] so so this equation expresses as T plus [00:52:59] so so this equation expresses as T plus 1 as a linear function or technically an [00:53:02] 1 as a linear function or technically an affine function of the previous state [00:53:04] affine function of the previous state and previous action right well some [00:53:08] and previous action right well some matrices in between and from this you [00:53:11] matrices in between and from this you know after some algebraic mundane you [00:53:14] know after some algebraic mundane you can re-express this as st plus 1 equals [00:53:17] can re-express this as st plus 1 equals a st plus ba t and and just that there [00:53:23] a st plus ba t and and just that there is just one other little detail which is [00:53:25] is just one other little detail which is you might need to redefine st to add an [00:53:28] you might need to redefine st to add an intercept term right and because this is [00:53:33] intercept term right and because this is a affine function with a intercept term [00:53:35] a affine function with a intercept term rather than the linear function but so [00:53:37] rather than the linear function but so from this formula you know with a little [00:53:40] from this formula you know with a little bit of algebraic Montaigne you should [00:53:42] bit of algebraic Montaigne you should really figure out whether the matrices a [00:53:43] really figure out whether the matrices a and B but you might need to add an [00:53:46] and B but you might need to add an instep term to the s but is just an [00:53:48] instep term to the s but is just an affine function you can rewrite in terms [00:53:49] affine function you can rewrite in terms of nature's the same ok um alright so I [00:54:00] of nature's the same ok um alright so I hope that makes sense right that this [00:54:02] hope that makes sense right that this thing this linearization thing expresses [00:54:05] thing this linearization thing expresses St plus 1 as a linear function of s tnat [00:54:08] St plus 1 as a linear function of s tnat right this is just a linear system the [00:54:10] right this is just a linear system the way st plus one varies you know it's [00:54:12] way st plus one varies you know it's just a matrix i'm st some matrix times [00:54:14] just a matrix i'm st some matrix times 80 and that's why was someone jean you [00:54:17] 80 and that's why was someone jean you can get into this form there for some [00:54:18] can get into this form there for some matrix a ok but because there are some [00:54:22] matrix a ok but because there are some constants floating around as well like [00:54:24] constants floating around as well like this you might need an extra interceptor [00:54:26] this you might need an extra interceptor to multiply into a to give you that [00:54:28] to multiply into a to give you that extra constant [00:54:39] but where we are we now have that for [00:54:44] but where we are we now have that for these MVPs either by learning a linear [00:54:47] these MVPs either by learning a linear model with the matrices a and B or by [00:54:50] model with the matrices a and B or by taking a nonlinear model and linearizing [00:54:53] taking a nonlinear model and linearizing it like you just saw you can model [00:54:56] it like you just saw you can model hopefully more than MVP as a linear [00:55:00] hopefully more than MVP as a linear dynamical system meaning this you know [00:55:02] dynamical system meaning this you know st plus 1 is this linear function or the [00:55:04] st plus 1 is this linear function or the previous state in action as well as [00:55:06] previous state in action as well as hopefully with a quadratic reward [00:55:09] hopefully with a quadratic reward function or they really write in the [00:55:12] function or they really write in the form that we saw just now so let me just [00:55:18] form that we saw just now so let me just summarize the problem we want to solve [00:55:23] summarize the problem we want to solve AST sorry [00:55:25] AST sorry st plus 1 equals a st plus ba t + WT so [00:55:32] st plus 1 equals a st plus ba t + WT so this is a noise term and then our s a [00:55:41] equals negative s transpose you have a [00:55:45] equals negative s transpose you have a transpose and this is a fine horizon MDP [00:55:50] transpose and this is a fine horizon MDP so the total payoff is a R of s 0 ok [00:56:07] so the total payoff is a R of s 0 ok so let's take other dynamic programming [00:56:16] so let's take other dynamic programming algorithm for this the remarkable [00:56:21] algorithm for this the remarkable problem that the remarkable property of [00:56:24] problem that the remarkable property of lqr and what makes this so useful is [00:56:29] lqr and what makes this so useful is that if you are willing to model your [00:56:31] that if you are willing to model your MVP using those sets of equations then [00:56:34] MVP using those sets of equations then the value function is a quadratic [00:56:36] the value function is a quadratic function right and so let me show you [00:56:39] function right and so let me show you what I mean and so if your if your model [00:56:41] what I mean and so if your if your model if your MVP can be modeled as this type [00:56:43] if your MVP can be modeled as this type of linear dynamical system with a [00:56:45] of linear dynamical system with a quadratic cost function then it turns [00:56:47] quadratic cost function then it turns out that V Star is a quadratic function [00:56:49] out that V Star is a quadratic function and so you can compute V star exactly [00:56:53] and so you can compute V star exactly right so let me show you what I mean [00:56:56] right so let me show you what I mean we're going to develop a dynamic [00:56:58] we're going to develop a dynamic programming algorithm to compute the [00:57:04] programming algorithm to compute the optimal value function V star similar to [00:57:08] optimal value function V star similar to what we did you know below earlier today [00:57:11] what we did you know below earlier today with the final horizon MDP with a finite [00:57:13] with the final horizon MDP with a finite set of states this starts with the final [00:57:16] set of states this starts with the final time step and it will work backwards so [00:57:19] time step and it will work backwards so V star T of s T is equal to max' over 80 [00:57:26] V star T of s T is equal to max' over 80 of R of s T 80 this is max over 80 of [00:57:35] of R of s T 80 this is max over 80 of negative [00:57:45] right but this is always greater than or [00:57:50] right but this is always greater than or equal to 0 because V is positive [00:57:52] equal to 0 because V is positive semi-definite and so the optimal action [00:57:56] semi-definite and so the optimal action is actually they just choose the action [00:57:58] is actually they just choose the action zero and so the max over this is equal [00:58:01] zero and so the max over this is equal to negative st transpose u st right [00:58:05] to negative st transpose u st right because uh because v is a positive semi [00:58:07] because uh because v is a positive semi definite matrix this thing is always [00:58:09] definite matrix this thing is always greater than zero and then and so this [00:58:12] greater than zero and then and so this tells us also that pi star the final [00:58:15] tells us also that pi star the final action is the arc max so the auto action [00:58:19] action is the arc max so the auto action is to choose you know the vector of zero [00:58:21] is to choose you know the vector of zero actions at the last time step okay so [00:58:25] actions at the last time step okay so this is the base case for the dynamic [00:58:27] this is the base case for the dynamic programming step of value duration where [00:58:32] programming step of value duration where the optimal value at the last time step [00:58:35] the optimal value at the last time step is just choose the action that maximizes [00:58:37] is just choose the action that maximizes the immediate reward which means [00:58:40] the immediate reward which means maximize this right and this is [00:58:42] maximize this right and this is maximized by choosing the action zero at [00:58:44] maximized by choosing the action zero at the last time step okay no [00:59:09] now the key step to the dynamic program [00:59:12] now the key step to the dynamic program implementation is the following which is [00:59:15] implementation is the following which is um suppose that V star T plus 1 st plus [00:59:22] um suppose that V star T plus 1 st plus 1 is equal to a quadratic function so [00:59:50] 1 is equal to a quadratic function so indeed [01:00:03] yes it's true that this term is also [01:00:05] yes it's true that this term is also driven to zero without the minus sign [01:00:07] driven to zero without the minus sign right what about the minus I that term [01:00:09] right what about the minus I that term is positive and so but you only get to [01:00:13] is positive and so but you only get to maximize respect to eighty right so so [01:00:16] maximize respect to eighty right so so the best you could do for this term the [01:00:18] the best you could do for this term the cell is zero thank you all right now for [01:00:27] cell is zero thank you all right now for the inductive case we want to go from VT [01:00:32] the inductive case we want to go from VT plus one we saw three plus one to [01:00:34] plus one we saw three plus one to computing beats 13 and the key [01:00:39] computing beats 13 and the key observation that makes lqr work is let's [01:00:43] observation that makes lqr work is let's suppose V star T plus one the auto value [01:00:46] suppose V star T plus one the auto value function at the next time step let's [01:00:48] function at the next time step let's suppose it's a quadratic function so [01:00:50] suppose it's a quadratic function so particularly suppose V star T plus one [01:00:52] particularly suppose V star T plus one is you know this quadratic function [01:00:57] is you know this quadratic function parameterize by some matrix this capital [01:01:00] parameterize by some matrix this capital Phi T plus one which is an enviable n [01:01:03] Phi T plus one which is an enviable n matrix and some constant offsets I which [01:01:06] matrix and some constant offsets I which is a real number [01:01:08] is a real number um what we'll be able to show is that if [01:01:11] um what we'll be able to show is that if you do one step of dynamic programming [01:01:14] you do one step of dynamic programming if this is true for V star T plus 1 that [01:01:18] if this is true for V star T plus 1 that V T after one step as you go from V [01:01:21] V T after one step as you go from V salty cylinder V T that the optimum [01:01:23] salty cylinder V T that the optimum value function VT is also going to be a [01:01:26] value function VT is also going to be a quadratic function with a very similar [01:01:28] quadratic function with a very similar form right look I guess T plus one [01:01:30] form right look I guess T plus one replaced by T alright and so in the [01:01:36] replaced by T alright and so in the dynamic programming step we are going to [01:01:41] dynamic programming step we are going to update VT s T equals max of eighty R of [01:01:51] update VT s T equals max of eighty R of s C comma 80 plus and then you know I [01:01:56] s C comma 80 plus and then you know I think you remember when I previously [01:02:00] think you remember when I previously writers previously we had some over s [01:02:04] writers previously we had some over s prime Oh actually St plus 1 I guess the [01:02:08] prime Oh actually St plus 1 I guess the s T 18 [01:02:10] s T 18 SC plus one B star c plus 1 St plus 1 so [01:02:17] SC plus one B star c plus 1 St plus 1 so that's what we had previously when we [01:02:19] that's what we had previously when we had a discrete state space and was [01:02:20] had a discrete state space and was summing over it but now that we have a [01:02:22] summing over it but now that we have a continuous state space this formula [01:02:24] continuous state space this formula becomes expected value with respect to s [01:02:27] becomes expected value with respect to s T plus 1 drawn from the state transition [01:02:30] T plus 1 drawn from the state transition probabilities B star pieces 1 so the [01:02:56] probabilities B star pieces 1 so the optimal value when the clock is a time T [01:02:59] optimal value when the clock is a time T is choose the action a the maximizes the [01:03:02] is choose the action a the maximizes the immediate reward plus the expected value [01:03:04] immediate reward plus the expected value of you know your future rewards when the [01:03:07] of you know your future rewards when the clock has now ticked from time T to time [01:03:10] clock has now ticked from time T to time T plus 1 in your in state as t plus 1 at [01:03:13] T plus 1 in your in state as t plus 1 at time t plus 1 so let's see so this is a [01:03:31] time t plus 1 so let's see so this is a pretty beefy piece of algebra to do I [01:03:35] pretty beefy piece of algebra to do I think I feel like showing this full [01:03:38] think I feel like showing this full result is oh no it's like at the love of [01:03:41] result is oh no it's like at the love of complexity of a you know typical CS 229 [01:03:44] complexity of a you know typical CS 229 homework problem which is quite hard but [01:03:48] homework problem which is quite hard but let me just show the outline of how you [01:03:50] let me just show the outline of how you do this derivation and why you know why [01:03:52] do this derivation and why you know why this inductive step works right but I [01:03:54] this inductive step works right but I think you but but if you want you could [01:03:55] think you but but if you want you could work through the algebra details [01:03:57] work through the algebra details yourself at home which is that [01:04:24] so be sautee of s T is equal to max' [01:04:29] so be sautee of s T is equal to max' over 80 of the immediate reward right so [01:04:37] over 80 of the immediate reward right so that's the immediate reward and then [01:04:39] that's the immediate reward and then plus the expected value with respect to [01:04:42] plus the expected value with respect to s T plus one is drawn from a Gaussian [01:04:48] s T plus one is drawn from a Gaussian would mean a st plus ba t and covariance [01:04:55] would mean a st plus ba t and covariance Sigma W so remember s T plus 1 is equal [01:04:59] Sigma W so remember s T plus 1 is equal to ast plus ba t + WT where WT is [01:05:05] to ast plus ba t + WT where WT is Gaussian with mean 0 and covariance [01:05:09] Gaussian with mean 0 and covariance Sigma W right so if you choose an action [01:05:13] Sigma W right so if you choose an action a T then this is the distribution of the [01:05:15] a T then this is the distribution of the next state at time T plus 1 and then [01:05:20] next state at time T plus 1 and then expected value of this quadratic term [01:05:33] expected value of this quadratic term because this quadratic term here I kind [01:05:37] because this quadratic term here I kind of in the inductive case was what we [01:05:38] of in the inductive case was what we showed was V star for the for the next [01:05:45] showed was V star for the for the next time step right so it turns out that [01:05:55] let's see so this is a quadratic [01:05:57] let's see so this is a quadratic function and this expectation is the [01:06:02] function and this expectation is the expected value of a quadratic function [01:06:04] expected value of a quadratic function with respect to s drawn from the [01:06:07] with respect to s drawn from the Gaussian right well the certain mean a [01:06:08] Gaussian right well the certain mean a certain variance so it turns out that [01:06:11] certain variance so it turns out that the expected value of this thing well [01:06:16] the expected value of this thing well this whole thing that I just circled [01:06:18] this whole thing that I just circled this thing simplifies into a big [01:06:23] this thing simplifies into a big quadratic function [01:06:32] of the action 80 and then and so in [01:06:50] of the action 80 and then and so in order to you know derive the odd max of [01:06:52] order to you know derive the odd max of the derive V stylus you would derive [01:06:55] the derive V stylus you would derive this big quadratic function take [01:06:58] this big quadratic function take derivatives with respect to a tea set to [01:07:04] derivatives with respect to a tea set to 0 right and solve for a tea and if you [01:07:16] 0 right and solve for a tea and if you go through all that algebra then you [01:07:18] go through all that algebra then you actually then you end up with the [01:07:19] actually then you end up with the formula for a tea as follows [01:07:40] okay and I'm gonna use I'm gonna I'm [01:07:46] okay and I'm gonna use I'm gonna I'm gonna take that big matrix and don't [01:07:47] gonna take that big matrix and don't know that Lt okay oh and so this shows [01:07:54] know that Lt okay oh and so this shows also that PI star at time T of s T is [01:07:59] also that PI star at time T of s T is equal to L T times s T so one the [01:08:17] equal to L T times s T so one the takeaway from this is that under the [01:08:21] takeaway from this is that under the assumptions we have rather than the [01:08:23] assumptions we have rather than the dynamical systems a quadratic cost [01:08:24] dynamical systems a quadratic cost function the octo action there's a [01:08:30] function the octo action there's a linear function of the state s T right [01:08:44] linear function of the state s T right and this is not a claim that is made [01:08:48] and this is not a claim that is made through function approximation what did [01:08:50] through function approximation what did I'm not saying that you can fit a [01:08:52] I'm not saying that you can fit a straight line to the Osmo action and if [01:08:54] straight line to the Osmo action and if you fit a straight line that you get [01:08:57] you fit a straight line that you get this linear function right that's not [01:08:58] this linear function right that's not what we're saying we're saying that of [01:09:01] what we're saying we're saying that of all the functions any way I could [01:09:02] all the functions any way I could possibly come up with in the world [01:09:04] possibly come up with in the world linear or nonlinear the best function [01:09:06] linear or nonlinear the best function the best option is linear so there is no [01:09:09] the best option is linear so there is no approximation here right so it's just [01:09:11] approximation here right so it's just that you know it's just a fact that if [01:09:13] that you know it's just a fact that if you have linear dynamical system the [01:09:15] you have linear dynamical system the best possible action at any state is [01:09:17] best possible action at any state is going to be a linear function of that [01:09:21] going to be a linear function of that state right so there's notice that we [01:09:22] state right so there's notice that we have an approximate or anything right [01:09:42] let me let me let me write this here and [01:09:45] let me let me let me write this here and then the other step is that if you take [01:09:50] then the other step is that if you take the autumn action and plug it into the [01:09:52] the autumn action and plug it into the definition of V star then by simplifying [01:09:56] definition of V star then by simplifying Michigan is quite a lot of algebra but [01:09:58] Michigan is quite a lot of algebra but after simplifying you end up with this [01:10:01] after simplifying you end up with this equation where again I just write out [01:10:12] equation where again I just write out the formulas [01:10:58] okay [01:11:03] all right [01:11:20] so to summarize the whole algorithm [01:11:23] so to summarize the whole algorithm right this let's put everything together [01:11:24] right this let's put everything together oh and so sorry and so what these two [01:11:28] oh and so sorry and so what these two equations do is they allow you to go [01:11:30] equations do is they allow you to go from B star t plus 1 which is defined in [01:11:32] from B star t plus 1 which is defined in terms of 5t plus 1 and side t plus 1 and [01:11:35] terms of 5t plus 1 and side t plus 1 and it allows you to recursively go back to [01:11:38] it allows you to recursively go back to figure out what is V star T using these [01:11:40] figure out what is V star T using these two equations right so Phi T depends on [01:11:43] two equations right so Phi T depends on Phi T plus 1 sy T depends on Phi T plus [01:11:46] Phi T plus 1 sy T depends on Phi T plus 1 inside p plus 1 and this Sigma W this [01:11:50] 1 inside p plus 1 and this Sigma W this is the covariance of WT right there's [01:11:55] is the covariance of WT right there's the Sigma subscript of you this is not a [01:11:57] the Sigma subscript of you this is not a summation over W is a Sigma matrix [01:11:59] summation over W is a Sigma matrix subscript 2 by W that was the covariance [01:12:01] subscript 2 by W that was the covariance matrix for the noise terms we were [01:12:04] matrix for the noise terms we were adding on every step you know linear [01:12:06] adding on every step you know linear dynamical system okay and there's a [01:12:09] dynamical system okay and there's a trace operator some of the diagonals ok [01:12:12] trace operator some of the diagonals ok so just to summarize here's the [01:12:19] so just to summarize here's the algorithm you will initialize Phi T to [01:12:27] algorithm you will initialize Phi T to be equal to negative u and side T equals [01:12:34] be equal to negative u and side T equals 0 and so you know that's just taking [01:12:39] 0 and so you know that's just taking this equation and map here there right [01:12:43] this equation and map here there right so the final time step that those two oh [01:12:47] so the final time step that those two oh sorry [01:12:48] sorry should be capital T so that those two [01:12:55] should be capital T so that those two equations to fire inside it defines V [01:12:58] equations to fire inside it defines V star of capital T and then you would you [01:13:06] star of capital T and then you would you know recur circle let me calculate Phi T [01:13:11] know recur circle let me calculate Phi T and sy t using Phi T plus 1 si T plus 1 [01:13:20] and sy t using Phi T plus 1 si T plus 1 so you go from you know but t equals t [01:13:24] so you go from you know but t equals t minus 1 t minus 2 and so on and go [01:13:27] minus 1 t minus 2 and so on and go backward calm down from right t minus 1 [01:13:30] backward calm down from right t minus 1 t minus 2 and so on down to 0 calculate [01:13:40] t minus 2 and so on down to 0 calculate L T as above right then LT was a formula [01:13:46] L T as above right then LT was a formula I guess we had over there saying how the [01:13:49] I guess we had over there saying how the optimal action is a function of the [01:13:52] optimal action is a function of the current state depending on the a and B [01:13:54] current state depending on the a and B and Phi and then finally PI star of s T [01:14:00] and Phi and then finally PI star of s T equals L T of s T ok and this algorithm [01:14:13] equals L T of s T ok and this algorithm the remarkable thing one really cool [01:14:14] the remarkable thing one really cool thing about lqr is that there is no [01:14:17] thing about lqr is that there is no approximation anywhere right you you [01:14:20] approximation anywhere right you you might need to make some approximation [01:14:22] might need to make some approximation steps in order to approximate a [01:14:25] steps in order to approximate a helicopter as a linear dynamical system [01:14:27] helicopter as a linear dynamical system by you know fitting matrices a and B the [01:14:31] by you know fitting matrices a and B the data or by taking a nonlinear thing and [01:14:33] data or by taking a nonlinear thing and linearizing it and you might need to [01:14:34] linearizing it and you might need to just restrict constrict you knee [01:14:37] just restrict constrict you knee restrict your choice that possibly [01:14:38] restrict your choice that possibly reward functions reward function is [01:14:40] reward functions reward function is called driving but once you've made [01:14:41] called driving but once you've made those assumptions none of this is [01:14:43] those assumptions none of this is approximately everything is exactly [01:14:53] yes that's right yep yeah so the [01:14:56] yes that's right yep yeah so the approximation step neither are getting [01:14:59] approximation step neither are getting your MDP into the form of a linear [01:15:01] your MDP into the form of a linear dynamical system will record you have a [01:15:03] dynamical system will record you have a reward so that is approximate but once [01:15:05] reward so that is approximate but once you specify the MTP like that all of [01:15:07] you specify the MTP like that all of these calculations were exactly right so [01:15:08] these calculations were exactly right so so we're not approximating the value [01:15:11] so we're not approximating the value function of a quadratic function is that [01:15:13] function of a quadratic function is that the value function is a quadratic [01:15:14] the value function is a quadratic function and you're computing it exactly [01:15:17] function and you're computing it exactly and the also policy is a linear function [01:15:19] and the also policy is a linear function and you're just computing computing that [01:15:21] and you're just computing computing that exactly okay I want to mention before [01:15:27] exactly okay I want to mention before wrap i want to mention one one unusual [01:15:30] wrap i want to mention one one unusual fun facts about lqr and this is very [01:15:32] fun facts about lqr and this is very specific to young and and and it's [01:15:36] specific to young and and and it's convenient but but it let me say well [01:15:39] convenient but but it let me say well the fact is then just be careful that [01:15:40] the fact is then just be careful that this doesn't give it a wrong intuition [01:15:42] this doesn't give it a wrong intuition just it does imply to anything other [01:15:43] just it does imply to anything other than lqr which is that if you look at [01:15:47] than lqr which is that if you look at where so first if you look at the [01:15:50] where so first if you look at the formula for L [01:15:57] all right [01:15:58] all right even though can the formula LT you need [01:16:01] even though can the formula LT you need to compute I mean the you know you the [01:16:03] to compute I mean the you know you the go up doing all this work is to find the [01:16:06] go up doing all this work is to find the also policy right so you want to find LT [01:16:08] also policy right so you want to find LT so you can compute the auto policy you [01:16:10] so you can compute the auto policy you notice that LT this depends on Phi but [01:16:21] notice that LT this depends on Phi but not sigh right so you and and maybe it's [01:16:25] not sigh right so you and and maybe it's kinda make sense you're going to when [01:16:27] kinda make sense you're going to when you take an action you get to some new [01:16:29] you take an action you get to some new stage and your future payoffs it's a [01:16:31] stage and your future payoffs it's a quadratic function plus a constant it [01:16:33] quadratic function plus a constant it doesn't matter what that constant is [01:16:34] doesn't matter what that constant is right and so in order to compute the [01:16:37] right and so in order to compute the optimal action and all compute e you [01:16:40] optimal action and all compute e you need to you need to know Phi or actually [01:16:42] need to you need to know Phi or actually Phi T plus one but you don't need to [01:16:44] Phi T plus one but you don't need to know what is sigh t plus one know [01:16:53] if you look at the way we do the dynamic [01:16:57] if you look at the way we do the dynamic programming the backwards recursion one [01:17:03] programming the backwards recursion one of you can ferment a piece of code that [01:17:05] of you can ferment a piece of code that doesn't bother to compute sigh right so [01:17:08] doesn't bother to compute sigh right so these are the two equations you use [01:17:10] these are the two equations you use update fire inside but whatever you know [01:17:12] update fire inside but whatever you know let's see you delete this line of code [01:17:14] let's see you delete this line of code just don't bother the computer and just [01:17:17] just don't bother the computer and just don't bother compute that and don't [01:17:19] don't bother compute that and don't bother to compute that so you notice [01:17:22] bother to compute that so you notice that Phi depends on Phi T plus one but [01:17:25] that Phi depends on Phi T plus one but it doesn't depend on side and so you can [01:17:28] it doesn't depend on side and so you can implement the whole thing and compute [01:17:30] implement the whole thing and compute the octo policy and completely on also [01:17:32] the octo policy and completely on also actions without ever computing sy right [01:17:37] actions without ever computing sy right now the funny thing about this is that [01:17:41] now the funny thing about this is that the only place that Sigma W appears is [01:17:50] the only place that Sigma W appears is that it affects only citee right so you [01:18:00] that it affects only citee right so you know if we do want to just cross out in [01:18:02] know if we do want to just cross out in our range and just don't bother the [01:18:04] our range and just don't bother the computes I T then the whole algorithm [01:18:07] computes I T then the whole algorithm doesn't even use Sigma W so one very [01:18:11] doesn't even use Sigma W so one very interesting property of the lqr you know [01:18:15] interesting property of the lqr you know if this formalism is that the optimal [01:18:18] if this formalism is that the optimal policy does not depend on Sigma W right [01:18:21] policy does not depend on Sigma W right and I think maybe this is a cell V star [01:18:30] depends on Sigma W because if the noise [01:18:34] depends on Sigma W because if the noise is very large if they're huge does it [01:18:36] is very large if they're huge does it wind blowing helicopter all over the [01:18:37] wind blowing helicopter all over the place then the value would be worse but [01:18:40] place then the value would be worse but PI star and LT do not depend [01:18:50] I'm sigma/w okay um so this is a [01:18:54] I'm sigma/w okay um so this is a property that's very specific to lqr [01:18:56] property that's very specific to lqr don't don't don't over generalize it to [01:18:59] don't don't don't over generalize it to other reinforcement learning algorithms [01:19:00] other reinforcement learning algorithms but this I think the intuition to take [01:19:06] but this I think the intuition to take from this is first if you're actually [01:19:07] from this is first if you're actually applying the system you know don't [01:19:09] applying the system you know don't bother to don't don't like so don't [01:19:11] bother to don't don't like so don't don't try too hard to estimate Sigma W [01:19:13] don't try too hard to estimate Sigma W because you know actually you don't [01:19:15] because you know actually you don't actually need to use it which is why [01:19:16] actually need to use it which is why when we're fitting a linear model I [01:19:18] when we're fitting a linear model I didn't talk too much about how you [01:19:20] didn't talk too much about how you actually estimate Sigma W because in the [01:19:22] actually estimate Sigma W because in the lpr system it literally doesn't matter [01:19:24] lpr system it literally doesn't matter in a mathematical sense in terms of what [01:19:27] in a mathematical sense in terms of what is the optimal policy you compute and in [01:19:29] is the optimal policy you compute and in a second there may be slightly useful [01:19:31] a second there may be slightly useful intuition to take away from this is that [01:19:33] intuition to take away from this is that for a lot of MVPs if you're building a [01:19:36] for a lot of MVPs if you're building a robot you know remember to add some [01:19:39] robot you know remember to add some noise to your system but the exact noise [01:19:41] noise to your system but the exact noise you add doesn't matter as much as one [01:19:44] you add doesn't matter as much as one might think so what I've seen and then [01:19:46] might think so what I've seen and then working a lot of robots a lot of MVPs is [01:19:48] working a lot of robots a lot of MVPs is you know do add some noise your system [01:19:50] you know do add some noise your system and make sure your learning algorithm [01:19:52] and make sure your learning algorithm Israel busting noise and the form of the [01:19:54] Israel busting noise and the form of the noise you add it does matter I don't say [01:19:56] noise you add it does matter I don't say it doesn't matter at all I mean now Kira [01:19:58] it doesn't matter at all I mean now Kira doesn't matter all for other MGP as it [01:20:00] doesn't matter all for other MGP as it does matter but I think the fact that [01:20:02] does matter but I think the fact that you remember to add some noise is often [01:20:05] you remember to add some noise is often in practice more important than the [01:20:07] in practice more important than the exact details of you know is the noise [01:20:09] exact details of you know is the noise 10% higher distant noise stem cell or if [01:20:11] 10% higher distant noise stem cell or if the noise is a hundred percent higher [01:20:13] the noise is a hundred percent higher lower that will often make the big [01:20:14] lower that will often make the big difference but but but when I'm you know [01:20:17] difference but but but when I'm you know training all the were helicopter or [01:20:18] training all the were helicopter or something the noise is something that [01:20:20] something the noise is something that you know I pay a little bit attention to [01:20:21] you know I pay a little bit attention to but I pay much more attention to making [01:20:23] but I pay much more attention to making sure that the matrices a and B are [01:20:25] sure that the matrices a and B are accurate and then a little bit [01:20:28] accurate and then a little bit sloppiness in the actress even noise [01:20:30] sloppiness in the actress even noise ball doing something that an MTP could [01:20:32] ball doing something that an MTP could probably survive then your policy for [01:20:33] probably survive then your policy for survive okay let's take one last [01:20:35] survive okay let's take one last question [01:20:37] question Oh V Bo OSE sorry yes see my nose V that [01:20:49] Oh V Bo OSE sorry yes see my nose V that was a this is a beat yes okay cool [01:20:56] was a this is a beat yes okay cool thanks everyone [01:20:57] thanks everyone let's break and I will see you for the [01:20:58] let's break and I will see you for the final lecture on Wednesday [01:21:01] final lecture on Wednesday thanks everyone ================================================================================ LECTURE 020 ================================================================================ RL Debugging and Diagnostics | Stanford CS229: Machine Learning Andrew Ng - Lecture 20 (Autumn 2018) Source: https://www.youtube.com/watch?v=pLhPQynL0tY --- Transcript [00:00:03] all right everyone Pontius Sandoval 8 so [00:00:08] all right everyone Pontius Sandoval 8 so welcome to the final lecture of sis 229 [00:00:13] welcome to the final lecture of sis 229 this quarter or I guess to the home [00:00:17] this quarter or I guess to the home viewers welcome to the season finale so [00:00:23] viewers welcome to the season finale so what like to do today is wrap up our [00:00:26] what like to do today is wrap up our discussion on reinforcer learning and [00:00:29] discussion on reinforcer learning and then and it will conclude the class so I [00:00:33] then and it will conclude the class so I think you know over the last few [00:00:36] think you know over the last few lectures you saw a lot of we saw a lot [00:00:42] lectures you saw a lot of we saw a lot at nav so maybe as a brief interlude [00:00:45] at nav so maybe as a brief interlude here are some videos so sample [00:00:51] here are some videos so sample autonomous helicopter you know there's a [00:00:54] autonomous helicopter you know there's a project that I know Peter view Adam [00:00:56] project that I know Peter view Adam coats some some former students here now [00:00:59] coats some some former students here now some of the machine learning greats were [00:01:00] some of the machine learning greats were on when they were PhD students here and [00:01:05] on when they were PhD students here and oh and and and I think using algorithms [00:01:08] oh and and and I think using algorithms similar to the ones you learned in this [00:01:10] similar to the ones you learned in this class how do you make a helicopter fly [00:01:12] class how do you make a helicopter fly so it just have fun there's a video shot [00:01:13] so it just have fun there's a video shot on top of one of the Stanford soccer [00:01:16] on top of one of the Stanford soccer fields I was actually a cameraman that [00:01:20] fields I was actually a cameraman that day and zooming out the camera see the [00:01:26] day and zooming out the camera see the trees planted in the sky [00:01:38] say it turns out there's a small radio [00:01:44] say it turns out there's a small radio control helicopter it turns out that [00:01:46] control helicopter it turns out that when you're very far away you can't tell [00:01:48] when you're very far away you can't tell if this is a small radio control [00:01:50] if this is a small radio control helicopter if there is like a helicopter [00:01:51] helicopter if there is like a helicopter with people sitting there named so there [00:01:55] with people sitting there named so there was actually this um you know Foods is [00:01:57] was actually this um you know Foods is on a kind of soccer field the big grass [00:02:02] on a kind of soccer field the big grass field off Santo Road it turns out across [00:02:04] field off Santo Road it turns out across Sand Hill Road and what that high-rises [00:02:06] Sand Hill Road and what that high-rises there was a those an elder lady that [00:02:09] there was a those an elder lady that lives in one of those apartments and [00:02:10] lives in one of those apartments and when if she saw this she would call 911 [00:02:12] when if she saw this she would call 911 and say hey this copter about the crash [00:02:14] and say hey this copter about the crash and then the the firemen would come out [00:02:18] and I and I think they were Polly relief [00:02:21] and I and I think they were Polly relief probably disappointed that there was no [00:02:24] probably disappointed that there was no one for us for them to save I think um [00:02:27] one for us for them to save I think um and so and and I think um let's see one [00:02:34] and so and and I think um let's see one of the things I promise to do in the [00:02:38] of the things I promise to do in the debugging learning algorithms lecture [00:02:40] debugging learning algorithms lecture was to just go over the reinforcement [00:02:43] was to just go over the reinforcement learning example again so let me just do [00:02:45] learning example again so let me just do that now but with notation that I think [00:02:48] that now but with notation that I think you now understand compared to oh yes [00:02:50] you now understand compared to oh yes oh you as an aerobatic stunt yeah that I [00:02:57] oh you as an aerobatic stunt yeah that I I don't think there's a good reason for [00:02:58] I don't think there's a good reason for fine how it drops upside down other than [00:03:01] fine how it drops upside down other than that you can there are a lot of videos [00:03:04] that you can there are a lot of videos of samples on the side cough to find all [00:03:06] of samples on the side cough to find all sorts of stunts go to heli stanford.edu [00:03:08] sorts of stunts go to heli stanford.edu Akio I don't stanford.edu and the stem [00:03:14] Akio I don't stanford.edu and the stem photons are cogs did a lot more than [00:03:16] photons are cogs did a lot more than fine upside down I mean make some [00:03:20] fine upside down I mean make some maneuvers that look aerodynamically [00:03:22] maneuvers that look aerodynamically impossible such as a helicopter that [00:03:24] impossible such as a helicopter that looks like a stumbling just spinning [00:03:26] looks like a stumbling just spinning randomly but staying the same place in [00:03:28] randomly but staying the same place in the air right now it's called the chaos [00:03:30] the air right now it's called the chaos maneuver and if you look how to go wow [00:03:32] maneuver and if you look how to go wow this work was turning upside down [00:03:33] this work was turning upside down spinning around the area same direction [00:03:35] spinning around the area same direction but it's just staying right there in the [00:03:36] but it's just staying right there in the air and not crashing and so the [00:03:38] air and not crashing and so the maneuvers like that that um the very [00:03:40] maneuvers like that that um the very best human pilots in the world can fly [00:03:42] best human pilots in the world can fly with helicopters and I think this was [00:03:44] with helicopters and I think this was just [00:03:46] just demonstration yes and I think a lot of [00:03:49] demonstration yes and I think a lot of this work wound up influencing something [00:03:51] this work wound up influencing something later work on the quadcopter drones and [00:03:53] later work on the quadcopter drones and a few research labs and yeah I think it [00:03:56] a few research labs and yeah I think it was a difficult control problem and it [00:03:58] was a difficult control problem and it was it was one of those things you do [00:04:00] was it was one of those things you do when you're you when you're a university [00:04:02] when you're you when you're a university you want to solve on the hardest [00:04:03] you want to solve on the hardest problems round um but one that step [00:04:07] problems round um but one that step through of you the debugging process [00:04:09] through of you the debugging process that we went through as we were you're [00:04:12] that we went through as we were you're building a helicopter like this so when [00:04:15] building a helicopter like this so when you're trying to get a helicopter to fly [00:04:17] you're trying to get a helicopter to fly upside down fly stunts you don't wanna [00:04:18] upside down fly stunts you don't wanna crash too often so step one is build a [00:04:21] crash too often so step one is build a model or build a simulator of the [00:04:22] model or build a simulator of the helicopter right much much as you saw we [00:04:26] helicopter right much much as you saw we start to talk about fitted value [00:04:27] start to talk about fitted value iteration and then choose the reward [00:04:30] iteration and then choose the reward function like that and it turns out that [00:04:34] function like that and it turns out that specifies the reward function for [00:04:36] specifies the reward function for staying a place is not that hard you [00:04:39] staying a place is not that hard you know like a quadratic function like that [00:04:40] know like a quadratic function like that works okay but if you want a helicopter [00:04:43] works okay but if you want a helicopter to fly aggressive maneuvers it's [00:04:45] to fly aggressive maneuvers it's actually quite tricky to specify what is [00:04:48] actually quite tricky to specify what is a good turn for a helicopter and then [00:04:52] a good turn for a helicopter and then what you do is you run the course to [00:04:54] what you do is you run the course to learning algorithm to try to maximize [00:04:57] learning algorithm to try to maximize say the final horizon MTP formulation [00:05:00] say the final horizon MTP formulation maximize some rewards over T time steps [00:05:02] maximize some rewards over T time steps you get a policy pie and then whenever [00:05:06] you get a policy pie and then whenever you do this the first time you do this [00:05:08] you do this the first time you do this you find that the resulting controller [00:05:10] you find that the resulting controller does much worse than the human pilot and [00:05:12] does much worse than the human pilot and the question is whether you do Nix right [00:05:14] the question is whether you do Nix right this is better this is almost I think [00:05:17] this is better this is almost I think this is almost exactly the slide I [00:05:18] this is almost exactly the slide I showed you last time because I might [00:05:19] showed you last time because I might clean up the slide using reinforcement [00:05:21] clean up the slide using reinforcement or any notation rather than it slightly [00:05:23] or any notation rather than it slightly simplified notation you saw before you [00:05:26] simplified notation you saw before you learned about reinforcement or anything [00:05:27] learned about reinforcement or anything and so the question is and and again if [00:05:32] and so the question is and and again if you work on the reinforcer learning [00:05:33] you work on the reinforcer learning problem yourself you know there's a good [00:05:35] problem yourself you know there's a good chance you have to answer this question [00:05:37] chance you have to answer this question yourself for whatever [00:05:39] yourself for whatever robot or other reinforcement learning or [00:05:41] robot or other reinforcement learning or factory automation or stock trading [00:05:42] factory automation or stock trading system or whatever it is you are trying [00:05:45] system or whatever it is you are trying to get to work with enforcement learning [00:05:47] to get to work with enforcement learning but do you want to improve the modelsim [00:05:48] but do you want to improve the modelsim model or doing a modified reward [00:05:51] model or doing a modified reward function or do you want to modify the [00:05:53] function or do you want to modify the reinforcement learning algorithm [00:05:55] reinforcement learning algorithm okay and multiply the reports of [00:05:57] okay and multiply the reports of learning album includes things like [00:05:59] learning album includes things like playing with the descritization that [00:06:02] playing with the descritization that you're using if you are taking a [00:06:04] you're using if you are taking a continuously MDP and discretizing it's a [00:06:07] continuously MDP and discretizing it's a solve of a finite state MVP formulation [00:06:09] solve of a finite state MVP formulation or modifying the reinforcement learning [00:06:11] or modifying the reinforcement learning algorithm and Cruz also may be choosing [00:06:13] algorithm and Cruz also may be choosing new features to use in physicality [00:06:15] new features to use in physicality iteration small things we could try or [00:06:17] iteration small things we could try or maybe instead of using a linear function [00:06:19] maybe instead of using a linear function approximator [00:06:20] approximator instead of fitting a linear function for [00:06:22] instead of fitting a linear function for fit evaluation maybe you want to use a [00:06:24] fit evaluation maybe you want to use a bigger you know deep neural network [00:06:26] bigger you know deep neural network right but so which of these steps is the [00:06:29] right but so which of these steps is the most useful thing to do so this is the [00:06:33] most useful thing to do so this is the analysis of those three things you know [00:06:36] analysis of those three things you know if I give you a second meters right but [00:06:41] if I give you a second meters right but if these three statements are true then [00:06:45] if these three statements are true then the learn controller should have flown [00:06:47] the learn controller should have flown well on the helicopter right and so [00:06:55] well on the helicopter right and so those three sentences correspond to the [00:07:00] those three sentences correspond to the three things in yellow that you could [00:07:03] three things in yellow that you could work on is a problem that you know [00:07:07] work on is a problem that you know statement one is false that the [00:07:09] statement one is false that the assimilator isn't good enough for his [00:07:10] assimilator isn't good enough for his problem that statement two is false that [00:07:16] problem that statement two is false that oh sorry I think actually two and three [00:07:19] oh sorry I think actually two and three are reverse right but the three [00:07:21] are reverse right but the three statements correspond to three things in [00:07:22] statements correspond to three things in yellow I think two and three are and are [00:07:24] yellow I think two and three are and are in opposite order right is the arrow [00:07:28] in opposite order right is the arrow alpha maximizing some rewards is a [00:07:30] alpha maximizing some rewards is a reward function actually the right thing [00:07:32] reward function actually the right thing to maximize and so here the Diagnostics [00:07:35] to maximize and so here the Diagnostics you could use to see this helicopter [00:07:38] you could use to see this helicopter simulator is accurate [00:07:39] simulator is accurate well first check if the policy flies [00:07:43] well first check if the policy flies well in simulation if your policy flies [00:07:48] well in simulation if your policy flies one simulation but not in real life then [00:07:50] one simulation but not in real life then this shows that the problem is with your [00:07:53] this shows that the problem is with your simulator and you should try to learn a [00:07:55] simulator and you should try to learn a better model for your helicopter right [00:07:58] better model for your helicopter right and if you're using a linear model this [00:08:00] and if you're using a linear model this with the matrices a and B if you know st [00:08:03] with the matrices a and B if you know st plus 1 equals a st plus ba t if you try [00:08:07] plus 1 equals a st plus ba t if you try try anymore [00:08:08] try anymore or maybe try a nonlinear model but if [00:08:12] or maybe try a nonlinear model but if you find it the problem is not your [00:08:14] you find it the problem is not your simulator if you find that your policy [00:08:17] simulator if you find that your policy is flying poorly in simulation and [00:08:20] is flying poorly in simulation and flying poorly in real life right then [00:08:22] flying poorly in real life right then this is the diagnostic I will use so I [00:08:27] this is the diagnostic I will use so I shall show these two lines [00:08:28] shall show these two lines so let's that human be the human control [00:08:30] so let's that human be the human control policy so hire a human pilot which we [00:08:34] policy so hire a human pilot which we did we're fortunate that one of the best [00:08:35] did we're fortunate that one of the best what one of them America is tall you [00:08:39] what one of them America is tall you know aerobatic helicopter pilots working [00:08:41] know aerobatic helicopter pilots working with us and he using his control signals [00:08:44] with us and he using his control signals radio control can make a helicopter fly [00:08:45] radio control can make a helicopter fly upside down tumble do flips loops rows [00:08:48] upside down tumble do flips loops rows so we had very good human pilot help us [00:08:51] so we had very good human pilot help us fly the helicopter manually so what you [00:08:56] fly the helicopter manually so what you can do is test whether or not the so [00:09:01] can do is test whether or not the so this this thing here right that's just a [00:09:04] this this thing here right that's just a payoff of the learned policy as measured [00:09:10] payoff of the learned policy as measured on your reward function so check if the [00:09:14] on your reward function so check if the learn policy achieves a better or a [00:09:17] learn policy achieves a better or a worse payoff then a human pilot you can [00:09:20] worse payoff then a human pilot you can write and so that means you know go [00:09:23] write and so that means you know go ahead and let the learn policy fly the [00:09:25] ahead and let the learn policy fly the helicopter and we get the humans up fi [00:09:27] helicopter and we get the humans up fi the helicopter and compute the summer [00:09:29] the helicopter and compute the summer rewards on the sequence of states that [00:09:32] rewards on the sequence of states that these two systems take the helicopter [00:09:34] these two systems take the helicopter through and just see whether the human [00:09:37] through and just see whether the human or the learn policy achieves a higher [00:09:40] or the learn policy achieves a higher payoff achieves a higher summer rewards [00:09:43] payoff achieves a higher summer rewards and if the payoff achieved by the [00:09:47] and if the payoff achieved by the learning algorithm is less than a payoff [00:09:49] learning algorithm is less than a payoff achieved by the human then this shows [00:09:51] achieved by the human then this shows that the learned policy is not actually [00:09:56] that the learned policy is not actually maximizing the summer rewards right [00:09:58] maximizing the summer rewards right because whatever human is doing you know [00:10:00] because whatever human is doing you know he or she is doing a better job [00:10:01] he or she is doing a better job maximizing some rewards then the learn [00:10:04] maximizing some rewards then the learn policy so this means that you should you [00:10:07] policy so this means that you should you know consider working on the [00:10:08] know consider working on the reinforcement learning algorithm to try [00:10:09] reinforcement learning algorithm to try to make it do a better job maximizing [00:10:11] to make it do a better job maximizing the Sun removals and then on the flip [00:10:15] the Sun removals and then on the flip side it is an equality goes the other [00:10:18] side it is an equality goes the other way right so positive if the payoff or [00:10:21] way right so positive if the payoff or the rol [00:10:22] the rol is greater than the payoff of the human [00:10:24] is greater than the payoff of the human then what that means is that the ORR [00:10:28] then what that means is that the ORR algorithm is actually doing a better job [00:10:30] algorithm is actually doing a better job maximizing the summer rewards but [00:10:32] maximizing the summer rewards but they're still flying worse so what this [00:10:34] they're still flying worse so what this tells you is that doing a really good [00:10:37] tells you is that doing a really good job maximizing some rewards does not [00:10:39] job maximizing some rewards does not correspond to how you actually want the [00:10:41] correspond to how you actually want the helicopter to fly and so that means that [00:10:44] helicopter to fly and so that means that maybe you should work on improving the [00:10:48] maybe you should work on improving the reward function that the reward function [00:10:50] reward function that the reward function is not capturing what's actually most [00:10:52] is not capturing what's actually most important to find helicopter well and [00:10:55] important to find helicopter well and then you multiply the reward function [00:10:57] then you multiply the reward function right so in a typical workflow I'm going [00:11:01] right so in a typical workflow I'm going to describe to you what what what it [00:11:02] to describe to you what what what it feels like to work on the machine [00:11:04] feels like to work on the machine learning project like this and it was a [00:11:05] learning project like this and it was a big multi or machine learning project [00:11:07] big multi or machine learning project but when you're working on a big [00:11:09] but when you're working on a big complicated machine learning project [00:11:10] complicated machine learning project like this the bottleneck moves around [00:11:13] like this the bottleneck moves around meaning that you build a helicopter get [00:11:16] meaning that you build a helicopter get a human pilot fly it you're getting the [00:11:18] a human pilot fly it you're getting the world near on these Diagnostics and [00:11:20] world near on these Diagnostics and maybe the first time you do this you [00:11:21] maybe the first time you do this you find wow the simulator is really [00:11:23] find wow the simulator is really inaccurate then you are going to work on [00:11:25] inaccurate then you are going to work on improving the simulator for a couple [00:11:26] improving the simulator for a couple months and then you know and every now [00:11:29] months and then you know and every now and then you come back and rerun this [00:11:30] and then you come back and rerun this diagnostic then maybe for the first two [00:11:33] diagnostic then maybe for the first two months the project you keep on saying [00:11:34] months the project you keep on saying yep soon this is not good enough so it's [00:11:36] yep soon this is not good enough so it's not good enough so as long enough after [00:11:38] not good enough so as long enough after working on this simulator for a couple [00:11:40] working on this simulator for a couple months you you may find that item one [00:11:43] months you you may find that item one that's no longer the problem you might [00:11:45] that's no longer the problem you might then find that item three is the problem [00:11:47] then find that item three is the problem the simulator is now good enough but [00:11:49] the simulator is now good enough but when you run this diagnostic two months [00:11:51] when you run this diagnostic two months in the project you might say wow looks [00:11:53] in the project you might say wow looks like you're our algorithm is maximally [00:11:56] like you're our algorithm is maximally reward function but this is not good [00:11:58] reward function but this is not good flying so now I think the biggest [00:12:00] flying so now I think the biggest problem for the project or the biggest [00:12:02] problem for the project or the biggest bottleneck with the project is that the [00:12:04] bottleneck with the project is that the reward function is not good enough and [00:12:06] reward function is not good enough and then you might spend you know another [00:12:07] then you might spend you know another one or two or three or sometimes longer [00:12:10] one or two or three or sometimes longer months working to try to improve the [00:12:12] months working to try to improve the reward function and you might do that [00:12:14] reward function and you might do that for a while and then when the reward [00:12:16] for a while and then when the reward function is good enough then that [00:12:17] function is good enough then that exposes the next problem your system [00:12:19] exposes the next problem your system which might be that the ROI algorithm is [00:12:21] which might be that the ROI algorithm is good enough and so the problem you [00:12:23] good enough and so the problem you should be working on actually moves [00:12:24] should be working on actually moves around and it's different in different [00:12:26] around and it's different in different phases of the project and when you're [00:12:29] phases of the project and when you're working on this it feels like every time [00:12:31] working on this it feels like every time you solve the current problem that [00:12:33] you solve the current problem that exposes the Nix [00:12:35] exposes the Nix most important to work on then you work [00:12:37] most important to work on then you work on that and we solve that then this [00:12:39] on that and we solve that then this helps you identify an explosive next [00:12:41] helps you identify an explosive next most important element work on and you [00:12:43] most important element work on and you kind of keep doing that or you keep [00:12:45] kind of keep doing that or you keep iterating or keep solving problems until [00:12:47] iterating or keep solving problems until hopefully you get a helicopter that does [00:12:49] hopefully you get a helicopter that does what you wanted to but I think teams [00:12:54] what you wanted to but I think teams that have the discipline to prioritize [00:12:59] that have the discipline to prioritize according to Diagnostics like this tend [00:13:01] according to Diagnostics like this tend to be much more efficient teams that [00:13:03] to be much more efficient teams that kind of go by gut feeling in terms of [00:13:05] kind of go by gut feeling in terms of selecting you know what to what to spend [00:13:07] selecting you know what to what to spend your time on all right any questions [00:13:11] your time on all right any questions about this oh sorry so again yeah I kind [00:13:31] about this oh sorry so again yeah I kind of want to say yes let me think [00:13:33] of want to say yes let me think yeah I wouldn't usually check step one [00:13:36] yeah I wouldn't usually check step one first and then if I think the simulator [00:13:38] first and then if I think the simulator is okay then look at steps two and three [00:13:41] is okay then look at steps two and three maybe one of the thing about when you [00:13:44] maybe one of the thing about when you work on these projects there is some [00:13:45] work on these projects there is some judgment involved so I think I'm [00:13:47] judgment involved so I think I'm presenting these things as those a rigid [00:13:49] presenting these things as those a rigid mathematical formula that's cut and dry [00:13:51] mathematical formula that's cut and dry this formula says now working on step [00:13:53] this formula says now working on step one then this one says now work on step [00:13:55] one then this one says now work on step three there is there is more judgment [00:13:58] three there is there is more judgment involved because when you run these [00:14:00] involved because when you run these things I'll say if you might say well [00:14:01] things I'll say if you might say well looks like the simulator is not that [00:14:03] looks like the simulator is not that good but it's kind of good and there's a [00:14:04] good but it's kind of good and there's a little bit ambiguous and oh looks like [00:14:06] little bit ambiguous and oh looks like you know and so that's what it often [00:14:08] you know and so that's what it often feels like and so a team would get [00:14:10] feels like and so a team would get together look for the evidence from all [00:14:12] together look for the evidence from all three steps and then say you know well [00:14:14] three steps and then say you know well maybe the simulator is not that good but [00:14:16] maybe the simulator is not that good but it's maybe good enough and but boy the [00:14:18] it's maybe good enough and but boy the reinforcement the reward function is [00:14:20] reinforcement the reward function is really bad let's focus on that so there [00:14:22] really bad let's focus on that so there is some surrounding a hard and fast rule [00:14:25] is some surrounding a hard and fast rule there there is some judgment needed to [00:14:28] there there is some judgment needed to make these decisions but having a so [00:14:32] make these decisions but having a so when a reading machine there any teams [00:14:33] when a reading machine there any teams often my teams will you know run D [00:14:35] often my teams will you know run D side.now sakes get together and look at [00:14:37] side.now sakes get together and look at the evidence and then discuss in debate [00:14:38] the evidence and then discuss in debate what's the best way to move forward but [00:14:40] what's the best way to move forward but I think the process in making sure [00:14:41] I think the process in making sure you've that discussion to debate it's [00:14:43] you've that discussion to debate it's much better than the alternative which [00:14:45] much better than the alternative which is you know someone just picked [00:14:46] is you know someone just picked something [00:14:47] something very random and the team does that so [00:15:00] very random and the team does that so just yeah maybe I had the laptop up you [00:15:06] just yeah maybe I had the laptop up you know a little bit of fun but a little [00:15:07] know a little bit of fun but a little bit because I'm to illustrate fitted [00:15:10] bit because I'm to illustrate fitted value iteration let me just show another [00:15:14] value iteration let me just show another reinforcement learning video oh and by [00:15:18] reinforcement learning video oh and by doing one of the I think if I look at [00:15:20] doing one of the I think if I look at the future of a I featured machine [00:15:21] the future of a I featured machine learning you know there's a lot of hype [00:15:23] learning you know there's a lot of hype about reinforcement learning for game [00:15:25] about reinforcement learning for game playing which is fine you know we all [00:15:26] playing which is fine you know we all like we all love computers playing [00:15:30] like we all love computers playing computer games like that's a great thing [00:15:32] computer games like that's a great thing I think but but I think that some of the [00:15:34] I think but but I think that some of the most exciting applications are [00:15:35] most exciting applications are reinforced with learning coming down the [00:15:37] reinforced with learning coming down the pipe I think will be robotics I don't [00:15:39] pipe I think will be robotics I don't know the next few years even though [00:15:40] know the next few years even though there are only a few success stories of [00:15:43] there are only a few success stories of reinforcement they applied to robotics [00:15:44] reinforcement they applied to robotics there are more and more right now one of [00:15:46] there are more and more right now one of the trends I see you know we look at the [00:15:49] the trends I see you know we look at the academic publications and some of the [00:15:51] academic publications and some of the things making their way into industrial [00:15:53] things making their way into industrial environments is I think in the next [00:15:54] environments is I think in the next several years just based on the stuff I [00:15:56] several years just based on the stuff I see my friends in many different [00:15:57] see my friends in many different companies and many different entities [00:15:58] companies and many different entities are working on I think there will be a [00:16:00] are working on I think there will be a rise of reinforcement learning [00:16:02] rise of reinforcement learning algorithms applied to robotics I think [00:16:04] algorithms applied to robotics I think there will be one important area to [00:16:06] there will be one important area to watch out for right um but um so you [00:16:12] watch out for right um but um so you know there's a now Stanford video this [00:16:15] know there's a now Stanford video this is again just using reinforcement [00:16:16] is again just using reinforcement learning to get a robot dog to climb [00:16:20] learning to get a robot dog to climb over obstacles like these my friends [00:16:24] over obstacles like these my friends that were less generous did not want to [00:16:29] that were less generous did not want to think of this as a robot dog they [00:16:31] think of this as a robot dog they thought it was more like a robot [00:16:32] thought it was more like a robot cockroach [00:16:34] cockroach but I think cockroaches done that for [00:16:36] but I think cockroaches done that for the x-ray coffee with six legs [00:16:54] yeah but so how do you program a robot [00:16:57] yeah but so how do you program a robot dog like this right to climb over [00:17:00] dog like this right to climb over terrain so one of the key components is [00:17:03] terrain so one of the key components is work I am Zico Colter [00:17:05] work I am Zico Colter now a Connie Mellon professor another [00:17:08] now a Connie Mellon professor another one of the machine learning greats is a [00:17:13] one of the machine learning greats is a key part of this was a valley function [00:17:15] key part of this was a valley function approximation where it dog sounds on the [00:17:19] approximation where it dog sounds on the left and it goes get to the right then [00:17:20] left and it goes get to the right then the approximate value function kind of [00:17:25] the approximate value function kind of I'm simplifying a little bit right but [00:17:27] I'm simplifying a little bit right but but the approximate value function tells [00:17:29] but the approximate value function tells it given the 3d shape of the terrain the [00:17:33] it given the 3d shape of the terrain the middle plots is a height map where the [00:17:35] middle plots is a height map where the different shades tell you how tall is [00:17:37] different shades tell you how tall is the terrain but given the 3d shape the [00:17:40] the terrain but given the 3d shape the terrain the dog learns a value function [00:17:44] terrain the dog learns a value function that tells it what is the cost of [00:17:46] that tells it what is the cost of putting his feet on different locations [00:17:48] putting his feet on different locations to the terrain and it learns among other [00:17:50] to the terrain and it learns among other things you know not to put his feet at [00:17:52] things you know not to put his feet at the edge of a cliff because then it's [00:17:53] the edge of a cliff because then it's likely to slip off the edge of a cliff [00:17:55] likely to slip off the edge of a cliff and fall over right so but but hopefully [00:17:59] and fall over right so but but hopefully this gives a visualization of whether [00:18:01] this gives a visualization of whether learning value functions very very [00:18:04] learning value functions very very complicated functions I'll say and okay [00:18:05] complicated functions I'll say and okay the states is very high dimension so [00:18:07] the states is very high dimension so this is all kind of project on so 2d [00:18:09] this is all kind of project on so 2d space you can visualize it but but this [00:18:11] space you can visualize it but but this is what the simplified value function [00:18:13] is what the simplified value function looks like a robot like this okay all [00:18:18] looks like a robot like this okay all right so with that let me return to the [00:18:30] so [00:18:41] um there's just one class of algorithms [00:18:46] um there's just one class of algorithms I want to describe to you today which [00:18:48] I want to describe to you today which are called policy search algorithms and [00:18:57] sometimes policy search is also called [00:19:00] sometimes policy search is also called direct policy search and to explain what [00:19:11] direct policy search and to explain what this means so far our approach to [00:19:14] this means so far our approach to reinforcement learning has been to first [00:19:16] reinforcement learning has been to first learn or approximate the value function [00:19:19] learn or approximate the value function you know approximate v-star and then use [00:19:22] you know approximate v-star and then use that to learn or at least hopefully [00:19:24] that to learn or at least hopefully approximate PI star right so we had you [00:19:27] approximate PI star right so we had you saw a value iteration top reading Apollo [00:19:29] saw a value iteration top reading Apollo through a chinois philosophy to [00:19:30] through a chinois philosophy to reinforce the learning was to estimate [00:19:32] reinforce the learning was to estimate the value function and then use that you [00:19:34] the value function and then use that you know that equation with the arc max to [00:19:36] know that equation with the arc max to figure out what is pi star so this is an [00:19:38] figure out what is pi star so this is an indirect way of getting at a policy [00:19:40] indirect way of getting at a policy because we would first try to figure out [00:19:43] because we would first try to figure out was the value function in direct policy [00:19:45] was the value function in direct policy search we try to find a good policy [00:19:54] search we try to find a good policy directly right hence the term direct [00:19:58] directly right hence the term direct policy search because you don't you go [00:20:00] policy search because you don't you go straight for trying to find a good [00:20:01] straight for trying to find a good policy without the intermediate step of [00:20:03] policy without the intermediate step of finding an approximation to the value [00:20:06] finding an approximation to the value function so um let's see I'm gonna use [00:20:12] function so um let's see I'm gonna use as the most Vida example the inverted [00:20:15] as the most Vida example the inverted pendulum great so that is that thing [00:20:18] pendulum great so that is that thing with the three things here and let's say [00:20:21] with the three things here and let's say your actions are to accelerate left or [00:20:24] your actions are to accelerate left or to a salary right right and then you [00:20:26] to a salary right right and then you could have and you can have stay still a [00:20:28] could have and you can have stay still a cell it's from a cell rate that strong [00:20:30] cell it's from a cell rate that strong salary right you could more than two [00:20:31] salary right you could more than two actions but let's just say you've an [00:20:33] actions but let's just say you've an inverted pendulum with two actions so if [00:20:39] inverted pendulum with two actions so if you want to talk about pros and cons the [00:20:42] you want to talk about pros and cons the direct policy search later but if you [00:20:44] direct policy search later but if you want to bypass direct policy search [00:20:46] want to bypass direct policy search you're going to apply policy search the [00:20:48] you're going to apply policy search the first step is to [00:20:50] first step is to come up with the class of policies you [00:20:52] come up with the class of policies you are entertained or come up with the set [00:20:54] are entertained or come up with the set of functions you use to approximate the [00:20:57] of functions you use to approximate the policy so again to make an analogy when [00:21:01] policy so again to make an analogy when you saw logistic regression for the [00:21:04] you saw logistic regression for the first time you know we kind of said that [00:21:05] first time you know we kind of said that we would approximate Y as a hypothesis [00:21:13] right whose form was governed by this [00:21:15] right whose form was governed by this sigmoid function and you remember in [00:21:19] sigmoid function and you remember in week 2 when I first described logistic [00:21:23] week 2 when I first described logistic regression I kind of pulled this out of [00:21:25] regression I kind of pulled this out of a hat right and said oh yeah trust me [00:21:27] a hat right and said oh yeah trust me let's use logistic function and and then [00:21:29] let's use logistic function and and then later we saw there's a special case of [00:21:31] later we saw there's a special case of the generalized linear model but you [00:21:34] the generalized linear model but you know we just had to write down some form [00:21:36] know we just had to write down some form for how we will predict Y as a function [00:21:39] for how we will predict Y as a function of X so in direct policy search we will [00:21:44] of X so in direct policy search we will have to come up with a form for pi right [00:21:47] have to come up with a form for pi right so right they just come up with a [00:21:48] so right they just come up with a function for however is an H indirect [00:21:52] function for however is an H indirect policy search will have to come our way [00:21:54] policy search will have to come our way for how we approximate the policy pi [00:21:56] for how we approximate the policy pi right and so you know one thing we have [00:21:59] right and so you know one thing we have to do is say well maybe the action will [00:22:02] to do is say well maybe the action will approximate with some policy PI may be [00:22:06] approximate with some policy PI may be parametrized by theta and it's now a [00:22:09] parametrized by theta and it's now a function of the states and maybe it'll [00:22:12] function of the states and maybe it'll be 1 over 1 plus e to the negative theta [00:22:15] be 1 over 1 plus e to the negative theta transpose you know the state vector [00:22:18] transpose you know the state vector right where the state vector may be [00:22:21] right where the state vector may be something like X dot and then the angle [00:22:25] something like X dot and then the angle at the angle dot right if if just this [00:22:29] at the angle dot right if if just this fine maybe add an intercept [00:22:31] fine maybe add an intercept okay and I switch this from theta to Phi [00:22:35] okay and I switch this from theta to Phi to avoid conflict into notation okay um [00:22:39] to avoid conflict into notation okay um this isn't really the form of the policy [00:22:41] this isn't really the form of the policy were right so let me let me make one [00:22:42] were right so let me let me make one more definition and then I'll show you a [00:22:45] more definition and then I'll show you a form of a specific form of policy you [00:22:47] form of a specific form of policy you can use but it's actually not quite this [00:22:49] can use but it's actually not quite this what we need to treat this a little bit [00:22:51] what we need to treat this a little bit so the direct policy search algorithm [00:22:55] so the direct policy search algorithm will use will use a stochastic policy so [00:22:58] will use will use a stochastic policy so this is a new definition so sarcastic [00:23:14] this is a new definition so sarcastic policy is a function so we're going to [00:23:48] policy is a function so we're going to use for the direct policy search [00:23:50] use for the direct policy search algorithm that you see today we're going [00:23:52] algorithm that you see today we're going to use the classic policies meaning that [00:23:55] to use the classic policies meaning that on every time step the policy will tell [00:24:00] on every time step the policy will tell you what's the chance you want to [00:24:02] you what's the chance you want to accelerate left versus what's the chance [00:24:04] accelerate left versus what's the chance you want to settle right and then you [00:24:07] you want to settle right and then you use a random number generator to select [00:24:09] use a random number generator to select either left or right to accelerate on [00:24:11] either left or right to accelerate on your inverted pendulum depending on the [00:24:13] your inverted pendulum depending on the policies there depending on the [00:24:14] policies there depending on the probability is output by this policy [00:24:17] probability is output by this policy okay and so here's one example let's see [00:24:27] okay and so here's one example let's see which is you can have [00:24:30] which is you can have [Applause] [00:24:37] so you know continuing with the inverted [00:24:40] so you know continuing with the inverted pendulum here's one policy that might be [00:24:49] pendulum here's one policy that might be reasonable where you say that let's see [00:25:02] so you know in a state that's the chancy [00:25:10] so you know in a state that's the chancy you take the salary right action is [00:25:13] you take the salary right action is given by this sigmoid function and the [00:25:16] given by this sigmoid function and the chance that in the state that's you take [00:25:19] chance that in the state that's you take the accellerate left action is given by [00:25:25] the accellerate left action is given by that okay and here's one example for why [00:25:29] that okay and here's one example for why this might be a reasonable policy so [00:25:31] this might be a reasonable policy so let's say the state vector s this one [00:25:33] let's say the state vector s this one X X dot Phi Phi dot where you know this [00:25:43] X X dot Phi Phi dot where you know this angle of the inverted pendulum is the [00:25:46] angle of the inverted pendulum is the angle Phi and let's say for the sake of [00:25:49] angle Phi and let's say for the sake of arguments that we set the parameter of [00:25:53] arguments that we set the parameter of this policy Phi to be um zero zero zero [00:25:58] this policy Phi to be um zero zero zero one zero so in this case this is saying [00:26:03] one zero so in this case this is saying that let's see so theta transpose s is [00:26:07] that let's see so theta transpose s is just equal to Phi right and so in this [00:26:11] just equal to Phi right and so in this case right because you know theta [00:26:14] case right because you know theta transpose s just 1 times Phi everything [00:26:16] transpose s just 1 times Phi everything else gets multiplied zero and so in this [00:26:18] else gets multiplied zero and so in this case the same that the chance to [00:26:19] case the same that the chance to accelerate to the right is equal to one [00:26:23] accelerate to the right is equal to one over one plus e to the negative how far [00:26:26] over one plus e to the negative how far is the PO tilted over to the right and [00:26:29] is the PO tilted over to the right and so this policy gives you the effect that [00:26:32] so this policy gives you the effect that the further the PO is tilted to the [00:26:34] the further the PO is tilted to the right the more aggressively you want to [00:26:37] right the more aggressively you want to accelerate to the right okay so this is [00:26:40] accelerate to the right okay so this is very simple policy it's not a great [00:26:41] very simple policy it's not a great policy but it's not a totally [00:26:43] policy but it's not a totally unreasonable policy which is well look [00:26:45] unreasonable policy which is well look at how far the post tilted so that for [00:26:47] at how far the post tilted so that for the right [00:26:47] the right apply sigmoid function and then [00:26:49] apply sigmoid function and then accelerate to the left or right you know [00:26:51] accelerate to the left or right you know depending on how far is tilted to the [00:26:53] depending on how far is tilted to the right now and and and because this is [00:26:58] right now and and and because this is the right so this is really the chance [00:27:02] the right so this is really the chance of taking the ass a very right action as [00:27:07] of taking the ass a very right action as a function of the PO angle pi right now [00:27:11] a function of the PO angle pi right now this is not the best policy because it [00:27:15] this is not the best policy because it ignores all the features other than Phi [00:27:18] ignores all the features other than Phi but if you were to set both theta equals [00:27:22] but if you were to set both theta equals you know 0 negative 0.5 0 1 0 then this [00:27:29] you know 0 negative 0.5 0 1 0 then this policy the negative 0.5 now multiplies [00:27:33] policy the negative 0.5 now multiplies into the exposition right now this new [00:27:37] into the exposition right now this new policy if you have this value of theta [00:27:40] policy if you have this value of theta it takes an account how far is your [00:27:42] it takes an account how far is your cards already to the right where I guess [00:27:46] cards already to the right where I guess this is the X distance and the further [00:27:50] this is the X distance and the further your cart is already [00:27:51] your cart is already I guess if your cart is on the set of [00:27:53] I guess if your cart is on the set of wheels right it's on the railway track [00:27:55] wheels right it's on the railway track and you don't want to fall off the rim [00:27:57] and you don't want to fall off the rim and you want to keep the car kind of [00:27:59] and you want to keep the car kind of Center you don't want to fall off the [00:28:00] Center you don't want to fall off the end of your table but this now says the [00:28:02] end of your table but this now says the further this is to the right already [00:28:04] further this is to the right already your well the less likely you should be [00:28:05] your well the less likely you should be to accelerate to the right okay and so [00:28:08] to accelerate to the right okay and so maybe this is suddenly better policy [00:28:10] maybe this is suddenly better policy there were descending parameters and [00:28:12] there were descending parameters and more generally what you would like is to [00:28:17] more generally what you would like is to come up with five numbers that tells you [00:28:21] come up with five numbers that tells you how to trade off how much you should [00:28:22] how to trade off how much you should aside to the right based on the position [00:28:24] aside to the right based on the position velocity angle and angular velocity of [00:28:28] velocity angle and angular velocity of their current state of the car of the of [00:28:32] their current state of the car of the of the inverted pendulum and what a direct [00:28:34] the inverted pendulum and what a direct policy search Alber will do is help you [00:28:38] policy search Alber will do is help you come up with a set of numbers that [00:28:40] come up with a set of numbers that results in hopefully a reasonable policy [00:28:43] results in hopefully a reasonable policy for controlling the inverted pendulum [00:28:45] for controlling the inverted pendulum hope and a policy that hopefully results [00:28:47] hope and a policy that hopefully results in a appropriate set of probabilities [00:28:49] in a appropriate set of probabilities that cause it to accelerate to the right [00:28:51] that cause it to accelerate to the right whenever's good to do so and Sarah to [00:28:53] whenever's good to do so and Sarah to let you know more often when it's good [00:28:55] let you know more often when it's good to do so [00:28:58] so so I'll go is to find the five [00:29:10] so so I'll go is to find the five parameters theta [00:29:16] so that's when we execute PI of s a we [00:29:29] maximize max of a theta the expected [00:29:34] maximize max of a theta the expected value of R of s 0 is 0 plus dot dot plus [00:29:49] and so the reward function could be [00:29:51] and so the reward function could be negative 1 whenever the inverted [00:29:53] negative 1 whenever the inverted pendulum falls over and 0 whenever it [00:29:56] pendulum falls over and 0 whenever it stays up or whatever or something that [00:29:59] stays up or whatever or something that measures how well you betcha Panem is [00:30:00] measures how well you betcha Panem is doing but the goal of a direct policy [00:30:03] doing but the goal of a direct policy search algorithm is to choose a set [00:30:06] search algorithm is to choose a set parameters theta so that we actually the [00:30:08] parameters theta so that we actually the policy you maximize your expected payoff [00:30:10] policy you maximize your expected payoff and I'm gonna use to find a horizon [00:30:12] and I'm gonna use to find a horizon setting for the album that was helpful [00:30:15] setting for the album that was helpful today okay and then one one other [00:30:18] today okay and then one one other difference between policy search [00:30:21] difference between policy search compared to estimating the value [00:30:24] compared to estimating the value function is that indirect policy search [00:30:28] function is that indirect policy search here as 0 is a fixed initial State [00:30:39] it turns out that when we were [00:30:42] it turns out that when we were estimating the value function V saw you [00:30:46] estimating the value function V saw you found the best possible policy for [00:30:48] found the best possible policy for starting from any state right and [00:30:50] starting from any state right and there's kind of no matter what state you [00:30:51] there's kind of no matter what state you start from is simultaneously the best [00:30:53] start from is simultaneously the best possible policy for all states indirect [00:30:55] possible policy for all states indirect policy search we assume that either [00:30:58] policy search we assume that either there's a fixed start state fix initial [00:31:00] there's a fixed start state fix initial state at 0 or there's a fixed [00:31:02] state at 0 or there's a fixed distribution over initial States I'm [00:31:04] distribution over initial States I'm going to try to maximize the expected [00:31:05] going to try to maximize the expected reward or [00:31:06] reward or back to your initial state or respect to [00:31:08] back to your initial state or respect to an initial priority distribution over [00:31:11] an initial priority distribution over what is the initial state okay so that's [00:31:13] what is the initial state okay so that's that's one other difference so [00:31:33] all right so this right is out the go is [00:31:40] all right so this right is out the go is a maximize overall theta the expected [00:31:44] a maximize overall theta the expected value of R of s 0 a 0 because R of s 1 a [00:31:49] value of R of s 0 a 0 because R of s 1 a 1 plus dot dot dot up to R of s t-80 you [00:31:58] 1 plus dot dot dot up to R of s t-80 you know given pi theta and in order to [00:32:05] know given pi theta and in order to simplify the math we'll write on this [00:32:08] simplify the math we'll write on this board today I'm just gonna set G equals [00:32:12] board today I'm just gonna set G equals 1 to simplify the math in order to not [00:32:15] 1 to simplify the math in order to not carry such a long summation but it turns [00:32:18] carry such a long summation but it turns out that so I'm just do like a 2 x mm DP [00:32:22] out that so I'm just do like a 2 x mm DP just to simplify the derivation but [00:32:24] just to simplify the derivation but everything works you know just with a [00:32:26] everything works you know just with a longer some if you have a more general [00:32:29] longer some if you have a more general version of T and so this term here the [00:32:34] version of T and so this term here the expectation is equal to sum over all [00:32:37] expectation is equal to sum over all possible state action sequences right [00:32:40] possible state action sequences right and again this way go up to St and 80 [00:32:42] and again this way go up to St and 80 but you just said capital T equals 1 um [00:32:46] but you just said capital T equals 1 um what's the chance your MVP starts out [00:32:49] what's the chance your MVP starts out and some state as 0 so this is your [00:32:51] and some state as 0 so this is your initial state distribution times the [00:32:54] initial state distribution times the chance that in that state you take the [00:32:59] chance that in that state you take the first action a zero oh sorry just let me [00:33:03] first action a zero oh sorry just let me write this out right so the chance of [00:33:06] write this out right so the chance of your MVP going through the state action [00:33:09] your MVP going through the state action sequence times [00:33:20] times that right so that's what it means [00:33:22] times that right so that's what it means to self compute the expected value of [00:33:26] to self compute the expected value of the payoff and so instead of writing all [00:33:32] the payoff and so instead of writing all this sum I'm just going to call this the [00:33:34] this sum I'm just going to call this the payoff and so this is equal to sum over [00:33:41] payoff and so this is equal to sum over s 0 a 0 s 1 a 1 of the Chauncey MTP [00:33:45] s 0 a 0 s 1 a 1 of the Chauncey MTP starts in state 0 times the challenge [00:33:48] starts in state 0 times the challenge that in state 0 you end up choosing the [00:33:51] that in state 0 you end up choosing the action a 0 times the chance governed by [00:33:56] action a 0 times the chance governed by the state transition probabilities that [00:33:58] the state transition probabilities that you end up in state 1 state s 1 times [00:34:03] you end up in state 1 state s 1 times the chance a state that's one you end up [00:34:05] the chance a state that's one you end up choosing so that's 1 and then times the [00:34:12] choosing so that's 1 and then times the payoff ok [00:34:14] payoff ok and so what we're going to be able to do [00:34:18] and so what we're going to be able to do is derive a gradient ascent algorithm [00:34:22] is derive a gradient ascent algorithm actually so costly gradient ascent [00:34:24] actually so costly gradient ascent algorithm as a function of theta to [00:34:27] algorithm as a function of theta to maximize this thing to maximize the [00:34:29] maximize this thing to maximize the expected value of this thing and that [00:34:31] expected value of this thing and that and and this is a this is how we'll do [00:34:34] and and this is a this is how we'll do direct policy search ok so let me just [00:34:39] direct policy search ok so let me just write out the algorithm and then we'll [00:34:41] write out the algorithm and then we'll go through why the algorithm that I [00:34:45] go through why the algorithm that I write down is maximizing this expected [00:34:48] write down is maximizing this expected payoff [00:34:57] so this algorithm is called the [00:35:06] reinforced algorithm the option [00:35:09] reinforced algorithm the option reinforced algorithm had a few other [00:35:12] reinforced algorithm had a few other bells and whistles but explain the code [00:35:16] bells and whistles but explain the code the idea but they were enforcing that [00:35:19] the idea but they were enforcing that the reinforced algorithm does the [00:35:22] the reinforced algorithm does the following which is you're going to run [00:35:26] following which is you're going to run your MDP right and just you know run it [00:35:31] your MDP right and just you know run it for a trajectory of tea time step so [00:35:33] for a trajectory of tea time step so again you know I'm just gonna well right [00:35:38] again you know I'm just gonna well right and and actually you would uh [00:35:43] technically you would run it for tea [00:35:47] technically you would run it for tea time steps but you know let's just say [00:35:50] time steps but you know let's just say for now well we'll do only the thing in [00:35:51] for now well we'll do only the thing in blue we run it for one time set go to [00:35:53] blue we run it for one time set go to sleep capital T equal one and then you [00:35:58] sleep capital T equal one and then you would compute the payoff right equals R [00:36:03] would compute the payoff right equals R of 0 plus R of s 1 and then in the more [00:36:08] of 0 plus R of s 1 and then in the more general case you know plus dot dot plus [00:36:10] general case you know plus dot dot plus R of s T and then you perform the [00:36:19] R of s T and then you perform the following update which is theta gets [00:36:23] following update which is theta gets updated as theta plus the learning rate [00:36:26] updated as theta plus the learning rate alpha times [00:36:52] and then times the payoff and again I'm [00:36:58] and then times the payoff and again I'm just setting capital t equals 1 if [00:37:00] just setting capital t equals 1 if capital t was bigger you would just sum [00:37:03] capital t was bigger you would just sum this all the way up to time T so that's [00:37:08] this all the way up to time T so that's the algorithm that's on every iteration [00:37:11] the algorithm that's on every iteration through the reinforced algorithm and [00:37:15] through the reinforced algorithm and through the reinforced algorithm you [00:37:17] through the reinforced algorithm you will take your robot take your inverted [00:37:19] will take your robot take your inverted pendulum run it through t time steps [00:37:24] pendulum run it through t time steps executing your current policy so choose [00:37:26] executing your current policy so choose actions randomly according to the [00:37:28] actions randomly according to the current stochastic policy using current [00:37:30] current stochastic policy using current values of the parameters data compute [00:37:33] values of the parameters data compute the total sum rewards you receive let's [00:37:35] the total sum rewards you receive let's call the payoff and then update theta [00:37:37] call the payoff and then update theta using this funny formula right now on [00:37:42] using this funny formula right now on every iteration of this algorithm you're [00:37:46] every iteration of this algorithm you're going to update theta and it turns out [00:37:50] going to update theta and it turns out that grandpa's is a stochastic gradient [00:37:53] that grandpa's is a stochastic gradient ascent algorithm and you remember when [00:37:57] ascent algorithm and you remember when we talked about linear regression right [00:38:00] we talked about linear regression right you saw me draw pictures like this if [00:38:02] you saw me draw pictures like this if there's a global minimum then gradient [00:38:04] there's a global minimum then gradient descent would just you know take a [00:38:06] descent would just you know take a straight path to the minimum but [00:38:08] straight path to the minimum but stochastic gradient descent would take a [00:38:10] stochastic gradient descent would take a more random path right towards the [00:38:12] more random path right towards the minimum and it kind of also lays around [00:38:14] minimum and it kind of also lays around there maybe it doesn't quite converge [00:38:17] there maybe it doesn't quite converge unless you slowly decrease the learning [00:38:19] unless you slowly decrease the learning rate alpha so that's what we have for [00:38:21] rate alpha so that's what we have for stochastic gradient descent for linear [00:38:24] stochastic gradient descent for linear regression what we'll see in a minute is [00:38:28] regression what we'll see in a minute is that reinforce is a stochastic gradient [00:38:30] that reinforce is a stochastic gradient ascent algorithm meaning that each of [00:38:33] ascent algorithm meaning that each of these updates is random because it [00:38:35] these updates is random because it depends on what was this state action [00:38:38] depends on what was this state action sequence that you just saw and what was [00:38:40] sequence that you just saw and what was the payoff the you just saw but what [00:38:42] the payoff the you just saw but what Willis show is that on expectation the [00:38:47] Willis show is that on expectation the the average update you know this this [00:38:50] the average update you know this this update to theta this thing you're having [00:38:52] update to theta this thing you're having two theta that on average let's see so [00:38:55] two theta that on average let's see so that on average this update here is [00:38:58] that on average this update here is exactly in the direction of the gradient [00:39:02] exactly in the direction of the gradient so that on average [00:39:06] so that on average yeah because uh every every loo every [00:39:08] yeah because uh every every loo every time through this loop you're making a [00:39:11] time through this loop you're making a random update to theta and this random [00:39:14] random update to theta and this random and noisy because it depends on this [00:39:16] and noisy because it depends on this random state sequence right then just a [00:39:18] random state sequence right then just a sequence is random because of the state [00:39:21] sequence is random because of the state transition probabilities and also [00:39:22] transition probabilities and also because of the fact that you're choosing [00:39:23] because of the fact that you're choosing actions randomly but on but the expected [00:39:27] actions randomly but on but the expected value of this update you see in a little [00:39:29] value of this update you see in a little bit turns out to be exactly the [00:39:31] bit turns out to be exactly the direction of the gradient which is why [00:39:34] direction of the gradient which is why this report algorithm is a gradient [00:39:37] this report algorithm is a gradient ascent algorithm so let's let's show [00:39:43] ascent algorithm so let's let's show that now so [00:40:04] all right so what we want to do is [00:40:06] all right so what we want to do is maximize the expected payoff which is a [00:40:09] maximize the expected payoff which is a formula we derive up there and so we're [00:40:13] formula we derive up there and so we're going to want to take derivatives with [00:40:15] going to want to take derivatives with respect to theta of the expected payoff [00:40:20] right I'm just gonna copy that for me [00:40:23] right I'm just gonna copy that for me there up there so that's a chance of [00:40:41] there up there so that's a chance of that see going through that say action [00:40:42] that see going through that say action sequence time to pay off and so we want [00:40:45] sequence time to pay off and so we want to take derivatives of this and you know [00:40:46] to take derivatives of this and you know so we can write go up hill using [00:40:49] so we can write go up hill using gradient ascent so we're going to do [00:40:55] gradient ascent so we're going to do this in four steps now first want to [00:41:01] this in four steps now first want to remind you when you take the derivative [00:41:03] remind you when you take the derivative of Smith of a product of three things [00:41:06] of Smith of a product of three things right so let's say that you have three [00:41:09] right so let's say that you have three functions f of theta times G of theta [00:41:13] functions f of theta times G of theta times H of theta so by the product rule [00:41:19] times H of theta so by the product rule you know derivatives product grew from [00:41:22] you know derivatives product grew from calculus the derivative of the product [00:41:24] calculus the derivative of the product of three things is obtained by you know [00:41:29] of three things is obtained by you know taking the derivatives of each of them [00:41:31] taking the derivatives of each of them one at a time right so this is f prime [00:41:34] one at a time right so this is f prime times G times H plus G prime here so the [00:41:50] times G times H plus G prime here so the product rule from calculus is that if [00:41:53] product rule from calculus is that if you want to take derivatives of a [00:41:54] you want to take derivatives of a product of three things then you kind of [00:41:57] product of three things then you kind of take the derivatives one at a time you [00:41:59] take the derivatives one at a time you end up with three sums right and so [00:42:03] end up with three sums right and so we're going to apply the product rule to [00:42:06] we're going to apply the product rule to this where we have here [00:42:11] this where we have here we have two different terms that depend [00:42:13] we have two different terms that depend on theta and so when we take the [00:42:18] on theta and so when we take the derivative of this thing respect to [00:42:19] derivative of this thing respect to theta we're gonna have of two terms that [00:42:22] theta we're gonna have of two terms that correspond to taking derivative this one [00:42:24] correspond to taking derivative this one is integral to doing that one's right [00:42:26] is integral to doing that one's right and so this derivative is equal to so [00:42:38] and so this derivative is equal to so the first term is the sum over all the [00:42:41] the first term is the sum over all the state action sequences you have s0 and [00:42:52] state action sequences you have s0 and then let's see so now we have PI of [00:42:56] then let's see so now we have PI of theta excuse me the derivative respect [00:43:00] theta excuse me the derivative respect to pi theta as zero a zero [00:43:14] and then plus Oh and then times that pay [00:43:37] and then plus Oh and then times that pay off right so the whole thing here is [00:43:41] off right so the whole thing here is then multiplied by the payoff okay so we [00:43:45] then multiplied by the payoff okay so we just applied the product rule for [00:43:46] just applied the product rule for calculus where for the first term in the [00:43:49] calculus where for the first term in the sum we kind of took the derivative of [00:43:51] sum we kind of took the derivative of this first thing and then for the second [00:43:54] this first thing and then for the second term in the sum we took the derivative [00:43:55] term in the sum we took the derivative of the second thing and now I'm gonna [00:44:01] of the second thing and now I'm gonna make one more algebraic trick which is [00:44:04] make one more algebraic trick which is I'm going to multiply and divide by that [00:44:09] I'm going to multiply and divide by that same term and then most fine divided by [00:44:15] same term and then most fine divided by the same thing here right [00:44:24] the same thing here right so lots of multi but most times divided [00:44:27] so lots of multi but most times divided by the same thing right and then finally [00:44:33] by the same thing right and then finally if you factor out so now the final step [00:44:41] if you factor out so now the final step is I'm going to factor out these terms [00:44:43] is I'm going to factor out these terms I'm underlining right because this terms [00:44:50] I'm underlining right because this terms I underlined this is just you know the [00:44:54] I underlined this is just you know the probability or the whole state sequence [00:45:00] right and again for the orange thing [00:45:03] right and again for the orange thing that this this orange thing right these [00:45:09] that this this orange thing right these two orange things multiplied together is [00:45:11] two orange things multiplied together is equal to that for each thing in that box [00:45:13] equal to that for each thing in that box as well [00:45:15] and so the final step is to factor out [00:45:19] and so the final step is to factor out the orange box which is just P of s 0 a [00:45:24] the orange box which is just P of s 0 a 0 s 1 a 1 right so that's the thing I [00:45:31] 0 s 1 a 1 right so that's the thing I boxed up in orange times then those two [00:45:37] boxed up in orange times then those two terms involving the derivatives [00:46:03] okay and I think Oh [00:46:08] right where I guess this term goes there [00:46:13] right where I guess this term goes there and this term goes there and so this is [00:46:30] and this term goes there and so this is just equal to well and if you look at [00:46:36] just equal to well and if you look at the reinforced algorithm right that we [00:46:38] the reinforced algorithm right that we wrote down this is just equal to sum [00:46:43] wrote down this is just equal to sum over you know all the state action [00:46:46] over you know all the state action sequences times the probability of the [00:46:53] sequences times the probability of the gradient update [00:47:00] because uh I guess I'm running out of [00:47:03] because uh I guess I'm running out of colors but you know this is a gradient [00:47:05] colors but you know this is a gradient update and that's just right equal to [00:47:08] update and that's just right equal to this thing okay so what this shows is [00:47:17] this thing okay so what this shows is that even though on each iteration the [00:47:23] that even though on each iteration the direction of the gradient updates is [00:47:25] direction of the gradient updates is random the the expected value of how you [00:47:35] random the the expected value of how you update the parameters is exactly equal [00:47:38] update the parameters is exactly equal to the derivative of your objective of [00:47:42] to the derivative of your objective of your expected total payoff so we started [00:47:45] your expected total payoff so we started saying that this formula is your [00:47:47] saying that this formula is your expected total payoff so let's figure [00:47:51] expected total payoff so let's figure out what's the derivative your expected [00:47:52] out what's the derivative your expected total payoff and we found that the [00:47:54] total payoff and we found that the expected the derivative your expected [00:47:57] expected the derivative your expected total payoff the derivative the thing [00:47:58] total payoff the derivative the thing you want to maximize is equal to the [00:48:00] you want to maximize is equal to the expected value of your gradient update [00:48:03] expected value of your gradient update and so this proves that on average you [00:48:08] and so this proves that on average you know if you have a very small learning [00:48:09] know if you have a very small learning rate you end up averaging over many [00:48:10] rate you end up averaging over many steps right but on average the updates [00:48:13] steps right but on average the updates that reinforce is taking on every [00:48:15] that reinforce is taking on every iteration is exactly in the direction of [00:48:18] iteration is exactly in the direction of the derivative of the expected total [00:48:22] the derivative of the expected total payoff that you're trying to maximize [00:48:29] any questions about this yeah [00:48:35] oh is this impending the choice of the [00:48:38] oh is this impending the choice of the dysfunction this is true for any form of [00:48:41] dysfunction this is true for any form of a stochastic policy where the definition [00:48:45] a stochastic policy where the definition is that you know pi theta as zero a zero [00:48:49] is that you know pi theta as zero a zero has to be the chance of taking that [00:48:51] has to be the chance of taking that action in that state but this could be [00:48:53] action in that state but this could be any function you want it could be a soft [00:48:56] any function you want it could be a soft massacree logistic function and many [00:48:58] massacree logistic function and many many different complicated features it [00:49:00] many different complicated features it could be has been continuous the has [00:49:02] could be has been continuous the has been differentiable function and [00:49:03] been differentiable function and actually one of the reasons we shifted [00:49:06] actually one of the reasons we shifted to stochastic policies was because [00:49:09] to stochastic policies was because previously just had two actions is [00:49:11] previously just had two actions is either left or right right and so you [00:49:13] either left or right right and so you can't define a derivative over a [00:49:15] can't define a derivative over a discontinuous function like either left [00:49:17] discontinuous function like either left or right but now we have a probability [00:49:19] or right but now we have a probability that shifts slowly between what's the [00:49:21] that shifts slowly between what's the probability to go let's go right and by [00:49:24] probability to go let's go right and by making this a continuous function of [00:49:25] making this a continuous function of theta you can then take derivatives in [00:49:27] theta you can then take derivatives in five unison it doesn't really just a [00:49:29] five unison it doesn't really just a function so another way to train a [00:49:51] function so another way to train a helicopter controller is to use [00:49:53] helicopter controller is to use supervised learning where you have a [00:49:55] supervised learning where you have a human expert train you know so you can [00:49:58] human expert train you know so you can also actually have a human pilot [00:50:00] also actually have a human pilot demonstrate and just stay take this [00:50:02] demonstrate and just stay take this action right and then you supervise the [00:50:04] action right and then you supervise the running to just learn directly a mapping [00:50:06] running to just learn directly a mapping from the state into the action I think [00:50:08] from the state into the action I think this I don't know this might be okay for [00:50:11] this I don't know this might be okay for low-speed helicopter flight I don't [00:50:13] low-speed helicopter flight I don't think it works super well I bet you [00:50:15] think it works super well I bet you could do this in not crash a helicopter [00:50:18] could do this in not crash a helicopter but but to get the best results I [00:50:22] but but to get the best results I wouldn't use this approach it turns out [00:50:27] wouldn't use this approach it turns out for some of the maneuvers where she [00:50:28] for some of the maneuvers where she fight better than human pilots as well [00:50:37] oh and so for other types of policies [00:50:43] oh and so for other types of policies messy [00:50:46] messy [Applause] [00:51:01] so direct policy search also works if [00:51:06] so direct policy search also works if you have continuous value actions and [00:51:08] you have continuous value actions and you don't want to discretize the action [00:51:10] you don't want to discretize the action so maybe here's a simple example let's [00:51:12] so maybe here's a simple example let's say a is a real number such as the [00:51:14] say a is a real number such as the magnitude that the force you apply to [00:51:17] magnitude that the force you apply to accelerating left or right it's around [00:51:19] accelerating left or right it's around discretizing your inverted pendulum you [00:51:21] discretizing your inverted pendulum you want to output a continuous number how [00:51:23] want to output a continuous number how hard you're sorry to left or right or [00:51:25] hard you're sorry to left or right or for self-driving car maybe theta is the [00:51:28] for self-driving car maybe theta is the steering angle which is a real value [00:51:30] steering angle which is a real value number so a simple policy would be a [00:51:33] number so a simple policy would be a equals you know say the transpose s and [00:51:37] equals you know say the transpose s and then plus Gaussian noise and if just for [00:51:44] then plus Gaussian noise and if just for the purpose of training you're willing [00:51:46] the purpose of training you're willing to pretend that your policy is to apply [00:51:49] to pretend that your policy is to apply the action theta transpose s and then a [00:51:51] the action theta transpose s and then a little bit of Gaussian noise to it then [00:51:53] little bit of Gaussian noise to it then the whole framework for reinforce but [00:51:56] the whole framework for reinforce but this type of gradient descent also will [00:51:59] this type of gradient descent also will also work and now I guess I reckon [00:52:03] also work and now I guess I reckon implementing this you pray turn off the [00:52:04] implementing this you pray turn off the Gaussian noise I know there are little [00:52:06] Gaussian noise I know there are little tricks like that as well um so let's see [00:52:13] tricks like that as well um so let's see some pros and cons of so when should you [00:52:16] some pros and cons of so when should you use direct policy search and when should [00:52:19] use direct policy search and when should you use value iteration or a value [00:52:21] you use value iteration or a value function based type of approach so it [00:52:26] function based type of approach so it turns out this one setting actually [00:52:30] turns out this one setting actually there are two settings where a direct [00:52:31] there are two settings where a direct policy search works much better one is [00:52:34] policy search works much better one is if you have a palm DP P Oh in this case [00:52:39] if you have a palm DP P Oh in this case that's a partially observable and that's [00:52:46] that's a partially observable and that's it for example you know for the inverted [00:52:53] it for example you know for the inverted pendulum that's a pro angle Phi you have [00:52:57] pendulum that's a pro angle Phi you have the car and this is your position X and [00:53:00] the car and this is your position X and we've been saying that the state space [00:53:02] we've been saying that the state space is X X dot Phi Phi dot right but let's [00:53:08] is X X dot Phi Phi dot right but let's say that you have sensors on this [00:53:12] say that you have sensors on this inverted pendulum that allow you to [00:53:15] inverted pendulum that allow you to Asia only the position and only the [00:53:17] Asia only the position and only the angle of the inverted pendulum so you [00:53:20] angle of the inverted pendulum so you might have an angle sensor you know down [00:53:22] might have an angle sensor you know down here and you might have a position [00:53:24] here and you might have a position sensor for your birthday pendulum but [00:53:26] sensor for your birthday pendulum but maybe you don't know the velocity and [00:53:27] maybe you don't know the velocity and you don't know the angular velocity [00:53:29] you don't know the angular velocity right so this is an example of a [00:53:31] right so this is an example of a partially observable Markov decision [00:53:33] partially observable Markov decision process because and what this means is [00:53:36] process because and what this means is that on every step you do not get to see [00:53:38] that on every step you do not get to see the whole state because you you don't [00:53:40] the whole state because you you don't have enough sensors to tell you exactly [00:53:42] have enough sensors to tell you exactly what is the state of the entire system [00:53:45] what is the state of the entire system so in a partially observable MDP at each [00:53:49] so in a partially observable MDP at each step you get a partial and potentially [00:53:59] step you get a partial and potentially noisy measurement of the state right and [00:54:11] noisy measurement of the state right and then have to take actions I have to [00:54:13] then have to take actions I have to choose an action a using these partial [00:54:23] choose an action a using these partial and potentially noisy measurements right [00:54:26] and potentially noisy measurements right which is uh maybe you only observe the [00:54:28] which is uh maybe you only observe the position and the angle but your senses [00:54:31] position and the angle but your senses aren't even totally accurate so you get [00:54:32] aren't even totally accurate so you get a slightly noisy you know estimate or [00:54:35] a slightly noisy you know estimate or the position you get a slightly noisy as [00:54:36] the position you get a slightly noisy as for the angle but you just have to [00:54:38] for the angle but you just have to choose in action based on your noisy [00:54:40] choose in action based on your noisy estimates of just two of the four state [00:54:42] estimates of just two of the four state variables [00:54:50] it turns out that there's been a lot of [00:54:53] it turns out that there's been a lot of academic literature trying to generalize [00:54:55] academic literature trying to generalize value function based approaches the pom [00:54:58] value function based approaches the pom DPS and they're very complicated [00:55:00] DPS and they're very complicated algorithms in the literature on trying [00:55:02] algorithms in the literature on trying to apply value function based approaches [00:55:04] to apply value function based approaches upon GPS but those algorithms despite [00:55:07] upon GPS but those algorithms despite their very high level of complexity you [00:55:10] their very high level of complexity you know are not are not widely in [00:55:11] know are not are not widely in production right but if you use the [00:55:15] production right but if you use the direct policy search algorithm then [00:55:17] direct policy search algorithm then there's actually very little problem oh [00:55:19] there's actually very little problem oh let me just write this down so let's say [00:55:21] let me just write this down so let's say the observation is on every time step [00:55:26] the observation is on every time step you observe y equals x phi plus noise [00:55:31] you observe y equals x phi plus noise right so you just don't know what is a [00:55:33] right so you just don't know what is a state and in a pom DP you cannot [00:55:36] state and in a pom DP you cannot approximate the value function or even [00:55:38] approximate the value function or even if you knew what was V Star right you [00:55:43] if you knew what was V Star right you can't compute PI star because uh I mean [00:55:46] can't compute PI star because uh I mean maybe you know what is pi star best [00:55:47] maybe you know what is pi star best listen compute V Sarn pi saw but if you [00:55:50] listen compute V Sarn pi saw but if you don't know what the state is you can't [00:55:51] don't know what the state is you can't apply PI star to the state because it's [00:55:53] apply PI star to the state because it's in so how do you choose in action if [00:55:56] in so how do you choose in action if you're using direct policy search then [00:55:59] you're using direct policy search then here's one thing you could do which is [00:56:01] here's one thing you could do which is you can say that hi of given an [00:56:05] you can say that hi of given an observation the chance of going to the [00:56:09] observation the chance of going to the right given your parent observation is [00:56:11] right given your parent observation is equal to 1 over 1 plus e to the negative [00:56:14] equal to 1 over 1 plus e to the negative theta transpose Y where I guess Y can be [00:56:18] theta transpose Y where I guess Y can be you know one read X plus noise v plus [00:56:23] you know one read X plus noise v plus noise that's X plus noise and so you [00:56:31] noise that's X plus noise and so you could run reinforce using just the [00:56:32] could run reinforce using just the observations you have to try to still [00:56:36] observations you have to try to still classically try to randomly choose an [00:56:38] classically try to randomly choose an action and nothing in the frame way we [00:56:40] action and nothing in the frame way we talked about provenza's album from [00:56:42] talked about provenza's album from working and so direct policy search just [00:56:44] working and so direct policy search just works very naturally even if you have [00:56:46] works very naturally even if you have only partial observations of the state [00:56:49] only partial observations of the state and more generally instead of plugging [00:56:53] and more generally instead of plugging the direct observations this can be any [00:56:55] the direct observations this can be any set of [00:56:55] set of I just make a side comment for those who [00:57:02] I just make a side comment for those who didn't know what common causes are don't [00:57:04] didn't know what common causes are don't have you don't but one common one common [00:57:07] have you don't but one common one common way of using the right policy search [00:57:09] way of using the right policy search would be to use some estimate such as a [00:57:12] would be to use some estimate such as a common filter a proper grammar model or [00:57:13] common filter a proper grammar model or something to use your historical [00:57:16] something to use your historical estimates look don't don't just look at [00:57:18] estimates look don't don't just look at your one set of measurements now but [00:57:20] your one set of measurements now but look at all your historical measurements [00:57:21] look at all your historical measurements and then their algorithms such as [00:57:23] and then their algorithms such as something called a common filter that [00:57:25] something called a common filter that lets you estimate whatever has the [00:57:27] lets you estimate whatever has the current state the full state vector you [00:57:29] current state the full state vector you can plug that full state vector estimate [00:57:30] can plug that full state vector estimate into the features you used to choose a [00:57:32] into the features you used to choose a to choose an action that's a common [00:57:35] to choose an action that's a common design paradigm if you don't know what a [00:57:36] design paradigm if you don't know what a common filter is don't worry about it [00:57:37] common filter is don't worry about it but you take take one a steal invoice [00:57:39] but you take take one a steal invoice for something on them yeah but that's [00:57:42] for something on them yeah but that's one common paradigm where you could use [00:57:44] one common paradigm where you could use your partial observations that's been [00:57:45] your partial observations that's been the full state and plug that as a [00:57:47] the full state and plug that as a features into the rent policy session [00:57:49] features into the rent policy session okay so that's one setting where the [00:57:52] okay so that's one setting where the right policy search works just just [00:57:55] right policy search works just just applies in a way that value function [00:57:57] applies in a way that value function approximation is very difficult to even [00:58:00] approximation is very difficult to even get to apply now one last thing is one [00:58:09] get to apply now one last thing is one last consideration for secure apply [00:58:11] last consideration for secure apply policy search algorithm or a value [00:58:13] policy search algorithm or a value function transformation algorithm oh it [00:58:15] function transformation algorithm oh it turns out the reinforced algorithm is is [00:58:18] turns out the reinforced algorithm is is actually very inefficient as in you end [00:58:21] actually very inefficient as in you end up you know when I when you look at [00:58:23] up you know when I when you look at research papers on the reinforced [00:58:25] research papers on the reinforced algorithm it's not unusual for people [00:58:26] algorithm it's not unusual for people that run the reinforced algorithm for [00:58:28] that run the reinforced algorithm for like a million iterations or ten million [00:58:30] like a million iterations or ten million iterations so you just have to train it [00:58:32] iterations so you just have to train it turns out the gradient estimates for the [00:58:34] turns out the gradient estimates for the reinforced algorithm even though the [00:58:35] reinforced algorithm even though the expected values right there's actually [00:58:37] expected values right there's actually very noisy and so if you train the [00:58:39] very noisy and so if you train the reinforced algorithm you end up just [00:58:41] reinforced algorithm you end up just running for a very very very long time [00:58:43] running for a very very very long time right it does work was a pretty [00:58:45] right it does work was a pretty inefficient algorithm so that's one [00:58:47] inefficient algorithm so that's one disadvantage of the reinforced algorithm [00:58:49] disadvantage of the reinforced algorithm is that the gradient estimates on [00:58:50] is that the gradient estimates on expectation are exactly what you want it [00:58:53] expectation are exactly what you want it to be but there's a lot of variance in [00:58:55] to be but there's a lot of variance in the gradient so you have to run it for a [00:58:57] the gradient so you have to run it for a long time of a very small learning rate [00:58:59] long time of a very small learning rate but one other reason to use the right [00:59:03] but one other reason to use the right policy search is [00:59:05] policy search is it's kind of ask yourself do you think [00:59:07] it's kind of ask yourself do you think pi-star is simpler or is beast are [00:59:16] simpler right and so um here's what I [00:59:19] simpler right and so um here's what I mean there are the in in in robotics [00:59:24] mean there are the in in in robotics there's sometimes what we call low level [00:59:26] there's sometimes what we call low level controls house and one way to think of [00:59:37] controls house and one way to think of low level controls house is flying a [00:59:39] low level controls house is flying a helicopter hovering the helicopter is an [00:59:41] helicopter hovering the helicopter is an example of a low-level control toss and [00:59:43] example of a low-level control toss and one way to inform me think of local [00:59:45] one way to inform me think of local control houses kind of really skilled [00:59:47] control houses kind of really skilled human you know holding a joystick [00:59:50] human you know holding a joystick control this thing making see the [00:59:52] control this thing making see the depends decisions right so those are [00:59:54] depends decisions right so those are kind of almost instinctual in the tiny [00:59:56] kind of almost instinctual in the tiny fraction of a second and almost by few [00:59:58] fraction of a second and almost by few you could control the thing those those [01:00:00] you could control the thing those those are tend to be low level control Tasos [01:00:02] are tend to be low level control Tasos either the parents holding a joystick a [01:00:04] either the parents holding a joystick a skill person because that inverted [01:00:06] skill person because that inverted pendulum or you know steer helicopter [01:00:09] pendulum or you know steer helicopter those are low level control tasks in [01:00:11] those are low level control tasks in contrast playing chess is not a [01:00:14] contrast playing chess is not a low-level control toss yeah because for [01:00:16] low-level control toss yeah because for the most part to be a very good chess [01:00:18] the most part to be a very good chess player is not really a seat-of-the-pants [01:00:21] player is not really a seat-of-the-pants you know take a bit make a decision in [01:00:24] you know take a bit make a decision in like in 0.1 seconds right you kind of [01:00:26] like in 0.1 seconds right you kind of have to think multiple steps ahead and [01:00:29] have to think multiple steps ahead and in low-level control toss there's [01:00:32] in low-level control toss there's usually some control policy that is [01:00:34] usually some control policy that is quite simple a very simple function [01:00:36] quite simple a very simple function mappings of states the actions that's [01:00:38] mappings of states the actions that's pretty good and so that allows you to [01:00:41] pretty good and so that allows you to specify a relatively simple class of [01:00:43] specify a relatively simple class of functions of PI star and direct policy [01:00:47] functions of PI star and direct policy search would be relatively promising for [01:00:49] search would be relatively promising for tasks like those whereas in contrast if [01:00:52] tasks like those whereas in contrast if you want to play chess okay go or do [01:00:55] you want to play chess okay go or do these things we have multiple steps of [01:00:56] these things we have multiple steps of reasoning I think that if you're driving [01:00:59] reasoning I think that if you're driving a car on a straight road that's a [01:01:02] a car on a straight road that's a low-level control toss we just look at [01:01:04] low-level control toss we just look at the road you just you know turn the [01:01:05] the road you just you know turn the steering or a little bit to stay on the [01:01:07] steering or a little bit to stay on the road so that's a lot of control tasks [01:01:08] road so that's a lot of control tasks but if you are planning how to you know [01:01:12] but if you are planning how to you know overtake this car and avoid that other [01:01:14] overtake this car and avoid that other car whether it's a pedestrian and the [01:01:16] car whether it's a pedestrian and the bicycle is along the way then that's [01:01:18] bicycle is along the way then that's less of a low-level controlled house and [01:01:21] less of a low-level controlled house and that requires more multi-step reasoning [01:01:23] that requires more multi-step reasoning right I guess depend how aggressive a [01:01:25] right I guess depend how aggressive a driver you are right driving on the [01:01:27] driver you are right driving on the highway you know may require more or [01:01:28] highway you know may require more or less multi-step reasoning where you want [01:01:30] less multi-step reasoning where you want to overtake this car before the trucker [01:01:33] to overtake this car before the trucker comes in this Lane so that that type of [01:01:34] comes in this Lane so that that type of thing is more multi-step reasoning and [01:01:38] thing is more multi-step reasoning and the person's like that tend to be [01:01:40] the person's like that tend to be difficult for a very simple like a [01:01:42] difficult for a very simple like a linear function to be a good policy and [01:01:45] linear function to be a good policy and for those things in playing chess [01:01:46] for those things in playing chess playing go playing checkers a value [01:01:49] playing go playing checkers a value function approximation approach may be [01:01:51] function approximation approach may be more promising okay so any questions [01:02:02] more promising okay so any questions about the oh and so again a long [01:02:07] about the oh and so again a long helicopter flight actually my first [01:02:11] helicopter flight actually my first attempts for flying helicopters were [01:02:12] attempts for flying helicopters were actually the right policy search because [01:02:15] actually the right policy search because flying helicopters I should see the [01:02:16] flying helicopters I should see the pants things but then when you try to [01:02:19] pants things but then when you try to find more complex maneuvers then you end [01:02:22] find more complex maneuvers then you end up using something maybe closer to value [01:02:24] up using something maybe closer to value function approximation that method so if [01:02:26] function approximation that method so if you want to find very complicated [01:02:27] you want to find very complicated maneuver so the video you saw just now [01:02:31] maneuver so the video you saw just now the helicopter flying upside down the [01:02:33] the helicopter flying upside down the algorithm implemented on for that pickle [01:02:35] algorithm implemented on for that pickle video that was a different policy search [01:02:36] video that was a different policy search algorithm right no not exactly this one [01:02:39] algorithm right no not exactly this one a little bit different but that was a [01:02:40] a little bit different but that was a tear apart see so geography but if one [01:02:42] tear apart see so geography but if one helicopter fly very complicated maneuver [01:02:44] helicopter fly very complicated maneuver then you need something maybe closer to [01:02:45] then you need something maybe closer to the value from Shiprock Smith and Soda [01:02:48] the value from Shiprock Smith and Soda and there is exciting research on how to [01:02:50] and there is exciting research on how to blend direct policy search approaches [01:02:52] blend direct policy search approaches together with value function [01:02:53] together with value function approximation book approaches so [01:02:55] approximation book approaches so actually alphago one of the reasons [01:03:00] actually alphago one of the reasons alphago worked was sorry you know go [01:03:04] alphago worked was sorry you know go claim program rate by deep I was there [01:03:06] claim program rate by deep I was there was a blend of ideas from both of these [01:03:08] was a blend of ideas from both of these types of literature which enabled it to [01:03:10] types of literature which enabled it to scale to a much bigger system to play go [01:03:13] scale to a much bigger system to play go in a very very very impressive [01:03:16] in a very very very impressive all right any questions about this [01:03:26] alright um so just final application [01:03:29] alright um so just final application examples you know reinforcement learning [01:03:33] examples you know reinforcement learning today is making strong let's see so [01:03:38] today is making strong let's see so there's a lot of work on reinforce to [01:03:40] there's a lot of work on reinforce to learning for game playing checkers chess [01:03:43] learning for game playing checkers chess go [01:03:44] go that is exciting um reinforcement [01:03:47] that is exciting um reinforcement learning today is used in is using a [01:03:50] learning today is used in is using a growing number of robotics applications [01:03:51] growing number of robotics applications I think for controlling a lot of robots [01:03:55] I think for controlling a lot of robots there is a honor if you go to robotics [01:03:58] there is a honor if you go to robotics conferences if you look at some of the [01:03:59] conferences if you look at some of the projects being done by some of the very [01:04:01] projects being done by some of the very large companies that make very large [01:04:03] large companies that make very large machines right I have many friends in [01:04:05] machines right I have many friends in multiple you know large companies making [01:04:08] multiple you know large companies making large machines that are increasingly [01:04:10] large machines that are increasingly using reinforcement if you control them [01:04:12] using reinforcement if you control them there is fascinating work using reports [01:04:17] there is fascinating work using reports of learning for optimizing [01:04:18] of learning for optimizing anti factory deployments there's [01:04:22] anti factory deployments there's academic research we're still in [01:04:25] academic research we're still in researcher as far as I know I shouldn't [01:04:26] researcher as far as I know I shouldn't mean maybe scientific deployed on using [01:04:28] mean maybe scientific deployed on using reinforcement learning to build chat [01:04:30] reinforcement learning to build chat BOTS and actually on using reinforcement [01:04:34] BOTS and actually on using reinforcement learning to build a a I based guidance [01:04:37] learning to build a a I based guidance counselor for example right where the [01:04:40] counselor for example right where the actions you take up what you say to [01:04:41] actions you take up what you say to students and then and then the reward is [01:04:43] students and then and then the reward is you know do you manage to help a student [01:04:45] you know do you manage to help a student navigate their coursework or navigate [01:04:47] navigate their coursework or navigate their career there is uh and there's [01:04:51] their career there is uh and there's also starting to be applied to [01:04:53] also starting to be applied to healthcare where one of the keys are [01:04:55] healthcare where one of the keys are reinforced with learning is is this a [01:04:56] reinforced with learning is is this a sequential decision making process right [01:04:58] sequential decision making process right where do you have to take a sequence of [01:05:00] where do you have to take a sequence of decisions that may affect your reward [01:05:02] decisions that may affect your reward over time and I think and in in [01:05:06] over time and I think and in in healthcare there is work on medical [01:05:09] healthcare there is work on medical planning where the goal is not you know [01:05:13] planning where the goal is not you know send you to get a blood test and then [01:05:15] send you to get a blood test and then we're done right [01:05:17] we're done right in complicated medical procedures we [01:05:20] in complicated medical procedures we might essentially get a blood test then [01:05:22] might essentially get a blood test then based on the outcome of the blood test [01:05:23] based on the outcome of the blood test we might send you to get a biopsy or not [01:05:26] we might send you to get a biopsy or not all right [01:05:27] all right ask you to take a drug and then come [01:05:29] ask you to take a drug and then come back in two weeks but is this very [01:05:30] back in two weeks but is this very complicated sequential decision making [01:05:32] complicated sequential decision making process for treatment of complicated [01:05:34] process for treatment of complicated healthcare conditions and so this [01:05:36] healthcare conditions and so this fascinating work on trying to apply [01:05:38] fascinating work on trying to apply reinforcement learning that instead of [01:05:39] reinforcement learning that instead of multi-step reasoning where it's not [01:05:41] multi-step reasoning where it's not about what sense for treatment and then [01:05:43] about what sense for treatment and then you never see you again for the rest of [01:05:44] you never see you again for the rest of your life as well here's the first thing [01:05:46] your life as well here's the first thing you do then come back let's see what [01:05:48] you do then come back let's see what stain you get to after taking this blood [01:05:50] stain you get to after taking this blood test so let's see what you can see you [01:05:51] test so let's see what you can see you get to after trying a drug and then [01:05:53] get to after trying a drug and then coming back on the week to see what has [01:05:55] coming back on the week to see what has happened to symptoms but I think that [01:05:57] happened to symptoms but I think that these are all sectors where [01:05:59] these are all sectors where reinforcement learning is making inroads [01:06:01] reinforcement learning is making inroads or even actually stock trading okay [01:06:05] or even actually stock trading okay maybe not the most inspiring one but one [01:06:07] maybe not the most inspiring one but one of my friends on the East Coast was and [01:06:10] of my friends on the East Coast was and then was a and just actually if you or [01:06:13] then was a and just actually if you or your parents invest in mutual funds this [01:06:16] your parents invest in mutual funds this may be being used to buy and sell shares [01:06:19] may be being used to buy and sell shares for them today depending on what back [01:06:20] for them today depending on what back they're investing I know what Bank is [01:06:22] they're investing I know what Bank is doing this but I won't say it out loud [01:06:23] doing this but I won't say it out loud oh but but if you want to buy or sell [01:06:28] oh but but if you want to buy or sell you know say a million shares of stock a [01:06:30] you know say a million shares of stock a very large volume of stock you may not [01:06:33] very large volume of stock you may not want to do it in a very public way [01:06:35] want to do it in a very public way because that will affect the price of [01:06:36] because that will affect the price of the shares right so if everyone knows [01:06:38] the shares right so if everyone knows that a very large investors about to buy [01:06:40] that a very large investors about to buy a million shares or buy ten million [01:06:42] a million shares or buy ten million shares or whatever that will cause the [01:06:44] shares or whatever that will cause the price to increase and this this is [01:06:47] price to increase and this this is disadvantage as a person wanting to buy [01:06:48] disadvantage as a person wanting to buy shares but so there's been very [01:06:51] shares but so there's been very interesting work on using reinforcement [01:06:52] interesting work on using reinforcement learning to decide how the sequence out [01:06:56] learning to decide how the sequence out you'll you'll buy how to buy the stock [01:06:59] you'll you'll buy how to buy the stock in small Lots [01:07:00] in small Lots in this trading market is called dark [01:07:02] in this trading market is called dark pools these are Google if you're curious [01:07:03] pools these are Google if you're curious as you don't bother to try to buy a very [01:07:08] as you don't bother to try to buy a very large lot of shares or so a very large [01:07:10] large lot of shares or so a very large lot of shares without affecting the [01:07:13] lot of shares without affecting the market price too much because the way [01:07:14] market price too much because the way your effective market price always [01:07:15] your effective market price always breaks against you know is always [01:07:17] breaks against you know is always against you it's always bad right so [01:07:21] against you it's always bad right so this work laid out as well [01:07:22] this work laid out as well so anyway I think um many applications I [01:07:26] so anyway I think um many applications I pursue you think that one of the most [01:07:27] pursue you think that one of the most exciting areas for reinforcement [01:07:28] exciting areas for reinforcement learning will be robotics but well we'll [01:07:31] learning will be robotics but well we'll see what what happens over the next few [01:07:33] see what what happens over the next few years [01:07:35] years all right so let's see we're just five [01:07:39] all right so let's see we're just five more minutes um and and just a wrap-up I [01:07:42] more minutes um and and just a wrap-up I think you know we've gone through quite [01:07:45] think you know we've gone through quite a lot of stuff I guess from supervised [01:07:47] a lot of stuff I guess from supervised learning to learning theory and advice [01:07:50] learning to learning theory and advice or apply learning algorithms to [01:07:52] or apply learning algorithms to unsupervised learning although it was it [01:07:54] unsupervised learning although it was it k-means pca EMA share gaussian factor [01:07:58] k-means pca EMA share gaussian factor analysis in Pentonville analysis to most [01:08:01] analysis in Pentonville analysis to most recently reinforcement learning with Val [01:08:03] recently reinforcement learning with Val function approaches fitted value [01:08:06] function approaches fitted value iteration policy search so feels like we [01:08:09] iteration policy search so feels like we did feels like feels like I feels like [01:08:11] did feels like feels like I feels like you've seen a lot of learning algorithms [01:08:13] you've seen a lot of learning algorithms um go ahead [01:08:18] how does enforce learn compared to have [01:08:20] how does enforce learn compared to have a sarah learning I think of those as a [01:08:22] a sarah learning I think of those as a pretty distinct logicians yeah yeah so I [01:08:27] pretty distinct logicians yeah yeah so I think and again actually I until I I [01:08:30] think and again actually I until I I know a lot of non publicly known facts [01:08:32] know a lot of non publicly known facts about the machine there any world but uh [01:08:34] about the machine there any world but uh one of the things that I actually happen [01:08:37] one of the things that I actually happen to know is that some of the ideas our [01:08:40] to know is that some of the ideas our adversary learning you know so can you [01:08:44] adversary learning you know so can you take a picture of ice you know very [01:08:46] take a picture of ice you know very little bit by tweaking a bunch of pixel [01:08:47] little bit by tweaking a bunch of pixel values they're not visible to human eye [01:08:49] values they're not visible to human eye they're fools our learning algorithm [01:08:50] they're fools our learning algorithm into thinking that this picture is [01:08:52] into thinking that this picture is actually cat one's clean all the cattle [01:08:53] actually cat one's clean all the cattle whatever so I actually know that there [01:08:55] whatever so I actually know that there are attackers out in the world today [01:08:56] are attackers out in the world today using techniques like that to attack you [01:08:59] using techniques like that to attack you know websites to try to fool you know [01:09:04] know websites to try to fool you know some of the websites down pretty sure [01:09:05] some of the websites down pretty sure you guys use in fooled there anti-spam [01:09:08] you guys use in fooled there anti-spam anti-fraud anti undermining democracy [01:09:10] anti-fraud anti undermining democracy types of algorithms into [01:09:13] types of algorithms into to make decisions so it's a it's [01:09:17] to make decisions so it's a it's exciting time doing machine learning [01:09:18] exciting time doing machine learning right now that we get to fight battles [01:09:21] right now that we get to fight battles like these okay and and I think you know [01:09:28] like these okay and and I think you know I think we're really I think that what [01:09:31] I think we're really I think that what the things you guys have learned in [01:09:32] the things you guys have learned in machine learning I think all of you are [01:09:34] machine learning I think all of you are now very knowledgeable right I think all [01:09:37] now very knowledgeable right I think all of you are experts in all the ideas of [01:09:40] of you are experts in all the ideas of core machine learning and I hope that um [01:09:42] core machine learning and I hope that um I think when we look around the world [01:09:44] I think when we look around the world there's so many worthwhile projects you [01:09:46] there's so many worthwhile projects you could do with machine learning and the [01:09:47] could do with machine learning and the number of you that know these techniques [01:09:49] number of you that know these techniques is so small that I hope that you take [01:09:51] is so small that I hope that you take these skills oh and some of you will go [01:09:54] these skills oh and some of you will go you know build businesses and make a lot [01:09:56] you know build businesses and make a lot of money that's great some of you will [01:09:57] of money that's great some of you will take these ideas and help drive basic [01:10:00] take these ideas and help drive basic research at Stanford or at other [01:10:02] research at Stanford or at other institutions I think that's fantastic [01:10:04] institutions I think that's fantastic but I think whatever you're doing the [01:10:06] but I think whatever you're doing the number of worthwhile projects on the [01:10:08] number of worthwhile projects on the planet is so large and the number of you [01:10:10] planet is so large and the number of you that actually know how to use these [01:10:12] that actually know how to use these techniques is so small that I hope that [01:10:14] techniques is so small that I hope that you take these skills you're learning [01:10:16] you take these skills you're learning from this cause and go and do something [01:10:18] from this cause and go and do something meaningful and do something that helps [01:10:20] meaningful and do something that helps other people I've even seen this looking [01:10:22] other people I've even seen this looking valley that there are lot of ways you [01:10:25] valley that there are lot of ways you know to build very valuable businesses [01:10:27] know to build very valuable businesses and some of you do that and that's great [01:10:29] and some of you do that and that's great but I hope that you do it in a way that [01:10:31] but I hope that you do it in a way that helps other people I think over the past [01:10:35] helps other people I think over the past few years we've seen I think that uh in [01:10:39] few years we've seen I think that uh in Silicon Valley maybe ten years ago the [01:10:42] Silicon Valley maybe ten years ago the contract we had with Society was that [01:10:45] contract we had with Society was that people would trust us with their data [01:10:46] people would trust us with their data and then we'll use their data to help [01:10:48] and then we'll use their data to help them but I think in the past year that [01:10:51] them but I think in the past year that contract feels like that has been broken [01:10:53] contract feels like that has been broken and the world's faith in Silicon Valley [01:10:56] and the world's faith in Silicon Valley has been shaken up but I think that [01:10:59] has been shaken up but I think that places even more pressure on all of us [01:11:01] places even more pressure on all of us on all of you to make sure that the work [01:11:04] on all of you to make sure that the work you go out into the world to do is work [01:11:05] you go out into the world to do is work that action is respectful of individuals [01:11:07] that action is respectful of individuals respectful individuals privacy is [01:11:09] respectful individuals privacy is transparent open and that ultimately is [01:11:12] transparent open and that ultimately is helping drive forward humanity or [01:11:16] helping drive forward humanity or helping people helping drive forward [01:11:17] helping people helping drive forward basic research or building products that [01:11:19] basic research or building products that actually help people rather than exploit [01:11:22] actually help people rather than exploit their foibles for profit but to there [01:11:25] their foibles for profit but to there I hope that all of you will take your [01:11:27] I hope that all of you will take your superpowers that you now have an um go [01:11:31] superpowers that you now have an um go out to do to do meaningful work and [01:11:34] out to do to do meaningful work and let's see and I think oh end and lastly [01:11:37] let's see and I think oh end and lastly just I just don't personally I want to [01:11:39] just I just don't personally I want to you know thank all of you on behalf of [01:11:41] you know thank all of you on behalf of the TAS the ho teaching team and myself [01:11:44] the TAS the ho teaching team and myself wants to thank all of you for your hard [01:11:46] wants to thank all of you for your hard work sometimes they go with homework [01:11:47] work sometimes they go with homework problems the good party also runs ago [01:11:49] problems the good party also runs ago Wow that she got that problem I thought [01:11:50] Wow that she got that problem I thought that was really hard or not project [01:11:52] that was really hard or not project Muslims go hey that's really cool look [01:11:54] Muslims go hey that's really cool look forward to seeing your final project [01:11:55] forward to seeing your final project results at the final poster session so I [01:11:58] results at the final poster session so I know that all of you have worked really [01:12:00] know that all of you have worked really hard and if you didn't don't tell me [01:12:04] hard and if you didn't don't tell me that thing almost but I'm gonna make [01:12:08] that thing almost but I'm gonna make sure you know there's a I think it [01:12:09] sure you know there's a I think it wasn't that long ago that I was a [01:12:10] wasn't that long ago that I was a student you know working late at night [01:12:12] student you know working late at night on homework problems and and I know that [01:12:14] on homework problems and and I know that many of you have been doing that for the [01:12:16] many of you have been doing that for the homeworks standing for the midterm for [01:12:19] homeworks standing for the midterm for work on your final term projects so want [01:12:22] work on your final term projects so want to make sure you know I'm very grateful [01:12:24] to make sure you know I'm very grateful for the hard work you put into this [01:12:26] for the hard work you put into this class and I hope that I hope that your [01:12:29] class and I hope that I hope that your your heart and skills will also reward [01:12:32] your heart and skills will also reward you very well in the future and also [01:12:33] you very well in the future and also help you do work that that you find this [01:12:36] help you do work that that you find this meaningful so thank you very much [01:12:38] meaningful so thank you very much [Applause] ================================================================================ LECTURE INDEX.md ================================================================================ CS229 – Machine Learning (Andrew Ng) Playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU Total Videos: 20 Transcripts Downloaded: 20 Failed/No Captions: 0 --- Lectures 1. Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018) - Video: [https://www.youtube.com/watch?v=jGwO_UgTS7I](https://www.youtube.com/watch?v=jGwO_UgTS7I) - Transcript: [001_jGwO_UgTS7I.md](001_jGwO_UgTS7I.md) 2. Stanford CS229: Machine Learning - Linear Regression and Gradient Descent | Lecture 2 (Autumn 2018) - Video: [https://www.youtube.com/watch?v=4b4MUYve_U8](https://www.youtube.com/watch?v=4b4MUYve_U8) - Transcript: [002_4b4MUYve_U8.md](002_4b4MUYve_U8.md) 3. Locally Weighted & Logistic Regression | Stanford CS229: Machine Learning - Lecture 3 (Autumn 2018) - Video: [https://www.youtube.com/watch?v=het9HFqo1TQ](https://www.youtube.com/watch?v=het9HFqo1TQ) - Transcript: [003_het9HFqo1TQ.md](003_het9HFqo1TQ.md) 4. Lecture 4 - Perceptron & Generalized Linear Model | Stanford CS229: Machine Learning (Autumn 2018) - Video: [https://www.youtube.com/watch?v=iZTeva0WSTQ](https://www.youtube.com/watch?v=iZTeva0WSTQ) - Transcript: [004_iZTeva0WSTQ.md](004_iZTeva0WSTQ.md) 5. Lecture 5 - GDA & Naive Bayes | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018) - Video: [https://www.youtube.com/watch?v=nt63k3bfXS0](https://www.youtube.com/watch?v=nt63k3bfXS0) - Transcript: [005_nt63k3bfXS0.md](005_nt63k3bfXS0.md) 6. Lecture 6 - Support Vector Machines | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018) - Video: [https://www.youtube.com/watch?v=lDwow4aOrtg](https://www.youtube.com/watch?v=lDwow4aOrtg) - Transcript: [006_lDwow4aOrtg.md](006_lDwow4aOrtg.md) 7. Lecture 7 - Kernels | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018) - Video: [https://www.youtube.com/watch?v=8NYoQiRANpg](https://www.youtube.com/watch?v=8NYoQiRANpg) - Transcript: [007_8NYoQiRANpg.md](007_8NYoQiRANpg.md) 8. Lecture 8 - Data Splits, Models & Cross-Validation | Stanford CS229: Machine Learning (Autumn 2018) - Video: [https://www.youtube.com/watch?v=rjbkWSTjHzM](https://www.youtube.com/watch?v=rjbkWSTjHzM) - Transcript: [008_rjbkWSTjHzM.md](008_rjbkWSTjHzM.md) 9. Lecture 9 - Approx/Estimation Error & ERM | Stanford CS229: Machine Learning (Autumn 2018) - Video: [https://www.youtube.com/watch?v=iVOxMcumR4A](https://www.youtube.com/watch?v=iVOxMcumR4A) - Transcript: [009_iVOxMcumR4A.md](009_iVOxMcumR4A.md) 10. Lecture 10 - Decision Trees and Ensemble Methods | Stanford CS229: Machine Learning (Autumn 2018) - Video: [https://www.youtube.com/watch?v=wr9gUr-eWdA](https://www.youtube.com/watch?v=wr9gUr-eWdA) - Transcript: [010_wr9gUr-eWdA.md](010_wr9gUr-eWdA.md) 11. Lecture 11 - Introduction to Neural Networks | Stanford CS229: Machine Learning (Autumn 2018) - Video: [https://www.youtube.com/watch?v=MfIjxPh6Pys](https://www.youtube.com/watch?v=MfIjxPh6Pys) - Transcript: [011_MfIjxPh6Pys.md](011_MfIjxPh6Pys.md) 12. Lecture 12 - Backprop & Improving Neural Networks | Stanford CS229: Machine Learning (Autumn 2018) - Video: [https://www.youtube.com/watch?v=zUazLXZZA2U](https://www.youtube.com/watch?v=zUazLXZZA2U) - Transcript: [012_zUazLXZZA2U.md](012_zUazLXZZA2U.md) 13. Lecture 13 - Debugging ML Models and Error Analysis | Stanford CS229: Machine Learning (Autumn 2018) - Video: [https://www.youtube.com/watch?v=ORrStCArmP4](https://www.youtube.com/watch?v=ORrStCArmP4) - Transcript: [013_ORrStCArmP4.md](013_ORrStCArmP4.md) 14. Lecture 14 - Expectation-Maximization Algorithms | Stanford CS229: Machine Learning (Autumn 2018) - Video: [https://www.youtube.com/watch?v=rVfZHWTwXSA](https://www.youtube.com/watch?v=rVfZHWTwXSA) - Transcript: [014_rVfZHWTwXSA.md](014_rVfZHWTwXSA.md) 15. Lecture 15 - EM Algorithm & Factor Analysis | Stanford CS229: Machine Learning Andrew Ng -Autumn2018 - Video: [https://www.youtube.com/watch?v=tw6cmL5STuY](https://www.youtube.com/watch?v=tw6cmL5STuY) - Transcript: [015_tw6cmL5STuY.md](015_tw6cmL5STuY.md) 16. Lecture 16 - Independent Component Analysis & RL | Stanford CS229: Machine Learning (Autumn 2018) - Video: [https://www.youtube.com/watch?v=YQA9lLdLig8](https://www.youtube.com/watch?v=YQA9lLdLig8) - Transcript: [016_YQA9lLdLig8.md](016_YQA9lLdLig8.md) 17. Lecture 17 - MDPs & Value/Policy Iteration | Stanford CS229: Machine Learning Andrew Ng (Autumn2018) - Video: [https://www.youtube.com/watch?v=d5gaWTo6kDM](https://www.youtube.com/watch?v=d5gaWTo6kDM) - Transcript: [017_d5gaWTo6kDM.md](017_d5gaWTo6kDM.md) 18. Lecture 18 - Continous State MDP & Model Simulation | Stanford CS229: Machine Learning (Autumn 2018) - Video: [https://www.youtube.com/watch?v=QFu5nuc-S0s](https://www.youtube.com/watch?v=QFu5nuc-S0s) - Transcript: [018_QFu5nuc-S0s.md](018_QFu5nuc-S0s.md) 19. Lecture 19 - Reward Model & Linear Dynamical System | Stanford CS229: Machine Learning (Autumn 2018) - Video: [https://www.youtube.com/watch?v=0rt2CsEQv6U](https://www.youtube.com/watch?v=0rt2CsEQv6U) - Transcript: [019_0rt2CsEQv6U.md](019_0rt2CsEQv6U.md) 20. RL Debugging and Diagnostics | Stanford CS229: Machine Learning Andrew Ng - Lecture 20 (Autumn 2018) - Video: [https://www.youtube.com/watch?v=pLhPQynL0tY](https://www.youtube.com/watch?v=pLhPQynL0tY) - Transcript: [020_pLhPQynL0tY.md](020_pLhPQynL0tY.md)