================================================================================
LECTURE 001
================================================================================

Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018)

Source: https://www.youtube.com/watch?v=jGwO_UgTS7I

---

Transcript

[00:00:03] welcome to 69 machine learning some of
[00:00:07] welcome to 69 machine learning some of you know that this closet or the
[00:00:08] you know that this closet or the Stanford for a long time and this is
[00:00:10] Stanford for a long time and this is often the cost that I most look forward
[00:00:13] often the cost that I most look forward to teaching each year because this is
[00:00:15] to teaching each year because this is where we've helped I think several
[00:00:17] where we've helped I think several generations of Stanford students become
[00:00:19] generations of Stanford students become experts in machine learning golf built
[00:00:21] experts in machine learning golf built many of their products and services and
[00:00:23] many of their products and services and startups that I'm sure many of you I
[00:00:25] startups that I'm sure many of you I pray all of you are using today so what
[00:00:30] pray all of you are using today so what I want to do today was spend some time
[00:00:32] I want to do today was spend some time talking over logistics and then spend
[00:00:36] talking over logistics and then spend some time you know giving you a
[00:00:37] some time you know giving you a beginning of an intro talk a little bit
[00:00:39] beginning of an intro talk a little bit about machine learning so about 229 you
[00:00:46] about machine learning so about 229 you know all of you have been reading about
[00:00:49] know all of you have been reading about AI in the news about machine learning in
[00:00:52] AI in the news about machine learning in the news and you pray heard me others
[00:00:56] the news and you pray heard me others say as a new electricity much as the
[00:00:59] say as a new electricity much as the rise of electricity about 100 years ago
[00:01:01] rise of electricity about 100 years ago transformed every major industry I think
[00:01:03] transformed every major industry I think AI already we call it machine learning
[00:01:06] AI already we call it machine learning but the rest of world seems to call it
[00:01:07] but the rest of world seems to call it AI machine learning and an AI and deep
[00:01:11] AI machine learning and an AI and deep learning will change the world
[00:01:13] learning will change the world and I hope that 3 2 3 9 will give you
[00:01:16] and I hope that 3 2 3 9 will give you the tools you need so that you can be
[00:01:18] the tools you need so that you can be many of these future titans of
[00:01:20] many of these future titans of industries that you can be one the gold
[00:01:22] industries that you can be one the gold and don't you know hope the large tech
[00:01:24] and don't you know hope the large tech companies do the amazing things they do
[00:01:26] companies do the amazing things they do or both your own startup or going to
[00:01:29] or both your own startup or going to some other industry go go transform
[00:01:31] some other industry go go transform healthcare or go transport
[00:01:32] healthcare or go transport transportation or go put a self-driving
[00:01:33] transportation or go put a self-driving car and do all of these things that
[00:01:36] car and do all of these things that after this class I think you'll be able
[00:01:39] after this class I think you'll be able to do you know the majority of students
[00:01:44] to do you know the majority of students applying the demand for AI skills the
[00:01:46] applying the demand for AI skills the demand for machine learning skills is so
[00:01:48] demand for machine learning skills is so vast I think you all know that and I
[00:01:50] vast I think you all know that and I think it's because machine learning has
[00:01:52] think it's because machine learning has advanced so rapidly in the last few
[00:01:54] advanced so rapidly in the last few years that there are so many
[00:01:56] years that there are so many opportunities to apply learning
[00:01:58] opportunities to apply learning algorithms right both in industry as
[00:02:01] algorithms right both in industry as well as in academia I think today we
[00:02:03] well as in academia I think today we have the English department professors
[00:02:05] have the English department professors trying to apply learning algorithms to
[00:02:07] trying to apply learning algorithms to understand history better we have
[00:02:09] understand history better we have lawyers trying to apply machine learning
[00:02:11] lawyers trying to apply machine learning inter-process legal
[00:02:13] inter-process legal humans and off-campus every company both
[00:02:15] humans and off-campus every company both the tech companies as well as a lot of
[00:02:17] the tech companies as well as a lot of other companies that you wouldn't
[00:02:18] other companies that you wouldn't consider tech companies everything from
[00:02:20] consider tech companies everything from me faction companies healthcare
[00:02:21] me faction companies healthcare companies the just six companies are
[00:02:24] companies the just six companies are also trying to apply machine learning so
[00:02:27] also trying to apply machine learning so I think that um if you look at it on a
[00:02:31] I think that um if you look at it on a factual basis the number of people doing
[00:02:35] factual basis the number of people doing very valuable machine learning projects
[00:02:37] very valuable machine learning projects today it's much greater than it was six
[00:02:39] today it's much greater than it was six months ago and six months ago is much
[00:02:40] months ago and six months ago is much greater than it was twelve months ago
[00:02:42] greater than it was twelve months ago and the amount of value the amounts of
[00:02:44] and the amount of value the amounts of exciting meaningful work being done in
[00:02:46] exciting meaningful work being done in machine learning is very strongly going
[00:02:49] machine learning is very strongly going up and I think that given the rise of
[00:02:54] up and I think that given the rise of you know the amount of data we have as
[00:02:57] you know the amount of data we have as well as the new machine learning tools
[00:02:59] well as the new machine learning tools that we have it would be a long time
[00:03:02] that we have it would be a long time before we run out of opportunities you
[00:03:04] before we run out of opportunities you have before before society as a whole
[00:03:06] have before before society as a whole has enough people where the machine
[00:03:08] has enough people where the machine learning skill set so just as maybe I
[00:03:12] learning skill set so just as maybe I don't know 20 years ago was a good time
[00:03:14] don't know 20 years ago was a good time to start working on this internet thing
[00:03:16] to start working on this internet thing a lot of people that started working on
[00:03:18] a lot of people that started working on the internet like 20 years ago and
[00:03:20] the internet like 20 years ago and fantastic careers I think today is a
[00:03:23] fantastic careers I think today is a wonderful time to jump to machine
[00:03:25] wonderful time to jump to machine learning and the number and the
[00:03:28] learning and the number and the opportunities for you to do unique
[00:03:31] opportunities for you to do unique things that no one is no one else is
[00:03:33] things that no one is no one else is doing right you are paying for you to go
[00:03:34] doing right you are paying for you to go to logistics company and find an
[00:03:37] to logistics company and find an exciting way to apply machine learning
[00:03:39] exciting way to apply machine learning will be very high because chances are
[00:03:42] will be very high because chances are that logistics company has no one else
[00:03:44] that logistics company has no one else even working on this because you know
[00:03:46] even working on this because you know they probably can they may not be able
[00:03:48] they probably can they may not be able to hire a fantastic Stanford student as
[00:03:50] to hire a fantastic Stanford student as a graduate cs2 29 right girls don't they
[00:03:52] a graduate cs2 29 right girls don't they just on a lot of CSUN graduates around
[00:03:54] just on a lot of CSUN graduates around um so what I want to do today is do a
[00:04:00] um so what I want to do today is do a quick intro talking about logistics and
[00:04:03] quick intro talking about logistics and then we'll spend the second half of the
[00:04:06] then we'll spend the second half of the day you know giving an overview and talk
[00:04:08] day you know giving an overview and talk a little bit more about machine learning
[00:04:09] a little bit more about machine learning okay and oh and I apologize I think that
[00:04:13] okay and oh and I apologize I think that this room according to that sign there
[00:04:15] this room according to that sign there seats what 300 something students I
[00:04:19] seats what 300 something students I think we have like not quite 800 people
[00:04:23] think we have like not quite 800 people and wrote in this class
[00:04:25] and wrote in this class so if there are people outside and all
[00:04:27] so if there are people outside and all of the classes are recorded broadcast on
[00:04:30] of the classes are recorded broadcast on SCPD they're usually the videos usually
[00:04:32] SCPD they're usually the videos usually made a very available same day so for
[00:04:35] made a very available same day so for those who they can't get into the room
[00:04:37] those who they can't get into the room my apologies you know there are some
[00:04:39] my apologies you know there are some years where even I had trouble getting
[00:04:42] years where even I had trouble getting into the room but I'm but hopefully you
[00:04:46] into the room but I'm but hopefully you can wash you you better wash all of
[00:04:48] can wash you you better wash all of these things online shortly OSE yes yeah
[00:04:52] these things online shortly OSE yes yeah I don't know it's a bit complicated yeah
[00:04:56] I don't know it's a bit complicated yeah thank you I think it's okay yeah okay
[00:04:59] thank you I think it's okay yeah okay yeah yeah for the next few classes you
[00:05:02] yeah yeah for the next few classes you can squeeze in and use at the NCC so for
[00:05:04] can squeeze in and use at the NCC so for now might be too complicated so quick
[00:05:08] now might be too complicated so quick and shows um
[00:05:10] and shows um oh I'm sorry I should have introduced
[00:05:12] oh I'm sorry I should have introduced myself my name is Andrew and I want to
[00:05:16] myself my name is Andrew and I want to introduce some of the rest of the
[00:05:18] introduce some of the rest of the teaching team as well
[00:05:22] is a class coordinator she has been
[00:05:25] is a class coordinator she has been playing this role for many years now and
[00:05:27] playing this role for many years now and helps keep the trains run on time and
[00:05:30] helps keep the trains run on time and make sure that everything of course
[00:05:30] make sure that everything of course happens when it's supposed to so she'll
[00:05:35] happens when it's supposed to so she'll be and then what throat
[00:05:39] cause my Santa be the co-head TS
[00:05:44] respectively a PhD students working with
[00:05:46] respectively a PhD students working with me and so bringing a lot of own
[00:05:50] me and so bringing a lot of own technical experience technical
[00:05:53] technical experience technical experience in machine learning as well
[00:05:54] experience in machine learning as well as practical know-how on how to make
[00:05:56] as practical know-how on how to make these things work and with the large
[00:05:59] these things work and with the large class that we have we have a large ta
[00:06:01] class that we have we have a large ta team maybe I won't introduce all of the
[00:06:03] team maybe I won't introduce all of the TAS here today but you meet many of them
[00:06:05] TAS here today but you meet many of them throughout the school sir but the TAS
[00:06:07] throughout the school sir but the TAS expertise span everything from
[00:06:09] expertise span everything from conversions and language processing
[00:06:10] conversions and language processing technology to robotics and so through
[00:06:14] technology to robotics and so through this quarter as you work on your class
[00:06:16] this quarter as you work on your class projects I hope that you get a lot of
[00:06:18] projects I hope that you get a lot of help and advice and mentoring from the
[00:06:21] help and advice and mentoring from the TAS all of which all of whom have deep
[00:06:23] TAS all of which all of whom have deep expertise not just in machine learning
[00:06:25] expertise not just in machine learning but often in a specific vertical
[00:06:27] but often in a specific vertical application area of machine learning so
[00:06:30] application area of machine learning so depend on what your projects we try to
[00:06:31] depend on what your projects we try to match you to a TA they can give you
[00:06:33] match you to a TA they can give you advice they're most relevant whatever
[00:06:36] advice they're most relevant whatever project you end up working on um so you
[00:06:41] project you end up working on um so you know go with this class I hope that
[00:06:43] know go with this class I hope that after the next ten weeks you will be an
[00:06:46] after the next ten weeks you will be an expert in machine learning it turns out
[00:06:50] expert in machine learning it turns out that you know and I hope that after this
[00:06:55] that you know and I hope that after this class you'll be able to go out and build
[00:06:58] class you'll be able to go out and build very meaningful machine learning
[00:07:00] very meaningful machine learning applications either in an academic
[00:07:02] applications either in an academic setting where hopefully you can apply it
[00:07:05] setting where hopefully you can apply it to your problems in mechanical
[00:07:07] to your problems in mechanical engineering Electrical Engineering and
[00:07:09] engineering Electrical Engineering and English and law and and education and
[00:07:14] English and law and and education and all of this wonderful work that happens
[00:07:15] all of this wonderful work that happens on campus as well as off the grass from
[00:07:18] on campus as well as off the grass from Stanford to Bill apply to whatever jobs
[00:07:20] Stanford to Bill apply to whatever jobs you find one of the things I find very
[00:07:23] you find one of the things I find very exciting about machine learning is that
[00:07:24] exciting about machine learning is that it's no longer a sort of pure tech
[00:07:28] it's no longer a sort of pure tech company only kind of thing right I think
[00:07:30] company only kind of thing right I think that many years ago
[00:07:32] that many years ago machine learning it was like a thing
[00:07:34] machine learning it was like a thing that you know computer science
[00:07:36] that you know computer science department would do and that the elite
[00:07:38] department would do and that the elite AI companies like Google and Facebook
[00:07:40] AI companies like Google and Facebook and Baidu and Microsoft with you but now
[00:07:43] and Baidu and Microsoft with you but now it is so pervasive that even companies
[00:07:46] it is so pervasive that even companies that are not traditional tech companies
[00:07:48] that are not traditional tech companies see a huge need to apply these tools and
[00:07:51] see a huge need to apply these tools and I find a lot of the most exciting work
[00:07:52] I find a lot of the most exciting work these days and maybe some of you guys
[00:07:55] these days and maybe some of you guys know my histories I'm a little bit
[00:07:57] know my histories I'm a little bit is very I let the Google brain team
[00:07:58] is very I let the Google brain team which help Google transform from what
[00:08:01] which help Google transform from what was already a great company ten years
[00:08:03] was already a great company ten years ago to today which is a great AI company
[00:08:05] ago to today which is a great AI company and then also let the AI group that
[00:08:07] and then also let the AI group that might do and you know let the company's
[00:08:09] might do and you know let the company's technology strategy to help I do also
[00:08:11] technology strategy to help I do also transform from what was already a green
[00:08:14] transform from what was already a green company many years ago to today arguably
[00:08:16] company many years ago to today arguably China's greatest AI company so having
[00:08:19] China's greatest AI company so having let the you know build the teams that
[00:08:21] let the you know build the teams that let the AI transformations of two large
[00:08:23] let the AI transformations of two large tech companies III feel like that's a
[00:08:26] tech companies III feel like that's a great thing to do but even beyond tech I
[00:08:29] great thing to do but even beyond tech I think that um there's lot of exciting
[00:08:31] think that um there's lot of exciting work to do as well to help other
[00:08:32] work to do as well to help other industries to help other sectors embrace
[00:08:35] industries to help other sectors embrace machine learning and use these tools
[00:08:36] machine learning and use these tools effectively but after this class I hope
[00:08:40] effectively but after this class I hope that each one of you will be well
[00:08:42] that each one of you will be well qualified to get a job at shiny tech
[00:08:45] qualified to get a job at shiny tech company and do machine learning there or
[00:08:47] company and do machine learning there or go into one of these other industries
[00:08:48] go into one of these other industries and do very valuable machine learning
[00:08:50] and do very valuable machine learning projects there um and in addition if any
[00:08:54] projects there um and in addition if any of you are taking this class with the
[00:08:57] of you are taking this class with the primary goal of being able to do
[00:09:00] primary goal of being able to do research in machine learning you know so
[00:09:02] research in machine learning you know so actually some some of you I know are PhD
[00:09:05] actually some some of you I know are PhD students I hope that this class will
[00:09:08] students I hope that this class will also leave you well equipped to really
[00:09:11] also leave you well equipped to really read and understand research papers as
[00:09:13] read and understand research papers as well as you know be qualified to start
[00:09:16] well as you know be qualified to start pushing forward the state-of-the-art so
[00:09:23] pushing forward the state-of-the-art so let's see so today so just as machine
[00:09:30] let's see so today so just as machine learning is evolving rapidly the whole
[00:09:33] learning is evolving rapidly the whole teaching team we've been constantly
[00:09:35] teaching team we've been constantly updating CS 229 as well so it's actually
[00:09:39] updating CS 229 as well so it's actually very interesting I feel like the pace of
[00:09:40] very interesting I feel like the pace of progress in machine learning has
[00:09:41] progress in machine learning has accelerated so it actually feels like
[00:09:44] accelerated so it actually feels like that the amount we changed the cost
[00:09:47] that the amount we changed the cost year-over-year has been increasing over
[00:09:49] year-over-year has been increasing over time so so if your friends that took the
[00:09:51] time so so if your friends that took the class last year you know things are a
[00:09:53] class last year you know things are a little bit different this year because
[00:09:54] little bit different this year because we're constantly updating the class to
[00:09:57] we're constantly updating the class to keep up with what feels like slow
[00:09:59] keep up with what feels like slow accelerating progress in the whole field
[00:10:01] accelerating progress in the whole field of machine learning so so so so there
[00:10:04] of machine learning so so so so there are some logistical changes for example
[00:10:06] are some logistical changes for example we've gone from what we used to hand out
[00:10:09] we've gone from what we used to hand out paper copy
[00:10:10] paper copy of handouts that where we're trying to
[00:10:12] of handouts that where we're trying to make this class digital-only but let me
[00:10:15] make this class digital-only but let me talk a little bit about prerequisites as
[00:10:17] talk a little bit about prerequisites as well as in case your friends have taken
[00:10:18] well as in case your friends have taken this class before some of the
[00:10:19] this class before some of the differences for this year um so
[00:10:23] differences for this year um so prerequisites we are going to assume
[00:10:27] prerequisites we are going to assume that all of you have a knowledge of
[00:10:31] that all of you have a knowledge of basic computer skills and principles so
[00:10:33] basic computer skills and principles so you know Big O notation Q stands binary
[00:10:35] you know Big O notation Q stands binary trees hopefully you understand what all
[00:10:38] trees hopefully you understand what all of those concepts are and assume that
[00:10:40] of those concepts are and assume that all we have a basic familiarity with
[00:10:43] all we have a basic familiarity with probability right that hopefully you
[00:10:46] probability right that hopefully you know what's the random variable what's
[00:10:47] know what's the random variable what's the expected value of a random variable
[00:10:49] the expected value of a random variable what's the variance of a random variable
[00:10:51] what's the variance of a random variable and if for some of you maybe especially
[00:10:53] and if for some of you maybe especially the SCPD students taking there's
[00:10:55] the SCPD students taking there's enrollee if there's been you know some
[00:10:58] enrollee if there's been you know some number of years since you lost the
[00:10:59] number of years since you lost the probability and statistics Falls we will
[00:11:02] probability and statistics Falls we will have review sessions on Fridays where
[00:11:05] have review sessions on Fridays where we'll go over some of this prerequisite
[00:11:07] we'll go over some of this prerequisite material as well hopefully you know
[00:11:09] material as well hopefully you know whether random variable is what the
[00:11:11] whether random variable is what the expected value is but if you are a
[00:11:12] expected value is but if you are a little bit fuzzy on those concepts we'll
[00:11:14] little bit fuzzy on those concepts we'll go over them again at a discussion
[00:11:17] go over them again at a discussion section on Friday also seeing the
[00:11:20] section on Friday also seeing the familiar basic linear algebra so
[00:11:22] familiar basic linear algebra so hopefully that you know what the matrix
[00:11:23] hopefully that you know what the matrix was a vector how to multiply two
[00:11:25] was a vector how to multiply two matrices and multiplying majors in a
[00:11:27] matrices and multiplying majors in a vector if you know what an eigenvector
[00:11:30] vector if you know what an eigenvector then that's even better if you're not
[00:11:32] then that's even better if you're not quite sure what an eigenvector is look
[00:11:34] quite sure what an eigenvector is look over it uuugh but yeah we'll go over it
[00:11:38] over it uuugh but yeah we'll go over it I guess um and then um a large part of
[00:11:43] I guess um and then um a large part of this class is having you practice these
[00:11:49] this class is having you practice these ideas through the homeworks as well as I
[00:11:52] ideas through the homeworks as well as I mentioned later a open-ended project and
[00:11:55] mentioned later a open-ended project and so one there we've actually until now we
[00:12:00] so one there we've actually until now we used to use MATLAB in octave for their
[00:12:03] used to use MATLAB in octave for their premiere assignments but this year we're
[00:12:06] premiere assignments but this year we're trying to ship the permian Simon's to
[00:12:08] trying to ship the permian Simon's to Python and so I think for a long time
[00:12:11] Python and so I think for a long time even today you know I sometimes use
[00:12:14] even today you know I sometimes use octaves their prototype because the
[00:12:16] octaves their prototype because the syntax the octave is so nice and just
[00:12:18] syntax the octave is so nice and just run very simple experiments very quickly
[00:12:21] run very simple experiments very quickly but I think the Machine there in the
[00:12:23] but I think the Machine there in the world
[00:12:24] world you know really migrating I think from
[00:12:27] you know really migrating I think from MATLAB Python world to increasing using
[00:12:30] MATLAB Python world to increasing using MATLAB octave world to increasingly a
[00:12:33] MATLAB octave world to increasingly a Python maybe and then eventually for
[00:12:36] Python maybe and then eventually for production Java or C++ kind of world and
[00:12:38] production Java or C++ kind of world and so we're rewriting a lot of the
[00:12:40] so we're rewriting a lot of the assignments that this causes quilter
[00:12:43] assignments that this causes quilter I've been having driving that process so
[00:12:46] I've been having driving that process so that so that this course here you could
[00:12:48] that so that this course here you could do more of the assignments maybe most
[00:12:51] do more of the assignments maybe most maybe all of the assignments in Python
[00:12:54] maybe all of the assignments in Python numpy instead now an oath of the only
[00:12:58] numpy instead now an oath of the only codes we all said you know we we
[00:13:01] codes we all said you know we we actually encourage you to form study
[00:13:02] actually encourage you to form study groups so you know up and I'm fascinated
[00:13:06] groups so you know up and I'm fascinated by education a long time for a long time
[00:13:09] by education a long time for a long time studying education in pedagogy and how
[00:13:11] studying education in pedagogy and how instructors like us can help support you
[00:13:13] instructors like us can help support you to learn more efficiently and one of the
[00:13:16] to learn more efficiently and one of the lessons I've learned from the
[00:13:17] lessons I've learned from the educational research literature is that
[00:13:19] educational research literature is that for highly technical classes like this
[00:13:21] for highly technical classes like this if you form study groups you will
[00:13:24] if you form study groups you will probably have an easier time right so so
[00:13:26] probably have an easier time right so so CS News and I would go for the highly
[00:13:28] CS News and I would go for the highly technical material there's a lot of math
[00:13:30] technical material there's a lot of math so the programs are hard and they have a
[00:13:31] so the programs are hard and they have a group of friends to study with you
[00:13:34] group of friends to study with you probably have an easier time because you
[00:13:37] probably have an easier time because you can ask each other questions and work
[00:13:38] can ask each other questions and work together help each other um where we ask
[00:13:40] together help each other um where we ask you to draw the line or what we ask you
[00:13:43] you to draw the line or what we ask you to do relative to the standards on the
[00:13:46] to do relative to the standards on the codes is we ask that you do the homework
[00:13:49] codes is we ask that you do the homework problems by yourself right and and and
[00:13:52] problems by yourself right and and and more specifically is okay to discuss the
[00:13:55] more specifically is okay to discuss the homework problems or friends but if you
[00:13:57] homework problems or friends but if you but after discussing homework problems
[00:13:59] but after discussing homework problems with friends we ask you to go back and
[00:14:01] with friends we ask you to go back and write up the solutions by yourself
[00:14:03] write up the solutions by yourself without referring to notes that you know
[00:14:06] without referring to notes that you know you and your friends have developed
[00:14:07] you and your friends have developed together okay the classes on the code is
[00:14:10] together okay the classes on the code is written clearly on the class handouts
[00:14:13] written clearly on the class handouts posted digitally on their website so if
[00:14:16] posted digitally on their website so if you ever have any questions about what
[00:14:18] you ever have any questions about what there's a lot of collaboration and what
[00:14:19] there's a lot of collaboration and what isn't allowed please refer to that
[00:14:21] isn't allowed please refer to that written document on the course website
[00:14:23] written document on the course website where we distract us more clearly but
[00:14:25] where we distract us more clearly but out of respect for the Stanford honor
[00:14:28] out of respect for the Stanford honor code as well as the
[00:14:30] code as well as the your students kind of doing their own
[00:14:32] your students kind of doing their own work we asked you to basically do your
[00:14:34] work we asked you to basically do your own work for the soca to discuss it but
[00:14:38] own work for the soca to discuss it but after discussing home problems with
[00:14:39] after discussing home problems with friends ultimately we asked you to write
[00:14:41] friends ultimately we asked you to write up your problems by yourself so that the
[00:14:43] up your problems by yourself so that the homework submissions reflect your own
[00:14:46] homework submissions reflect your own work right and I care about this because
[00:14:49] work right and I care about this because turns out that having CS 239 you know CS
[00:14:52] turns out that having CS 239 you know CS 229 is one of those classes that
[00:14:54] 229 is one of those classes that employers recognize I don't know if you
[00:14:57] employers recognize I don't know if you guys know but they're been um
[00:14:59] guys know but they're been um companies that have put up job ads that
[00:15:01] companies that have put up job ads that say stuff like so long as you got solace
[00:15:04] say stuff like so long as you got solace you complete the CST three now and we
[00:15:06] you complete the CST three now and we guarantee you get an interview right
[00:15:07] guarantee you get an interview right I've seen stuff like that and so I think
[00:15:10] I've seen stuff like that and so I think you know in order to maintain that
[00:15:13] you know in order to maintain that sanctity of what it means to be a CSU to
[00:15:15] sanctity of what it means to be a CSU to nine computer I think and I all said all
[00:15:17] nine computer I think and I all said all of you so the really do work or stay
[00:15:21] of you so the really do work or stay within the bounds of accepted of
[00:15:22] within the bounds of accepted of acceptable collaboration relative the
[00:15:25] acceptable collaboration relative the honor codes let's see and I think that
[00:15:29] honor codes let's see and I think that um if you know what this is and I think
[00:15:37] um if you know what this is and I think that one of the best parts of CS 339 it
[00:15:41] that one of the best parts of CS 339 it turns out is excuse me
[00:15:49] oh yeah sorry I'm gonna try looking for
[00:15:53] oh yeah sorry I'm gonna try looking for the mouse cursor
[00:16:08] all right serve all that might might
[00:16:11] all right serve all that might might might displays on not smear erisa so
[00:16:14] might displays on not smear erisa so this is a little bit awkward um so one
[00:16:26] this is a little bit awkward um so one of the best parts of the class is she
[00:16:30] of the best parts of the class is she sorry about that
[00:16:34] right no mind I won't do this you could
[00:16:37] right no mind I won't do this you could do that you could do yourself online
[00:16:38] do that you could do yourself online later yeah I started using I started
[00:16:44] later yeah I started using I started using Firefox recently in addition to
[00:16:46] using Firefox recently in addition to Chrome it was just a mix-up um one of
[00:16:50] Chrome it was just a mix-up um one of the best parts of the class is the class
[00:16:55] the best parts of the class is the class project and so you know one of the girls
[00:16:59] project and so you know one of the girls of the qualities to leave you well
[00:17:00] of the qualities to leave you well qualified to do a meaningful machine
[00:17:02] qualified to do a meaningful machine learning project and so one of the best
[00:17:06] learning project and so one of the best ways to make sure you have that skill
[00:17:07] ways to make sure you have that skill set is through this class and hopefully
[00:17:10] set is through this class and hopefully with the help of some of the TAS we want
[00:17:13] with the help of some of the TAS we want to support you to work on a small group
[00:17:15] to support you to work on a small group to complete a meaningful machine
[00:17:17] to complete a meaningful machine learning project and so one thing I hope
[00:17:20] learning project and so one thing I hope you start doing you know later today is
[00:17:23] you start doing you know later today is to start brainstorming maybe of your
[00:17:25] to start brainstorming maybe of your friends some of the some of the class
[00:17:28] friends some of the some of the class projects you might work on and the most
[00:17:31] projects you might work on and the most common class project that you know
[00:17:33] common class project that you know people do in CSU's you know it's the
[00:17:35] people do in CSU's you know it's the picker area pick an application that
[00:17:37] picker area pick an application that excites you and to apply machine
[00:17:40] excites you and to apply machine learning to it and see if it can build a
[00:17:42] learning to it and see if it can build a good machine learning system for some
[00:17:43] good machine learning system for some application in the area and so if you go
[00:17:46] application in the area and so if you go to the course website
[00:17:47] to the course website you know cs2 2/9 does time for the edu
[00:17:49] you know cs2 2/9 does time for the edu and look at previous year's projects you
[00:17:51] and look at previous year's projects you you you see machine learning projects
[00:17:53] you you see machine learning projects applied to pretty much you know pretty
[00:17:55] applied to pretty much you know pretty much every imaginable application Under
[00:17:57] much every imaginable application Under the Sun everything from no diagnosing
[00:18:00] the Sun everything from no diagnosing cancer to creating arts to lots of
[00:18:03] cancer to creating arts to lots of projects apply to other areas of
[00:18:06] projects apply to other areas of engineering applying to application
[00:18:08] engineering applying to application areas in EE or my contouring or silver
[00:18:10] areas in EE or my contouring or silver engineering or earthquake immersion and
[00:18:12] engineering or earthquake immersion and so on to applying it to understand
[00:18:14] so on to applying it to understand literature it's applying it to know and
[00:18:18] literature it's applying it to know and and and and so if you look at the
[00:18:21] and and and so if you look at the previous year's projects many of which
[00:18:23] previous year's projects many of which are posted on the course website you can
[00:18:25] are posted on the course website you can use that as inspiration to see the types
[00:18:27] use that as inspiration to see the types of projects students complete completing
[00:18:29] of projects students complete completing this class are able to do and also
[00:18:31] this class are able to do and also encourage you to you can look at that
[00:18:34] encourage you to you can look at that for inspiration to get a sense of what
[00:18:36] for inspiration to get a sense of what you'll be able to do at the conclusion
[00:18:38] you'll be able to do at the conclusion of this class and also see if looking at
[00:18:41] of this class and also see if looking at previous year's projects gives you
[00:18:43] previous year's projects gives you inspiration
[00:18:44] inspiration for what you might do yourself so we
[00:18:48] for what you might do yourself so we also know we invite you I guess to do
[00:18:50] also know we invite you I guess to do class projects in small groups and so
[00:18:53] class projects in small groups and so after class today also encourage you to
[00:18:56] after class today also encourage you to start making friends in the class both
[00:18:58] start making friends in the class both for the purpose of forming study groups
[00:18:59] for the purpose of forming study groups as well for the purpose and maybe
[00:19:01] as well for the purpose and maybe finding a small group to do a class
[00:19:03] finding a small group to do a class project with we ask you to form project
[00:19:07] project with we ask you to form project groups of up to size three most project
[00:19:11] groups of up to size three most project groups end up being size two or three if
[00:19:14] groups end up being size two or three if you insist on doing it by yourself
[00:19:15] you insist on doing it by yourself right without any partners that's
[00:19:17] right without any partners that's actually okay - you're welcome to do
[00:19:18] actually okay - you're welcome to do that but but but I think often you know
[00:19:21] that but but but I think often you know having one or two others to work with
[00:19:23] having one or two others to work with may give you an easier time and for
[00:19:25] may give you an easier time and for projects of exceptional scope if you
[00:19:27] projects of exceptional scope if you have a very large project it just cannot
[00:19:29] have a very large project it just cannot be done by three people sometimes you
[00:19:32] be done by three people sometimes you know let us know and we're open to work
[00:19:35] know let us know and we're open to work with to some project groups of size four
[00:19:38] with to some project groups of size four but our expectation but we do whole
[00:19:40] but our expectation but we do whole projects you know with a group of four
[00:19:42] projects you know with a group of four to a higher standard than project group
[00:19:45] to a higher standard than project group size one to three so so what that means
[00:19:47] size one to three so so what that means is that if your project team size is one
[00:19:50] is that if your project team size is one two or three persons the grading is one
[00:19:52] two or three persons the grading is one criteria if your project group is bigger
[00:19:55] criteria if your project group is bigger than three persons we use a stricter
[00:19:57] than three persons we use a stricter criteria when it comes to creating class
[00:19:59] criteria when it comes to creating class projects okay um oh and that reminds me
[00:20:03] projects okay um oh and that reminds me I know that the scene so for most of you
[00:20:07] I know that the scene so for most of you since is since this starts at 9:30 a.m.
[00:20:09] since is since this starts at 9:30 a.m. on the first day of the quarter for many
[00:20:12] on the first day of the quarter for many of you this may be this / your very
[00:20:14] of you this may be this / your very first cause at Stanford for how many of
[00:20:16] first cause at Stanford for how many of you does your very first cause at
[00:20:18] you does your very first cause at Stanford Wow cool okay awesome
[00:20:20] Stanford Wow cool okay awesome great welcome to Stanford oh and there's
[00:20:23] great welcome to Stanford oh and there's someone next to you just raise their
[00:20:25] someone next to you just raise their hand actually raise your hand again so I
[00:20:27] hand actually raise your hand again so I hope that you know maybe off the class
[00:20:28] hope that you know maybe off the class today if someone makes you raise their
[00:20:30] today if someone makes you raise their hand
[00:20:30] hand welcome them to Stanford and then say hi
[00:20:33] welcome them to Stanford and then say hi and show yourself and good friends I'll
[00:20:35] and show yourself and good friends I'll do it cool nice nice to see so many of
[00:20:37] do it cool nice nice to see so many of you yeah
[00:20:44] all right so just a bit more logistics
[00:20:53] all right so just a bit more logistics so let's see in addition to the main
[00:20:58] so let's see in addition to the main lectures that we'll have here on Mondays
[00:21:01] lectures that we'll have here on Mondays and Wednesdays
[00:21:03] and Wednesdays since 39 also has discussion sections on
[00:21:06] since 39 also has discussion sections on held on Fridays that are and everything
[00:21:09] held on Fridays that are and everything we do including the see all the all the
[00:21:11] we do including the see all the all the lectures and discussion sections are
[00:21:12] lectures and discussion sections are recorded and broadcast through SCPD
[00:21:15] recorded and broadcast through SCPD through the online website and one of
[00:21:19] through the online website and one of and discussion section are taught
[00:21:22] and discussion section are taught usually by the TAS on Fridays and
[00:21:25] usually by the TAS on Fridays and attendance at discussion sections is
[00:21:27] attendance at discussion sections is optional and what I mean is that you you
[00:21:31] optional and what I mean is that you you know you punch up some promise there
[00:21:32] know you punch up some promise there won't be material on the midterm that
[00:21:35] won't be material on the midterm that will sneak in from the section so it's
[00:21:37] will sneak in from the section so it's hundred-percent optional and you will be
[00:21:39] hundred-percent optional and you will be able to do all the homework and
[00:21:40] able to do all the homework and appropriate projects without attending
[00:21:41] appropriate projects without attending in this question section but what we'll
[00:21:43] in this question section but what we'll use the discussion section for for the
[00:21:45] use the discussion section for for the first three discussion sections so you
[00:21:47] first three discussion sections so you know this week next week week after that
[00:21:49] know this week next week week after that we'll use the discussion sections to go
[00:21:51] we'll use the discussion sections to go over prerequisite material and great to
[00:21:53] over prerequisite material and great to Jeff so go over linear algebra or basic
[00:21:57] Jeff so go over linear algebra or basic crime statistics teach a little about
[00:21:59] crime statistics teach a little about Python numpy in case you're less
[00:22:01] Python numpy in case you're less familiar with those frameworks so do
[00:22:03] familiar with those frameworks so do that for the first few weeks and then
[00:22:05] that for the first few weeks and then for the discussion sections that are
[00:22:07] for the discussion sections that are held later this quarter will usually use
[00:22:09] held later this quarter will usually use them to go over a more advanced optional
[00:22:12] them to go over a more advanced optional material for example cs50 now involved
[00:22:15] material for example cs50 now involved in learning algorithms you you hear
[00:22:17] in learning algorithms you you hear about in the class rely on convex
[00:22:19] about in the class rely on convex optimization algorithms but we want to
[00:22:22] optimization algorithms but we want to focus the class on the learning
[00:22:24] focus the class on the learning algorithms and spend less time on convex
[00:22:26] algorithms and spend less time on convex optimization so if you want to come and
[00:22:28] optimization so if you want to come and hear about more advanced concepts in
[00:22:30] hear about more advanced concepts in convex optimization we'll defer that to
[00:22:32] convex optimization we'll defer that to a discussion section and then there are
[00:22:35] a discussion section and then there are few other advanced topics hidden Markov
[00:22:38] few other advanced topics hidden Markov models time series that were planning to
[00:22:40] models time series that were planning to defer to the Friday discussion sections
[00:22:44] defer to the Friday discussion sections okay
[00:22:47] okay so let's see
[00:22:52] so let's see cool and oh and final bit of logistics
[00:23:00] there'll be there are digital tools that
[00:23:02] there'll be there are digital tools that some of you have seen but for this class
[00:23:04] some of you have seen but for this class will drive a lot of the discussion
[00:23:06] will drive a lot of the discussion through the online website Piazza how
[00:23:09] through the online website Piazza how many of you abuse Piazza Cavour
[00:23:10] many of you abuse Piazza Cavour okay cool mostly Wow all of you that's
[00:23:14] okay cool mostly Wow all of you that's pretty amazing good so so online
[00:23:18] pretty amazing good so so online discussion board for those of you that
[00:23:19] discussion board for those of you that haven't seen it before but definitely
[00:23:21] haven't seen it before but definitely encourage you to participate actively on
[00:23:24] encourage you to participate actively on Piazza and also to answer all the
[00:23:26] Piazza and also to answer all the students questions I think that one of
[00:23:29] students questions I think that one of the best ways to learn as was contribute
[00:23:31] the best ways to learn as was contribute you know back to the course as a whole
[00:23:33] you know back to the course as a whole is if you see someone else ask a
[00:23:35] is if you see someone else ask a question on Piazza if you jump in and
[00:23:37] question on Piazza if you jump in and have answer that that that often helps
[00:23:39] have answer that that that often helps you and helps your classmates so I
[00:23:41] you and helps your classmates so I strongly encourage you to do that for
[00:23:45] strongly encourage you to do that for those of you that have a private
[00:23:46] those of you that have a private question you know sometimes we have
[00:23:48] question you know sometimes we have students reaching out to us too with a
[00:23:52] students reaching out to us too with a personal matter or something that you
[00:23:55] personal matter or something that you know it's not appropriate to share on
[00:23:56] know it's not appropriate to share on the public forum in which case you're
[00:23:57] the public forum in which case you're welcome to email us at the cross email
[00:24:00] welcome to email us at the cross email address as well and we also and the
[00:24:03] address as well and we also and the class the email address the clock
[00:24:04] class the email address the clock teaching staffs email address on the
[00:24:05] teaching staffs email address on the course website you can find it there in
[00:24:07] course website you can find it there in contact us but for anything technical or
[00:24:09] contact us but for anything technical or anything reasonable to share the class
[00:24:12] anything reasonable to share the class which includes most technical questions
[00:24:14] which includes most technical questions that most logistical questions write
[00:24:16] that most logistical questions write questions like you know chief can you
[00:24:18] questions like you know chief can you confirm what date is it midterm or you
[00:24:20] confirm what date is it midterm or you know what happens can you confirm
[00:24:22] know what happens can you confirm Wednesday handout for this going on and
[00:24:24] Wednesday handout for this going on and so on for questions are not personal or
[00:24:26] so on for questions are not personal or private in nature strongly encourage you
[00:24:29] private in nature strongly encourage you to post on Piazza rather than emailing
[00:24:30] to post on Piazza rather than emailing us because statistically you actually
[00:24:33] us because statistically you actually get a faster answer posting this on post
[00:24:36] get a faster answer posting this on post posting on Piazza then then you know if
[00:24:38] posting on Piazza then then you know if you wait for one of us to respond to you
[00:24:40] you wait for one of us to respond to you and we'll be using great scope as well
[00:24:44] and we'll be using great scope as well for online grading if you don't know why
[00:24:47] for online grading if you don't know why brace go up is don't worry about it well
[00:24:48] brace go up is don't worry about it well we'll send you links and show you how to
[00:24:51] we'll send you links and show you how to use it data
[00:24:52] use it data oh and again relative to want one loss
[00:24:57] oh and again relative to want one loss which is a real thing to plan for unlike
[00:25:01] which is a real thing to plan for unlike previous
[00:25:03] previous yes when we taught CS 339 so we're
[00:25:06] yes when we taught CS 339 so we're constantly updating the syllabus right
[00:25:08] constantly updating the syllabus right the technical content to try to show you
[00:25:10] the technical content to try to show you the latest machine learning algorithms
[00:25:11] the latest machine learning algorithms and the to pick up little changes we're
[00:25:15] and the to pick up little changes we're making this year I guess one is a Python
[00:25:18] making this year I guess one is a Python instead of MATLAB and the other one is
[00:25:20] instead of MATLAB and the other one is instead of having a midterm exam you
[00:25:24] instead of having a midterm exam you know that's a timed midterm we're
[00:25:27] know that's a timed midterm we're planning to have a take hold midterm
[00:25:29] planning to have a take hold midterm this quarter
[00:25:31] this quarter this day so I know some people just
[00:25:34] this day so I know some people just breathing sharply when I said that I
[00:25:36] breathing sharply when I said that I don't know what that means
[00:25:37] don't know what that means was that shock full happiness don't
[00:25:41] was that shock full happiness don't worry midterms are fun you love it
[00:25:45] worry midterms are fun you love it all right oh so that's it for the that's
[00:25:51] all right oh so that's it for the that's it for the logistical aspects let me
[00:25:54] it for the logistical aspects let me check with you and so let me check there
[00:25:56] check with you and so let me check there any questions
[00:25:57] any questions oh yeah go ahead oh yeah so that's such
[00:26:16] oh yeah go ahead oh yeah so that's such thing oh let's see I think has offered
[00:26:18] thing oh let's see I think has offered in spring and one of the presses
[00:26:23] in spring and one of the presses oh yes and I was teaching it so someone
[00:26:25] oh yes and I was teaching it so someone else is teaching it in Spring Quarter I
[00:26:30] else is teaching it in Spring Quarter I actually did not know it was going to be
[00:26:33] actually did not know it was going to be offered in winter so I think a big guy
[00:26:44] offered in winter so I think a big guy teaching notes in
[00:26:47] teaching notes in Neverending right teaching it in spring
[00:26:51] Neverending right teaching it in spring and I don't think is often in winter
[00:26:58] well this kind of sections be recorded
[00:27:00] well this kind of sections be recorded yes they will be oh and by the way if
[00:27:02] yes they will be oh and by the way if you wonder why I'm recording that I'm
[00:27:04] you wonder why I'm recording that I'm repeating the question I know it feels
[00:27:06] repeating the question I know it feels weird I'm recording for the microphone
[00:27:07] weird I'm recording for the microphone so that so that people watching those at
[00:27:09] so that so that people watching those at home can hear the question but both the
[00:27:11] home can hear the question but both the lectures and the discussion sections
[00:27:13] lectures and the discussion sections will be will be recorded and put on the
[00:27:16] will be will be recorded and put on the website maybe the one thing we do
[00:27:19] website maybe the one thing we do there's not recorded and broadcast at
[00:27:21] there's not recorded and broadcast at the office hours right oh oh but I think
[00:27:25] the office hours right oh oh but I think this year we have a 60 hour how many on
[00:27:30] this year we have a 60 hour how many on 60 office hours per week right yeah so
[00:27:34] 60 office hours per week right yeah so so so hopefully I just again we're
[00:27:36] so so hopefully I just again we're constantly trying to improve the cause
[00:27:38] constantly trying to improve the cause in previous years one of the feedback we
[00:27:39] in previous years one of the feedback we got was that the officers are really
[00:27:41] got was that the officers are really crowded so so we have 60 60 hours of
[00:27:44] crowded so so we have 60 60 hours of football 60 offers all slots per week
[00:27:46] football 60 offers all slots per week this that seems like long so hopefully
[00:27:48] this that seems like long so hopefully if you need to track down one of us
[00:27:50] if you need to track down one of us track down the teeth to get help
[00:27:51] track down the teeth to get help hopefully that'll make it easier for you
[00:27:54] hopefully that'll make it easier for you to do so good say okay well oh well
[00:28:06] to do so good say okay well oh well logistical things like when Homer said
[00:28:08] logistical things like when Homer said you be covering lectures we have a yes
[00:28:11] you be covering lectures we have a yes so we have four plans homeworks yeah and
[00:28:16] so we have four plans homeworks yeah and if you go to the if you go to the course
[00:28:18] if you go to the if you go to the course website and click on the syllabus link
[00:28:20] website and click on the syllabus link that has a calendar with when each
[00:28:23] that has a calendar with when each homework assignments called and when
[00:28:24] homework assignments called and when Opie do so for homeworks and project
[00:28:28] Opie do so for homeworks and project proposal due few weeks from now and then
[00:28:31] proposal due few weeks from now and then final projects due at the end of the
[00:28:33] final projects due at the end of the quarter but all the other exact dates
[00:28:35] quarter but all the other exact dates are listed on the course website
[00:28:39] oh sure
[00:28:43] oh sure yes difference between this class into
[00:28:45] yes difference between this class into 39a let me think how does dead yes so
[00:28:49] 39a let me think how does dead yes so yeah I know I was debating earlier this
[00:28:52] yeah I know I was debating earlier this morning how to answer that is no doubt
[00:28:54] morning how to answer that is no doubt that a few times so I think that what
[00:28:58] that a few times so I think that what has happened at Stanford is that the
[00:29:00] has happened at Stanford is that the volume of demand for machine learning
[00:29:02] volume of demand for machine learning education is just right skyrocketing
[00:29:05] education is just right skyrocketing because I think everyone sees everyone
[00:29:07] because I think everyone sees everyone wants to learn this stuff and so so
[00:29:12] wants to learn this stuff and so so whooping so the computer science
[00:29:13] whooping so the computer science department has been trying to grow the
[00:29:15] department has been trying to grow the number of machine learning offerings we
[00:29:16] number of machine learning offerings we have we actually kept the enrollment at
[00:29:20] have we actually kept the enrollment at CSC 39a at a relatively low number at a
[00:29:23] CSC 39a at a relatively low number at a hundred students so I actually don't
[00:29:25] hundred students so I actually don't want to encourage too many of you to
[00:29:27] want to encourage too many of you to sign up because I think we might be
[00:29:29] sign up because I think we might be hitting the enrollment cap already so so
[00:29:31] hitting the enrollment cap already so so please don't all sign up for CC 90
[00:29:34] please don't all sign up for CC 90 because we - 398 does not have the
[00:29:37] because we - 398 does not have the capacity this quarter but since 229 a is
[00:29:40] capacity this quarter but since 229 a is a much less mathematical and much more
[00:29:44] a much less mathematical and much more quite relatively more apply version of
[00:29:48] quite relatively more apply version of machine learning and so I guess I'm
[00:29:52] machine learning and so I guess I'm teaching 69 ACS 230 NCSU now in this
[00:29:55] teaching 69 ACS 230 NCSU now in this quarter of the 3 CS 229 is the most
[00:29:58] quarter of the 3 CS 229 is the most mathematical it has a little bit less
[00:30:01] mathematical it has a little bit less apply than 63 9 a which is more apply
[00:30:03] apply than 63 9 a which is more apply machine learning and since 230 which is
[00:30:05] machine learning and since 230 which is deep learning
[00:30:06] deep learning my advice to students is that 63 962 9s
[00:30:11] my advice to students is that 63 962 9s is let me write this down
[00:30:21] so since 229 a is taught in a flipped
[00:30:24] so since 229 a is taught in a flipped classroom format which means that
[00:30:26] classroom format which means that students taking and will mainly watch
[00:30:29] students taking and will mainly watch videos on the Coursera website and do a
[00:30:32] videos on the Coursera website and do a lot of programming exercises and then
[00:30:34] lot of programming exercises and then meet for weekly discussion sections but
[00:30:38] meet for weekly discussion sections but it's a smaller class with captain Romans
[00:30:40] it's a smaller class with captain Romans I would advise you that if you feel
[00:30:44] I would advise you that if you feel ready for cs50 9 + CS 2:30 to do those
[00:30:47] ready for cs50 9 + CS 2:30 to do those but cs50 9 you know because of the math
[00:30:52] but cs50 9 you know because of the math this is a this is a very heavy workload
[00:30:55] this is a this is a very heavy workload and pretty challenging class and so if
[00:30:58] and pretty challenging class and so if you're not sure we're ready for CCTV 969
[00:31:01] you're not sure we're ready for CCTV 969 a may be a good thing to take first and
[00:31:07] a may be a good thing to take first and then cs50 90s TV 9 a cover a broader
[00:31:11] then cs50 90s TV 9 a cover a broader range of machine learning algorithms and
[00:31:14] range of machine learning algorithms and cs2 30 is more focused on deep learning
[00:31:17] cs2 30 is more focused on deep learning algorithms specifically right which is a
[00:31:19] algorithms specifically right which is a much narrower set of algorithms but it
[00:31:21] much narrower set of algorithms but it is you know one of the hottest areas of
[00:31:23] is you know one of the hottest areas of deep learning there is not that much
[00:31:26] deep learning there is not that much overlap in content between the three
[00:31:28] overlap in content between the three classes so if you actually take all
[00:31:29] classes so if you actually take all three you learn relatively different
[00:31:32] three you learn relatively different things from all of them in the positive
[00:31:34] things from all of them in the positive has students simultaneously take to 29
[00:31:36] has students simultaneously take to 29 and 229 a and there is a little bit of
[00:31:39] and 229 a and there is a little bit of overlap you know they do kind of cover
[00:31:40] overlap you know they do kind of cover related algorithms but from different
[00:31:43] related algorithms but from different points of view so so some people
[00:31:44] points of view so so some people actually take multiple of these classes
[00:31:46] actually take multiple of these classes at the same time but sooner nine is more
[00:31:50] at the same time but sooner nine is more applied a bit more you know practical
[00:31:52] applied a bit more you know practical know-how hands-on and so on and much
[00:31:55] know-how hands-on and so on and much less mathematical and CS 230 is also
[00:32:00] less mathematical and CS 230 is also less mathematical more applied more
[00:32:02] less mathematical more applied more about kind of getting to work whereas
[00:32:03] about kind of getting to work whereas see Susan honey we do much more
[00:32:06] see Susan honey we do much more mathematical derivations in c ester's
[00:32:22] so why'd you say that what I would
[00:32:28] so why'd you say that what I would generally prefer students not do that in
[00:32:30] generally prefer students not do that in the interest of time but what what do
[00:32:32] the interest of time but what what do you want oh I see
[00:32:38] you want oh I see sure go for it who's enrolled in 239 to
[00:32:41] sure go for it who's enrolled in 239 to 3000 not that many of you interesting oh
[00:32:44] 3000 not that many of you interesting oh that's actually really interesting cool
[00:32:46] that's actually really interesting cool yeah
[00:32:47] yeah thank you yeah I just didn't want to say
[00:32:49] thank you yeah I just didn't want to say the presence of students using this as a
[00:32:51] the presence of students using this as a forum to run surveys so that was that
[00:32:53] forum to run surveys so that was that was that was an interesting question so
[00:32:55] was that was an interesting question so thank you cool all right
[00:33:01] thank you cool all right and by the way I think you know just one
[00:33:04] and by the way I think you know just one thing about Stanford is on the AI world
[00:33:07] thing about Stanford is on the AI world machine they're in the world
[00:33:08] machine they're in the world yeah is bigger than machine learning
[00:33:09] yeah is bigger than machine learning right machine is bigger than deep
[00:33:10] right machine is bigger than deep learning um one of the great things
[00:33:13] learning um one of the great things about being a Stanford student is you
[00:33:15] about being a Stanford student is you can and I think should take multiple
[00:33:18] can and I think should take multiple classes right I think that you know
[00:33:20] classes right I think that you know since we're now in has for many years
[00:33:21] since we're now in has for many years been the call of the machine learning
[00:33:23] been the call of the machine learning world at Stanford but even beyond CS 239
[00:33:27] world at Stanford but even beyond CS 239 is worth your while to take multiple
[00:33:30] is worth your while to take multiple classes including multiple perspectives
[00:33:32] classes including multiple perspectives so so if you want to be really effective
[00:33:35] so so if you want to be really effective you know after you drive don't stand
[00:33:37] you know after you drive don't stand there you do want to be an exponent
[00:33:38] there you do want to be an exponent machine learning you do want to be an
[00:33:39] machine learning you do want to be an expert and deep learning and you
[00:33:41] expert and deep learning and you probably want to know probably in
[00:33:43] probably want to know probably in statistics maybe you want to know bit of
[00:33:45] statistics maybe you want to know bit of confidence optimization maybe you want
[00:33:46] confidence optimization maybe you want to know bit more about the force when
[00:33:48] to know bit more about the force when learning know a little bit about
[00:33:49] learning know a little bit about planning a little bit about lots of
[00:33:51] planning a little bit about lots of things so so I actually encourage you to
[00:33:54] things so so I actually encourage you to take multiple classes like this if there
[00:34:03] take multiple classes like this if there are no more questions let's go on to
[00:34:05] are no more questions let's go on to talk a bit about machine learning so all
[00:34:15] talk a bit about machine learning so all right so in the remainder of the sauce
[00:34:17] right so in the remainder of the sauce what I'd like to do is give a quick
[00:34:21] what I'd like to do is give a quick overview of you know the major areas of
[00:34:26] overview of you know the major areas of machine learning and also
[00:34:29] and and also give you absolute overview
[00:34:32] and and also give you absolute overview of the things you learn in the next ten
[00:34:35] of the things you learn in the next ten weeks so you know what is machine
[00:34:38] weeks so you know what is machine learning right it seems to be everywhere
[00:34:40] learning right it seems to be everywhere these days and it's useful for so many
[00:34:41] these days and it's useful for so many places and and I think that and you know
[00:34:46] places and and I think that and you know and and I feel like if either just to
[00:34:50] and and I feel like if either just to share view my personal bias right you
[00:34:52] share view my personal bias right you you read the news about these people
[00:34:54] you read the news about these people making so much money building learning
[00:34:56] making so much money building learning algorithms I think that's great I hope I
[00:34:58] algorithms I think that's great I hope I hope all of you go make up all the money
[00:35:00] hope all of you go make up all the money but the thing I find even more exciting
[00:35:02] but the thing I find even more exciting is the meaningful work we could do right
[00:35:04] is the meaningful work we could do right I think that you know I think that every
[00:35:06] I think that you know I think that every time there's a major technological
[00:35:07] time there's a major technological disruption which there is now through
[00:35:09] disruption which there is now through machine learning it gives us an
[00:35:11] machine learning it gives us an opportunity to remake large parts of the
[00:35:13] opportunity to remake large parts of the world and if we behave ethically in a
[00:35:15] world and if we behave ethically in a principled way and use these super
[00:35:17] principled way and use these super powers of machine learning to do things
[00:35:20] powers of machine learning to do things that you know helps people's lives right
[00:35:22] that you know helps people's lives right maybe we could maybe you can improve the
[00:35:25] maybe we could maybe you can improve the health care system maybe you can improve
[00:35:27] health care system maybe you can improve give every child a personalized tutor
[00:35:30] give every child a personalized tutor maybe you can make a democracy run
[00:35:32] maybe you can make a democracy run better rather than make it run worse but
[00:35:34] better rather than make it run worse but I think that the meaning I find in
[00:35:37] I think that the meaning I find in machine learning is that there's so many
[00:35:38] machine learning is that there's so many people that are so eager for us to go in
[00:35:41] people that are so eager for us to go in and help them with these tools that if
[00:35:44] and help them with these tools that if you become good at these tools it gives
[00:35:47] you become good at these tools it gives you an opportunity to really remake some
[00:35:49] you an opportunity to really remake some peace some meaningful piece of the world
[00:35:52] peace some meaningful piece of the world hopefully in a way that helps other
[00:35:54] hopefully in a way that helps other people and makes the world kind of makes
[00:35:56] people and makes the world kind of makes the world a better place is very cliche
[00:35:58] the world a better place is very cliche in Silicon Valley but but I think you
[00:36:00] in Silicon Valley but but I think you know with these tools you actually have
[00:36:02] know with these tools you actually have the power to do that and they've got
[00:36:04] the power to do that and they've got make a ton of money that's great too but
[00:36:05] make a ton of money that's great too but I find a much greater meaning in the
[00:36:07] I find a much greater meaning in the work we could do but um
[00:36:14] work we could do but um despite all the excitement of machine
[00:36:15] despite all the excitement of machine learning what is machine learning so let
[00:36:17] learning what is machine learning so let me give you a couple definitions of
[00:36:20] me give you a couple definitions of machine learning author Samuel whose
[00:36:24] machine learning author Samuel whose claim to fame was building a checkers
[00:36:26] claim to fame was building a checkers playing program defined it as follows
[00:36:28] playing program defined it as follows fields I think is computable II learned
[00:36:30] fields I think is computable II learned well being explicitly programmed and you
[00:36:34] well being explicitly programmed and you know interesting when when also samuel
[00:36:37] know interesting when when also samuel many many decades ago wrote the checkers
[00:36:40] many many decades ago wrote the checkers playing program
[00:36:41] playing program the debates of the day was coming
[00:36:43] the debates of the day was coming computer ever do something that it
[00:36:46] computer ever do something that it wasn't explicitly told to do and Arthur
[00:36:49] wasn't explicitly told to do and Arthur Samuel wrote checkers playing program
[00:36:52] Samuel wrote checkers playing program but that through self play learned what
[00:36:56] but that through self play learned what are the patterns of checkerboard that
[00:36:59] are the patterns of checkerboard that are more likely to lead to win versus
[00:37:01] are more likely to lead to win versus more likely lead to a loss and learn to
[00:37:03] more likely lead to a loss and learn to be even better than Arthur Samuel the
[00:37:06] be even better than Arthur Samuel the author himself at playing checkers so
[00:37:09] author himself at playing checkers so back then there was viewed as a
[00:37:10] back then there was viewed as a remarkable result there the computer
[00:37:12] remarkable result there the computer program huh you know that could write a
[00:37:14] program huh you know that could write a piece of software to do something that
[00:37:16] piece of software to do something that the computer program that himself could
[00:37:17] the computer program that himself could not do right because this program became
[00:37:19] not do right because this program became better and also Samuel at the toss of
[00:37:26] better and also Samuel at the toss of played checkers and I think today we are
[00:37:31] played checkers and I think today we are used to computers or machine learning
[00:37:34] used to computers or machine learning algorithms outperforming humans on so
[00:37:36] algorithms outperforming humans on so many tasks but it turns out that when
[00:37:39] many tasks but it turns out that when you choose a narrow tasks like speech
[00:37:41] you choose a narrow tasks like speech recognition on a certain type of tasks
[00:37:43] recognition on a certain type of tasks you can maybe surpass human level
[00:37:45] you can maybe surpass human level performance if it was a narrow toss like
[00:37:47] performance if it was a narrow toss like playing the game of goal then by
[00:37:49] playing the game of goal then by throwing really tons of computation
[00:37:51] throwing really tons of computation power at it and self play you can have a
[00:37:55] power at it and self play you can have a computer you know become very good at
[00:37:57] computer you know become very good at these narrow tasks but this is maybe one
[00:38:01] these narrow tasks but this is maybe one of the first such examples in history of
[00:38:03] of the first such examples in history of computing and I think this is still one
[00:38:08] computing and I think this is still one of the most widely cited definitions
[00:38:11] of the most widely cited definitions right gives computer disability learn
[00:38:13] right gives computer disability learn while being explicitly programs
[00:38:15] while being explicitly programs um my friend Tom Mitchell in his
[00:38:18] um my friend Tom Mitchell in his textbook defined this as well-posed
[00:38:21] textbook defined this as well-posed learning problem program said to learn
[00:38:25] learning problem program said to learn from experience ee respect to toss T on
[00:38:27] from experience ee respect to toss T on some performance measure people on t as
[00:38:29] some performance measure people on t as measured by P improves experience you
[00:38:31] measured by P improves experience you know and III asked on this I asked Tom
[00:38:34] know and III asked on this I asked Tom if he wrote this definition just because
[00:38:37] if he wrote this definition just because he wanted it to rhyme
[00:38:39] he wanted it to rhyme he did not say yes but I I don't know um
[00:38:43] he did not say yes but I I don't know um but in this definition the experience II
[00:38:46] but in this definition the experience II for the case of playing checkers the
[00:38:49] for the case of playing checkers the experience he would be the experience of
[00:38:51] experience he would be the experience of having the checklist the program play
[00:38:53] having the checklist the program play tons of games against itself so
[00:38:55] tons of games against itself so computers lots of patience sit there for
[00:38:57] computers lots of patience sit there for days playing games of checkers against
[00:38:59] days playing games of checkers against itself so that's the experience II the
[00:39:02] itself so that's the experience II the toss T is the toss the playing checkers
[00:39:03] toss T is the toss the playing checkers and performance measure P maybe was the
[00:39:06] and performance measure P maybe was the chance of this program winning the next
[00:39:09] chance of this program winning the next game of checkers it plays against the
[00:39:10] game of checkers it plays against the next opponent right so so we say that
[00:39:12] next opponent right so so we say that this is a well-posed learning problem
[00:39:14] this is a well-posed learning problem doing anything checkers now within this
[00:39:18] doing anything checkers now within this set of ideas of machine learning there
[00:39:21] set of ideas of machine learning there are many different tools we use in
[00:39:24] are many different tools we use in machine learning and so in the next ten
[00:39:27] machine learning and so in the next ten weeks you learn about a variety of these
[00:39:30] weeks you learn about a variety of these different tools and so the first of them
[00:39:33] different tools and so the first of them and the most widely used one is
[00:39:34] and the most widely used one is supervised learning let's see I want to
[00:39:38] supervised learning let's see I want to switch to the whiteboard do you guys
[00:39:40] switch to the whiteboard do you guys know how do i erase the screen
[00:39:59] so what I want to do today is really go
[00:40:01] so what I want to do today is really go over some of the major categories of
[00:40:03] over some of the major categories of machine learning tools and and so that
[00:40:07] machine learning tools and and so that what you learn in the next by the end of
[00:40:11] what you learn in the next by the end of this quarter so the most widely used
[00:40:19] machine learning - is today is
[00:40:22] machine learning - is today is supervised learning actually then we
[00:40:23] supervised learning actually then we check how many of you know what
[00:40:24] check how many of you know what supervised learning is two-thirds half
[00:40:29] supervised learning is two-thirds half of you maybe okay cool let me just
[00:40:30] of you maybe okay cool let me just briefly define it um here's one example
[00:40:33] briefly define it um here's one example let's say you have a database of housing
[00:40:38] let's say you have a database of housing prices and so I'm gonna plot your data
[00:40:40] prices and so I'm gonna plot your data set where on the horizontal axis I want
[00:40:43] set where on the horizontal axis I want to plot the size of the house in square
[00:40:45] to plot the size of the house in square feet you know and then the vertical axis
[00:40:48] feet you know and then the vertical axis will plot the price of the house right
[00:40:51] will plot the price of the house right and maybe a dataset looks like that and
[00:40:59] and maybe a dataset looks like that and so horizontal axis I guess we call this
[00:41:01] so horizontal axis I guess we call this X and vertical axis we'll call that Y so
[00:41:06] X and vertical axis we'll call that Y so um the supervised learning problem is
[00:41:08] um the supervised learning problem is given the data set like this to find the
[00:41:10] given the data set like this to find the relationship mapping from X to Y and so
[00:41:14] relationship mapping from X to Y and so for example let's see yeah let's say
[00:41:17] for example let's see yeah let's say let's say you have a let's say you are
[00:41:18] let's say you have a let's say you are fortunate enough to own a house in
[00:41:20] fortunate enough to own a house in Colorado and you're trying to sell it
[00:41:23] Colorado and you're trying to sell it and you want to know how to price the
[00:41:25] and you want to know how to price the house so maybe your house has a size you
[00:41:29] house so maybe your house has a size you know of that amount on the horizontal
[00:41:31] know of that amount on the horizontal axis this is five inches square feet
[00:41:34] axis this is five inches square feet 1,000 square feet 1,500 square feet so
[00:41:37] 1,000 square feet 1,500 square feet so your house is 1250 square feet right and
[00:41:40] your house is 1250 square feet right and you want to know you know how do you
[00:41:42] you want to know you know how do you price this house
[00:41:44] price this house so given this dataset one thing you can
[00:41:46] so given this dataset one thing you can do is fit a straight line to it right
[00:41:51] do is fit a straight line to it right and then you could estimate so predict
[00:41:53] and then you could estimate so predict the price to be whatever value you read
[00:41:55] the price to be whatever value you read off on the vertical axis so in
[00:41:59] off on the vertical axis so in supervised learning you are given a data
[00:42:02] supervised learning you are given a data set with infos X and labels Y and your
[00:42:07] set with infos X and labels Y and your goal is to learn a
[00:42:10] goal is to learn a from X to Y right now fitting a straight
[00:42:15] from X to Y right now fitting a straight line to the data is maybe the simplest
[00:42:16] line to the data is maybe the simplest possible maybe the simplest possible
[00:42:19] possible maybe the simplest possible learning algorithm maybe that one of the
[00:42:21] learning algorithm maybe that one of the simplest on learning algorithms given
[00:42:24] simplest on learning algorithms given they said like there's there many
[00:42:25] they said like there's there many possible ways to learn the mapping to
[00:42:28] possible ways to learn the mapping to learn the function mapping from the
[00:42:30] learn the function mapping from the input size to the estimated price and so
[00:42:33] input size to the estimated price and so maybe you want to fit a quadratic
[00:42:34] maybe you want to fit a quadratic function instead maybe that actually
[00:42:36] function instead maybe that actually fits a date a little bit better and so
[00:42:38] fits a date a little bit better and so how do you choose among different models
[00:42:40] how do you choose among different models will be either automatically or manually
[00:42:42] will be either automatically or manually vention will be will be something we'll
[00:42:44] vention will be will be something we'll spend a lot of time talking about now to
[00:42:48] spend a lot of time talking about now to give a little bit more to define a few
[00:42:51] give a little bit more to define a few more things this pickle example is a
[00:42:53] more things this pickle example is a problem called a regression problem and
[00:42:58] problem called a regression problem and the term regression refers to that D
[00:43:00] the term regression refers to that D value Y you're trying to predict is
[00:43:03] value Y you're trying to predict is continuous right in contrast here is a
[00:43:07] continuous right in contrast here is a here's a different type of problem so
[00:43:11] here's a different type of problem so problem that some one friends were
[00:43:12] problem that some one friends were working on and I'll simplify it was was
[00:43:14] working on and I'll simplify it was was a healthcare problem where they were
[00:43:17] a healthcare problem where they were looking at breast cancer breast tumors
[00:43:20] looking at breast cancer breast tumors and trying to decide if a tumor is
[00:43:23] and trying to decide if a tumor is benign or malignant
[00:43:25] benign or malignant right so tumor you know serve a lump in
[00:43:27] right so tumor you know serve a lump in ER in a woman's breast is it can be
[00:43:31] ER in a woman's breast is it can be malign or cancerous or benign meaning
[00:43:34] malign or cancerous or benign meaning you know roughly there's not that
[00:43:36] you know roughly there's not that harmful and so if on the horizontal axis
[00:43:40] harmful and so if on the horizontal axis you plot the size of a tumor and on the
[00:43:45] you plot the size of a tumor and on the vertical axis you plot is it malignant
[00:43:48] vertical axis you plot is it malignant or not so malignant means harmful rate
[00:43:52] or not so malignant means harmful rate and some tumors are harmful some are not
[00:43:56] and some tumors are harmful some are not and so whether it's malignant or not
[00:43:58] and so whether it's malignant or not takes only two values one or zero and so
[00:44:05] takes only two values one or zero and so you may have a data set like that
[00:44:13] you may have a data set like that right and given this can you learn a
[00:44:17] right and given this can you learn a mapping from X to Y so that if a new
[00:44:20] mapping from X to Y so that if a new patient walks into your office was in
[00:44:23] patient walks into your office was in the doctor's office and the tumor size
[00:44:25] the doctor's office and the tumor size is you know say this can i learning
[00:44:28] is you know say this can i learning algorithm figure out from this day there
[00:44:29] algorithm figure out from this day there that it was probably well based on this
[00:44:31] that it was probably well based on this data set it looks like there's a there's
[00:44:33] data set it looks like there's a there's a high chance that that tumor is
[00:44:35] a high chance that that tumor is malignant so so this is an example of a
[00:44:43] malignant so so this is an example of a classification problem and the term
[00:44:49] classification problem and the term classification refers to that Y here
[00:44:52] classification refers to that Y here takes on a discrete number of variables
[00:44:54] takes on a discrete number of variables so for a regression problem Y is a real
[00:44:57] so for a regression problem Y is a real number I guess technically prices can be
[00:45:00] number I guess technically prices can be rounded off to the nearest dollar
[00:45:01] rounded off to the nearest dollar instead so prices aren't really real
[00:45:03] instead so prices aren't really real numbers because you pretty not price of
[00:45:07] numbers because you pretty not price of how's that like pi times 1 million or
[00:45:09] how's that like pi times 1 million or whatever but so but for all practical
[00:45:12] whatever but so but for all practical purposes prices are continuous so we
[00:45:14] purposes prices are continuous so we call them housing price prediction to be
[00:45:16] call them housing price prediction to be a regression problem whereas if you have
[00:45:18] a regression problem whereas if you have two values that possible help was zero
[00:45:20] two values that possible help was zero one call that classification problem if
[00:45:22] one call that classification problem if you have K discrete outputs so if the
[00:45:26] you have K discrete outputs so if the tumor can be malignant or if there are
[00:45:29] tumor can be malignant or if there are five types of cancer where and so you
[00:45:31] five types of cancer where and so you have one of five possible outputs then
[00:45:33] have one of five possible outputs then there's also a classification problem
[00:45:34] there's also a classification problem that the output is discrete now I want
[00:45:40] that the output is discrete now I want to find a different way to visualize
[00:45:42] to find a different way to visualize this data set which is let me throw a
[00:45:45] this data set which is let me throw a line on top and I'm just gonna you know
[00:45:48] line on top and I'm just gonna you know mat all this data on the horizontal axis
[00:45:51] mat all this data on the horizontal axis upward onto a line but I'm going to use
[00:45:56] upward onto a line but I'm going to use a symbol ol to denote I hope what I did
[00:46:09] a symbol ol to denote I hope what I did was clear so I took the two sets of
[00:46:11] was clear so I took the two sets of examples that positive and negative
[00:46:13] examples that positive and negative examples positive example was this one
[00:46:15] examples positive example was this one negative example zero and I took all of
[00:46:18] negative example zero and I took all of these examples and kind of push them up
[00:46:20] these examples and kind of push them up onto a straight line and I used two
[00:46:23] onto a straight line and I used two symbols I use OHS to denote negative
[00:46:26] symbols I use OHS to denote negative examples and I use
[00:46:27] examples and I use processes in our positive examples okay
[00:46:29] processes in our positive examples okay so this is just a different way of
[00:46:31] so this is just a different way of visualizing the same data but drawing it
[00:46:35] visualizing the same data but drawing it on a line and using you know two symbols
[00:46:38] on a line and using you know two symbols to denote the two discrete values around
[00:46:40] to denote the two discrete values around one right so um it turns out that in
[00:46:46] one right so um it turns out that in both of these examples the input X was
[00:46:49] both of these examples the input X was one-dimensional it was a single real
[00:46:50] one-dimensional it was a single real number for most of the machine learning
[00:46:53] number for most of the machine learning application to work with the input X
[00:46:55] application to work with the input X will be multi-dimensional and you won't
[00:46:57] will be multi-dimensional and you won't be given just one number and also
[00:46:59] be given just one number and also predict another number instead you often
[00:47:02] predict another number instead you often be given multiple features or multiple
[00:47:05] be given multiple features or multiple numbers to predict another number so for
[00:47:07] numbers to predict another number so for example instead of just using two
[00:47:10] example instead of just using two messiahs to predict to estimate
[00:47:12] messiahs to predict to estimate malignancy open malignant versus benign
[00:47:15] malignancy open malignant versus benign tumors you may instead have two features
[00:47:19] tumors you may instead have two features where one is tumor size and the second
[00:47:23] where one is tumor size and the second is age of the patient and be given a
[00:47:27] is age of the patient and be given a data set maybe I'll sit and be given the
[00:47:45] data set maybe I'll sit and be given the data set that looks like that well now
[00:47:49] data set that looks like that well now your task is given two input features so
[00:47:53] your task is given two input features so X's tumor size and age you know like a
[00:47:56] X's tumor size and age you know like a two dimensional vector and your toss is
[00:47:58] two dimensional vector and your toss is given these two input features to
[00:48:03] given these two input features to predict whether a given tumor is
[00:48:05] predict whether a given tumor is malignant or benign so the new patient
[00:48:06] malignant or benign so the new patient walks in the doctor's office and that
[00:48:08] walks in the doctor's office and that the tumor size is here and the ages here
[00:48:11] the tumor size is here and the ages here so that point there that hopefully you
[00:48:14] so that point there that hopefully you conclude that you know this patients
[00:48:16] conclude that you know this patients tumors probably benign very
[00:48:17] tumors probably benign very corresponding oh that's a negative
[00:48:19] corresponding oh that's a negative example and so one thing one thing you
[00:48:23] example and so one thing one thing you learn next week is a learning algorithm
[00:48:25] learn next week is a learning algorithm that can further straight line to the
[00:48:29] that can further straight line to the data as follows
[00:48:30] data as follows kind of like that to separate out the
[00:48:31] kind of like that to separate out the positive negative example several out
[00:48:33] positive negative example several out the holes and the crosses and so next
[00:48:36] the holes and the crosses and so next week you learn about the logistic
[00:48:37] week you learn about the logistic regression algorithm which
[00:48:40] regression algorithm which and do that okay so one of the most
[00:48:45] and do that okay so one of the most interesting things you learn about is
[00:48:48] interesting things you learn about is let's see so in this example I drew a
[00:48:51] let's see so in this example I drew a dataset with two input features when I
[00:48:55] dataset with two input features when I said I have friends that actually worked
[00:48:56] said I have friends that actually worked on the breast cancer prediction problem
[00:48:59] on the breast cancer prediction problem and in practice you usually have a lot
[00:49:02] and in practice you usually have a lot more than one or two features and
[00:49:03] more than one or two features and usually you have so many features you
[00:49:05] usually you have so many features you can't plot on the board right and so for
[00:49:07] can't plot on the board right and so for an actual breast cancer prediction
[00:49:09] an actual breast cancer prediction problem my friends are working on this
[00:49:11] problem my friends are working on this well we're using many other features
[00:49:13] well we're using many other features such as don't worry about what these
[00:49:14] such as don't worry about what these need mean I guess clump thickness you
[00:49:18] need mean I guess clump thickness you know uniformity of cell size uniformity
[00:49:25] know uniformity of cell size uniformity of cell shape right at Heejun how will
[00:49:32] of cell shape right at Heejun how will the cells sync together
[00:49:33] the cells sync together don't worry about what these means but
[00:49:35] don't worry about what these means but if you're actually doing this in a
[00:49:37] if you're actually doing this in a actual medical application there's a
[00:49:39] actual medical application there's a good chance that you'll be using a lot
[00:49:40] good chance that you'll be using a lot more features than just two and this
[00:49:43] more features than just two and this means that you actually can't plot this
[00:49:44] means that you actually can't plot this data right it's too high dimensional you
[00:49:46] data right it's too high dimensional you can't plot things higher than three
[00:49:48] can't plot things higher than three dimensional maybe four dimensional
[00:49:49] dimensional maybe four dimensional something right and so we have all the
[00:49:51] something right and so we have all the features are you difficult to plot this
[00:49:53] features are you difficult to plot this data I'll come back to this in a second
[00:49:55] data I'll come back to this in a second in learning theory and one of the things
[00:50:00] in learning theory and one of the things you learn about so as we develop our
[00:50:02] you learn about so as we develop our rhythms you learn how to build
[00:50:04] rhythms you learn how to build regression algorithms or classification
[00:50:07] regression algorithms or classification algorithms that can deal with these
[00:50:09] algorithms that can deal with these relatively larger number of features one
[00:50:11] relatively larger number of features one of the most fascinating results you
[00:50:13] of the most fascinating results you learn is that you also learn about an
[00:50:17] learn is that you also learn about an algorithm called support vector machine
[00:50:18] algorithm called support vector machine which uses not one or two or three or
[00:50:22] which uses not one or two or three or ten or hundred or million input features
[00:50:25] ten or hundred or million input features but uses an infinite number of input
[00:50:29] but uses an infinite number of input features right and so so so just be
[00:50:31] features right and so so so just be clear if in this example the state of a
[00:50:34] clear if in this example the state of a patient will represents as one number
[00:50:36] patient will represents as one number you know tumor size in this example you
[00:50:39] you know tumor size in this example you get two features so the state of a
[00:50:40] get two features so the state of a patient will be represented using two
[00:50:42] patient will be represented using two numbers tumor size in the age if you use
[00:50:44] numbers tumor size in the age if you use this list of features maybe a patient
[00:50:46] this list of features maybe a patient arrives enter with five or six numbers
[00:50:48] arrives enter with five or six numbers but there's an algorithm called the
[00:50:50] but there's an algorithm called the support vector machine that allows you
[00:50:53] support vector machine that allows you to use an infinite dimensional vector to
[00:50:59] to use an infinite dimensional vector to represent patients and how do you deal
[00:51:03] represent patients and how do you deal with that and how can a computer even
[00:51:05] with that and how can a computer even store an infinite dimensional vector
[00:51:06] store an infinite dimensional vector right I mean you know computer memory
[00:51:09] right I mean you know computer memory you can store one girl number two real
[00:51:11] you can store one girl number two real numbers but you can't store an infinite
[00:51:13] numbers but you can't store an infinite number of real numbers in the computer
[00:51:14] number of real numbers in the computer without running on the memory or
[00:51:16] without running on the memory or processor speed or whatever so so how do
[00:51:18] processor speed or whatever so so how do you do that so we talked about support
[00:51:20] you do that so we talked about support vector machines and specifically the
[00:51:23] vector machines and specifically the technical method called kernels you
[00:51:25] technical method called kernels you learn how to build during the algorithms
[00:51:27] learn how to build during the algorithms that work with so that it infinitely
[00:51:29] that work with so that it infinitely long
[00:51:30] long lissa features infinitely long list of
[00:51:32] lissa features infinitely long list of features for which you can imagine that
[00:51:36] features for which you can imagine that if you have an infinitely long list of
[00:51:38] if you have an infinitely long list of numbers to represent a patient that
[00:51:40] numbers to represent a patient that might give you a lot of information
[00:51:41] might give you a lot of information about that patient and so that is one of
[00:51:44] about that patient and so that is one of the relatively effective learning
[00:51:46] the relatively effective learning algorithm problems um so that's
[00:51:52] algorithm problems um so that's supervised learning and you know let me
[00:51:54] supervised learning and you know let me just some play a video show you a fun
[00:52:00] just some play a video show you a fun slightly older example of supervised
[00:52:03] slightly older example of supervised there in the previous hands what this
[00:52:04] there in the previous hands what this means but at the heart of supervised
[00:52:08] means but at the heart of supervised learning is the idea that during
[00:52:10] learning is the idea that during training you are given inputs X together
[00:52:14] training you are given inputs X together with the labels Y and you're given both
[00:52:16] with the labels Y and you're given both at the same time and the job of your
[00:52:18] at the same time and the job of your learning algorithm is to find a mapping
[00:52:21] learning algorithm is to find a mapping so that given a new X you can map it to
[00:52:25] so that given a new X you can map it to the most appropriate output Y so this is
[00:52:28] the most appropriate output Y so this is a very old video made by a Dean
[00:52:31] a very old video made by a Dean Pomerleau known for a long time as well
[00:52:32] Pomerleau known for a long time as well on using supervised learning for
[00:52:35] on using supervised learning for autonomous driving this does not save
[00:52:37] autonomous driving this does not save the art for Toms driving anymore but it
[00:52:39] the art for Toms driving anymore but it actually does remarkably well oh and as
[00:52:42] actually does remarkably well oh and as you you hear a few technical terms like
[00:52:45] you you hear a few technical terms like back propagation you learn all those
[00:52:47] back propagation you learn all those techniques in this cause and by the end
[00:52:50] techniques in this cause and by the end of class you've really built a learning
[00:52:51] of class you've really built a learning algorithm much more effective than what
[00:52:52] algorithm much more effective than what you see here but let's let's see this
[00:52:54] you see here but let's let's see this application
[00:52:59] could you turn up the volume maybe have
[00:53:01] could you turn up the volume maybe have that are you guys getting volleyball
[00:53:04] that are you guys getting volleyball yeah I see
[00:53:13] alright I'll narrate this Oh so I'll be
[00:53:17] alright I'll narrate this Oh so I'll be using artificial neural network to drive
[00:53:19] using artificial neural network to drive this vehicle that was built at carnegie
[00:53:21] this vehicle that was built at carnegie mellon university many years ago and
[00:53:24] mellon university many years ago and what happens is during training it
[00:53:27] what happens is during training it watches the human drive the vehicle and
[00:53:30] watches the human drive the vehicle and I think ten times a second it digitizes
[00:53:34] I think ten times a second it digitizes the image in front of the vehicle and so
[00:53:37] the image in front of the vehicle and so that's a picture taken by a front-facing
[00:53:40] that's a picture taken by a front-facing camera and what it does is in order to
[00:53:44] camera and what it does is in order to collect labelled data the car while the
[00:53:46] collect labelled data the car while the human is driving it records both the
[00:53:49] human is driving it records both the image such as the scene here as well as
[00:53:52] image such as the scene here as well as the steering direction that was chosen
[00:53:53] the steering direction that was chosen by human so at the bottom here is the
[00:53:56] by human so at the bottom here is the image turned into grayscale and lower
[00:53:58] image turned into grayscale and lower res and on top let me pose this for a
[00:54:02] res and on top let me pose this for a second this is the driver direction the
[00:54:07] second this is the driver direction the phone's kind of blurry but this Texas
[00:54:08] phone's kind of blurry but this Texas driver direction so this is the Y label
[00:54:11] driver direction so this is the Y label the label Y that the human driver chose
[00:54:15] the label Y that the human driver chose and so the position of this white bar of
[00:54:18] and so the position of this white bar of this white blob shows how the human is
[00:54:21] this white blob shows how the human is choosing to steer the car so in this in
[00:54:24] choosing to steer the car so in this in this image the white blob is a little
[00:54:26] this image the white blob is a little bit to the left of center so the humans
[00:54:27] bit to the left of center so the humans you know steering just a little bit to
[00:54:29] you know steering just a little bit to the left this second line here is the
[00:54:33] the left this second line here is the output of the neural network and
[00:54:35] output of the neural network and initially the neural network doesn't
[00:54:38] initially the neural network doesn't know how to drive and so it's just
[00:54:39] know how to drive and so it's just outputting this white schmear everywhere
[00:54:42] outputting this white schmear everywhere you say you know I don't know do I Drive
[00:54:43] you say you know I don't know do I Drive left right center
[00:54:44] left right center I don't know so stop putting this gray
[00:54:46] I don't know so stop putting this gray blur everywhere and as the algorithm
[00:54:49] blur everywhere and as the algorithm learns using the back propagation
[00:54:52] learns using the back propagation learning algorithm or gradient descent
[00:54:54] learning algorithm or gradient descent which you learn about you actually learn
[00:54:56] which you learn about you actually learn about gradient descent this Wednesday
[00:54:57] about gradient descent this Wednesday you see that the neural networks output
[00:55:00] you see that the neural networks output becomes less and less of this white
[00:55:02] becomes less and less of this white shmear this white blur but starts become
[00:55:06] shmear this white blur but starts become sharper and starts to mimic more
[00:55:10] sharper and starts to mimic more accurately the human selected driving
[00:55:14] accurately the human selected driving direction
[00:55:16] direction so this um there's an example of
[00:55:20] so this um there's an example of supervised learning because the human
[00:55:22] supervised learning because the human driver demonstrates inputs X and outputs
[00:55:25] driver demonstrates inputs X and outputs y maybe if you see this in front of the
[00:55:29] y maybe if you see this in front of the car steer like that so that's x and y
[00:55:31] car steer like that so that's x and y and after the learning algorithm has
[00:55:34] and after the learning algorithm has learned you can then well he pushes a
[00:55:39] learned you can then well he pushes a button takes a hand off the steering you
[00:55:41] button takes a hand off the steering you know and then it's using this network to
[00:55:45] know and then it's using this network to drive itself right digitizing the image
[00:55:48] drive itself right digitizing the image in front of the note taking this image
[00:55:51] in front of the note taking this image and passing it through the learning
[00:55:53] and passing it through the learning algorithm through the trained neural
[00:55:54] algorithm through the trained neural network letting the neural network
[00:55:56] network letting the neural network select the steering direction and then
[00:55:59] select the steering direction and then using a little motor to turn the wheel
[00:56:02] using a little motor to turn the wheel this is a slightly more advanced version
[00:56:05] this is a slightly more advanced version which has trained two separate models
[00:56:07] which has trained two separate models one for I think a two-lane road one for
[00:56:09] one for I think a two-lane road one for a four-lane roads so that's the so the
[00:56:15] a four-lane roads so that's the so the second and third lines this is for a
[00:56:16] second and third lines this is for a two-lane road this is a four-lane road
[00:56:18] two-lane road this is a four-lane road and the arbitrator is another algorithm
[00:56:21] and the arbitrator is another algorithm that tries to decide whether the two
[00:56:24] that tries to decide whether the two lane of the four-lane road model is the
[00:56:26] lane of the four-lane road model is the more appropriate one for a particular
[00:56:28] more appropriate one for a particular given situation and so as Alvin
[00:56:31] given situation and so as Alvin destroying excuse me a one named world
[00:56:34] destroying excuse me a one named world or two lane road so it says driving from
[00:56:36] or two lane road so it says driving from a one lane road here toward an
[00:56:39] a one lane road here toward an intersection
[00:56:51] the the algorithm realizes is just
[00:56:54] the the algorithm realizes is just switch over from I think I forget what I
[00:56:58] switch over from I think I forget what I think the one lane or network today to
[00:57:00] think the one lane or network today to the two lane your network and one or
[00:57:02] the two lane your network and one or season
[00:57:18] okay oh all right fine we just see the
[00:57:21] okay oh all right fine we just see the final dramatic moment as searching for a
[00:57:23] final dramatic moment as searching for a one-day road to tuning all right and I
[00:57:40] one-day road to tuning all right and I think you know so this is just using
[00:57:42] think you know so this is just using supervised learning to take as input
[00:57:44] supervised learning to take as input what's in front of your car to decide on
[00:57:46] what's in front of your car to decide on steering direction this is not so the
[00:57:48] steering direction this is not so the odds for how self-driving cars are built
[00:57:50] odds for how self-driving cars are built today but you know you could do some
[00:57:52] today but you know you could do some things in some limited context and I
[00:57:55] things in some limited context and I think within several weeks you actually
[00:57:58] think within several weeks you actually be able to build something that is more
[00:58:00] be able to build something that is more sophisticated than this um so after
[00:58:05] sophisticated than this um so after supervised learning we will in this
[00:58:10] supervised learning we will in this class to spend a bit of time talking
[00:58:12] class to spend a bit of time talking about machine learning strategy you know
[00:58:14] about machine learning strategy you know I think on the class nails we annotate
[00:58:17] I think on the class nails we annotate this as a learning theory but what that
[00:58:19] this as a learning theory but what that means is um I want to give you the tools
[00:58:21] means is um I want to give you the tools to gaulden apply learning algorithms
[00:58:24] to gaulden apply learning algorithms effectively and I think I've been
[00:58:26] effectively and I think I've been fortunate to have you know to know a lot
[00:58:30] fortunate to have you know to know a lot of I think that um I've been fortunate
[00:58:34] of I think that um I've been fortunate to have you know over the years
[00:58:35] to have you know over the years constantly visited lots of great tech
[00:58:39] constantly visited lots of great tech companies more than ones that I've dead
[00:58:41] companies more than ones that I've dead that I've been probably associated with
[00:58:43] that I've been probably associated with right but often just a home friend zone
[00:58:45] right but often just a home friend zone I visit various tech companies of the
[00:58:48] I visit various tech companies of the Sun whose products I'm sure installed on
[00:58:50] Sun whose products I'm sure installed on your cell phone but I often visit tech
[00:58:52] your cell phone but I often visit tech companies and you know talk to the
[00:58:53] companies and you know talk to the Machine there any themes and see what
[00:58:55] Machine there any themes and see what they're doing and see if I can help them
[00:58:56] they're doing and see if I can help them out and what I see is that there's a
[00:58:59] out and what I see is that there's a huge difference in the effectiveness of
[00:59:02] huge difference in the effectiveness of how two different teams could apply the
[00:59:04] how two different teams could apply the exact same learning algorithm and I
[00:59:08] exact same learning algorithm and I think that what I've seen savvy is that
[00:59:11] think that what I've seen savvy is that sometimes there will be a team even in
[00:59:14] sometimes there will be a team even in some of the best tech companies right
[00:59:16] some of the best tech companies right the the elite AI companies in multiple
[00:59:19] the the elite AI companies in multiple of them where you go talk to a team and
[00:59:22] of them where you go talk to a team and they'll tell you about something they've
[00:59:23] they'll tell you about something they've been working on for six months
[00:59:25] been working on for six months and then you can quickly take a look at
[00:59:27] and then you can quickly take a look at the data and and hear that did not the
[00:59:31] the data and and hear that did not the album isn't quite working and sometimes
[00:59:33] album isn't quite working and sometimes you could look like what they're doing
[00:59:34] you could look like what they're doing and go yeah you know I could have told
[00:59:36] and go yeah you know I could have told you six months ago the disapproval is
[00:59:38] you six months ago the disapproval is never gonna work right and what I find
[00:59:42] never gonna work right and what I find is that the most skill machine learning
[00:59:44] is that the most skill machine learning practitioners are very strategic by
[00:59:46] practitioners are very strategic by which I mean that your skill at deciding
[00:59:49] which I mean that your skill at deciding when you work on a machine learning
[00:59:51] when you work on a machine learning project know you have a lot of decisions
[00:59:53] project know you have a lot of decisions to make right
[00:59:54] to make right do you collect more data do you try a
[00:59:56] do you collect more data do you try a different learning algorithm do you ran
[00:59:59] different learning algorithm do you ran faster GPUs to train your learning oven
[01:00:00] faster GPUs to train your learning oven for longer or if you collect more data
[01:00:02] for longer or if you collect more data what type of data is you collect or for
[01:00:04] what type of data is you collect or for all of these architecture choices using
[01:00:06] all of these architecture choices using neural networks what vector machine
[01:00:08] neural networks what vector machine which is regression which one do you
[01:00:09] which is regression which one do you pick
[01:00:10] pick but there are a lot of decisions you
[01:00:12] but there are a lot of decisions you need to make when building these
[01:00:15] need to make when building these learning algorithms so one thing that's
[01:00:18] learning algorithms so one thing that's quite unique to the way we teach is we
[01:00:21] quite unique to the way we teach is we want to help you become more systematic
[01:00:23] want to help you become more systematic in driving machine learning as a
[01:00:27] in driving machine learning as a sabbatical
[01:00:27] sabbatical engineering discipline so that when one
[01:00:29] engineering discipline so that when one day your work on machine learning
[01:00:30] day your work on machine learning project you can efficiently figure out
[01:00:33] project you can efficiently figure out what to do next
[01:00:34] what to do next and sometimes make an analogy to how to
[01:00:38] and sometimes make an analogy to how to a software engineering
[01:00:41] a software engineering you know I many years ago I had a friend
[01:00:45] you know I many years ago I had a friend that would debug code by compiling it
[01:00:48] that would debug code by compiling it and then this friend will look all these
[01:00:51] and then this friend will look all these syntax errors right that you know see
[01:00:54] syntax errors right that you know see pluses compiler outputs and they thought
[01:00:56] pluses compiler outputs and they thought that the best way to eliminate the
[01:00:58] that the best way to eliminate the errors is the delete all the lines of
[01:01:00] errors is the delete all the lines of code with syntax errors and that was
[01:01:02] code with syntax errors and that was their first serious thing so that did
[01:01:03] their first serious thing so that did not go well right took me a while to
[01:01:07] not go well right took me a while to persuade them to start doing that Oh
[01:01:09] persuade them to start doing that Oh but-but-but-but so it turns out that
[01:01:11] but-but-but-but so it turns out that when you run a learning algorithm you
[01:01:13] when you run a learning algorithm you know it almost never works the first
[01:01:15] know it almost never works the first time is just life and the way you go
[01:01:19] time is just life and the way you go about debugging the learning algorithm
[01:01:20] about debugging the learning algorithm will have a huge impact on your
[01:01:22] will have a huge impact on your efficiency on how quickly you can build
[01:01:25] efficiency on how quickly you can build effective learning systems and I think
[01:01:27] effective learning systems and I think until now too much of the of this
[01:01:30] until now too much of the of this process of making your learning Urban's
[01:01:33] process of making your learning Urban's work well has been a black magic kind of
[01:01:35] work well has been a black magic kind of process where you know as
[01:01:37] process where you know as it's the decades so when you run
[01:01:40] it's the decades so when you run something and don't know why it's not
[01:01:41] something and don't know why it's not working I hate what I do and says oh
[01:01:43] working I hate what I do and says oh yeah I'll do that and then and then
[01:01:44] yeah I'll do that and then and then because he's so experienced it works but
[01:01:47] because he's so experienced it works but I think um what we're trying to do with
[01:01:49] I think um what we're trying to do with the discipline of machine learning is to
[01:01:50] the discipline of machine learning is to evolve it from a black magic tribal
[01:01:53] evolve it from a black magic tribal knowledge experience based thing to a
[01:01:55] knowledge experience based thing to a systemic engineering process right and
[01:01:58] systemic engineering process right and so later this quarter as we talk about
[01:02:01] so later this quarter as we talk about machine learning strategy or talk about
[01:02:03] machine learning strategy or talk about learning theory you try to suspect to
[01:02:05] learning theory you try to suspect to give you tools on how to go about
[01:02:08] give you tools on how to go about strategizing so can be very efficient in
[01:02:11] strategizing so can be very efficient in how you how you yourself how you can
[01:02:14] how you how you yourself how you can lead a team to build an effective
[01:02:16] lead a team to build an effective learning system because I don't want you
[01:02:18] learning system because I don't want you to be one of those people that you know
[01:02:20] to be one of those people that you know waste six months on some direction that
[01:02:22] waste six months on some direction that maybe could have relatively quickly
[01:02:25] maybe could have relatively quickly figured out what's not promising well
[01:02:27] figured out what's not promising well maybe one loss analogy if you if you use
[01:02:30] maybe one loss analogy if you if you use the optimizing code right making code
[01:02:32] the optimizing code right making code run faster not tell me if you learn that
[01:02:36] run faster not tell me if you learn that less experience software engineers will
[01:02:39] less experience software engineers will just dive in and optimize the code to
[01:02:41] just dive in and optimize the code to try to make it run faster right let's
[01:02:42] try to make it run faster right let's take the C++ and code in the 70 or
[01:02:44] take the C++ and code in the 70 or something but more experienced people
[01:02:46] something but more experienced people will run the profiler to try to figure
[01:02:48] will run the profiler to try to figure out what part of your code is actually
[01:02:50] out what part of your code is actually the whole night and then just focus on
[01:02:51] the whole night and then just focus on changing on that so one things hope to
[01:02:54] changing on that so one things hope to do this quarter is convey to you some of
[01:02:58] do this quarter is convey to you some of these more systemic engineering
[01:02:59] these more systemic engineering principles
[01:03:00] principles oh and actually this is a actually I've
[01:03:05] oh and actually this is a actually I've been down I've been writing this up
[01:03:08] been down I've been writing this up actually so how many of you have heard a
[01:03:10] actually so how many of you have heard a machine there on in your name oh just a
[01:03:12] machine there on in your name oh just a few of you interesting so actually - so
[01:03:15] few of you interesting so actually - so if any of you interested just in my
[01:03:19] if any of you interested just in my spare time I've been writing a book to
[01:03:26] spare time I've been writing a book to try to codify systemic engineering
[01:03:27] try to codify systemic engineering principles for machine learning and so
[01:03:29] principles for machine learning and so if you are and so if you want a you know
[01:03:33] if you are and so if you want a you know free draft copy of the book sign up for
[01:03:35] free draft copy of the book sign up for a mailing list here I tend to just write
[01:03:37] a mailing list here I tend to just write stuff and put it on the internet for you
[01:03:39] stuff and put it on the internet for you so if you want a free drop copy of the
[01:03:41] so if you want a free drop copy of the book
[01:03:43] book you know go to this website enter your
[01:03:46] you know go to this website enter your email address and the website was saying
[01:03:48] email address and the website was saying Jerry copy of the book they'll talk a
[01:03:49] Jerry copy of the book they'll talk a little bit about these interview
[01:03:50] little bit about these interview principles as well okay all right so so
[01:03:55] principles as well okay all right so so first object machine learning second
[01:03:57] first object machine learning second subject learning theory and the third
[01:04:02] subject learning theory and the third major subject we'll talk about is deep
[01:04:04] major subject we'll talk about is deep learning and so you know the lot of
[01:04:07] learning and so you know the lot of tools in machine learning and many of
[01:04:09] tools in machine learning and many of them are worth learning about and I use
[01:04:10] them are worth learning about and I use many different tools the machine
[01:04:12] many different tools the machine learning you know for many different
[01:04:14] learning you know for many different applications there's one subset of
[01:04:16] applications there's one subset of machine learning that's really hot right
[01:04:17] machine learning that's really hot right now because it's just advancing very
[01:04:19] now because it's just advancing very rapidly which is deep learning and so
[01:04:21] rapidly which is deep learning and so we'll spend a bit of time talking about
[01:04:23] we'll spend a bit of time talking about deep learning so they can understand the
[01:04:25] deep learning so they can understand the basics of how the training your network
[01:04:28] basics of how the training your network as well but I think that um whereas 2:29
[01:04:32] as well but I think that um whereas 2:29 covers a much broader set of out rules
[01:04:34] covers a much broader set of out rules which are all useful see su-30
[01:04:37] which are all useful see su-30 more narrowly covers just deep learning
[01:04:42] more narrowly covers just deep learning so other than deep learning slash after
[01:04:45] so other than deep learning slash after after deep learning such new neural
[01:04:47] after deep learning such new neural networks the other the the fall of both
[01:04:51] networks the other the the fall of both of the five major topics we'll cover
[01:04:52] of the five major topics we'll cover will be an unsupervised learning so one
[01:04:56] will be an unsupervised learning so one is unsupervised learning
[01:05:06] so you saw me draw a picture like this
[01:05:10] so you saw me draw a picture like this just now right and this would be a
[01:05:12] just now right and this would be a classification problem like the tumor
[01:05:14] classification problem like the tumor malignant benign problems this is a
[01:05:16] malignant benign problems this is a classification problem and that was a
[01:05:18] classification problem and that was a supervised learning problem because you
[01:05:19] supervised learning problem because you have to learn the function mapping from
[01:05:21] have to learn the function mapping from X to Y unsupervised learning would be if
[01:05:25] X to Y unsupervised learning would be if I give you a data set like this with no
[01:05:28] I give you a data set like this with no labels so you just give in inputs X and
[01:05:31] labels so you just give in inputs X and no why and you're asked to find me
[01:05:35] no why and you're asked to find me something interesting in this data
[01:05:36] something interesting in this data figure out you know interesting
[01:05:37] figure out you know interesting structure in this data and so in this
[01:05:41] structure in this data and so in this data set it looks like there are two
[01:05:42] data set it looks like there are two clusters and then unsupervised learning
[01:05:45] clusters and then unsupervised learning algorithm which you learn about called
[01:05:46] algorithm which you learn about called k-means clustering will discover this
[01:05:49] k-means clustering will discover this this structure in the data other
[01:05:53] this structure in the data other examples and so I was learning you know
[01:05:55] examples and so I was learning you know if you actually Google News is a very
[01:05:57] if you actually Google News is a very interesting website sometimes I use it
[01:05:59] interesting website sometimes I use it to look up right latest news there's an
[01:06:01] to look up right latest news there's an only example but Google News every day
[01:06:04] only example but Google News every day crawls or reads many many thousands or
[01:06:10] crawls or reads many many thousands or tens of thousands of news articles on
[01:06:11] tens of thousands of news articles on the Internet and groups them together
[01:06:13] the Internet and groups them together right for example there's a set of
[01:06:15] right for example there's a set of articles on DPP all well still and it
[01:06:19] articles on DPP all well still and it has taken a lot of the articles written
[01:06:22] has taken a lot of the articles written by different reporters and grouped them
[01:06:24] by different reporters and grouped them together so you can you know figure out
[01:06:26] together so you can you know figure out that what BP Macondo all well write that
[01:06:32] that what BP Macondo all well write that this is a CNN article about the whole
[01:06:34] this is a CNN article about the whole world still there's a guardian article
[01:06:36] world still there's a guardian article about all well spill this example of a
[01:06:37] about all well spill this example of a clustering algorithm where is taking
[01:06:40] clustering algorithm where is taking these different new sources and figuring
[01:06:42] these different new sources and figuring out that these are all stories kind of
[01:06:44] out that these are all stories kind of about the same thing and other examples
[01:06:50] about the same thing and other examples of clustering just getting data and
[01:06:52] of clustering just getting data and figuring out what groups belong together
[01:06:56] figuring out what groups belong together all the work on genetic data this is a
[01:07:00] all the work on genetic data this is a visualization of the genetic microwave
[01:07:03] visualization of the genetic microwave radiator we're giving it like this you
[01:07:07] radiator we're giving it like this you can group individuals into different
[01:07:09] can group individuals into different types of individuals at different
[01:07:12] types of individuals at different characteristics or clustering algorithms
[01:07:16] characteristics or clustering algorithms grouping this type of data together is
[01:07:18] grouping this type of data together is used to organize computing clusters you
[01:07:21] used to organize computing clusters you know figure out what machines workloads
[01:07:23] know figure out what machines workloads are more related to each other and
[01:07:24] are more related to each other and organize communities probably so to take
[01:07:26] organize communities probably so to take a social network like LinkedIn or
[01:07:29] a social network like LinkedIn or Facebook or other social networks and
[01:07:31] Facebook or other social networks and figure out which are the groups of
[01:07:33] figure out which are the groups of friends on which are the cohesive
[01:07:34] friends on which are the cohesive communities within a social network or
[01:07:37] communities within a social network or market segmentation actually many
[01:07:39] market segmentation actually many companies I've worked with look at the
[01:07:41] companies I've worked with look at the customer database and cluster the users
[01:07:43] customer database and cluster the users together so you can say that looks like
[01:07:45] together so you can say that looks like where four types of users you know looks
[01:07:47] where four types of users you know looks like that there are the young
[01:07:50] like that there are the young professionals looking to develop
[01:07:52] professionals looking to develop themselves they're the you know soccer
[01:07:55] themselves they're the you know soccer moms and soccer dads that the discount
[01:07:57] moms and soccer dads that the discount in this case who can then market to the
[01:07:59] in this case who can then market to the different market segments separately and
[01:08:02] different market segments separately and and actually many years ago my friend
[01:08:04] and actually many years ago my friend Andrew Moore was using this type of data
[01:08:08] Andrew Moore was using this type of data for astronomical data analysis group
[01:08:10] for astronomical data analysis group together galaxies question Oh is almost
[01:08:18] together galaxies question Oh is almost worse than the clustering knows not so
[01:08:20] worse than the clustering knows not so as well as well as learning brought me
[01:08:22] as well as well as learning brought me is the concept of using unlabeled data
[01:08:23] is the concept of using unlabeled data so just X and finding interesting things
[01:08:26] so just X and finding interesting things about it right so for example actually
[01:08:30] about it right so for example actually here's shoot this won't work with all
[01:08:34] here's shoot this won't work with all audio will do this later in the clock in
[01:08:36] audio will do this later in the clock in the class I guess maybe I'll save and do
[01:08:39] the class I guess maybe I'll save and do this later cocktail party problem is
[01:08:44] this later cocktail party problem is another unsupervised learning problem
[01:08:46] another unsupervised learning problem reading the audio for this to explain
[01:08:48] reading the audio for this to explain this though everything how to explain
[01:08:51] this though everything how to explain this
[01:08:52] this you know cocktail party problems I'll
[01:08:54] you know cocktail party problems I'll try to do the demo when we can get all
[01:08:56] try to do the demo when we can get all your work on this laptop there's a
[01:08:58] your work on this laptop there's a problem where if you have a noisy room
[01:09:01] problem where if you have a noisy room and you stick multiple microphones in
[01:09:03] and you stick multiple microphones in the room every call overlapping voices
[01:09:05] the room every call overlapping voices so there no labels readers and multiple
[01:09:07] so there no labels readers and multiple microphones my array of microphones in a
[01:09:10] microphones my array of microphones in a room with lots of people talking how can
[01:09:13] room with lots of people talking how can you have the algorithm separate out the
[01:09:14] you have the algorithm separate out the people's voices so that's an
[01:09:16] people's voices so that's an unsupervised learning problem because
[01:09:18] unsupervised learning problem because there are no labels you just stick
[01:09:20] there are no labels you just stick microphones in the room and have a
[01:09:22] microphones in the room and have a record different people's voices over
[01:09:23] record different people's voices over that
[01:09:24] that voices multiple Utah at the same time
[01:09:25] voices multiple Utah at the same time and then haven't tried to separate out
[01:09:28] and then haven't tried to separate out people's voices and one of the pro next
[01:09:30] people's voices and one of the pro next Assizes you do later is if we have you
[01:09:33] Assizes you do later is if we have you know five people talking so each
[01:09:36] know five people talking so each microphone there cause five people's
[01:09:38] microphone there cause five people's overlapping voices right because you
[01:09:40] overlapping voices right because you know each microphone here's five people
[01:09:42] know each microphone here's five people at the same time how can you have an
[01:09:44] at the same time how can you have an algorithm separate out these voices so
[01:09:46] algorithm separate out these voices so you can't clean recordings of just one
[01:09:48] you can't clean recordings of just one voice at a time so that's called a
[01:09:50] voice at a time so that's called a cocktail party problem and the algorithm
[01:09:52] cocktail party problem and the algorithm you used to do this is called ICA
[01:09:53] you used to do this is called ICA independent components analysis and
[01:09:55] independent components analysis and that's something you implement in one of
[01:09:57] that's something you implement in one of the latest homework exercises and there
[01:10:02] the latest homework exercises and there are other examples of us who has
[01:10:03] are other examples of us who has learning as well the Internet has tons
[01:10:06] learning as well the Internet has tons of unlabeled text later you just sucked
[01:10:08] of unlabeled text later you just sucked down data from the internet there no
[01:10:10] down data from the internet there no labels necessarily but can you learn
[01:10:13] labels necessarily but can you learn interesting things about language figure
[01:10:15] interesting things about language figure out what figure out I don't know one of
[01:10:17] out what figure out I don't know one of the best cited results recently was
[01:10:20] the best cited results recently was learning analogies like you know man is
[01:10:22] learning analogies like you know man is the woman as king of the Queen right or
[01:10:26] the woman as king of the Queen right or what Tokyo mister Japan as Washington
[01:10:30] what Tokyo mister Japan as Washington DC's the United States right to learn
[01:10:32] DC's the United States right to learn and energies like that some say you can
[01:10:34] and energies like that some say you can learn analogies like that from unlabeled
[01:10:36] learn analogies like that from unlabeled data just from text on the internet so
[01:10:37] data just from text on the internet so there's also unsupervised learning okay
[01:10:40] there's also unsupervised learning okay um so after on sooo eyes learning oh and
[01:10:46] um so after on sooo eyes learning oh and I'm surprised earning so you know
[01:10:47] I'm surprised earning so you know machine learning is very useful today
[01:10:49] machine learning is very useful today turns out that most of the recent wave
[01:10:52] turns out that most of the recent wave of economic value created by machine
[01:10:55] of economic value created by machine learning is through supervised learning
[01:10:56] learning is through supervised learning but there are important use cases for
[01:10:59] but there are important use cases for unsupervised learning as well so I use
[01:11:01] unsupervised learning as well so I use them in my work occasionally and there's
[01:11:04] them in my work occasionally and there's also beating edge for a lot of exciting
[01:11:06] also beating edge for a lot of exciting research and then the final topic find
[01:11:09] research and then the final topic find out the five topics we cover so talk
[01:11:11] out the five topics we cover so talk about supervised learning machine
[01:11:13] about supervised learning machine learning strategy deep learning
[01:11:15] learning strategy deep learning unsupervised learning and then the fifth
[01:11:17] unsupervised learning and then the fifth one is reinforcement learning
[01:11:18] one is reinforcement learning is this which is um let's say I give you
[01:11:22] is this which is um let's say I give you the keys to Stanford on this helicopter
[01:11:24] the keys to Stanford on this helicopter this hyung copters actually sitting in
[01:11:25] this hyung copters actually sitting in my office I'm trying to figure out how
[01:11:27] my office I'm trying to figure out how to get rid of it and I'll see the ready
[01:11:29] to get rid of it and I'll see the ready program to make it fly right so how do
[01:11:32] program to make it fly right so how do you do that so this is a video of a
[01:11:37] you do that so this is a video of a helicopter flying the audio is just a
[01:11:41] helicopter flying the audio is just a lot of helicopter noise so that's not
[01:11:42] lot of helicopter noise so that's not important but uh well zoom out the video
[01:11:45] important but uh well zoom out the video you can see it responds in the sky right
[01:11:47] you can see it responds in the sky right but so you can use learning out of yeah
[01:11:50] but so you can use learning out of yeah that's kind of cool
[01:11:51] that's kind of cool I was the cameraman that day oh but so
[01:11:54] I was the cameraman that day oh but so you can use there any algorithms to get
[01:11:56] you can use there any algorithms to get you know robots to do pretty interesting
[01:11:59] you know robots to do pretty interesting things like this and it turns out that a
[01:12:02] things like this and it turns out that a good way to do this is through
[01:12:04] good way to do this is through reinforcement learning
[01:12:05] reinforcement learning so what's reinforcement learning um it
[01:12:07] so what's reinforcement learning um it turns out that no one knows what's the
[01:12:09] turns out that no one knows what's the optimal way to fly a helicopter right if
[01:12:11] optimal way to fly a helicopter right if you find helicopter you have to control
[01:12:13] you find helicopter you have to control sticks that you're moving but no one
[01:12:16] sticks that you're moving but no one knows what's the octal way to move the
[01:12:18] knows what's the octal way to move the control sticks so that way you can get a
[01:12:20] control sticks so that way you can get a helicopter fly itself is let the
[01:12:22] helicopter fly itself is let the helicopter do whatever you think of it
[01:12:25] helicopter do whatever you think of it us training a dog right you can't teach
[01:12:27] us training a dog right you can't teach a dog the optimal way to behave but
[01:12:29] a dog the optimal way to behave but actually how many have a pet dog a pet
[01:12:32] actually how many have a pet dog a pet cat before
[01:12:34] cat before it's fascinating
[01:12:36] it's fascinating okay so I had a pet dog when I was a kid
[01:12:39] okay so I had a pet dog when I was a kid and my family made it my job to train
[01:12:41] and my family made it my job to train the dog so how do you train at all you
[01:12:43] the dog so how do you train at all you let the dog do whatever it once and then
[01:12:45] let the dog do whatever it once and then whenever the Hays well there you go oh
[01:12:47] whenever the Hays well there you go oh good dog and when it misbehaves you go
[01:12:50] good dog and when it misbehaves you go bad dog
[01:12:51] bad dog and then over time the dog learns to do
[01:12:54] and then over time the dog learns to do more of the good don't things and fear
[01:12:56] more of the good don't things and fear of the bad dog things and so
[01:12:58] of the bad dog things and so reinforcement learning is a bit like
[01:12:59] reinforcement learning is a bit like that
[01:13:00] that right I don't know what's the awful way
[01:13:01] right I don't know what's the awful way to fly a helicopter so you let the
[01:13:03] to fly a helicopter so you let the helicopter do whatever it wants and then
[01:13:05] helicopter do whatever it wants and then whenever it flies well you know doesn't
[01:13:08] whenever it flies well you know doesn't mean everybody whines or flies
[01:13:09] mean everybody whines or flies accurately without getting around too
[01:13:11] accurately without getting around too much you go
[01:13:12] much you go oh good helicopter and when it crashes
[01:13:15] oh good helicopter and when it crashes to go bad helicopter and it's the job of
[01:13:18] to go bad helicopter and it's the job of the reinforcement learning algorithms to
[01:13:20] the reinforcement learning algorithms to figure out how to control it over time
[01:13:21] figure out how to control it over time so as to get more of a good helicopter
[01:13:23] so as to get more of a good helicopter things and fear that bad
[01:13:25] things and fear that bad couple of things um and I think um well
[01:13:29] couple of things um and I think um well just one more video yeah all right and
[01:13:38] just one more video yeah all right and so again given a robot like this I
[01:13:40] so again given a robot like this I actually don't know how the programmer
[01:13:42] actually don't know how the programmer actually you know robot like this as a
[01:13:45] actually you know robot like this as a Lolich joints right so how do you get a
[01:13:46] Lolich joints right so how do you get a robot like this to climb more obstacles
[01:13:48] robot like this to climb more obstacles so well this is actually a robot dog so
[01:13:52] so well this is actually a robot dog so you can actually say good dog dog but by
[01:13:56] you can actually say good dog dog but by giving those signals called a reward
[01:13:58] giving those signals called a reward signal you can have a learning algorithm
[01:14:01] signal you can have a learning algorithm figure out by itself how's the optimize
[01:14:03] figure out by itself how's the optimize the reward therefore climb over these
[01:14:07] the reward therefore climb over these types of obstacles and I think recently
[01:14:10] types of obstacles and I think recently the most famous application is a very
[01:14:12] the most famous application is a very for student learning happened for game
[01:14:14] for student learning happened for game playing playing Atari games or playing a
[01:14:16] playing playing Atari games or playing a game of gold can alphago I think that uh
[01:14:20] game of gold can alphago I think that uh IIIi I think that uh game play has made
[01:14:23] IIIi I think that uh game play has made for some remarkable stunts a remarkable
[01:14:26] for some remarkable stunts a remarkable PR but I'm also equally excited or maybe
[01:14:29] PR but I'm also equally excited or maybe even more excited about the integrals
[01:14:31] even more excited about the integrals their reinforcement or anything is
[01:14:33] their reinforcement or anything is making it's a robotics applications so I
[01:14:35] making it's a robotics applications so I think I think yeah reinforcement has
[01:14:38] think I think yeah reinforcement has been proven to be fantastic for playing
[01:14:40] been proven to be fantastic for playing games is also getting making real
[01:14:42] games is also getting making real traction in optimizing robots and
[01:14:45] traction in optimizing robots and optimizing logistic system things like
[01:14:48] optimizing logistic system things like that so you learn about all these things
[01:14:53] that so you learn about all these things last thing for today I hope that you
[01:14:56] last thing for today I hope that you will start to tell me people in the
[01:15:00] will start to tell me people in the class to make friends phone project
[01:15:02] class to make friends phone project partners and study groups and if you
[01:15:04] partners and study groups and if you have any questions you know dog on the
[01:15:06] have any questions you know dog on the Piazza ask you questions let's help
[01:15:07] Piazza ask you questions let's help others answer the questions so let's
[01:15:10] others answer the questions so let's break for today and I look forward to
[01:15:12] break for today and I look forward to seeing you on Wednesday
[01:15:14] seeing you on Wednesday you


================================================================================
LECTURE 002
================================================================================

Stanford CS229: Machine Learning - Linear Regression and Gradient Descent |  Lecture 2 (Autumn 2018)

Source: https://www.youtube.com/watch?v=4b4MUYve_U8

---

Transcript

[00:00:03] morning and welcome back so what we'll
[00:00:09] morning and welcome back so what we'll see today in class is the first in-depth
[00:00:12] see today in class is the first in-depth discussion of a learning algorithm
[00:00:14] discussion of a learning algorithm linear regression and in particular over
[00:00:17] linear regression and in particular over the next one hour and a bit you see a
[00:00:20] the next one hour and a bit you see a linear regression batch and Tsukasa
[00:00:23] linear regression batch and Tsukasa Granderson's is an algorithm for fitting
[00:00:24] Granderson's is an algorithm for fitting linear aggression models and then the
[00:00:27] linear aggression models and then the normal equations as a way of is a very
[00:00:31] normal equations as a way of is a very efficient way to let you fit linear
[00:00:33] efficient way to let you fit linear models and we're going to define
[00:00:37] models and we're going to define notation and a few concepts today that
[00:00:39] notation and a few concepts today that will lay the foundation for a lot of the
[00:00:42] will lay the foundation for a lot of the work that we'll see the rest of this
[00:00:43] work that we'll see the rest of this quarter so to motivate linear regression
[00:00:48] quarter so to motivate linear regression scofield maybe the may be the simplest
[00:00:50] scofield maybe the may be the simplest one of the simplest learning algorithms
[00:00:51] one of the simplest learning algorithms you remember the Alvin video the
[00:00:55] you remember the Alvin video the autonomous driving video that I had
[00:00:57] autonomous driving video that I had shown in class on Monday for the
[00:01:00] shown in class on Monday for the self-driving car video that was a
[00:01:02] self-driving car video that was a supervised learning problem and this
[00:01:05] supervised learning problem and this term supervised learning meant that you
[00:01:10] term supervised learning meant that you were given access which was a picture of
[00:01:12] were given access which was a picture of what's in front of the car and the
[00:01:14] what's in front of the car and the algorithm had to map that to an output Y
[00:01:17] algorithm had to map that to an output Y which was the steering direction and
[00:01:19] which was the steering direction and that was a regression problem because
[00:01:25] that was a regression problem because the output Y that you want is continuous
[00:01:27] the output Y that you want is continuous value right as opposed to a
[00:01:28] value right as opposed to a classification problem where Y is the
[00:01:30] classification problem where Y is the speed and we'll talk about
[00:01:31] speed and we'll talk about classification next Monday but
[00:01:34] classification next Monday but supervised learning regression so I
[00:01:36] supervised learning regression so I think the simplest maybe the simplest
[00:01:38] think the simplest maybe the simplest possible learning algorithm a supervised
[00:01:40] possible learning algorithm a supervised learning regression problem is linear
[00:01:43] learning regression problem is linear regression and to motivate that rather
[00:01:47] regression and to motivate that rather than using a saw driving car example you
[00:01:50] than using a saw driving car example you know which is quite complicated it will
[00:01:51] know which is quite complicated it will build up a supervised learning algorithm
[00:01:53] build up a supervised learning algorithm using a simpler example so let's say you
[00:01:57] using a simpler example so let's say you want to predict or what estimate prices
[00:01:59] want to predict or what estimate prices of houses so the way you'd build a
[00:02:03] of houses so the way you'd build a learning algorithm is start by
[00:02:05] learning algorithm is start by collecting a data set of houses and
[00:02:07] collecting a data set of houses and their prices so this is a data set that
[00:02:10] their prices so this is a data set that we collected off Craigslist a little bit
[00:02:12] we collected off Craigslist a little bit back this is data from Portland Oregon
[00:02:15] back this is data from Portland Oregon but so there's a size of house in square
[00:02:18] but so there's a size of house in square feet and that's the price of a house in
[00:02:21] feet and that's the price of a house in thousands of dollars right so there's a
[00:02:26] thousands of dollars right so there's a house that is a 2100 full square feet
[00:02:29] house that is a 2100 full square feet who's asking price was $400,000 pulse
[00:02:33] who's asking price was $400,000 pulse with that size with that price and so on
[00:02:44] with that size with that price and so on okay and maybe more conventionally if
[00:02:49] okay and maybe more conventionally if you plot this data with there's the size
[00:02:52] you plot this data with there's the size that's the price see some data set like
[00:02:55] that's the price see some data set like that and what would end up doing today
[00:02:57] that and what would end up doing today is fit a straight line to this data I
[00:03:00] is fit a straight line to this data I didn't go through how to do that so in
[00:03:02] didn't go through how to do that so in supervised learning um the process of
[00:03:05] supervised learning um the process of supervised learning is that you have our
[00:03:07] supervised learning is that you have our training set such as the data set that I
[00:03:10] training set such as the data set that I drew on the left and you feed this to
[00:03:12] drew on the left and you feed this to learning algorithm and the job of the
[00:03:20] learning algorithm and the job of the learning algorithm is to output a
[00:03:22] learning algorithm is to output a function to make predictions about
[00:03:25] function to make predictions about housing prices and by convention I'm
[00:03:28] housing prices and by convention I'm gonna call this function that it outputs
[00:03:30] gonna call this function that it outputs a hypothesis and a job with the
[00:03:36] a hypothesis and a job with the hypothesis is you know it will it can
[00:03:39] hypothesis is you know it will it can input the size of a new house the size
[00:03:42] input the size of a new house the size of a different house that you haven't
[00:03:43] of a different house that you haven't seen yet and will output the estimated
[00:03:49] price okay so the job of the learning
[00:03:53] price okay so the job of the learning algorithm is to input a training set and
[00:03:54] algorithm is to input a training set and out for the hypothesis the job with
[00:03:56] out for the hypothesis the job with hypothesis is to take as input any size
[00:03:59] hypothesis is to take as input any size of a house and try to tell you what if
[00:04:01] of a house and try to tell you what if things should be the price of that house
[00:04:04] things should be the price of that house now when designing a learning algorithm
[00:04:08] now when designing a learning algorithm and and you know even though linear
[00:04:11] and and you know even though linear regression right you may have seen it in
[00:04:12] regression right you may have seen it in a linear algebra class before in some
[00:04:14] a linear algebra class before in some class before the way you go about
[00:04:16] class before the way you go about structuring a machine learning algorithm
[00:04:18] structuring a machine learning algorithm is important and design choices of you
[00:04:21] is important and design choices of you know what is the workflow what does the
[00:04:22] know what is the workflow what does the data say what is the hypothesis
[00:04:24] data say what is the hypothesis represents a hypothesis these are the
[00:04:25] represents a hypothesis these are the key decisions you have to make in pretty
[00:04:28] key decisions you have to make in pretty much every supervised learning every
[00:04:30] much every supervised learning every machine learning algorithms design so as
[00:04:32] machine learning algorithms design so as we go through the new regression I'll
[00:04:34] we go through the new regression I'll try to describe the concepts clearly as
[00:04:36] try to describe the concepts clearly as well because they'll lay the foundation
[00:04:38] well because they'll lay the foundation for the rest of the algorithms sometimes
[00:04:40] for the rest of the algorithms sometimes much more completely I'll go as you see
[00:04:41] much more completely I'll go as you see later this quarter so when designing a
[00:04:45] later this quarter so when designing a learning algorithm the first thing we'll
[00:04:46] learning algorithm the first thing we'll need to ask is um how do you represent
[00:04:51] the hypothesis and in linear regression
[00:04:56] the hypothesis and in linear regression the for the purpose of this lecture
[00:04:58] the for the purpose of this lecture we're going to say that the hypothesis
[00:05:01] we're going to say that the hypothesis is going to be input size X and output a
[00:05:10] is going to be input size X and output a number as a as a linear function of the
[00:05:13] number as a as a linear function of the size X okay and then the mathematicians
[00:05:17] size X okay and then the mathematicians in the room you say technique doesn't
[00:05:18] in the room you say technique doesn't have fine function it was a linear
[00:05:20] have fine function it was a linear function there's no theta zero
[00:05:21] function there's no theta zero technically you know they've been in
[00:05:23] technically you know they've been in machine learning is sometimes just
[00:05:24] machine learning is sometimes just causes a linear function but technically
[00:05:26] causes a linear function but technically is an affine function it does it doesn't
[00:05:27] is an affine function it does it doesn't matter so more generally in this example
[00:05:33] matter so more generally in this example we have just one input feature X more
[00:05:36] we have just one input feature X more generally if you have multiple input
[00:05:39] generally if you have multiple input features so if you have more data more
[00:05:41] features so if you have more data more information about these houses such as
[00:05:43] information about these houses such as number of bedrooms excuse me mom
[00:05:49] number of bedrooms excuse me mom handwriting's okay that's the word
[00:05:51] handwriting's okay that's the word bedrooms then I guess my father-in-law
[00:06:02] bedrooms then I guess my father-in-law lives a little bit outside Portland and
[00:06:04] lives a little bit outside Portland and he's actually really into real estate so
[00:06:05] he's actually really into real estate so this is that your real data set in
[00:06:06] this is that your real data set in Portland so more generally if you know
[00:06:12] Portland so more generally if you know the size as was the number of bedrooms
[00:06:13] the size as was the number of bedrooms in these houses then you may have a two
[00:06:17] in these houses then you may have a two input features where x1 is the size and
[00:06:20] input features where x1 is the size and x2 is the number of bedrooms I'm using
[00:06:26] x2 is the number of bedrooms I'm using the pound sign bedrooms to denote number
[00:06:29] the pound sign bedrooms to denote number of bedrooms and you might say that you
[00:06:31] of bedrooms and you might say that you estimate the size of a house as H of X
[00:06:35] estimate the size of a house as H of X equals theta 0 or Stata 1x
[00:06:37] equals theta 0 or Stata 1x 1 plus theta 2 x2 where x1 is the size
[00:06:43] 1 plus theta 2 x2 where x1 is the size of the house and x2 is is the number of
[00:06:47] of the house and x2 is is the number of bedrooms okay so in order to
[00:06:57] so in order to simplify the notation in
[00:07:07] so in order to simplify the notation in order to make that notation a little bit
[00:07:09] order to make that notation a little bit more compact I'm also going to introduce
[00:07:12] more compact I'm also going to introduce this other notation where we want to
[00:07:16] this other notation where we want to write the hypothesis as sum from J
[00:07:22] write the hypothesis as sum from J equals 0 to 2 of theta J XJ so the
[00:07:31] equals 0 to 2 of theta J XJ so the summation where for conciseness we
[00:07:35] summation where for conciseness we define X 0 to be equal to 1 ok see we
[00:07:39] define X 0 to be equal to 1 ok see we define if you define X 0 to be a dummy
[00:07:42] define if you define X 0 to be a dummy feature that always takes on the value
[00:07:44] feature that always takes on the value of 1 then you can write the hypothesis H
[00:07:47] of 1 then you can write the hypothesis H of X this way sum from J equals 0 to 2
[00:07:49] of X this way sum from J equals 0 to 2 or just theta J XJ
[00:07:51] or just theta J XJ it's the same with that equation that
[00:07:53] it's the same with that equation that you saw to the upper right and so here
[00:07:56] you saw to the upper right and so here theta becomes a three dimensional
[00:07:59] theta becomes a three dimensional parameter theta 0 theta 1 theta 2 this
[00:08:03] parameter theta 0 theta 1 theta 2 this index starting from 0 and the features
[00:08:06] index starting from 0 and the features become a 3 dimensional feature vector X
[00:08:08] become a 3 dimensional feature vector X 0 X 1 X 2 where X 0 is always 1 X 1 is
[00:08:14] 0 X 1 X 2 where X 0 is always 1 X 1 is the size of the house and X 2 is the
[00:08:16] the size of the house and X 2 is the number of bedrooms of the house so to
[00:08:22] number of bedrooms of the house so to introduce a bit more terminology theta
[00:08:26] introduce a bit more terminology theta is called the parameters of the learning
[00:08:33] is called the parameters of the learning algorithm and the job of the learning
[00:08:35] algorithm and the job of the learning algorithm is to choose parameters theta
[00:08:38] algorithm is to choose parameters theta that allows you to make good predictions
[00:08:40] that allows you to make good predictions about your prices of houses right and
[00:08:44] about your prices of houses right and just to lay out some more notation that
[00:08:48] just to lay out some more notation that we're going to use throughout this
[00:08:49] we're going to use throughout this quarter I'm going to use a standard that
[00:08:53] quarter I'm going to use a standard that M will define as the number of training
[00:09:03] M will define as the number of training examples so M is going to be the number
[00:09:05] examples so M is going to be the number of rows
[00:09:08] right in the table above where you know
[00:09:13] right in the table above where you know each house you have your training said
[00:09:15] each house you have your training said just one training example you've already
[00:09:18] just one training example you've already seen me use X to denote the inputs and
[00:09:24] seen me use X to denote the inputs and often the inputs
[00:09:26] often the inputs I'll call features you know I think as a
[00:09:31] I'll call features you know I think as a as an emerging discipline grows up right
[00:09:34] as an emerging discipline grows up right notation kind of emerges depending on
[00:09:36] notation kind of emerges depending on what different scientists use for the
[00:09:38] what different scientists use for the first time when you write a paper so I
[00:09:39] first time when you write a paper so I think that you know I think that the
[00:09:41] think that you know I think that the fact that we call these things
[00:09:42] fact that we call these things hypotheses frankly I don't think that's
[00:09:44] hypotheses frankly I don't think that's a great name but but I think someone
[00:09:46] a great name but but I think someone many decades ago wrote a few papers
[00:09:48] many decades ago wrote a few papers calling a hypothesis and then others
[00:09:50] calling a hypothesis and then others follow and we kind of stuck with some of
[00:09:52] follow and we kind of stuck with some of this terminology but X is what's called
[00:09:53] this terminology but X is what's called input features a sentence input
[00:09:55] input features a sentence input attributes and Y is the output right and
[00:10:02] attributes and Y is the output right and sometimes we call this a target variable
[00:10:07] and so X comma Y is one training example
[00:10:18] and I'm going to use this notation X
[00:10:26] and I'm going to use this notation X superscript I comma Y superscript I in
[00:10:29] superscript I comma Y superscript I in parentheses to denote the training
[00:10:37] parentheses to denote the training example okay so the superscript
[00:10:39] example okay so the superscript parentheses I that's not exponentiation
[00:10:42] parentheses I that's not exponentiation I think that as we build this is this
[00:10:46] I think that as we build this is this notation X I comma Y I this is just a
[00:10:48] notation X I comma Y I this is just a way of writing an index into the table
[00:10:52] way of writing an index into the table of training examples above so so maybe
[00:10:54] of training examples above so so maybe for example if the first training
[00:10:56] for example if the first training example is the size house of science to
[00:10:59] example is the size house of science to 104 so X 1 1 would be equal to 2104
[00:11:06] 104 so X 1 1 would be equal to 2104 right because this is the size of the
[00:11:08] right because this is the size of the first house in the training example
[00:11:10] first house in the training example and I guess X the second example feature
[00:11:16] and I guess X the second example feature one would be one four one six with our
[00:11:18] one would be one four one six with our example though so the super strip in
[00:11:20] example though so the super strip in parentheses is just some it's just the
[00:11:26] parentheses is just some it's just the index into the different training
[00:11:28] index into the different training examples where I superscript I here
[00:11:31] examples where I superscript I here we're running from one through m1
[00:11:33] we're running from one through m1 through the number of training examples
[00:11:34] through the number of training examples you have and then one last bit of
[00:11:38] you have and then one last bit of notation I'm going to use n to denote
[00:11:45] the number of features you have for the
[00:11:47] the number of features you have for the supervised learning problem so in this
[00:11:49] supervised learning problem so in this example n is equal to 2 right because we
[00:11:53] example n is equal to 2 right because we have two features which is the size the
[00:11:56] have two features which is the size the house and the number of bedrooms so two
[00:11:58] house and the number of bedrooms so two features which is why you can take this
[00:12:02] features which is why you can take this write and write this as a sum from J
[00:12:08] write and write this as a sum from J equals 0 to n and so here X and theta
[00:12:16] equals 0 to n and so here X and theta are n plus 1 dimensional because we
[00:12:18] are n plus 1 dimensional because we added the extra X 0 and theta 0 ok so so
[00:12:24] added the extra X 0 and theta 0 ok so so if you have two features then these are
[00:12:26] if you have two features then these are three dimensional vectors and more
[00:12:28] three dimensional vectors and more generally if you have n features you end
[00:12:30] generally if you have n features you end up with X and theta being n plus 1
[00:12:33] up with X and theta being n plus 1 dimensional features all right and you
[00:12:37] dimensional features all right and you know you see this notation multiple
[00:12:40] know you see this notation multiple times in multiple algorithms throughout
[00:12:41] times in multiple algorithms throughout this quarter so if you you know don't
[00:12:44] this quarter so if you you know don't manage to memorize all these symbols
[00:12:45] manage to memorize all these symbols right now don't worry about it you see
[00:12:47] right now don't worry about it you see them over and over and over come
[00:12:48] them over and over and over come familiar alright so um given the data
[00:12:53] familiar alright so um given the data set and given that this is the way you
[00:12:56] set and given that this is the way you define the hypothesis how do you choose
[00:12:59] define the hypothesis how do you choose the parameters right so you're the
[00:13:01] the parameters right so you're the learning algorithms job is to choose
[00:13:02] learning algorithms job is to choose values for parameters theta so that it
[00:13:05] values for parameters theta so that it can output hypotheses so how do you
[00:13:08] can output hypotheses so how do you choose parameters theta well what we'll
[00:13:11] choose parameters theta well what we'll do is let's choose theta
[00:13:23] such that H of X is close to Y for the
[00:13:30] such that H of X is close to Y for the training examples so and I think the
[00:13:38] training examples so and I think the final bit of notation I've been writing
[00:13:41] final bit of notation I've been writing H of X as a function of the features of
[00:13:45] H of X as a function of the features of the house as a function of the size and
[00:13:47] the house as a function of the size and number of bedrooms the house sometimes
[00:13:49] number of bedrooms the house sometimes to emphasize that H depends both on the
[00:13:52] to emphasize that H depends both on the parameters theta and on the input
[00:13:55] parameters theta and on the input features X I'm going to use H subscript
[00:13:59] features X I'm going to use H subscript theta X to emphasize that the hypothesis
[00:14:03] theta X to emphasize that the hypothesis depends both on the parameters and on
[00:14:05] depends both on the parameters and on the you know input features X right but
[00:14:08] the you know input features X right but sometimes for notational convenience I
[00:14:10] sometimes for notational convenience I just write this as H of X sometimes I
[00:14:12] just write this as H of X sometimes I include the theta there and they mean
[00:14:14] include the theta there and they mean the same thing it's just maybe a
[00:14:15] the same thing it's just maybe a abbreviation in notation but so in order
[00:14:19] abbreviation in notation but so in order to learn set of parameters what we'll
[00:14:24] to learn set of parameters what we'll want to do is choose a parameters theta
[00:14:27] want to do is choose a parameters theta so that at least for the houses whose
[00:14:28] so that at least for the houses whose prices you know that you know the
[00:14:31] prices you know that you know the learning algorithm outputs prices that
[00:14:33] learning algorithm outputs prices that are close to what you know were the
[00:14:35] are close to what you know were the correct prices for that set of houses
[00:14:37] correct prices for that set of houses with their compare asking prices for
[00:14:40] with their compare asking prices for those houses and so more formally in the
[00:14:44] those houses and so more formally in the linear regression algorithm also called
[00:14:46] linear regression algorithm also called ordinary least-squares with a linear
[00:14:48] ordinary least-squares with a linear regression we will want to minimize I'm
[00:14:56] regression we will want to minimize I'm going to build out this equation one
[00:14:57] going to build out this equation one piece of the time okay minimize the
[00:15:00] piece of the time okay minimize the squared difference between what the
[00:15:02] squared difference between what the hypothesis outputs H subscript theta of
[00:15:05] hypothesis outputs H subscript theta of X minus y squared right so let's say we
[00:15:15] X minus y squared right so let's say we want to minimize the squared difference
[00:15:16] want to minimize the squared difference between the prediction which is H of x
[00:15:18] between the prediction which is H of x and y which is a correct price and so
[00:15:24] and y which is a correct price and so what we want to do is
[00:15:26] what we want to do is choose values of theta that minimizes
[00:15:28] choose values of theta that minimizes that to fill this out you know you have
[00:15:32] that to fill this out you know you have M training examples so I'm going to sum
[00:15:36] M training examples so I'm going to sum from I equals 1 through m of that
[00:15:41] from I equals 1 through m of that squared difference so this is sum over I
[00:15:44] squared difference so this is sum over I equals 1 through all say 50 examples you
[00:15:47] equals 1 through all say 50 examples you have right the squared difference
[00:15:50] have right the squared difference between what your algorithm predicts and
[00:15:52] between what your algorithm predicts and what the true price of the house is and
[00:15:55] what the true price of the house is and then finally by convention we put up
[00:15:59] then finally by convention we put up one-half there it's put a 1/2 constant
[00:16:01] one-half there it's put a 1/2 constant there because when we take derivatives
[00:16:03] there because when we take derivatives to minimize this later
[00:16:05] to minimize this later putting 1/2 there will make some of the
[00:16:06] putting 1/2 there will make some of the math a little bit simpler so you know
[00:16:08] math a little bit simpler so you know changing 1 adding a 1/2 minimizing that
[00:16:10] changing 1 adding a 1/2 minimizing that formula should give you the same ran
[00:16:12] formula should give you the same ran says minimize a 1/2 of that but we often
[00:16:14] says minimize a 1/2 of that but we often put a 1/2 there since I make the math a
[00:16:16] put a 1/2 there since I make the math a little bit simpler later ok
[00:16:18] little bit simpler later ok and so in linear regression I'm gonna
[00:16:23] and so in linear regression I'm gonna define the cost function J of theta to
[00:16:27] define the cost function J of theta to be equal to that and we'll find
[00:16:32] be equal to that and we'll find parameters theta that minimizes the cost
[00:16:35] parameters theta that minimizes the cost function J of theta ok and questions
[00:16:41] function J of theta ok and questions often gotten is you know why squared
[00:16:44] often gotten is you know why squared error why not absolute error or this
[00:16:46] error why not absolute error or this error to the power of 4 we'll talk more
[00:16:48] error to the power of 4 we'll talk more about that when we talk about when we
[00:16:52] about that when we talk about when we talk about a generalization of linear
[00:16:55] talk about a generalization of linear regression when we talk about
[00:16:57] regression when we talk about generalized linear models we should do
[00:16:58] generalized linear models we should do next week you see that linear regression
[00:17:02] next week you see that linear regression is a special case of a bigger family of
[00:17:05] is a special case of a bigger family of algorithms called generalizing the
[00:17:06] algorithms called generalizing the models and using squared error
[00:17:09] models and using squared error corresponds to a Gaussian but justified
[00:17:13] corresponds to a Gaussian but justified may be a little bit more Y squared error
[00:17:15] may be a little bit more Y squared error rather than absolute error or error to
[00:17:17] rather than absolute error or error to the power 4 next week so um let me just
[00:17:22] the power 4 next week so um let me just check see if any questions
[00:17:29] okay cool alright so um so let's Knicks
[00:17:55] okay cool alright so um so let's Knicks let's see how you can implement an
[00:17:56] let's see how you can implement an algorithm to find a value of theta that
[00:17:59] algorithm to find a value of theta that minimizes J of theta that minimizes the
[00:18:02] minimizes J of theta that minimizes the cost function J of theta we're going to
[00:18:06] cost function J of theta we're going to use an algorithm called gradient descent
[00:18:12] and you know there's our first loss
[00:18:16] and you know there's our first loss seeking this Austrian so trying to
[00:18:17] seeking this Austrian so trying to figure out what just takes like this all
[00:18:27] figure out what just takes like this all right and so with gradient descent we
[00:18:29] right and so with gradient descent we are going to start with some value of
[00:18:36] are going to start with some value of theta and it could be you know theta
[00:18:41] theta and it could be you know theta equals the vector of all zeros would be
[00:18:43] equals the vector of all zeros would be a reasonable default we could initialize
[00:18:44] a reasonable default we could initialize a random you can't doesn't really matter
[00:18:46] a random you can't doesn't really matter but theta is this three dimensional
[00:18:48] but theta is this three dimensional vector and I'm writing zero with an
[00:18:51] vector and I'm writing zero with an arrow on top to denote the vector of all
[00:18:54] arrow on top to denote the vector of all zero so zero with an arrow on top does
[00:18:56] zero so zero with an arrow on top does it vector there's a zero zero zero
[00:18:57] it vector there's a zero zero zero everywhere right so so stop to some you
[00:19:01] everywhere right so so stop to some you know initial value of theta and we're
[00:19:04] know initial value of theta and we're going to keep changing theta to reduce J
[00:19:18] going to keep changing theta to reduce J of theta okay so let me show you a
[00:19:28] but let me show you a visualization of
[00:19:31] but let me show you a visualization of gradient descent and and it will write
[00:19:33] gradient descent and and it will write all the math
[00:19:36] so alright let's say you want to
[00:19:41] so alright let's say you want to minimize some function J of theta and is
[00:19:45] minimize some function J of theta and is importantly get the axis right in this
[00:19:47] importantly get the axis right in this diagram right so in this diagram the
[00:19:49] diagram right so in this diagram the horizontal axes are theta 0 and theta 1
[00:19:52] horizontal axes are theta 0 and theta 1 and what you want to do is find values
[00:19:55] and what you want to do is find values for theta 0 and theta 1 in our in our
[00:19:59] for theta 0 and theta 1 in our in our examples as you say the zero theta 1
[00:20:00] examples as you say the zero theta 1 theta 2 cos theta 3 dimentional I can't
[00:20:02] theta 2 cos theta 3 dimentional I can't plot that so I'm just using theta 0 and
[00:20:04] plot that so I'm just using theta 0 and theta 1 but what you want to do is find
[00:20:07] theta 1 but what you want to do is find values for theta 0 and theta 1 right
[00:20:10] values for theta 0 and theta 1 right that's the right you want to find values
[00:20:17] that's the right you want to find values of theta zero and theta one that
[00:20:20] of theta zero and theta one that minimizes the height to the surface J of
[00:20:22] minimizes the height to the surface J of theta so maybe this this looks like a
[00:20:24] theta so maybe this this looks like a pretty good point
[00:20:24] pretty good point or something okay and so in green
[00:20:27] or something okay and so in green descent you you know start off at some
[00:20:30] descent you you know start off at some point on this surface and you do that by
[00:20:33] point on this surface and you do that by initializing theta 0 and theta 1 either
[00:20:36] initializing theta 0 and theta 1 either randomly or to the value of all zeros or
[00:20:38] randomly or to the value of all zeros or something doesn't doesn't matter too
[00:20:39] something doesn't doesn't matter too much and what you do is imagine you are
[00:20:43] much and what you do is imagine you are standing on this little hill right
[00:20:45] standing on this little hill right standing at the point at that little
[00:20:47] standing at the point at that little extra that little cross what you're
[00:20:50] extra that little cross what you're doing great in the sentence is turn on
[00:20:52] doing great in the sentence is turn on turn around
[00:20:53] turn around all 360 degrees and look around you and
[00:20:55] all 360 degrees and look around you and see if you were to take a tiny little
[00:20:58] see if you were to take a tiny little step you know take a tiny little baby
[00:21:00] step you know take a tiny little baby set in what direction should you take a
[00:21:02] set in what direction should you take a little step to go downhill as fast as
[00:21:05] little step to go downhill as fast as possible because you're trying to go
[00:21:07] possible because you're trying to go downhill which is go to the lowest
[00:21:09] downhill which is go to the lowest possible elevation go to the lowest
[00:21:11] possible elevation go to the lowest possible point of J of theta so what
[00:21:14] possible point of J of theta so what you're in descent will do is a stand at
[00:21:16] you're in descent will do is a stand at that point look around look all around
[00:21:19] that point look around look all around you and say well what direction should I
[00:21:21] you and say well what direction should I take a little step in to go down so as
[00:21:23] take a little step in to go down so as quickly as possible because you want to
[00:21:25] quickly as possible because you want to minimize J of theta you want to reduce
[00:21:28] minimize J of theta you want to reduce the value of J of theta you want to go
[00:21:30] the value of J of theta you want to go to the lowest possible elevation on this
[00:21:31] to the lowest possible elevation on this pillow and so Grint descent will take
[00:21:36] pillow and so Grint descent will take that little baby step right and then and
[00:21:39] that little baby step right and then and then repeat now you're a little bit
[00:21:41] then repeat now you're a little bit lower on the surface so you can take a
[00:21:44] lower on the surface so you can take a look all around you and say oh looks
[00:21:45] look all around you and say oh looks like that he'll let that little
[00:21:47] like that he'll let that little direction this
[00:21:48] direction this cheapest direction of the steepest
[00:21:49] cheapest direction of the steepest gradient downhill so you take another
[00:21:52] gradient downhill so you take another little step take another set another
[00:21:54] little step take another set another step and so on until until you until you
[00:22:01] step and so on until until you until you get to a whole via local optima now one
[00:22:04] get to a whole via local optima now one property or green descent is that um
[00:22:06] property or green descent is that um depending where you initialize
[00:22:08] depending where you initialize parameters you can't get to local
[00:22:10] parameters you can't get to local different points right so previously we
[00:22:13] different points right so previously we had started it at that little point X
[00:22:14] had started it at that little point X but imagine if you had started it you
[00:22:17] but imagine if you had started it you know just a few steps over to the right
[00:22:19] know just a few steps over to the right right that new axis of the one on the
[00:22:22] right that new axis of the one on the Left if you are on green descent from
[00:22:24] Left if you are on green descent from that new point then that wouldn't be the
[00:22:27] that new point then that wouldn't be the first step there on the second step and
[00:22:29] first step there on the second step and so on and you would have gotten to a
[00:22:31] so on and you would have gotten to a different local optimum your different
[00:22:34] different local optimum your different level okay it turns out that when you
[00:22:38] level okay it turns out that when you run gradient descent on linear
[00:22:40] run gradient descent on linear regression it turns out that there will
[00:22:44] regression it turns out that there will not be local optimum the world we'll
[00:22:46] not be local optimum the world we'll talk about that a little bit okay so
[00:22:49] talk about that a little bit okay so let's formalize D
[00:22:53] gradient descent algorithm
[00:22:55] gradient descent algorithm [Music]
[00:23:03] in gradient descent each step of
[00:23:11] in gradient descent each step of gradient descent is implemented as
[00:23:14] gradient descent is implemented as follows so so remember in this example
[00:23:17] follows so so remember in this example the training set is fixed right you you
[00:23:20] the training set is fixed right you you know you've collected the data set of
[00:23:21] know you've collected the data set of housing prices from Portland Oregon so
[00:23:23] housing prices from Portland Oregon so you just have to add so in your computer
[00:23:25] you just have to add so in your computer memory and so the data centers takes the
[00:23:28] memory and so the data centers takes the cost function J is a fixed function this
[00:23:30] cost function J is a fixed function this function parameters theta and the only
[00:23:32] function parameters theta and the only thing you're gonna do is tweak or modify
[00:23:34] thing you're gonna do is tweak or modify the parameters theta one step of
[00:23:38] the parameters theta one step of gradient descent it can be implemented
[00:23:40] gradient descent it can be implemented as follows or just say that J gets
[00:23:44] as follows or just say that J gets updated as say that J - I'll just write
[00:23:48] updated as say that J - I'll just write this out so bit more notation I'm gonna
[00:23:57] this out so bit more notation I'm gonna use : equals and let me use this
[00:24:00] use : equals and let me use this notation to denote assignment so what
[00:24:03] notation to denote assignment so what this means is we're going to take the
[00:24:04] this means is we're going to take the value on the right and assign it to
[00:24:06] value on the right and assign it to theta on the left right and so so in
[00:24:09] theta on the left right and so so in other words in the notation will use
[00:24:11] other words in the notation will use this quarter you know a colon equals a
[00:24:14] this quarter you know a colon equals a plus one this means increment the value
[00:24:16] plus one this means increment the value of a by one whereas you know a equals B
[00:24:21] of a by one whereas you know a equals B if I write a equals B I'm asserting a
[00:24:24] if I write a equals B I'm asserting a statement of fact I'm searching that the
[00:24:26] statement of fact I'm searching that the value of a is equal to the value of B
[00:24:28] value of a is equal to the value of B and hopefully I won't ever write a
[00:24:31] and hopefully I won't ever write a equals a plus one right because because
[00:24:34] equals a plus one right because because that is really true alright so in each
[00:24:42] that is really true alright so in each step of gradient descent you're going to
[00:24:44] step of gradient descent you're going to for each value of J so you're gonna do
[00:24:47] for each value of J so you're gonna do this for J equals 0 1 2 or 0 1 up to n
[00:24:52] this for J equals 0 1 2 or 0 1 up to n where n is the number of features for
[00:24:55] where n is the number of features for each value of J takes a DJ and update it
[00:24:58] each value of J takes a DJ and update it according to theta J minus alpha which
[00:25:04] according to theta J minus alpha which is called the learning rate
[00:25:10] is called the learning rate alpha the learning rate times this
[00:25:12] alpha the learning rate times this formula and this formula is the partial
[00:25:14] formula and this formula is the partial derivative of the cost function J of
[00:25:16] derivative of the cost function J of theta with respect to the parameter
[00:25:20] theta with respect to the parameter theta J okay and then this is partial
[00:25:23] theta J okay and then this is partial derivative notation for those of you
[00:25:25] derivative notation for those of you that I know haven't seen calculus for a
[00:25:28] that I know haven't seen calculus for a while or haven't seen you know some of
[00:25:30] while or haven't seen you know some of their prerequisite or a while we'll go
[00:25:32] their prerequisite or a while we'll go over some more of this in a little bit
[00:25:34] over some more of this in a little bit greater detail in discussion section but
[00:25:36] greater detail in discussion section but I'll do this
[00:25:38] I'll do this quickly now
[00:25:46] but I know if you take a look calculus
[00:25:49] but I know if you take a look calculus class a while back you may remember that
[00:25:51] class a while back you may remember that the derivative of a function is you know
[00:25:54] the derivative of a function is you know defines the direction of steepest
[00:25:55] defines the direction of steepest descent so it defines the direction that
[00:25:57] descent so it defines the direction that allows you to go downhill as steeply as
[00:26:00] allows you to go downhill as steeply as possible on the hill and dance question
[00:26:03] possible on the hill and dance question oh how do you determine learning rate
[00:26:07] oh how do you determine learning rate let me get back to that it's a good
[00:26:08] let me get back to that it's a good question for now you know there's a
[00:26:12] question for now you know there's a theory there's a practice in practice
[00:26:14] theory there's a practice in practice you said to 0.01 let me say a bit more
[00:26:19] you said to 0.01 let me say a bit more about that later if you actually if you
[00:26:24] about that later if you actually if you scale all the features between zero and
[00:26:26] scale all the features between zero and one you know minus one and plus one or
[00:26:28] one you know minus one and plus one or something like that and then try you
[00:26:31] something like that and then try you could try a few values and see what lets
[00:26:32] could try a few values and see what lets you minimize the function best but if
[00:26:34] you minimize the function best but if the features are scale to plus minus one
[00:26:37] the features are scale to plus minus one I usually start with 0.01 and try
[00:26:39] I usually start with 0.01 and try increasing and diffusing it say say a
[00:26:41] increasing and diffusing it say say a little more about it all right cool so
[00:26:48] little more about it all right cool so um let's see I'm just quickly show how
[00:26:56] um let's see I'm just quickly show how the derivative calculation is done and
[00:26:59] the derivative calculation is done and you know I'm gonna do a few more
[00:27:01] you know I'm gonna do a few more equations in this lecture and then and
[00:27:03] equations in this lecture and then and then over time I think all of this all
[00:27:06] then over time I think all of this all of these definitions and derivations are
[00:27:08] of these definitions and derivations are written out in full detail in the
[00:27:10] written out in full detail in the lecture notes posted on the course
[00:27:12] lecture notes posted on the course website so sometimes I'll do more math
[00:27:14] website so sometimes I'll do more math in class when we want you to see the
[00:27:17] in class when we want you to see the steps of the derivation and sometimes
[00:27:18] steps of the derivation and sometimes the save time and cost will gloss over
[00:27:20] the save time and cost will gloss over the mathematical details that leave you
[00:27:22] the mathematical details that leave you the read over the full details in the
[00:27:24] the read over the full details in the lecture notes under sisters you know
[00:27:25] lecture notes under sisters you know course website so partially whatever
[00:27:29] course website so partially whatever strategy of theta that's the partial er
[00:27:32] strategy of theta that's the partial er of respect to that 1/2 H of theta of X
[00:27:40] of respect to that 1/2 H of theta of X minus y squared and so I'm gonna do a
[00:27:43] minus y squared and so I'm gonna do a slightly simpler version assuming we
[00:27:46] slightly simpler version assuming we have just one training example right the
[00:27:48] have just one training example right the actual derivate definition of J of theta
[00:27:50] actual derivate definition of J of theta has a sum over I from 1 to M over all
[00:27:54] has a sum over I from 1 to M over all the training examples so I'm just
[00:27:56] the training examples so I'm just forgetting that some for now so if you
[00:27:58] forgetting that some for now so if you have only one training example
[00:28:00] have only one training example and so from calculus if you take the
[00:28:03] and so from calculus if you take the derivative of a square you know the two
[00:28:05] derivative of a square you know the two comes down and so that cancels out with
[00:28:07] comes down and so that cancels out with the hall so two times one half times the
[00:28:12] the hall so two times one half times the thing inside right and then by the chain
[00:28:17] thing inside right and then by the chain rule of derivatives that's times the
[00:28:21] rule of derivatives that's times the partial derivative of theta J minus y so
[00:28:27] partial derivative of theta J minus y so if you take the derivative of a square
[00:28:28] if you take the derivative of a square the two comes down and then you take the
[00:28:31] the two comes down and then you take the derivative of what's inside and multiply
[00:28:33] derivative of what's inside and multiply that right and so the 200 one half
[00:28:38] that right and so the 200 one half cancel out so this leaves you with minus
[00:28:43] cancel out so this leaves you with minus y times partial derivative respect to
[00:28:45] y times partial derivative respect to theta J of say the zero X zero plus
[00:28:49] theta J of say the zero X zero plus theta 1 x1 plus dot plus theta n X and
[00:28:56] theta 1 x1 plus dot plus theta n X and minus y I where I just took the
[00:29:00] minus y I where I just took the definition of H of X and expanded it out
[00:29:03] definition of H of X and expanded it out to that to that sum right because H of X
[00:29:08] to that to that sum right because H of X is just equal to that so if you look at
[00:29:10] is just equal to that so if you look at the partial derivative of each of these
[00:29:12] the partial derivative of each of these terms with respect to theta J the
[00:29:16] terms with respect to theta J the partial derivative of every one of these
[00:29:18] partial derivative of every one of these terms respect to theta J is going to be
[00:29:21] terms respect to theta J is going to be 0 except for the term corresponding to J
[00:29:25] 0 except for the term corresponding to J because you know if if J was equal to
[00:29:29] because you know if if J was equal to one say right then this term doesn't
[00:29:33] one say right then this term doesn't depend on theta 1 this term this term
[00:29:36] depend on theta 1 this term this term all of them do not even depend on theta
[00:29:38] all of them do not even depend on theta 1 the only term that depends on theta 1
[00:29:41] 1 the only term that depends on theta 1 is this term over there and the partial
[00:29:44] is this term over there and the partial derivative of this term respect to theta
[00:29:46] derivative of this term respect to theta 1 would be just X 1 and so when you take
[00:29:52] 1 would be just X 1 and so when you take the partial derivative of this big some
[00:29:54] the partial derivative of this big some with respect to say the J intercept just
[00:29:58] with respect to say the J intercept just J equals 1 and respect to theta J in
[00:30:00] J equals 1 and respect to theta J in general then the only term that even
[00:30:02] general then the only term that even depends on theta J is the term theta J
[00:30:06] depends on theta J is the term theta J XJ
[00:30:08] XJ and so the partial derivative of all the
[00:30:10] and so the partial derivative of all the other terms end up being zero and
[00:30:12] other terms end up being zero and partial observer this term respect to
[00:30:15] partial observer this term respect to theta J it is equal to XJ okay and so
[00:30:19] theta J it is equal to XJ okay and so this ends up being a theta X minus y
[00:30:24] this ends up being a theta X minus y times XJ okay and again if you haven't
[00:30:30] times XJ okay and again if you haven't you haven't played with calculus for a
[00:30:32] you haven't played with calculus for a while if you you know don't quite
[00:30:33] while if you you know don't quite remember what positive is or don't quite
[00:30:35] remember what positive is or don't quite get what I just said don't worry too
[00:30:36] get what I just said don't worry too much about it go over a bit more in
[00:30:38] much about it go over a bit more in section and we and then also read
[00:30:40] section and we and then also read through the lecture notes which kind of
[00:30:41] through the lecture notes which kind of goes over this in in in more detail and
[00:30:45] goes over this in in in more detail and more slowly than then we might do in
[00:30:47] more slowly than then we might do in class
[00:31:01] so um so plugging this let's see so
[00:31:06] so um so plugging this let's see so we've just calculated that this partial
[00:31:08] we've just calculated that this partial derivative is equal to this and so
[00:31:13] derivative is equal to this and so plugging it back into that formula one
[00:31:15] plugging it back into that formula one step of gradient descent is is the
[00:31:19] step of gradient descent is is the following which is that we will let
[00:31:23] following which is that we will let theta J be updated according to u theta
[00:31:26] theta J be updated according to u theta J minus the learning rate and times H of
[00:31:31] J minus the learning rate and times H of X minus y times X change ok now I'm
[00:31:41] X minus y times X change ok now I'm gonna just add a few more things to this
[00:31:43] gonna just add a few more things to this equation so I did this for one training
[00:31:45] equation so I did this for one training example but this was I kind of used
[00:31:47] example but this was I kind of used definition of the cost function J of
[00:31:49] definition of the cost function J of theta defined using just one single
[00:31:52] theta defined using just one single training example but you actually have M
[00:31:54] training example but you actually have M training examples and so the the correct
[00:31:58] training examples and so the the correct formula for the derivative is actually
[00:32:01] formula for the derivative is actually if you take this thing and sum it over
[00:32:03] if you take this thing and sum it over all M training examples the derivative
[00:32:06] all M training examples the derivative of the derivative sum is the sum of the
[00:32:08] of the derivative sum is the sum of the derivatives right so so you actually if
[00:32:12] derivatives right so so you actually if you redo this derivation you know
[00:32:15] you redo this derivation you know summing with the correct definition of J
[00:32:16] summing with the correct definition of J of theta which sums over all M training
[00:32:18] of theta which sums over all M training examples if you just redo that the
[00:32:20] examples if you just redo that the derivation you end up with some equals I
[00:32:23] derivation you end up with some equals I threw em that right where remember X I
[00:32:31] threw em that right where remember X I is the I've training examples input
[00:32:34] is the I've training examples input features y I is the target Abel is the
[00:32:36] features y I is the target Abel is the price in the life training example and
[00:32:41] price in the life training example and so this is the actual correct formula
[00:32:45] so this is the actual correct formula for the partial derivative respect to
[00:32:47] for the partial derivative respect to that the cost function J of theta when
[00:32:52] that the cost function J of theta when it's defined using all of the what is
[00:32:57] it's defined using all of the what is defined using all of the training
[00:32:58] defined using all of the training examples
[00:33:00] examples and so the gradient descent algorithm is
[00:33:03] and so the gradient descent algorithm is to repeat until convergence carry out
[00:33:15] to repeat until convergence carry out this update and in each iteration of
[00:33:18] this update and in each iteration of gradient descent you do this update
[00:33:20] gradient descent you do this update before J equals 0 1 up to n where n is
[00:33:30] before J equals 0 1 up to n where n is the number of features so n was 2 in our
[00:33:33] the number of features so n was 2 in our example ok and if you do this then you
[00:33:39] example ok and if you do this then you know then what will happen is show you
[00:33:45] know then what will happen is show you the animation yes you fit hopefully you
[00:33:48] the animation yes you fit hopefully you find a pretty good value of the
[00:33:50] find a pretty good value of the parameters theta so it turns out that
[00:33:56] parameters theta so it turns out that when you plot the cost function J of
[00:34:00] when you plot the cost function J of theta for a linear regression model it
[00:34:04] theta for a linear regression model it turns out that unlike the earlier
[00:34:07] turns out that unlike the earlier diagram I have shown which has local
[00:34:09] diagram I have shown which has local Optima it turns out that if J of theta
[00:34:12] Optima it turns out that if J of theta is defined a way that you know we just
[00:34:14] is defined a way that you know we just defined it for linear regression there's
[00:34:16] defined it for linear regression there's this sum of squared terms then J of
[00:34:19] this sum of squared terms then J of theta turns out to be a quadratic
[00:34:20] theta turns out to be a quadratic function where it's the sum of these
[00:34:22] function where it's the sum of these squares of terms and so J of theta will
[00:34:25] squares of terms and so J of theta will always look like look like a big bowl
[00:34:26] always look like look like a big bowl like this another way to look at this oh
[00:34:30] like this another way to look at this oh and so and so J of theta does not have
[00:34:33] and so and so J of theta does not have local Optima or the only local Optima is
[00:34:35] local Optima or the only local Optima is also the global optimum the other way to
[00:34:38] also the global optimum the other way to look at the function like this is to
[00:34:40] look at the function like this is to look at the contours of this plot right
[00:34:42] look at the contours of this plot right so you pull the contours by looking at
[00:34:44] so you pull the contours by looking at the big bowl and taking horizontal
[00:34:46] the big bowl and taking horizontal slices and plotting where they're where
[00:34:48] slices and plotting where they're where the curves where where the edges of the
[00:34:50] the curves where where the edges of the horizontal slices so the contours of a
[00:34:53] horizontal slices so the contours of a big bowl or I guess a formal is a bigger
[00:34:56] big bowl or I guess a formal is a bigger and of this quadratic function will be
[00:35:00] and of this quadratic function will be ellipses like these or these ovals or
[00:35:03] ellipses like these or these ovals or these ellipses like this and so if you
[00:35:06] these ellipses like this and so if you run gradient descent on this algorithm
[00:35:09] run gradient descent on this algorithm let's say I initialize
[00:35:12] let's say I initialize my parameters at that little X shown
[00:35:16] my parameters at that little X shown over here and usually you initialize say
[00:35:19] over here and usually you initialize say to the ruler zero but but you know but
[00:35:21] to the ruler zero but but you know but it doesn't matter too much so let's to
[00:35:23] it doesn't matter too much so let's to initialize over there then with one step
[00:35:26] initialize over there then with one step of gradient descent the algorithm will
[00:35:28] of gradient descent the algorithm will take that step downhill and then where
[00:35:31] take that step downhill and then where the second step will take that step
[00:35:34] the second step will take that step downhill oh and by the way fun fact if
[00:35:37] downhill oh and by the way fun fact if you if you think about the contours of a
[00:35:38] you if you think about the contours of a function it turns out that the direction
[00:35:40] function it turns out that the direction of steepest descent is always at ninety
[00:35:42] of steepest descent is always at ninety degrees is always orthogonal to the
[00:35:44] degrees is always orthogonal to the contour direction right so no seem to
[00:35:48] contour direction right so no seem to remember that file my high school
[00:35:50] remember that file my high school something I think it's alright and so as
[00:35:53] something I think it's alright and so as you as you take steps downhill because
[00:35:56] you as you take steps downhill because there's only one global minimum this
[00:35:59] there's only one global minimum this algorithm will eventually converge to
[00:36:02] algorithm will eventually converge to the and so the question just now about
[00:36:06] the and so the question just now about the choice of the learning rate alpha if
[00:36:09] the choice of the learning rate alpha if you set alpha to be very very large to
[00:36:11] you set alpha to be very very large to be too large then it can overshoot right
[00:36:14] be too large then it can overshoot right the steps you take can be too large and
[00:36:16] the steps you take can be too large and you can run past the minimum if you said
[00:36:19] you can run past the minimum if you said to be too small then you need a lot of
[00:36:21] to be too small then you need a lot of iterations and yah will be slow and so
[00:36:24] iterations and yah will be slow and so what happens in practice is usually you
[00:36:27] what happens in practice is usually you try a few values and and and see what
[00:36:30] try a few values and and and see what value of the learning rate allows you to
[00:36:32] value of the learning rate allows you to most efficiently you know drive down the
[00:36:34] most efficiently you know drive down the value of J of theta and if you see J of
[00:36:37] value of J of theta and if you see J of theta increasing rather than decreasing
[00:36:39] theta increasing rather than decreasing you see the cost function increasing
[00:36:41] you see the cost function increasing rather than decreasing then there's a
[00:36:44] rather than decreasing then there's a very strong sign that the learning rate
[00:36:46] very strong sign that the learning rate is too large and so actually what I
[00:36:50] is too large and so actually what I often do is actually try out multiple
[00:36:52] often do is actually try out multiple values of the learning rate alpha and
[00:36:55] values of the learning rate alpha and and and and usually try them on
[00:36:58] and and and usually try them on exponential scale so try 0.01 0.02 to
[00:37:01] exponential scale so try 0.01 0.02 to 0.04 0.08 kind of a doubling scale or
[00:37:06] 0.04 0.08 kind of a doubling scale or doubling scale or tripling scale and try
[00:37:07] doubling scale or tripling scale and try a few values and see what value allows
[00:37:09] a few values and see what value allows you to drive down the learning rate
[00:37:10] you to drive down the learning rate that's this I'm just
[00:37:14] that's this I'm just so I just want to visualize this in one
[00:37:18] so I just want to visualize this in one other way which is with the data so this
[00:37:22] other way which is with the data so this is this is the actual dataset there
[00:37:24] is this is the actual dataset there they're actually 49 points in the
[00:37:26] they're actually 49 points in the dataset so M the number of training
[00:37:28] dataset so M the number of training examples is 49 and so if you initialize
[00:37:31] examples is 49 and so if you initialize the parameters to 0 that means
[00:37:34] the parameters to 0 that means initializing your hypothesis or
[00:37:36] initializing your hypothesis or initializing your straight line fit to
[00:37:38] initializing your straight line fit to the data to be that horizontal line
[00:37:40] the data to be that horizontal line right so if you initialize theta 0
[00:37:43] right so if you initialize theta 0 equals 0 theta 1 equals 0 then your
[00:37:45] equals 0 theta 1 equals 0 then your hypothesis is you know for any input
[00:37:48] hypothesis is you know for any input size of house of a price the estimated
[00:37:50] size of house of a price the estimated price is 0 and so your hypothesis starts
[00:37:53] price is 0 and so your hypothesis starts off with a horizontal line there just
[00:37:55] off with a horizontal line there just whatever the input X the output Y is 0
[00:37:57] whatever the input X the output Y is 0 and what you're doing as you run
[00:38:02] and what you're doing as you run gradient descent is you're changing the
[00:38:04] gradient descent is you're changing the parameters theta right so the parameters
[00:38:06] parameters theta right so the parameters went from this value to this value to
[00:38:08] went from this value to this value to this value to this value and so on and
[00:38:10] this value to this value and so on and so the other way of visualizing gradient
[00:38:13] so the other way of visualizing gradient descent is if gradient descent starts
[00:38:16] descent is if gradient descent starts off with this hypothesis with each
[00:38:18] off with this hypothesis with each iteration of gradient descent you are
[00:38:20] iteration of gradient descent you are trying to find different values of the
[00:38:23] trying to find different values of the parameters theta that allows the
[00:38:26] parameters theta that allows the straight line to fit the data better so
[00:38:28] straight line to fit the data better so after one iteration of gradient descent
[00:38:29] after one iteration of gradient descent this is the new hypothesis you now have
[00:38:32] this is the new hypothesis you now have different values of theta zero and theta
[00:38:33] different values of theta zero and theta one that fits the gate a little bit
[00:38:35] one that fits the gate a little bit better after two iterations you end up
[00:38:39] better after two iterations you end up with that hypothesis and with each
[00:38:41] with that hypothesis and with each iteration of grading descent is trying
[00:38:43] iteration of grading descent is trying to minimize J of theta is trying to
[00:38:45] to minimize J of theta is trying to minimize one half of the sum of squares
[00:38:47] minimize one half of the sum of squares errors of the hypothesis or predictions
[00:38:50] errors of the hypothesis or predictions on the different examples right well
[00:38:52] on the different examples right well three iterations of green descent for
[00:38:55] three iterations of green descent for their Asians and so on and then and then
[00:38:57] their Asians and so on and then and then the bunch more durations and eventually
[00:39:00] the bunch more durations and eventually converges to that hypothesis which is
[00:39:03] converges to that hypothesis which is pretty pretty decent straight line fit
[00:39:04] pretty pretty decent straight line fit to the data ok so question
[00:39:09] oh sure me just repeat question why is
[00:39:30] oh sure me just repeat question why is the y-you subtracting alpha times the
[00:39:33] the y-you subtracting alpha times the gradient rather than adding alpha times
[00:39:35] gradient rather than adding alpha times the gradient um let me suggest let me
[00:39:38] the gradient um let me suggest let me raise the screen so let me suggest you
[00:39:42] raise the screen so let me suggest you work through one example it turns out
[00:39:46] work through one example it turns out that if you add a multiple times the
[00:39:48] that if you add a multiple times the gradient you'll be going uphill rather
[00:39:50] gradient you'll be going uphill rather than going down low and maybe one way to
[00:39:52] than going down low and maybe one way to see that would be um you know take a
[00:39:56] see that would be um you know take a quadratic function right if you're here
[00:40:01] quadratic function right if you're here the gradient is a positive direction and
[00:40:04] the gradient is a positive direction and you want to reduce so this would be
[00:40:06] you want to reduce so this would be theta and just me jr. yes so you want
[00:40:08] theta and just me jr. yes so you want thither to decrease so the green is
[00:40:10] thither to decrease so the green is positive you want to decrease say there
[00:40:12] positive you want to decrease say there is you want to subtract in multiple
[00:40:13] is you want to subtract in multiple times a gradient
[00:40:14] times a gradient um I think maybe the best way to see
[00:40:16] um I think maybe the best way to see that would be to work through an example
[00:40:17] that would be to work through an example yourself said J of theta equals theta
[00:40:21] yourself said J of theta equals theta squared and sin theta equals one so here
[00:40:24] squared and sin theta equals one so here at the quadratic function the differs is
[00:40:25] at the quadratic function the differs is equal to one so you want to subtract the
[00:40:27] equal to one so you want to subtract the value from say a rotten egg all right
[00:40:34] value from say a rotten egg all right great so you've now seen your first
[00:40:39] great so you've now seen your first learning algorithm and you know green
[00:40:43] learning algorithm and you know green descent and linear aggression is
[00:40:45] descent and linear aggression is Daphne's still one of the most widely
[00:40:47] Daphne's still one of the most widely used learning algorithms in the world
[00:40:48] used learning algorithms in the world today and if you implement this review
[00:40:50] today and if you implement this review if you implement this today right you
[00:40:52] if you implement this today right you could use this but some actually pretty
[00:40:55] could use this but some actually pretty pretty decent purposes right now I want
[00:41:02] pretty decent purposes right now I want to give this algorithm one other name
[00:41:05] to give this algorithm one other name so our gradient descent algorithm here
[00:41:10] so our gradient descent algorithm here calculates this derivative by summing
[00:41:13] calculates this derivative by summing over your entire training set M and so
[00:41:16] over your entire training set M and so sometimes this version of gradient
[00:41:18] sometimes this version of gradient descent
[00:41:19] descent has another name which is Bosch gradient
[00:41:21] has another name which is Bosch gradient descent and the term batch you know and
[00:41:36] descent and the term batch you know and again I think a machine learning a whole
[00:41:38] again I think a machine learning a whole committee we just make up names of stuff
[00:41:40] committee we just make up names of stuff and sometimes the names aren't great but
[00:41:41] and sometimes the names aren't great but the the term Bactrian descent refers to
[00:41:44] the the term Bactrian descent refers to that you look at the entire training set
[00:41:47] that you look at the entire training set all 49 examples in the example I just
[00:41:49] all 49 examples in the example I just had on PowerPoint you know you think of
[00:41:52] had on PowerPoint you know you think of all 49 examples it's one batch of data
[00:41:54] all 49 examples it's one batch of data and we're gonna process all the data as
[00:41:56] and we're gonna process all the data as a batch so hence the name batch gradient
[00:41:59] a batch so hence the name batch gradient descent
[00:41:59] descent do you disadvantage a bachelor in
[00:42:02] do you disadvantage a bachelor in descent is that if you have a giant data
[00:42:04] descent is that if you have a giant data set if you have and and you're in era of
[00:42:07] set if you have and and you're in era of big data we're really moving to large
[00:42:09] big data we're really moving to large and larger data set there and serve use
[00:42:11] and larger data set there and serve use you know train machine learning models
[00:42:13] you know train machine learning models of like hundreds of millions of examples
[00:42:15] of like hundreds of millions of examples and and if you are trying to if you have
[00:42:18] and and if you are trying to if you have if you download the US Census database
[00:42:21] if you download the US Census database if your data on the United States Census
[00:42:23] if your data on the United States Census that's a very large data set and you
[00:42:25] that's a very large data set and you want to predict housing prices from
[00:42:27] want to predict housing prices from across the United States that that that
[00:42:29] across the United States that that that may have a data set with many many
[00:42:31] may have a data set with many many millions of examples and the
[00:42:33] millions of examples and the disadvantage a batch gradient descent is
[00:42:36] disadvantage a batch gradient descent is that in order to make one update to your
[00:42:40] that in order to make one update to your parameters in order to take even a
[00:42:41] parameters in order to take even a single step of gradient descent you need
[00:42:44] single step of gradient descent you need to calculate this sum and if M is say a
[00:42:48] to calculate this sum and if M is say a million or ten million or 100 million
[00:42:50] million or ten million or 100 million you need to scan through your entire
[00:42:53] you need to scan through your entire database scan your entire dataset and
[00:42:55] database scan your entire dataset and calculate this for you know 100 million
[00:43:00] calculate this for you know 100 million examples and sum it up and so every
[00:43:01] examples and sum it up and so every single step of gradient descent becomes
[00:43:03] single step of gradient descent becomes very slow because you're scanning over
[00:43:06] very slow because you're scanning over you're reading over right like 100
[00:43:08] you're reading over right like 100 million training examples and before you
[00:43:12] million training examples and before you can even you know make one tiny little
[00:43:13] can even you know make one tiny little step of gradient descent okay by the way
[00:43:17] step of gradient descent okay by the way I think I don't know I feel like in
[00:43:19] I think I don't know I feel like in today's ever a big day there people
[00:43:20] today's ever a big day there people start to lose intuitions about what's
[00:43:22] start to lose intuitions about what's the baby that's and I think even by
[00:43:23] the baby that's and I think even by today's standards like a hundred many
[00:43:25] today's standards like a hundred many examples is still very big
[00:43:27] examples is still very big I rarely only rarely use dungeon
[00:43:29] I rarely only rarely use dungeon examples although maybe in a few years
[00:43:33] examples although maybe in a few years we'll look back on Hydra examples and
[00:43:35] we'll look back on Hydra examples and say that was really small but at least
[00:43:36] say that was really small but at least today so the main disadvantage of
[00:43:41] today so the main disadvantage of battery and descends is every single
[00:43:44] battery and descends is every single step of your in descent requires that
[00:43:45] step of your in descent requires that you read through you know your entire
[00:43:47] you read through you know your entire data set may be terabytes of data sets
[00:43:50] data set may be terabytes of data sets maybe maybe maybe tens of hundreds of
[00:43:52] maybe maybe maybe tens of hundreds of terabytes of data before you can even
[00:43:54] terabytes of data before you can even update the parameters just once
[00:43:56] update the parameters just once and if gradient descent needs you know
[00:43:59] and if gradient descent needs you know hundreds of iterations to converge then
[00:44:01] hundreds of iterations to converge then you be scanning through your entire data
[00:44:03] you be scanning through your entire data set hundreds of times oh and sometimes
[00:44:06] set hundreds of times oh and sometimes we train our algorithms so thousands of
[00:44:09] we train our algorithms so thousands of tens of thousands of iterations and so
[00:44:10] tens of thousands of iterations and so so this this gets expensive so there's
[00:44:15] so this this gets expensive so there's an alternative to bash gradient descent
[00:44:18] an alternative to bash gradient descent and let me just write out the algorithm
[00:44:20] and let me just write out the algorithm here then we can talk about it which is
[00:44:23] here then we can talk about it which is going to repeatedly do this
[00:44:52] so this algorithm which is called
[00:44:56] so this algorithm which is called stochastic gradient descent instead of
[00:45:08] stochastic gradient descent instead of scanning through all million examples
[00:45:10] scanning through all million examples before you update the parameters theta
[00:45:12] before you update the parameters theta even a little bit in stochastic grain
[00:45:15] even a little bit in stochastic grain descent instead in the inner loop of the
[00:45:17] descent instead in the inner loop of the algorithm you loop through J equals 1
[00:45:19] algorithm you loop through J equals 1 through m of ticking a gradient descent
[00:45:22] through m of ticking a gradient descent step using the derivative of just one
[00:45:26] step using the derivative of just one single example of just that one example
[00:45:29] single example of just that one example oh excuse me I write so let I go from 1
[00:45:34] oh excuse me I write so let I go from 1 to M and update theta J for every J so
[00:45:37] to M and update theta J for every J so you update this for J equals 1 through n
[00:45:40] you update this for J equals 1 through n update theta J using this derivative but
[00:45:45] update theta J using this derivative but now this derivative is taken just with
[00:45:48] now this derivative is taken just with respect to one training example the
[00:45:50] respect to one training example the example I just I guess you update this
[00:45:57] example I just I guess you update this for every J and so let me just draw a
[00:46:06] for every J and so let me just draw a picture of what this algorithm is doing
[00:46:09] picture of what this algorithm is doing if this is the contour like the one you
[00:46:15] if this is the contour like the one you saw just now so the axes are theta 0 and
[00:46:19] saw just now so the axes are theta 0 and theta 1 and the height of the surface
[00:46:21] theta 1 and the height of the surface right to know the contours J of theta
[00:46:23] right to know the contours J of theta with stochastic render sense what you do
[00:46:26] with stochastic render sense what you do is you initialize the parameters
[00:46:27] is you initialize the parameters somewhere and then you will look at your
[00:46:30] somewhere and then you will look at your first training example hey let's just
[00:46:32] first training example hey let's just look at one house and see if we can
[00:46:33] look at one house and see if we can predict that houses better and you
[00:46:35] predict that houses better and you modify the parameters to increase the
[00:46:38] modify the parameters to increase the accuracy where you predict the price of
[00:46:40] accuracy where you predict the price of that one house and because you for the
[00:46:41] that one house and because you for the innovator just the one house you know
[00:46:44] innovator just the one house you know maybe you end up improving the
[00:46:47] maybe you end up improving the parameters a little bit but not quite
[00:46:49] parameters a little bit but not quite going in the most direct direction
[00:46:52] going in the most direct direction downhill and you're going look at the
[00:46:53] downhill and you're going look at the second house and say hey let's try to
[00:46:55] second house and say hey let's try to fit that house better and then update
[00:46:57] fit that house better and then update the parameters and look at third house a
[00:46:58] the parameters and look at third house a house
[00:47:00] house and so as you run so costly gradient
[00:47:03] and so as you run so costly gradient descent it takes a slightly noisy
[00:47:05] descent it takes a slightly noisy slightly random path but on average is
[00:47:09] slightly random path but on average is headed to what the global minimum okay
[00:47:13] headed to what the global minimum okay so as you run stochastic current descent
[00:47:16] so as you run stochastic current descent so her great in the sense will actually
[00:47:19] so her great in the sense will actually never quite converge in backstreet
[00:47:21] never quite converge in backstreet understand it kind of went to the global
[00:47:24] understand it kind of went to the global minimum and stopped right but so
[00:47:27] minimum and stopped right but so classroom in this end even as you won't
[00:47:28] classroom in this end even as you won't run it the parameters oscillate and
[00:47:30] run it the parameters oscillate and won't ever quite converge because you're
[00:47:32] won't ever quite converge because you're always running around looking at
[00:47:34] always running around looking at different houses and trying to do better
[00:47:35] different houses and trying to do better on just that one hold on that one house
[00:47:37] on just that one hold on that one house on that one house but when you have a
[00:47:40] on that one house but when you have a very large data set stochastic gradient
[00:47:43] very large data set stochastic gradient descent allows your implementation
[00:47:45] descent allows your implementation allows you algorithm to make much faster
[00:47:47] allows you algorithm to make much faster progress and so and and so when you have
[00:47:52] progress and so and and so when you have very large data sets the casa grande
[00:47:55] very large data sets the casa grande descent is use much more practice than -
[00:47:57] descent is use much more practice than - brilliant
[00:48:11] you know is it possible stop so
[00:48:13] you know is it possible stop so customers and and it's such over the
[00:48:15] customers and and it's such over the battery understand yes it is so boy
[00:48:20] battery understand yes it is so boy something wasn't a tough one in this
[00:48:21] something wasn't a tough one in this class solvency a suitor these mini
[00:48:23] class solvency a suitor these mini battery in the sense where you don't
[00:48:26] battery in the sense where you don't when you use say 100 examples this time
[00:48:28] when you use say 100 examples this time rather than one example of the time and
[00:48:31] rather than one example of the time and so that's another algorithm that I
[00:48:33] so that's another algorithm that I should use more often in practice I
[00:48:35] should use more often in practice I think people rarely actually so in
[00:48:37] think people rarely actually so in practice you know when your data set is
[00:48:41] practice you know when your data set is large we rarely ever switch to batch
[00:48:45] large we rarely ever switch to batch gradient descent because battery in the
[00:48:47] gradient descent because battery in the sin is just so slow right so I don't
[00:48:50] sin is just so slow right so I don't know I'm thinking through concrete
[00:48:52] know I'm thinking through concrete examples of problems that worked on and
[00:48:54] examples of problems that worked on and I think that what may actually may I
[00:48:57] I think that what may actually may I think that dump right for a lot of for
[00:49:01] think that dump right for a lot of for modern machine learning where you have
[00:49:03] modern machine learning where you have if you have very very large datasets
[00:49:04] if you have very very large datasets right so you know whatever if you're
[00:49:06] right so you know whatever if you're building a speech recognition system you
[00:49:08] building a speech recognition system you might have like a terabyte of data right
[00:49:10] might have like a terabyte of data right and so it's so expensive to scan through
[00:49:14] and so it's so expensive to scan through a terabyte of day they're just reading
[00:49:16] a terabyte of day they're just reading it from disk right it's so expensive
[00:49:18] it from disk right it's so expensive that you would probably never even run
[00:49:21] that you would probably never even run one iteration about you in the sense and
[00:49:23] one iteration about you in the sense and it turns out the the the this one one
[00:49:26] it turns out the the the this one one huge saving graces to consecrate
[00:49:28] huge saving graces to consecrate understands is let's say runs the costly
[00:49:30] understands is let's say runs the costly grained descent right and you know you
[00:49:33] grained descent right and you know you end up with this parameter and that's
[00:49:37] end up with this parameter and that's the parameter you use for your machine
[00:49:39] the parameter you use for your machine learning system rather than the global
[00:49:41] learning system rather than the global optimum it turns out that parameter is
[00:49:44] optimum it turns out that parameter is actually not that bad right you you
[00:49:45] actually not that bad right you you probably make perfectly fine predictions
[00:49:47] probably make perfectly fine predictions even if you don't get to like the global
[00:49:50] even if you don't get to like the global global minimum so what you said I think
[00:49:53] global minimum so what you said I think is a fine thing to do no harm trying it
[00:49:56] is a fine thing to do no harm trying it although in practice in practice we
[00:50:01] although in practice in practice we don't bother I think in practice we use
[00:50:02] don't bother I think in practice we use the customer in the same the thing that
[00:50:04] the customer in the same the thing that actually is more common is to slowly
[00:50:06] actually is more common is to slowly decrease the learning rate so just keep
[00:50:08] decrease the learning rate so just keep using so-called green descent but reduce
[00:50:10] using so-called green descent but reduce the learning rate
[00:50:10] the learning rate time so it takes smaller and smaller
[00:50:12] time so it takes smaller and smaller steps so if you do that then what
[00:50:14] steps so if you do that then what happens is the size of the oscillations
[00:50:16] happens is the size of the oscillations of decrease and so you end up
[00:50:18] of decrease and so you end up oscillating or bouncing around the
[00:50:20] oscillating or bouncing around the smaller regions so wherever you end up
[00:50:22] smaller regions so wherever you end up may not be the global global minimum but
[00:50:25] may not be the global global minimum but at least it'll be it'll be closer to the
[00:50:27] at least it'll be it'll be closer to the yeah so the appeasing learning rate is
[00:50:29] yeah so the appeasing learning rate is used much more often cool oh sure when
[00:50:40] used much more often cool oh sure when do you start with certain ranges and
[00:50:42] do you start with certain ranges and plot to J of theta over a time so J of
[00:50:47] plot to J of theta over a time so J of theta is the cost function that you're
[00:50:48] theta is the cost function that you're trying to drive down so monitor J of
[00:50:51] trying to drive down so monitor J of theta as you know it's going down over
[00:50:53] theta as you know it's going down over time and then if it looks like this stop
[00:50:55] time and then if it looks like this stop going down then you can say oh it looks
[00:50:57] going down then you can say oh it looks like a spout going down when it's not
[00:50:58] like a spout going down when it's not raining oh and you know one nice thing
[00:51:02] raining oh and you know one nice thing about linear regression is it has no
[00:51:04] about linear regression is it has no local optimum and so if you run into
[00:51:09] local optimum and so if you run into these conversions debugging in terms of
[00:51:12] these conversions debugging in terms of issues less often when you're training
[00:51:14] issues less often when you're training highly nonlinear things like neural
[00:51:16] highly nonlinear things like neural networks which talk about later in CST
[00:51:17] networks which talk about later in CST tonight as well
[00:51:18] tonight as well oh these issues become more acute okay
[00:51:26] oh these issues become more acute okay great so um
[00:51:33] Oh which I learn here if you want to
[00:51:36] Oh which I learn here if you want to rent times linear expressions and not
[00:51:38] rent times linear expressions and not really it's usually much bigger than
[00:51:39] really it's usually much bigger than that
[00:51:40] that yeah yeah because if your learning rate
[00:51:43] yeah yeah because if your learning rate was 1 over n times that of Y should use
[00:51:45] was 1 over n times that of Y should use the fashion in descent then it ended up
[00:51:47] the fashion in descent then it ended up being as slow as factual innocence so
[00:51:49] being as slow as factual innocence so there's usually much bigger okay so um
[00:51:54] there's usually much bigger okay so um so that's the classic greater sin oh and
[00:51:57] so that's the classic greater sin oh and and so I'll tell you what I do if you
[00:51:59] and so I'll tell you what I do if you have a relatively small dataset you know
[00:52:01] have a relatively small dataset you know if you have if yep I don't know like a
[00:52:03] if you have if yep I don't know like a hundreds of examples maybe a thousands
[00:52:05] hundreds of examples maybe a thousands of examples where it's computationally
[00:52:07] of examples where it's computationally efficient to do batch gradient descent
[00:52:10] efficient to do batch gradient descent if battery and descent doesn't cost too
[00:52:11] if battery and descent doesn't cost too much I would almost always just use
[00:52:13] much I would almost always just use battery and descent because it's one
[00:52:15] battery and descent because it's one less thing to fiddle with right it's
[00:52:16] less thing to fiddle with right it's just one less thing to have to worry
[00:52:18] just one less thing to have to worry about the parameters oscillating but
[00:52:20] about the parameters oscillating but your data said is too large that battery
[00:52:23] your data said is too large that battery understand becomes prohibitively slow
[00:52:26] understand becomes prohibitively slow then almost everyone would use you know
[00:52:29] then almost everyone would use you know so costly Granger sentence there right
[00:52:31] so costly Granger sentence there right Oh however more like so cost for any
[00:52:32] Oh however more like so cost for any sense
[00:52:47] all right so gradient descents both
[00:52:53] all right so gradient descents both master and descent as so costly green
[00:52:55] master and descent as so costly green descent is an iterative algorithm
[00:52:58] descent is an iterative algorithm meaning that you have to take multiple
[00:53:00] meaning that you have to take multiple steps to get to you know get near
[00:53:02] steps to get to you know get near hopefully the global optimum it turns
[00:53:05] hopefully the global optimum it turns out this is another algorithm oh and and
[00:53:08] out this is another algorithm oh and and for many other algorithms we'll talk
[00:53:11] for many other algorithms we'll talk about in this class including general
[00:53:13] about in this class including general linear models and neural networks and a
[00:53:15] linear models and neural networks and a few other algorithms you will have to
[00:53:17] few other algorithms you will have to use gradient descent and so and so we'll
[00:53:19] use gradient descent and so and so we'll see gradient descent you know as we
[00:53:21] see gradient descent you know as we develop multiple different algorithms
[00:53:23] develop multiple different algorithms later this quarter it turns out that for
[00:53:27] later this quarter it turns out that for the special case of linear regression
[00:53:29] the special case of linear regression and I mean linear regression but not the
[00:53:31] and I mean linear regression but not the other and we'll talk about next Monday
[00:53:33] other and we'll talk about next Monday not the r1 with help on the expensive
[00:53:34] not the r1 with help on the expensive but if the algorithm you're using is
[00:53:36] but if the algorithm you're using is linear regression of exactly linear
[00:53:38] linear regression of exactly linear regression it turns out that there's a
[00:53:40] regression it turns out that there's a way to solve for the optimal value of
[00:53:43] way to solve for the optimal value of the parameters theta to just jump in one
[00:53:46] the parameters theta to just jump in one step to the global optimum without
[00:53:48] step to the global optimum without needing to use an iterative algorithm
[00:53:50] needing to use an iterative algorithm right and this this one I'm gonna
[00:53:53] right and this this one I'm gonna present makes is called the normal
[00:53:55] present makes is called the normal equation it works only for linear
[00:53:56] equation it works only for linear regression doesn't work for any of the
[00:53:58] regression doesn't work for any of the other arrows talked about me to the
[00:53:59] other arrows talked about me to the school sir
[00:54:00] school sir but but let me quickly show you the
[00:54:10] but but let me quickly show you the derivation of that and what I want to do
[00:54:15] derivation of that and what I want to do is give you a flavor of how to derive
[00:54:19] is give you a flavor of how to derive the normal equation and where you end up
[00:54:22] the normal equation and where you end up with is you know what what I hope to do
[00:54:25] with is you know what what I hope to do is end up with a formula that lets you
[00:54:27] is end up with a formula that lets you say theta equals some stuff where you
[00:54:31] say theta equals some stuff where you just set theta equals to that and in one
[00:54:33] just set theta equals to that and in one step with a few matrix multiplications
[00:54:35] step with a few matrix multiplications you end up with the optimal value of
[00:54:37] you end up with the optimal value of theta that lands you right at the global
[00:54:39] theta that lands you right at the global optimum right now just like that just in
[00:54:41] optimum right now just like that just in one step okay um and if you've taken you
[00:54:45] one step okay um and if you've taken you know advanced linear algebra classes
[00:54:47] know advanced linear algebra classes before something you may have seen
[00:54:49] before something you may have seen in this formula for linear regression
[00:54:53] in this formula for linear regression what what what the longer than Yashiro
[00:54:55] what what what the longer than Yashiro clauses do is what some of the natural
[00:54:58] clauses do is what some of the natural classes do is cover the board with you
[00:55:00] classes do is cover the board with you know pages and pages and matrix
[00:55:02] know pages and pages and matrix derivatives what I want to do is
[00:55:04] derivatives what I want to do is describe to you a matrix derivative
[00:55:08] describe to you a matrix derivative notation that allows you to derive the
[00:55:10] notation that allows you to derive the normal equation in roughly four lines of
[00:55:13] normal equation in roughly four lines of linear algebra rather than so pages and
[00:55:15] linear algebra rather than so pages and pages in linear algebra and in the work
[00:55:18] pages in linear algebra and in the work I've done in machine learning
[00:55:19] I've done in machine learning you know sometimes notation really
[00:55:21] you know sometimes notation really matters right if you're the right
[00:55:22] matters right if you're the right notation you can solve some problems
[00:55:24] notation you can solve some problems much more easily and what I want to do
[00:55:26] much more easily and what I want to do is define this matrix linear algebra
[00:55:31] is define this matrix linear algebra notation and then I don't want to do all
[00:55:34] notation and then I don't want to do all the steps of the derivation I'm gonna
[00:55:35] the steps of the derivation I'm gonna give you a give you a sense of the
[00:55:37] give you a give you a sense of the flavor of what it looks like and then
[00:55:39] flavor of what it looks like and then I'll ask you to get a lot of details
[00:55:42] I'll ask you to get a lot of details yourself in the in the lecture notes
[00:55:46] yourself in the in the lecture notes will work out everything in more detail
[00:55:48] will work out everything in more detail than I want to do algebra in class oh
[00:55:49] than I want to do algebra in class oh and um in problem set one you get to
[00:55:52] and um in problem set one you get to practice using this yourself - you know
[00:55:55] practice using this yourself - you know derive some additional things that I
[00:55:57] derive some additional things that I found this notation really convenient
[00:55:59] found this notation really convenient right for deriving learning algorithms
[00:56:02] right for deriving learning algorithms okay so I'm going to use the following
[00:56:07] okay so I'm going to use the following notation so J right there's a function
[00:56:14] notation so J right there's a function mapping from parameters to the real
[00:56:16] mapping from parameters to the real numbers so I'm going to define this this
[00:56:21] numbers so I'm going to define this this is the derivative of J of theta with
[00:56:24] is the derivative of J of theta with respect to theta where remember a theta
[00:56:28] respect to theta where remember a theta is a three-dimensional vector so it's r3
[00:56:32] is a three-dimensional vector so it's r3 Rashi's are n plus 1 right if you have
[00:56:35] Rashi's are n plus 1 right if you have two features of the house if N equals 2
[00:56:37] two features of the house if N equals 2 then theta is three-dimensional n plus 1
[00:56:40] then theta is three-dimensional n plus 1 dimensional so it's a vector and so I'm
[00:56:42] dimensional so it's a vector and so I'm going to define the derivative with
[00:56:44] going to define the derivative with respect to theta of J of theta as
[00:56:46] respect to theta of J of theta as follows
[00:56:48] follows this is going to be yourself 3 by 1
[00:56:51] this is going to be yourself 3 by 1 vector
[00:57:01] so hope this notation is clear so this
[00:57:04] so hope this notation is clear so this is a three-dimensional vector with three
[00:57:07] is a three-dimensional vector with three components all right so that's why so
[00:57:21] components all right so that's why so that's the first component is vector
[00:57:23] that's the first component is vector that's the second and the third okay
[00:57:25] that's the second and the third okay it's a partial derivative J respecting
[00:57:27] it's a partial derivative J respecting each of the three elements and more
[00:57:33] each of the three elements and more generally in the notation we'll use
[00:57:50] generally in the notation we'll use maybe an example um let's say that E is
[00:57:56] maybe an example um let's say that E is a matrix so let's say that's a a is the
[00:58:00] a matrix so let's say that's a a is the 2 by 2 matrix then you can have a
[00:58:04] 2 by 2 matrix then you can have a function right so let's say a is you
[00:58:08] function right so let's say a is you know a11 a12 a21 a.22 right so a is a 2
[00:58:15] know a11 a12 a21 a.22 right so a is a 2 by 2 matrix then you might have some
[00:58:18] by 2 matrix then you might have some function of a matrix a right then that's
[00:58:23] function of a matrix a right then that's a real number so I'm gonna be F max from
[00:58:25] a real number so I'm gonna be F max from a 2 by 2 - excuse me are 2 by 2
[00:58:31] a 2 by 2 - excuse me are 2 by 2 it's a real number so and so for example
[00:58:37] it's a real number so and so for example if F of a equals a 1 1 plus a 1 2
[00:58:40] if F of a equals a 1 1 plus a 1 2 squared then f of you know 5 6 7 8 would
[00:58:47] squared then f of you know 5 6 7 8 would be equal to I guess 5 plus 6 squared
[00:58:51] be equal to I guess 5 plus 6 squared right so as we derived this will be
[00:58:54] right so as we derived this will be working a little bit with functions that
[00:58:56] working a little bit with functions that map from matrices to real numbers and
[00:58:58] map from matrices to real numbers and this is just one made-up example of a
[00:59:00] this is just one made-up example of a function that impose a matrix and maps
[00:59:02] function that impose a matrix and maps the matrix massive values of a matrix
[00:59:03] the matrix massive values of a matrix serial number
[00:59:05] serial number and when you have a matrix function like
[00:59:09] and when you have a matrix function like this I'm going to define the derivative
[00:59:12] this I'm going to define the derivative we respected a of F of a to be equal to
[00:59:17] we respected a of F of a to be equal to itself a matrix where the derivative of
[00:59:22] itself a matrix where the derivative of f of a with respect to the matrix a this
[00:59:26] f of a with respect to the matrix a this itself will be a matrix with the same
[00:59:30] itself will be a matrix with the same dimension of a and the elements of this
[00:59:33] dimension of a and the elements of this are the derivative with respect to the
[00:59:39] are the derivative with respect to the individual elements I'm just ready to
[00:59:46] individual elements I'm just ready to like this
[01:00:01] okay so if a was a 2x2 matrix then the
[01:00:05] okay so if a was a 2x2 matrix then the derivative of F of a respect to a is
[01:00:07] derivative of F of a respect to a is itself a two by two matrix and you
[01:00:09] itself a two by two matrix and you compute this two-by-two matrix just by
[01:00:11] compute this two-by-two matrix just by looking at F and taking derivatives with
[01:00:16] looking at F and taking derivatives with respect to the different elements and
[01:00:17] respect to the different elements and plugging them into the different the
[01:00:19] plugging them into the different the different elements of this matrix okay
[01:00:21] different elements of this matrix okay and so in this example I guess the
[01:00:25] and so in this example I guess the derivative respect to any of F of a this
[01:00:28] derivative respect to any of F of a this would be right it would be over that and
[01:00:39] would be right it would be over that and I got these four numbers by taking the
[01:00:45] I got these four numbers by taking the definition of F and taking the
[01:00:48] definition of F and taking the derivative with respect to a 1 1 and
[01:00:50] derivative with respect to a 1 1 and plugging that here taking the respect to
[01:00:55] plugging that here taking the respect to a 1 2 and plugging that here and taking
[01:00:58] a 1 2 and plugging that here and taking the derivative respect to the remaining
[01:01:00] the derivative respect to the remaining elements and plugging them here so
[01:01:04] elements and plugging them here so that's the definition of a matrix they
[01:01:06] that's the definition of a matrix they remember - yeah
[01:01:11] oh yes we do same definition for a
[01:01:14] oh yes we do same definition for a vector and by 1 or n by 1 matrix yes and
[01:01:17] vector and by 1 or n by 1 matrix yes and in fact that definition and dis
[01:01:19] in fact that definition and dis definition for the the reserving J
[01:01:21] definition for the the reserving J respective thing so these are consistent
[01:01:23] respective thing so these are consistent so if you apply that definition to a
[01:01:25] so if you apply that definition to a column vector treating a column vector
[01:01:27] column vector treating a column vector as an N by 1 matrix or input and it has
[01:01:30] as an N by 1 matrix or input and it has to be n plus 1 by 1 matrix then that
[01:01:33] to be n plus 1 by 1 matrix then that that specializes to what we described
[01:01:34] that specializes to what we described here
[01:01:48] all right so let's see so um I want to
[01:02:01] all right so let's see so um I want to leave the details to lecture notes
[01:02:03] leave the details to lecture notes because there is more lines of algebra
[01:02:05] because there is more lines of algebra but I want to but he'll give you an
[01:02:06] but I want to but he'll give you an overview of what the derivation of the
[01:02:09] overview of what the derivation of the normal equation looks like so arms of
[01:02:13] normal equation looks like so arms of this definition of a derivative of a
[01:02:15] this definition of a derivative of a matrix the broad outline that what we're
[01:02:18] matrix the broad outline that what we're going to do is we're going to take J of
[01:02:21] going to do is we're going to take J of theta alright that's the cost function
[01:02:25] theta alright that's the cost function take the derivative with respect to
[01:02:28] take the derivative with respect to theta right since theta is a vector so
[01:02:33] theta right since theta is a vector so you want to take derivatives with
[01:02:34] you want to take derivatives with respect to theta and you know well how
[01:02:37] respect to theta and you know well how do you minimize the function you take
[01:02:38] do you minimize the function you take the Reuters respective theta and set it
[01:02:40] the Reuters respective theta and set it equal to zero and then you solve for the
[01:02:44] equal to zero and then you solve for the value of theta so that the derivative is
[01:02:46] value of theta so that the derivative is zero right the minimum you know the
[01:02:48] zero right the minimum you know the maximum minimum function is whether
[01:02:49] maximum minimum function is whether there is equal to zero so so how you
[01:02:52] there is equal to zero so so how you derive the normal equation is take this
[01:02:53] derive the normal equation is take this vector so J of theta maps from a vector
[01:02:57] vector so J of theta maps from a vector to a real number so we'll take
[01:02:59] to a real number so we'll take derivatives with respect to theta set
[01:03:00] derivatives with respect to theta set that there is equal zero and solve for
[01:03:02] that there is equal zero and solve for theta and then we end up with a formula
[01:03:04] theta and then we end up with a formula for theta that lets you just you know
[01:03:08] for theta that lets you just you know immediately go to the global minimum of
[01:03:10] immediately go to the global minimum of the cost function J of theta and and and
[01:03:14] the cost function J of theta and and and all of the build up and all of this
[01:03:15] all of the build up and all of this notation is you know is there what does
[01:03:18] notation is you know is there what does this mean and is there an easy way to
[01:03:20] this mean and is there an easy way to compute the derivative of J of theta
[01:03:22] compute the derivative of J of theta okay so to help you understand the
[01:03:29] okay so to help you understand the lecture notes when hopefully you take a
[01:03:30] lecture notes when hopefully you take a look at them just a couple other
[01:03:33] look at them just a couple other derivation if a is a square matrix so
[01:03:42] derivation if a is a square matrix so let's say a is a an N by n matrix so
[01:03:45] let's say a is a an N by n matrix so number of rows equals number of columns
[01:03:48] number of rows equals number of columns I'm going to denote the trace of a to be
[01:03:53] I'm going to denote the trace of a to be equal to the sum of the diagonal entries
[01:03:59] so some of our III and this is
[01:04:05] so some of our III and this is pronounced the trace of a and and and
[01:04:10] pronounced the trace of a and and and you know you can you can also write this
[01:04:12] you know you can you can also write this as trace operator like the trace
[01:04:15] as trace operator like the trace function applied to a but by convention
[01:04:17] function applied to a but by convention we often write trace of a without the
[01:04:19] we often write trace of a without the parentheses and so this is called a
[01:04:21] parentheses and so this is called a trace so trace just means sum of
[01:04:26] trace so trace just means sum of diagonal entries and some facts about
[01:04:29] diagonal entries and some facts about the trace of a matrix you know trace of
[01:04:33] the trace of a matrix you know trace of a is equal to the trace of a transpose
[01:04:36] a is equal to the trace of a transpose because if you transpose the matrix
[01:04:37] because if you transpose the matrix right you're just flipping along the 45
[01:04:40] right you're just flipping along the 45 degree axis and so the diagonal entries
[01:04:43] degree axis and so the diagonal entries actually stay the same when you
[01:04:44] actually stay the same when you transpose the matrix so that trace of a
[01:04:46] transpose the matrix so that trace of a is equal to trace of a transpose and
[01:04:49] is equal to trace of a transpose and then there there there are some other
[01:04:52] then there there there are some other useful properties of the trace operator
[01:04:56] useful properties of the trace operator here's one that I don't want to prove
[01:04:58] here's one that I don't want to prove but that you could go home and prove
[01:05:00] but that you could go home and prove yourself with a few some little bit of
[01:05:04] yourself with a few some little bit of work maybe not not too much which is if
[01:05:06] work maybe not not too much which is if you define F of a equals trace of a
[01:05:14] you define F of a equals trace of a times B so here it B is some fixed
[01:05:18] times B so here it B is some fixed matrix right and what F of a does is it
[01:05:22] matrix right and what F of a does is it multiplies a and B and then it takes to
[01:05:23] multiplies a and B and then it takes to sum of diagonal entries then it turns
[01:05:26] sum of diagonal entries then it turns out that the derivative with respect to
[01:05:28] out that the derivative with respect to a of F of a is equal to B transpose and
[01:05:38] a of F of a is equal to B transpose and this is you could prove this yourself
[01:05:40] this is you could prove this yourself for any matrix B if F of a is defined
[01:05:43] for any matrix B if F of a is defined this way the derivative is equal to B
[01:05:45] this way the derivative is equal to B transpose the trace function or the
[01:05:49] transpose the trace function or the trace operator has other interesting
[01:05:50] trace operator has other interesting properties the trace of a B is equal to
[01:05:53] properties the trace of a B is equal to the trace of B a you could prove this
[01:05:58] the trace of B a you could prove this from principle it's a little bit of work
[01:05:59] from principle it's a little bit of work to prove that you if you expand out the
[01:06:03] to prove that you if you expand out the definitions a and B sure
[01:06:04] definitions a and B sure that and the tracer a times B times C is
[01:06:08] that and the tracer a times B times C is equal to you the trace of C times a
[01:06:11] equal to you the trace of C times a times B this is a cyclic permutation
[01:06:14] times B this is a cyclic permutation property if you ever multiple you know
[01:06:16] property if you ever multiple you know multiply several matrices together you
[01:06:18] multiply several matrices together you can always take one from the end and
[01:06:19] can always take one from the end and move it to the front and the trace will
[01:06:22] move it to the front and the trace will remain the same and another one that is
[01:06:31] remain the same and another one that is a little bit harder to prove is that the
[01:06:34] a little bit harder to prove is that the trace
[01:06:35] trace excuse me derivative of eight runs a
[01:06:39] excuse me derivative of eight runs a transpose C is okay so I think just as
[01:06:49] transpose C is okay so I think just as just as for your ordinary calculus we
[01:06:53] just as for your ordinary calculus we know the derivative of x squared is 2x
[01:06:55] know the derivative of x squared is 2x right and so we all figured out that
[01:06:57] right and so we all figured out that grew and we just use it too much without
[01:06:59] grew and we just use it too much without without having to read arrive every time
[01:07:01] without having to read arrive every time this is a little bit like that the trace
[01:07:03] this is a little bit like that the trace of a squared C is you know two times C a
[01:07:06] of a squared C is you know two times C a right it's an open like that with matrix
[01:07:10] right it's an open like that with matrix notation that's there so think of this
[01:07:12] notation that's there so think of this as analogous to DDA of a squared C
[01:07:16] as analogous to DDA of a squared C equals to AC but this is like the matrix
[01:07:21] equals to AC but this is like the matrix version of that
[01:07:44] all right so finally what I'd like to do
[01:07:53] all right so finally what I'd like to do is take J of theta and express it in
[01:07:58] is take J of theta and express it in this you know matrix vector notation so
[01:08:01] this you know matrix vector notation so we can take the Reuters respect to theta
[01:08:03] we can take the Reuters respect to theta and set those equal to zero and just
[01:08:05] and set those equal to zero and just solve for the value of theta right and
[01:08:07] solve for the value of theta right and so let me just write out the definition
[01:08:10] so let me just write out the definition of J of theta so J of theta it was 1/2
[01:08:15] of J of theta so J of theta it was 1/2 something I equals 1 cm squared and it
[01:08:26] something I equals 1 cm squared and it turns out that
[01:08:39] it turns out that some if you did if you
[01:08:42] it turns out that some if you did if you define the matrix capital X as follows
[01:08:46] define the matrix capital X as follows which is I'm going to take the matrix
[01:08:48] which is I'm going to take the matrix capital X and take the training examples
[01:08:52] capital X and take the training examples we have you know and stack them up in
[01:08:57] we have you know and stack them up in rows so we have M training examples
[01:09:03] rows so we have M training examples right so so the XS will call them
[01:09:05] right so so the XS will call them vectors so I'm taking transpose you just
[01:09:07] vectors so I'm taking transpose you just stack up to K examples in n rows here so
[01:09:11] stack up to K examples in n rows here so let me call this the design matrix but
[01:09:13] let me call this the design matrix but couple X color design matrix and it
[01:09:18] couple X color design matrix and it turns out that if you define X this way
[01:09:21] turns out that if you define X this way then x times theta is this thing times
[01:09:28] then x times theta is this thing times theta and the way of matrix vector
[01:09:32] theta and the way of matrix vector multiplication works is your theta is
[01:09:34] multiplication works is your theta is now a column vector right so theta is
[01:09:36] now a column vector right so theta is you know theta 0 theta 1 theta 2 so the
[01:09:41] you know theta 0 theta 1 theta 2 so the way that matrix vector multiplication
[01:09:43] way that matrix vector multiplication works is you multiply this column vector
[01:09:46] works is you multiply this column vector with each of these in intern and so this
[01:09:49] with each of these in intern and so this ends up being X 1 transpose theta X 2
[01:09:54] ends up being X 1 transpose theta X 2 transpose theta down to X M transpose
[01:10:01] transpose theta down to X M transpose theta which is of course just a vector
[01:10:06] theta which is of course just a vector of all of the predictions of the
[01:10:08] of all of the predictions of the algorithm
[01:10:28] and so if now let me also define a
[01:10:34] and so if now let me also define a vector Y to be taking all of the labels
[01:10:44] vector Y to be taking all of the labels from your training example and stacking
[01:10:47] from your training example and stacking them up into a big column vector right
[01:10:49] them up into a big column vector right let me define Y that way it turns out
[01:10:53] let me define Y that way it turns out that J of theta can then be written as
[01:11:02] that J of theta can then be written as 1/2 X theta minus y transpose X theta
[01:11:10] minus y ok and let me just outline the
[01:11:20] minus y ok and let me just outline the proof but I won't do this in great
[01:11:21] proof but I won't do this in great detail so X theta minus y is going to be
[01:11:25] detail so X theta minus y is going to be right so this is X later this is y so
[01:11:29] right so this is X later this is y so you know X theta minus y it's going to
[01:11:32] you know X theta minus y it's going to be this vector of H of X 1 minus y 1
[01:11:38] be this vector of H of X 1 minus y 1 down to H of X M minus y M right this is
[01:11:46] down to H of X M minus y M right this is just all the errors your learning
[01:11:47] just all the errors your learning algorithm is making on the examples the
[01:11:49] algorithm is making on the examples the difference between predictions and the
[01:11:50] difference between predictions and the actual labels and if you remember so Z
[01:11:55] actual labels and if you remember so Z transpose Z is equal to sum over I Z
[01:11:58] transpose Z is equal to sum over I Z squared like a vector transpose itself
[01:12:01] squared like a vector transpose itself is the sum of squares of elements and so
[01:12:03] is the sum of squares of elements and so this vector transpose itself is the sum
[01:12:07] this vector transpose itself is the sum of squares of the elements right so so
[01:12:10] of squares of the elements right so so which is y so the cost function J of
[01:12:13] which is y so the cost function J of theta is computed by taking the sum of
[01:12:15] theta is computed by taking the sum of squares of all of these elements of all
[01:12:17] squares of all of these elements of all of these errors and the way you do that
[01:12:19] of these errors and the way you do that is to take this vector you know X a the
[01:12:22] is to take this vector you know X a the minus y transpose itself is the sum of
[01:12:26] minus y transpose itself is the sum of squares of the
[01:12:27] squares of the which is exactly the era so that's why
[01:12:29] which is exactly the era so that's why you end up with this is the sum of
[01:12:32] you end up with this is the sum of squares of the those error terms okay
[01:12:39] and um if some of the steps don't quite
[01:12:43] and um if some of the steps don't quite make sense really don't worry about it
[01:12:45] make sense really don't worry about it all this is written out more slowly and
[01:12:47] all this is written out more slowly and carefully in the lecture notes but I
[01:12:49] carefully in the lecture notes but I wanted you to have a sense of the
[01:12:51] wanted you to have a sense of the brought off of the of the big picture of
[01:12:54] brought off of the of the big picture of the derivation before you go through
[01:12:56] the derivation before you go through them yourself and great disease on the
[01:12:57] them yourself and great disease on the lecture notes elsewhere
[01:13:08] so finally what we want to do is take
[01:13:12] so finally what we want to do is take the derivative respect the theta of jr.
[01:13:16] the derivative respect the theta of jr. theta and set that to zero and so this
[01:13:19] theta and set that to zero and so this is going to be equal to the derivative
[01:13:21] is going to be equal to the derivative of 1/2 X theta minus y transpose X Y and
[01:13:30] of 1/2 X theta minus y transpose X Y and so I'm gonna I'm gonna do the steps
[01:13:33] so I'm gonna I'm gonna do the steps really quickly right so the steps
[01:13:35] really quickly right so the steps require some of the little properties of
[01:13:37] require some of the little properties of traces and matrix derivatives that wrote
[01:13:39] traces and matrix derivatives that wrote down briefly just now but so I'm gonna
[01:13:41] down briefly just now but so I'm gonna do these very quickly without going into
[01:13:43] do these very quickly without going into the details better
[01:13:44] the details better so this is equal to 1/2 there's our
[01:13:48] so this is equal to 1/2 there's our theta of so take transposes of these
[01:13:52] theta of so take transposes of these things so this becomes theta transpose X
[01:13:54] things so this becomes theta transpose X transpose minus y transpose and then
[01:14:02] transpose minus y transpose and then kind of like expanding out a quadratic
[01:14:05] kind of like expanding out a quadratic function right this is you know a minus
[01:14:08] function right this is you know a minus B times C minus D as you can this is AC
[01:14:10] B times C minus D as you can this is AC minus 80 and so on since so I just write
[01:14:14] minus 80 and so on since so I just write this out
[01:14:29] and so what I just did here this is
[01:14:32] and so what I just did here this is similar to how you know ax minus B times
[01:14:36] similar to how you know ax minus B times ax minus B is equal to a squared x
[01:14:40] ax minus B is equal to a squared x squared minus ax b minus b ax ax plus b
[01:14:45] squared minus ax b minus b ax ax plus b squared it's kind of it's just expanding
[01:14:48] squared it's kind of it's just expanding out the quadratic function
[01:15:03] and then the final step is is that right
[01:15:14] and then the final step is is that right oh yes thank you um and then the final
[01:15:20] oh yes thank you um and then the final step is you know for each of these four
[01:15:22] step is you know for each of these four terms first second third and fourth
[01:15:25] terms first second third and fourth terms to take the derivative with
[01:15:27] terms to take the derivative with respect to theta and if you use some of
[01:15:30] respect to theta and if you use some of the formulas I was alluding to over
[01:15:32] the formulas I was alluding to over there you find that the derivative which
[01:15:35] there you find that the derivative which which I don't want to show the
[01:15:37] which I don't want to show the derivation out but it turns out that the
[01:15:39] derivation out but it turns out that the derivative is X transpose X theta plus X
[01:15:43] derivative is X transpose X theta plus X transpose X theta minus X transpose Y
[01:15:49] transpose X theta minus X transpose Y minus X transpose Y and we're going to
[01:15:56] minus X transpose Y and we're going to set this derivative and so the
[01:16:00] set this derivative and so the simplifies to X transpose X theta minus
[01:16:04] simplifies to X transpose X theta minus X transpose Y and so as described where
[01:16:09] X transpose Y and so as described where they are gonna set this derivative to
[01:16:12] they are gonna set this derivative to zero but how they go from this step to
[01:16:14] zero but how they go from this step to that step is using the matrix
[01:16:15] that step is using the matrix derivatives explained in more detail in
[01:16:17] derivatives explained in more detail in the lecture notes and so the final step
[01:16:20] the lecture notes and so the final step is you know having said this a zero this
[01:16:23] is you know having said this a zero this implies that X transpose X theta equals
[01:16:25] implies that X transpose X theta equals X transpose Y so this is called the
[01:16:28] X transpose Y so this is called the normal equations and the optimum value
[01:16:36] normal equations and the optimum value for theta is theta equals x transpose x
[01:16:41] for theta is theta equals x transpose x inverse X transpose Y okay
[01:16:49] inverse X transpose Y okay and if you implement this then you know
[01:16:55] and if you implement this then you know you can in basically one step get the
[01:16:58] you can in basically one step get the value of theta that corresponds the
[01:17:00] value of theta that corresponds the global minimum and and and again common
[01:17:08] global minimum and and and again common question I get is - well what if x is
[01:17:09] question I get is - well what if x is non-invertible what that usually means
[01:17:12] non-invertible what that usually means is you have redundant features that your
[01:17:14] is you have redundant features that your features are linearly dependent
[01:17:16] features are linearly dependent but if you use something called the
[01:17:18] but if you use something called the pseudo-inverse you you kind of get the
[01:17:20] pseudo-inverse you you kind of get the right answer if that's the case although
[01:17:21] right answer if that's the case although I think that even more right answers if
[01:17:23] I think that even more right answers if you have linear dependent features
[01:17:24] you have linear dependent features probably means you have the same feature
[01:17:26] probably means you have the same feature repeated twice and I would usually go
[01:17:28] repeated twice and I would usually go and figure out what features actually
[01:17:29] and figure out what features actually repeated leading to this problem okay
[01:17:34] repeated leading to this problem okay all right any last questions before so
[01:17:37] all right any last questions before so that so that's a normal equations hope
[01:17:38] that so that's a normal equations hope you read through the detailed
[01:17:39] you read through the detailed derivations in lecture notes any last
[01:17:41] derivations in lecture notes any last questions going great
[01:17:53] oh yeah how do you choose salon area
[01:17:58] oh yeah how do you choose salon area it's this is quite empirical I think so
[01:18:00] it's this is quite empirical I think so most people would try different values
[01:18:02] most people would try different values and just pick one all right I think
[01:18:05] and just pick one all right I think let's let's break if people have more
[01:18:07] let's let's break if people have more questions where the tears come up we can
[01:18:09] questions where the tears come up we can acute Aten questions were less grateful
[01:18:10] acute Aten questions were less grateful today thanks everyone


================================================================================
LECTURE 003
================================================================================

Locally Weighted & Logistic Regression | Stanford CS229: Machine Learning - Lecture 3 (Autumn 2018)

Source: https://www.youtube.com/watch?v=het9HFqo1TQ

---

Transcript

[00:00:03] what I'd like to do today is continue
[00:00:07] what I'd like to do today is continue our discussion of supervised learning so
[00:00:11] our discussion of supervised learning so lost Wednesday you saw the linear
[00:00:14] lost Wednesday you saw the linear regression algorithm including both
[00:00:16] regression algorithm including both gradient descent how poorly the problem
[00:00:19] gradient descent how poorly the problem then great in a sense and then the
[00:00:20] then great in a sense and then the normal equations what I'd like to do
[00:00:23] normal equations what I'd like to do today is talk about locally weighted
[00:00:26] today is talk about locally weighted regression which is a way to modify
[00:00:29] regression which is a way to modify linear regression to make it fit very
[00:00:31] linear regression to make it fit very nonlinear functions so you're on just a
[00:00:33] nonlinear functions so you're on just a straight lines and then we'll talk about
[00:00:35] straight lines and then we'll talk about property interpretation of linear
[00:00:37] property interpretation of linear regression and that will lead us into
[00:00:39] regression and that will lead us into the first classification algorithm
[00:00:41] the first classification algorithm you've seen this also called logistic
[00:00:43] you've seen this also called logistic regression and we'll talk about an
[00:00:45] regression and we'll talk about an algorithm called Newton's method for
[00:00:47] algorithm called Newton's method for logistic regression and so the
[00:00:49] logistic regression and so the dependency of ideas in this class is
[00:00:51] dependency of ideas in this class is that locally weighted regression will
[00:00:54] that locally weighted regression will depend on what you learned in linear
[00:00:56] depend on what you learned in linear regression and then where are you gonna
[00:00:59] regression and then where are you gonna just cover the key ideas of locally
[00:01:02] just cover the key ideas of locally weighted regression and let you play
[00:01:03] weighted regression and let you play some of the ideas yourself in the
[00:01:05] some of the ideas yourself in the problem set one which will release later
[00:01:07] problem set one which will release later this week and then I guess give a
[00:01:11] this week and then I guess give a probability interpretation of linear
[00:01:12] probability interpretation of linear regression logistic rest on depend on
[00:01:14] regression logistic rest on depend on that and new test method is for logistic
[00:01:17] that and new test method is for logistic regression
[00:01:19] regression to recap the notation you saw on
[00:01:22] to recap the notation you saw on Wednesday we use this notation X I comma
[00:01:26] Wednesday we use this notation X I comma iy I to denote a single training example
[00:01:29] iy I to denote a single training example where X I was n plus 1 dimensional so if
[00:01:33] where X I was n plus 1 dimensional so if you had two features the size of a house
[00:01:36] you had two features the size of a house and the number of bedrooms then X I
[00:01:38] and the number of bedrooms then X I would be two plus one repeat
[00:01:39] would be two plus one repeat three-dimensional because we had
[00:01:41] three-dimensional because we had introduced a new sort of fake feature x0
[00:01:45] introduced a new sort of fake feature x0 which was always set to the value of 1
[00:01:47] which was always set to the value of 1 and then why I in the case of regression
[00:01:50] and then why I in the case of regression is always a real number and was the
[00:01:53] is always a real number and was the number of training examples and was the
[00:01:55] number of training examples and was the number of features and this was the
[00:01:58] number of features and this was the hypothesis right as the linear function
[00:02:00] hypothesis right as the linear function of the features X including this feature
[00:02:03] of the features X including this feature x0 which is always set to 1 and J was
[00:02:07] x0 which is always set to 1 and J was the cost function you would minimize you
[00:02:09] the cost function you would minimize you minimizes as function of J to find the
[00:02:12] minimizes as function of J to find the parameter
[00:02:13] parameter theta for your straight line fit to the
[00:02:16] theta for your straight line fit to the data okay so that's what you saw last
[00:02:18] data okay so that's what you saw last Wednesday um now if you have a data set
[00:02:28] Wednesday um now if you have a data set that looks like that where this is the
[00:02:32] that looks like that where this is the size of a house and this is the price of
[00:02:33] size of a house and this is the price of a house what you saw on Wednesday last
[00:02:37] a house what you saw on Wednesday last Wednesday was an algorithm to fit a
[00:02:39] Wednesday was an algorithm to fit a straight line right to this data so the
[00:02:43] straight line right to this data so the hypothesis was on the phone theta 0 plus
[00:02:45] hypothesis was on the phone theta 0 plus theta 1 EXO EXO theta 1 x1 right same
[00:02:49] theta 1 EXO EXO theta 1 x1 right same thing but with this data set maybe it
[00:02:53] thing but with this data set maybe it actually looks you know maybe the data
[00:02:55] actually looks you know maybe the data looks a little bit like that and so one
[00:02:57] looks a little bit like that and so one question that you have to address when
[00:02:59] question that you have to address when fitting models to the data is what are
[00:03:02] fitting models to the data is what are the features you want do you want to fit
[00:03:04] the features you want do you want to fit a straight line to this problem
[00:03:05] a straight line to this problem or do you want to fit a hypothesis of
[00:03:08] or do you want to fit a hypothesis of the form theta 1x plus theta 2 x squared
[00:03:16] the form theta 1x plus theta 2 x squared since this may be a quadratic function
[00:03:17] since this may be a quadratic function right now the problem quadratic function
[00:03:20] right now the problem quadratic function is a quadratic function eventually
[00:03:21] is a quadratic function eventually starts you know curving back down no
[00:03:23] starts you know curving back down no that would be a quadratic function
[00:03:25] that would be a quadratic function this starts curving back down so maybe
[00:03:27] this starts curving back down so maybe you don't want to fit a quadratic
[00:03:28] you don't want to fit a quadratic function instead maybe you want um it's
[00:03:32] function instead maybe you want um it's a fit
[00:03:35] something like that
[00:03:36] something like that if housing prices sort of curve down a
[00:03:39] if housing prices sort of curve down a little bit but you don't want it to
[00:03:41] little bit but you don't want it to eventually curve back down the way a
[00:03:43] eventually curve back down the way a quadratic function weight right
[00:03:46] quadratic function weight right so oh and and if you want to do this the
[00:03:49] so oh and and if you want to do this the way you would implement this is you
[00:03:51] way you would implement this is you define the first feature x1 equals x and
[00:03:54] define the first feature x1 equals x and the second feature x2 equals x squared
[00:03:57] the second feature x2 equals x squared or you define X 1 to be equal to X and X
[00:04:00] or you define X 1 to be equal to X and X 2 equals square root of x right and by
[00:04:02] 2 equals square root of x right and by defining a new feature X 2 which can be
[00:04:04] defining a new feature X 2 which can be the square of X the square root of x
[00:04:06] the square of X the square root of x then the machinery that you solve from
[00:04:08] then the machinery that you solve from wednesday of linear regression applies
[00:04:10] wednesday of linear regression applies to fit these types of these types of
[00:04:14] to fit these types of these types of functions the data
[00:04:16] functions the data later this quarter you hear about
[00:04:18] later this quarter you hear about feature selection algorithms which is a
[00:04:20] feature selection algorithms which is a type of algorithm for automatically
[00:04:22] type of algorithm for automatically deciding do you want x squared as a
[00:04:24] deciding do you want x squared as a feature or square root of x as a feature
[00:04:26] feature or square root of x as a feature or maybe you want some longer of X as a
[00:04:30] or maybe you want some longer of X as a feature right but what's other features
[00:04:32] feature right but what's other features does the best job fitting the data that
[00:04:35] does the best job fitting the data that you have if it's not fit well by a
[00:04:37] you have if it's not fit well by a perfectly straight line what I'd like to
[00:04:41] perfectly straight line what I'd like to do today is so you hear about feature
[00:04:43] do today is so you hear about feature selection later this quarter what I want
[00:04:46] selection later this quarter what I want to share you today is a different way of
[00:04:47] to share you today is a different way of accessing this out this problem of one
[00:04:50] accessing this out this problem of one of the data isn't just fit Y by a
[00:04:52] of the data isn't just fit Y by a straight line and in particular my share
[00:04:54] straight line and in particular my share of you an idea called
[00:04:54] of you an idea called a locally weighted regression or locally
[00:04:57] a locally weighted regression or locally weighted linear regression so let me use
[00:05:00] weighted linear regression so let me use a slightly different example to
[00:05:02] a slightly different example to illustrate this which is which is that
[00:05:06] illustrate this which is which is that you know if you have a data set that
[00:05:08] you know if you have a data set that looks like that so it's pretty clear
[00:05:18] looks like that so it's pretty clear what the shape of this data is but how
[00:05:21] what the shape of this data is but how do you fit a curve that you know kind of
[00:05:23] do you fit a curve that you know kind of looks like that right and it's actually
[00:05:26] looks like that right and it's actually quite difficult to find features is it
[00:05:28] quite difficult to find features is it square root of x log of X X cubed like
[00:05:31] square root of x log of X X cubed like third root of X except off 2/3 but what
[00:05:33] third root of X except off 2/3 but what is the set of features that lets you do
[00:05:35] is the set of features that lets you do this so well sidestep all those problems
[00:05:37] this so well sidestep all those problems of an algorithm called locally weighted
[00:05:39] of an algorithm called locally weighted regression
[00:05:53] and so introduce if it will machine
[00:05:57] and so introduce if it will machine learning terminology in machine learning
[00:06:00] learning terminology in machine learning we sometimes distinguish between
[00:06:02] we sometimes distinguish between parametric learning algorithms and non
[00:06:06] parametric learning algorithms and non parametric learning algorithms but in a
[00:06:11] parametric learning algorithms but in a parametric learning algorithm there's a
[00:06:14] parametric learning algorithm there's a you fit some fixed set of parameters
[00:06:21] such as theta I to data and so linear
[00:06:26] such as theta I to data and so linear regression as you saw last Wednesday is
[00:06:29] regression as you saw last Wednesday is a parametric learning algorithm because
[00:06:31] a parametric learning algorithm because there's a fixed set of parameters the
[00:06:33] there's a fixed set of parameters the theta I so you fit the data and then
[00:06:34] theta I so you fit the data and then you're done right locally weighted
[00:06:37] you're done right locally weighted regression will be our first exposure to
[00:06:41] regression will be our first exposure to a nonparametric learning algorithm and
[00:06:49] a nonparametric learning algorithm and what that means is that the amount of
[00:06:54] data / parameters you need to keep
[00:07:04] throws and in this case it grows
[00:07:08] throws and in this case it grows linearly with the size of the data size
[00:07:16] linearly with the size of the data size the training set okay so with a
[00:07:19] the training set okay so with a parametric learning algorithm no matter
[00:07:21] parametric learning algorithm no matter how big your training your training set
[00:07:24] how big your training your training set is you fit the parameters stay there I
[00:07:25] is you fit the parameters stay there I then you could erase the training set
[00:07:28] then you could erase the training set from your computer memory and make
[00:07:29] from your computer memory and make predictions just using the parameters
[00:07:31] predictions just using the parameters data all in a nonparametric learning
[00:07:33] data all in a nonparametric learning algorithm which we'll see in a second
[00:07:35] algorithm which we'll see in a second the amount of stuff you need to keep
[00:07:36] the amount of stuff you need to keep around in computer memory or the net
[00:07:38] around in computer memory or the net amount stuff you need to store around
[00:07:40] amount stuff you need to store around grows linearly as a function of the
[00:07:42] grows linearly as a function of the training set size and so this type of
[00:07:45] training set size and so this type of algorithm is you know we may not be
[00:07:47] algorithm is you know we may not be great if you're a really really massive
[00:07:49] great if you're a really really massive dataset because you keep all of the data
[00:07:51] dataset because you keep all of the data you're in computer memory or on this
[00:07:53] you're in computer memory or on this just to make predictions okay so but
[00:07:56] just to make predictions okay so but we'll see an example of this and one of
[00:07:58] we'll see an example of this and one of the effects of this is that will that it
[00:08:00] the effects of this is that will that it will be able to fit that data that I
[00:08:03] will be able to fit that data that I drew up there quite well without you
[00:08:05] drew up there quite well without you needing to fiddle manually with features
[00:08:08] needing to fiddle manually with features again you get to practice implementing
[00:08:11] again you get to practice implementing locally way to regression that whole
[00:08:13] locally way to regression that whole work so I'm going to go for the height
[00:08:15] work so I'm going to go for the height of ideas relatively quickly and then let
[00:08:17] of ideas relatively quickly and then let you gain practice in the problem set all
[00:08:22] you gain practice in the problem set all right so let me redraw that data set
[00:08:24] right so let me redraw that data set something like this all right so say you
[00:08:31] something like this all right so say you have a data set like this now for linear
[00:08:35] have a data set like this now for linear regression if you want to evaluate it at
[00:08:43] regression if you want to evaluate it at a certain value of the input so to make
[00:08:53] a certain value of the input so to make a prediction at a certain value of X
[00:08:55] a prediction at a certain value of X what's you for linear regression what
[00:08:58] what's you for linear regression what you do is you fit theta you know to
[00:09:04] you do is you fit theta you know to minimize this cost function and then you
[00:09:17] minimize this cost function and then you return say the transpose X right so you
[00:09:20] return say the transpose X right so you for the straight line and then you know
[00:09:23] for the straight line and then you know if you want to make a prediction that
[00:09:24] if you want to make a prediction that this value X you then return say the
[00:09:27] this value X you then return say the transpose X for locally weighted
[00:09:30] transpose X for locally weighted regression
[00:09:41] you do something slightly different
[00:09:43] you do something slightly different which is if this is the value of
[00:09:46] which is if this is the value of actually you want to make a prediction
[00:09:47] actually you want to make a prediction around that value of X what you do is
[00:09:49] around that value of X what you do is you look in a little local neighborhood
[00:09:52] you look in a little local neighborhood at the training examples close to that
[00:09:54] at the training examples close to that point X we want to make a prediction and
[00:09:56] point X we want to make a prediction and then I'll describe this informally for
[00:10:00] then I'll describe this informally for now but we'll formalize this in math in
[00:10:02] now but we'll formalize this in math in a second but focusing mainly on these
[00:10:06] a second but focusing mainly on these examples and you know looking a little
[00:10:08] examples and you know looking a little bit at further examples but really
[00:10:10] bit at further examples but really focusing mainly on these examples you're
[00:10:12] focusing mainly on these examples you're trying to fit a straight line like that
[00:10:15] trying to fit a straight line like that focusing on the training examples that
[00:10:18] focusing on the training examples that close to where you want to make a
[00:10:19] close to where you want to make a prediction and by close I mean the
[00:10:21] prediction and by close I mean the values are similar on the x axis the x
[00:10:24] values are similar on the x axis the x values are similar and then to actually
[00:10:27] values are similar and then to actually make a prediction you will use this
[00:10:31] make a prediction you will use this Green Line there you just fit to make a
[00:10:33] Green Line there you just fit to make a prediction at that value of x now if you
[00:10:38] prediction at that value of x now if you want to make a prediction at a different
[00:10:40] want to make a prediction at a different point let's say that you know the user
[00:10:43] point let's say that you know the user now says hey make a prediction for this
[00:10:45] now says hey make a prediction for this point then what you would do is gain
[00:10:48] point then what you would do is gain focus on this local area kind of look at
[00:10:50] focus on this local area kind of look at those points and when I say focus saying
[00:10:53] those points and when I say focus saying you know put most of the weight on these
[00:10:55] you know put most of the weight on these points but you kind of take a glance at
[00:10:56] points but you kind of take a glance at the points further away but most of the
[00:10:58] the points further away but most of the attention is on these for the straight
[00:11:00] attention is on these for the straight lines of that and then you use that
[00:11:02] lines of that and then you use that straight line to make a prediction okay
[00:11:06] straight line to make a prediction okay and so to formalize this and locally
[00:11:12] and so to formalize this and locally weight a regression you will fit theta
[00:11:16] weight a regression you will fit theta to minimize a modified cost function
[00:11:33] where WI is a weight function and so a
[00:11:45] where WI is a weight function and so a good well the default choice a common
[00:11:47] good well the default choice a common choice of WI will be this I'm gonna add
[00:11:57] choice of WI will be this I'm gonna add something to this equation a little bit
[00:11:59] something to this equation a little bit later but WI is a weighting function
[00:12:02] later but WI is a weighting function where notice that this this formula has
[00:12:06] where notice that this this formula has defining property right if X I minus X
[00:12:11] defining property right if X I minus X is small then the weight will be close
[00:12:18] is small then the weight will be close to one because if X I X so X is the
[00:12:22] to one because if X I X so X is the location where you want to make a
[00:12:24] location where you want to make a prediction and X I is the input X for
[00:12:27] prediction and X I is the input X for your life training example so WI is a
[00:12:31] your life training example so WI is a weighting function there's a value which
[00:12:34] weighting function there's a value which is 0 and 1 that tells you how much
[00:12:37] is 0 and 1 that tells you how much should you pay attention to the values
[00:12:40] should you pay attention to the values of X I comma Y I when fitting say this
[00:12:43] of X I comma Y I when fitting say this green line or that red line and so if X
[00:12:47] green line or that red line and so if X I minus X is small so that's the
[00:12:51] I minus X is small so that's the training example that is close to where
[00:12:53] training example that is close to where you want to make the prediction for X
[00:12:55] you want to make the prediction for X then this is about e to the zero right e
[00:12:59] then this is about e to the zero right e to the negative 0 if the numerator you
[00:13:01] to the negative 0 if the numerator you are small and e to the 0 is close to 1
[00:13:05] are small and e to the 0 is close to 1 right and conversely if X I minus X is
[00:13:11] right and conversely if X I minus X is large then WI it's close to 0 and so if
[00:13:18] large then WI it's close to 0 and so if X is very far away so let's see a
[00:13:21] X is very far away so let's see a fitting this green line and this is your
[00:13:24] fitting this green line and this is your example X I Y I then the saying give
[00:13:27] example X I Y I then the saying give this example all the way out there if
[00:13:29] this example all the way out there if your fitting the Green Line right look
[00:13:31] your fitting the Green Line right look at this verse X saying that example
[00:13:33] at this verse X saying that example shadow weight very close to 0 ok
[00:13:39] shadow weight very close to 0 ok and so if you look at the cost function
[00:13:45] and so if you look at the cost function the main modification to the cost
[00:13:48] the main modification to the cost function with main is that we've added
[00:13:50] function with main is that we've added this weighting term and so what locally
[00:13:55] this weighting term and so what locally weighted regression does is the same if
[00:13:57] weighted regression does is the same if an example X I is far from where you
[00:14:01] an example X I is far from where you want to make a prediction multiply get
[00:14:04] want to make a prediction multiply get error term by 0 or by a constant very
[00:14:07] error term by 0 or by a constant very close to zero whereas if it's close to
[00:14:10] close to zero whereas if it's close to where you want to make prediction
[00:14:12] where you want to make prediction multiply the error term by one and so
[00:14:15] multiply the error term by one and so the net effect of this is that this is
[00:14:17] the net effect of this is that this is something if you know the terms
[00:14:19] something if you know the terms multiplied by zero disappear right so
[00:14:21] multiplied by zero disappear right so the net effect of this is that this sums
[00:14:23] the net effect of this is that this sums over essentially only the terms for the
[00:14:28] over essentially only the terms for the squared error for the examples they're
[00:14:30] squared error for the examples they're close to the value close to the value of
[00:14:33] close to the value close to the value of x where you want to make a prediction
[00:14:37] and that's why when you fit theta to
[00:14:43] and that's why when you fit theta to minimize this you end up paying
[00:14:47] minimize this you end up paying attention only to the points only to the
[00:14:49] attention only to the points only to the examples close to where you wanna make
[00:14:50] examples close to where you wanna make friction and fitting a line like the
[00:14:53] friction and fitting a line like the Green Line over there okay so let me
[00:14:58] Green Line over there okay so let me draw a couple more pictures to
[00:14:59] draw a couple more pictures to illustrate this so if you're slightly
[00:15:05] illustrate this so if you're slightly smaller data set just to make this
[00:15:07] smaller data set just to make this easier illustrate so that's your
[00:15:10] easier illustrate so that's your training set so that's the Oh example 6
[00:15:11] training set so that's the Oh example 6 1 X 2 X 3 X 4 and if you want to make a
[00:15:14] 1 X 2 X 3 X 4 and if you want to make a prediction here right at that point X
[00:15:17] prediction here right at that point X then this curve here looks the the shape
[00:15:23] then this curve here looks the the shape of this curve is actually like this and
[00:15:27] of this curve is actually like this and it this is the shape of a Gaussian bell
[00:15:30] it this is the shape of a Gaussian bell curve but this is nothing to do with a
[00:15:32] curve but this is nothing to do with a Gaussian density right so this thing
[00:15:34] Gaussian density right so this thing does not integrate the 1 it's just
[00:15:37] does not integrate the 1 it's just sometimes your awesome one is this is
[00:15:38] sometimes your awesome one is this is using Gaussian density the answer is no
[00:15:40] using Gaussian density the answer is no this is just a function that is shaped a
[00:15:43] this is just a function that is shaped a lot like a Gaussian but you know
[00:15:45] lot like a Gaussian but you know Gaussian density is probably the C
[00:15:47] Gaussian density is probably the C functions have to integrate to one and
[00:15:49] functions have to integrate to one and distance
[00:15:49] distance so there's nothing to do for Gaussian
[00:15:50] so there's nothing to do for Gaussian probably density question oh so how do
[00:15:56] probably density question oh so how do you choose well let me get back to that
[00:15:59] you choose well let me get back to that and so for this example this height here
[00:16:04] and so for this example this height here says if this example a weight equal to
[00:16:08] says if this example a weight equal to the height to that thing give this
[00:16:10] the height to that thing give this example a way to go height of this
[00:16:12] example a way to go height of this height of this height of that right
[00:16:15] height of this height of that right which is why if you actually if you have
[00:16:16] which is why if you actually if you have an example this way out there you know
[00:16:19] an example this way out there you know it's given a weight that's essentially
[00:16:21] it's given a weight that's essentially zero which is why it's waiting or
[00:16:23] zero which is why it's waiting or neither nearby the examples when trying
[00:16:25] neither nearby the examples when trying to fit a straight line right for that
[00:16:29] to fit a straight line right for that for making predictions close to this
[00:16:31] for making predictions close to this okay um now so one last thing I want to
[00:16:39] okay um now so one last thing I want to mention which is the question just now
[00:16:42] mention which is the question just now which is how do you choose the width of
[00:16:44] which is how do you choose the width of this Gaussian density right how fast is
[00:16:46] this Gaussian density right how fast is it how thin should it be
[00:16:48] it how thin should it be and this decides how big a neighborhood
[00:16:50] and this decides how big a neighborhood should you look in order to decide
[00:16:52] should you look in order to decide what's the neighborhood of points that
[00:16:54] what's the neighborhood of points that you use to fit this your local straight
[00:16:56] you use to fit this your local straight line and so for a Gaussian function like
[00:17:00] line and so for a Gaussian function like this this i'm gonna call this the
[00:17:04] this this i'm gonna call this the bandwidth parameter towel and this is a
[00:17:12] bandwidth parameter towel and this is a parameter or hyper parameter of the
[00:17:16] parameter or hyper parameter of the algorithm and depending on the choice of
[00:17:19] algorithm and depending on the choice of towel you can choose a fatter or thinner
[00:17:23] towel you can choose a fatter or thinner bell-shaped curve which causes you to
[00:17:25] bell-shaped curve which causes you to look in a bigger or a narrower window in
[00:17:29] look in a bigger or a narrower window in order to decide you know how many nearby
[00:17:32] order to decide you know how many nearby examples used in order to fit the
[00:17:34] examples used in order to fit the straight line okay
[00:17:36] straight line okay and it turns out that and I wonder the I
[00:17:38] and it turns out that and I wonder the I want to leave you to discover this
[00:17:40] want to leave you to discover this yourself in the problem set if if you've
[00:17:43] yourself in the problem set if if you've taken a little bit of machine learning
[00:17:44] taken a little bit of machine learning elsewhere I've heard of the terms oh yes
[00:17:52] elsewhere I've heard of the terms oh yes it turns out that um the choice of the
[00:17:55] it turns out that um the choice of the bandwidth towel has an effect on over 15
[00:17:59] bandwidth towel has an effect on over 15 another fitting if you don't know what
[00:18:00] another fitting if you don't know what those terms being don't worry about it
[00:18:01] those terms being don't worry about it to find them later this quarter but what
[00:18:04] to find them later this quarter but what you get to do in the problem sets is
[00:18:06] you get to do in the problem sets is play with Tao yourself and see why if
[00:18:12] play with Tao yourself and see why if tau is too broad you end up fitting your
[00:18:16] tau is too broad you end up fitting your end up over smoothing the data and if
[00:18:18] end up over smoothing the data and if tau is too thin you end up fitting a
[00:18:20] tau is too thin you end up fitting a very jagged fit to the data and if any
[00:18:21] very jagged fit to the data and if any of these things don't make sense yet
[00:18:23] of these things don't make sense yet don't worry about it they'll make sense
[00:18:24] don't worry about it they'll make sense after you play of it in the in the
[00:18:26] after you play of it in the in the problem set okay yeah since since you
[00:18:30] problem set okay yeah since since you you play with the varying tau and the
[00:18:33] you play with the varying tau and the problem set and see for yourself the net
[00:18:35] problem set and see for yourself the net impact okay thank you this is tau screen
[00:18:44] impact okay thank you this is tau screen yeah what happens you need to defer the
[00:18:56] yeah what happens you need to defer the value of H outside school they said it
[00:18:59] value of H outside school they said it turns out that you can still use this
[00:19:01] turns out that you can still use this algorithm it's just that it's results
[00:19:04] algorithm it's just that it's results may not be very good yeah
[00:19:09] locally within the linear regression is
[00:19:12] locally within the linear regression is usually not greater than extrapolation
[00:19:14] usually not greater than extrapolation but then most many learning armors are
[00:19:15] but then most many learning armors are not great at extrapolation so all the
[00:19:17] not great at extrapolation so all the formulas still work is so implement is
[00:19:19] formulas still work is so implement is but um yeah you know also try you can
[00:19:22] but um yeah you know also try you can also try the your problem set and see
[00:19:24] also try the your problem set and see what happens yes this is multiple the
[00:19:36] what happens yes this is multiple the variable towel Devon
[00:19:37] variable towel Devon yes it is and there are quite
[00:19:40] yes it is and there are quite complicated ways to choose tau based on
[00:19:42] complicated ways to choose tau based on how many points there on the local
[00:19:43] how many points there on the local region and so on yes there's a huge
[00:19:45] region and so on yes there's a huge literature on different formulas
[00:19:47] literature on different formulas actually for example it serves as
[00:19:48] actually for example it serves as Gaussian bumping there's a sometimes
[00:19:51] Gaussian bumping there's a sometimes people use that triangle shape function
[00:19:53] people use that triangle shape function so it happily goes to zero upsides and
[00:19:54] so it happily goes to zero upsides and small me so there are there are many
[00:19:56] small me so there are there are many versions of this algorithm so I tend to
[00:20:00] versions of this algorithm so I tend to use locally weighted linear regression
[00:20:03] use locally weighted linear regression when you have a relatively low
[00:20:05] when you have a relatively low dimensional dataset so when the number
[00:20:07] dimensional dataset so when the number features it's not too big right so when
[00:20:09] features it's not too big right so when n is quite
[00:20:10] n is quite all right two or three or something and
[00:20:12] all right two or three or something and we have a lot of data and you don't want
[00:20:14] we have a lot of data and you don't want to think about what features to use
[00:20:16] to think about what features to use right so that's the scenario so if you
[00:20:19] right so that's the scenario so if you actually a data set that looks like
[00:20:20] actually a data set that looks like these up and drawing you know locally
[00:20:22] these up and drawing you know locally way to the interaction is a pretty good
[00:20:25] way to the interaction is a pretty good algorithm oh sure yes the remote data
[00:20:37] algorithm oh sure yes the remote data wanted to accomplish an expensive yes it
[00:20:39] wanted to accomplish an expensive yes it would be I guess what data is relative
[00:20:41] would be I guess what data is relative yes we have you know two three four
[00:20:44] yes we have you know two three four dimensional later and hundreds of
[00:20:46] dimensional later and hundreds of examples of many thousand examples it
[00:20:49] examples of many thousand examples it turns out the computation needed to fit
[00:20:50] turns out the computation needed to fit the minimization is similar to the
[00:20:53] the minimization is similar to the normal equations and so you involve
[00:20:56] normal equations and so you involve solving a linear system of equations of
[00:20:58] solving a linear system of equations of dimension equal the number of training
[00:20:59] dimension equal the number of training examples you have so that's you know
[00:21:01] examples you have so that's you know like a thousand or a few thousand that's
[00:21:03] like a thousand or a few thousand that's not too bad if you have millions of
[00:21:05] not too bad if you have millions of examples then then there are also most
[00:21:07] examples then then there are also most of skilled algorithms like KT trees and
[00:21:09] of skilled algorithms like KT trees and much more complicated algorithms to do
[00:21:10] much more complicated algorithms to do this when you have millions or tens of
[00:21:13] this when you have millions or tens of millions of examples yeah okay so ready
[00:21:18] millions of examples yeah okay so ready you get a better sense of this algorithm
[00:21:20] you get a better sense of this algorithm when you play of it in the problem set
[00:21:24] when you play of it in the problem set now the second topic when so I'm going
[00:21:28] now the second topic when so I'm going to put aside locally weighted regression
[00:21:29] to put aside locally weighted regression we won't talk about that said ideas
[00:21:31] we won't talk about that said ideas anymore today but but what I want to do
[00:21:33] anymore today but but what I want to do today is on last Wednesday I had said
[00:21:37] today is on last Wednesday I had said that I had promised last Wednesday that
[00:21:39] that I had promised last Wednesday that today I'll give a justification for why
[00:21:42] today I'll give a justification for why we use the squared error right why the
[00:21:44] we use the squared error right why the squared error why not you know to the
[00:21:46] squared error why not you know to the fourth power or absolute value and so
[00:21:50] fourth power or absolute value and so what I want to show you today now is the
[00:21:53] what I want to show you today now is the promisee interpretation of linear
[00:21:55] promisee interpretation of linear regression and this properties
[00:21:56] regression and this properties interpretation will put us into good
[00:21:58] interpretation will put us into good standing as we go on to logistic
[00:22:00] standing as we go on to logistic regression today and then generalized
[00:22:02] regression today and then generalized any models later this week keep up to
[00:22:06] any models later this week keep up to keep the notation there a secure
[00:22:08] keep the notation there a secure continue to refer to it
[00:22:13] so right so why these squares Y squared
[00:22:26] so right so why these squares Y squared error going to present a set of
[00:22:29] error going to present a set of assumptions under which these squares
[00:22:31] assumptions under which these squares using squared error falls out very
[00:22:33] using squared error falls out very naturally which is let's say for housing
[00:22:37] naturally which is let's say for housing price prediction let's assume that
[00:22:40] price prediction let's assume that there's a true price of every house why
[00:22:42] there's a true price of every house why I which is X transpose say there I plus
[00:22:50] I which is X transpose say there I plus epsilon I where epsilon I is an error
[00:22:55] epsilon I where epsilon I is an error term that includes unmodeled effects you
[00:23:05] term that includes unmodeled effects you know and just random noise so let's
[00:23:11] know and just random noise so let's assume that the way you know housing
[00:23:13] assume that the way you know housing prices truly work is that every houses
[00:23:15] prices truly work is that every houses price is a linear function of the size
[00:23:17] price is a linear function of the size of holes and number of bedrooms plus an
[00:23:20] of holes and number of bedrooms plus an error term they captures unmodeled
[00:23:22] error term they captures unmodeled effects such as maybe one day that cell
[00:23:25] effects such as maybe one day that cell is an unusually good mood or an
[00:23:27] is an unusually good mood or an unusually bad mood and so that makes the
[00:23:29] unusually bad mood and so that makes the price go higher or lower
[00:23:30] price go higher or lower we just don't model that as well as
[00:23:32] we just don't model that as well as random noise right or maybe I don't want
[00:23:36] random noise right or maybe I don't want to screw this straight
[00:23:37] to screw this straight you know percent adjusting caption
[00:23:38] you know percent adjusting caption that's one of the features but other
[00:23:39] that's one of the features but other things have an impact on housing prices
[00:23:41] things have an impact on housing prices and we're going to assume that epsilon I
[00:23:49] and we're going to assume that epsilon I is distributed Gaussian with mean zero
[00:23:54] is distributed Gaussian with mean zero and covariance Sigma squared so I'm
[00:23:56] and covariance Sigma squared so I'm going to use this notation to mean so
[00:23:59] going to use this notation to mean so the way you read this notation is
[00:24:01] the way you read this notation is epsilon I this turtle new pronoun says
[00:24:03] epsilon I this turtle new pronoun says is distributed and then script n for n 0
[00:24:07] is distributed and then script n for n 0 comma Sigma squared this is a normal
[00:24:09] comma Sigma squared this is a normal distribution also called the Gaussian
[00:24:11] distribution also called the Gaussian distribution same thing normal in the
[00:24:12] distribution same thing normal in the Spirit of God students should be in the
[00:24:13] Spirit of God students should be in the same
[00:24:14] same the normal distribution with mean zero
[00:24:17] the normal distribution with mean zero and variance Sigma squared okay and what
[00:24:21] and variance Sigma squared okay and what this means is that the probability
[00:24:23] this means is that the probability density of epsilon I this is the
[00:24:27] density of epsilon I this is the Gaussian density one of the root 2 Pi
[00:24:29] Gaussian density one of the root 2 Pi Sigma e to the negative epsilon I
[00:24:34] Sigma e to the negative epsilon I squared over 2 Sigma squared ok oh and
[00:24:37] squared over 2 Sigma squared ok oh and unlike the bell shape the bell-shaped
[00:24:39] unlike the bell shape the bell-shaped curve I use earlier for locally weighted
[00:24:41] curve I use earlier for locally weighted linear regression this thing does
[00:24:43] linear regression this thing does integrate to one right this dysfunction
[00:24:45] integrate to one right this dysfunction integrates to 1 and so this is a
[00:24:48] integrates to 1 and so this is a Gaussian density this is a probability
[00:24:50] Gaussian density this is a probability density function and this is the
[00:24:54] density function and this is the familiar you know Gaussian bell-shaped
[00:24:58] familiar you know Gaussian bell-shaped curve would mean 0 and covere and
[00:25:02] curve would mean 0 and covere and variance Sigma squared right where Sigma
[00:25:06] variance Sigma squared right where Sigma kind of controls the width of this
[00:25:08] kind of controls the width of this Gaussian okay and if you haven't seen
[00:25:10] Gaussian okay and if you haven't seen gaussians for a while we'll go over some
[00:25:12] gaussians for a while we'll go over some of the probability probably prereqs as
[00:25:15] of the probability probably prereqs as well in the classes friday and
[00:25:17] well in the classes friday and discussion sections
[00:25:23] so in other words we assume that the way
[00:25:26] so in other words we assume that the way housing prices are determined is that
[00:25:28] housing prices are determined is that first is a true price theta transpose X
[00:25:31] first is a true price theta transpose X and then you know some random force of
[00:25:33] and then you know some random force of nature right the move of the seller or I
[00:25:37] nature right the move of the seller or I I don't have other factors right
[00:25:40] I don't have other factors right perturbs it from this true value say
[00:25:43] perturbs it from this true value say their transpose X I and the huge
[00:25:47] their transpose X I and the huge assumption we're gonna make is that the
[00:25:49] assumption we're gonna make is that the epsilon is these error terms are a ID
[00:25:51] epsilon is these error terms are a ID and iid from statistics sense for
[00:25:54] and iid from statistics sense for independently and identically
[00:25:56] independently and identically distributed and what that means is that
[00:25:57] distributed and what that means is that their error term for one house is
[00:25:59] their error term for one house is independent as the error term for a
[00:26:02] independent as the error term for a different house which is actually not a
[00:26:04] different house which is actually not a true assumption right because you know
[00:26:06] true assumption right because you know if if one house this price on one street
[00:26:08] if if one house this price on one street is unusually high probably a price on a
[00:26:10] is unusually high probably a price on a different house on the same street will
[00:26:12] different house on the same street will also be unusually high and so but this
[00:26:14] also be unusually high and so but this assumption that these epsilonr iid
[00:26:17] assumption that these epsilonr iid sensor independently and identically
[00:26:19] sensor independently and identically distributed is one of those assumptions
[00:26:21] distributed is one of those assumptions that that you know it's probably not
[00:26:23] that that you know it's probably not absolutely true but may be good enough
[00:26:25] absolutely true but may be good enough that if you make this assumption you get
[00:26:27] that if you make this assumption you get a pretty good model and so let's see
[00:26:33] a pretty good model and so let's see under the set of assumptions this
[00:26:35] under the set of assumptions this implies that the density or the
[00:26:44] implies that the density or the probability of Y given X I and theta
[00:26:50] probability of Y given X I and theta this is going to be this and I'll take
[00:27:06] this is going to be this and I'll take this and writes in another way in other
[00:27:22] this and writes in another way in other words given X and theta
[00:27:25] words given X and theta what's the density well what's the
[00:27:28] what's the density well what's the probability of a particular house this
[00:27:30] probability of a particular house this price well it's going to be Gaussian
[00:27:32] price well it's going to be Gaussian working in given by theta transpose X I
[00:27:35] working in given by theta transpose X I or theta transpose X and the area is
[00:27:38] or theta transpose X and the area is given by Sigma squared okay
[00:27:43] given by Sigma squared okay and so because the way that the price of
[00:27:47] and so because the way that the price of a house is determined is by thinking
[00:27:49] a house is determined is by thinking theta transpose X was the you know the
[00:27:51] theta transpose X was the you know the quote true price of the house and then
[00:27:53] quote true price of the house and then adding noise or adding error of variance
[00:27:56] adding noise or adding error of variance Sigma squared to it and so the the
[00:27:59] Sigma squared to it and so the the assumptions on the left imply that given
[00:28:02] assumptions on the left imply that given X and theta the density of Y you know
[00:28:05] X and theta the density of Y you know has this distribution which is really
[00:28:07] has this distribution which is really this is the random variable Y and that's
[00:28:09] this is the random variable Y and that's the mean that's the variance of the
[00:28:15] the mean that's the variance of the Gaussian density okay now um two pieces
[00:28:20] Gaussian density okay now um two pieces of notation I have one more that that
[00:28:23] of notation I have one more that that you should get familiar with the reason
[00:28:27] you should get familiar with the reason I wrote the semicolon here is that the
[00:28:31] I wrote the semicolon here is that the way you read this equation is the
[00:28:32] way you read this equation is the semicolon should be read as a
[00:28:34] semicolon should be read as a parameterised ass and so because you
[00:28:42] parameterised ass and so because you know the alternative way to write this
[00:28:44] know the alternative way to write this would be to say P of X are given why I
[00:28:47] would be to say P of X are given why I give Y given X I comma theta but if you
[00:28:51] give Y given X I comma theta but if you were to write this notation this way
[00:28:53] were to write this notation this way this would be conditioning on theta but
[00:28:57] this would be conditioning on theta but theta is not a random variable so you
[00:28:58] theta is not a random variable so you shouldn't conditional on theta which is
[00:29:00] shouldn't conditional on theta which is why I'm gonna write a semicolon and so
[00:29:03] why I'm gonna write a semicolon and so the way you read this is the program Y
[00:29:05] the way you read this is the program Y are given X I and parameterised excuse
[00:29:08] are given X I and parameterised excuse me parametrized by theta is equal to
[00:29:12] me parametrized by theta is equal to that formula okay if you don't
[00:29:15] that formula okay if you don't understand this distinction again don't
[00:29:16] understand this distinction again don't worry too much about it in statistics
[00:29:18] worry too much about it in statistics there are multiple schools of statistics
[00:29:21] there are multiple schools of statistics called Bayesian statistics and
[00:29:22] called Bayesian statistics and frequencies statistics this is a
[00:29:23] frequencies statistics this is a frequentist interpretation for the
[00:29:26] frequentist interpretation for the purpose of machine learning don't worry
[00:29:27] purpose of machine learning don't worry about it but I find there being more
[00:29:28] about it but I find there being more consistent terminology
[00:29:30] consistent terminology some of our statisticians friends from
[00:29:32] some of our statisticians friends from getting really upset but you know some
[00:29:34] getting really upset but you know some try to follow statistics convention so
[00:29:38] try to follow statistics convention so because just only unnecessary flag I
[00:29:40] because just only unnecessary flag I guess but for the practical purposes is
[00:29:43] guess but for the practical purposes is not that important if you get this
[00:29:44] not that important if you get this notation wrong your homework don't worry
[00:29:46] notation wrong your homework don't worry about it we won't penalize you but our
[00:29:47] about it we won't penalize you but our child be consistent but this just means
[00:29:50] child be consistent but this just means that theta in this view is not a random
[00:29:52] that theta in this view is not a random variable it's just theta as a set of
[00:29:54] variable it's just theta as a set of parameters that parameterize is this
[00:29:56] parameters that parameterize is this probably distribution okay and the way
[00:30:01] probably distribution okay and the way to read the second equation is when you
[00:30:04] to read the second equation is when you write these equations usually don't
[00:30:05] write these equations usually don't write down with parentheses but the way
[00:30:07] write down with parentheses but the way to pause this equation is to say that
[00:30:09] to pause this equation is to say that this thing as a random variable the
[00:30:12] this thing as a random variable the random variable Y given X and
[00:30:14] random variable Y given X and parametrized by theta this thing that I
[00:30:16] parametrized by theta this thing that I just drew in green parentheses is this
[00:30:18] just drew in green parentheses is this you take Gaussian with that distribution
[00:30:20] you take Gaussian with that distribution okay all right any questions about this
[00:30:35] so it turns out that if you are willing
[00:30:43] so it turns out that if you are willing to make those assumptions then the new
[00:30:47] to make those assumptions then the new regression falls out almost naturally of
[00:30:53] regression falls out almost naturally of the assumptions we just made and in
[00:30:58] the assumptions we just made and in particular under the assumptions we just
[00:31:00] particular under the assumptions we just made the likelihood of the parameters
[00:31:08] made the likelihood of the parameters theta so this is pronounced the
[00:31:11] theta so this is pronounced the likelihood of the parameters theta L of
[00:31:18] likelihood of the parameters theta L of theta which is defined as the
[00:31:21] theta which is defined as the probability of the data right so this is
[00:31:26] probability of the data right so this is probably of all the values of Y of y1 up
[00:31:28] probably of all the values of Y of y1 up to Y M given all the X's and given the
[00:31:32] to Y M given all the X's and given the parameters theta a parametrized by theta
[00:31:37] this is equal to the product from I
[00:31:42] this is equal to the product from I equals 1 through m appear why I even X
[00:31:48] equals 1 through m appear why I even X my franchise by theta because we assume
[00:31:56] my franchise by theta because we assume the examples were because we assume the
[00:31:57] the examples were because we assume the errors are iid right then the error
[00:31:59] errors are iid right then the error terms are independently and identity
[00:32:02] terms are independently and identity destroys each other so the probability
[00:32:04] destroys each other so the probability of all of the observations of all the
[00:32:07] of all of the observations of all the values of Y in our training set is equal
[00:32:09] values of Y in our training set is equal to the product of the probabilities
[00:32:10] to the product of the probabilities because of the independence assumption
[00:32:11] because of the independence assumption we made and so plugging in the
[00:32:14] we made and so plugging in the definition that P of Y given X franchise
[00:32:16] definition that P of Y given X franchise by theta that we had up there this is
[00:32:18] by theta that we had up there this is equal to product
[00:32:36] okay now again one more piece of
[00:32:43] okay now again one more piece of terminology
[00:32:44] terminology you know another question about mandalas
[00:32:46] you know another question about mandalas if you say hey Andrew what's the
[00:32:48] if you say hey Andrew what's the difference between likelihood and
[00:32:49] difference between likelihood and probability right and so the likelihood
[00:32:52] probability right and so the likelihood of parameters is exactly the same thing
[00:32:55] of parameters is exactly the same thing as the probability of the data but the
[00:32:57] as the probability of the data but the reason we sometimes talk about
[00:32:58] reason we sometimes talk about likelihood and some to solve a
[00:33:00] likelihood and some to solve a probability is we think of likelihood so
[00:33:03] probability is we think of likelihood so this is some function right this thing
[00:33:05] this is some function right this thing is a function of the data as well as a
[00:33:07] is a function of the data as well as a function of the parameters theta and if
[00:33:10] function of the parameters theta and if we viewed this number whatever this
[00:33:11] we viewed this number whatever this number is if you view this thing as a
[00:33:13] number is if you view this thing as a function of the parameters holding the
[00:33:15] function of the parameters holding the data fix then we call that the
[00:33:17] data fix then we call that the likelihood so you think of the training
[00:33:18] likelihood so you think of the training set the data is a fixed thing and then
[00:33:21] set the data is a fixed thing and then varying parameters theta then I'm going
[00:33:23] varying parameters theta then I'm going to use the term likelihood whereas if
[00:33:26] to use the term likelihood whereas if you view the parameters theta as fixed
[00:33:28] you view the parameters theta as fixed and maybe varying the data are going to
[00:33:30] and maybe varying the data are going to say probability right so so you hear me
[00:33:33] say probability right so so you hear me use well I'll try to be consistent I
[00:33:35] use well I'll try to be consistent I find I'm pretty good at being consistent
[00:33:38] find I'm pretty good at being consistent but not perfect but I'm going to try to
[00:33:39] but not perfect but I'm going to try to say likelihood of the parameters and
[00:33:42] say likelihood of the parameters and probability of the data even though
[00:33:44] probability of the data even though those evaluate to the same thing it's
[00:33:46] those evaluate to the same thing it's just you know for this function this
[00:33:48] just you know for this function this function is function the data and the
[00:33:50] function is function the data and the parameters which one that UV leaners
[00:33:51] parameters which one that UV leaners fixing which one are you viewing this is
[00:33:53] fixing which one are you viewing this is variable so when you view this as a
[00:33:55] variable so when you view this as a function of theta when I use the term
[00:33:56] function of theta when I use the term likelihood but so so hopefully you hear
[00:33:59] likelihood but so so hopefully you hear me say likelihood of the parameters
[00:34:01] me say likelihood of the parameters hopefully you won't hear me say
[00:34:03] hopefully you won't hear me say likelihood of the data right and then
[00:34:06] likelihood of the data right and then similarly hopefully you hear me say
[00:34:08] similarly hopefully you hear me say probably the data and not probably of
[00:34:10] probably the data and not probably of the parameters
[00:34:18] like the Frances okay so probably at the
[00:34:25] like the Frances okay so probably at the data no tech got it sorry yes like your
[00:34:32] data no tech got it sorry yes like your paycheck got it yes sorry yes like a
[00:34:35] paycheck got it yes sorry yes like a theory right oh no so no so theta is a
[00:34:48] theory right oh no so no so theta is a set of parameters it's not a random
[00:34:49] set of parameters it's not a random variable so we like you of theta
[00:34:52] variable so we like you of theta doesn't mean theta is a random variable
[00:34:53] doesn't mean theta is a random variable right by the way the stuff about what's
[00:34:58] right by the way the stuff about what's a random variable and what's not the
[00:34:59] a random variable and what's not the semicolon versus comma thing we explain
[00:35:02] semicolon versus comma thing we explain this in more detail on the lecture notes
[00:35:03] this in more detail on the lecture notes to me this is part of them you know a
[00:35:07] to me this is part of them you know a little bit paying homage to the to their
[00:35:10] little bit paying homage to the to their religion of Bayesian frequencies versus
[00:35:13] religion of Bayesian frequencies versus Bayesian frequencies versus Bayesian
[00:35:15] Bayesian frequencies versus Bayesian statistics from from a machine from
[00:35:18] statistics from from a machine from apply machine learning operational what
[00:35:20] apply machine learning operational what you write code point of view it doesn't
[00:35:22] you write code point of view it doesn't matter that much yeah but theta is not a
[00:35:26] matter that much yeah but theta is not a random variable we have likely or the
[00:35:27] random variable we have likely or the parameters which another random variable
[00:35:35] what's the rationale for choosing oh
[00:35:39] sure why is epsilon I Gaussian so it
[00:35:43] sure why is epsilon I Gaussian so it turns out because the central limit
[00:35:45] turns out because the central limit theorem from statistics most error
[00:35:47] theorem from statistics most error distribution or Gaussian right if
[00:35:49] distribution or Gaussian right if something is it is an error that's made
[00:35:51] something is it is an error that's made up of lots of little noise sources which
[00:35:53] up of lots of little noise sources which are not too correlated and by the
[00:35:55] are not too correlated and by the central limit theorem it will be
[00:35:56] central limit theorem it will be Gaussian so if you think that hope that
[00:35:58] Gaussian so if you think that hope that the rotations are the mood in the seller
[00:36:00] the rotations are the mood in the seller was the School District
[00:36:02] was the School District you know what's the weather like access
[00:36:04] you know what's the weather like access to transportation and all of these
[00:36:05] to transportation and all of these sources are not too correlated and you
[00:36:07] sources are not too correlated and you add them up then the distribution will
[00:36:09] add them up then the distribution will be Gaussian and and I think yeah
[00:36:14] so really because the central limit
[00:36:15] so really because the central limit theorem I think the gaussians become a
[00:36:16] theorem I think the gaussians become a default noise distribution but for
[00:36:19] default noise distribution but for things where the true noise distribution
[00:36:22] things where the true noise distribution is very far from Gaussian this model
[00:36:24] is very far from Gaussian this model does do this well and in fact for when
[00:36:27] does do this well and in fact for when you see generalized linear models on
[00:36:28] you see generalized linear models on Wednesday you see when how to generalize
[00:36:31] Wednesday you see when how to generalize all of these algorithms to very
[00:36:33] all of these algorithms to very different distributions like possum and
[00:36:34] different distributions like possum and so on alright so so we've seen the
[00:36:41] so on alright so so we've seen the likelihood of the parameters theta so
[00:36:47] likelihood of the parameters theta so I'm going to use lower case L to denote
[00:36:50] I'm going to use lower case L to denote the log likelihood and the log
[00:36:56] the log likelihood and the log likelihood is just a longer the
[00:36:58] likelihood is just a longer the likelihood and so and so log of a
[00:37:12] likelihood and so and so log of a product is equal to the sum of the logs
[00:37:15] product is equal to the sum of the logs right and so this is equal to
[00:37:49] okay and so one of the you know well
[00:37:58] okay and so one of the you know well tested methods in statistics for
[00:38:00] tested methods in statistics for estimating parameters is to use maximum
[00:38:03] estimating parameters is to use maximum likelihood estimation which means atchoo
[00:38:22] likelihood estimation which means atchoo Stata to maximize the likelihood right
[00:38:32] Stata to maximize the likelihood right so you're going to dataset how would you
[00:38:37] so you're going to dataset how would you like to estimate theta
[00:38:38] like to estimate theta well one natural way to choose theta is
[00:38:41] well one natural way to choose theta is to choose whatever value of theta has
[00:38:43] to choose whatever value of theta has the highest likelihood or in other words
[00:38:45] the highest likelihood or in other words choose the value of theta so that that
[00:38:46] choose the value of theta so that that value of theta maximizes the probability
[00:38:49] value of theta maximizes the probability of the data and so for to simplify the
[00:38:56] of the data and so for to simplify the algebra rather than maximizing the
[00:38:57] algebra rather than maximizing the likelihood capital L is actually easier
[00:39:00] likelihood capital L is actually easier to maximize the log likelihood but the
[00:39:02] to maximize the log likelihood but the log is a strictly monotonically
[00:39:04] log is a strictly monotonically increasing function so the value of
[00:39:06] increasing function so the value of theta that maximizes the log likelihood
[00:39:08] theta that maximizes the log likelihood it should be the same as the value of
[00:39:10] it should be the same as the value of theta that maximizes our likelihood and
[00:39:11] theta that maximizes our likelihood and if you derived a log likelihood we
[00:39:15] if you derived a log likelihood we conclude that if you're using maximum
[00:39:17] conclude that if you're using maximum likelihood estimation what you'd like to
[00:39:19] likelihood estimation what you'd like to do is choose the value of theta that
[00:39:21] do is choose the value of theta that maximizes this thing right but this
[00:39:25] maximizes this thing right but this first term is just a constant theta
[00:39:26] first term is just a constant theta doesn't even appear in this first term
[00:39:29] doesn't even appear in this first term and so what you like to do is choose the
[00:39:32] and so what you like to do is choose the value of theta that maximizes the second
[00:39:35] value of theta that maximizes the second term notice there's a minus sign there
[00:39:38] term notice there's a minus sign there and so what you'd like to do is ie you
[00:39:43] and so what you'd like to do is ie you know choose theta to minimize this term
[00:40:01] right oh so Sigma squared is just a
[00:40:04] right oh so Sigma squared is just a constant right no matter what Sigma
[00:40:06] constant right no matter what Sigma squared is you know so so so if you want
[00:40:10] squared is you know so so so if you want to minimize this term excuse me if you
[00:40:12] to minimize this term excuse me if you want to maximize this term negative of
[00:40:14] want to maximize this term negative of this thing that's the same as minimizing
[00:40:16] this thing that's the same as minimizing this term but this is just J of theta
[00:40:21] this term but this is just J of theta the cost function you saw earlier for
[00:40:25] the cost function you saw earlier for linear regression okay so this little
[00:40:29] linear regression okay so this little proof shows that choosing the value of
[00:40:33] proof shows that choosing the value of theta to minimize the least squares
[00:40:35] theta to minimize the least squares errors like you saw last Wednesday
[00:40:38] errors like you saw last Wednesday that's just finding the maximum
[00:40:40] that's just finding the maximum likelihood estimate for the parameters
[00:40:42] likelihood estimate for the parameters theta under the set of assumptions we
[00:40:45] theta under the set of assumptions we made that the error terms are Gaussian
[00:40:48] made that the error terms are Gaussian and iid okay oh thank you
[00:41:03] Oh is there a situation using this phone
[00:41:09] Oh is there a situation using this phone as a least-squares cost function with
[00:41:10] as a least-squares cost function with better idea no so this I think this
[00:41:12] better idea no so this I think this derivation shows that this this is
[00:41:14] derivation shows that this this is completely equivalent to these squares
[00:41:16] completely equivalent to these squares right that if you want if you're willing
[00:41:19] right that if you want if you're willing to assume that the error terms are
[00:41:21] to assume that the error terms are Gaussian and iid and if you want to use
[00:41:24] Gaussian and iid and if you want to use maximum likelihood estimation which is
[00:41:26] maximum likelihood estimation which is very natural procedure and statistics
[00:41:27] very natural procedure and statistics then you know then you should use these
[00:41:31] then you know then you should use these squares if you knew for some reason
[00:41:42] squares if you knew for some reason Arizona idea which figure about the cost
[00:41:44] Arizona idea which figure about the cost function yes I know I think that you
[00:41:48] function yes I know I think that you know when building learning algorithms
[00:41:50] know when building learning algorithms often we make maldo we make assumptions
[00:41:53] often we make maldo we make assumptions about the world that we just know are
[00:41:54] about the world that we just know are not hunched up sent true because it
[00:41:56] not hunched up sent true because it leads to algorithms accomplished and
[00:41:58] leads to algorithms accomplished and efficient and so if you knew that your
[00:42:01] efficient and so if you knew that your if you knew that your training set was
[00:42:03] if you knew that your training set was very very non-id there are there most of
[00:42:05] very very non-id there are there most of skated modeled as you could build but
[00:42:09] skated modeled as you could build but yeah but but very often we wouldn't
[00:42:12] yeah but but very often we wouldn't bother I think ya know more often than
[00:42:14] bother I think ya know more often than not we might not bother I can think of a
[00:42:17] not we might not bother I can think of a few special cases where you would bother
[00:42:19] few special cases where you would bother but only if you think the assumptions
[00:42:21] but only if you think the assumptions are really really bad if you don't have
[00:42:22] are really really bad if you don't have enough data and so something waitwait
[00:42:25] enough data and so something waitwait via alright I want to move on to make
[00:42:31] via alright I want to move on to make sure we get through the rest of things
[00:42:33] sure we get through the rest of things any questions all right
[00:42:39] so armed with this machinery right so so
[00:42:44] so armed with this machinery right so so what do we do here was we set up a set
[00:42:47] what do we do here was we set up a set of problems occur some shion's we made
[00:42:48] of problems occur some shion's we made certain assumptions about P of Y given X
[00:42:51] certain assumptions about P of Y given X where the key assumption was Gaussian
[00:42:53] where the key assumption was Gaussian errors in IIT and then through maximum
[00:42:55] errors in IIT and then through maximum likelihood estimation we derived an
[00:42:57] likelihood estimation we derived an algorithm which turns out to be exactly
[00:42:59] algorithm which turns out to be exactly the least squares algorithm right what
[00:43:02] the least squares algorithm right what I'd like to do is take this framework
[00:43:04] I'd like to do is take this framework and apply it
[00:43:06] and apply it our first classification problem right
[00:43:09] our first classification problem right and so the key steps are you know one
[00:43:12] and so the key steps are you know one make an assumption about P of Y given X
[00:43:14] make an assumption about P of Y given X P of Y given X parameter
[00:43:16] P of Y given X parameter in a second is figure out maximum
[00:43:18] in a second is figure out maximum likelihood estimation it's nice to take
[00:43:19] likelihood estimation it's nice to take this framework and apply it to a
[00:43:21] this framework and apply it to a different type of problem where the
[00:43:23] different type of problem where the value of y is now either zero or one
[00:43:26] value of y is now either zero or one size of your classification problem okay
[00:43:28] size of your classification problem okay so let's see so in the classification
[00:43:40] so let's see so in the classification problem in our first classification
[00:43:43] problem in our first classification problem we're going to start with binary
[00:43:45] problem we're going to start with binary classification so the value of y is
[00:43:48] classification so the value of y is either 0 or 1
[00:43:50] either 0 or 1 and sometimes we call this binary
[00:43:52] and sometimes we call this binary classification because there are two
[00:43:54] classification because there are two classes and so so that's a data set
[00:44:09] classes and so so that's a data set where yes this is X and this is y um so
[00:44:13] where yes this is X and this is y um so something that's not a good idea is
[00:44:15] something that's not a good idea is applied linear regression to this data
[00:44:17] applied linear regression to this data set some sometimes we will do it and
[00:44:19] set some sometimes we will do it and maybe they get away with it but I
[00:44:21] maybe they get away with it but I wouldn't do it and here's why
[00:44:22] wouldn't do it and here's why which is um it's tempting to just fit a
[00:44:25] which is um it's tempting to just fit a straight line to this data and then take
[00:44:27] straight line to this data and then take the straight line and threshold it at
[00:44:29] the straight line and threshold it at 0.5 and then say oh if this is above 0.5
[00:44:32] 0.5 and then say oh if this is above 0.5 rounds after 1 it is below 0.5 rends it
[00:44:34] rounds after 1 it is below 0.5 rends it off to 0 but it turns out that this is
[00:44:39] off to 0 but it turns out that this is not a good idea for classification
[00:44:41] not a good idea for classification problems and here's why which is for
[00:44:44] problems and here's why which is for this data set it's really obvious what
[00:44:46] this data set it's really obvious what the what the pattern is right everything
[00:44:47] the what the pattern is right everything to the left at this point for the 0 I
[00:44:49] to the left at this point for the 0 I think the right at that point for the
[00:44:51] think the right at that point for the good one but let's say we now change the
[00:44:53] good one but let's say we now change the data set to just add one more example
[00:44:56] data set to just add one more example there and the pattern is still really
[00:44:59] there and the pattern is still really obvious is everything to the left of
[00:45:00] obvious is everything to the left of this breathing zero
[00:45:02] this breathing zero I think it's right of that for the good
[00:45:03] I think it's right of that for the good one but they fit a straight line to this
[00:45:05] one but they fit a straight line to this data set with this extra one point there
[00:45:07] data set with this extra one point there and just not even the outlier it's
[00:45:09] and just not even the outlier it's really obvious at this point way out
[00:45:10] really obvious at this point way out there should be labeled one but was this
[00:45:12] there should be labeled one but was this extra example if you fit a straight line
[00:45:15] extra example if you fit a straight line to the data you end up with maybe
[00:45:18] to the data you end up with maybe something like that and somehow having
[00:45:22] something like that and somehow having this one example it really didn't change
[00:45:24] this one example it really didn't change anything right but somehow the string I
[00:45:26] anything right but somehow the string I fit
[00:45:26] fit from the green lines of the move from
[00:45:29] from the green lines of the move from the blue line to the green line and if
[00:45:31] the blue line to the green line and if you now flash hold it at 0.5 you end up
[00:45:33] you now flash hold it at 0.5 you end up with a very different decision boundary
[00:45:35] with a very different decision boundary and so linear regression is just not a
[00:45:38] and so linear regression is just not a good algorithm for classification some
[00:45:40] good algorithm for classification some people use it and sometimes again lucky
[00:45:42] people use it and sometimes again lucky is not too bad but I personally never
[00:45:45] is not too bad but I personally never used the neighbor aggression for
[00:45:47] used the neighbor aggression for classification algorithms right because
[00:45:48] classification algorithms right because just don't know if you end up with a
[00:45:50] just don't know if you end up with a really bad fit to the data like this um
[00:45:55] really bad fit to the data like this um so oh and and and the other unnatural
[00:46:00] so oh and and and the other unnatural thing about using linear regression for
[00:46:02] thing about using linear regression for classification problem is that you know
[00:46:04] classification problem is that you know for a classification problem that the
[00:46:06] for a classification problem that the values are you know 0 or 1 right and so
[00:46:10] values are you know 0 or 1 right and so it's output negative values or values
[00:46:12] it's output negative values or values even greater than 1 seems seems strange
[00:46:18] so what I'd like to share of you now is
[00:46:22] so what I'd like to share of you now is really probably by far the most commonly
[00:46:25] really probably by far the most commonly used classification algorithm called
[00:46:27] used classification algorithm called logistic regression I always say the two
[00:46:34] logistic regression I always say the two learning are rooms I probably use the
[00:46:36] learning are rooms I probably use the most often on linear regression and
[00:46:38] most often on linear regression and logistic regression yeah yeah and this
[00:46:45] logistic regression yeah yeah and this is the algorithm so as we designed in
[00:46:49] is the algorithm so as we designed in logistic regression algorithm one of the
[00:46:50] logistic regression algorithm one of the things we might naturally want is for
[00:46:54] things we might naturally want is for the hypothesis to output values between
[00:46:57] the hypothesis to output values between 0 &amp; 1
[00:46:58] 0 &amp; 1 right and this is mathematical notation
[00:47:01] right and this is mathematical notation for the values for H of X or H prime H
[00:47:04] for the values for H of X or H prime H subscript theta of X lies in the set
[00:47:07] subscript theta of X lies in the set from 0 to 1 right the 0 to 1 square
[00:47:10] from 0 to 1 right the 0 to 1 square bracket is the set of all real numbers
[00:47:11] bracket is the set of all real numbers from 0 to 1 so this is we want the
[00:47:14] from 0 to 1 so this is we want the hypothesis output values in between 0
[00:47:17] hypothesis output values in between 0 and 1 in the set of all numbers which is
[00:47:19] and 1 in the set of all numbers which is from 0 to 1 and so we're going to choose
[00:47:24] from 0 to 1 and so we're going to choose the following form of the hypothesis so
[00:47:40] so will it define the function G of Z
[00:47:45] so will it define the function G of Z that looks like this and this is called
[00:47:48] that looks like this and this is called the sigmoid or the logistic function
[00:47:58] these are synonyms they mean exactly the
[00:48:00] these are synonyms they mean exactly the same thing so can we call the sigmoid
[00:48:03] same thing so can we call the sigmoid function or the logistic function it
[00:48:04] function or the logistic function it means exactly the same thing but I'm
[00:48:07] means exactly the same thing but I'm gonna choose a function G of Z and this
[00:48:11] gonna choose a function G of Z and this function is shaped as follows if you
[00:48:12] function is shaped as follows if you plot this function you find that it
[00:48:15] plot this function you find that it looks like this where if the horizontal
[00:48:19] looks like this where if the horizontal axis is Z then this is G of Z and so it
[00:48:23] axis is Z then this is G of Z and so it crosses x intercept at 0 and it you know
[00:48:28] crosses x intercept at 0 and it you know starts off well really close to 0 Rises
[00:48:32] starts off well really close to 0 Rises and then asymptotes towards 1 okay and
[00:48:36] and then asymptotes towards 1 okay and so G of Z outputs values between 0 and 1
[00:48:40] so G of Z outputs values between 0 and 1 and what logistic regression does is
[00:48:45] and what logistic regression does is instead of let's see so previously for
[00:48:48] instead of let's see so previously for linear regression we had chosen this
[00:48:50] linear regression we had chosen this form for the hypothesis right we just
[00:48:53] form for the hypothesis right we just made a choice that will say that housing
[00:48:54] made a choice that will say that housing prices are a linear function of the
[00:48:56] prices are a linear function of the features X and what logistic regression
[00:48:59] features X and what logistic regression does is say the transpose X could be
[00:49:01] does is say the transpose X could be bigger than 1 it could be less than 0
[00:49:03] bigger than 1 it could be less than 0 which is not very natural but it's gonna
[00:49:05] which is not very natural but it's gonna take theta transpose X and pass it
[00:49:07] take theta transpose X and pass it through this sigmoid function G so this
[00:49:10] through this sigmoid function G so this force the output values only between 0
[00:49:13] force the output values only between 0 and 1 ok so you know when designing a
[00:49:20] and 1 ok so you know when designing a learning algorithm sometimes you just
[00:49:22] learning algorithm sometimes you just have to choose the form of the
[00:49:24] have to choose the form of the hypothesis how are you going to
[00:49:25] hypothesis how are you going to represent the function H or H of H
[00:49:28] represent the function H or H of H subscript theta and so we're making that
[00:49:30] subscript theta and so we're making that choice here today and if you're
[00:49:33] choice here today and if you're wondering you know there are lots of
[00:49:35] wondering you know there are lots of functions that we could have chosen
[00:49:36] functions that we could have chosen right there loss of why why not why not
[00:49:40] right there loss of why why not why not dysfunction or why not you know there
[00:49:42] dysfunction or why not you know there lots of functions with very the shape to
[00:49:43] lots of functions with very the shape to go but easier and
[00:49:44] go but easier and so why are we choosing this specifically
[00:49:48] so why are we choosing this specifically it turns out that there's a broader
[00:49:49] it turns out that there's a broader class of algorithms called generalized
[00:49:51] class of algorithms called generalized any models you hear about on Wednesday
[00:49:52] any models you hear about on Wednesday of which this is a special case so we've
[00:49:55] of which this is a special case so we've seen linear Russian you see logistic
[00:49:58] seen linear Russian you see logistic regression in a second
[00:49:58] regression in a second and on Wednesday you see that both of
[00:50:01] and on Wednesday you see that both of these examples of a much bigger set of
[00:50:02] these examples of a much bigger set of algorithms derive using a broader set of
[00:50:04] algorithms derive using a broader set of principle so so for now just you know
[00:50:07] principle so so for now just you know take my word for it that we want to use
[00:50:09] take my word for it that we want to use the logistic function it'll turn out you
[00:50:12] the logistic function it'll turn out you see on Wednesday that this way to derive
[00:50:13] see on Wednesday that this way to derive even dysfunction from from more basic
[00:50:17] even dysfunction from from more basic principles rather than just putting all
[00:50:18] principles rather than just putting all this does that happen for now let me
[00:50:20] this does that happen for now let me just pull this out of a hat and say
[00:50:21] just pull this out of a hat and say that's the one we want to use
[00:50:39] so let's make some assumptions about the
[00:50:48] so let's make some assumptions about the distribution of Y given X franchise by
[00:50:51] distribution of Y given X franchise by theta so I'm going to assume that the
[00:50:56] theta so I'm going to assume that the data has a following distribution the
[00:50:59] data has a following distribution the probability of Y being 1 again from the
[00:51:02] probability of Y being 1 again from the breast cancer prediction that we had
[00:51:03] breast cancer prediction that we had from the first lecture right it would be
[00:51:07] from the first lecture right it would be the chance of a tumor being cancerous of
[00:51:09] the chance of a tumor being cancerous of being malignant chance of Y be new one
[00:51:12] being malignant chance of Y be new one given the size of a tumor that's the
[00:51:15] given the size of a tumor that's the future x parametrized by theta that this
[00:51:18] future x parametrized by theta that this is equal to the output of your
[00:51:22] is equal to the output of your hypothesis so in other words going
[00:51:25] hypothesis so in other words going assume that what you want your learning
[00:51:27] assume that what you want your learning algorithm to do is input the features
[00:51:30] algorithm to do is input the features and tell me what's the chance that this
[00:51:32] and tell me what's the chance that this tumor is malignant right what's the
[00:51:34] tumor is malignant right what's the chance that Y is equal to one and by
[00:51:39] chance that Y is equal to one and by logic I guess because Y can be only one
[00:51:42] logic I guess because Y can be only one or zero the chance of Y being equal to
[00:51:44] or zero the chance of Y being equal to zero this has got to be one minus that
[00:51:50] right because if a tumor has a 10%
[00:51:53] right because if a tumor has a 10% chance of being malignant that means it
[00:51:55] chance of being malignant that means it has a 1 minus that means it must have a
[00:51:58] has a 1 minus that means it must have a 90% chance of being benign right since
[00:52:00] 90% chance of being benign right since these two probabilities must add up to
[00:52:01] these two probabilities must add up to one
[00:52:13] I'll say it again
[00:52:15] I'll say it again oh can we change Peru
[00:52:19] oh can we change Peru yes you can but I'm not yet but I think
[00:52:21] yes you can but I'm not yet but I think just a stick of convention in the
[00:52:23] just a stick of convention in the z-direction you yeah sure because assume
[00:52:26] z-direction you yeah sure because assume that P of y equals 1 was this in P of y
[00:52:28] that P of y equals 1 was this in P of y equals 1 was that but I think either way
[00:52:30] equals 1 was that but I think either way it's just what you call positive example
[00:52:31] it's just what you call positive example why you call negative example um and now
[00:52:38] why you call negative example um and now bearing in mind that Y right by
[00:52:42] bearing in mind that Y right by definition because it's a binary
[00:52:45] definition because it's a binary classification problem but bearing in
[00:52:47] classification problem but bearing in mind that Y can only take on two values
[00:52:49] mind that Y can only take on two values 0 1 there's a nifty so the algebra way
[00:52:54] 0 1 there's a nifty so the algebra way to take these 2 equations and write them
[00:52:57] to take these 2 equations and write them in one equation and this will make some
[00:52:59] in one equation and this will make some of the math a little bit easier when I
[00:53:00] of the math a little bit easier when I take these two equations take these two
[00:53:02] take these two equations take these two assumptions and take these two facts and
[00:53:04] assumptions and take these two facts and compress it into one equation which is
[00:53:07] compress it into one equation which is this oh and I dropped the theta
[00:53:16] this oh and I dropped the theta subscript just to simplify the notation
[00:53:18] subscript just to simplify the notation I'm gonna be a little bit sloppy
[00:53:20] I'm gonna be a little bit sloppy sometimes well no less one more whether
[00:53:22] sometimes well no less one more whether I write the theta there okay but these
[00:53:26] I write the theta there okay but these two definitions appear Y given X
[00:53:28] two definitions appear Y given X paralyzed by theta bear in mind that Y
[00:53:31] paralyzed by theta bear in mind that Y is either 0 one can be compressed into
[00:53:33] is either 0 one can be compressed into one equation like this and then just say
[00:53:36] one equation like this and then just say Y right is because if y
[00:53:46] if one is equal to one then this becomes
[00:53:52] if one is equal to one then this becomes H of X to the power of one times this
[00:53:54] H of X to the power of one times this thing it's a power of zero right if Y is
[00:53:58] thing it's a power of zero right if Y is equal to 1 then 1 minus y is 0 and you
[00:54:03] equal to 1 then 1 minus y is 0 and you know anything so the power of 0 is just
[00:54:07] know anything so the power of 0 is just equal to 1 and so if Y is equal to 1 you
[00:54:12] equal to 1 and so if Y is equal to 1 you end up with P of Y given X prioritize by
[00:54:15] end up with P of Y given X prioritize by theta equals H of X which is just what
[00:54:21] theta equals H of X which is just what we had there and conversely if Y is
[00:54:27] we had there and conversely if Y is equal to 0 then this thing will be 0 and
[00:54:32] equal to 0 then this thing will be 0 and this thing would be 1 and so you end up
[00:54:35] this thing would be 1 and so you end up with P of Y given X perilous theta is
[00:54:39] with P of Y given X perilous theta is equal to 1 minus H of X which is just
[00:54:42] equal to 1 minus H of X which is just equal to that second equation ok right
[00:54:47] equal to that second equation ok right and so this is a nifty way to take these
[00:54:51] and so this is a nifty way to take these two equations and compress them into one
[00:54:53] two equations and compress them into one line because depending on whether Y is
[00:54:55] line because depending on whether Y is zero one one of these two terms switches
[00:54:58] zero one one of these two terms switches off because it's exponentiated to the
[00:55:00] off because it's exponentiated to the power of zero and anything to the power
[00:55:03] power of zero and anything to the power of zero is just equal of one right so
[00:55:05] of zero is just equal of one right so one of these terms is just you know one
[00:55:07] one of these terms is just you know one doesn't leaving the other term and just
[00:55:10] doesn't leaving the other term and just selecting the appropriate equation
[00:55:12] selecting the appropriate equation depend on whether Y is zero one okay so
[00:55:15] depend on whether Y is zero one okay so with that so with this little on a
[00:55:19] with that so with this little on a notational trick you'll make the later
[00:55:22] notational trick you'll make the later derivations simpler
[00:55:31] so all right she can reuse along with
[00:55:55] so all right she can reuse along with this all right so we're gonna use
[00:56:01] this all right so we're gonna use maximum likelihood estimation again so
[00:56:03] maximum likelihood estimation again so let's write down the likelihood of the
[00:56:07] let's write down the likelihood of the parameters so it's actually PF otherwise
[00:56:11] parameters so it's actually PF otherwise given all the XS for entrance by theta
[00:56:14] given all the XS for entrance by theta which is equal to this which is now
[00:56:17] which is equal to this which is now equal to product from I equals 1 through
[00:56:21] equal to product from I equals 1 through X KY ^ Y I times 1 minus H of X I - why
[00:56:32] X KY ^ Y I times 1 minus H of X I - why okay where all I did was take this
[00:56:35] okay where all I did was take this definition of P of Y given X parent
[00:56:37] definition of P of Y given X parent choice by theta you know from that after
[00:56:40] choice by theta you know from that after we did that little exponentiation trick
[00:56:41] we did that little exponentiation trick and wrote it in here
[00:56:50] and then what maximum likelihood
[00:56:53] and then what maximum likelihood estimation we'll want to find the value
[00:56:56] estimation we'll want to find the value of theta that maximizes the likelihood
[00:56:58] of theta that maximizes the likelihood maximize the likelihood of the
[00:57:00] maximize the likelihood of the parameters and so same as what we did
[00:57:04] parameters and so same as what we did for linear regression to make the
[00:57:06] for linear regression to make the algebra yeah to make the out you're a
[00:57:08] algebra yeah to make the out you're a bit more simple we're going to take the
[00:57:10] bit more simple we're going to take the log of the likelihood and so compute the
[00:57:12] log of the likelihood and so compute the log likelihood and so that's equal to
[00:57:19] let's see right I say if you take the
[00:57:21] let's see right I say if you take the log of that you end up with and it so so
[00:57:52] log of that you end up with and it so so in other words the last thing you want
[00:57:54] in other words the last thing you want to do is try to choose the value of
[00:57:56] to do is try to choose the value of theta to try to maximize L of theta now
[00:58:10] theta to try to maximize L of theta now so just just to summarize where we are
[00:58:13] so just just to summarize where we are right if you're trying to predict your
[00:58:15] right if you're trying to predict your malignancy in bananas of tumors you have
[00:58:19] malignancy in bananas of tumors you have a training set with X I Y I you define
[00:58:22] a training set with X I Y I you define the likelihood to find a log likelihood
[00:58:25] the likelihood to find a log likelihood and then what you need to do is have an
[00:58:27] and then what you need to do is have an algorithm such as gradient descent
[00:58:28] algorithm such as gradient descent agreed innocent talk about that a sec to
[00:58:31] agreed innocent talk about that a sec to try to find the value of theta that
[00:58:32] try to find the value of theta that maximizes the log likelihood and then
[00:58:35] maximizes the log likelihood and then having chosen the value of theta when a
[00:58:38] having chosen the value of theta when a new patient walks into the doctor's
[00:58:39] new patient walks into the doctor's office you would you know take the
[00:58:41] office you would you know take the features of the new tumor and then use H
[00:58:44] features of the new tumor and then use H of theta to estimate the chance of this
[00:58:47] of theta to estimate the chance of this new tumor and the new patient that walks
[00:58:49] new tumor and the new patient that walks in tomorrow's estimate the chance that
[00:58:51] in tomorrow's estimate the chance that this new thing is this malignant
[00:58:54] this new thing is this malignant okay so the algorithm were going to use
[00:59:01] okay so the algorithm were going to use to choose theta to try to maximize the
[00:59:03] to choose theta to try to maximize the log likelihood is gradient ascent or
[00:59:06] log likelihood is gradient ascent or batch gradient ascent and what that
[00:59:10] batch gradient ascent and what that means is we will update the parameters
[00:59:13] means is we will update the parameters theta J according to theta J plus the
[00:59:21] theta J according to theta J plus the partial derivative with respect to the
[00:59:24] partial derivative with respect to the log-likelihood okay and the differences
[00:59:27] log-likelihood okay and the differences from what you saw for linear regression
[00:59:29] from what you saw for linear regression from last time it's the following just
[00:59:33] from last time it's the following just two differences I guess for linear
[00:59:36] two differences I guess for linear regression last week I have written this
[00:59:39] regression last week I have written this down theta J gets updated as theta J
[00:59:41] down theta J gets updated as theta J minus partial with respect to theta J of
[00:59:45] minus partial with respect to theta J of J of theta right so you saw this on
[00:59:47] J of theta right so you saw this on Wednesday so the two differences between
[00:59:49] Wednesday so the two differences between dances are well first instead of J of
[00:59:53] dances are well first instead of J of theta you're now trying to optimize the
[00:59:56] theta you're now trying to optimize the log likelihood instead of the squared
[00:59:58] log likelihood instead of the squared cost function and the second change is
[01:00:00] cost function and the second change is previously you were trying to minimize
[01:00:02] previously you were trying to minimize the squared error that's why we had the
[01:00:04] the squared error that's why we had the minus and today you're trying to
[01:00:06] minus and today you're trying to maximize the log likelihood which is why
[01:00:09] maximize the log likelihood which is why there's a plus sign okay and so so great
[01:00:13] there's a plus sign okay and so so great in E sent you know it's trying to climb
[01:00:17] in E sent you know it's trying to climb down this hill whereas gradient ascent
[01:00:20] down this hill whereas gradient ascent has a has a concave function like this
[01:00:25] has a has a concave function like this and it's trying to write climb up the
[01:00:29] and it's trying to write climb up the hill
[01:00:29] hill rather than climb down their goal so
[01:00:31] rather than climb down their goal so that's why there's a plus symbol here
[01:00:33] that's why there's a plus symbol here instead of a minus because we maximize
[01:00:36] instead of a minus because we maximize the function rather than minimize the
[01:00:38] the function rather than minimize the function so the last thing to really
[01:00:44] function so the last thing to really flesh out this algorithm which is done
[01:00:47] flesh out this algorithm which is done in the lecture notes but I don't want to
[01:00:48] in the lecture notes but I don't want to do to you today is to plug in the
[01:00:51] do to you today is to plug in the definition of H of theta into this
[01:00:54] definition of H of theta into this equation and then take this thing so
[01:00:57] equation and then take this thing so that's the log likelihood of theta and
[01:01:00] that's the log likelihood of theta and then through you know calculus and
[01:01:02] then through you know calculus and algebra you can take derivatives of this
[01:01:05] algebra you can take derivatives of this whole thing with respect
[01:01:06] whole thing with respect theta this is done in detail in the
[01:01:08] theta this is done in detail in the lecture notes I don't want to use it
[01:01:09] lecture notes I don't want to use it cost but go ahead and take the Ritter's
[01:01:12] cost but go ahead and take the Ritter's at this big formula with respect to the
[01:01:14] at this big formula with respect to the parameters theta in order to figure out
[01:01:17] parameters theta in order to figure out what is that thing right what is this
[01:01:19] what is that thing right what is this thing that I just circled and it turns
[01:01:21] thing that I just circled and it turns out that if you do so you will find that
[01:01:27] out that if you do so you will find that batch gradient descent is the following
[01:01:31] batch gradient descent is the following you update theta J according to actually
[01:01:41] you update theta J according to actually I'm sorry I forgot the learning rate
[01:01:43] I'm sorry I forgot the learning rate yeah it's relearning the Alpha learning
[01:01:46] yeah it's relearning the Alpha learning rate alpha times this because this term
[01:01:56] rate alpha times this because this term here is the partial derivative respect
[01:01:58] here is the partial derivative respect to theta J and the full calculus and so
[01:02:09] to theta J and the full calculus and so on derivation is given in the lecture
[01:02:12] on derivation is given in the lecture notes is a chance of local Maxima in
[01:02:20] notes is a chance of local Maxima in this case no there isn't
[01:02:21] this case no there isn't it turns out that this function that the
[01:02:24] it turns out that this function that the log-likelihood function o of theta for
[01:02:27] log-likelihood function o of theta for logistic regression that always looks
[01:02:28] logistic regression that always looks like that so this is a concave function
[01:02:30] like that so this is a concave function so there are no local op that the only
[01:02:33] so there are no local op that the only maximum as a global Maxima there's
[01:02:35] maximum as a global Maxima there's actually another reason why we chose the
[01:02:37] actually another reason why we chose the logistic function because if you choose
[01:02:38] logistic function because if you choose the logistic function rather than some
[01:02:40] the logistic function rather than some other function reserve the one you're
[01:02:42] other function reserve the one you're guaranteed that the likelihood function
[01:02:44] guaranteed that the likelihood function has only one global maximum and this
[01:02:47] has only one global maximum and this there's actually a big positive actually
[01:02:49] there's actually a big positive actually what you see on Wednesday this is a big
[01:02:52] what you see on Wednesday this is a big class of algorithms
[01:02:53] class of algorithms I wish linear regression is one example
[01:02:55] I wish linear regression is one example which is addresses another example and
[01:02:57] which is addresses another example and for all of these algorithms in this
[01:02:58] for all of these algorithms in this class there are no local Optima problems
[01:03:00] class there are no local Optima problems when you when you derived them this way
[01:03:02] when you when you derived them this way so you see that on Wednesday wouldn't
[01:03:04] so you see that on Wednesday wouldn't talk about us
[01:03:05] talk about us okay so actually I think your bun is
[01:03:09] okay so actually I think your bun is just one question for you to think about
[01:03:12] just one question for you to think about this looks exactly the same as what
[01:03:14] this looks exactly the same as what we've paid it out for linear regression
[01:03:15] we've paid it out for linear regression right actually the difference the linear
[01:03:19] right actually the difference the linear regression was I had a minus sign here
[01:03:20] regression was I had a minus sign here and I reversed these two terms I think
[01:03:23] and I reversed these two terms I think there's a big sign - weii if you put the
[01:03:27] there's a big sign - weii if you put the minus sign there and reverse these two
[01:03:28] minus sign there and reverse these two terms so take the minus - this is
[01:03:31] terms so take the minus - this is actually exactly the same as when we had
[01:03:32] actually exactly the same as when we had come up with a linear regression so why
[01:03:35] come up with a linear regression so why is this different right I started off
[01:03:37] is this different right I started off saying don't use the knee regression for
[01:03:38] saying don't use the knee regression for classification problems because of
[01:03:40] classification problems because of because of that problem that a single
[01:03:42] because of that problem that a single example could really you know I'm sorry
[01:03:44] example could really you know I'm sorry I start off with an example saying that
[01:03:46] I start off with an example saying that linear regression is really bad for
[01:03:48] linear regression is really bad for classification and we did all this work
[01:03:50] classification and we did all this work and came back to the same algorithm so
[01:03:52] and came back to the same algorithm so what happened all right cool awesome
[01:04:00] what happened all right cool awesome right so what happened is the definition
[01:04:02] right so what happened is the definition of H of theta is now different than
[01:04:04] of H of theta is now different than before but the surface level the
[01:04:07] before but the surface level the equation turns out to be the same okay
[01:04:10] equation turns out to be the same okay and again it turns out that for every
[01:04:12] and again it turns out that for every algorithm and discourse around as you
[01:04:13] algorithm and discourse around as you see on Wednesday you end up with the
[01:04:15] see on Wednesday you end up with the same thing actually there's a general
[01:04:17] same thing actually there's a general property of a much bigger class of
[01:04:19] property of a much bigger class of algorithms called generalize many models
[01:04:22] algorithms called generalize many models although yeah interesting historical
[01:04:25] although yeah interesting historical divergence because of the confusion
[01:04:28] divergence because of the confusion between these two algorithms in the
[01:04:30] between these two algorithms in the early history of machine learning there
[01:04:32] early history of machine learning there was some debate about a Oh between
[01:04:34] was some debate about a Oh between academia saying no I invented that no I
[01:04:36] academia saying no I invented that no I invented that
[01:04:37] invented that no is actually different algorithms all
[01:04:44] no is actually different algorithms all right any questions
[01:04:48] oh great question is their equivalents
[01:04:54] oh great question is their equivalents of normal equations to logistic
[01:04:57] of normal equations to logistic regression short answer is no so for
[01:05:01] regression short answer is no so for linear regression the normal equations
[01:05:03] linear regression the normal equations gives you like a one-shot way to just
[01:05:04] gives you like a one-shot way to just find the best value of theta there is no
[01:05:07] find the best value of theta there is no known way to just have a closed form
[01:05:08] known way to just have a closed form equation unless you find the best value
[01:05:10] equation unless you find the best value of theta which is why you always have to
[01:05:12] of theta which is why you always have to use an algorithm in sort of optimization
[01:05:15] use an algorithm in sort of optimization out rooms such as creating ascend or and
[01:05:17] out rooms such as creating ascend or and we'll see in a second Newton's method
[01:05:21] cool
[01:05:25] so there's a great lead-in to the lost
[01:05:32] so there's a great lead-in to the lost topic for today which is Mutants method
[01:05:56] um you know created in a sense right
[01:05:59] um you know created in a sense right it's a good algorithm I use screen to
[01:06:01] it's a good algorithm I use screen to send all the time but it takes the baby
[01:06:02] send all the time but it takes the baby step takes a baby sir taking the baby
[01:06:04] step takes a baby sir taking the baby set takes a lot of iterations for
[01:06:05] set takes a lot of iterations for gradient ascent to converge there's
[01:06:09] gradient ascent to converge there's another algorithm called Newton's method
[01:06:11] another algorithm called Newton's method which allows you to take much bigger
[01:06:13] which allows you to take much bigger jumps to let's data you know so there
[01:06:16] jumps to let's data you know so there are problems where you might need
[01:06:17] are problems where you might need they'll say a hundred iterations or a
[01:06:19] they'll say a hundred iterations or a thousand iterations are great in ascent
[01:06:21] thousand iterations are great in ascent that if you run this algorithm called
[01:06:23] that if you run this algorithm called Newton's method you might need only ten
[01:06:26] Newton's method you might need only ten iterations to get a very good value of
[01:06:28] iterations to get a very good value of theta but each iteration will be more
[01:06:31] theta but each iteration will be more expensive we talked about pros and cons
[01:06:32] expensive we talked about pros and cons a second
[01:06:33] a second but um let's see how let's let's
[01:06:36] but um let's see how let's let's describe this algorithm which is
[01:06:38] describe this algorithm which is sometimes much faster for gradient than
[01:06:40] sometimes much faster for gradient than great innocent for optimizing the value
[01:06:43] great innocent for optimizing the value of theta okay so um what we'd like to do
[01:06:48] of theta okay so um what we'd like to do is so let me let me use the simplify one
[01:06:52] is so let me let me use the simplify one dimensional problem to describe Newton's
[01:06:55] dimensional problem to describe Newton's method so I'm going to solve a slightly
[01:07:05] method so I'm going to solve a slightly different problem with Newton's method
[01:07:06] different problem with Newton's method which is say you have some function f
[01:07:10] which is say you have some function f and you want to find theta such that f
[01:07:18] and you want to find theta such that f of theta is equal to zero okay so this
[01:07:23] of theta is equal to zero okay so this is a problem that Newton's method solves
[01:07:25] is a problem that Newton's method solves and the way we're going to use this
[01:07:27] and the way we're going to use this later is what you really want is to
[01:07:30] later is what you really want is to maximize L of theta right and well at
[01:07:40] maximize L of theta right and well at the maximum the first derivative must be
[01:07:42] the maximum the first derivative must be zero so ie
[01:07:45] zero so ie you want to value where the derivative L
[01:07:49] you want to value where the derivative L prime of theta is equal to zero right
[01:07:51] prime of theta is equal to zero right and L prime is the derivative of theta
[01:07:54] and L prime is the derivative of theta because this is L prime is another
[01:07:57] because this is L prime is another notation for the first derivative theta
[01:07:59] notation for the first derivative theta so you want to maximize the function or
[01:08:01] so you want to maximize the function or minimize the function whether
[01:08:02] minimize the function whether Muse's you want to find a point where
[01:08:04] Muse's you want to find a point where the derivative is equal to zero
[01:08:06] the derivative is equal to zero so the way we're going to use Newton's
[01:08:08] so the way we're going to use Newton's method is we're going to set f of theta
[01:08:10] method is we're going to set f of theta equal to the derivative and then try to
[01:08:13] equal to the derivative and then try to find the point where the derivative is
[01:08:14] find the point where the derivative is equal to zero okay but to explain your
[01:08:17] equal to zero okay but to explain your tennis method I'm gonna you know work on
[01:08:20] tennis method I'm gonna you know work on this other problem where you have a
[01:08:21] this other problem where you have a function f and you just want to find the
[01:08:23] function f and you just want to find the value of theta where f of states is
[01:08:25] value of theta where f of states is equal to zero and then it will set F
[01:08:27] equal to zero and then it will set F equal to L prime theta and that's how we
[01:08:29] equal to L prime theta and that's how we will apply this to logistic regression
[01:08:33] will apply this to logistic regression so let me draw in pictures how this
[01:08:38] so let me draw in pictures how this algorithm works
[01:08:52] all right so let's say that's the
[01:08:54] all right so let's say that's the function f and you know to make this
[01:08:57] function f and you know to make this drawable on a whiteboard I'm gonna
[01:08:59] drawable on a whiteboard I'm gonna assume theta is just a real number for
[01:09:01] assume theta is just a real number for now so theta is just a single you know
[01:09:03] now so theta is just a single you know like a scalar a real number so this is
[01:09:07] like a scalar a real number so this is how Newton's method works oh and the
[01:09:13] how Newton's method works oh and the goal is to find this point right
[01:09:15] goal is to find this point right the goal is to find the value of theta
[01:09:17] the goal is to find the value of theta with F of theta is equal to zero okay so
[01:09:21] with F of theta is equal to zero okay so let's say you start off from let's see
[01:09:25] let's say you start off from let's see you start off at this point right at the
[01:09:27] you start off at this point right at the first iteration you know randomly I stay
[01:09:29] first iteration you know randomly I stay there and nationally thing is zero
[01:09:31] there and nationally thing is zero something but let's say you start up at
[01:09:32] something but let's say you start up at that point this is how one iteration of
[01:09:36] that point this is how one iteration of Newton's method will work which is start
[01:09:44] Newton's method will work which is start off with theta zero that's just the
[01:09:45] off with theta zero that's just the first value first iteration what we're
[01:09:48] first value first iteration what we're going to do is look at the function f
[01:09:49] going to do is look at the function f and then find a line that's just tangent
[01:09:52] and then find a line that's just tangent to F so take the derivative of F and
[01:09:55] to F so take the derivative of F and find a line that's just tangent to F so
[01:10:00] find a line that's just tangent to F so take that red line there this just
[01:10:02] take that red line there this just touches a function f and we're going to
[01:10:04] touches a function f and we're going to use if you will use the straight line
[01:10:05] use if you will use the straight line approximation to F and solve for where F
[01:10:08] approximation to F and solve for where F touches the horizontal axis so we're
[01:10:11] touches the horizontal axis so we're gonna solve for the point where this
[01:10:13] gonna solve for the point where this straight line touches the horizontal
[01:10:15] straight line touches the horizontal axis okay and then we're going to set
[01:10:18] axis okay and then we're going to set this and that's one iteration of
[01:10:21] this and that's one iteration of Newton's method so they're going to move
[01:10:22] Newton's method so they're going to move from this value to this value and then
[01:10:27] from this value to this value and then in the second iteration of Newton's
[01:10:28] in the second iteration of Newton's method we're gonna look at this point
[01:10:30] method we're gonna look at this point and again you know take a line that's
[01:10:33] and again you know take a line that's just tangent to it and then solve for
[01:10:36] just tangent to it and then solve for where this touches the horizontal axis
[01:10:39] where this touches the horizontal axis and then that's after two iterations of
[01:10:43] and then that's after two iterations of u2's men right and then you repeat take
[01:10:46] u2's men right and then you repeat take this sometimes you can overshoot a
[01:10:48] this sometimes you can overshoot a little bit but that's okay right
[01:10:50] little bit but that's okay right and then that so
[01:10:52] and then that so it gives us a cycle back to rent that
[01:10:55] it gives us a cycle back to rent that stay the three then you take this
[01:10:57] stay the three then you take this let's stay there for so you can tell
[01:11:12] let's stay there for so you can tell that Tom Newton's method it's actually
[01:11:16] that Tom Newton's method it's actually pretty fast algorithm right within just
[01:11:18] pretty fast algorithm right within just what one two three four iterations we've
[01:11:21] what one two three four iterations we've gotten really really close to the point
[01:11:24] gotten really really close to the point where F of theta is equal to zero so
[01:11:29] where F of theta is equal to zero so let's write out the map for how you do
[01:11:31] let's write out the map for how you do this so let's see I'm going to so let me
[01:11:38] this so let's see I'm going to so let me just write out and derive you know how
[01:11:41] just write out and derive you know how you go from theta 0 to theta 1 so I'm
[01:11:44] you go from theta 0 to theta 1 so I'm going to use this horizontal distance
[01:11:46] going to use this horizontal distance I'm gonna denote this as Delta this
[01:11:50] I'm gonna denote this as Delta this triangle is upper case Greek alphabet
[01:11:53] triangle is upper case Greek alphabet Delta right this is lower case Delta
[01:11:55] Delta right this is lower case Delta that's upper case Delta right and then
[01:11:58] that's upper case Delta right and then the height here well that's just f of
[01:12:01] the height here well that's just f of theta 0 this is the height it's just F
[01:12:04] theta 0 this is the height it's just F of theta 0 and so let's see
[01:12:15] so what we'd like to do is solve for the
[01:12:18] so what we'd like to do is solve for the value of Delta because one iteration of
[01:12:21] value of Delta because one iteration of Newton's method is a set you know theta
[01:12:24] Newton's method is a set you know theta one is set to theta zero minus Delta
[01:12:29] one is set to theta zero minus Delta right so how do you solve for Delta well
[01:12:32] right so how do you solve for Delta well from calculus we know that the slope of
[01:12:36] from calculus we know that the slope of the function f is the height over the
[01:12:38] the function f is the height over the run right height over the width and so
[01:12:41] run right height over the width and so we know that the derivative of del F
[01:12:43] we know that the derivative of del F prime that's the derivative of F at the
[01:12:46] prime that's the derivative of F at the point stay to zero that's equal to the
[01:12:49] point stay to zero that's equal to the height that's F of theta divided by the
[01:12:53] height that's F of theta divided by the horizontal right so the derivative
[01:12:56] horizontal right so the derivative meaning the slope of the red line is by
[01:12:58] meaning the slope of the red line is by definition of derivatives is this ratio
[01:13:00] definition of derivatives is this ratio between just height over this width and
[01:13:04] between just height over this width and so Delta is equal to f of theta 0 over F
[01:13:09] so Delta is equal to f of theta 0 over F prime of theta 0 and if you plug that in
[01:13:14] prime of theta 0 and if you plug that in then you find that a single iteration of
[01:13:17] then you find that a single iteration of Newton's method is the following group
[01:13:20] Newton's method is the following group data T plus 1 gets updated as theta t
[01:13:26] data T plus 1 gets updated as theta t minus f of theta t over F prime of theta
[01:13:34] minus f of theta t over F prime of theta T ok where instead of 0 1 I replaced it
[01:13:39] T ok where instead of 0 1 I replaced it with T and T plus 1 and finally to you
[01:13:46] with T and T plus 1 and finally to you know the very first thing we did was
[01:13:48] know the very first thing we did was let's let F of theta be equal to say L
[01:13:53] let's let F of theta be equal to say L prime of theta right because we want to
[01:13:56] prime of theta right because we want to find the place where the first
[01:13:57] find the place where the first derivative of L is 0 then this becomes
[01:14:01] derivative of L is 0 then this becomes theta T plus 1 gets updated as theta T
[01:14:06] theta T plus 1 gets updated as theta T minus L prime of theta T over L double
[01:14:12] minus L prime of theta T over L double prime of theta T so it's really the
[01:14:16] prime of theta T so it's really the first derivative divided by the second
[01:14:22] so
[01:14:38] Newton's mess is a very fast algorithm
[01:14:41] Newton's mess is a very fast algorithm and it has Newton's method and Joy's
[01:14:47] and it has Newton's method and Joy's property called quadratic convergence
[01:14:51] not a great name don't worry don't worry
[01:14:54] not a great name don't worry don't worry too much about what it means but the
[01:14:55] too much about what it means but the informative what it means is that if on
[01:14:58] informative what it means is that if on one innovation Newton's method is 0.01
[01:15:02] one innovation Newton's method is 0.01 error so on the x-axis you're 0.01 away
[01:15:05] error so on the x-axis you're 0.01 away from the from the value from from the
[01:15:08] from the from the value from from the true minimum of a true value of F is
[01:15:10] true minimum of a true value of F is zero zero after one iteration the error
[01:15:13] zero zero after one iteration the error could go to zero point zero zero zero
[01:15:15] could go to zero point zero zero zero one error and all the two iterations it
[01:15:18] one error and all the two iterations it goes to zero but roughly Newton's method
[01:15:26] goes to zero but roughly Newton's method under certain assumptions that function
[01:15:30] under certain assumptions that function is smooth not too far from quadratic the
[01:15:32] is smooth not too far from quadratic the number of significant digits that you
[01:15:34] number of significant digits that you have converts the minimum doubles on a
[01:15:37] have converts the minimum doubles on a single iteration so this is called
[01:15:38] single iteration so this is called quadratic convergence and so when you
[01:15:41] quadratic convergence and so when you get near the minimum Newton's method
[01:15:42] get near the minimum Newton's method converges extremely rapidly right so so
[01:15:45] converges extremely rapidly right so so after a single iteration becomes much
[01:15:47] after a single iteration becomes much more accurate enumeration becomes way
[01:15:48] more accurate enumeration becomes way way way more accurate which is why
[01:15:50] way way more accurate which is why Newton's method requires relatively few
[01:15:53] Newton's method requires relatively few innovations and let's see I have written
[01:15:58] innovations and let's see I have written out Newton's method for when theta is a
[01:16:02] out Newton's method for when theta is a real number when theta is a vector
[01:16:13] then the generalization of the rule I
[01:16:15] then the generalization of the rule I wrote above is the following
[01:16:17] wrote above is the following theta T plus one gets updated as theta T
[01:16:20] theta T plus one gets updated as theta T plus h that where X is the Hessian
[01:16:28] plus h that where X is the Hessian matrix so these details are written in
[01:16:37] matrix so these details are written in lecture notes but to give you a sense it
[01:16:40] lecture notes but to give you a sense it when theta is a vector this is a vector
[01:16:43] when theta is a vector this is a vector of derivatives it says I guess this part
[01:16:48] of derivatives it says I guess this part n plus 1 dimensional if nature is an RN
[01:16:53] n plus 1 dimensional if nature is an RN plus 1 then this derivative respect to
[01:16:57] plus 1 then this derivative respect to theta of the log-likelihood becomes a
[01:16:59] theta of the log-likelihood becomes a vector of derivatives and the Hessian
[01:17:01] vector of derivatives and the Hessian matrix this becomes in matrixes are n
[01:17:04] matrix this becomes in matrixes are n plus 1 by n plus 1 so becomes a square
[01:17:09] plus 1 by n plus 1 so becomes a square matrix with the dimension equal to the
[01:17:11] matrix with the dimension equal to the parameter vector theta and the Hessian
[01:17:14] parameter vector theta and the Hessian matrix is defined as the matrix of
[01:17:16] matrix is defined as the matrix of partial derivatives right so and so the
[01:17:26] partial derivatives right so and so the disadvantage of Newton's method is that
[01:17:29] disadvantage of Newton's method is that in high dimensional problems if theta is
[01:17:32] in high dimensional problems if theta is a vector that each step of Newton's
[01:17:34] a vector that each step of Newton's method is much more expensive because
[01:17:37] method is much more expensive because you're either solving a linear system
[01:17:39] you're either solving a linear system craisins or having to convert to pretty
[01:17:41] craisins or having to convert to pretty big matrix so if theta is 10 dimensional
[01:17:44] big matrix so if theta is 10 dimensional you know this involves inverting a 10 by
[01:17:47] you know this involves inverting a 10 by 10 matrix which is fine but if theta was
[01:17:49] 10 matrix which is fine but if theta was 10,000 or 100,000 then each innovation
[01:17:52] 10,000 or 100,000 then each innovation requires computing like a hundred
[01:17:55] requires computing like a hundred thousand by a hundred thousand matrix
[01:17:56] thousand by a hundred thousand matrix and inverting that which is very hard
[01:17:58] and inverting that which is very hard it's very very difficult to do that in
[01:18:00] it's very very difficult to do that in very high dimensional problems so you
[01:18:04] very high dimensional problems so you know some rules of thumb if the number
[01:18:08] know some rules of thumb if the number of parameters you have foolish if the
[01:18:09] of parameters you have foolish if the number of parameters religion aggression
[01:18:11] number of parameters religion aggression is not too big if you have 10 parameters
[01:18:13] is not too big if you have 10 parameters or 50 parameters I would almost
[01:18:15] or 50 parameters I would almost certainly I would very likely use
[01:18:18] certainly I would very likely use Newton's method
[01:18:20] Newton's method then you probably get convergence in
[01:18:23] then you probably get convergence in maybe ten iterations or you know 15
[01:18:25] maybe ten iterations or you know 15 iterations or even less than ten
[01:18:27] iterations or even less than ten generations but they've a very large
[01:18:29] generations but they've a very large number of parameters if you have you
[01:18:30] number of parameters if you have you know ten thousand parameters then rather
[01:18:33] know ten thousand parameters then rather than dealing over 10,000 by 10,000
[01:18:35] than dealing over 10,000 by 10,000 matrix or even bigger than 55,000 about
[01:18:38] matrix or even bigger than 55,000 about 50,000 matrix under 50,000 parameters I
[01:18:40] 50,000 matrix under 50,000 parameters I would use great in the sentence then
[01:18:42] would use great in the sentence then okay but if the number of parameters is
[01:18:45] okay but if the number of parameters is not too big so that the computational
[01:18:47] not too big so that the computational cost per iteration is manageable
[01:18:49] cost per iteration is manageable then Newton's method converges in a very
[01:18:52] then Newton's method converges in a very small number of iterations and could be
[01:18:54] small number of iterations and could be much faster algorithm than gradient
[01:18:55] much faster algorithm than gradient descent all right so that's it Newton's
[01:19:01] descent all right so that's it Newton's method on Wednesday this remaining time
[01:19:05] method on Wednesday this remaining time on Wednesday you hear about generosity
[01:19:06] on Wednesday you hear about generosity models I think unfortunately I promised
[01:19:10] models I think unfortunately I promised to be in Washington DC tonight I guess
[01:19:13] to be in Washington DC tonight I guess through Wednesday so you hear from some
[01:19:15] through Wednesday so you hear from some I think onion will give the lecture on
[01:19:18] I think onion will give the lecture on Wednesday but I will be back next week
[01:19:21] Wednesday but I will be back next week unfortunately what's time to do this but
[01:19:23] unfortunately what's time to do this but because of this pulse take on lecture so
[01:19:26] because of this pulse take on lecture so so thanks everyone
[01:19:28] so thanks everyone I always say


================================================================================
LECTURE 004
================================================================================

Lecture 4 - Perceptron & Generalized Linear Model | Stanford CS229: Machine Learning (Autumn 2018)

Source: https://www.youtube.com/watch?v=iZTeva0WSTQ

---

Transcript

[00:00:04] a couple of announcements before we get
[00:00:06] a couple of announcements before we get started so first of all ps1 is out
[00:00:09] started so first of all ps1 is out problem set 1 it is due on 17th that's 2
[00:00:17] problem set 1 it is due on 17th that's 2 weeks from today you have exactly 2
[00:00:19] weeks from today you have exactly 2 weeks to work on it you could take up to
[00:00:21] weeks to work on it you could take up to 2 or 3 late days I think you can take up
[00:00:24] 2 or 3 late days I think you can take up to three late days there is there's a
[00:00:30] to three late days there is there's a good amount of programming and a good
[00:00:32] good amount of programming and a good amount of math you need to do so
[00:00:35] amount of math you need to do so ps1 needs to be uploaded the solutions
[00:00:37] ps1 needs to be uploaded the solutions need to be uploaded to great scope
[00:00:39] need to be uploaded to great scope you'll have to make two submissions one
[00:00:43] you'll have to make two submissions one submission will be a PDF file which you
[00:00:46] submission will be a PDF file which you can either which you can either use a
[00:00:49] can either which you can either use a latex template that we provide or you
[00:00:51] latex template that we provide or you can handwrite it as well but you're
[00:00:53] can handwrite it as well but you're strongly encouraged to use the the latex
[00:00:56] strongly encouraged to use the the latex template and there is a separate coding
[00:00:59] template and there is a separate coding assignment for which you'll have to
[00:01:01] assignment for which you'll have to submit code as a separate great scope
[00:01:04] submit code as a separate great scope assignment so they're going to you're
[00:01:05] assignment so they're going to you're going to see two assignments in great
[00:01:06] going to see two assignments in great scope one is for the written part the
[00:01:08] scope one is for the written part the others for the is for the programming
[00:01:11] others for the is for the programming part with that let's let's jump right
[00:01:15] part with that let's let's jump right into today's topics so today we are
[00:01:18] into today's topics so today we are going to cover briefly we're going to
[00:01:20] going to cover briefly we're going to cover the perceptron algorithm and then
[00:01:24] cover the perceptron algorithm and then you know good chunk of today is going to
[00:01:26] you know good chunk of today is going to be exponential family and generalized
[00:01:29] be exponential family and generalized linear models and we'll we'll end it
[00:01:31] linear models and we'll we'll end it with softmax regression for multi-class
[00:01:34] with softmax regression for multi-class classification so perceptron we saw in
[00:01:40] classification so perceptron we saw in logistic regression so first of all the
[00:01:43] logistic regression so first of all the perceptron algorithm I should mention is
[00:01:45] perceptron algorithm I should mention is not something that is widely used in
[00:01:48] not something that is widely used in practice we study it mostly for
[00:01:51] practice we study it mostly for historical reasons oh and also because
[00:01:54] historical reasons oh and also because it is it's nice and simple and you know
[00:01:56] it is it's nice and simple and you know it's easy to analyze and we also have
[00:01:59] it's easy to analyze and we also have homework questions on it so logistic
[00:02:03] homework questions on it so logistic regression we saw logistic regression
[00:02:05] regression we saw logistic regression uses the sigmoid function
[00:02:33] right so the logistic regression user
[00:02:37] right so the logistic regression user the sigmoid function which which
[00:02:40] the sigmoid function which which essentially squeezes the entire real
[00:02:42] essentially squeezes the entire real line from minus infinity to infinity
[00:02:43] line from minus infinity to infinity between zero and one and and the zero
[00:02:47] between zero and one and and the zero and one kind of represents the
[00:02:48] and one kind of represents the probability right you could also think
[00:02:53] probability right you could also think of a variant of that which will be like
[00:02:58] of a variant of that which will be like the perceptron where so in the in in the
[00:03:00] the perceptron where so in the in in the sigmoid function at at z equals 0 at z
[00:03:10] sigmoid function at at z equals 0 at z equals 0 g of z is a half and as Z tends
[00:03:14] equals 0 g of z is a half and as Z tends to minus infinity Z tends to 0 and as Z
[00:03:17] to minus infinity Z tends to 0 and as Z tends to plus infinity G tends to 1 the
[00:03:23] tends to plus infinity G tends to 1 the perceptron algorithm uses a somewhat
[00:03:27] perceptron algorithm uses a somewhat similar but different function which
[00:03:49] right so G of Z in this case is 1 if Z
[00:03:59] right so G of Z in this case is 1 if Z is greater than equal to 0 and 0 if Z is
[00:04:03] is greater than equal to 0 and 0 if Z is less than 0 right so you can you can
[00:04:05] less than 0 right so you can you can think of this as the hard version of the
[00:04:07] think of this as the hard version of the son of the sigmoid function right and
[00:04:10] son of the sigmoid function right and this needs to this leads to the
[00:04:16] this needs to this leads to the hypothesis function here being H theta
[00:04:22] hypothesis function here being H theta of X is equal to G of theta transpose X
[00:04:30] of X is equal to G of theta transpose X so theta transpose X here theta is the
[00:04:34] so theta transpose X here theta is the parameter X is the X is the input and H
[00:04:38] parameter X is the X is the input and H state of X will be 0 or 1 depending on
[00:04:42] state of X will be 0 or 1 depending on whether theta transpose X was less than
[00:04:44] whether theta transpose X was less than 0 or or greater than 0 and it all and
[00:04:48] 0 or or greater than 0 and it all and similarly in logistic regression we had
[00:04:51] similarly in logistic regression we had a H state of X is equal to essentially G
[00:05:02] a H state of X is equal to essentially G of G of Z where G is the Sigma sigmoid
[00:05:06] of G of Z where G is the Sigma sigmoid function both of them have a common
[00:05:10] function both of them have a common update rule which on the surface looks
[00:05:14] update rule which on the surface looks similar so theta J theta J plus alpha
[00:05:22] similar so theta J theta J plus alpha times y minus H
[00:05:37] right so the update rules for perceptron
[00:05:41] right so the update rules for perceptron and logistic regression they look the
[00:05:42] and logistic regression they look the same except each state of X means
[00:05:44] same except each state of X means different things in in in the two
[00:05:47] different things in in in the two different scenarios we also saw that it
[00:05:51] different scenarios we also saw that it was similar for a linear regression as
[00:05:53] was similar for a linear regression as well and we're going to see why this
[00:05:54] well and we're going to see why this this is you know that this is actually a
[00:05:57] this is you know that this is actually a more common common theme so what's
[00:06:01] more common common theme so what's happening here so if you inspect this
[00:06:05] happening here so if you inspect this equation to get a better sense of what's
[00:06:08] equation to get a better sense of what's happening in in the perceptron algorithm
[00:06:11] happening in in the perceptron algorithm this quantity over here is a scalar
[00:06:15] this quantity over here is a scalar right it's the difference between y i
[00:06:18] right it's the difference between y i which can be either 0 and 1 and eight
[00:06:20] which can be either 0 and 1 and eight state of X I which can either be 0 or 1
[00:06:23] state of X I which can either be 0 or 1 right so when the algorithm makes a
[00:06:28] right so when the algorithm makes a prediction of each state of eight set of
[00:06:30] prediction of each state of eight set of X I for a given X I this quantity will
[00:06:34] X I for a given X I this quantity will either be 0 if if the algorithm got it
[00:06:42] either be 0 if if the algorithm got it right already and it will be either plus
[00:06:51] right already and it will be either plus 1 or minus 1 if if Y i if the actual if
[00:06:59] 1 or minus 1 if if Y i if the actual if the ground truth was plus 1 and the
[00:07:01] the ground truth was plus 1 and the algorithm predicted 0 then it this will
[00:07:05] algorithm predicted 0 then it this will evaluate to 1 if wrong and why I equals
[00:07:13] evaluate to 1 if wrong and why I equals 1 and similarly it is minus 1 if wrong
[00:07:22] 1 and similarly it is minus 1 if wrong and why I
[00:07:27] so what's happening here to see what's
[00:07:31] so what's happening here to see what's what's happening it's useful to see this
[00:07:33] what's happening it's useful to see this picture so this is the input space right
[00:07:42] picture so this is the input space right and let's imagine there are two two
[00:07:46] and let's imagine there are two two classes boxes and let's say circles and
[00:07:54] classes boxes and let's say circles and you want to learn I'm going to learn an
[00:07:58] you want to learn I'm going to learn an algorithm that can separate these two
[00:07:59] algorithm that can separate these two classes right and if you imagine that
[00:08:04] classes right and if you imagine that the what what the algorithm has learned
[00:08:08] the what what the algorithm has learned so far is a theta that represents this
[00:08:12] so far is a theta that represents this decision boundary so this represents
[00:08:16] decision boundary so this represents theta transpose x equals zero and
[00:08:20] theta transpose x equals zero and anything about is theta transpose X is
[00:08:25] anything about is theta transpose X is greater than zero and anything below is
[00:08:27] greater than zero and anything below is a transpose X less than zero right
[00:08:32] a transpose X less than zero right and let's say the algorithm is learning
[00:08:35] and let's say the algorithm is learning one example at a time and a new example
[00:08:37] one example at a time and a new example comes in and this time it happens to be
[00:08:42] the new example happens to be a square
[00:08:45] the new example happens to be a square or a box and but the algorithm has miss
[00:08:49] or a box and but the algorithm has miss misclassified right now this line the
[00:08:55] misclassified right now this line the separating boundary if the vector
[00:09:00] separating boundary if the vector equivalent of that would be a vector
[00:09:01] equivalent of that would be a vector that's normal to the line so this would
[00:09:04] that's normal to the line so this would be theta and this is our new X this is
[00:09:11] be theta and this is our new X this is the new X so this got misclassified this
[00:09:16] the new X so this got misclassified this R this is lying to you know lying on the
[00:09:19] R this is lying to you know lying on the bottom of the decision boundary so what
[00:09:21] bottom of the decision boundary so what what what's gonna happen here
[00:09:23] what what's gonna happen here weii let's call this the one class and
[00:09:26] weii let's call this the one class and this is this as the zero class right so
[00:09:29] this is this as the zero class right so why I minus eight state of I will be
[00:09:32] why I minus eight state of I will be plus one and what the algorithm is doing
[00:09:36] plus one and what the algorithm is doing is it sets theta to be theta plus alpha
[00:09:39] is it sets theta to be theta plus alpha times X
[00:09:41] times X right so this is the old theta this is X
[00:09:45] right so this is the old theta this is X alpha is some small learning rate so it
[00:09:48] alpha is some small learning rate so it adds let me use a different color here
[00:09:51] adds let me use a different color here it adds right alpha times X to theta and
[00:09:57] it adds right alpha times X to theta and now say this is let's call it theta
[00:10:02] now say this is let's call it theta prime is the new vector that's that's
[00:10:04] prime is the new vector that's that's the updated value right and they and the
[00:10:07] the updated value right and they and the separating hyperplane corresponding to
[00:10:10] separating hyperplane corresponding to this is something that is normal to it
[00:10:14] this is something that is normal to it yeah so so it updated the decision
[00:10:18] yeah so so it updated the decision boundary such that X is now included in
[00:10:20] boundary such that X is now included in the positive class right the the idea
[00:10:25] the positive class right the the idea here is that theta we want theta to be
[00:10:31] here is that theta we want theta to be similar to X in general where such where
[00:10:35] similar to X in general where such where Y is 1 and we want theta to be not
[00:10:40] Y is 1 and we want theta to be not similar to X when y equals 0 the reason
[00:10:45] similar to X when y equals 0 the reason is when two vectors are similar the dot
[00:10:48] is when two vectors are similar the dot product is positive and they are not
[00:10:49] product is positive and they are not similar the dot product is negative what
[00:10:52] similar the dot product is negative what does that mean if let's say this is X
[00:10:55] does that mean if let's say this is X and let's say you have theta if there
[00:10:58] and let's say you have theta if there are kind of pointed outwards their dot
[00:11:01] are kind of pointed outwards their dot product would be negative and when when
[00:11:04] product would be negative and when when if you have a theta that looks like this
[00:11:07] if you have a theta that looks like this Teta prime then the dot product will be
[00:11:09] Teta prime then the dot product will be positive if their angle is less than so
[00:11:12] positive if their angle is less than so this essentially means that as theta is
[00:11:14] this essentially means that as theta is rotating the decision boundary is kind
[00:11:17] rotating the decision boundary is kind of perpendicular to theta and you want
[00:11:19] of perpendicular to theta and you want to get all the positive X's on one side
[00:11:22] to get all the positive X's on one side of the decision boundary and what's the
[00:11:24] of the decision boundary and what's the what's the most naive way of taking
[00:11:28] what's the most naive way of taking theta and given X try to make theta more
[00:11:31] theta and given X try to make theta more and closer to X simple thing is to just
[00:11:34] and closer to X simple thing is to just add a component of X in that direction
[00:11:37] add a component of X in that direction you know add it here and kind of make
[00:11:39] you know add it here and kind of make theta and so this this is a very common
[00:11:41] theta and so this this is a very common technique used in lots of algorithms
[00:11:43] technique used in lots of algorithms where if you add a vector to another
[00:11:44] where if you add a vector to another vector you make the second one in a
[00:11:47] vector you make the second one in a closer to the first one essentially so
[00:11:50] closer to the first one essentially so this is this is the perceptron algorithm
[00:11:52] this is this is the perceptron algorithm you go example
[00:11:55] you go example example in an online manner and if the
[00:11:58] example in an online manner and if the example is already classified you do
[00:12:00] example is already classified you do nothing you get a 0 over here if it is
[00:12:02] nothing you get a 0 over here if it is misclassified you either are the add a
[00:12:06] misclassified you either are the add a small component of you add the vector
[00:12:10] small component of you add the vector itself the example itself but your theta
[00:12:12] itself the example itself but your theta or you subtract it depending on the
[00:12:13] or you subtract it depending on the class of the vector that's about it any
[00:12:16] class of the vector that's about it any questions about the perceptron cool so
[00:12:22] questions about the perceptron cool so let's move on to the next topic
[00:12:27] exponential families so exponential
[00:12:36] exponential families so exponential family is essentially a class of yeah
[00:12:42] it's not used in practice because aid it
[00:12:48] it's not used in practice because aid it does not have a probabilistic
[00:12:50] does not have a probabilistic interpretation of what's what's
[00:12:51] interpretation of what's what's happening you kind of have a geometrical
[00:12:53] happening you kind of have a geometrical feel of what's happening with with the
[00:12:54] feel of what's happening with with the hyperplane but it doesn't have a
[00:12:55] hyperplane but it doesn't have a probabilistic interpretation also it's
[00:13:00] probabilistic interpretation also it's it was in I think the perceptron was a
[00:13:04] it was in I think the perceptron was a pretty famous and I think the nineteen
[00:13:06] pretty famous and I think the nineteen fifties or the sixties where people
[00:13:08] fifties or the sixties where people thought this is a good model of how the
[00:13:09] thought this is a good model of how the brain works and I think it was Marvin
[00:13:13] brain works and I think it was Marvin Minsky who wrote a paper saying you know
[00:13:15] Minsky who wrote a paper saying you know the perceptron is it's kind of limited
[00:13:17] the perceptron is it's kind of limited because it it could never classify
[00:13:20] because it it could never classify points like this there's no possible
[00:13:24] points like this there's no possible separating boundary that can you know do
[00:13:26] separating boundary that can you know do something as simple as this and kind of
[00:13:28] something as simple as this and kind of people lost interest in it but yeah and
[00:13:31] people lost interest in it but yeah and in fact what we see is in logistic
[00:13:34] in fact what we see is in logistic regression is like a softer version of
[00:13:36] regression is like a softer version of the perceptron itself in a way yeah yeah
[00:13:47] it's it's it's up to you know it's it's
[00:13:51] it's it's it's up to you know it's it's a design choice that you make what you
[00:13:52] a design choice that you make what you could do is you can you can kind of
[00:13:54] could do is you can you can kind of anneal your learning rate with every
[00:13:57] anneal your learning rate with every step every time you see a new example
[00:13:59] step every time you see a new example decrease your learning rate until
[00:14:01] decrease your learning rate until something until you stop changing theta
[00:14:05] something until you stop changing theta by a lot you can you're not guaranteed
[00:14:07] by a lot you can you're not guaranteed that you'll you'll be able to get every
[00:14:09] that you'll you'll be able to get every example right for example here no matter
[00:14:11] example right for example here no matter how long you learn you're never going to
[00:14:12] how long you learn you're never going to you know find a learning boundary so
[00:14:16] you know find a learning boundary so it's it's up to you when you want to
[00:14:17] it's it's up to you when you want to stop training common things to just
[00:14:20] stop training common things to just decrease the learning rate with every
[00:14:22] decrease the learning rate with every time step until you stop making changes
[00:14:27] all right let's move on to exponential
[00:14:30] all right let's move on to exponential families so exponential families is is a
[00:14:33] families so exponential families is is a class of probability distributions which
[00:14:37] class of probability distributions which are somewhat nice mathematically right
[00:14:39] are somewhat nice mathematically right they're also very closely related to
[00:14:42] they're also very closely related to GLM's which we will be going over next
[00:14:46] GLM's which we will be going over next right but first we kind of take a deeper
[00:14:48] right but first we kind of take a deeper look at exponential families and and and
[00:14:52] look at exponential families and and and what they're about so an exponential
[00:14:54] what they're about so an exponential family is one whose PDF so whose PDF can
[00:15:13] family is one whose PDF so whose PDF can be written in the form my PDF I mean
[00:15:16] be written in the form my PDF I mean probability density function with a
[00:15:18] probability density function with a discrete distribution then it would be
[00:15:20] discrete distribution then it would be the probability mass function and this
[00:15:23] the probability mass function and this PDF can be written in the form
[00:15:47] right this looks pretty scary let's
[00:15:50] right this looks pretty scary let's let's kind of break it down into you
[00:15:52] let's kind of break it down into you know what what they actually mean
[00:15:54] know what what they actually mean so why over here is the data right and
[00:16:00] so why over here is the data right and there is a reason why we call it why
[00:16:02] there is a reason why we call it why because yeah a bit larger sure
[00:16:28] this is better so why is the data and
[00:16:32] this is better so why is the data and the reason there's a reason what we call
[00:16:33] the reason there's a reason what we call it Y and not X and and that's because
[00:16:36] it Y and not X and and that's because we're going to use exponential families
[00:16:38] we're going to use exponential families to model the output of your of your data
[00:16:40] to model the output of your of your data you know in a supervised learning
[00:16:42] you know in a supervised learning setting and and we're going to see X
[00:16:45] setting and and we're going to see X when we move on to GLM's until you know
[00:16:47] when we move on to GLM's until you know until then we're just going to deal with
[00:16:48] until then we're just going to deal with Y's for now
[00:16:49] Y's for now so Y is the data it is is called the
[00:16:55] so Y is the data it is is called the natural parameter right T of Y is called
[00:17:06] natural parameter right T of Y is called a sufficient statistic if you have a
[00:17:11] a sufficient statistic if you have a statistics background and you you know
[00:17:13] statistics background and you you know if you come across the word sufficient
[00:17:14] if you come across the word sufficient statistic before it's the exact same
[00:17:16] statistic before it's the exact same thing but you don't need to know much
[00:17:18] thing but you don't need to know much about this because for all the
[00:17:22] about this because for all the distributions that we're going to be
[00:17:23] distributions that we're going to be seeing today or in this class T of Y
[00:17:26] seeing today or in this class T of Y will be equal to just Y so you can you
[00:17:30] will be equal to just Y so you can you can just replace T of Y with Y for for
[00:17:34] can just replace T of Y with Y for for all the examples today and in the rest
[00:17:35] all the examples today and in the rest of the of the class B of Y it's called a
[00:17:42] of the of the class B of Y it's called a base measure right and finally a of beta
[00:17:52] base measure right and finally a of beta is called the lock partition function
[00:17:57] is called the lock partition function and you're going to be seeing a lot of
[00:17:59] and you're going to be seeing a lot of this function not partition function
[00:18:01] this function not partition function right so again why is the data that this
[00:18:08] right so again why is the data that this probability distribution standard model
[00:18:09] probability distribution standard model eta is the parameter of the distribution
[00:18:14] T of Y which will mostly be just Y
[00:18:17] T of Y which will mostly be just Y technically we you know T of Y is more
[00:18:20] technically we you know T of Y is more more correct to B of Y which means it is
[00:18:26] more correct to B of Y which means it is a function of only Y this function
[00:18:28] a function of only Y this function cannot involve data right and similarly
[00:18:30] cannot involve data right and similarly T of Y cannot involve
[00:18:32] T of Y cannot involve data it should be purely a function of Y
[00:18:35] data it should be purely a function of Y B of Y is called the base measure and a
[00:18:38] B of Y is called the base measure and a of eita which has to be a function of
[00:18:40] of eita which has to be a function of only eight and constants no no Y can can
[00:18:44] only eight and constants no no Y can can can be part of a of data it's called the
[00:18:47] can be part of a of data it's called the lock partition function right and the
[00:18:50] lock partition function right and the reason why this is called the lock
[00:18:52] reason why this is called the lock partition function is pretty easy to see
[00:18:55] partition function is pretty easy to see because this can be written as B of Y so
[00:19:12] because this can be written as B of Y so these two are exactly the same just take
[00:19:15] these two are exactly the same just take this out and
[00:19:31] it's fine these are exactly the same and
[00:19:38] it's fine these are exactly the same and oh yeah you're right this should be
[00:19:44] oh yeah you're right this should be positive thank you so this is you can
[00:19:53] positive thank you so this is you can think of this as a normalizing constant
[00:19:55] think of this as a normalizing constant of the distribution such that the the
[00:19:58] of the distribution such that the the whole thing integrates to 1 right and
[00:20:01] whole thing integrates to 1 right and therefore the log of this will be a of 8
[00:20:04] therefore the log of this will be a of 8 other so H is called the log of the
[00:20:05] other so H is called the log of the partition function so the partition
[00:20:07] partition function so the partition function is a technical term to indicate
[00:20:09] function is a technical term to indicate the normalizing constant of probability
[00:20:11] the normalizing constant of probability distributions now you can plug in any
[00:20:17] distributions now you can plug in any definition of B a and P yep sure so why
[00:20:29] definition of B a and P yep sure so why is your Y and for most of most of our
[00:20:32] is your Y and for most of most of our example it's going to be a scalar
[00:20:34] example it's going to be a scalar eita can be a vector but we will also be
[00:20:39] eita can be a vector but we will also be focusing except maybe in softmax this
[00:20:43] focusing except maybe in softmax this would be a scalar T of Y has to match so
[00:20:48] would be a scalar T of Y has to match so these the dimension of these two has to
[00:20:50] these the dimension of these two has to match and these are scalars right so for
[00:21:02] match and these are scalars right so for any choice of a B and T that you put
[00:21:06] any choice of a B and T that you put that's that that can be your choice
[00:21:08] that's that that can be your choice completely as long as the expression
[00:21:12] completely as long as the expression integrates to 1 you have a family in the
[00:21:15] integrates to 1 you have a family in the exponential family right what does that
[00:21:18] exponential family right what does that mean for a specific choice of say for
[00:21:22] mean for a specific choice of say for some choice of a B and T this can
[00:21:24] some choice of a B and T this can actually this will be equal to say the
[00:21:27] actually this will be equal to say the PDF of the Gaussian in which case you
[00:21:30] PDF of the Gaussian in which case you got for that choice of T a and and and B
[00:21:33] got for that choice of T a and and and B you got the Gaussian distribution a
[00:21:36] you got the Gaussian distribution a family of Gaussian distribution such
[00:21:38] family of Gaussian distribution such that for any value of the parameter you
[00:21:41] that for any value of the parameter you get a member of the Gaussian family
[00:21:43] get a member of the Gaussian family right and this is mostly to show that a
[00:21:49] right and this is mostly to show that a distribution is in the exponential
[00:21:50] distribution is in the exponential family the most straightforward way to
[00:21:54] family the most straightforward way to do it is to write out the PDF of the
[00:21:56] do it is to write out the PDF of the distribution in the form that you know
[00:21:58] distribution in the form that you know and just do some algebraic massaging to
[00:22:01] and just do some algebraic massaging to bring it into this form right and then
[00:22:03] bring it into this form right and then you do a pattern match to two and you
[00:22:06] you do a pattern match to two and you know conclude that it's a member of the
[00:22:09] know conclude that it's a member of the exponential family so let's do it for a
[00:22:10] exponential family so let's do it for a couple of examples so a Bernoulli
[00:22:33] couple of examples so a Bernoulli distribution is one you used to model a
[00:22:36] distribution is one you used to model a binary data right and it has a parameter
[00:22:44] binary data right and it has a parameter let's call it fee which is you know the
[00:22:47] let's call it fee which is you know the probability of the event happening or
[00:22:49] probability of the event happening or not right now the what is the PDF of a
[00:23:05] not right now the what is the PDF of a Bernoulli distribution one way to write
[00:23:09] Bernoulli distribution one way to write this is fee of Y times 1 minus V 1 minus
[00:23:18] this is fee of Y times 1 minus V 1 minus y make sense this this pattern is like a
[00:23:23] y make sense this this pattern is like a way of writing a programming
[00:23:26] way of writing a programming programmatic if-else in in math right so
[00:23:29] programmatic if-else in in math right so whenever Y is 1 this term cancels out so
[00:23:33] whenever Y is 1 this term cancels out so the answer would be fee and whenever Y
[00:23:35] the answer would be fee and whenever Y is 0 this term cancels out and the
[00:23:38] is 0 this term cancels out and the answer is 1 minus V so this is just a
[00:23:40] answer is 1 minus V so this is just a mathematical way to represent an if/else
[00:23:43] mathematical way to represent an if/else that you would do in programming right
[00:23:45] that you would do in programming right so this is the PDF of Bernoulli and our
[00:23:50] so this is the PDF of Bernoulli and our goal is to take this form and massage it
[00:23:54] goal is to take this form and massage it into that form right and and see what
[00:23:56] into that form right and and see what what
[00:23:57] what the individual TB and a turn out to be
[00:24:00] the individual TB and a turn out to be right
[00:24:00] right so whenever you see a distribution in
[00:24:05] so whenever you see a distribution in this form a common technique is to wrap
[00:24:12] this form a common technique is to wrap this with a log and an X because these
[00:24:22] this with a log and an X because these two cancel out so this is actually
[00:24:24] two cancel out so this is actually exactly equal to this and if you do some
[00:24:36] exactly equal to this and if you do some more algebra on this we will see that
[00:24:39] more algebra on this we will see that this turns out to be XP plus it's pretty
[00:24:59] this turns out to be XP plus it's pretty straightforward to go from here to here
[00:25:01] straightforward to go from here to here I'll let you guys verify it yourself but
[00:25:05] I'll let you guys verify it yourself but once we have it in this form it's easy
[00:25:08] once we have it in this form it's easy to kind of start doing some pattern
[00:25:10] to kind of start doing some pattern matching from this expression to that
[00:25:12] matching from this expression to that expression so what what we see here is
[00:25:16] expression so what what we see here is the base measure B of Y is equal to if
[00:25:21] the base measure B of Y is equal to if you match this with that B of Y will be
[00:25:25] you match this with that B of Y will be just 1 because there's no B of Y term
[00:25:29] just 1 because there's no B of Y term here and so this would be B of Y this
[00:25:37] here and so this would be B of Y this would be beta this would be P of Y this
[00:25:43] would be beta this would be P of Y this would be a 8 right so that would be you
[00:25:49] would be a 8 right so that would be you know you can see that you know that they
[00:25:52] know you can see that you know that they kind of match in pattern so B of Y would
[00:25:55] kind of match in pattern so B of Y would be 1 T of Y is just Y as as expected
[00:26:03] be 1 T of Y is just Y as as expected well so eta is equal to
[00:26:09] well so eta is equal to log P over 1 minus P and this is an
[00:26:20] log P over 1 minus P and this is an equivalent statement is to invert this
[00:26:22] equivalent statement is to invert this operation and say P is equal to 1 over 1
[00:26:26] operation and say P is equal to 1 over 1 plus e to the minus beta
[00:26:33] I'm just flipping the operation from
[00:26:35] I'm just flipping the operation from this this went from fee to a tahir you
[00:26:38] this this went from fee to a tahir you know it's it's the equivalent now here
[00:26:41] know it's it's the equivalent now here it goes from a Tartar fee right and a of
[00:26:44] it goes from a Tartar fee right and a of eita
[00:26:48] is going to be so here we have it as a
[00:26:53] is going to be so here we have it as a function of fee but we got an expression
[00:26:56] function of fee but we got an expression for fee in terms of eita
[00:26:58] for fee in terms of eita so you can plug this expression in here
[00:27:03] so you can plug this expression in here and with the change of minus sign so let
[00:27:07] and with the change of minus sign so let me work out the sub-sites gonna be minus
[00:27:08] me work out the sub-sites gonna be minus log of 1 minus P this is I just it
[00:27:16] log of 1 minus P this is I just it pattern matching there and minus log 1
[00:27:21] pattern matching there and minus log 1 minus 8 the reason is because we want an
[00:27:28] minus 8 the reason is because we want an expression in terms of eita here we got
[00:27:30] expression in terms of eita here we got it in terms of Phi
[00:27:31] it in terms of Phi but we need to plug in plug in a tower
[00:27:35] but we need to plug in plug in a tower here beta and this will just be log of 1
[00:27:41] here beta and this will just be log of 1 plus e to the 8th so there we go so this
[00:27:48] plus e to the 8th so there we go so this this kind of verifies that the Bernoulli
[00:27:50] this kind of verifies that the Bernoulli distribution is a member of the
[00:27:52] distribution is a member of the exponential family any questions here so
[00:27:58] note that this may look familiar
[00:28:01] note that this may look familiar it looks like the sigmoid function
[00:28:04] it looks like the sigmoid function somewhat like the sigmoid function and
[00:28:06] somewhat like the sigmoid function and this is actually no accident we will see
[00:28:07] this is actually no accident we will see why it is actually the sigmoid how it
[00:28:13] why it is actually the sigmoid how it kind of relates to logistic regression
[00:28:15] kind of relates to logistic regression in a minute so another example
[00:28:27] so a Gaussian with fixed millions all
[00:28:42] so a Gaussian with fixed millions all right so a Gaussian distribution has two
[00:28:47] right so a Gaussian distribution has two parameters the mean and the variance for
[00:28:49] parameters the mean and the variance for our purposes we're going to assume a
[00:28:51] our purposes we're going to assume a constant variance you can have you can
[00:28:57] constant variance you can have you can also consider the options with with
[00:28:59] also consider the options with with where the variance is also a variable
[00:29:00] where the variance is also a variable but for our course we're only interested
[00:29:04] but for our course we're only interested in gaussians with fixed variance and we
[00:29:07] in gaussians with fixed variance and we are going to assume assume variance is
[00:29:15] are going to assume assume variance is equal to one so this gives the PDF of a
[00:29:19] equal to one so this gives the PDF of a Gaussian to look like this P of Y
[00:29:23] Gaussian to look like this P of Y parametrized is mu so note here when we
[00:29:26] parametrized is mu so note here when we start writing out we start with the
[00:29:29] start writing out we start with the parameters that we are commonly used to
[00:29:33] parameters that we are commonly used to you know they are also called like the
[00:29:35] you know they are also called like the canonical parameters and then we set up
[00:29:37] canonical parameters and then we set up a link between the canonical parameters
[00:29:39] a link between the canonical parameters and the natural parameters that's part
[00:29:41] and the natural parameters that's part of the massaging exercise that we do so
[00:29:44] of the massaging exercise that we do so we're going to start with the canonical
[00:29:45] we're going to start with the canonical parameters is equal to 1
[00:30:00] / - so this is the gaussian PDF with
[00:30:06] / - so this is the gaussian PDF with with with variance equal to 1 right and
[00:30:10] with with variance equal to 1 right and this can be rewritten as again I'm
[00:30:15] this can be rewritten as again I'm skipping a few algebra steps you know
[00:30:18] skipping a few algebra steps you know straightforward no tricks there yeah
[00:30:25] straightforward no tricks there yeah our fixed variance e to the minus y
[00:30:32] our fixed variance e to the minus y square over 2 again we go through the
[00:30:46] square over 2 again we go through the same exercise you know pattern match
[00:30:48] same exercise you know pattern match this is B of Y this is 8 this is T of Y
[00:30:58] this is B of Y this is 8 this is T of Y and this would be right so we have a B
[00:31:07] and this would be right so we have a B of Y note that this is a function of
[00:31:19] of Y note that this is a function of only why there's no eita here T of Y is
[00:31:23] only why there's no eita here T of Y is just Y and in this case a natural
[00:31:26] just Y and in this case a natural parameter is mu theta is mu and the lock
[00:31:32] parameter is mu theta is mu and the lock partition function is equal to MU square
[00:31:36] partition function is equal to MU square by 2 and when we and we repeat the same
[00:31:41] by 2 and when we and we repeat the same exercise we did here we start with a
[00:31:44] exercise we did here we start with a lock partition function that is
[00:31:47] lock partition function that is parametrized by the canonical parameters
[00:31:49] parametrized by the canonical parameters and we use the the link between the
[00:31:53] and we use the the link between the canonical and the natural parameters
[00:31:55] canonical and the natural parameters invert it and so in this case it's the
[00:32:00] invert it and so in this case it's the it's the same search a tower - so a of
[00:32:05] it's the same search a tower - so a of beta is a function of only eita
[00:32:07] beta is a function of only eita again here a of ETA was a function of
[00:32:09] again here a of ETA was a function of only 8 ax and T of Y is a function of
[00:32:12] only 8 ax and T of Y is a function of only Y and B of I is a function of
[00:32:14] only Y and B of I is a function of you.why as well any questions on this
[00:32:22] yeah yeah you if the variance is unknown
[00:32:30] yeah yeah you if the variance is unknown you can write it as an exponential
[00:32:31] you can write it as an exponential family in which case ADA will now be a
[00:32:33] family in which case ADA will now be a vector it won't be a scalar in mode it
[00:32:35] vector it won't be a scalar in mode it will be it will have to like eight a one
[00:32:37] will be it will have to like eight a one and eight or two and you will also have
[00:32:42] and eight or two and you will also have you will have a mapping between each of
[00:32:44] you will have a mapping between each of the canonical parameters and each of the
[00:32:46] the canonical parameters and each of the natural parameters you can do it it's
[00:32:50] natural parameters you can do it it's pretty straightforward
[00:32:51] pretty straightforward right so this is this is exponential
[00:32:56] right so this is this is exponential these are exponential families right the
[00:32:59] these are exponential families right the reason why we are why we use exponential
[00:33:01] reason why we are why we use exponential family is because it has some nice
[00:33:03] family is because it has some nice mathematical properties right so so one
[00:33:15] mathematical properties right so so one property is now on if we perform maximum
[00:33:18] property is now on if we perform maximum likelihood on on the exponential family
[00:33:22] likelihood on on the exponential family as as when when when the exponential
[00:33:27] as as when when when the exponential family is parameterized in the natural
[00:33:29] family is parameterized in the natural parameters then the optimization problem
[00:33:33] parameters then the optimization problem is concave so MLE with respect to ETA is
[00:33:42] is concave so MLE with respect to ETA is concave similarly if you flip the sign
[00:33:47] concave similarly if you flip the sign and use the the what's called the
[00:33:49] and use the the what's called the negative log likelihood so take the log
[00:33:51] negative log likelihood so take the log of the expression negated and in in this
[00:33:53] of the expression negated and in in this case the negative log likelihood is like
[00:33:55] case the negative log likelihood is like the cost function equivalent of doing
[00:33:57] the cost function equivalent of doing maximum likelihood you're just flipping
[00:33:59] maximum likelihood you're just flipping a sign instead of maximizing you
[00:34:00] a sign instead of maximizing you minimize the negative log likelihood so
[00:34:02] minimize the negative log likelihood so the and and you know the NLL is there
[00:34:05] the and and you know the NLL is there for convex the expectation of why
[00:34:25] what does this mean each of the
[00:34:31] what does this mean each of the distribution we start with a of a to
[00:34:34] distribution we start with a of a to differentiate this with respect to Etta
[00:34:37] differentiate this with respect to Etta the lock partition function with respect
[00:34:38] the lock partition function with respect to a toss and you get another function
[00:34:42] to a toss and you get another function with respect to beta and that function
[00:34:44] with respect to beta and that function will is the mean of the distribution as
[00:34:47] will is the mean of the distribution as parameterize by a turn and similarly the
[00:34:52] parameterize by a turn and similarly the variance of y it's just the second
[00:34:59] variance of y it's just the second derivative this was the first derivative
[00:35:00] derivative this was the first derivative this is the second derivative so the
[00:35:12] this is the second derivative so the reason why this is nice is because in
[00:35:14] reason why this is nice is because in general for probability distributions to
[00:35:16] general for probability distributions to calculate the mean and the variance you
[00:35:18] calculate the mean and the variance you generally need to integrate something
[00:35:19] generally need to integrate something but over here you just need to
[00:35:21] but over here you just need to differentiate which is a lot easier
[00:35:22] differentiate which is a lot easier operation and and you will be proving
[00:35:32] operation and and you will be proving these properties in your first homework
[00:35:39] you provided hint search should be
[00:35:42] you provided hint search should be right so now we're going to move on to
[00:35:47] right so now we're going to move on to generalized linear models this this is
[00:35:51] generalized linear models this this is all we want to talk about exponential
[00:35:52] all we want to talk about exponential families any questions yeah exactly so
[00:36:06] families any questions yeah exactly so if you're if you're if you're if it's a
[00:36:10] if you're if you're if you're if it's a multivariate Gaussian then this data
[00:36:12] multivariate Gaussian then this data would be a vector and this would be the
[00:36:15] would be a vector and this would be the Hessian
[00:36:22] all right let's move on to GLM's
[00:36:35] so the GLM is is somewhat like a natural
[00:36:40] so the GLM is is somewhat like a natural extension of the exponential families to
[00:36:42] extension of the exponential families to include include covariates or include
[00:36:47] include include covariates or include your input features in some way right
[00:36:49] your input features in some way right so over here we are only dealing with in
[00:36:52] so over here we are only dealing with in the exponential families you're only
[00:36:54] the exponential families you're only dealing with the Y which in our case it
[00:36:58] dealing with the Y which in our case it will kind of map to the outputs but we
[00:37:03] will kind of map to the outputs but we can actually build a lot of many
[00:37:06] can actually build a lot of many powerful models by by choosing an
[00:37:10] powerful models by by choosing an appropriate family in the exponential
[00:37:14] appropriate family in the exponential family and kind of plugging it down to a
[00:37:18] family and kind of plugging it down to a linear model so the assumptions we're
[00:37:22] linear model so the assumptions we're going to make for GLM is that one so
[00:37:27] going to make for GLM is that one so these are the assumptions or design
[00:37:33] these are the assumptions or design choices that are going to take us from
[00:37:39] choices that are going to take us from exponential families to generalize
[00:37:41] exponential families to generalize linear models so the most important
[00:37:43] linear models so the most important assumption is that well yeah assumption
[00:37:47] assumption is that well yeah assumption is that Y given X parametrized by theta
[00:37:52] is that Y given X parametrized by theta is a member of an exponential family by
[00:38:07] is a member of an exponential family by exponential family of kata I mean that
[00:38:11] exponential family of kata I mean that form it could it could in in a
[00:38:14] form it could it could in in a particular scenario that you have it
[00:38:17] particular scenario that you have it could take on any one of these
[00:38:19] could take on any one of these distributions we only we only talked
[00:38:24] distributions we only we only talked about the Bernoulli and Gaussian there
[00:38:26] about the Bernoulli and Gaussian there are also other distributions that are
[00:38:29] are also other distributions that are those are part of the exponential family
[00:38:32] those are part of the exponential family for example forgot to mention this so if
[00:38:37] for example forgot to mention this so if you have real valued data you use a
[00:38:41] you have real valued data you use a Gaussian
[00:38:44] if you have binary Bernoulli if you have
[00:38:52] if you have binary Bernoulli if you have counts my counts here so this is a
[00:38:57] counts my counts here so this is a real-valued it can take any value
[00:38:59] real-valued it can take any value between 0 and infinity by count like
[00:39:01] between 0 and infinity by count like means just non-negative integers but not
[00:39:04] means just non-negative integers but not anything it we need so if you have count
[00:39:06] anything it we need so if you have count you can use a Poisson if you have
[00:39:11] you can use a Poisson if you have positive real valued integers like say
[00:39:14] positive real valued integers like say the volume of some object or the time to
[00:39:17] the volume of some object or the time to an event which you know you're only
[00:39:19] an event which you know you're only predicting into the future so here you
[00:39:21] predicting into the future so here you can use like gamma or exponential so so
[00:39:31] can use like gamma or exponential so so there is the exponential family and
[00:39:33] there is the exponential family and there is also a distribution called the
[00:39:34] there is also a distribution called the exponential distribution which are you
[00:39:36] exponential distribution which are you know two distinct things the exponential
[00:39:38] know two distinct things the exponential distribution happens to be a member of
[00:39:40] distribution happens to be a member of the exponential family as well but
[00:39:41] the exponential family as well but they're not the same thing the
[00:39:44] they're not the same thing the exponential and ya and you can also have
[00:39:48] exponential and ya and you can also have you can also have probability
[00:39:50] you can also have probability distributions over probability
[00:39:52] distributions over probability distributions like beta delay these
[00:40:02] distributions like beta delay these mostly show up in Bayesian machine
[00:40:03] mostly show up in Bayesian machine learning or Bayesian statistics
[00:40:10] so depending on the kind of data that
[00:40:14] so depending on the kind of data that you have if your Y variable is is is if
[00:40:17] you have if your Y variable is is is if you're trying to do a regression then
[00:40:19] you're trying to do a regression then your Y is going to be say a Gaussian if
[00:40:21] your Y is going to be say a Gaussian if you're trying to do a classification
[00:40:23] you're trying to do a classification then your Y is and if it's a binary
[00:40:25] then your Y is and if it's a binary classification then the exponential
[00:40:27] classification then the exponential family would be Bernoulli so depending
[00:40:28] family would be Bernoulli so depending on the problem that you have you can
[00:40:30] on the problem that you have you can choose any member of the exponential
[00:40:32] choose any member of the exponential family as parametrized by theta and so
[00:40:39] family as parametrized by theta and so that's the first assumption that y
[00:40:41] that's the first assumption that y condition on Y given X is a member of
[00:40:44] condition on Y given X is a member of the exponential family and the second
[00:40:48] the exponential family and the second the design choice that we are making
[00:40:50] the design choice that we are making here is that eta is equal to theta
[00:40:54] here is that eta is equal to theta transpose X so this is where your X now
[00:40:57] transpose X so this is where your X now comes into the picture right so theta is
[00:41:04] our N and X is also in our n now this n
[00:41:12] our N and X is also in our n now this n has nothing to do with anything in the
[00:41:15] has nothing to do with anything in the exponential family it's purely
[00:41:17] exponential family it's purely dimensions of your of your data that you
[00:41:20] dimensions of your of your data that you have the axis of your inputs and and
[00:41:22] have the axis of your inputs and and this does not show up anywhere else I
[00:41:24] this does not show up anywhere else I mean that that's and ETA is is we we
[00:41:32] mean that that's and ETA is is we we make a design choice that Etta will be
[00:41:34] make a design choice that Etta will be theta transpose transpose X and another
[00:41:40] theta transpose transpose X and another kind of assumption is that at test time
[00:41:47] right when we want output for a new X
[00:41:51] right when we want output for a new X given a new X we want to make an output
[00:41:53] given a new X we want to make an output right so the output will be right so
[00:42:04] right so the output will be right so given an X and given an X we get an
[00:42:08] given an X and given an X we get an exponential family distribution right
[00:42:11] exponential family distribution right and the mean of that distribution will
[00:42:13] and the mean of that distribution will be the prediction that we make for a
[00:42:15] be the prediction that we make for a given for a given X on this may sound a
[00:42:18] given for a given X on this may sound a little abstract but you know we're going
[00:42:20] little abstract but you know we're going to make this more clear so this what
[00:42:22] to make this more clear so this what does essentially mean is
[00:42:24] does essentially mean is the hypothesis function is actually just
[00:42:32] right this is our hypothesis function
[00:42:34] right this is our hypothesis function and we'll see that you know what we do
[00:42:36] and we'll see that you know what we do over here if you plug in the exponential
[00:42:39] over here if you plug in the exponential family as Gaussian then the hypothesis
[00:42:42] family as Gaussian then the hypothesis will be the same you know Gaussian
[00:42:44] will be the same you know Gaussian hypothesis that we saw in linear
[00:42:46] hypothesis that we saw in linear regression if we plug in a Bernoulli
[00:42:48] regression if we plug in a Bernoulli then this will turn out to be the same
[00:42:50] then this will turn out to be the same hypothesis that we saw in logistic
[00:42:52] hypothesis that we saw in logistic regression and so on so one way to kind
[00:42:56] regression and so on so one way to kind of visualize this is
[00:42:59] of visualize this is [Music]
[00:43:40] right so one way to think of is if this
[00:43:44] right so one way to think of is if this is there is a model and there is a
[00:43:46] is there is a model and there is a distribution right so the model we are
[00:43:48] distribution right so the model we are assuming it to be a linear model right
[00:43:50] assuming it to be a linear model right given X there is a learnable parameter
[00:43:52] given X there is a learnable parameter theta and theta transpose X will give
[00:43:55] theta and theta transpose X will give you a parameter right this is the model
[00:43:58] you a parameter right this is the model and here is the distribution now the
[00:44:00] and here is the distribution now the distribution is a member of the
[00:44:03] distribution is a member of the exponential family and the parameter for
[00:44:06] exponential family and the parameter for this distribution is the output of the
[00:44:08] this distribution is the output of the linear model right this is the picture
[00:44:11] linear model right this is the picture you want to have in your mind and the
[00:44:13] you want to have in your mind and the exponential family we make depending on
[00:44:16] exponential family we make depending on the data that we have whether it's you
[00:44:18] the data that we have whether it's you know whether it's a classification
[00:44:19] know whether it's a classification problem or a regression problem or a
[00:44:21] problem or a regression problem or a time to end problem you would choose an
[00:44:23] time to end problem you would choose an appropriate B a and T based on the
[00:44:28] appropriate B a and T based on the distribution of your choice right so
[00:44:31] distribution of your choice right so this entire thing and from this you can
[00:44:35] this entire thing and from this you can say get the expectation of Y given theta
[00:44:44] say get the expectation of Y given theta and this is same as expectation of Y
[00:44:50] and this is same as expectation of Y given theta transpose X right and this
[00:44:54] given theta transpose X right and this is essentially our hypothesis function
[00:45:12] that's exactly right
[00:45:14] that's exactly right so so the question is are we training
[00:45:19] so so the question is are we training theta two to predict the parameter of
[00:45:24] theta two to predict the parameter of the exponential family distribution
[00:45:26] the exponential family distribution whose mean is the prediction that we are
[00:45:30] whose mean is the prediction that we are going to make for Y that's that's
[00:45:32] going to make for Y that's that's correct right and so this is what we do
[00:45:36] correct right and so this is what we do at test time and during train time how
[00:45:45] at test time and during train time how do we train this model so in this model
[00:45:47] do we train this model so in this model the parameter that we are learning by
[00:45:49] the parameter that we are learning by doing gradient descent are these
[00:45:51] doing gradient descent are these parameters right so you're not learning
[00:45:54] parameters right so you're not learning any of the parameters in the in the
[00:45:57] any of the parameters in the in the exponential family we are not learning
[00:45:59] exponential family we are not learning mu or Sigma square or or eita we are not
[00:46:03] mu or Sigma square or or eita we are not learning this we are learning theta
[00:46:04] learning this we are learning theta that's part of the model and not part of
[00:46:06] that's part of the model and not part of the distribution and the output of this
[00:46:08] the distribution and the output of this will become the the distributions
[00:46:11] will become the the distributions parameter it's unfortunate that we use
[00:46:13] parameter it's unfortunate that we use the word parameter for this and that but
[00:46:17] the word parameter for this and that but there there it's important to understand
[00:46:20] there there it's important to understand what what is being learned during
[00:46:23] what what is being learned during training phase and and what's not so
[00:46:26] training phase and and what's not so this parameter is the output of a
[00:46:28] this parameter is the output of a function it's not it's not a variable
[00:46:31] function it's not it's not a variable that we that we do gradient descent on
[00:46:33] that we that we do gradient descent on so during learning what we do is maximum
[00:46:38] so during learning what we do is maximum likelihood maximized with respect to
[00:46:40] likelihood maximized with respect to theta right so you're doing gradient
[00:46:56] theta right so you're doing gradient ascent on the locked probability of of Y
[00:47:01] ascent on the locked probability of of Y where the the natural parameter was Reap
[00:47:06] where the the natural parameter was Reap aramet rised with a linear model right
[00:47:10] aramet rised with a linear model right and we are doing gradient descent by
[00:47:12] and we are doing gradient descent by taking gradients on theta right the this
[00:47:15] taking gradients on theta right the this like the big picture of what's happening
[00:47:16] like the big picture of what's happening with GLM's and how they kind of are an
[00:47:19] with GLM's and how they kind of are an extension of exponential families yuri
[00:47:21] extension of exponential families yuri parameterize the parameters with a
[00:47:23] parameterize the parameters with a linear model and you get a GL m
[00:47:39] so let's let's look at some more detail
[00:47:44] so let's let's look at some more detail on what happens at train time
[00:48:19] so another kind of incidental benefit of
[00:48:24] so another kind of incidental benefit of using
[00:48:25] using GLM's is that a train time we saw that
[00:48:38] GLM's is that a train time we saw that we want to do maximum likelihood on the
[00:48:42] we want to do maximum likelihood on the log problem using the log probability
[00:48:44] log problem using the log probability with respect to theta right now
[00:48:49] at first it may appear that you know we
[00:48:52] at first it may appear that you know we need to do some more algebra figure out
[00:48:54] need to do some more algebra figure out what the expressions for you know P is
[00:48:58] what the expressions for you know P is represented in the in as a function of
[00:49:00] represented in the in as a function of theta transpose X and take the
[00:49:02] theta transpose X and take the derivatives and you know come up with a
[00:49:04] derivatives and you know come up with a gradient update rule and so on but it
[00:49:07] gradient update rule and so on but it turns out that no matter which what kind
[00:49:13] turns out that no matter which what kind of GLM you are doing no matter which
[00:49:15] of GLM you are doing no matter which choice of distribution that you make the
[00:49:18] choice of distribution that you make the learning update rule is the same the
[00:49:27] learning update rule is the same the learning update rule is theta you guys
[00:49:48] learning update rule is theta you guys have seen this so many times by now so
[00:49:50] have seen this so many times by now so this is you can you can straight away
[00:49:54] this is you can you can straight away just apply this learning rule without
[00:49:56] just apply this learning rule without ever having to do any more algebra to
[00:50:01] ever having to do any more algebra to figure out what the gradients are or
[00:50:02] figure out what the gradients are or what the what the loss is you can go
[00:50:05] what the what the loss is you can go straight to the update rule and do your
[00:50:07] straight to the update rule and do your learning you plug in the appropriate H
[00:50:09] learning you plug in the appropriate H theta of X you plug in the appropriate H
[00:50:15] theta of X you plug in the appropriate H theta of X depending on the choice of
[00:50:17] theta of X depending on the choice of distribution that you make and you can
[00:50:19] distribution that you make and you can start learning initialize your theta to
[00:50:21] start learning initialize your theta to some random values and and and you can
[00:50:23] some random values and and and you can start learning so any question on this
[00:50:29] start learning so any question on this yeah
[00:50:34] you can do if you want to do it for
[00:50:38] you can do if you want to do it for batch gradient descent then you just sum
[00:50:41] batch gradient descent then you just sum over all your examples yeah so the
[00:50:52] over all your examples yeah so the Newton method is is is probably the most
[00:50:55] Newton method is is is probably the most common you would use with GLM's and that
[00:50:58] common you would use with GLM's and that again comes with the assumption that
[00:51:00] again comes with the assumption that your dimensionality of your data is not
[00:51:02] your dimensionality of your data is not extremely high as long as the number of
[00:51:04] extremely high as long as the number of features is less than a few thousand
[00:51:08] features is less than a few thousand then you can do Newton's method
[00:51:12] any other question cool so so this is
[00:51:25] any other question cool so so this is the same update rule for any any any
[00:51:28] the same update rule for any any any specific type of GLM based on the choice
[00:51:30] specific type of GLM based on the choice of distribution that you have whether
[00:51:32] of distribution that you have whether you are modeling you know you're doing
[00:51:35] you are modeling you know you're doing classification whether you doing
[00:51:36] classification whether you doing regression whether you're doing you know
[00:51:38] regression whether you're doing you know a Poisson regression the update rule is
[00:51:40] a Poisson regression the update rule is the same you just plug in a different
[00:51:42] the same you just plug in a different age state of X and you get your learning
[00:51:44] age state of X and you get your learning rule another some more terminology so
[00:51:59] rule another some more terminology so eta is what we call the natural
[00:52:03] eta is what we call the natural parameter
[00:52:11] so eta is the natural parameter and the
[00:52:15] so eta is the natural parameter and the function that links a natural parameter
[00:52:27] function that links a natural parameter to the mean of the distribution and this
[00:52:29] to the mean of the distribution and this has a name it's called the canonical
[00:52:31] has a name it's called the canonical response function right and similarly
[00:52:46] response function right and similarly you can also let's call it mu it's like
[00:52:48] you can also let's call it mu it's like the mean of the distribution similarly
[00:52:51] the mean of the distribution similarly you can go from me back to ETA with the
[00:52:56] you can go from me back to ETA with the inverse of this and this is also called
[00:53:00] inverse of this and this is also called the canonical link function there's some
[00:53:07] the canonical link function there's some terminology we also already saw that G
[00:53:12] terminology we also already saw that G of eta is also equal to the the the
[00:53:19] of eta is also equal to the the the gradient of the law partition function
[00:53:21] gradient of the law partition function with respect to theta so side not G
[00:53:39] right and it's also helpful to make
[00:53:44] right and it's also helpful to make explicit the distinction between the
[00:53:47] explicit the distinction between the three different kinds of
[00:53:48] three different kinds of parameterizations we have so we have
[00:53:50] parameterizations we have so we have three parameterizations so we have the
[00:53:59] three parameterizations so we have the model parameters that's theta the
[00:54:06] model parameters that's theta the natural parameters that's 8 and we have
[00:54:15] natural parameters that's 8 and we have the canonical parameters and this is a
[00:54:22] the canonical parameters and this is a fee for Bernoulli mu and Sigma square
[00:54:26] fee for Bernoulli mu and Sigma square for Gaussian lambda for Poisson so these
[00:54:33] for Gaussian lambda for Poisson so these are three different ways we are we can
[00:54:35] are three different ways we are we can parameterize either the exponential
[00:54:38] parameterize either the exponential family or the GLM and whenever we are
[00:54:44] family or the GLM and whenever we are learning a GLM it is only you know this
[00:54:47] learning a GLM it is only you know this thing that we learn right that is the
[00:54:51] thing that we learn right that is the theta in the linear model this is the
[00:54:53] theta in the linear model this is the theta that is that is learnt right and
[00:54:57] theta that is that is learnt right and the connection between these two is is
[00:55:00] the connection between these two is is linear so theta transpose X will give
[00:55:03] linear so theta transpose X will give you the natural parameter and this is
[00:55:06] you the natural parameter and this is the design choice that we are making and
[00:55:12] the design choice that we are making and we choose to reaper ammeter is a 2 by a
[00:55:15] we choose to reaper ammeter is a 2 by a linear model a linear of a linear in
[00:55:18] linear model a linear of a linear in your data and between these two you have
[00:55:23] your data and between these two you have G to go this way G inverse come back
[00:55:28] G to go this way G inverse come back this way where G is also the derivative
[00:55:32] this way where G is also the derivative of the partition so yeah so it's
[00:55:35] of the partition so yeah so it's important to to kind of realize it can
[00:55:38] important to to kind of realize it can get pretty confusing when you're seeing
[00:55:40] get pretty confusing when you're seeing this for the first time because you have
[00:55:41] this for the first time because you have so many parameters that are being
[00:55:42] so many parameters that are being swapped around and you know getting
[00:55:44] swapped around and you know getting repairmen tries there are three kind of
[00:55:48] repairmen tries there are three kind of spaces in which three different ways in
[00:55:50] spaces in which three different ways in which we are parameterizing or
[00:55:52] which we are parameterizing or generalized
[00:55:52] generalized your models the model parameters the
[00:55:55] your models the model parameters the ones that we learn and the output of
[00:55:57] ones that we learn and the output of this with this the natural parameter for
[00:55:59] this with this the natural parameter for the exponential family and you can you
[00:56:02] the exponential family and you can you know do some algebraic manipulations and
[00:56:04] know do some algebraic manipulations and get the canonical parameters for the
[00:56:07] get the canonical parameters for the distribution that we are choosing
[00:56:10] distribution that we are choosing depending on the task whether it's
[00:56:11] depending on the task whether it's classification or regression any
[00:56:22] classification or regression any questions on this
[00:56:32] so now it's actually pretty you know you
[00:56:37] so now it's actually pretty you know you can see the you know when you're doing
[00:56:38] can see the you know when you're doing logistic regression right so each theta
[00:56:47] logistic regression right so each theta of X so H state of X is the expected
[00:56:57] of X so H state of X is the expected value of y condition and this is equal
[00:57:12] value of y condition and this is equal to V because here the choice of
[00:57:17] to V because here the choice of distribution is a Bernoulli and the mean
[00:57:20] distribution is a Bernoulli and the mean of a Bernoulli distribution is just V
[00:57:22] of a Bernoulli distribution is just V the in the in the canonical parameter
[00:57:24] the in the in the canonical parameter space and if we write that as in terms
[00:57:30] space and if we write that as in terms of t minus theta transpose X so the
[00:57:44] of t minus theta transpose X so the logistic function which when we
[00:57:46] logistic function which when we introduced linear logistically agression
[00:57:49] introduced linear logistically agression we just you know pulled out the logistic
[00:57:52] we just you know pulled out the logistic function out of thin air and said hey
[00:57:54] function out of thin air and said hey this is something that can squash minus
[00:57:56] this is something that can squash minus infinity to infinity between 0 and 1
[00:57:58] infinity to infinity between 0 and 1 seems like a good choice but but now we
[00:58:00] seems like a good choice but but now we see that it is it is a natural outcome
[00:58:04] see that it is it is a natural outcome it just pops out from this more elegant
[00:58:07] it just pops out from this more elegant generalized linear model where if you
[00:58:10] generalized linear model where if you choose Bernoulli to be to be the
[00:58:14] choose Bernoulli to be to be the distribution of your output then you
[00:58:16] distribution of your output then you know the logistic regression just just
[00:58:20] know the logistic regression just just pops out naturally so
[00:58:25] pops out naturally so [Music]
[00:58:29] any questions yeah yeah so the the
[00:58:43] any questions yeah yeah so the the choice of what distribution you're going
[00:58:45] choice of what distribution you're going to choose is really dependent on the
[00:58:48] to choose is really dependent on the task that you have so if your task is
[00:58:50] task that you have so if your task is regression where you want to output real
[00:58:53] regression where you want to output real valued numbers like price of the house
[00:58:55] valued numbers like price of the house or something then you choose a
[00:58:57] or something then you choose a distribution over the real real numbers
[00:59:01] distribution over the real real numbers like a Gaussian if your task is
[00:59:04] like a Gaussian if your task is classification that your output is
[00:59:06] classification that your output is binary 0 or 1 you choose a distribution
[00:59:08] binary 0 or 1 you choose a distribution that models binary data right so the
[00:59:12] that models binary data right so the task in a way influences you to pick the
[00:59:17] task in a way influences you to pick the distribution and you know most of the
[00:59:20] distribution and you know most of the times that choice is pretty obvious if
[00:59:22] times that choice is pretty obvious if you want to model the number of visitors
[00:59:23] you want to model the number of visitors to a website which is like a count you
[00:59:26] to a website which is like a count you know you want to use a Poisson
[00:59:27] know you want to use a Poisson distribution because Poisson
[00:59:28] distribution because Poisson distribution is a distribution over
[00:59:29] distribution is a distribution over integers so the task decide you know
[00:59:32] integers so the task decide you know pretty much tells you what distribution
[00:59:34] pretty much tells you what distribution you want to choose and then you you do
[00:59:37] you want to choose and then you you do the you know you do this you know all
[00:59:41] the you know you do this you know all you you go through this machinery of
[00:59:43] you you go through this machinery of figuring out what are the what what a
[00:59:46] figuring out what are the what what a trait of X is and you plug in each state
[00:59:48] trait of X is and you plug in each state of X over there and you have your
[00:59:51] of X over there and you have your learning role any more questions so it
[00:59:58] learning role any more questions so it so we made some assumptions these
[01:00:02] so we made some assumptions these assumptions now it's it's also helpful
[01:00:08] assumptions now it's it's also helpful to kind of get a visualization of what
[01:00:10] to kind of get a visualization of what these assumptions actually mean
[01:00:35] so to expand upon your point you know if
[01:00:41] so to expand upon your point you know if you think of the question are GLM's used
[01:00:43] you think of the question are GLM's used for classification or are they used for
[01:00:44] for classification or are they used for regression or are they used for you know
[01:00:46] regression or are they used for you know something else
[01:00:48] something else the answer really depends on what is the
[01:00:50] the answer really depends on what is the choice of distribution that you're going
[01:00:52] choice of distribution that you're going to choose you know GLM's are just a
[01:00:54] to choose you know GLM's are just a general way to model data and that data
[01:00:56] general way to model data and that data could be you know binary it could be
[01:00:58] could be you know binary it could be real valued and as long as you have a
[01:01:01] real valued and as long as you have a distribution that can model that kind of
[01:01:03] distribution that can model that kind of data and falls in the exponential family
[01:01:05] data and falls in the exponential family it can be just plugged in to a GL m and
[01:01:08] it can be just plugged in to a GL m and everything just works out nicely so so
[01:01:19] everything just works out nicely so so the assumptions that we made well let's
[01:01:21] the assumptions that we made well let's start with regression right so for
[01:01:26] start with regression right so for aggression we assume there is some X to
[01:01:31] aggression we assume there is some X to simplify I'm I'm drawing X as one
[01:01:34] simplify I'm I'm drawing X as one dimension but you know X could be
[01:01:35] dimension but you know X could be multi-dimensional and there exists a
[01:01:39] multi-dimensional and there exists a theta right and theta transpose X would
[01:01:44] theta right and theta transpose X would would be some linear some linear
[01:01:51] would be some linear some linear hyperplane and this we assume is beta
[01:02:01] right and in case of regression eight I
[01:02:06] right and in case of regression eight I was also mu so eight I was also mu and
[01:02:13] was also mu so eight I was also mu and then we are assuming that the Y for any
[01:02:16] then we are assuming that the Y for any given X is distributed as a Gaussian
[01:02:19] given X is distributed as a Gaussian with mu as the mean so which means for
[01:02:23] with mu as the mean so which means for every X every possible X you have the
[01:02:27] every X every possible X you have the appropriate data and with this as the
[01:02:31] appropriate data and with this as the mean let's let's think of this as Y so
[01:02:34] mean let's let's think of this as Y so there is a Gaussian distribution at
[01:02:37] there is a Gaussian distribution at every possible
[01:02:43] we assume a variance of one so this is
[01:02:45] we assume a variance of one so this is like a Gaussian with standard deviation
[01:02:47] like a Gaussian with standard deviation or variance equal to one right so for
[01:02:49] or variance equal to one right so for every possible X there is a Y given X
[01:02:54] every possible X there is a Y given X which is parameterized by by by theta
[01:02:57] which is parameterized by by by theta transpose X s as the mean right and you
[01:03:02] transpose X s as the mean right and you assume that your data is generated from
[01:03:04] assume that your data is generated from this process right so what does it mean
[01:03:08] this process right so what does it mean it means given X and let's say this is y
[01:03:15] it means given X and let's say this is y so you would have examples in your
[01:03:19] so you would have examples in your training set that may look like this the
[01:03:24] training set that may look like this the Assumption here is that for every X
[01:03:27] Assumption here is that for every X there is let's say for this particular
[01:03:31] there is let's say for this particular value of x there was a Gaussian
[01:03:34] value of x there was a Gaussian distribution that started from with a
[01:03:36] distribution that started from with a mean over here and from this Gaussian
[01:03:39] mean over here and from this Gaussian distribution
[01:03:40] distribution this value is sampled right you're just
[01:03:44] this value is sampled right you're just sampling it from from the distribution
[01:03:46] sampling it from from the distribution now the this is how your data is
[01:03:50] now the this is how your data is generated again this is our assumption
[01:03:55] right now that now based on these
[01:04:00] right now that now based on these assumptions what we're doing with the
[01:04:01] assumptions what we're doing with the GLM is we start with the data we don't
[01:04:04] GLM is we start with the data we don't know anything else we make an assumption
[01:04:06] know anything else we make an assumption that there is some linear model from
[01:04:09] that there is some linear model from which the data was was generated in this
[01:04:11] which the data was was generated in this format and we want to work backwards
[01:04:13] format and we want to work backwards right to find theta that will give us
[01:04:18] right to find theta that will give us this line right so for a different
[01:04:21] this line right so for a different choice of theta we get a different line
[01:04:23] choice of theta we get a different line right we assume that you know if that
[01:04:27] right we assume that you know if that line represents the the Meuse or the
[01:04:29] line represents the the Meuse or the means of the Y's for that particular X
[01:04:31] means of the Y's for that particular X from which it's sampled from we are
[01:04:34] from which it's sampled from we are trying to find a line which is which
[01:04:40] trying to find a line which is which will be like your theta transpose x from
[01:04:42] will be like your theta transpose x from which these Y's are most likely to have
[01:04:45] which these Y's are most likely to have samples that that's essentially what's
[01:04:46] samples that that's essentially what's happening when you do maximum likelihood
[01:04:48] happening when you do maximum likelihood with with with the GLM similarly
[01:05:06] similarly for classification again let's
[01:05:10] similarly for classification again let's assume there's an X right and there's
[01:05:13] assume there's an X right and there's some theta transpose X right and and
[01:05:19] some theta transpose X right and and this theta transpose X is equal to his
[01:05:22] this theta transpose X is equal to his data we assign this to be later right
[01:05:26] data we assign this to be later right and this data is from this a table we we
[01:05:31] and this data is from this a table we we run this through the sigmoid function 1
[01:05:35] run this through the sigmoid function 1 over 1 plus e to the minus beta to get
[01:05:39] over 1 plus e to the minus beta to get fee right so if these are the a TAS for
[01:05:43] fee right so if these are the a TAS for each for each a tar we run it through
[01:05:47] each for each a tar we run it through the sigmoid and we get something like
[01:05:50] the sigmoid and we get something like this right so this tends to 1 this tends
[01:05:55] this right so this tends to 1 this tends to 0 and when at this point when eita is
[01:06:00] to 0 and when at this point when eita is 0 the sigmoid is is 0.5 and now at each
[01:06:12] 0 the sigmoid is is 0.5 and now at each point at any given choice of X we have a
[01:06:16] point at any given choice of X we have a probability distribution in this case
[01:06:21] probability distribution in this case it's a it's a binary so let's assume
[01:06:24] it's a it's a binary so let's assume probability of Y is the height till the
[01:06:27] probability of Y is the height till the sigmoid line and here it is so every X
[01:06:31] sigmoid line and here it is so every X we have a different Bernoulli
[01:06:33] we have a different Bernoulli distribution essentially that's obtained
[01:06:36] distribution essentially that's obtained where you know the probability of Y is
[01:06:38] where you know the probability of Y is there is the height to the sigmoid
[01:06:41] there is the height to the sigmoid through the natural parameter and from
[01:06:43] through the natural parameter and from this you have a data generating
[01:06:45] this you have a data generating distribution that would look like so X
[01:06:48] distribution that would look like so X and you have a few X's in your training
[01:06:52] and you have a few X's in your training set and for those X's you calc you
[01:06:57] set and for those X's you calc you you know why distribution is and sample
[01:06:59] you know why distribution is and sample from it so let's say
[01:07:10] right and now again our goal is to stop
[01:07:15] right and now again our goal is to stop given given this data so over here this
[01:07:18] given given this data so over here this is the X and this is y so this is these
[01:07:21] is the X and this is y so this is these are points for which Y is 0 these are
[01:07:23] are points for which Y is 0 these are points for which Y is 1 and so given
[01:07:25] points for which Y is 1 and so given given this data we want to work
[01:07:26] given this data we want to work backwards to find out you know what
[01:07:31] backwards to find out you know what theta was what's the theta that would
[01:07:33] theta was what's the theta that would have resulted in sigmoid like curve from
[01:07:37] have resulted in sigmoid like curve from which these these Y's were most likely
[01:07:40] which these these Y's were most likely to have been sampled that's and and
[01:07:42] to have been sampled that's and and figuring out that Y is is is essentially
[01:07:44] figuring out that Y is is is essentially doing logistic regression any questions
[01:07:57] alright so in the last ten minutes or so
[01:08:00] alright so in the last ten minutes or so we will go over softmax regression
[01:08:30] so softbox aggression is so in the
[01:08:35] so softbox aggression is so in the lecture notes softmax regression is
[01:08:37] lecture notes softmax regression is explained as as yet another member of
[01:08:41] explained as as yet another member of the GLM family however in today's
[01:08:44] the GLM family however in today's lecture we'll be taking a non GLM
[01:08:47] lecture we'll be taking a non GLM approach and kind of seeing and see how
[01:08:50] approach and kind of seeing and see how softmax is is essentially doing what's
[01:08:53] softmax is is essentially doing what's also called as cross entropy
[01:08:54] also called as cross entropy minimization we'll end up with the same
[01:08:59] minimization we'll end up with the same same formulas and equations you can you
[01:09:01] same formulas and equations you can you can go through the GLM interpretation in
[01:09:03] can go through the GLM interpretation in the notes
[01:09:03] the notes it's a little messy to kind of do it on
[01:09:05] it's a little messy to kind of do it on the whiteboard so whereas this has it
[01:09:08] the whiteboard so whereas this has it has a nicer interpretation and it's good
[01:09:11] has a nicer interpretation and it's good to kind of get this cross-entropy
[01:09:13] to kind of get this cross-entropy interpretation as well so let's assume
[01:09:17] interpretation as well so let's assume so here we are talking about multi-class
[01:09:19] so here we are talking about multi-class classification so let's assume we have
[01:09:21] classification so let's assume we have three cat three classes of data let's
[01:09:25] three cat three classes of data let's call them circles squares and triangles
[01:09:40] now if here you know this is x1 and x2
[01:09:45] now if here you know this is x1 and x2 this just you're just visualizing your
[01:09:47] this just you're just visualizing your input space and the output space Y is
[01:09:50] input space and the output space Y is kind of implicit in the shape of this so
[01:09:51] kind of implicit in the shape of this so on so in multik like a multi-class
[01:09:58] on so in multik like a multi-class classification our goal is to start from
[01:10:02] classification our goal is to start from this data and learn a model that can
[01:10:05] this data and learn a model that can given a new data point you know make a
[01:10:10] given a new data point you know make a prediction of whether this point is a
[01:10:12] prediction of whether this point is a circle square or a triangle right you're
[01:10:16] circle square or a triangle right you're just looking at three because it's easy
[01:10:18] just looking at three because it's easy to visualize but this can work over
[01:10:19] to visualize but this can work over thousands of classes and so what we have
[01:10:28] thousands of classes and so what we have is so you have x i's in RN
[01:10:33] is so you have x i's in RN right so the label y is zero okay so K
[01:10:46] right so the label y is zero okay so K is the number of classes so the labels Y
[01:11:00] is the number of classes so the labels Y is a one hot vector what would you call
[01:11:03] is a one hot vector what would you call it as a one hot vector where it's a
[01:11:04] it as a one hot vector where it's a vector which indicates which class the X
[01:11:10] vector which indicates which class the X corresponds to so each each element in
[01:11:13] corresponds to so each each element in the vector corresponds to one of the
[01:11:15] the vector corresponds to one of the classes so this may correspond to the
[01:11:17] classes so this may correspond to the triangle class circle class square class
[01:11:20] triangle class circle class square class or maybe something else so the labels
[01:11:24] or maybe something else so the labels are in this one hot vector where we have
[01:11:28] are in this one hot vector where we have a vector that's filled with zeros except
[01:11:30] a vector that's filled with zeros except with a one in one of the places and the
[01:11:41] with a one in one of the places and the way we're going to the very gonna think
[01:11:46] way we're going to the very gonna think of softmax regression is that each class
[01:11:49] of softmax regression is that each class has its its own set of parameters so we
[01:11:53] has its its own set of parameters so we have theta class and and there are K
[01:12:05] have theta class and and there are K such things where class so in logistic
[01:12:16] such things where class so in logistic regression we had just one theta which
[01:12:19] regression we had just one theta which would do a binary you know yes versus no
[01:12:21] would do a binary you know yes versus no in softmax we have one such vector of
[01:12:27] in softmax we have one such vector of theta per class right so you could also
[01:12:29] theta per class right so you could also optionally represent them as a matrix I
[01:12:32] optionally represent them as a matrix I just an N by K matrix where you have a
[01:12:37] just an N by K matrix where you have a theta class right so in softmax
[01:12:42] theta class right so in softmax regression it's it's a generalization of
[01:12:47] regression it's it's a generalization of logistic regression where you have a set
[01:12:50] logistic regression where you have a set of parameters per class right and we're
[01:12:55] of parameters per class right and we're going to do something
[01:13:02] something similar to
[01:13:28] so corresponding to each each class of
[01:13:32] so corresponding to each each class of parameters there exists so there is that
[01:13:42] parameters there exists so there is that exists this line which represents say
[01:13:45] exists this line which represents say theta triangle transpose x equals zero
[01:13:47] theta triangle transpose x equals zero and anything to the left will be theta
[01:13:50] and anything to the left will be theta triangle transpose X is greater than
[01:13:52] triangle transpose X is greater than zero and over here to be less than zero
[01:13:54] zero and over here to be less than zero right so if for for the theta triangle
[01:13:58] right so if for for the theta triangle class there is there is this line which
[01:14:03] class there is there is this line which which corresponds to theta transpose x
[01:14:06] which corresponds to theta transpose x equals zero anything to the left will
[01:14:08] equals zero anything to the left will give you a value greater than zero
[01:14:11] give you a value greater than zero anything to the right similarly there is
[01:14:13] anything to the right similarly there is also so this corresponds to theta square
[01:14:19] also so this corresponds to theta square transpose x equals zero anything below
[01:14:24] transpose x equals zero anything below will be greater than zero anything above
[01:14:27] will be greater than zero anything above will be less than zero similarly you
[01:14:30] will be less than zero similarly you have another one for this corresponds to
[01:14:36] have another one for this corresponds to theta circle transpose x equals zero and
[01:14:41] theta circle transpose x equals zero and this half plane we have to be greater
[01:14:44] this half plane we have to be greater than zero to the left is less than zero
[01:14:47] than zero to the left is less than zero right so we have a different set of
[01:14:51] right so we have a different set of parameters per class which which
[01:14:56] parameters per class which which hopefully satisfies this property and
[01:15:00] hopefully satisfies this property and now our goal is to take these parameters
[01:15:07] now our goal is to take these parameters and let's see what happens when we feed
[01:15:11] and let's see what happens when we feed a new example so given an example X we
[01:15:16] a new example so given an example X we get a set of
[01:15:22] given X and who are here we have classes
[01:15:28] given X and who are here we have classes right so we have the circle class the
[01:15:31] right so we have the circle class the triangle class the square class right so
[01:15:34] triangle class the square class right so over here we plot theta class transpose
[01:15:38] over here we plot theta class transpose X so we may get something that looks
[01:15:41] X so we may get something that looks like this so let's say for a new point x
[01:15:48] like this so let's say for a new point x over here if that's our new X we would
[01:15:52] over here if that's our new X we would have theta transpose theta transfer data
[01:15:57] have theta transpose theta transfer data square transpose X to be positive so
[01:16:01] square transpose X to be positive so right and maybe for for the others we
[01:16:06] right and maybe for for the others we may have some negative and maybe
[01:16:08] may have some negative and maybe something like this for this right so
[01:16:11] something like this for this right so this piece is it's also called the logic
[01:16:14] this piece is it's also called the logic space I mean so these are real numbers
[01:16:17] space I mean so these are real numbers this will this will this is not a value
[01:16:20] this will this will this is not a value between 0 &amp; 1 this is between plus
[01:16:22] between 0 &amp; 1 this is between plus infinity and minus infinity right and
[01:16:26] infinity and minus infinity right and and our goal is to get a probability
[01:16:30] and our goal is to get a probability distribution over the classes and in
[01:16:33] distribution over the classes and in order to do that we perform a few steps
[01:16:36] order to do that we perform a few steps so we exponentiate the logits which
[01:16:40] so we exponentiate the logits which would give us now it is x buff theta
[01:16:45] would give us now it is x buff theta class transpose x and this will make
[01:16:48] class transpose x and this will make everything positive
[01:16:56] squares triangles and circles right now
[01:17:01] squares triangles and circles right now we got a set of positive numbers and
[01:17:03] we got a set of positive numbers and next we normalize this my normalize I
[01:17:12] next we normalize this my normalize I mean divide everything by the sum of all
[01:17:16] mean divide everything by the sum of all of them so here we have theta e to the
[01:17:21] of them so here we have theta e to the theta at class transpose x over some of
[01:17:26] theta at class transpose x over some of I in triangle square circle e to the
[01:17:34] I in triangle square circle e to the theta I transpose X so once we do this
[01:17:39] theta I transpose X so once we do this operation we now get a probability
[01:17:41] operation we now get a probability distribution
[01:17:53] where the sum of the heights will add up
[01:17:55] where the sum of the heights will add up to one so so given so if given a new
[01:18:02] to one so so given so if given a new point X and we run through this pipeline
[01:18:03] point X and we run through this pipeline we get a probability output over the
[01:18:07] we get a probability output over the classes for which class that example is
[01:18:11] classes for which class that example is most likely to belong to right and this
[01:18:16] most likely to belong to right and this whole process so let's call this P hat
[01:18:19] whole process so let's call this P hat of Y for the given X right so this is
[01:18:25] of Y for the given X right so this is like our hypothesis the output of the
[01:18:26] like our hypothesis the output of the hypothesis function will output this
[01:18:28] hypothesis function will output this probability distribution in the other
[01:18:30] probability distribution in the other cases the output of the hypothesis
[01:18:32] cases the output of the hypothesis function generally output a scalar or a
[01:18:34] function generally output a scalar or a probability in this case it's outputting
[01:18:36] probability in this case it's outputting a probability distribution over all the
[01:18:38] a probability distribution over all the classes and now the true why would look
[01:18:43] classes and now the true why would look something like this right let's say the
[01:18:46] something like this right let's say the point over there was let's say it was a
[01:18:49] point over there was let's say it was a triangle for whatever reason right if
[01:18:52] triangle for whatever reason right if that was the triangle then the P of Y
[01:18:56] that was the triangle then the P of Y which is also called the label you can
[01:19:00] which is also called the label you can think of that as a probability
[01:19:02] think of that as a probability distribution which is one over the
[01:19:06] distribution which is one over the correct class and 0 elsewhere right so P
[01:19:10] correct class and 0 elsewhere right so P of Y this is essentially representing
[01:19:12] of Y this is essentially representing the one heart representation as a
[01:19:14] the one heart representation as a probability distribution right now the
[01:19:16] probability distribution right now the goal or the learning approach that we
[01:19:19] goal or the learning approach that we are going to do is in a way minimize the
[01:19:24] are going to do is in a way minimize the distance between these two distributions
[01:19:27] distance between these two distributions right this is one distribution this is
[01:19:29] right this is one distribution this is another distribution we want to change
[01:19:31] another distribution we want to change this distribution to look like that
[01:19:32] this distribution to look like that distribution right and and technically
[01:19:37] distribution right and and technically that the term for that is minimize the
[01:19:40] that the term for that is minimize the cross entropy between the two
[01:19:41] cross entropy between the two distributions
[01:19:55] so the cross-entropy between P and P hat
[01:20:04] so the cross-entropy between P and P hat is equal to 4y in circle angle square P
[01:20:20] is equal to 4y in circle angle square P of Y times y I don't think we will have
[01:20:29] of Y times y I don't think we will have time to go over the interpretation of
[01:20:30] time to go over the interpretation of cross-entropy but you can so here we see
[01:20:33] cross-entropy but you can so here we see that P of Y will be 1 for just one of
[01:20:36] that P of Y will be 1 for just one of the classes and 0 for the others so
[01:20:37] the classes and 0 for the others so let's say in this say this example P of
[01:20:39] let's say in this say this example P of Y will say a triangle so this will
[01:20:42] Y will say a triangle so this will essentially boil down to P and we saw
[01:20:57] essentially boil down to P and we saw that this hypothesis is essentially and
[01:21:24] that this hypothesis is essentially and on this you you treat this as the loss
[01:21:28] on this you you treat this as the loss and do gradient descent gradient descent
[01:21:35] and do gradient descent gradient descent with respect to the parameters right
[01:21:42] yeah with that I think any questions on
[01:21:46] yeah with that I think any questions on softmax
[01:21:53] okay so we will break for today in that
[01:21:56] okay so we will break for today in that is thanks


================================================================================
LECTURE 005
================================================================================

Lecture 5 - GDA & Naive Bayes | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018)

Source: https://www.youtube.com/watch?v=nt63k3bfXS0

---

Transcript

[00:00:04] hey morning everyone welcome back
[00:00:07] hey morning everyone welcome back um so last week you heard about uh
[00:00:11] um so last week you heard about uh logistic regression and um uh
[00:00:14] logistic regression and um uh generalized linear models and it turns
[00:00:17] generalized linear models and it turns out all of the learning algorithms we've
[00:00:19] out all of the learning algorithms we've been learning about so far are called
[00:00:21] been learning about so far are called discriminative learning algorithms which
[00:00:22] discriminative learning algorithms which is one big bucket of learning algorithms
[00:00:24] is one big bucket of learning algorithms and today um what i' like to do is share
[00:00:27] and today um what i' like to do is share with you how generative learning
[00:00:29] with you how generative learning algorithms work um in particular you
[00:00:31] algorithms work um in particular you learn about Gan discre analysis so by
[00:00:34] learn about Gan discre analysis so by the end of the day you know how to
[00:00:35] the end of the day you know how to implement this and it turns out that uh
[00:00:37] implement this and it turns out that uh compared to say logistic regression for
[00:00:39] compared to say logistic regression for classification GDA is actually a um
[00:00:43] classification GDA is actually a um simpler and maybe more computationally
[00:00:45] simpler and maybe more computationally efficient algorithm to implement in some
[00:00:47] efficient algorithm to implement in some cases so um and it sometimes works
[00:00:50] cases so um and it sometimes works better if you have uh very small data
[00:00:52] better if you have uh very small data sets sometimes of some cavas um and it
[00:00:55] sets sometimes of some cavas um and it will talk about comparison between
[00:00:57] will talk about comparison between generative learning albums which is a
[00:00:58] generative learning albums which is a new class of albums you hear about today
[00:01:00] new class of albums you hear about today versus The sctive Learning algorithms
[00:01:02] versus The sctive Learning algorithms and then we'll talk about naive Bas and
[00:01:05] and then we'll talk about naive Bas and how you could use that to uh build a
[00:01:07] how you could use that to uh build a span filter for example okay so um we'll
[00:01:12] span filter for example okay so um we'll use bin classification as the motivating
[00:01:15] use bin classification as the motivating example for today and um if you have a
[00:01:21] example for today and um if you have a data set that looks like this with two
[00:01:23] data set that looks like this with two classes then what a discriminative
[00:01:26] classes then what a discriminative learning algorithm like logistic
[00:01:28] learning algorithm like logistic regression would do is use GR descent to
[00:01:31] regression would do is use GR descent to search for a line that separates the
[00:01:33] search for a line that separates the positive negative examples right so if
[00:01:34] positive negative examples right so if you randomly randomly initialize
[00:01:37] you randomly randomly initialize parameters maybe starts with some Digis
[00:01:40] parameters maybe starts with some Digis boundary like that and over the course
[00:01:42] boundary like that and over the course of grade and descent you know the line
[00:01:44] of grade and descent you know the line migrates or evolves until you get maybe
[00:01:46] migrates or evolves until you get maybe a line like that that separates the
[00:01:48] a line like that that separates the positive and negative examples and um uh
[00:01:51] positive and negative examples and um uh logistic regression is really searching
[00:01:53] logistic regression is really searching for a line searching for desision
[00:01:55] for a line searching for desision boundary that separates the positive and
[00:01:56] boundary that separates the positive and negative examples um and so if this was
[00:02:00] negative examples um and so if this was the uh malignant
[00:02:02] the uh malignant tumors and the benign tumors
[00:02:05] tumors and the benign tumors example right uh that's that's what
[00:02:08] example right uh that's that's what logistic regression would do now there's
[00:02:10] logistic regression would do now there's a different class of algorithm which
[00:02:12] a different class of algorithm which isn't searching for this separation
[00:02:14] isn't searching for this separation which isn't trying to maximize the
[00:02:16] which isn't trying to maximize the likelihood that you the way you saw last
[00:02:18] likelihood that you the way you saw last week which is um here's an alternative
[00:02:21] week which is um here's an alternative this is called a generative learning
[00:02:23] this is called a generative learning algorithm which is rather than looking
[00:02:25] algorithm which is rather than looking at two classes and trying to find a
[00:02:27] at two classes and trying to find a separation instead the algorithm is
[00:02:28] separation instead the algorithm is going to look at the classes one at a
[00:02:30] going to look at the classes one at a time first we'll look at all of the
[00:02:32] time first we'll look at all of the malignant tumors right in the cancer
[00:02:35] malignant tumors right in the cancer example and try to build a model for
[00:02:37] example and try to build a model for what malignant tumus looks like so you
[00:02:39] what malignant tumus looks like so you might say oh looks like all the
[00:02:40] might say oh looks like all the malignant tumus um
[00:02:45] roughly all the malignant tumors roughly
[00:02:48] roughly all the malignant tumors roughly live in that ellipse and then you look
[00:02:50] live in that ellipse and then you look at all the benign tumors in isolation
[00:02:53] at all the benign tumors in isolation and say oh it looks like all the benign
[00:02:55] and say oh it looks like all the benign tumors roughly live in that ellipse and
[00:02:58] tumors roughly live in that ellipse and then at classification time time if
[00:03:00] then at classification time time if there's a new you know patient in your
[00:03:03] there's a new you know patient in your office with those features uh it would
[00:03:06] office with those features uh it would then look at this new patient and
[00:03:08] then look at this new patient and compare it to the malignant tumor model
[00:03:12] compare it to the malignant tumor model compare it to the benign tumor model and
[00:03:14] compare it to the benign tumor model and then say in this case oh looks like this
[00:03:16] then say in this case oh looks like this one looks a lot more like the benign
[00:03:18] one looks a lot more like the benign tumors I had previously seen so I'm
[00:03:20] tumors I had previously seen so I'm going to classify that as a benign tumor
[00:03:22] going to classify that as a benign tumor okay so
[00:03:25] okay so um rather than uh looking at both
[00:03:28] um rather than uh looking at both classes simultaneous and searching for a
[00:03:30] classes simultaneous and searching for a way to separate them a generative
[00:03:33] way to separate them a generative learning algorithm uh instead builds a
[00:03:35] learning algorithm uh instead builds a model of what each of the classes looks
[00:03:38] model of what each of the classes looks like kind of almost in isolation with
[00:03:40] like kind of almost in isolation with some details we'll learn about later and
[00:03:42] some details we'll learn about later and then at test time uh it evaluates a new
[00:03:45] then at test time uh it evaluates a new example against the benign model
[00:03:47] example against the benign model evaluates against the malignant model
[00:03:49] evaluates against the malignant model and tries to see which of the two models
[00:03:51] and tries to see which of the two models it matches more closely against so let's
[00:03:55] it matches more closely against so let's formalize this um a discriminative
[00:04:00] formalize this um a discriminative learning
[00:04:01] learning algorithm
[00:04:03] algorithm learns P of
[00:04:07] Y given X right
[00:04:11] Y given X right um
[00:04:13] um or uh or or it learns
[00:04:21] um right some
[00:04:24] um right some mapping from X to Y directly you know
[00:04:27] mapping from X to Y directly you know learn or you can learn I think on and
[00:04:29] learn or you can learn I think on and brief talked about the perception Al we
[00:04:31] brief talked about the perception Al we talk about support vect machines later
[00:04:33] talk about support vect machines later um we learns the function mapping from X
[00:04:35] um we learns the function mapping from X to the labels directly so that's a
[00:04:37] to the labels directly so that's a discriminative learning algorithm you're
[00:04:38] discriminative learning algorithm you're trying to discriminate between positive
[00:04:40] trying to discriminate between positive and negative
[00:04:41] and negative classes in contrast a generative
[00:04:44] classes in contrast a generative learning
[00:04:50] algorithm it
[00:04:54] algorithm it learns P
[00:04:56] learns P of um x given
[00:05:01] of um x given y so this says what are the features
[00:05:08] like given the class right so um instead
[00:05:12] like given the class right so um instead of P of Y given X we're going to learn P
[00:05:14] of P of Y given X we're going to learn P of x given y so in other words given
[00:05:17] of x given y so in other words given that the tumor is malignant what are the
[00:05:19] that the tumor is malignant what are the features likely going to be like or
[00:05:20] features likely going to be like or given the tumor is benign what are the
[00:05:22] given the tumor is benign what are the features X going to be like okay um and
[00:05:26] features X going to be like okay um and then there and then there also
[00:05:28] then there and then there also generative learning algorithm will also
[00:05:30] generative learning algorithm will also learn P of Y so this is a this is also
[00:05:32] learn P of Y so this is a this is also called the class prior to this
[00:05:36] called the class prior to this probability I guess right it's called a
[00:05:38] probability I guess right it's called a class prior it's just when a patient
[00:05:40] class prior it's just when a patient walks into your office before you've
[00:05:41] walks into your office before you've even examined them before you even seen
[00:05:43] even examined them before you even seen them what are the odds that they
[00:05:45] them what are the odds that they tumorous malignant versus benign right
[00:05:47] tumorous malignant versus benign right before you see any features
[00:05:50] before you see any features okay and so using Bas
[00:05:56] rule if you can build a model for p of x
[00:05:59] rule if you can build a model for p of x given Y and for p of Y um if you know if
[00:06:03] given Y and for p of Y um if you know if you can calculate numbers for both of
[00:06:05] you can calculate numbers for both of these quantities then using base rule
[00:06:07] these quantities then using base rule when you have a new Test
[00:06:09] when you have a new Test example with features X you can then
[00:06:12] example with features X you can then calculate the chance of Y being equal to
[00:06:14] calculate the chance of Y being equal to one as
[00:06:22] this
[00:06:24] this right uh where P of x
[00:06:29] right uh where P of x by
[00:06:39] the
[00:06:41] the okay
[00:06:46] um and so if you've learned this term if
[00:06:49] um and so if you've learned this term if you have x given y then you can plug
[00:06:52] you have x given y then you can plug that in
[00:06:57] here and if you've also learned this
[00:06:59] here and if you've also learned this term P of Y you can plug that in
[00:07:03] term P of Y you can plug that in here right
[00:07:07] here right um so oh P of X in the
[00:07:10] um so oh P of X in the denominators goes in the denominator
[00:07:13] denominators goes in the denominator okay so if you learned both both of
[00:07:14] okay so if you learned both both of those terms in the red square and in the
[00:07:16] those terms in the red square and in the orange Square you could plug it into all
[00:07:19] orange Square you could plug it into all of those terms and therefore use a base
[00:07:21] of those terms and therefore use a base rule to calculate P of y equals 1 given
[00:07:25] rule to calculate P of y equals 1 given X so given a new patient with features X
[00:07:27] X so given a new patient with features X you could use this formula to calculate
[00:07:28] you could use this formula to calculate what's the chance that the tumor is
[00:07:30] what's the chance that the tumor is malignant if you've estimated you know
[00:07:33] malignant if you've estimated you know these these two quantities in the red
[00:07:35] these these two quantities in the red and in the orange circles okay so um
[00:07:41] and in the orange circles okay so um that's the framework we're use to build
[00:07:43] that's the framework we're use to build generative learning algorithms and in
[00:07:45] generative learning algorithms and in fact today you see two examples of
[00:07:47] fact today you see two examples of generative learning algorithms uh one
[00:07:49] generative learning algorithms uh one for continuous value features uh which
[00:07:51] for continuous value features uh which can use for things like the tumor
[00:07:52] can use for things like the tumor classification and one for uh discrete
[00:07:55] classification and one for uh discrete features which uh you can use for
[00:07:57] features which uh you can use for building like an email span filter right
[00:07:59] building like an email span filter right or or I don't know or if you want to
[00:08:01] or or I don't know or if you want to download Twitter things and see how
[00:08:04] download Twitter things and see how positive or negative the sentiment on
[00:08:05] positive or negative the sentiment on Twitter is or something right you so
[00:08:07] Twitter is or something right you so have a natural language processing
[00:08:08] have a natural language processing example
[00:08:10] example later so um let's talk
[00:08:14] later so um let's talk about calcian distrib
[00:08:22] analysis
[00:08:25] G
[00:08:27] G um so
[00:08:34] uh let's develop this model assuming
[00:08:36] uh let's develop this model assuming that the features X are continuous
[00:08:38] that the features X are continuous valued and when we develop um generative
[00:08:42] valued and when we develop um generative learning algorithms I'm going to use x
[00:08:44] learning algorithms I'm going to use x and RN so you know I'm going to drop the
[00:08:47] and RN so you know I'm going to drop the x0 equals 1
[00:08:50] x0 equals 1 convention right so I'm not going to
[00:08:52] convention right so I'm not going to we're not going to need that extra xal 1
[00:08:55] we're not going to need that extra xal 1 so X is now RN rather than RN + 1 and
[00:08:59] so X is now RN rather than RN + 1 and the key assumption in gaussian discre
[00:09:02] the key assumption in gaussian discre analysis is we're going to
[00:09:04] analysis is we're going to assume that P of x given
[00:09:10] Y is distributed Gan right in other
[00:09:13] Y is distributed Gan right in other words condition on the tumor is being
[00:09:16] words condition on the tumor is being malignant the distribution of the
[00:09:17] malignant the distribution of the features is Galan you know the feature
[00:09:20] features is Galan you know the feature is like size of the size of the tumor
[00:09:22] is like size of the size of the tumor the the cell adhesion whatever features
[00:09:24] the the cell adhesion whatever features you use to measure a tumor um and
[00:09:27] you use to measure a tumor um and condition on it being benign the
[00:09:28] condition on it being benign the distribution is also so Gan so um how
[00:09:33] distribution is also so Gan so um how many of you are familiar with a multivar
[00:09:34] many of you are familiar with a multivar gan raise your hand if you are like half
[00:09:37] gan raise your hand if you are like half of you one3 no two fifths okay cool all
[00:09:40] of you one3 no two fifths okay cool all right oh how many of you are familiar
[00:09:41] right oh how many of you are familiar with a uni varia like a single
[00:09:43] with a uni varia like a single dimensional Gan okay cool almost
[00:09:45] dimensional Gan okay cool almost everyone all right cool so let me let me
[00:09:48] everyone all right cool so let me let me go through what is a multivar gan
[00:09:50] go through what is a multivar gan distribution so the Gan is this familiar
[00:09:53] distribution so the Gan is this familiar Bel shaped curve a multivar gan is the
[00:09:56] Bel shaped curve a multivar gan is the generalization of this familiar
[00:09:58] generalization of this familiar bell-shaped curve over one dimensional
[00:10:00] bell-shaped curve over one dimensional random variable to multiple random
[00:10:02] random variable to multiple random variables at the same time to to to
[00:10:03] variables at the same time to to to Vector value random variables rather
[00:10:05] Vector value random variables rather than Univar random variable so um if Z
[00:10:10] than Univar random variable so um if Z is distributed Gan with some mean Vector
[00:10:14] is distributed Gan with some mean Vector mu and some covariance Matrix Sigma um
[00:10:19] mu and some covariance Matrix Sigma um so if Z
[00:10:21] so if Z is uh in RN then mu would be RN as well
[00:10:26] is uh in RN then mu would be RN as well and sigma The ciance Matrix would be n
[00:10:28] and sigma The ciance Matrix would be n byn so Z is two dimensional mu is two
[00:10:31] byn so Z is two dimensional mu is two dimensional and sigma is two dimensional
[00:10:33] dimensional and sigma is two dimensional and the expected value of Z is equal
[00:10:37] and the expected value of Z is equal to um the
[00:10:39] to um the mean and the
[00:10:42] mean and the um coari of
[00:10:44] um coari of Z well if you're familiar with
[00:10:46] Z well if you're familiar with multivariate ciencies uh this is the
[00:10:51] multivariate ciencies uh this is the formula right um and this simplifies we
[00:10:54] formula right um and this simplifies we shown the Elon you get this from Elon
[00:10:56] shown the Elon you get this from Elon Els
[00:11:01] sorry and uh following sometimes
[00:11:04] sorry and uh following sometimes semi-standard convention I'm sometimes
[00:11:06] semi-standard convention I'm sometimes going to omit the square bracket so
[00:11:08] going to omit the square bracket so instead of writing the expected value of
[00:11:09] instead of writing the expected value of Z meaning the mean of Z sometimes I just
[00:11:12] Z meaning the mean of Z sometimes I just write it as EZ right and omit omit the
[00:11:15] write it as EZ right and omit omit the square brackets to simplify the notation
[00:11:16] square brackets to simplify the notation of it okay uh and the derivation from
[00:11:20] of it okay uh and the derivation from this step to this step is given electon
[00:11:22] this step to this step is given electon notes um and
[00:11:26] so the PRI density function for G
[00:11:29] so the PRI density function for G looks like
[00:11:36] this and this is one of those formulas
[00:11:39] this and this is one of those formulas that I don't know when you're
[00:11:41] that I don't know when you're implementing these algorithms you use
[00:11:43] implementing these algorithms you use over and over but what I've seen for a
[00:11:45] over and over but what I've seen for a lot of people is almost no one well very
[00:11:48] lot of people is almost no one well very few people start their machine learning
[00:11:50] few people start their machine learning memorize this formula just look at every
[00:11:51] memorize this formula just look at every time you need it I've used it so many
[00:11:53] time you need it I've used it so many times I seem to have it steered in my
[00:11:55] times I seem to have it steered in my brain by now but most people don't but
[00:11:57] brain by now but most people don't but when you use it enough you you you up
[00:11:59] when you use it enough you you you up memorizing it uh but let me show you
[00:12:01] memorizing it uh but let me show you some pictures of what this looks like
[00:12:03] some pictures of what this looks like since I think that would
[00:12:06] since I think that would um that might be more useful so the
[00:12:09] um that might be more useful so the multiv galaan density has two parameters
[00:12:11] multiv galaan density has two parameters mu and sigma that control the mean and
[00:12:16] mu and sigma that control the mean and the variance of this density okay so
[00:12:20] the variance of this density okay so this is a picture of the Gan density um
[00:12:23] this is a picture of the Gan density um this is a two-dimensional Gan bump and
[00:12:26] this is a two-dimensional Gan bump and for now I've set the mean parameter to
[00:12:29] for now I've set the mean parameter to zero so mu is a two-dimensional
[00:12:31] zero so mu is a two-dimensional parameter this is z0 which is why this
[00:12:34] parameter this is z0 which is why this gussian bump is uh
[00:12:36] gussian bump is uh centered at zero um
[00:12:41] centered at zero um and the cence Matrix Sigma is the
[00:12:44] and the cence Matrix Sigma is the identity um is the identity Matrix so uh
[00:12:49] identity um is the identity Matrix so uh so you know so so you have this standard
[00:12:52] so you know so so you have this standard this is also called the standard Gan
[00:12:54] this is also called the standard Gan distribution which means mean zero and
[00:12:56] distribution which means mean zero and Coan equals to the identity now I'm we
[00:12:59] Coan equals to the identity now I'm we going to take the ciance Matrix and
[00:13:00] going to take the ciance Matrix and shrink it right so take the ciance
[00:13:02] shrink it right so take the ciance Matrix and multiply by a number less
[00:13:04] Matrix and multiply by a number less than one that should shrink the variance
[00:13:07] than one that should shrink the variance reduce the variability and distribution
[00:13:08] reduce the variability and distribution if I do that the density um the probity
[00:13:13] if I do that the density um the probity density function becomes taller uh this
[00:13:16] density function becomes taller uh this this is a probity density function that
[00:13:18] this is a probity density function that always integrates the one right the area
[00:13:20] always integrates the one right the area under the curve you know is is is one
[00:13:22] under the curve you know is is is one and so by reducing the covariance from
[00:13:26] and so by reducing the covariance from Identity to 0.6 times the identity it
[00:13:28] Identity to 0.6 times the identity it reduces the the spread of the gum
[00:13:30] reduces the the spread of the gum density um but it also makes it taller
[00:13:32] density um but it also makes it taller as a result because you know the area
[00:13:34] as a result because you know the area under the curve must integrate to one
[00:13:36] under the curve must integrate to one now let's make it fatter uh let's make
[00:13:39] now let's make it fatter uh let's make the ciance two times the identity then
[00:13:42] the ciance two times the identity then you end up with a wider distribution
[00:13:44] you end up with a wider distribution where the values of um I guess the axis
[00:13:47] where the values of um I guess the axis here this would be the Z1 and the Z2
[00:13:50] here this would be the Z1 and the Z2 axis the two dimensions of gum density
[00:13:52] axis the two dimensions of gum density right increases the variance of the
[00:13:54] right increases the variance of the density so let's go back to standard Gan
[00:13:57] density so let's go back to standard Gan coari equal one one now let's try
[00:14:00] coari equal one one now let's try filling around with the off diagonal
[00:14:02] filling around with the off diagonal entries um I'm going to so right now the
[00:14:05] entries um I'm going to so right now the off diagonal entries are zero right so
[00:14:07] off diagonal entries are zero right so in this Gan density the off diagonal
[00:14:10] in this Gan density the off diagonal elements are z0 let's increase that to
[00:14:12] elements are z0 let's increase that to 0.5 and see what happens so you do that
[00:14:16] 0.5 and see what happens so you do that then the Gan density uh hope you can see
[00:14:19] then the Gan density uh hope you can see the change right goes from this round
[00:14:20] the change right goes from this round shape to this slightly narrower thing
[00:14:23] shape to this slightly narrower thing let's increase the further to 8.8 then
[00:14:25] let's increase the further to 8.8 then the density ends up looking like that um
[00:14:29] the density ends up looking like that um where now it's more likely that Z1 now
[00:14:32] where now it's more likely that Z1 now Z1 and Z2 are positively correlated okay
[00:14:36] Z1 and Z2 are positively correlated okay so let's go through all of these plots
[00:14:37] so let's go through all of these plots um but now looking at Contours of these
[00:14:39] um but now looking at Contours of these gum densities instead of these 3D bmps
[00:14:42] gum densities instead of these 3D bmps so uh this is the Contours of the Gan
[00:14:46] so uh this is the Contours of the Gan density when the ciance Matrix is the
[00:14:49] density when the ciance Matrix is the identity Matrix and poiz the aspect
[00:14:51] identity Matrix and poiz the aspect ratio these are supposed to be perfectly
[00:14:52] ratio these are supposed to be perfectly round circles but the aspect ratio makes
[00:14:55] round circles but the aspect ratio makes this look a little bit fatter but this
[00:14:56] this look a little bit fatter but this is supposed to be perfectly round
[00:14:58] is supposed to be perfectly round circles um and so uh when uh the
[00:15:02] circles um and so uh when uh the converence Matrix is the identity Matrix
[00:15:04] converence Matrix is the identity Matrix you know Z1 and Z2 are uncorrelated um
[00:15:08] you know Z1 and Z2 are uncorrelated um uh and the Contours of the Gin bump of
[00:15:10] uh and the Contours of the Gin bump of the Gin density look like round circles
[00:15:12] the Gin density look like round circles and if you increase the off diagonal
[00:15:14] and if you increase the off diagonal excuse me then it looks like that you
[00:15:17] excuse me then it looks like that you increase it further to 8.8 it looks like
[00:15:20] increase it further to 8.8 it looks like that okay where now most of the
[00:15:23] that okay where now most of the probabbly M most probity density
[00:15:25] probabbly M most probity density function places value on um Z1 and Z2
[00:15:28] function places value on um Z1 and Z2 being positively correlated okay um next
[00:15:33] being positively correlated okay um next let's look at uh what happens if we set
[00:15:35] let's look at uh what happens if we set the off diagonal elements to negative
[00:15:38] the off diagonal elements to negative values right so um actually what do you
[00:15:41] values right so um actually what do you think will happen let's set the off
[00:15:43] think will happen let's set the off diagonals to negative.
[00:15:47] 5.5 right oh wow people seeing people
[00:15:49] 5.5 right oh wow people seeing people making that head gesture okay cool right
[00:15:51] making that head gesture okay cool right great right so so so you you endow the
[00:15:54] great right so so so you you endow the two random variables in negative
[00:15:56] two random variables in negative correlation so you end up with um this
[00:15:59] correlation so you end up with um this type of uh prity density function right
[00:16:02] type of uh prity density function right uh and in Contours it looks like this
[00:16:05] uh and in Contours it looks like this okay with where it's now slanted the
[00:16:07] okay with where it's now slanted the other way so now Z1 and Z2 have a
[00:16:09] other way so now Z1 and Z2 have a negative correlation and that's plenty
[00:16:11] negative correlation and that's plenty Point okay all right so so far we've
[00:16:14] Point okay all right so so far we've been keeping the mean Vector as zero and
[00:16:17] been keeping the mean Vector as zero and just varing the covariance Matrix um
[00:16:20] just varing the covariance Matrix um good
[00:16:23] yeah uh yes every ciance Matrix is
[00:16:25] yeah uh yes every ciance Matrix is symmetric yeah um
[00:16:29] symmetric yeah um mat a Ser of
[00:16:35] that uh should we think of conri as
[00:16:38] that uh should we think of conri as interesting column vectors that point
[00:16:39] interesting column vectors that point interesting directions not really um let
[00:16:44] interesting directions not really um let me think maybe should
[00:16:47] me think maybe should yeah yeah no I I I think the cence
[00:16:50] yeah yeah no I I I think the cence Matrix is always symmetric and so I
[00:16:53] Matrix is always symmetric and so I would usually not look at single Columns
[00:16:56] would usually not look at single Columns of the cience Matrix and isolation uh
[00:16:59] of the cience Matrix and isolation uh when we talk about principal components
[00:17:00] when we talk about principal components analysis talk about the ion vectors of
[00:17:02] analysis talk about the ion vectors of the ciance Matrix which are the
[00:17:04] the ciance Matrix which are the principal directions in which it points
[00:17:06] principal directions in which it points but yeah we'll get to that
[00:17:11] later oh yes so the AR vectors The
[00:17:14] later oh yes so the AR vectors The Matrix point in the principal axis of
[00:17:15] Matrix point in the principal axis of the ellipse uh that's defined by the
[00:17:17] the ellipse uh that's defined by the cont yeah
[00:17:19] cont yeah cool okay um so this a standed Gan with
[00:17:24] cool okay um so this a standed Gan with mean zero so the Gan BMP presented at 0
[00:17:27] mean zero so the Gan BMP presented at 0 0 because me is0 z uh let's move mu
[00:17:30] 0 because me is0 z uh let's move mu around so I'm going to move you know mu
[00:17:32] around so I'm going to move you know mu to 01 to 0 1.5 so that moves the Galan
[00:17:37] to 01 to 0 1.5 so that moves the Galan uh position of gine density right now
[00:17:39] uh position of gine density right now let's move it to different location move
[00:17:42] let's move it to different location move it to minus 1.5 minus one and so by
[00:17:45] it to minus 1.5 minus one and so by varying the value of mu you could also
[00:17:47] varying the value of mu you could also shift the center of the gum density
[00:17:49] shift the center of the gum density around okay so hope this gives you a
[00:17:53] around okay so hope this gives you a sense of um as you vary the parameters
[00:17:55] sense of um as you vary the parameters the mean and the ciance Matrix of the 2D
[00:17:58] the mean and the ciance Matrix of the 2D gum density um the source of prob PR
[00:18:01] gum density um the source of prob PR density functions you can get as a
[00:18:03] density functions you can get as a result of changing mu and sigma okay um
[00:18:08] result of changing mu and sigma okay um any other questions about
[00:18:15] this all right
[00:18:23] cool so
[00:18:50] here is the
[00:18:54] GDA right model um and and
[00:19:00] GDA right model um and and uh let's
[00:19:02] uh let's see so
[00:19:06] um remember for GDA we need to model P
[00:19:10] um remember for GDA we need to model P of x given y right instead P of Y given
[00:19:12] of x given y right instead P of Y given X so I'm going to write this separately
[00:19:14] X so I'm going to write this separately in two separate equations P of x given
[00:19:17] in two separate equations P of x given yal Z so what's the chance what's the
[00:19:19] yal Z so what's the chance what's the appro density of the features if is a
[00:19:22] appro density of the features if is a benign tumor um I'm going to assume is
[00:19:27] benign tumor um I'm going to assume is Gan so I'm just write out the formula
[00:19:29] Gan so I'm just write out the formula for
[00:19:49] Galan okay
[00:19:54] um and then similarly I'm going to
[00:19:57] um and then similarly I'm going to assume that if it's a malignant tumor so
[00:20:00] assume that if it's a malignant tumor so if Y is equal to one that the density of
[00:20:03] if Y is equal to one that the density of the features is
[00:20:05] the features is also
[00:20:09] Gan okay and um I want to point out a
[00:20:12] Gan okay and um I want to point out a couple things so the parameters of the
[00:20:14] couple things so the parameters of the GDA
[00:20:17] model are
[00:20:19] model are mu0
[00:20:22] mu1 and sigma um and the reasons we're
[00:20:26] mu1 and sigma um and the reasons we're going into a little bit we use the same
[00:20:28] going into a little bit we use the same Sigma
[00:20:29] Sigma for both
[00:20:32] for both classes um but we'll use different means
[00:20:35] classes um but we'll use different means zero and one okay uh and we can come
[00:20:38] zero and one okay uh and we can come back to this later if you want you could
[00:20:41] back to this later if you want you could use separate parameters you know Sigma
[00:20:43] use separate parameters you know Sigma 0o and sigma one but that's not usually
[00:20:45] 0o and sigma one but that's not usually done so we're going to assume that the
[00:20:47] done so we're going to assume that the two gaussians for the positive and
[00:20:48] two gaussians for the positive and negative classes have the same
[00:20:50] negative classes have the same covariance Matrix but they they have
[00:20:51] covariance Matrix but they they have different means uh you don't have to
[00:20:53] different means uh you don't have to make this assumption but this is the way
[00:20:55] make this assumption but this is the way it's most commonly done and we can talk
[00:20:57] it's most commonly done and we can talk about the reason why why we tend to do
[00:20:59] about the reason why why we tend to do that in a
[00:21:00] that in a second um so this is a model for p of Y
[00:21:04] second um so this is a model for p of Y given X the other thing we need to do is
[00:21:07] given X the other thing we need to do is model P of Y uh so Y is just a newly
[00:21:11] model P of Y uh so Y is just a newly random variable right it takes on you
[00:21:13] random variable right it takes on you know the value zero or one and so I'm
[00:21:16] know the value zero or one and so I'm going to write it like this 5 to the y *
[00:21:20] going to write it like this 5 to the y * 1 - 5 to the
[00:21:24] 1 - 5 to the 1- y okay um and you saw this kind of
[00:21:29] 1- y okay um and you saw this kind of notation when we talked about logistic
[00:21:32] notation when we talked about logistic regression but all this means is that um
[00:21:35] regression but all this means is that um you know probity of Y be equal to one is
[00:21:38] you know probity of Y be equal to one is equal to
[00:21:39] equal to five right because Y is either zero or
[00:21:41] five right because Y is either zero or one and so um this is the way of writing
[00:21:45] one and so um this is the way of writing uh PRI yals 1 is equal to five okay and
[00:21:49] uh PRI yals 1 is equal to five okay and you saw a similar exponentiation
[00:21:51] you saw a similar exponentiation notation when we're talking about um
[00:21:53] notation when we're talking about um logistic rection right one week ago last
[00:21:57] logistic rection right one week ago last Monday and so the last parameter is five
[00:22:01] Monday and so the last parameter is five so this is RN this is also RN this is r
[00:22:07] so this is RN this is also RN this is r n byn and that's just a real number
[00:22:10] n byn and that's just a real number between zero and
[00:22:12] between zero and one
[00:22:25] okay so um for for any let's see so if
[00:22:30] okay so um for for any let's see so if you can fit mu0 mu1 Sigma and F to your
[00:22:33] you can fit mu0 mu1 Sigma and F to your data then these parameters will
[00:22:37] data then these parameters will Define p of x given y and p of Y and so
[00:22:42] Define p of x given y and p of Y and so if at test time you have a new patient
[00:22:44] if at test time you have a new patient walked into your office and you need to
[00:22:46] walked into your office and you need to compute this then you can compute right
[00:22:49] compute this then you can compute right these things in the red and the orange
[00:22:51] these things in the red and the orange boxes each of these is a number and by
[00:22:53] boxes each of these is a number and by plugging all these numbers in the
[00:22:54] plugging all these numbers in the formula you get a number out for p of
[00:22:56] formula you get a number out for p of yals 1 given X and you can then predict
[00:22:58] yals 1 given X and you can then predict you know malignant or benign tumor right
[00:23:03] you know malignant or benign tumor right so let's talk about how to fit the
[00:23:04] so let's talk about how to fit the parameters so you have a training
[00:23:08] parameters so you have a training set um as usual I'm going to write the
[00:23:11] set um as usual I'm going to write the training well I'm let me write the
[00:23:12] training well I'm let me write the training set like this x i Yi for IAL 1
[00:23:16] training set like this x i Yi for IAL 1 through M right this is a usual training
[00:23:18] through M right this is a usual training set um and what we're going to do in
[00:23:24] set um and what we're going to do in order to fit these parameters is
[00:23:26] order to fit these parameters is maximize the joints like
[00:23:32] hood and in particular um let me Define
[00:23:36] hood and in particular um let me Define the likelihood of the
[00:23:45] parameters to be equal to the product
[00:23:48] parameters to be equal to the product from IAL 1 through M of P of x i
[00:23:53] from IAL 1 through M of P of x i Yi you know parameterized
[00:23:56] Yi you know parameterized by um the paramet
[00:24:07] okay
[00:24:15] um and I'm I'm just going to drop the
[00:24:18] um and I'm I'm just going to drop the parameters here right to simplify the
[00:24:20] parameters here right to simplify the notation a little bit okay and the big
[00:24:24] notation a little bit okay and the big difference between um a generative
[00:24:28] difference between um a generative learning algorithm like this compared to
[00:24:30] learning algorithm like this compared to discriminative learning algorithm is
[00:24:33] discriminative learning algorithm is that the cost function you
[00:24:36] that the cost function you maximize is this joint likelihood which
[00:24:39] maximize is this joint likelihood which is p of X comma y whereas for a
[00:24:42] is p of X comma y whereas for a discriminative learning
[00:24:46] algorithm we were maximizing um this
[00:24:50] algorithm we were maximizing um this other
[00:24:51] other thing right
[00:25:02] uh which is sometimes also called the
[00:25:04] uh which is sometimes also called the conditional
[00:25:09] likelihood okay so the big difference
[00:25:12] likelihood okay so the big difference between the these two cost functions is
[00:25:14] between the these two cost functions is that for logistic regression or linear
[00:25:16] that for logistic regression or linear regression or generalized linear models
[00:25:18] regression or generalized linear models um you were trying to choose paramet
[00:25:20] um you were trying to choose paramet data that maximize P of Y given X but
[00:25:24] data that maximize P of Y given X but for generative learning algorithms we're
[00:25:26] for generative learning algorithms we're going to try to choose parameters that
[00:25:27] going to try to choose parameters that maximize p P of X and Y or P of X comma
[00:25:31] maximize p P of X and Y or P of X comma y
[00:25:32] y right
[00:25:35] okay
[00:25:42] so all right
[00:26:04] so if you use um maximum like
[00:26:17] estimation right um so you choose the
[00:26:21] estimation right um so you choose the parameters 5 mu 0 mu1 and sigma that
[00:26:27] parameters 5 mu 0 mu1 and sigma that maximize the log
[00:26:30] likelihood right where this you define
[00:26:33] likelihood right where this you define as you know log of the likelihood that
[00:26:36] as you know log of the likelihood that we Define out there um and so uh we
[00:26:40] we Define out there um and so uh we actually ask you to do this as a problem
[00:26:42] actually ask you to do this as a problem set in the next homework but so the way
[00:26:44] set in the next homework but so the way you maximize this is um look at that
[00:26:47] you maximize this is um look at that formula for the likelihood take logs
[00:26:50] formula for the likelihood take logs take derivatives of this thing set the
[00:26:51] take derivatives of this thing set the there is equal to zero and then solve
[00:26:53] there is equal to zero and then solve for the Valu so the parameters it
[00:26:54] for the Valu so the parameters it maximiz this whole thing and I I I'll
[00:26:57] maximiz this whole thing and I I I'll just tell you the answer is is supposed
[00:26:58] just tell you the answer is is supposed to get uh but but but you still have to
[00:27:00] to get uh but but but you still have to do the
[00:27:05] derivation all right um the value of
[00:27:08] derivation all right um the value of five that maximizes this is you know not
[00:27:12] five that maximizes this is you know not that
[00:27:13] that surprisingly so so five is the estimate
[00:27:15] surprisingly so so five is the estimate of probability of Y being equal to one
[00:27:18] of probability of Y being equal to one right so what's the chance when the next
[00:27:20] right so what's the chance when the next patient walks into your doctor's office
[00:27:23] patient walks into your doctor's office that they have a a malignant tumor and
[00:27:26] that they have a a malignant tumor and so the maximum likely estimate for five
[00:27:27] so the maximum likely estimate for five is
[00:27:28] is um it's just of all of your training
[00:27:30] um it's just of all of your training examples what's the fraction with label
[00:27:32] examples what's the fraction with label y equals 1 right so the maxim likelihood
[00:27:35] y equals 1 right so the maxim likelihood of the uh bias of a coin TOS is just
[00:27:38] of the uh bias of a coin TOS is just well count of the fraction of his you
[00:27:40] well count of the fraction of his you got okay so this is it um and one other
[00:27:43] got okay so this is it um and one other way to write this is um sum from I = 1
[00:27:47] way to write this is um sum from I = 1 through M
[00:27:54] indicat okay
[00:28:02] right um let's see so software indicates
[00:28:05] right um let's see so software indicates a notation on Wednesday did you no uh
[00:28:09] a notation on Wednesday did you no uh did you did we talk did you talk about
[00:28:11] did you did we talk did you talk about indicated notation on Wednesday no okay
[00:28:13] indicated notation on Wednesday no okay oh so um uh this notation is an
[00:28:16] oh so um uh this notation is an indicator function uh where um indicator
[00:28:20] indicator function uh where um indicator y i equals 1 is uh uh return zero or one
[00:28:23] y i equals 1 is uh uh return zero or one depending on whether the thing inside is
[00:28:25] depending on whether the thing inside is true right so this's indicator notation
[00:28:27] true right so this's indicator notation in in which a indicator of a true
[00:28:30] in in which a indicator of a true statement is equal to one and indicator
[00:28:33] statement is equal to one and indicator of a false statement is equal to zero so
[00:28:35] of a false statement is equal to zero so that's another way of writing writing
[00:28:37] that's another way of writing writing this formula right um and then the
[00:28:41] this formula right um and then the maximum likelihood estimate for
[00:28:43] maximum likelihood estimate for mu0 is this um I'll just write out
[00:29:04] okay um and so well actually if you uh
[00:29:08] okay um and so well actually if you uh put aside the math for now what do you
[00:29:10] put aside the math for now what do you think is a m likely estimate of the mean
[00:29:12] think is a m likely estimate of the mean of all of the uh features for the benign
[00:29:15] of all of the uh features for the benign tumors right well what you do is you
[00:29:16] tumors right well what you do is you take all the benign tumors in your
[00:29:18] take all the benign tumors in your training set and just take their average
[00:29:20] training set and just take their average that seems like a very reasonable way
[00:29:22] that seems like a very reasonable way just look at look at your trading set
[00:29:23] just look at look at your trading set look at all of the um look at all of the
[00:29:26] look at all of the um look at all of the benign tumors all the O's I guess and
[00:29:29] benign tumors all the O's I guess and then just take the mean of these and
[00:29:31] then just take the mean of these and that you know seems like a pretty
[00:29:32] that you know seems like a pretty reasonable way to estimate mu0 right
[00:29:35] reasonable way to estimate mu0 right look at all the the negative examples
[00:29:36] look at all the the negative examples and average their features so this is a
[00:29:38] and average their features so this is a way of writing out that intuition um so
[00:29:41] way of writing out that intuition um so the denominator is sum from I equals 1
[00:29:43] the denominator is sum from I equals 1 through M indicator Yi equals zero and
[00:29:47] through M indicator Yi equals zero and so the denominator will count up the
[00:29:49] so the denominator will count up the number of examples that have benign
[00:29:52] number of examples that have benign tumus right because every time Yi equals
[00:29:54] tumus right because every time Yi equals zero you get an extra one in this sum um
[00:29:59] zero you get an extra one in this sum um uh and so the denominator ends up being
[00:30:02] uh and so the denominator ends up being the total number of benign tumors in
[00:30:05] the total number of benign tumors in your training set okay and the
[00:30:09] your training set okay and the numerator uh sunal 13m indicator is a
[00:30:12] numerator uh sunal 13m indicator is a benign tumor times
[00:30:15] benign tumor times XI so the effect of that is um whenever
[00:30:19] XI so the effect of that is um whenever a tumor is benign is one times the
[00:30:23] a tumor is benign is one times the features whenever an example is
[00:30:26] features whenever an example is malignant it's zero times theur features
[00:30:29] malignant it's zero times theur features and so the numerator is summing up all
[00:30:31] and so the numerator is summing up all the features all the feature vectors for
[00:30:34] the features all the feature vectors for all of the examples that have been nine
[00:30:37] all of the examples that have been nine does that make sense I just write this
[00:30:39] does that make sense I just write this up so this is sum of feature
[00:30:45] vectors for
[00:30:48] vectors for um for all the
[00:30:52] examples with y equals z and the
[00:30:55] examples with y equals z and the denominators a number of examples
[00:31:02] with y equals z okay and then if you
[00:31:06] with y equals z okay and then if you take this ratio if you take this
[00:31:08] take this ratio if you take this fraction then you're summing up all of
[00:31:10] fraction then you're summing up all of the feature vectors for the benign
[00:31:11] the feature vectors for the benign tumors divide by the total number of
[00:31:13] tumors divide by the total number of benign tumors in the training set and so
[00:31:16] benign tumors in the training set and so that's just the mean of the feature
[00:31:17] that's just the mean of the feature vectors of all of the benign
[00:31:21] vectors of all of the benign examples
[00:31:24] examples okay um
[00:31:28] and
[00:31:38] then right Maxim like for me one no
[00:31:41] then right Maxim like for me one no surprises is so kind of what you'd
[00:31:43] surprises is so kind of what you'd expect sum up all of the positive
[00:31:45] expect sum up all of the positive examples and divide by the total number
[00:31:47] examples and divide by the total number of positive examples and get their mean
[00:31:49] of positive examples and get their mean so that's as like put mu one um and then
[00:31:54] so that's as like put mu one um and then I just write this out
[00:31:58] if you're familiar with um ciance
[00:32:01] if you're familiar with um ciance matrices this formula might not surprise
[00:32:04] matrices this formula might not surprise you but if you're less
[00:32:07] you but if you're less familiar
[00:32:08] familiar then I guess you can see the details in
[00:32:11] then I guess you can see the details in the
[00:32:20] homework okay don't worry too much about
[00:32:22] homework okay don't worry too much about that uh you can unpack the details in
[00:32:24] that uh you can unpack the details in the lecture for the hom works okay um
[00:32:28] the lecture for the hom works okay um but the ciance Matrix basically tries to
[00:32:31] but the ciance Matrix basically tries to you know fit Contours to the ellipse
[00:32:35] you know fit Contours to the ellipse right like we saw so so try to fit the G
[00:32:38] right like we saw so so try to fit the G to both of these with these
[00:32:39] to both of these with these corresponding means where you want one
[00:32:40] corresponding means where you want one ciance Matrix to both of these okay um
[00:32:45] ciance Matrix to both of these okay um so these are the so so so the way so the
[00:32:48] so these are the so so so the way so the way I motivated this was you know I said
[00:32:50] way I motivated this was you know I said well if you want to estimate the mean of
[00:32:52] well if you want to estimate the mean of a coin toss just counted fraction of
[00:32:54] a coin toss just counted fraction of coin tosses they came up heads uh and
[00:32:56] coin tosses they came up heads uh and then it seems like the mean for Mu new
[00:32:58] then it seems like the mean for Mu new one you should just look at these
[00:32:59] one you should just look at these examples and pick the mean right so that
[00:33:01] examples and pick the mean right so that that was the intuitive explanation of
[00:33:02] that was the intuitive explanation of how you get these formulas but the
[00:33:05] how you get these formulas but the mathematically sound way to get these
[00:33:07] mathematically sound way to get these formulas is not Val this intuitive
[00:33:09] formulas is not Val this intuitive argument that I just gave is instead to
[00:33:11] argument that I just gave is instead to look at the likelihood uh take logs get
[00:33:14] look at the likelihood uh take logs get the log likelihood take derivatives set
[00:33:16] the log likelihood take derivatives set deres equal to zero solve all these
[00:33:19] deres equal to zero solve all these values and prove more formally that
[00:33:21] values and prove more formally that these are the actual values that
[00:33:23] these are the actual values that maximize this thing right by by saying
[00:33:25] maximize this thing right by by saying there is a zero and solving so you can
[00:33:27] there is a zero and solving so you can see that for yourself um in the problem
[00:33:30] see that for yourself um in the problem sets
[00:33:32] sets Okay
[00:33:38] so all right um
[00:33:43] so all right um finally having fit these
[00:33:46] finally having fit these parameters um if you want to make a
[00:33:53] prediction right so give it a new
[00:33:56] prediction right so give it a new patient uh how do you make make a
[00:33:58] patient uh how do you make make a prediction for whether their tumor is
[00:34:00] prediction for whether their tumor is malignant orine um
[00:34:05] malignant orine um so if you want to predict the most
[00:34:07] so if you want to predict the most likely class label uh you choose Max
[00:34:11] likely class label uh you choose Max over y of P of
[00:34:14] over y of P of Y given
[00:34:16] Y given X right um and by base rule this is Max
[00:34:20] X right um and by base rule this is Max of a y of P of x given y p of Y divided
[00:34:25] of a y of P of x given y p of Y divided by
[00:34:28] by P of X okay now um I want to introduce
[00:34:32] P of X okay now um I want to introduce one well one more piece of notation
[00:34:35] one well one more piece of notation which is
[00:34:37] which is uh I'm G introduce actually how how many
[00:34:40] uh I'm G introduce actually how how many of you are familiar with the agmax
[00:34:43] of you are familiar with the agmax notation most of you like okay two two3
[00:34:47] notation most of you like okay two two3 okay cool I I'll go over this quickly so
[00:34:50] okay cool I I'll go over this quickly so um Let's do an
[00:34:52] um Let's do an example so the
[00:34:55] example so the um let's see
[00:35:00] H boy all right so you know the Min over
[00:35:05] H boy all right so you know the Min over Z of Z - 5^ 2 is equal to zero because
[00:35:11] Z of Z - 5^ 2 is equal to zero because the smallest possible value of Z - 5 S
[00:35:13] the smallest possible value of Z - 5 S is zero right and the
[00:35:17] is zero right and the augment over Z of zus 5^
[00:35:21] augment over Z of zus 5^ 2 is equal to five okay so the Min is
[00:35:25] 2 is equal to five okay so the Min is the smallest possible value obtained by
[00:35:27] the smallest possible value obtained by the thing inside and the augment is the
[00:35:30] the thing inside and the augment is the value you need to plug in to achieve
[00:35:32] value you need to plug in to achieve that smallest possible value right so uh
[00:35:35] that smallest possible value right so uh the prediction you actually want to make
[00:35:36] the prediction you actually want to make if you want to Output a value for y you
[00:35:38] if you want to Output a value for y you don't want to Output a probability right
[00:35:40] don't want to Output a probability right you want to say well what do I think is
[00:35:41] you want to say well what do I think is value of y so you want to choose the
[00:35:43] value of y so you want to choose the value of y that maximizes this so so
[00:35:45] value of y that maximizes this so so there's the aax of this and this would
[00:35:47] there's the aax of this and this would be either zero or one right um so that's
[00:35:50] be either zero or one right um so that's equal the AUG Max of that and you notice
[00:35:53] equal the AUG Max of that and you notice that uh this denominator is just a
[00:35:55] that uh this denominator is just a constant right doesn't doesn't it's a p
[00:35:58] constant right doesn't doesn't it's a p of X Y doesn't even appear in there it's
[00:36:00] of X Y doesn't even appear in there it's just some positive number and so this is
[00:36:03] just some positive number and so this is equal
[00:36:04] equal to just AR Max over y p of x given
[00:36:09] to just AR Max over y p of x given y time P of Y so when implementing um uh
[00:36:17] y time P of Y so when implementing um uh when when making predictions with G in
[00:36:18] when when making predictions with G in discri with uh generative learning
[00:36:21] discri with uh generative learning algorithm sometimes to save on
[00:36:22] algorithm sometimes to save on computation you don't bother to
[00:36:24] computation you don't bother to calculate the denominator if all you
[00:36:26] calculate the denominator if all you care about is to make a prediction but
[00:36:28] care about is to make a prediction but if you actually need a probability then
[00:36:30] if you actually need a probability then you have to normalize the
[00:36:37] probability
[00:36:40] okay
[00:36:42] okay so
[00:36:44] so let's
[00:36:47] let's examine what the Aram is
[00:36:56] doing all all right so let's look at the
[00:36:58] doing all all right so let's look at the same data set and uh compare and
[00:37:01] same data set and uh compare and contrast what a discriminative learning
[00:37:03] contrast what a discriminative learning algorithm versus a generative learning
[00:37:04] algorithm versus a generative learning algorithm will do on this data
[00:37:07] algorithm will do on this data set right
[00:37:10] set right um here's example of two features X1 X2
[00:37:14] um here's example of two features X1 X2 and positive and negative examples so
[00:37:15] and positive and negative examples so let's start with a discriminative
[00:37:17] let's start with a discriminative learning algorithm uh let's say you
[00:37:19] learning algorithm uh let's say you initialize the parameters randomly
[00:37:22] initialize the parameters randomly typically when you run logistic
[00:37:23] typically when you run logistic regression I almost always initialize
[00:37:25] regression I almost always initialize parenthesis zero but but this just you
[00:37:27] parenthesis zero but but this just you was more interesting to start off for
[00:37:29] was more interesting to start off for purpose of visualization in a random
[00:37:31] purpose of visualization in a random line I guess and then if you run one
[00:37:33] line I guess and then if you run one iteration of gradient descent on the
[00:37:35] iteration of gradient descent on the conditional likelihood um one iteration
[00:37:38] conditional likelihood um one iteration of legis regession moves the line there
[00:37:41] of legis regession moves the line there there two iterations three
[00:37:42] there two iterations three iterations um four iterations and so on
[00:37:46] iterations um four iterations and so on and after about 20 iterations they'll
[00:37:49] and after about 20 iterations they'll converge to that pretty decent
[00:37:51] converge to that pretty decent discriminative boundary okay so that's
[00:37:54] discriminative boundary okay so that's legis really searching for a line that
[00:37:56] legis really searching for a line that separates positive and negative
[00:37:58] separates positive and negative examples how about the generative
[00:38:00] examples how about the generative learning algorithm what it does is the
[00:38:03] learning algorithm what it does is the following which is fit uh with gussian
[00:38:06] following which is fit uh with gussian dis analysis what would do is fit
[00:38:10] dis analysis what would do is fit gaussians to the positive and negative
[00:38:12] gaussians to the positive and negative examples right and and just one one one
[00:38:15] examples right and and just one one one technical detail um I described this as
[00:38:17] technical detail um I described this as if we look at the two classes separately
[00:38:20] if we look at the two classes separately because we use the same coari Matrix
[00:38:22] because we use the same coari Matrix Sigma for the positive and negative
[00:38:23] Sigma for the positive and negative classes we actually don't quite look at
[00:38:25] classes we actually don't quite look at them totally separately but we do fit
[00:38:27] them totally separately but we do fit two hous in densities to the positive
[00:38:29] two hous in densities to the positive and negative examples um and then what
[00:38:32] and negative examples um and then what we do is for each point try to decide uh
[00:38:36] we do is for each point try to decide uh what is this class label using base rule
[00:38:38] what is this class label using base rule using that formula and it turns out that
[00:38:41] using that formula and it turns out that this implies the following decision
[00:38:43] this implies the following decision boundary right so points to the upper
[00:38:45] boundary right so points to the upper right of this decision boundary that
[00:38:48] right of this decision boundary that that straight line I just drew you are
[00:38:50] that straight line I just drew you are closer to the negative class you end up
[00:38:52] closer to the negative class you end up classifying them as negative examples
[00:38:54] classifying them as negative examples and points to the lower left of that
[00:38:56] and points to the lower left of that line you end up classifying as 45 as a
[00:38:58] line you end up classifying as 45 as a positive examples and um uh I've also
[00:39:02] positive examples and um uh I've also drawn in green here the decision
[00:39:04] drawn in green here the decision boundary for logistic regression so so
[00:39:06] boundary for logistic regression so so so these two algorithms actually come up
[00:39:08] so these two algorithms actually come up with slightly different decision
[00:39:10] with slightly different decision boundaries okay but the way you arrive
[00:39:13] boundaries okay but the way you arrive at these two digion boundaries are a
[00:39:14] at these two digion boundaries are a little bit
[00:39:17] different so
[00:39:21] um all right let's go back to the any
[00:39:26] um all right let's go back to the any questions about this yeah
[00:39:30] [Music]
[00:39:41] oh sure yes good question so um why why
[00:39:43] oh sure yes good question so um why why why do we use two separate means mu0 and
[00:39:46] why do we use two separate means mu0 and mu1 and single cience Matrix Sigma um it
[00:39:50] mu1 and single cience Matrix Sigma um it turns out that um uh well it turns out
[00:39:53] turns out that um uh well it turns out that if you choose to build the model
[00:39:55] that if you choose to build the model this way the desision boundary ends up
[00:39:57] this way the desision boundary ends up being linear and so for a lot of
[00:39:59] being linear and so for a lot of problems if you want to lineate Des
[00:40:01] problems if you want to lineate Des boundary uh uh uh yeah and it turns out
[00:40:04] boundary uh uh uh yeah and it turns out you could choose to use two separate um
[00:40:07] you could choose to use two separate um cence Matrix Sigma 0o and sigma one and
[00:40:10] cence Matrix Sigma 0o and sigma one and that'll actually work okay right that's
[00:40:12] that'll actually work okay right that's is actually very reasonable to do so as
[00:40:13] is actually very reasonable to do so as well but uh you double the number of
[00:40:15] well but uh you double the number of parameters roughly and you end up with a
[00:40:19] parameters roughly and you end up with a desision boundary that isn't linear
[00:40:20] desision boundary that isn't linear anymore but it's actually not reason to
[00:40:23] anymore but it's actually not reason to do that as
[00:40:25] do that as well um
[00:40:28] well um now there's
[00:40:50] one now there's one very interesting
[00:40:52] one now there's one very interesting property um about Gan discr analysis
[00:40:58] property um about Gan discr analysis and it turns out that
[00:41:01] and it turns out that uh well let's let's let's
[00:41:09] compare GDA to logistic
[00:41:14] regression and
[00:41:16] regression and um for fixed set of
[00:41:24] parameters right so let's say you've
[00:41:27] parameters right so let's say you've learn some set of
[00:41:29] learn some set of parameters um I'm going to do an
[00:41:32] parameters um I'm going to do an exercise where we're going to
[00:41:38] plot P of Y = 1 given
[00:41:42] plot P of Y = 1 given X you know parameterized by all these
[00:41:47] things right as a function of
[00:41:53] X okay um so I'm going to do this little
[00:41:56] X okay um so I'm going to do this little exercise a second but what this means is
[00:41:59] exercise a second but what this means is um well this formula this is equal to P
[00:42:03] um well this formula this is equal to P of x given y equals
[00:42:07] of x given y equals one you know which is parameterized by
[00:42:11] one you know which is parameterized by right well the various parameters time P
[00:42:14] right well the various parameters time P of Y = 1 parameterized by 5 divided by P
[00:42:19] of Y = 1 parameterized by 5 divided by P of X which depends on all the paramas I
[00:42:21] of X which depends on all the paramas I guess
[00:42:28] right so uh by base rule you know this
[00:42:32] right so uh by base rule you know this formula is equal to this little thing
[00:42:36] formula is equal to this little thing and uh just as we saw earlier I guess
[00:42:39] and uh just as we saw earlier I guess right once you have fixed all the
[00:42:41] right once you have fixed all the parameters that's just a number you
[00:42:43] parameters that's just a number you compute by evaluating a gan
[00:42:45] compute by evaluating a gan density
[00:42:48] density um this is a b newly probability so
[00:42:50] um this is a b newly probability so actually P of yal 1 parameterized by
[00:42:52] actually P of yal 1 parameterized by five this is just equal to five is that
[00:42:54] five this is just equal to five is that second term and you similarly calculate
[00:42:56] second term and you similarly calculate the denominator but so for every value
[00:42:58] the denominator but so for every value of x you can compute this ratio and thus
[00:43:02] of x you can compute this ratio and thus get a number for the chance of yal to
[00:43:04] get a number for the chance of yal to one given
[00:43:07] one given X so I'm going go
[00:43:11] through one example of uh what function
[00:43:15] through one example of uh what function you get for p of yals 1 given X for what
[00:43:19] you get for p of yals 1 given X for what function you get for this if you
[00:43:20] function you get for this if you actually plot this for um different
[00:43:24] actually plot this for um different values of X okay
[00:43:28] values of X okay so
[00:43:30] so um let's see let's say you have a just
[00:43:32] um let's see let's say you have a just one feature X so X is a you know
[00:43:36] one feature X so X is a you know uh and let's say that you have a few
[00:43:40] uh and let's say that you have a few negative examples there and a few
[00:43:43] negative examples there and a few positive examples
[00:43:45] positive examples there right so simple data
[00:43:49] there right so simple data set okay and let's see what gine discu
[00:43:53] set okay and let's see what gine discu analysis will do on this data set um
[00:43:56] analysis will do on this data set um with just one feure so that's why all
[00:43:58] with just one feure so that's why all the data is POS on
[00:43:59] the data is POS on 1D
[00:44:05] so let me map all this data to an
[00:44:12] x-axis I just took this data and mapped
[00:44:14] x-axis I just took this data and mapped it down and um if you fit a g into each
[00:44:18] it down and um if you fit a g into each of these two data sets then you end up
[00:44:22] of these two data sets then you end up with you
[00:44:23] with you know G as follows where this bump on the
[00:44:26] know G as follows where this bump on the left is p of x given yal Z and this bump
[00:44:30] left is p of x given yal Z and this bump on the right is p of
[00:44:33] on the right is p of x given yal 1 right and and again
[00:44:38] x given yal 1 right and and again there's a technical detail that we set
[00:44:40] there's a technical detail that we set the same variance to the two Gans but
[00:44:42] the same variance to the two Gans but you know you kind of model the Gan densi
[00:44:44] you know you kind of model the Gan densi of what does class zero look like what
[00:44:46] of what does class zero look like what does class one look like with two gum
[00:44:48] does class one look like with two gum BMS like this oh and then because the
[00:44:51] BMS like this oh and then because the data set is spit 50/50 you know P of yal
[00:44:54] data set is spit 50/50 you know P of yal 1 is 0.5 right so one half prior
[00:44:58] 1 is 0.5 right so one half prior okay now let's go through that exercise
[00:45:01] okay now let's go through that exercise I described on the left of trying to
[00:45:03] I described on the left of trying to plot P of yals 1 given X for different
[00:45:08] plot P of yals 1 given X for different values of X so the vertical axis here is
[00:45:10] values of X so the vertical axis here is p of yal 1 given different values of X
[00:45:15] p of yal 1 given different values of X so um let's pick a point far to the left
[00:45:19] so um let's pick a point far to the left here right with this model you if if you
[00:45:23] here right with this model you if if you actually calculate this ratio you find
[00:45:25] actually calculate this ratio you find that um if you have a Point here it
[00:45:28] that um if you have a Point here it almost certainly came from this Gan on
[00:45:31] almost certainly came from this Gan on the left right if if if you have an
[00:45:33] the left right if if if you have an unable example here you almost certainly
[00:45:35] unable example here you almost certainly came from the class zero Gan because the
[00:45:39] came from the class zero Gan because the chance of this Gan generating example
[00:45:41] chance of this Gan generating example all the way to the left is almost zero
[00:45:43] all the way to the left is almost zero right and so chance of p p of y equals
[00:45:45] right and so chance of p p of y equals only given X is very small so for a
[00:45:48] only given X is very small so for a point like that you end up with a point
[00:45:50] point like that you end up with a point you know very close to
[00:45:51] you know very close to zero right um let's pick another Point
[00:45:55] zero right um let's pick another Point all right how about this point the mid
[00:45:57] all right how about this point the mid Point well if you get an example right
[00:45:58] Point well if you get an example right in the midpoint you you really have no
[00:46:00] in the midpoint you you really have no idea you really can't tell did this come
[00:46:02] idea you really can't tell did this come from the negative or the positive
[00:46:03] from the negative or the positive calcium can't tell right so this is
[00:46:05] calcium can't tell right so this is really 50/50 so I guess if this is a 0.5
[00:46:09] really 50/50 so I guess if this is a 0.5 for that midpoint you would have PF yal
[00:46:13] for that midpoint you would have PF yal 1 given X is
[00:46:15] 1 given X is .5 um and then if you go to point way to
[00:46:18] .5 um and then if you go to point way to the right if you get an example way here
[00:46:20] the right if you get an example way here then you'll be pretty sure this came
[00:46:21] then you'll be pretty sure this came from the positive examples and so you
[00:46:24] from the positive examples and so you know you get a point like that
[00:46:27] know you get a point like that right now it turns out that if you
[00:46:30] right now it turns out that if you repeat this exercise uh sweeping from
[00:46:33] repeat this exercise uh sweeping from left to right from many many points on
[00:46:35] left to right from many many points on the x- axis you find that for points far
[00:46:38] the x- axis you find that for points far to the left the chance of this coming
[00:46:42] to the left the chance of this coming from uh the yals 1 CLA is very small and
[00:46:45] from uh the yals 1 CLA is very small and as you approach this
[00:46:47] as you approach this midpoint it increases to 0.5 and it
[00:46:50] midpoint it increases to 0.5 and it surpasses 0.5 and then beyond a certain
[00:46:54] surpasses 0.5 and then beyond a certain point it becomes very very close to one
[00:46:58] point it becomes very very close to one right and you do this exercise and
[00:46:59] right and you do this exercise and actually just for every point you know
[00:47:01] actually just for every point you know for a dense grid on the xaxis evaluate
[00:47:04] for a dense grid on the xaxis evaluate this formula which will give you a
[00:47:06] this formula which will give you a number between zero and one is
[00:47:08] number between zero and one is probability and go ahead and plot you
[00:47:10] probability and go ahead and plot you know the values you get a curve like
[00:47:12] know the values you get a curve like this and it turns out that if you
[00:47:14] this and it turns out that if you connect up the dots um then this is
[00:47:18] connect up the dots um then this is exactly a sigid function the shape of
[00:47:21] exactly a sigid function the shape of that turns out to be exactly a shap
[00:47:23] that turns out to be exactly a shap sigmoid function and you proved this in
[00:47:25] sigmoid function and you proved this in the problem sets as well
[00:47:28] the problem sets as well right
[00:47:31] right um
[00:47:34] so
[00:47:37] so um both logistic regression and Gan
[00:47:40] um both logistic regression and Gan discri analysis actually end up using a
[00:47:44] discri analysis actually end up using a sigmoid function to calculate you know P
[00:47:48] sigmoid function to calculate you know P of yals 1 given X or or or the the the
[00:47:51] of yals 1 given X or or or the the the outcome ends up being a sigmoid function
[00:47:53] outcome ends up being a sigmoid function I guess the mechanics is you actually
[00:47:54] I guess the mechanics is you actually use this calculation rather than
[00:47:56] use this calculation rather than computer sigo function right but um the
[00:47:59] computer sigo function right but um the specific choice of the parameters they
[00:48:02] specific choice of the parameters they end up choosing are quite different and
[00:48:04] end up choosing are quite different and you saw when I was projecting the
[00:48:06] you saw when I was projecting the results on the display just now in
[00:48:07] results on the display just now in PowerPoint uh that the two algorithms
[00:48:10] PowerPoint uh that the two algorithms actually come up with two different
[00:48:12] actually come up with two different decision
[00:48:14] decision boundaries so um let's discuss when a
[00:48:18] boundaries so um let's discuss when a generative algorithm like GDA is
[00:48:20] generative algorithm like GDA is superior and when a Distributive
[00:48:22] superior and when a Distributive algorithm like logistic regression is
[00:48:24] algorithm like logistic regression is superior
[00:48:28] um let's
[00:48:47] see all
[00:48:49] see all right so GDA Gan disc
[00:48:53] right so GDA Gan disc analysis so the generative approach
[00:48:58] analysis so the generative approach this assumes that x given y equals z
[00:49:02] this assumes that x given y equals z this is Gan with mean mu 0 and coent
[00:49:06] this is Gan with mean mu 0 and coent sigma it assumes x given yal 1 this is
[00:49:09] sigma it assumes x given yal 1 this is Gan would mean mu1 and coent sigma and Y
[00:49:13] Gan would mean mu1 and coent sigma and Y is
[00:49:14] is brui with
[00:49:18] um Paramus SP right and what logistic
[00:49:22] um Paramus SP right and what logistic regression does
[00:49:28] this is a discens of
[00:49:34] algorithm oh some strange wind at the
[00:49:38] algorithm oh some strange wind at the back is it I see okay cool all right
[00:49:43] back is it I see okay cool all right yeah
[00:49:44] yeah boy no there's just a scary un report on
[00:49:48] boy no there's just a scary un report on global warming over the weekend I hope
[00:49:49] global warming over the weekend I hope we don't already have storms
[00:49:52] we don't already have storms here okay it's okay did you guys see the
[00:49:55] here okay it's okay did you guys see the UN report on this SL slightly scary
[00:49:57] UN report on this SL slightly scary actually with the the the year for
[00:49:59] actually with the the the year for global warming but hopefully all right
[00:50:01] global warming but hopefully all right good hurricane stopped
[00:50:06] okay
[00:50:08] okay um let's
[00:50:10] um let's see uh so what logistic regression
[00:50:15] see uh so what logistic regression assumes is p of y equals 1 given
[00:50:19] assumes is p of y equals 1 given X you know that this is a governed by
[00:50:22] X you know that this is a governed by logistic function right so this you know
[00:50:23] logistic function right so this you know 1 over 1 plus e Nega Theta transpose X
[00:50:27] 1 over 1 plus e Nega Theta transpose X with with some details about x0 equals
[00:50:31] with with some details about x0 equals 1 and so on right so just
[00:50:34] 1 and so on right so just just okay so so in other words let's
[00:50:37] just okay so so in other words let's assume that this is
[00:50:40] assume that this is um P of yal 1 given XIs
[00:50:45] logistic okay and the argument that I
[00:50:48] logistic okay and the argument that I just described just now uh plotting you
[00:50:52] just described just now uh plotting you know P of yal 1 given X Point by Point
[00:50:54] know P of yal 1 given X Point by Point really the sigo curve I drew on On The
[00:50:56] really the sigo curve I drew on On The Other Board what that illustrates um it
[00:51:00] Other Board what that illustrates um it it doesn't prove it you prove it
[00:51:01] it doesn't prove it you prove it yourself in the homework problem but
[00:51:03] yourself in the homework problem but what that illustrates is that this set
[00:51:05] what that illustrates is that this set of
[00:51:07] of assumptions implies that P of yal 1
[00:51:10] assumptions implies that P of yal 1 given X is governed by a logistic
[00:51:13] given X is governed by a logistic function right but it turns out that the
[00:51:17] function right but it turns out that the implication in the opposite direction is
[00:51:19] implication in the opposite direction is not
[00:51:21] not true right so if you assume P of yal 1
[00:51:25] true right so if you assume P of yal 1 given X is governed by IC function by by
[00:51:27] given X is governed by IC function by by this shape this does not in any way
[00:51:30] this shape this does not in any way shape or form assume that x given Y is
[00:51:32] shape or form assume that x given Y is Gan uh uh x given y equ z is Gan X y1 is
[00:51:38] Gan uh uh x given y equ z is Gan X y1 is Gan so what this means is that GDA the
[00:51:43] Gan so what this means is that GDA the generative learning Alm in this case
[00:51:45] generative learning Alm in this case this makes a stronger set of
[00:51:51] assumptions and which is
[00:51:53] assumptions and which is regression makes a
[00:51:59] weaker set of assumptions because you
[00:52:02] weaker set of assumptions because you could prove these assumptions from this
[00:52:04] could prove these assumptions from this assumptions
[00:52:06] assumptions okay
[00:52:08] okay um and by the way as as a as as
[00:52:13] um and by the way as as a as as a let's see and so what you see in a lot
[00:52:17] a let's see and so what you see in a lot of learning algorithms is that um if you
[00:52:20] of learning algorithms is that um if you make stronger modeling assumptions and
[00:52:22] make stronger modeling assumptions and if your modeling assumptions are roughly
[00:52:24] if your modeling assumptions are roughly correct then your model will do do
[00:52:26] correct then your model will do do better because you're telling more
[00:52:28] better because you're telling more information to the algorithm so if
[00:52:31] information to the algorithm so if indeed x given Y is Gan then GDA will do
[00:52:36] indeed x given Y is Gan then GDA will do better because you're telling the
[00:52:38] better because you're telling the algorithm x given Y is Galan and so it
[00:52:40] algorithm x given Y is Galan and so it can be more efficient and so even if you
[00:52:42] can be more efficient and so even if you have a very small data set um if these
[00:52:45] have a very small data set um if these assumptions are roughly correct then GDA
[00:52:47] assumptions are roughly correct then GDA will do better and the problem with GDA
[00:52:50] will do better and the problem with GDA is if these assumptions turn out to be
[00:52:52] is if these assumptions turn out to be wrong so if x given Y is not AOW of Gan
[00:52:55] wrong so if x given Y is not AOW of Gan then this might be very bad set of
[00:52:57] then this might be very bad set of assumptions to make you might be trying
[00:52:58] assumptions to make you might be trying to fit the gaan density the data that is
[00:53:00] to fit the gaan density the data that is not at all Gan and then GDA would do
[00:53:03] not at all Gan and then GDA would do more poorly okay so here's one fun fact
[00:53:08] more poorly okay so here's one fun fact here's another example get to question
[00:53:10] here's another example get to question second which is um let's say the
[00:53:14] second which is um let's say the following are true let's say that x
[00:53:16] following are true let's say that x given yals 1 is
[00:53:20] given yals 1 is plus with a parameter Lambda 1 and x
[00:53:24] plus with a parameter Lambda 1 and x given y equals z
[00:53:27] given y equals z is
[00:53:29] is P with a mean lambda0 or Lambda 1 Lambda
[00:53:34] P with a mean lambda0 or Lambda 1 Lambda 0 and Y as before is brand
[00:53:37] 0 and Y as before is brand newly five right it turns out that this
[00:53:41] newly five right it turns out that this set of
[00:53:42] set of assumptions also imply that P of yal 1
[00:53:46] assumptions also imply that P of yal 1 given
[00:53:49] X is logistic okay and you can prove
[00:53:52] X is logistic okay and you can prove this and this is actually true for um
[00:53:54] this and this is actually true for um any generalized linear model actually
[00:53:57] any generalized linear model actually where uh where where the difference
[00:53:59] where uh where where the difference between these two distributions varies
[00:54:01] between these two distributions varies only according to the Natural parameter
[00:54:03] only according to the Natural parameter as the generalizing name excuse me of
[00:54:05] as the generalizing name excuse me of the exponential family distribution
[00:54:08] the exponential family distribution right and so what this means is that um
[00:54:12] right and so what this means is that um if you don't know if your data is gaan
[00:54:14] if you don't know if your data is gaan or P um if you're using logistic
[00:54:17] or P um if you're using logistic regression you don't need to worry about
[00:54:18] regression you don't need to worry about it it work fine either way right so so
[00:54:21] it it work fine either way right so so you know maybe um your fitting data to
[00:54:24] you know maybe um your fitting data to maybe fitting uh model conation model to
[00:54:27] maybe fitting uh model conation model to some data and you don't know is it data
[00:54:30] some data and you don't know is it data gaussian is it pong is it some other
[00:54:32] gaussian is it pong is it some other exponential family model maybe you just
[00:54:33] exponential family model maybe you just don't know but if you're fitting
[00:54:35] don't know but if you're fitting logistic regression it'll do fine under
[00:54:37] logistic regression it'll do fine under all of those scenarios right but if your
[00:54:41] all of those scenarios right but if your data was actually pass on but you assume
[00:54:43] data was actually pass on but you assume it was gussian then your model might do
[00:54:45] it was gussian then your model might do quite poorly okay so the key high level
[00:54:51] quite poorly okay so the key high level principles when you take away from this
[00:54:52] principles when you take away from this is um uh uh if you make weaker
[00:54:57] is um uh uh if you make weaker assumptions as in logistic regression
[00:55:00] assumptions as in logistic regression then your algorithm will be more robust
[00:55:02] then your algorithm will be more robust modeling assumptions such as accy
[00:55:04] modeling assumptions such as accy assuming the data is gin if is not uh
[00:55:07] assuming the data is gin if is not uh but on the flip side if you have a very
[00:55:08] but on the flip side if you have a very small data set then um using a model
[00:55:13] small data set then um using a model that makes more assumptions will
[00:55:15] that makes more assumptions will actually allow you to do better because
[00:55:17] actually allow you to do better because by making more assumptions you're just
[00:55:18] by making more assumptions you're just telling the algorithm more truth about
[00:55:20] telling the algorithm more truth about the world which is you know hey
[00:55:22] the world which is you know hey algorithm the world is Gan and if it is
[00:55:24] algorithm the world is Gan and if it is Gan then it will actually do do do
[00:55:26] Gan then it will actually do do do better okay question at the back or a
[00:55:29] better okay question at the back or a few questions go
[00:55:38] ahead oh uh yeah practice what of data
[00:55:40] ahead oh uh yeah practice what of data is a go in property you know it's
[00:55:44] is a go in property you know it's uh uh yeah you know it's a matter of
[00:55:48] uh uh yeah you know it's a matter of degree right most data on this universe
[00:55:51] degree right most data on this universe is Gan uh uh uh except for the spe data
[00:55:55] is Gan uh uh uh except for the spe data I guess yeah but but
[00:55:56] I guess yeah but but um so I think it's actually a matter of
[00:55:58] um so I think it's actually a matter of degree right if if you plot actually if
[00:56:00] degree right if if you plot actually if you take continuous value data no there
[00:56:03] you take continuous value data no there now there there are exceptions you could
[00:56:04] now there there are exceptions you could plot it and most data that you plot you
[00:56:06] plot it and most data that you plot you know will not really be Gan but a lot of
[00:56:09] know will not really be Gan but a lot of it you can convince yourself as vaguely
[00:56:11] it you can convince yourself as vaguely Gan so I think a lot of is matter ofree
[00:56:13] Gan so I think a lot of is matter ofree I I actually tell you the way I choose
[00:56:14] I I actually tell you the way I choose to use um these two algorithms so I
[00:56:17] to use um these two algorithms so I think that the whole world has moved
[00:56:19] think that the whole world has moved toward using bigger data sets right
[00:56:21] toward using bigger data sets right Digital Society which is a lot of data
[00:56:24] Digital Society which is a lot of data and so for a lot of problems we have a
[00:56:25] and so for a lot of problems we have a lot of data
[00:56:26] lot of data I would probably use logistic regression
[00:56:29] I would probably use logistic regression because with more data you can overcome
[00:56:32] because with more data you can overcome telling the algorithm less about the
[00:56:34] telling the algorithm less about the world right so so the algorithm has two
[00:56:36] world right so so the algorithm has two sources of knowledge uh one source of
[00:56:38] sources of knowledge uh one source of knowledge is what did you tell it what
[00:56:40] knowledge is what did you tell it what the assumptions you told it to make and
[00:56:42] the assumptions you told it to make and the second source of knowledge is learn
[00:56:44] the second source of knowledge is learn from the data and in this era of big
[00:56:46] from the data and in this era of big data we have a lot of data you know
[00:56:49] data we have a lot of data you know there is a strong Trend to use logistic
[00:56:51] there is a strong Trend to use logistic regression which makes less assumptions
[00:56:53] regression which makes less assumptions and just let the a figure out whatever
[00:56:54] and just let the a figure out whatever it wants to figure out from the data
[00:56:56] it wants to figure out from the data right now one practical reason why I
[00:56:59] right now one practical reason why I still use algorithms like GDA General
[00:57:01] still use algorithms like GDA General discre analysis or algorithms like this
[00:57:03] discre analysis or algorithms like this um uh is that is actually quite
[00:57:06] um uh is that is actually quite computationally efficient and so there's
[00:57:08] computationally efficient and so there's actually one use case that Landing AI
[00:57:09] actually one use case that Landing AI than working on where we just need to
[00:57:11] than working on where we just need to fit a ton of models and don't have the
[00:57:13] fit a ton of models and don't have the patience to run theis progression over
[00:57:15] patience to run theis progression over and over and it turns out Computing mean
[00:57:17] and over and it turns out Computing mean and variances of um ciance matrices is
[00:57:20] and variances of um ciance matrices is very efficient and so this is actually
[00:57:23] very efficient and so this is actually apart from the assumptions type of
[00:57:25] apart from the assumptions type of benefit uh which is a general
[00:57:26] benefit uh which is a general philosophical point we'll see again
[00:57:28] philosophical point we'll see again later in this course right this idea
[00:57:30] later in this course right this idea about do you make strong or weak
[00:57:31] about do you make strong or weak assumptions this is a general principle
[00:57:33] assumptions this is a general principle in machine learning that we'll see again
[00:57:34] in machine learning that we'll see again in other places but it's a very concrete
[00:57:37] in other places but it's a very concrete the other reason I tend to use GDA these
[00:57:39] the other reason I tend to use GDA these days is less that I think could perform
[00:57:41] days is less that I think could perform better from an accuracy point of view
[00:57:42] better from an accuracy point of view but there's actually very efficient
[00:57:43] but there's actually very efficient algorithm you just compute the mean Co
[00:57:45] algorithm you just compute the mean Co cence and you're done and there's no
[00:57:47] cence and you're done and there's no iterative process needed so these days
[00:57:49] iterative process needed so these days when I use these models um is more
[00:57:52] when I use these models um is more motivated by computation and less by
[00:57:55] motivated by computation and less by performance but this General principle
[00:57:57] performance but this General principle is one that will come back to you again
[00:57:59] is one that will come back to you again later we develop more sub learnings
[00:58:03] later we develop more sub learnings yeah
[00:58:08] the the
[00:58:13] ass oh right so what happens if the
[00:58:15] ass oh right so what happens if the cence matrices are different it turns
[00:58:17] cence matrices are different it turns out that uh uh it still ends up being a
[00:58:20] out that uh uh it still ends up being a logistic function but a bunch of
[00:58:21] logistic function but a bunch of quadratic terms in the logistic function
[00:58:23] quadratic terms in the logistic function so it's not a linear design boundary
[00:58:24] so it's not a linear design boundary anymore you can end up with a desision
[00:58:26] anymore you can end up with a desision boundary you know that that that looks
[00:58:28] boundary you know that that that looks like this right a positive negative
[00:58:30] like this right a positive negative example separated by some by some other
[00:58:32] example separated by some by some other shape and
[00:58:33] shape and line you you could you could you could F
[00:58:36] line you you could you could you could F actually if you're curious encourage you
[00:58:38] actually if you're curious encourage you to you know uh fire up Python numpy and
[00:58:41] to you know uh fire up Python numpy and and play around their parames and plot
[00:58:43] and play around their parames and plot this for yourself
[00:58:54] question yeah is a recommend dle test to
[00:58:57] question yeah is a recommend dle test to see if it's Gan um I can tell you what's
[00:59:00] see if it's Gan um I can tell you what's done in practice I think in practice if
[00:59:03] done in practice I think in practice if you have enough data to do a cical test
[00:59:05] you have enough data to do a cical test and gain conviction you probably have
[00:59:08] and gain conviction you probably have enough data to just use logistic
[00:59:09] enough data to just use logistic regression um uh the the I I I don't
[00:59:12] regression um uh the the I I I don't know well no that's not really fair I
[00:59:14] know well no that's not really fair I don't know if a very high dimensional
[00:59:16] don't know if a very high dimensional data I I think what often happens more
[00:59:18] data I I think what often happens more is people just plot the data and if it
[00:59:21] is people just plot the data and if it looks clearly non-an then you know that
[00:59:23] looks clearly non-an then you know that would be a reason to not use GDA but
[00:59:25] would be a reason to not use GDA but what happens often is that um uh uh
[00:59:30] what happens often is that um uh uh sometimes you just have a very small
[00:59:31] sometimes you just have a very small training set and it's just a matter of
[00:59:33] training set and it's just a matter of judgment right like if you have if you
[00:59:35] judgment right like if you have if you have a a you know I don't know 50
[00:59:38] have a a you know I don't know 50 examples of healthcare records then you
[00:59:40] examples of healthcare records then you just have to ask some doctors and ask
[00:59:43] just have to ask some doctors and ask well do you think the distribution is RA
[00:59:45] well do you think the distribution is RA relatively Gan and use domain knowledge
[00:59:47] relatively Gan and use domain knowledge like that right I think by the way
[00:59:49] like that right I think by the way another philosophical Point um I think
[00:59:52] another philosophical Point um I think that uh the machine learning world has
[00:59:56] that uh the machine learning world has TR you know a little bit overhyped big
[00:59:57] TR you know a little bit overhyped big data right and and yes it's true that
[01:00:00] data right and and yes it's true that when you have more data is great and I
[01:00:02] when you have more data is great and I love data and having more data pretty
[01:00:04] love data and having more data pretty much never hurts and usually the more
[01:00:06] much never hurts and usually the more data the better so all that is true and
[01:00:07] data the better so all that is true and I think we did a good job telling people
[01:00:09] I think we did a good job telling people that high level message you know more
[01:00:10] that high level message you know more data almost always helps but um uh I
[01:00:13] data almost always helps but um uh I think a lot of the skill in machine
[01:00:15] think a lot of the skill in machine learning these days is getting your alms
[01:00:18] learning these days is getting your alms to work even when you don't have a
[01:00:19] to work even when you don't have a million in examples even don't have 100
[01:00:21] million in examples even don't have 100 million examples so there are lots of
[01:00:22] million examples so there are lots of machine learning applications where you
[01:00:24] machine learning applications where you just don't have a million examples
[01:00:26] just don't have a million examples uh you have 100 examples and um it's
[01:00:29] uh you have 100 examples and um it's then the skill in designing the learning
[01:00:31] then the skill in designing the learning AA matters much more um so if you take
[01:00:34] AA matters much more um so if you take something like imet million images there
[01:00:36] something like imet million images there are now dozens of teams maybe hundreds
[01:00:38] are now dozens of teams maybe hundreds of teams I don't know they can get great
[01:00:40] of teams I don't know they can get great results if you have a million examples
[01:00:42] results if you have a million examples right and so the performance difference
[01:00:44] right and so the performance difference between teams you know there are now
[01:00:46] between teams you know there are now dozens of teams that get great
[01:00:47] dozens of teams that get great performance there a million examples uh
[01:00:49] performance there a million examples uh for for for image classification like
[01:00:50] for for for image classification like image them but if you have only 100
[01:00:53] image them but if you have only 100 examples then the high school teams will
[01:00:56] examples then the high school teams will actually do much much much much better
[01:00:58] actually do much much much much better than the low Skool teams whereas the
[01:00:59] than the low Skool teams whereas the performance Gap is smaller when you have
[01:01:01] performance Gap is smaller when you have giant data sets I think so and I think
[01:01:04] giant data sets I think so and I think that is these types of intuitions you
[01:01:06] that is these types of intuitions you know what assumptions you use generative
[01:01:07] know what assumptions you use generative or discriminative that actually
[01:01:09] or discriminative that actually distinguishes the high school teams and
[01:01:11] distinguishes the high school teams and the and the and the less experienced
[01:01:12] the and the and the less experienced teams and drives a lot of performance
[01:01:14] teams and drives a lot of performance differences when you have small data oh
[01:01:17] differences when you have small data oh and if someone goes to you and says oh
[01:01:19] and if someone goes to you and says oh you only have 100 examples you never do
[01:01:21] you only have 100 examples you never do anything then I don't know if if it's a
[01:01:24] anything then I don't know if if it's a competitor saying that I'll say great
[01:01:26] competitor saying that I'll say great you know don't do it because I can make
[01:01:27] you know don't do it because I can make it work uh well I don't know uh but but
[01:01:30] it work uh well I don't know uh but but I think there are a lot of applications
[01:01:31] I think there are a lot of applications where your skill of Designing a machine
[01:01:33] where your skill of Designing a machine Learning System really makes a bigger
[01:01:34] Learning System really makes a bigger difference when you have a makes a it
[01:01:36] difference when you have a makes a it makes a difference for big data and
[01:01:38] makes a difference for big data and small data but it just this is very
[01:01:40] small data but it just this is very clear when you don't know much data is
[01:01:42] clear when you don't know much data is the assumptions you C into the a like is
[01:01:44] the assumptions you C into the a like is it Gan is it P that that skill allows
[01:01:47] it Gan is it P that that skill allows you to drive much bigger performance
[01:01:50] you to drive much bigger performance then a lower skill team would be able to
[01:01:53] then a lower skill team would be able to all right let's just uh I want take
[01:01:55] all right let's just uh I want take question
[01:01:56] question go
[01:02:06] ahead oh sure so does this uh yes so
[01:02:09] ahead oh sure so does this uh yes so what's the general statement of this yes
[01:02:10] what's the general statement of this yes so if uh x given yals 1 uh comes from an
[01:02:14] so if uh x given yals 1 uh comes from an exponential family distribution x given
[01:02:16] exponential family distribution x given yal Z comes from exponential family
[01:02:18] yal Z comes from exponential family distribution the same exponential family
[01:02:20] distribution the same exponential family distribution and if they vary only by
[01:02:22] distribution and if they vary only by the natural parameter of exponential
[01:02:24] the natural parameter of exponential family distribution then and this will
[01:02:26] family distribution then and this will be logistic yeah um I think this was
[01:02:29] be logistic yeah um I think this was once a midterm homew world problem to
[01:02:31] once a midterm homew world problem to prove this actually but yeah all right
[01:02:34] prove this actually but yeah all right uh actually just take one last question
[01:02:36] uh actually just take one last question we'll move on go
[01:02:44] ahead oh uh does performance Improvement
[01:02:47] ahead oh uh does performance Improvement whole even as you increase number of
[01:02:49] whole even as you increase number of classes
[01:02:52] uh I think so yes uh and the general
[01:02:55] uh I think so yes uh and the general ization of this would be the soft Max
[01:02:57] ization of this would be the soft Max regression which I didn't talk about but
[01:02:59] regression which I didn't talk about but yes I think a similar thing holds true
[01:03:01] yes I think a similar thing holds true for um gdf and multiple and we have so
[01:03:04] for um gdf and multiple and we have so far we only talked about bind
[01:03:05] far we only talked about bind classification what if we have more than
[01:03:06] classification what if we have more than two classes but uh but yes similar
[01:03:08] two classes but uh but yes similar similar things holds true for uh like a
[01:03:11] similar things holds true for uh like a GDA with three classes in
[01:03:14] GDA with three classes in softmax yeah oh yes right you saw
[01:03:16] softmax yeah oh yes right you saw softmax the other day
[01:03:20] softmax the other day cool um and this this theme that when
[01:03:23] cool um and this this theme that when you have less data the them needs to
[01:03:26] you have less data the them needs to rely more on assumptions you code in
[01:03:28] rely more on assumptions you code in this is a recurring theme that we'll
[01:03:30] this is a recurring theme that we'll come back to as well this is one of the
[01:03:32] come back to as well this is one of the important principles of machine learning
[01:03:34] important principles of machine learning that when you have less data your skill
[01:03:36] that when you have less data your skill at coding in your knowledge matters much
[01:03:38] at coding in your knowledge matters much more uh this is a theme we'll come back
[01:03:40] more uh this is a theme we'll come back to when we talk about much more
[01:03:42] to when we talk about much more complicated learning ARS as
[01:03:46] well all
[01:03:49] well all right so I want a fresh board for this
[01:04:00] so you've seen GDA in the context of um
[01:04:04] so you've seen GDA in the context of um continuous
[01:04:05] continuous valued uh features X the last thing I
[01:04:09] valued uh features X the last thing I want to do today
[01:04:12] want to do today um is talk about one more generative
[01:04:15] um is talk about one more generative learning algorithm called naive Bas um
[01:04:20] learning algorithm called naive Bas um and I'm going to use as a mul example
[01:04:23] and I'm going to use as a mul example email spam classification but this this
[01:04:25] email spam classification but this this is this I guess this is our first for
[01:04:27] is this I guess this is our first for into national language processing right
[01:04:29] into national language processing right but given a piece of text like given a
[01:04:30] but given a piece of text like given a piece of email can you classify this as
[01:04:32] piece of email can you classify this as spam or not spam or uh other examples uh
[01:04:36] spam or not spam or uh other examples uh uh actually several years ago eBay used
[01:04:38] uh actually several years ago eBay used a problem of you if someone's trying to
[01:04:39] a problem of you if someone's trying to sell something and you write a text
[01:04:41] sell something and you write a text description right hey I have a
[01:04:43] description right hey I have a secondhand you know room I'm trying to
[01:04:45] secondhand you know room I'm trying to sell it on eBay how do you take that
[01:04:47] sell it on eBay how do you take that text that someone wrote of a description
[01:04:49] text that someone wrote of a description and categorize it is electronic thing or
[01:04:51] and categorize it is electronic thing or are they trying to sell a TV are they
[01:04:52] are they trying to sell a TV are they trying to sell clothing uh but these
[01:04:54] trying to sell clothing uh but these these examples are text class a problems
[01:04:56] these examples are text class a problems you have a piece of text and you want to
[01:04:57] you have a piece of text and you want to classify it into one of two categories
[01:05:00] classify it into one of two categories for spam or non-spam or one of maybe
[01:05:03] for spam or non-spam or one of maybe thousands of categories if you're trying
[01:05:04] thousands of categories if you're trying to take a private description and
[01:05:05] to take a private description and classify it into one of the
[01:05:07] classify it into one of the classes um and so the first question we
[01:05:12] classes um and so the first question we will have is um uh given an email
[01:05:16] will have is um uh given an email problem uh given email classification
[01:05:18] problem uh given email classification problem how do you represented as a
[01:05:21] problem how do you represented as a feature vector
[01:05:27] right and so um in naive base what we're
[01:05:31] right and so um in naive base what we're going to do is take your email take a
[01:05:34] going to do is take your email take a piece of email and first map it to a
[01:05:38] piece of email and first map it to a feature Vector X and we'll do so as
[01:05:41] feature Vector X and we'll do so as follows which is first um let's start
[01:05:44] follows which is first um let's start with
[01:05:45] with a let's start with the English
[01:05:47] a let's start with the English dictionary and make a list of all the
[01:05:49] dictionary and make a list of all the words in the English dictionary right so
[01:05:52] words in the English dictionary right so first word in the English dictionary is
[01:05:54] first word in the English dictionary is a second word English dictionary is is
[01:05:56] a second word English dictionary is is arvar third word is
[01:05:59] arvar third word is adwolf uh that's e he look it
[01:06:04] adwolf uh that's e he look it up um and then you know uh uh email spam
[01:06:08] up um and then you know uh uh email spam a lot of people ask you buy stuff so
[01:06:09] a lot of people ask you buy stuff so they would buy right and then um uh and
[01:06:13] they would buy right and then um uh and then the last word in my dictionary is
[01:06:16] then the last word in my dictionary is zigar which is the technological
[01:06:18] zigar which is the technological chemistry that refers to the
[01:06:20] chemistry that refers to the fermentation process in
[01:06:23] fermentation process in Brewing um so so again this is useful
[01:06:25] Brewing um so so again this is useful way think about it in in in practice
[01:06:27] way think about it in in in practice what you do is not uh actually look at
[01:06:29] what you do is not uh actually look at the dictionary but look at the top
[01:06:31] the dictionary but look at the top 10,000 words in know in your training
[01:06:32] 10,000 words in know in your training set right so maybe you have 10,000 it's
[01:06:35] set right so maybe you have 10,000 it's easier to think about it as if it was a
[01:06:36] easier to think about it as if it was a dictionary but you know in practice what
[01:06:38] dictionary but you know in practice what you the other thing that's diction has
[01:06:40] you the other thing that's diction has too many words but what the other way to
[01:06:42] too many words but what the other way to do it is to look through your own email
[01:06:44] do it is to look through your own email Corpus and just find the top 10,000
[01:06:46] Corpus and just find the top 10,000 occurring words and use that as a
[01:06:48] occurring words and use that as a feature set and so oh know right in your
[01:06:51] feature set and so oh know right in your emails I guess you're getting a bunch of
[01:06:52] emails I guess you're getting a bunch of email about from us or maybe others
[01:06:54] email about from us or maybe others about cs229 so cs229 might appear in
[01:06:56] about cs229 so cs229 might appear in your dictionary of building an email SPF
[01:06:58] your dictionary of building an email SPF for yourself even if it doesn't appear
[01:07:00] for yourself even if it doesn't appear in the in the official uh what is it
[01:07:02] in the in the official uh what is it like the oate dictionary just yet just
[01:07:05] like the oate dictionary just yet just just just you way we'll get CS there
[01:07:07] just just you way we'll get CS there somay all right um and
[01:07:12] somay all right um and so given an email what we would like to
[01:07:15] so given an email what we would like to do is then um take this piece of text
[01:07:18] do is then um take this piece of text and represent to this feature vector and
[01:07:21] and represent to this feature vector and so one way to do this is um you can
[01:07:23] so one way to do this is um you can create a binary feature vector that puts
[01:07:26] create a binary feature vector that puts a one if a word appears in the email and
[01:07:30] a one if a word appears in the email and puts a zero if it doesn't right so if
[01:07:32] puts a zero if it doesn't right so if you get an email um uh that asks you to
[01:07:35] you get an email um uh that asks you to you know buy some stuff and the word a
[01:07:37] you know buy some stuff and the word a appears in email you put a one there
[01:07:39] appears in email you put a one there they're not trying to sell odv odw so
[01:07:41] they're not trying to sell odv odw so zero there
[01:07:45] zero there buy and so on right so you take a take
[01:07:47] buy and so on right so you take a take an email and turn it into a binary
[01:07:52] an email and turn it into a binary feature Vector um and so here the
[01:07:56] feature Vector um and so here the feature Vector is
[01:07:59] feature Vector is 01 to the N because it's a n dimensional
[01:08:03] 01 to the N because it's a n dimensional bin feature Vector where where for the
[01:08:05] bin feature Vector where where for the purpose of illustration let's say n is
[01:08:07] purpose of illustration let's say n is 10,000 because you're using you know
[01:08:09] 10,000 because you're using you know take the top 10,000 words uh that appear
[01:08:11] take the top 10,000 words uh that appear in your email training set as the
[01:08:14] in your email training set as the dictionary that you will
[01:08:22] use so
[01:08:29] um so in other words
[01:08:33] um so in other words XI is
[01:08:36] XI is indicator where
[01:08:39] I appears in the email right so it's
[01:08:42] I appears in the email right so it's either Z One depending on whether or not
[01:08:44] either Z One depending on whether or not that word I from this list appears in
[01:08:47] that word I from this list appears in your
[01:08:49] your email now um in the na Bas algorithm
[01:08:54] email now um in the na Bas algorithm we're going to build a generative
[01:08:56] we're going to build a generative learning algorithm um and so we want to
[01:09:05] model P of x given
[01:09:07] model P of x given y right as well as P of Y okay but there
[01:09:13] y right as well as P of Y okay but there are uh two to the
[01:09:21] 10,000 possible values of X right
[01:09:25] 10,000 possible values of X right because because X is a binary Vector of
[01:09:27] because because X is a binary Vector of this 10,000 dimensional so if we try to
[01:09:29] this 10,000 dimensional so if we try to model P of X in the straightforward way
[01:09:33] model P of X in the straightforward way as a multinomial distribution over you
[01:09:35] as a multinomial distribution over you know two to the 10,000 possible outcomes
[01:09:38] know two to the 10,000 possible outcomes then you need right uh uh you need you
[01:09:41] then you need right uh uh you need you know two to the 10,000 parameters right
[01:09:45] know two to the 10,000 parameters right which is a lot or actually technically
[01:09:47] which is a lot or actually technically you need 2 to 10,000 minus one
[01:09:49] you need 2 to 10,000 minus one parameters because that add up to one
[01:09:51] parameters because that add up to one you save one parameter um but so
[01:09:53] you save one parameter um but so modeling this without additional
[01:09:54] modeling this without additional assumptions won't won't work right
[01:09:57] assumptions won't won't work right because uh excessive number parameters
[01:10:00] because uh excessive number parameters so in the Naas algorithm we're going to
[01:10:03] so in the Naas algorithm we're going to assume that the
[01:10:06] assume that the XIs
[01:10:08] XIs are uh conditionally
[01:10:18] independent given y okay uh let me just
[01:10:22] independent given y okay uh let me just write out what this means but so P of X1
[01:10:25] write out what this means but so P of X1 up to x
[01:10:27] up to x 10,000 given y by the chain rule of
[01:10:32] 10,000 given y by the chain rule of probability this is equal to P of X1
[01:10:35] probability this is equal to P of X1 given
[01:10:39] y times P of X2
[01:10:43] y times P of X2 given um X1 and Y times P of x3 given X1
[01:10:50] given um X1 and Y times P of x3 given X1 X2 y up to your P of x 10,000
[01:10:55] X2 y up to your P of x 10,000 given so on right so I haven't made any
[01:10:59] given so on right so I haven't made any assumptions yet this is just a true
[01:11:00] assumptions yet this is just a true statement of fact is always true by the
[01:11:02] statement of fact is always true by the by the chain rule of
[01:11:04] by the chain rule of probability um and what we're going to
[01:11:06] probability um and what we're going to assume which is what this assumption
[01:11:15] is is that this is equal to this first
[01:11:19] is is that this is equal to this first term no change but X2 given y p of x3
[01:11:24] term no change but X2 given y p of x3 given y do do do p of
[01:11:27] given y do do do p of x 10,000 given y okay so um this
[01:11:33] x 10,000 given y okay so um this assumption is called a conditional
[01:11:36] assumption is called a conditional Independence assumption is also
[01:11:37] Independence assumption is also sometimes called the na based assumption
[01:11:39] sometimes called the na based assumption but you're assuming that um so long as
[01:11:42] but you're assuming that um so long as you know why the chance of seeing the
[01:11:45] you know why the chance of seeing the word um odv in your email does not
[01:11:49] word um odv in your email does not depend on whether the word a appears in
[01:11:51] depend on whether the word a appears in your email right um and this is one of
[01:11:54] your email right um and this is one of those assumptions is definitely not a
[01:11:56] those assumptions is definitely not a true assumption in that this is just not
[01:11:57] true assumption in that this is just not a mathematically true assumption just
[01:11:59] a mathematically true assumption just that sometimes your data isn't perfectly
[01:12:01] that sometimes your data isn't perfectly Gan but you assum as Gan you can kind of
[01:12:03] Gan but you assum as Gan you can kind of get away with it uh so this assumption
[01:12:05] get away with it uh so this assumption is not true um in a mathematical sense
[01:12:08] is not true um in a mathematical sense but it may be not so horrible that you
[01:12:11] but it may be not so horrible that you can't get away with it right um and
[01:12:15] can't get away with it right um and so side if you if any of you are
[01:12:17] so side if you if any of you are familiar with prob graphical models if
[01:12:19] familiar with prob graphical models if you taken
[01:12:20] you taken cs228 uh this assumption is summarized
[01:12:23] cs228 uh this assumption is summarized in this picture and if you haven't taken
[01:12:25] in this picture and if you haven't taken cs228 this picture won't make sense but
[01:12:27] cs228 this picture won't make sense but don't worry about it
[01:12:31] don't worry about it um right that uh once you know the class
[01:12:34] um right that uh once you know the class label is a spam or not spam whether or
[01:12:36] label is a spam or not spam whether or not each word appears or does not appear
[01:12:39] not each word appears or does not appear is independent okay so this called
[01:12:41] is independent okay so this called conditional so the the mechanics of this
[01:12:43] conditional so the the mechanics of this assumption is really just captured by
[01:12:45] assumption is really just captured by this equation um uh and you just use
[01:12:48] this equation um uh and you just use this equation that's all you need to
[01:12:50] this equation that's all you need to derive naive base but the intuition is
[01:12:52] derive naive base but the intuition is that if I tell you whether this piece if
[01:12:55] that if I tell you whether this piece if I tell you that this piece of email is
[01:12:57] I tell you that this piece of email is Spam then whether the word buy appears
[01:12:59] Spam then whether the word buy appears in it doesn't affect your beliefs
[01:13:01] in it doesn't affect your beliefs whether the word mortgage or discount or
[01:13:03] whether the word mortgage or discount or whatever spam you words appear
[01:13:07] whatever spam you words appear right so just to summarize this is
[01:13:10] right so just to summarize this is product from I equals 1 through n of P
[01:13:13] product from I equals 1 through n of P of x
[01:13:14] of x i given y
[01:13:47] all
[01:13:48] all right so the parameters of this
[01:13:53] model uh are
[01:13:55] model uh are I'm going to write five
[01:13:57] I'm going to write five subrip
[01:13:59] subrip um
[01:14:01] um J given yals 1 as the probability that
[01:14:05] J given yals 1 as the probability that XJ equal 1 given yal 1 I sub j yal
[01:14:18] z and then um f
[01:14:25] right and just to distinguish all these
[01:14:27] right and just to distinguish all these FES from each other I'm going to just
[01:14:29] FES from each other I'm going to just call this five subscript y okay so this
[01:14:33] call this five subscript y okay so this parameter says if a spam email if yal 1
[01:14:36] parameter says if a spam email if yal 1 is Spam y z is non spam if a spam email
[01:14:39] is Spam y z is non spam if a spam email what's the challeng of where J appearing
[01:14:42] what's the challeng of where J appearing in the email uh if it's not spam email
[01:14:44] in the email uh if it's not spam email what's the chance of where J appearing
[01:14:46] what's the chance of where J appearing in the email um and then also what's the
[01:14:48] in the email um and then also what's the class prior what's the PRI probability
[01:14:50] class prior what's the PRI probability that the next email you receive in your
[01:14:52] that the next email you receive in your uh in your in your inbox is SP
[01:15:00] email and so um to fit the parameters of
[01:15:04] email and so um to fit the parameters of this
[01:15:05] this model you would similar to gmin
[01:15:12] analysis write out the Jo joint
[01:15:15] analysis write out the Jo joint likelihood so the joint likelihood of
[01:15:17] likelihood so the joint likelihood of these
[01:15:21] parameters right is the product
[01:15:28] you know given these
[01:15:33] parameters right similar to what we had
[01:15:35] parameters right similar to what we had for Gan discre
[01:15:38] for Gan discre analysis and the maximum likelihood
[01:15:41] analysis and the maximum likelihood estimates um if you take this take logs
[01:15:43] estimates um if you take this take logs take der to set there to zero solve for
[01:15:45] take der to set there to zero solve for the values that maximize this you find
[01:15:48] the values that maximize this you find that the maximum likely estimates of the
[01:15:50] that the maximum likely estimates of the parameters are 5y is pretty much what
[01:15:54] parameters are 5y is pretty much what You' expect
[01:16:00] right it's just a fraction of spam
[01:16:03] right it's just a fraction of spam emails and F of J given yals 1 is um
[01:16:09] emails and F of J given yals 1 is um well all write does out an indicat a
[01:16:10] well all write does out an indicat a function
[01:16:20] notation oh shoot sorry
[01:16:45] okay um so that's the indicator function
[01:16:47] okay um so that's the indicator function notation of writing out look through
[01:16:49] notation of writing out look through your uh training set find all the spam
[01:16:52] your uh training set find all the spam emails and of all the spam emails are
[01:16:55] emails and of all the spam emails are examples of y equal 1 count out what
[01:16:58] examples of y equal 1 count out what fraction of them had word j in it right
[01:17:00] fraction of them had word j in it right so your estimate of the chance of word J
[01:17:02] so your estimate of the chance of word J appearing your estim chance of the word
[01:17:05] appearing your estim chance of the word by appearing in spam email is just where
[01:17:07] by appearing in spam email is just where of all the spam emails in your training
[01:17:09] of all the spam emails in your training set what fraction of them contained the
[01:17:11] set what fraction of them contained the word by what fraction of them had your
[01:17:14] word by what fraction of them had your XJ equals one for say the word by okay
[01:17:18] XJ equals one for say the word by okay um and so it turns out that if you
[01:17:21] um and so it turns out that if you implement this algorithm it will it will
[01:17:24] implement this algorithm it will it will nearly work I guess uh uh but this is
[01:17:27] nearly work I guess uh uh but this is naive base for um for email spam
[01:17:30] naive base for um for email spam classification right and I'll mention
[01:17:33] classification right and I'll mention the one reason this uh and it turns out
[01:17:36] the one reason this uh and it turns out that with with one fix to this algorithm
[01:17:38] that with with one fix to this algorithm which we'll talk about on Wednesday um
[01:17:41] which we'll talk about on Wednesday um this is actually is actually a not too
[01:17:44] this is actually is actually a not too horrible spam classifier it turns out
[01:17:46] horrible spam classifier it turns out that if you use logistic regression uh
[01:17:48] that if you use logistic regression uh for spam classification you do better
[01:17:50] for spam classification you do better than this almost all the time but this
[01:17:52] than this almost all the time but this is a very efficient algorithm because
[01:17:55] is a very efficient algorithm because estimating these parameters is just
[01:17:56] estimating these parameters is just counting and then Computing
[01:17:58] counting and then Computing probabilities is just multiplying a
[01:17:59] probabilities is just multiplying a bunch of numbers so there's nothing
[01:18:00] bunch of numbers so there's nothing iterative about this you can fit this
[01:18:02] iterative about this you can fit this model very efficiently and also keep on
[01:18:05] model very efficiently and also keep on updating this model even as you get new
[01:18:06] updating this model even as you get new data even as you get new new new you
[01:18:09] data even as you get new new new you know users hits mark spam or whatever
[01:18:11] know users hits mark spam or whatever even you get new data you can update
[01:18:12] even you get new data you can update this model very efficiently um but it
[01:18:16] this model very efficiently um but it turns out that uh actually the the the
[01:18:18] turns out that uh actually the the the biggest problem with this algorithm is
[01:18:20] biggest problem with this algorithm is what happens if uh this is zero over if
[01:18:23] what happens if uh this is zero over if uh if you get zeros in some of these
[01:18:24] uh if you get zeros in some of these equations right but but we'll come back
[01:18:26] equations right but but we'll come back to that when we talk about the pl moving
[01:18:29] to that when we talk about the pl moving on
[01:18:30] on Wednesday okay all right any quick
[01:18:32] Wednesday okay all right any quick questions before we wrap
[01:18:36] up y okay good so now you learn about
[01:18:39] up y okay good so now you learn about generative learning algorithms uh we'll
[01:18:41] generative learning algorithms uh we'll come back on Wednesday and learn about
[01:18:42] come back on Wednesday and learn about some more fine details how to make this
[01:18:44] some more fine details how to make this work that's so let's break we'll see you
[01:18:46] work that's so let's break we'll see you on Wednesday


================================================================================
LECTURE 006
================================================================================

Lecture 6 - Support Vector Machines | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018)

Source: https://www.youtube.com/watch?v=lDwow4aOrtg

---

Transcript

[00:00:03] alright hey everyone morning and welcome
[00:00:06] alright hey everyone morning and welcome back so what I'd like to do today is
[00:00:11] back so what I'd like to do today is continue our discussion of naive Bayes
[00:00:14] continue our discussion of naive Bayes and in particular would describe how to
[00:00:17] and in particular would describe how to use naive Bayes generative learning
[00:00:19] use naive Bayes generative learning algorithm to build a spam classifier
[00:00:21] algorithm to build a spam classifier that will almost work right and and so
[00:00:24] that will almost work right and and so today you see how the paths moving is
[00:00:27] today you see how the paths moving is one other idea you need to add to the
[00:00:30] one other idea you need to add to the naive Bayes algorithm we described on
[00:00:31] naive Bayes algorithm we described on Monday to really make it work for say
[00:00:35] Monday to really make it work for say email spam constellation or for text
[00:00:37] email spam constellation or for text oxidation and then we'll talk about a
[00:00:39] oxidation and then we'll talk about a different version of naive Bayes there's
[00:00:41] different version of naive Bayes there's even better than the one we've been
[00:00:42] even better than the one we've been discussing so far it's a little bit
[00:00:45] discussing so far it's a little bit about advice for applying machine
[00:00:48] about advice for applying machine learning algorithms so this will be
[00:00:49] learning algorithms so this will be useful to you as you get started on your
[00:00:52] useful to you as you get started on your system across projects as well this is
[00:00:54] system across projects as well this is strategy for how to choose an algorithm
[00:00:56] strategy for how to choose an algorithm and what to do first what to do second
[00:00:58] and what to do first what to do second and then we'll start with intro to
[00:01:01] and then we'll start with intro to support so to recap the naive Bayes
[00:01:09] support so to recap the naive Bayes algorithm is a generative learning
[00:01:11] algorithm is a generative learning algorithm in which given a piece of
[00:01:13] algorithm in which given a piece of email or Twitter message or some piece
[00:01:16] email or Twitter message or some piece of text psycho dictionary and put in
[00:01:19] of text psycho dictionary and put in zeros and ones depending on whether
[00:01:22] zeros and ones depending on whether different words appear in a particular
[00:01:24] different words appear in a particular email and so this becomes your feature
[00:01:26] email and so this becomes your feature representation for say an email that
[00:01:28] representation for say an email that you're trying to classify as spam on all
[00:01:30] you're trying to classify as spam on all spam so using the indicator function
[00:01:32] spam so using the indicator function notation XJ I've been trying to use a
[00:01:39] notation XJ I've been trying to use a subscript J not consistently to denote
[00:01:42] subscript J not consistently to denote the indexes and features and I to invest
[00:01:44] the indexes and features and I to invest in training examples but we'll see I'm
[00:01:46] in training examples but we'll see I'm not going
[00:01:47] not going mrs. J is whether or not an indicator
[00:01:50] mrs. J is whether or not an indicator for where the word J appears in an email
[00:01:53] for where the word J appears in an email and so to build a generative model for
[00:01:58] and so to build a generative model for this we need to model these two terms P
[00:02:02] this we need to model these two terms P of x given Y and PF y so Gaussian to
[00:02:06] of x given Y and PF y so Gaussian to strengthen Alice's models these two
[00:02:08] strengthen Alice's models these two terms with a Gaussian and a Bernoulli
[00:02:10] terms with a Gaussian and a Bernoulli respectively and naive Bayes uses a
[00:02:12] respectively and naive Bayes uses a different model and with naive based in
[00:02:14] different model and with naive based in particular P of X given Y is modeled as
[00:02:16] particular P of X given Y is modeled as a product of the conditional
[00:02:19] a product of the conditional probabilities of the new features given
[00:02:22] probabilities of the new features given the class label Y and so the parameters
[00:02:25] the class label Y and so the parameters the naive Bayes model are Phi subscript
[00:02:28] the naive Bayes model are Phi subscript Y is the class prior what's the chance
[00:02:30] Y is the class prior what's the chance that Y is equal to one before you've
[00:02:32] that Y is equal to one before you've seen any features as well as Phi
[00:02:35] seen any features as well as Phi subscript J given y equals 0 which is a
[00:02:38] subscript J given y equals 0 which is a chance of that word appearing in a
[00:02:40] chance of that word appearing in a non-spam as was Phi subscript J given y
[00:02:43] non-spam as was Phi subscript J given y equals 1 which is the chance event worth
[00:02:45] equals 1 which is the chance event worth that were appearing in spam email okay
[00:02:49] that were appearing in spam email okay and so if you derive the maximum
[00:02:54] and so if you derive the maximum likelihood estimates you will find that
[00:02:57] likelihood estimates you will find that the maximum likelihood estimator you
[00:02:59] the maximum likelihood estimator you know Phi Y right just a fraction of
[00:03:08] know Phi Y right just a fraction of training examples that was equal to spam
[00:03:36] and this is just the indicator function
[00:03:40] and this is just the indicator function notation way of writing with all of your
[00:03:44] notation way of writing with all of your emails with label y equals zero and
[00:03:47] emails with label y equals zero and counter my fraction of them did this
[00:03:49] counter my fraction of them did this feature X J appeared did this word x JM
[00:03:52] feature X J appeared did this word x JM here and then finally a prediction time
[00:04:10] you will calculate P of Y given X this
[00:04:22] you will calculate P of Y given X this is kind of according to Bayes rule
[00:04:41] all right so it turns out this algorithm
[00:04:45] all right so it turns out this algorithm will almost work and here's where it
[00:04:47] will almost work and here's where it breaks down which is um you know so
[00:04:51] breaks down which is um you know so actually every year there are some CST
[00:04:54] actually every year there are some CST realized some machine learning students
[00:04:56] realized some machine learning students or do a class project and some people
[00:04:58] or do a class project and some people end up submitting this to an academic
[00:05:00] end up submitting this to an academic conference right some some actually some
[00:05:02] conference right some some actually some some 169 class projects get submitted
[00:05:04] some 169 class projects get submitted gos conference papers pretty much every
[00:05:06] gos conference papers pretty much every year one of the top machine learning
[00:05:09] year one of the top machine learning conferences is the conference nibs name
[00:05:12] conferences is the conference nibs name stands the neural information processing
[00:05:13] stands the neural information processing systems and let's say that in your
[00:05:18] systems and let's say that in your dictionary you know you have 10,000
[00:05:19] dictionary you know you have 10,000 words in your dictionary let's say that
[00:05:21] words in your dictionary let's say that the nibs conference the word nibs
[00:05:23] the nibs conference the word nibs corresponds to word number 6 0 17 right
[00:05:27] corresponds to word number 6 0 17 right in your you know 10,000 word dictionary
[00:05:30] in your you know 10,000 word dictionary but up until now presumably you've not
[00:05:34] but up until now presumably you've not had a lot of emails from your friends
[00:05:36] had a lot of emails from your friends asking hey do you want some of the paper
[00:05:38] asking hey do you want some of the paper to the nibs conference or not and so if
[00:05:43] to the nibs conference or not and so if you use your current you know email set
[00:05:46] you use your current you know email set of emails to find these maximum
[00:05:48] of emails to find these maximum likelihood estimate the parameters you
[00:05:50] likelihood estimate the parameters you will probably estimate that probability
[00:05:53] will probably estimate that probability of seeing this word given that it's spam
[00:05:58] of seeing this word given that it's spam email it's probably 0 right 0 over the
[00:06:04] email it's probably 0 right 0 over the number of examples that you've labeled
[00:06:08] number of examples that you've labeled as down email so if you clean up this
[00:06:10] as down email so if you clean up this model using your personal email probably
[00:06:13] model using your personal email probably none of the emails you've received for
[00:06:14] none of the emails you've received for the last few months had the word nips in
[00:06:16] the last few months had the word nips in it maybe and so if you plug in this form
[00:06:20] it maybe and so if you plug in this form for the master likely estimate the
[00:06:21] for the master likely estimate the numerator 0 and so you ask me that this
[00:06:23] numerator 0 and so you ask me that this probably is 0 and then similarly this is
[00:06:32] probably is 0 and then similarly this is also 0 over you know the number of non
[00:06:34] also 0 over you know the number of non staff
[00:06:35] staff right so that's what this just this
[00:06:38] right so that's what this just this formula right and statistically is just
[00:06:46] formula right and statistically is just a bad idea to say that the chance of
[00:06:48] a bad idea to say that the chance of something is zero just because you
[00:06:50] something is zero just because you haven't seen it yet and where this will
[00:06:53] haven't seen it yet and where this will cause the naive Bayes Outland to break
[00:06:54] cause the naive Bayes Outland to break down is if you use these as estimates of
[00:06:58] down is if you use these as estimates of the parameters so this is your estimate
[00:07:01] the parameters so this is your estimate parameter Phi subscript six zero one
[00:07:04] parameter Phi subscript six zero one seven given y equals one this is five
[00:07:07] seven given y equals one this is five such of 6001 seven given y equals zero
[00:07:10] such of 6001 seven given y equals zero yes
[00:07:11] yes and if you ever calculate this
[00:07:12] and if you ever calculate this probability that is equal to a product
[00:07:16] probability that is equal to a product from I equals 1 through n less you have
[00:07:19] from I equals 1 through n less you have 10,000 words appear to make us I equals
[00:07:24] 10,000 words appear to make us I equals 1 o P of X i given the y right and so if
[00:07:33] 1 o P of X i given the y right and so if you train your sound classifier on the
[00:07:35] you train your sound classifier on the email you've gotten up until today
[00:07:36] email you've gotten up until today and then after cs2 to 89 your project
[00:07:39] and then after cs2 to 89 your project teammates saying it starts engine email
[00:07:41] teammates saying it starts engine email saying hey you know we like the class
[00:07:43] saying hey you know we like the class project shall we consists of letting
[00:07:45] project shall we consists of letting this class project to the news
[00:07:47] this class project to the news conference news conference deadlines you
[00:07:49] conference news conference deadlines you know sort of May or June those years so
[00:07:53] know sort of May or June those years so you know finish the house projects this
[00:07:55] you know finish the house projects this December work on this some more very
[00:07:57] December work on this some more very gently very March April next year and
[00:08:00] gently very March April next year and then maybe submit to the conference May
[00:08:01] then maybe submit to the conference May or June of 2019 do you start getting
[00:08:03] or June of 2019 do you start getting emails on your fancy let's submit our
[00:08:05] emails on your fancy let's submit our papers in its conference then when you
[00:08:09] papers in its conference then when you start to see the word nips in your email
[00:08:10] start to see the word nips in your email maybe in March of next year this product
[00:08:14] maybe in March of next year this product of probabilities will have a 0 in it and
[00:08:19] of probabilities will have a 0 in it and so this thing that the circle will
[00:08:21] so this thing that the circle will evaluate to zero because you're
[00:08:22] evaluate to zero because you're multiplying a lot numbers 1 which is 0
[00:08:25] multiplying a lot numbers 1 which is 0 and in the same way this well this is
[00:08:29] and in the same way this well this is also 0 right and this is also 0 because
[00:08:34] also 0 right and this is also 0 because there'll be that one term bring that
[00:08:36] there'll be that one term bring that product over there and so what that
[00:08:39] product over there and so what that means is if you train the spam
[00:08:40] means is if you train the spam classifier today using all the data you
[00:08:44] classifier today using all the data you have in your email inbox so far and if
[00:08:46] have in your email inbox so far and if tomorrow
[00:08:48] tomorrow you know or two months from now whatever
[00:08:49] you know or two months from now whatever the first time you get an email from
[00:08:52] the first time you get an email from your teammates that has the word nips in
[00:08:54] your teammates that has the word nips in it your spam classifier will estimate
[00:08:56] it your spam classifier will estimate this probability as zero over zero plus
[00:09:00] this probability as zero over zero plus zero okay
[00:09:02] zero okay now apart from the divide by zero error
[00:09:05] now apart from the divide by zero error it turns out that this is just a bad
[00:09:09] it turns out that this is just a bad idea right to estimate the probably of
[00:09:11] idea right to estimate the probably of something as zero just because you have
[00:09:14] something as zero just because you have not seen it once yet right so what I
[00:09:20] not seen it once yet right so what I want to do is describe to you
[00:09:22] want to do is describe to you Laplace smoothing which is a technique
[00:09:24] Laplace smoothing which is a technique that helps address this problem
[00:09:28] that helps address this problem okay and let's let's uh in order to
[00:09:32] okay and let's let's uh in order to motivate Laplace moving let me use a let
[00:09:39] motivate Laplace moving let me use a let me use a different example for now let's
[00:09:48] me use a different example for now let's see all right so you know several years
[00:09:51] see all right so you know several years ago this is this is older data but
[00:09:53] ago this is this is older data but several years ago so then put a sign ID
[00:09:54] several years ago so then put a sign ID based on top of the pops movie no come
[00:09:56] based on top of the pops movie no come back to pile up house routine my base so
[00:09:58] back to pile up house routine my base so several years ago I was tracking their
[00:10:01] several years ago I was tracking their progress of the Stanford football team
[00:10:04] progress of the Stanford football team there's a few years ago now but that
[00:10:06] there's a few years ago now but that year on 9/12
[00:10:08] year on 9/12 a football team played to Wake Forest
[00:10:14] you know I think these are all the all
[00:10:17] you know I think these are all the all the state games we played that year
[00:10:18] the state games we played that year right and we did not win that game then
[00:10:25] right and we did not win that game then on Jen Jen we played Oregon State and we
[00:10:29] on Jen Jen we played Oregon State and we did not win that game
[00:10:45] and the question is these are audio way
[00:10:50] and the question is these are audio way games almost all the other state games
[00:10:52] games almost all the other state games we played that yeah and so via the you
[00:10:53] we played that yeah and so via the you know Stanford forgot his biggest fan he
[00:10:55] know Stanford forgot his biggest fan he followed into every single out-of-state
[00:10:56] followed into every single out-of-state game and watch all these games the
[00:10:58] game and watch all these games the question is after this unfortunate
[00:10:59] question is after this unfortunate shriek when you go on the accession game
[00:11:04] shriek when you go on the accession game when you as if you follow them to their
[00:11:06] when you as if you follow them to their over Homer game what's your estimate of
[00:11:09] over Homer game what's your estimate of their chance of their winning or losing
[00:11:12] their chance of their winning or losing now if use maximum likelihood so let's
[00:11:16] now if use maximum likelihood so let's say this is the variable X you would
[00:11:18] say this is the variable X you would estimate the probability of their
[00:11:19] estimate the probability of their winning well that's one life because
[00:11:22] winning well that's one life because really count up the number of wins right
[00:11:25] really count up the number of wins right and divide that by the number of wins
[00:11:28] and divide that by the number of wins plus the number of losses and so in this
[00:11:34] plus the number of losses and so in this case you estimate this as 0 divided by
[00:11:39] case you estimate this as 0 divided by number of wins of 0 number of losses is
[00:11:43] number of wins of 0 number of losses is 4 right which is equal to 0 ok um that's
[00:11:48] 4 right which is equal to 0 ok um that's kind of mean right they lost 4 games but
[00:11:51] kind of mean right they lost 4 games but you say no the Chancellor that they're
[00:11:52] you say no the Chancellor that they're winning is 0 absolute certain to you and
[00:11:55] winning is 0 absolute certain to you and then justice lee this is not this is not
[00:11:57] then justice lee this is not this is not a good idea and so what the process
[00:12:02] a good idea and so what the process moving what we're going to do is imagine
[00:12:10] moving what we're going to do is imagine that we saw the positive outcomes the
[00:12:15] that we saw the positive outcomes the number of wins you know just add one to
[00:12:18] number of wins you know just add one to the number of wins we actually saw and
[00:12:20] the number of wins we actually saw and also the number of losses at 1 right so
[00:12:25] also the number of losses at 1 right so if you actually saw 0 wins
[00:12:26] if you actually saw 0 wins pretend you saw one that we saw four
[00:12:28] pretend you saw one that we saw four losses pretend you saw one more than you
[00:12:30] losses pretend you saw one more than you actually saw and so the plus moving
[00:12:32] actually saw and so the plus moving gonna end up adding 1 to the numerator
[00:12:34] gonna end up adding 1 to the numerator and adding 2 to the denominator and so
[00:12:39] and adding 2 to the denominator and so this ends up being 1 over 6 and that's
[00:12:43] this ends up being 1 over 6 and that's actually a more reasonable well maybe
[00:12:45] actually a more reasonable well maybe maybe this is a more reasonable estimate
[00:12:47] maybe this is a more reasonable estimate for the chance of that
[00:12:49] for the chance of that or losing the next game and there's
[00:12:53] or losing the next game and there's actually a certain set of circumstances
[00:12:54] actually a certain set of circumstances under which there's an optimal estimate
[00:12:56] under which there's an optimal estimate so I didn't just make the Sabbath in
[00:12:58] so I didn't just make the Sabbath in there but Laplace you know ancient that
[00:13:03] there but Laplace you know ancient that well-known very influential
[00:13:05] well-known very influential mathematician he was actually I spent
[00:13:08] mathematician he was actually I spent the chance of the Sun rising the next
[00:13:09] the chance of the Sun rising the next day and the reasoning was well we've
[00:13:11] day and the reasoning was well we've seen the Sun Rise a lot of times and so
[00:13:13] seen the Sun Rise a lot of times and so but but that doesn't mean we should be
[00:13:15] but but that doesn't mean we should be absolutely certain the Sun will still
[00:13:17] absolutely certain the Sun will still rise tomorrow right and so his reasoning
[00:13:20] rise tomorrow right and so his reasoning was well we've seen the Sun Rise 10,000
[00:13:22] was well we've seen the Sun Rise 10,000 times you know we can be really certain
[00:13:23] times you know we can be really certain the Sun will rise again tomorrow but
[00:13:25] the Sun will rise again tomorrow but maybe not absolutely certain because
[00:13:27] maybe not absolutely certain because maybe something will go wrong
[00:13:28] maybe something will go wrong who knows what happened this galaxy and
[00:13:31] who knows what happened this galaxy and so his reasoning was he derived the
[00:13:35] so his reasoning was he derived the optimal way of estimating you know
[00:13:37] optimal way of estimating you know really the chance the Sun will rise
[00:13:38] really the chance the Sun will rise tomorrow and this is actually an optimal
[00:13:40] tomorrow and this is actually an optimal estimate under I'll say I'll save the
[00:13:43] estimate under I'll say I'll save the sail assumptions we don't need to worry
[00:13:44] sail assumptions we don't need to worry about it but the sense of they assumed
[00:13:46] about it but the sense of they assumed that you are Bayesian with a uniform
[00:13:48] that you are Bayesian with a uniform Bayesian prior on the chance of the Sun
[00:13:51] Bayesian prior on the chance of the Sun rising tomorrow so if the chance of Sun
[00:13:53] rising tomorrow so if the chance of Sun rising tomorrow is uniformly distributed
[00:13:55] rising tomorrow is uniformly distributed you're in the unit interval anywhere
[00:13:57] you're in the unit interval anywhere from 0 to 1 then after the set of
[00:13:59] from 0 to 1 then after the set of observations of this coin toss of
[00:14:01] observations of this coin toss of whether the Sun rises this is actually
[00:14:03] whether the Sun rises this is actually based in optimist myth of the chosen Sun
[00:14:05] based in optimist myth of the chosen Sun rising tomorrow ok if you don't stand
[00:14:07] rising tomorrow ok if you don't stand which is in the last 30 seconds don't
[00:14:08] which is in the last 30 seconds don't worry about it this is taught in sort of
[00:14:11] worry about it this is taught in sort of Bayesian statistics that bounds Bayesian
[00:14:12] Bayesian statistics that bounds Bayesian statistics classes but mechanically what
[00:14:15] statistics classes but mechanically what you should do is uh take this formula
[00:14:17] you should do is uh take this formula and add 1 to the number of counts you
[00:14:19] and add 1 to the number of counts you actually saw for each of the possible
[00:14:21] actually saw for each of the possible outcomes and more generally if Y excuse
[00:14:32] outcomes and more generally if Y excuse me if if you're estimating probabilities
[00:14:34] me if if you're estimating probabilities for a way random variable then you
[00:14:41] for a way random variable then you estimate the chance of X being I to be
[00:14:46] estimate the chance of X being I to be equal to you
[00:15:02] so that's the maximum likelihood
[00:15:05] so that's the maximum likelihood estimate and for Laplace moving you'd
[00:15:08] estimate and for Laplace moving you'd add one to the numerator and add K to
[00:15:13] add one to the numerator and add K to the denominator so for naive Bayes the
[00:15:25] the denominator so for naive Bayes the way the small modifies your premises
[00:15:29] way the small modifies your premises I'm just going to copy over right so
[00:15:47] I'm just going to copy over right so that's the max or likely estimate and
[00:15:50] that's the max or likely estimate and with the pasta thing you add one to the
[00:15:53] with the pasta thing you add one to the numerator and add two to the denominator
[00:15:55] numerator and add two to the denominator and this means they us mr. probabilities
[00:15:58] and this means they us mr. probabilities probably is they're never exactly zero
[00:16:00] probably is they're never exactly zero or exactly one which takes away that
[00:16:03] or exactly one which takes away that problem of field is zero over zero and
[00:16:08] problem of field is zero over zero and so the employers algorithm yeah it's not
[00:16:10] so the employers algorithm yeah it's not there's not like a great spam classifier
[00:16:12] there's not like a great spam classifier but it's not terrible either and one
[00:16:15] but it's not terrible either and one nice thing about this algorithm is is so
[00:16:16] nice thing about this algorithm is is so simple right estimated parameters is
[00:16:19] simple right estimated parameters is just counting can be done you know very
[00:16:22] just counting can be done you know very efficiently right just just by counting
[00:16:25] efficiently right just just by counting and then classification time is just
[00:16:27] and then classification time is just multiply a bunch of properties together
[00:16:29] multiply a bunch of properties together this very competition efficient
[00:16:31] this very competition efficient algorithm all right any questions about
[00:16:34] algorithm all right any questions about this
[00:16:39] yeah
[00:16:43] oh sorry this why oh yes thank you all
[00:16:59] oh sorry this why oh yes thank you all right
[00:16:59] right oh by the way I I was actually falling
[00:17:01] oh by the way I I was actually falling inside the football team that year so
[00:17:07] okay I love a football team they're
[00:17:09] okay I love a football team they're doing much better now because a few
[00:17:10] doing much better now because a few years ago all right
[00:17:19] years ago all right um so um in Indy example of Tulsa fell
[00:17:34] um so um in Indy example of Tulsa fell so far the features were binary valued
[00:17:37] so far the features were binary valued and so my quick generalization when the
[00:17:44] and so my quick generalization when the features are multinomial value then the
[00:17:51] features are multinomial value then the generalization actually here's one
[00:17:53] generalization actually here's one example we talked about predicting
[00:17:55] example we talked about predicting housing prices right that was a very
[00:17:57] housing prices right that was a very first moving example let's say you have
[00:17:59] first moving example let's say you have a classification problem instead which
[00:18:01] a classification problem instead which is your listing house u.s. Smith once
[00:18:03] is your listing house u.s. Smith once the chance that this house would be so
[00:18:05] the chance that this house would be so weird in the next 30 days so it's a
[00:18:06] weird in the next 30 days so it's a classification problem so if one of the
[00:18:09] classification problem so if one of the features is the size of the house X
[00:18:12] features is the size of the house X right then one way to turn the feature
[00:18:15] right then one way to turn the feature into this speaker would be to choose a
[00:18:19] into this speaker would be to choose a few buckets so if the size is less than
[00:18:21] few buckets so if the size is less than 400 square feet versus you know 400 to
[00:18:27] 400 square feet versus you know 400 to 800 or 800 to 1200 or greater than 1200
[00:18:32] 800 or 800 to 1200 or greater than 1200 square feet then you can set the feature
[00:18:36] square feet then you can set the feature X I to one of four values right so
[00:18:38] X I to one of four values right so there's how you dis precise it comes in
[00:18:40] there's how you dis precise it comes in this value feature to discrete value
[00:18:42] this value feature to discrete value feature and if you want to apply naive
[00:18:45] feature and if you want to apply naive Bayes to this problem then probably of X
[00:18:48] Bayes to this problem then probably of X given Y this is just same as before
[00:18:52] given Y this is just same as before product from
[00:18:53] product from I equals 1 through n P of XJ given Y
[00:19:03] I equals 1 through n P of XJ given Y where now this can be a multinomial
[00:19:08] probability right where if X now takes
[00:19:10] probability right where if X now takes on one of four values say then this can
[00:19:14] on one of four values say then this can be estimated a multinomial probably so
[00:19:17] be estimated a multinomial probably so instead of a Bernoulli distribution over
[00:19:20] instead of a Bernoulli distribution over to pass all comes this can be a problem
[00:19:21] to pass all comes this can be a problem the probably mass function probably over
[00:19:24] the probably mass function probably over for most ball comes if you discretize
[00:19:25] for most ball comes if you discretize the size of a Houston to four values and
[00:19:29] the size of a Houston to four values and if you ever discretized variables
[00:19:31] if you ever discretized variables typical rule of thumb in machine
[00:19:34] typical rule of thumb in machine learning often we discretize variables
[00:19:35] learning often we discretize variables into ten values into 10 buckets that's
[00:19:38] into ten values into 10 buckets that's just uh it often seems to work well now
[00:19:40] just uh it often seems to work well now I grew four here so I don't to write out
[00:19:42] I grew four here so I don't to write out 10 buckets but if you ever disguising
[00:19:45] 10 buckets but if you ever disguising their variables you know for most people
[00:19:48] their variables you know for most people will start off with disco sizing things
[00:19:50] will start off with disco sizing things into 10 down use
[00:20:02] No right and so this is how you can
[00:20:10] No right and so this is how you can apply now you base other problems as
[00:20:12] apply now you base other problems as well including classifying for example
[00:20:14] well including classifying for example if a house is likely to be sold in make
[00:20:15] if a house is likely to be sold in make seventy days now um this there's a
[00:20:22] seventy days now um this there's a different variation on naivebayes
[00:20:24] different variation on naivebayes that I want to describe to you that is
[00:20:27] that I want to describe to you that is actually much better for the specific
[00:20:29] actually much better for the specific problem of text classification and so a
[00:20:33] problem of text classification and so a future representation for X so far was
[00:20:37] future representation for X so far was the following right with a dictionary so
[00:20:50] let's see you get an email there's you
[00:20:54] let's see you get an email there's you know a very spammy email that's drugs by
[00:20:56] know a very spammy email that's drugs by drugs now this is meant as an
[00:21:02] drugs now this is meant as an illustrative example I'm not selling any
[00:21:04] illustrative example I'm not selling any of my drivers so if if you have a
[00:21:13] of my drivers so if if you have a dictionary of 10,000 words then I guess
[00:21:15] dictionary of 10,000 words then I guess let's say a is worth one my pockets were
[00:21:18] let's say a is worth one my pockets were to just to you know make this example RP
[00:21:21] to just to you know make this example RP let's say that whereby is worth 800
[00:21:23] let's say that whereby is worth 800 drugs is the word 1600 and let's say now
[00:21:26] drugs is the word 1600 and let's say now is the word is the 6,200 word in your
[00:21:29] is the word is the 6,200 word in your 10,000 words in a sorted dictionary then
[00:21:33] 10,000 words in a sorted dictionary then the representation for X will be you
[00:21:36] the representation for X will be you know 0 0 right and I put a one there and
[00:21:39] know 0 0 right and I put a one there and one there and the one there okay now one
[00:21:42] one there and the one there okay now one one so um one interesting thing about
[00:21:44] one so um one interesting thing about naive Bayes is that it throws away the
[00:21:46] naive Bayes is that it throws away the fact that the word drug says appear
[00:21:48] fact that the word drug says appear twice right so that's losing a little
[00:21:51] twice right so that's losing a little bit of elevation and and in this speaker
[00:21:55] bit of elevation and and in this speaker representation you know each feature is
[00:21:58] representation you know each feature is either 0 or 1 right and that's part of
[00:22:00] either 0 or 1 right and that's part of why it's throws
[00:22:01] why it's throws way the information that what the one
[00:22:04] way the information that what the one where drugs appear twice and maybe
[00:22:05] where drugs appear twice and maybe should be given more weight for your
[00:22:07] should be given more weight for your classifier um there's a different
[00:22:11] classifier um there's a different representation which is specific to text
[00:22:19] representation which is specific to text and I think text data has a property
[00:22:22] and I think text data has a property that they can be very long or very short
[00:22:24] that they can be very long or very short you can have a five-word email or 1,000
[00:22:26] you can have a five-word email or 1,000 word email and somehow you're taking
[00:22:29] word email and somehow you're taking very strong or very long emails and just
[00:22:32] very strong or very long emails and just mapping them to a feature vector there's
[00:22:33] mapping them to a feature vector there's always the same way just a different
[00:22:36] always the same way just a different representation for this email which is
[00:22:40] representation for this email which is for that email this is drills by drugs
[00:22:43] for that email this is drills by drugs and how we're going to represent it as a
[00:22:45] and how we're going to represent it as a four dimensional feature vector so this
[00:22:53] four dimensional feature vector so this is going to be n-dimensional for an
[00:22:57] is going to be n-dimensional for an email of length and so rather than a
[00:22:59] email of length and so rather than a 10,000 dimensional feature vector we now
[00:23:03] 10,000 dimensional feature vector we now have a four dimensional feature vector
[00:23:05] have a four dimensional feature vector but now X J is in index from 1 to 10,000
[00:23:19] but now X J is in index from 1 to 10,000 instead of just being 0 or 1 okay and
[00:23:21] instead of just being 0 or 1 okay and and this and I guess n varies by chain
[00:23:27] and this and I guess n varies by chain example so an eye is the length
[00:23:38] of email I say the longer email this
[00:23:42] of email I say the longer email this vector the feature vector X will be
[00:23:45] vector the feature vector X will be longer and if you're short to email this
[00:23:47] longer and if you're short to email this feature vector will be shorter okay so
[00:23:56] let's see just to give names the average
[00:24:00] let's see just to give names the average we're going to develop these are these
[00:24:02] we're going to develop these are these are really very confusing very horrible
[00:24:04] are really very confusing very horrible names but this is what the community
[00:24:06] names but this is what the community calls them the model we've talked about
[00:24:09] calls them the model we've talked about so far is sometimes called the
[00:24:11] so far is sometimes called the multivariate Bernoulli model so
[00:24:23] multivariate Bernoulli model so Bernoulli means coin tosses so
[00:24:24] Bernoulli means coin tosses so multivariate means you know they're ten
[00:24:26] multivariate means you know they're ten thousand Bernoulli random variables in
[00:24:29] thousand Bernoulli random variables in this model this is a multivariate
[00:24:30] this model this is a multivariate Bernoulli event model and the event
[00:24:32] Bernoulli event model and the event comes with statistics and answers and
[00:24:34] comes with statistics and answers and the new representation we're gonna talk
[00:24:36] the new representation we're gonna talk about is called the multinomial event
[00:24:44] about is called the multinomial event model these two names are frankly these
[00:24:47] model these two names are frankly these two names are quite confusing but these
[00:24:49] two names are quite confusing but these are the names that I think actually one
[00:24:51] are the names that I think actually one of my friends as McCollum as far as I
[00:24:54] of my friends as McCollum as far as I know wrote the paper that named these
[00:24:56] know wrote the paper that named these two algorithms but but I think these are
[00:24:58] two algorithms but but I think these are these are the names me seem to use
[00:25:13] and so with this new model
[00:25:19] and so with this new model we're gonna build a generative model
[00:25:22] we're gonna build a generative model because there's a generative model or
[00:25:24] because there's a generative model or model P of X comma Y which can be
[00:25:28] model P of X comma Y which can be factored as follows
[00:25:29] factored as follows and using the naive Bayes assumption
[00:25:33] and using the naive Bayes assumption we're going to assume that P of X given
[00:25:36] we're going to assume that P of X given Y is product from I equals 1 through n J
[00:25:41] Y is product from I equals 1 through n J equals 1 to N P of XJ given Y and n
[00:25:47] equals 1 to N P of XJ given Y and n times you know P of Y is that second
[00:25:50] times you know P of Y is that second term right now one of the one of the
[00:25:54] term right now one of the one of the reasons these two models were very
[00:25:56] reasons these two models were very frankly are actually very confusing to
[00:25:59] frankly are actually very confusing to machine learning community it's because
[00:26:00] machine learning community it's because this is exactly the equation that you
[00:26:03] this is exactly the equation that you know you saw on Monday when we described
[00:26:05] know you saw on Monday when we described naive Bayes for the first time like you
[00:26:08] naive Bayes for the first time like you know this P of X in wise father
[00:26:10] know this P of X in wise father probabilities is exactly so this this
[00:26:13] probabilities is exactly so this this equation looks cosmetically identical
[00:26:15] equation looks cosmetically identical but with this new model the second model
[00:26:18] but with this new model the second model the confusingly named multinomial event
[00:26:21] the confusingly named multinomial event model the definition of XJ and the
[00:26:25] model the definition of XJ and the definition of n is very different right
[00:26:27] definition of n is very different right so instead of a product from 1 through
[00:26:31] so instead of a product from 1 through 10,000 there's a product from 1 through
[00:26:32] 10,000 there's a product from 1 through the number of words in your email and
[00:26:34] the number of words in your email and this is now instead a multinomial
[00:26:36] this is now instead a multinomial probability rather than the binary or
[00:26:39] probability rather than the binary or Bernoulli probability and it turns out
[00:26:44] Bernoulli probability and it turns out that with this model the parameters are
[00:26:51] that with this model the parameters are same as before if I was very at one
[00:26:53] same as before if I was very at one equals one and also the other parameters
[00:26:57] equals one and also the other parameters of this model by K given y equals zero
[00:27:01] of this model by K given y equals zero is a chance of XJ equals K given y
[00:27:10] is a chance of XJ equals K given y equals zero right and then just to make
[00:27:12] equals zero right and then just to make sure you understand the notation
[00:27:13] sure you understand the notation see if this makes sense so this
[00:27:15] see if this makes sense so this probability is the chance of word blank
[00:27:22] being black if they were y equals zero
[00:27:29] being black if they were y equals zero so what goes in those two blanks
[00:27:39] actually what goes in a second bag let's
[00:27:42] actually what goes in a second bag let's see
[00:28:00] yes says a chance of the third word in
[00:28:09] yes says a chance of the third word in the email being aware drugs which ones
[00:28:11] the email being aware drugs which ones are the second word in the email being
[00:28:12] are the second word in the email being by or whatever and one part of what
[00:28:16] by or whatever and one part of what wouldn't firstly assumes me why this is
[00:28:18] wouldn't firstly assumes me why this is tricky is that we assume that this
[00:28:22] tricky is that we assume that this probability doesn't depend on J right
[00:28:25] probability doesn't depend on J right that for every position an email for the
[00:28:27] that for every position an email for the chance of the first word being drugs the
[00:28:29] chance of the first word being drugs the same as chances of second Rabindra to
[00:28:31] same as chances of second Rabindra to say mr. worth being drugs which is why
[00:28:33] say mr. worth being drugs which is why on the left hand side J doesn't actually
[00:28:35] on the left hand side J doesn't actually appear on the left hand side right any
[00:28:40] appear on the left hand side right any questions about this and so the way you
[00:28:47] questions about this and so the way you calculate the probability the way you
[00:28:49] calculate the probability the way you would and so the way that given a new
[00:28:55] would and so the way that given a new email a test email you would calculate
[00:28:59] email a test email you would calculate this probability is by you know plugging
[00:29:01] this probability is by you know plugging these parameters a USB from Tainter into
[00:29:04] these parameters a USB from Tainter into this formula
[00:29:27] oh and then no I wrote down and then an
[00:29:34] oh and then no I wrote down and then an Indian set the parameters is this kind
[00:29:38] Indian set the parameters is this kind of just with our y equals one is that y
[00:29:40] of just with our y equals one is that y equals zero and then for the maximum
[00:29:43] equals zero and then for the maximum likely resume the parameters I just
[00:29:44] likely resume the parameters I just write out one of them your estimate of
[00:29:49] write out one of them your estimate of the chance of a given work there's
[00:29:53] the chance of a given work there's really any word in any position being
[00:29:55] really any word in any position being word K what's the chance of some word in
[00:29:58] word K what's the chance of some word in a non-spam email being the word drugs
[00:30:00] a non-spam email being the word drugs let's say the chance of that is equal to
[00:30:04] let's say the chance of that is equal to you I find it well this indicates a
[00:30:24] you I find it well this indicates a function notation
[00:30:25] function notation those are complex I just say in a second
[00:30:33] what this actually means
[00:30:35] what this actually means so the denominator so this space means
[00:30:39] so the denominator so this space means so if you figure out what the English
[00:30:41] so if you figure out what the English meaning of this complicated formula is
[00:30:43] meaning of this complicated formula is this basically says look at all the
[00:30:45] this basically says look at all the words in all of your non-spam emails all
[00:30:48] words in all of your non-spam emails all the emails of y equals zero and look at
[00:30:51] the emails of y equals zero and look at all of the words in all of the emails
[00:30:52] all of the words in all of the emails and so all of those words what fraction
[00:30:54] and so all of those words what fraction of those words is the word drugs and
[00:30:56] of those words is the word drugs and that's a new estimate of the chance of
[00:30:59] that's a new estimate of the chance of the word drugs appearing in the non-spam
[00:31:01] the word drugs appearing in the non-spam email is
[00:31:02] email is position in that right and so in nav the
[00:31:06] position in that right and so in nav the denominator is sum over your training
[00:31:08] denominator is sum over your training set indicator is not spam times the
[00:31:12] set indicator is not spam times the number of words in that you know
[00:31:13] number of words in that you know so the denominator ends up being the
[00:31:17] so the denominator ends up being the total number of words in all of your
[00:31:18] total number of words in all of your non-stemi emails in your training set
[00:31:20] non-stemi emails in your training set and the numerator is some of your
[00:31:24] and the numerator is some of your training set some from michaeles wants
[00:31:25] training set some from michaeles wants um indicator y equals zero so you know
[00:31:30] um indicator y equals zero so you know concept only the things for non-spam
[00:31:33] concept only the things for non-spam email and for the non-spam email J
[00:31:36] email and for the non-spam email J equals 1 through and I go over the words
[00:31:38] equals 1 through and I go over the words in that email and see how many words are
[00:31:40] in that email and see how many words are that worth K right and so if in your
[00:31:45] that worth K right and so if in your training set you have you know a hundred
[00:31:49] training set you have you know a hundred thousand words in your non-spam emails
[00:31:51] thousand words in your non-spam emails and 200 of them are the word drugs the
[00:31:55] and 200 of them are the word drugs the course you know 200 times then this
[00:31:57] course you know 200 times then this range would be two hundred over a
[00:31:58] range would be two hundred over a hundred thousand
[00:31:59] hundred thousand oh and then lastly to implement Laplace
[00:32:09] oh and then lastly to implement Laplace moving with this you would add one to
[00:32:13] moving with this you would add one to the numerator as usual and then let's
[00:32:18] the numerator as usual and then let's see actually what would you add to the
[00:32:20] see actually what would you add to the denominator
[00:32:30] wait but what is K naught K right K is a
[00:32:34] wait but what is K naught K right K is a variable it's okay indexes into the
[00:32:37] variable it's okay indexes into the words what you had 1000 oh I think I've
[00:32:57] words what you had 1000 oh I think I've just read why you say K I think over
[00:32:59] just read why you say K I think over though the notation when defining the
[00:33:01] though the notation when defining the fast moving I think I use K is the
[00:33:02] fast moving I think I use K is the number of possible outcomes yeah but
[00:33:04] number of possible outcomes yeah but here cases in depth yeah right so CI
[00:33:09] here cases in depth yeah right so CI once a numerator and add the number of
[00:33:11] once a numerator and add the number of possible outcomes in denominator which
[00:33:13] possible outcomes in denominator which in this case was there at 10,000 so so
[00:33:17] in this case was there at 10,000 so so this is the probability of X being equal
[00:33:22] this is the probability of X being equal to the value of K where K ranges from 1
[00:33:26] to the value of K where K ranges from 1 the sooner 10,000 if you have a
[00:33:29] the sooner 10,000 if you have a dictionary size if you're about a list
[00:33:31] dictionary size if you're about a list of 10,000 words you're modeling and so
[00:33:33] of 10,000 words you're modeling and so the number of possible values for X is
[00:33:36] the number of possible values for X is 10,000 so you have 10,000 simulator
[00:33:38] 10,000 so you have 10,000 simulator Oh what do you do worse around in a
[00:33:46] Oh what do you do worse around in a dictionary so um there are two
[00:33:49] dictionary so um there are two approaches of that one is just throw it
[00:33:52] approaches of that one is just throw it away
[00:33:52] away just ignore it disregard it that's one
[00:33:54] just ignore it disregard it that's one second approach is to take the rare
[00:33:56] second approach is to take the rare words and map them to a special token
[00:33:58] words and map them to a special token which traditionally is denoted UNK for
[00:34:02] which traditionally is denoted UNK for unknown word so if in your training set
[00:34:05] unknown word so if in your training set you decide to take just the top 10,000
[00:34:08] you decide to take just the top 10,000 words
[00:34:08] words it sends your dictionary then everything
[00:34:10] it sends your dictionary then everything that's down the top 10,000 words you can
[00:34:12] that's down the top 10,000 words you can map to a you know unknown word token
[00:34:14] map to a you know unknown word token unknown where a special symbol
[00:34:19] oh why they wipe their run before oh
[00:34:24] oh why they wipe their run before oh this is an indicator function notation
[00:34:27] this is an indicator function notation so indicates a function boy so if so
[00:34:34] so indicates a function boy so if so this is this notation right means so
[00:34:38] this is this notation right means so indicator you know two equals one plus
[00:34:41] indicator you know two equals one plus one this is true and indicates her you
[00:34:45] one this is true and indicates her you know V equals five but this is this is
[00:34:56] know V equals five but this is this is no formula that's either true or false
[00:34:58] no formula that's either true or false did I know whether Y is zero
[00:35:01] did I know whether Y is zero I guess if Y is 0 1 this is the same as
[00:35:05] I guess if Y is 0 1 this is the same as not why I again saw 1 minus y all right
[00:35:16] not why I again saw 1 minus y all right so I think both event models including
[00:35:19] so I think both event models including the details of mass molecular estimates
[00:35:20] the details of mass molecular estimates are written out in more detail in
[00:35:24] are written out in more detail in lecture notes um so you know when would
[00:35:30] lecture notes um so you know when would you use the naive Bayes algorithm it
[00:35:33] you use the naive Bayes algorithm it turns out now greens algorithm is
[00:35:34] turns out now greens algorithm is actually not very competitive of other
[00:35:36] actually not very competitive of other learning algorithms so for most problems
[00:35:38] learning algorithms so for most problems you find that logistic regression we're
[00:35:42] you find that logistic regression we're better in terms of delivering a higher
[00:35:45] better in terms of delivering a higher accuracy than naive Bayes but the the
[00:35:49] accuracy than naive Bayes but the the advantages of naive Bayes is first is
[00:35:52] advantages of naive Bayes is first is completely very efficient and second is
[00:35:54] completely very efficient and second is relatively quick to implement right and
[00:35:57] relatively quick to implement right and it also doesn't require iterative
[00:35:58] it also doesn't require iterative gradient descent thing and the number of
[00:36:00] gradient descent thing and the number of lines of code needs its employee base is
[00:36:02] lines of code needs its employee base is relatively small so if you are facing a
[00:36:07] relatively small so if you are facing a problem where you go is to implement
[00:36:09] problem where you go is to implement something quick and dirty then naive
[00:36:12] something quick and dirty then naive Bayes is is may be a reasonable choice
[00:36:14] Bayes is is may be a reasonable choice and I think you know as you work on your
[00:36:19] and I think you know as you work on your class projects I think some of you
[00:36:21] class projects I think some of you priority minority will try to invent a
[00:36:24] priority minority will try to invent a new machine learning algorithm and write
[00:36:26] new machine learning algorithm and write a research paper and and I think you
[00:36:29] a research paper and and I think you know inventing new machine learning is a
[00:36:31] know inventing new machine learning is a great thing to do
[00:36:32] great thing to do helps love people on a lot of different
[00:36:34] helps love people on a lot of different applications so this one
[00:36:36] applications so this one the majority of constructions to t9
[00:36:38] the majority of constructions to t9 won't try to apply a learning algorithm
[00:36:42] won't try to apply a learning algorithm to a project that you care about we
[00:36:44] to a project that you care about we applied to a research project you're
[00:36:46] applied to a research project you're working on somewhere Stanford or apply
[00:36:48] working on somewhere Stanford or apply it to a fun application you want to
[00:36:50] it to a fun application you want to build or apply to business application
[00:36:51] build or apply to business application for some of you taking this on SCPD
[00:36:53] for some of you taking this on SCPD taking this remotely and if your goal is
[00:36:56] taking this remotely and if your goal is not to invent a brand new learning
[00:36:58] not to invent a brand new learning algorithm but to take the existing
[00:37:00] algorithm but to take the existing algorithms and apply them then your
[00:37:03] algorithms and apply them then your thumb I suggest to you is when you get
[00:37:07] thumb I suggest to you is when you get started on the machine learning project
[00:37:08] started on the machine learning project start by implementing something quick
[00:37:11] start by implementing something quick and dirty
[00:37:11] and dirty instead of implanting most complicated
[00:37:13] instead of implanting most complicated possible learning algorithm stop
[00:37:14] possible learning algorithm stop influencing something quickly and train
[00:37:17] influencing something quickly and train the algorithm look at how it performs
[00:37:19] the algorithm look at how it performs and then use that to deep out the
[00:37:21] and then use that to deep out the algorithm and keep innovating all right
[00:37:24] algorithm and keep innovating all right so I think you know what does a Stanford
[00:37:26] so I think you know what does a Stanford so we're very good at coming up with
[00:37:27] so we're very good at coming up with very very complicated algorithms but if
[00:37:30] very very complicated algorithms but if you go is to make something work for an
[00:37:33] you go is to make something work for an application
[00:37:34] application you brought it and invented neither in
[00:37:35] you brought it and invented neither in the algorithm and published a paper on a
[00:37:37] the algorithm and published a paper on a new technical you know contribution if
[00:37:40] new technical you know contribution if you if you mean go is a you're working
[00:37:42] you if you mean go is a you're working on an application on understanding news
[00:37:45] on an application on understanding news better or improving the environment or
[00:37:48] better or improving the environment or estimating prices or whatever and your
[00:37:50] estimating prices or whatever and your primary objective is just make an
[00:37:52] primary objective is just make an algorithm work then rather than building
[00:37:56] algorithm work then rather than building a very complicated out and if it all set
[00:37:58] a very complicated out and if it all set I would recommend implementing something
[00:38:01] I would recommend implementing something quickly so that you can then better
[00:38:04] quickly so that you can then better understand how it's performing and then
[00:38:06] understand how it's performing and then to error analysis which we'll talk about
[00:38:08] to error analysis which we'll talk about later and use that to drive your
[00:38:10] later and use that to drive your development you know one one one analogy
[00:38:14] development you know one one one analogy I sometimes make is that if you are
[00:38:20] I sometimes make is that if you are let's see so if you're writing a new
[00:38:23] let's see so if you're writing a new computer program with 10,000 lines of
[00:38:25] computer program with 10,000 lines of code right one approach is to write all
[00:38:28] code right one approach is to write all 10,000 lines of code first and then they
[00:38:31] 10,000 lines of code first and then they try compiling it for the first time
[00:38:32] try compiling it for the first time right and that's clearly a bad idea
[00:38:35] right and that's clearly a bad idea and then sudden you know you should
[00:38:36] and then sudden you know you should write small Maude use around the
[00:38:38] write small Maude use around the attested unit testing and then build up
[00:38:40] attested unit testing and then build up the program incremental you rather than
[00:38:42] the program incremental you rather than write 10,000 lines of code and then
[00:38:43] write 10,000 lines of code and then start to see what syntax errors he gave
[00:38:45] start to see what syntax errors he gave me for the first time um and I think a
[00:38:47] me for the first time um and I think a similar for machine learning instead of
[00:38:50] similar for machine learning instead of building a very complicated algorithm
[00:38:51] building a very complicated algorithm from the get-go we've got a simpler
[00:38:54] from the get-go we've got a simpler algorithm tested and then and then use
[00:38:57] algorithm tested and then and then use the see what is doing one see what's
[00:38:59] the see what is doing one see what's doing wrong to improve from there you
[00:39:01] doing wrong to improve from there you often end up getting to a better
[00:39:04] often end up getting to a better performing algorithm faster um so here's
[00:39:08] performing algorithm faster um so here's here's one example there's actually
[00:39:10] here's one example there's actually something I used to work on I actually
[00:39:12] something I used to work on I actually started on conference on you know and
[00:39:14] started on conference on you know and Auntie spats you know student work on
[00:39:16] Auntie spats you know student work on spam classification many years ago and
[00:39:19] spam classification many years ago and it turns out that when you start out on
[00:39:22] it turns out that when you start out on a new application problem oh it's hard
[00:39:25] a new application problem oh it's hard to know what's the hardest part of the
[00:39:27] to know what's the hardest part of the problem right so if you want to build an
[00:39:30] problem right so if you want to build an anti-spam crossfire the lots of things
[00:39:32] anti-spam crossfire the lots of things you could work on for example spammers
[00:39:34] you could work on for example spammers would deliberately misspell words you
[00:39:36] would deliberately misspell words you know wallet mortgage that right no
[00:39:38] know wallet mortgage that right no refinance your mortgage or whatever but
[00:39:40] refinance your mortgage or whatever but instead of writing the words mortgage
[00:39:44] instead of writing the words mortgage the spammers would write them 0 RT GA or
[00:39:54] the spammers would write them 0 RT GA or instead of GA keaney maybe uh slash
[00:39:57] instead of GA keaney maybe uh slash slash right but all of us as people have
[00:40:01] slash right but all of us as people have no trouble meaning this is aware
[00:40:02] no trouble meaning this is aware mortgage but this would trip up a spam
[00:40:05] mortgage but this would trip up a spam filter this might map the work to an
[00:40:07] filter this might map the work to an unknown word talk about just Alfred it
[00:40:10] unknown word talk about just Alfred it hasn't seen this before and does the
[00:40:11] hasn't seen this before and does the line this were to slip by the spam
[00:40:12] line this were to slip by the spam filter so that's one idea for improving
[00:40:15] filter so that's one idea for improving spam or action one over here see
[00:40:18] spam or action one over here see students the hall like we actually roll
[00:40:19] students the hall like we actually roll the paper
[00:40:20] the paper mapping this back to worse like that so
[00:40:23] mapping this back to worse like that so this dental they can see their words the
[00:40:24] this dental they can see their words the way that humans see them all right so
[00:40:26] way that humans see them all right so that's one idea
[00:40:27] that's one idea another idea might be a lot of spam
[00:40:30] another idea might be a lot of spam email spruce email headers you know
[00:40:37] email spruce email headers you know spammers often tried to hide where the
[00:40:39] spammers often tried to hide where the email truly came from
[00:40:41] email truly came from by spoofing the email header the email
[00:40:44] by spoofing the email header the email address on from information and and and
[00:40:48] address on from information and and and another thing you might do is try to
[00:40:50] another thing you might do is try to fetch the URLs that are referred to in
[00:40:52] fetch the URLs that are referred to in the email and then analyze the webpages
[00:40:54] the email and then analyze the webpages that you get to write but a lot of
[00:40:56] that you get to write but a lot of things that you could do to improve a
[00:40:58] things that you could do to improve a spam filter and any one of these topics
[00:41:01] spam filter and any one of these topics could easily be three months or six
[00:41:03] could easily be three months or six months of research but when you're
[00:41:05] months of research but when you're building say a new spam filter for the
[00:41:07] building say a new spam filter for the first time how do you actually know
[00:41:08] first time how do you actually know which of these is the best investment of
[00:41:10] which of these is the best investment of your time so my advice to those who work
[00:41:14] your time so my advice to those who work on project if your primary goes just get
[00:41:16] on project if your primary goes just get distinctive work is to not somewhat
[00:41:19] distinctive work is to not somewhat arbitrarily dive in and spend six months
[00:41:21] arbitrarily dive in and spend six months on improving this or spend you know six
[00:41:25] on improving this or spend you know six plans on trying to analyze email headers
[00:41:28] plans on trying to analyze email headers but just that implement a more basic
[00:41:30] but just that implement a more basic algorithm almost implement something
[00:41:32] algorithm almost implement something quick and dirty and then look at the
[00:41:34] quick and dirty and then look at the examples that your learning algorithm is
[00:41:36] examples that your learning algorithm is still misclassifying if you find that if
[00:41:38] still misclassifying if you find that if after you've implemented a quick and
[00:41:40] after you've implemented a quick and dirty algorithm you find that yours the
[00:41:42] dirty algorithm you find that yours the anti-spam algorithm is misclassifying a
[00:41:44] anti-spam algorithm is misclassifying a lot of examples with these deliberately
[00:41:46] lot of examples with these deliberately misspell words there's only then they
[00:41:48] misspell words there's only then they get more evidence than it's worth
[00:41:49] get more evidence than it's worth spending a bunch of time solving the
[00:41:52] spending a bunch of time solving the misspelled words and deliberately
[00:41:53] misspelled words and deliberately misspell words problem right but you
[00:41:55] misspell words problem right but you implement spam filter and you see that
[00:41:57] implement spam filter and you see that there's not misclassifying a lot of
[00:41:59] there's not misclassifying a lot of examples of these misspelled worst and I
[00:42:00] examples of these misspelled worst and I would say don't bother go work on
[00:42:02] would say don't bother go work on something else is there or these at
[00:42:04] something else is there or these at least treat that as a lower priority so
[00:42:07] least treat that as a lower priority so one of the uses of
[00:42:08] one of the uses of GDA call centers grant analysis as well
[00:42:11] GDA call centers grant analysis as well as naive bayes is that is they're not
[00:42:14] as naive bayes is that is they're not going to be the most accurate algorithms
[00:42:16] going to be the most accurate algorithms if you want the highest precision
[00:42:18] if you want the highest precision accuracy there are other algorithms like
[00:42:20] accuracy there are other algorithms like which is you Russian or as well switched
[00:42:22] which is you Russian or as well switched up walnuts or neural networks we talk
[00:42:24] up walnuts or neural networks we talk about later which will others always
[00:42:26] about later which will others always give you higher classification accuracy
[00:42:28] give you higher classification accuracy than these algorithms but the advantage
[00:42:30] than these algorithms but the advantage of Johnson's analysis and naive Bayes is
[00:42:33] of Johnson's analysis and naive Bayes is that they are very quick to train there
[00:42:36] that they are very quick to train there is no literature this is just counting
[00:42:39] is no literature this is just counting and GDA is just computing means and
[00:42:42] and GDA is just computing means and covariances
[00:42:42] covariances right so it's very calm
[00:42:43] right so it's very calm they efficient and also there are there
[00:42:46] they efficient and also there are there are simple to implement so it can help
[00:42:48] are simple to implement so it can help you implement that quick and dirty thing
[00:42:50] you implement that quick and dirty thing that helps you get going more quickly
[00:42:54] that helps you get going more quickly and so I think for your project as well
[00:42:57] and so I think for your project as well I would advise most of you to you know
[00:43:01] I would advise most of you to you know as you start working on your project
[00:43:02] as you start working on your project I would advise most of you to don't
[00:43:05] I would advise most of you to don't spend weeks designing exactly what
[00:43:08] spend weeks designing exactly what you're going to do if you have an
[00:43:09] you're going to do if you have an advocate if you have your theory of an
[00:43:10] advocate if you have your theory of an apply project but instead get the Dana
[00:43:13] apply project but instead get the Dana said and apply something simple start to
[00:43:15] said and apply something simple start to reach a super aggression
[00:43:17] reach a super aggression noth-nothing your network or not not
[00:43:18] noth-nothing your network or not not something more complicated or started
[00:43:20] something more complicated or started nowadays and then see how that performs
[00:43:22] nowadays and then see how that performs and then and then go from there okay
[00:43:27] alright so that's it for naive Bayes and
[00:43:32] alright so that's it for naive Bayes and generative learning algorithms the thing
[00:43:36] generative learning algorithms the thing I want to do is move on to a different
[00:43:38] I want to do is move on to a different type of classifier which is a support
[00:43:41] type of classifier which is a support vector machine let me check out any
[00:43:43] vector machine let me check out any questions about this with Wyman's long
[00:43:57] oh wait oh sorry oh can use logistic
[00:44:08] oh wait oh sorry oh can use logistic regression with discrete variables
[00:44:20] oh I see yeah right yes so yes right so
[00:44:26] oh I see yeah right yes so yes right so one of the weaknesses of the naive Bayes
[00:44:28] one of the weaknesses of the naive Bayes algorithm is that it treats all the
[00:44:29] algorithm is that it treats all the words completely you know separate from
[00:44:32] words completely you know separate from each other
[00:44:32] each other so right there was one and two are quite
[00:44:35] so right there was one and two are quite similar and whereas the only mother and
[00:44:37] similar and whereas the only mother and father are quite similar and so but with
[00:44:41] father are quite similar and so but with this feature better representation it
[00:44:44] this feature better representation it doesn't know the relation means these
[00:44:45] doesn't know the relation means these words so in machine learning there are
[00:44:48] words so in machine learning there are other ways of representing words this
[00:44:50] other ways of representing words this technique called word embeddings in
[00:44:57] technique called word embeddings in which we choose the feature
[00:44:58] which we choose the feature representation that encodes the fact
[00:45:00] representation that encodes the fact that the worst one and two are quite
[00:45:02] that the worst one and two are quite similar to each other are the worst
[00:45:03] similar to each other are the worst mother and father question as each other
[00:45:05] mother and father question as each other you know they're worse whatever London
[00:45:09] you know they're worse whatever London and Tokyo are quite suit each other
[00:45:10] and Tokyo are quite suit each other because they're both city names and so
[00:45:13] because they're both city names and so this is a technique that I was not
[00:45:16] this is a technique that I was not planning to teach Europe but that is
[00:45:18] planning to teach Europe but that is taught in CS 2:30 but you can also read
[00:45:25] taught in CS 2:30 but you can also read up on word embeddings or look at some of
[00:45:27] up on word embeddings or look at some of their videos or reasons from CS 2:30 you
[00:45:29] their videos or reasons from CS 2:30 you want to so the word embeddings technique
[00:45:33] want to so the word embeddings technique this is a technique from neural networks
[00:45:34] this is a technique from neural networks really will reduce the number of
[00:45:36] really will reduce the number of training examples you need to learn a
[00:45:37] training examples you need to learn a good tech sauce fire because it comes in
[00:45:39] good tech sauce fire because it comes in with more knowledge victim by the way I
[00:45:51] with more knowledge victim by the way I do this in the other classes too inside
[00:45:53] do this in the other classes too inside the other classes something got a
[00:45:54] the other classes something got a question I go no we don't do that we
[00:45:56] question I go no we don't do that we just covered as ICSC jt9 so
[00:46:10] actually ICS 224 and I think also covers
[00:46:13] actually ICS 224 and I think also covers this yeah the NLP appreciate yeah I'm
[00:46:17] this yeah the NLP appreciate yeah I'm sure okay so so support vector machine
[00:46:33] sure okay so so support vector machine says be ins um let's see the
[00:46:39] says be ins um let's see the classification problem right whether
[00:46:50] classification problem right whether they said looks like this and so you
[00:46:53] they said looks like this and so you want an algorithm to find you know like
[00:46:55] want an algorithm to find you know like a non linear decision boundary right so
[00:46:59] a non linear decision boundary right so the support vector machine will be an
[00:47:01] the support vector machine will be an algorithm to help us find potentially
[00:47:03] algorithm to help us find potentially very very nonlinear decision boundary is
[00:47:05] very very nonlinear decision boundary is like this now one way to build a
[00:47:08] like this now one way to build a classifier like this would be to use
[00:47:09] classifier like this would be to use logistic regression but if this is X 1
[00:47:13] logistic regression but if this is X 1 this is X 2 right
[00:47:15] this is X 2 right so the logistic regression will fit a
[00:47:18] so the logistic regression will fit a straight line to Zeta a Gaussian
[00:47:19] straight line to Zeta a Gaussian distribution Valerie so one way to apply
[00:47:23] distribution Valerie so one way to apply this is arrested like this would be to
[00:47:25] this is arrested like this would be to take your feature vector x1 x2 and map
[00:47:28] take your feature vector x1 x2 and map it to a high dimensional feature vector
[00:47:30] it to a high dimensional feature vector with you know x1 x2 x1 squared x2
[00:47:34] with you know x1 x2 x1 squared x2 squared x1 x2 may be X Y cube x2 cube
[00:47:39] squared x1 x2 may be X Y cube x2 cube and so on and have a new feature vector
[00:47:41] and so on and have a new feature vector which we'll call the value of x that
[00:47:43] which we'll call the value of x that that has these high dimensional features
[00:47:46] that has these high dimensional features right now it turns out if you do this
[00:47:50] right now it turns out if you do this and then apply logistic regression to
[00:47:52] and then apply logistic regression to this augmented feature vector then
[00:47:55] this augmented feature vector then logistic regression can learn nonlinear
[00:47:57] logistic regression can learn nonlinear decision boundaries with this other
[00:47:59] decision boundaries with this other features
[00:47:59] features there's just regression if you actually
[00:48:01] there's just regression if you actually learn the decision boundary there's this
[00:48:03] learn the decision boundary there's this there's a shape of an ellipse but man
[00:48:07] there's a shape of an ellipse but man they'll be choosing these features is a
[00:48:09] they'll be choosing these features is a little bit of a pain right
[00:48:11] little bit of a pain right know what I did I actually don't know
[00:48:13] know what I did I actually don't know what you know type of all set of
[00:48:17] what you know type of all set of features could get you a decision valve
[00:48:19] features could get you a decision valve you like that right rather than just in
[00:48:21] you like that right rather than just in the lips and more complex is your mouth
[00:48:22] the lips and more complex is your mouth to me and what we will see with support
[00:48:26] to me and what we will see with support vector machines is that we will be able
[00:48:28] vector machines is that we will be able to derive an algorithm that can take say
[00:48:31] to derive an algorithm that can take say input features x1 x2 map them to a much
[00:48:36] input features x1 x2 map them to a much higher dimensional set of features and
[00:48:39] higher dimensional set of features and then apply a linear classifier in a way
[00:48:42] then apply a linear classifier in a way similar to logistic regression but
[00:48:43] similar to logistic regression but different in details that allows you to
[00:48:45] different in details that allows you to learn very nonlinear decision boundaries
[00:48:49] learn very nonlinear decision boundaries and I think you know a support vector
[00:48:52] and I think you know a support vector machine one of the actually one of the
[00:48:54] machine one of the actually one of the reasons support vector machines are used
[00:48:56] reasons support vector machines are used today is is a relatively turnkey
[00:48:59] today is is a relatively turnkey algorithm and what I mean by that is it
[00:49:01] algorithm and what I mean by that is it doesn't have too many parameters to
[00:49:02] doesn't have too many parameters to fiddle with even for logistic regression
[00:49:05] fiddle with even for logistic regression or for linear regression you know you
[00:49:08] or for linear regression you know you might have to tune the gradient descent
[00:49:10] might have to tune the gradient descent parameter a tune the learning rate sorry
[00:49:12] parameter a tune the learning rate sorry change the learning rate alpha and
[00:49:13] change the learning rate alpha and that's just another thing that fiddle
[00:49:15] that's just another thing that fiddle worked very try a few values and hope
[00:49:17] worked very try a few values and hope you didn't mess up how you said that
[00:49:18] you didn't mess up how you said that value Oh a support vector machine today
[00:49:22] value Oh a support vector machine today has the very robust very mature software
[00:49:26] has the very robust very mature software packages they can just download to train
[00:49:28] packages they can just download to train a support vector machine on on any on
[00:49:30] a support vector machine on on any on you know on a problem and you just run
[00:49:32] you know on a problem and you just run it and the algorithm will kind of
[00:49:34] it and the algorithm will kind of converge without you having to worry too
[00:49:36] converge without you having to worry too much about the details so I think on the
[00:49:38] much about the details so I think on the grand scheme of things today I would say
[00:49:40] grand scheme of things today I would say support vector machines are not as
[00:49:42] support vector machines are not as effective as neural networks for many
[00:49:44] effective as neural networks for many problems but but one dream propria
[00:49:48] problems but but one dream propria support vector machines is this is
[00:49:50] support vector machines is this is turnkey you kind of just turn the key
[00:49:51] turnkey you kind of just turn the key and it works and there isn't as many
[00:49:53] and it works and there isn't as many parameters like the learning rate and
[00:49:55] parameters like the learning rate and other things that you have to fiddle
[00:49:56] other things that you have to fiddle with
[00:50:03] so the roadmap is we're going to develop
[00:50:09] so the roadmap is we're going to develop the following set of ideas talk about
[00:50:12] the following set of ideas talk about the optimal margin classifier today and
[00:50:20] the optimal margin classifier today and we'll start with the separable case and
[00:50:26] we'll start with the separable case and what that means is going to start off
[00:50:28] what that means is going to start off with data sets that we assume look like
[00:50:34] with data sets that we assume look like this and that are linearly separable and
[00:50:36] this and that are linearly separable and so the also margin classifier is the
[00:50:38] so the also margin classifier is the basic building block of a support vector
[00:50:40] basic building block of a support vector machine and will first derive an
[00:50:44] machine and will first derive an algorithm
[00:50:45] algorithm don't be they'll have some similarities
[00:50:47] don't be they'll have some similarities to a logistic regression but that allows
[00:50:49] to a logistic regression but that allows us a scale in the important way it's
[00:50:52] us a scale in the important way it's that to find a linear classifier for
[00:50:55] that to find a linear classifier for training cells like this that we assume
[00:50:57] training cells like this that we assume for now can be linearly separated so
[00:51:00] for now can be linearly separated so we'll do that today and then what you
[00:51:02] we'll do that today and then what you see on Wednesday is excuse me next
[00:51:07] see on Wednesday is excuse me next Monday
[00:51:07] Monday wait you see next Monday is an idea
[00:51:10] wait you see next Monday is an idea called kernels and the kernel idea is
[00:51:13] called kernels and the kernel idea is one of the most powerful ideas in
[00:51:14] one of the most powerful ideas in machine learning is how do you take a
[00:51:17] machine learning is how do you take a feature vector X maybe this is r2 and
[00:51:22] map it too much high dimensional set of
[00:51:26] map it too much high dimensional set of features in our example there that was
[00:51:29] features in our example there that was our 5 right and then train an algorithm
[00:51:33] our 5 right and then train an algorithm on this higher dimensional set of
[00:51:35] on this higher dimensional set of features and and the cool thing about
[00:51:37] features and and the cool thing about kernels is that this high dimensional
[00:51:39] kernels is that this high dimensional set of features may not be r5 it might
[00:51:42] set of features may not be r5 it might be our 100,000 or it might even be our
[00:51:46] be our 100,000 or it might even be our infinite and so with the kernel
[00:51:49] infinite and so with the kernel formulation we really take you know the
[00:51:52] formulation we really take you know the original set of features that you were
[00:51:53] original set of features that you were given for the houses you trying to sell
[00:51:56] given for the houses you trying to sell you know medical conditions try to
[00:51:58] you know medical conditions try to predict and map this two dimensional
[00:52:00] predict and map this two dimensional feature vector space into maybe an
[00:52:02] feature vector space into maybe an infinite dimensional
[00:52:03] infinite dimensional so the features and what this does is it
[00:52:07] so the features and what this does is it relieves us from a lot of the burden of
[00:52:09] relieves us from a lot of the burden of manually picking features right like do
[00:52:11] manually picking features right like do you want to have square root of x 1 or
[00:52:13] you want to have square root of x 1 or maybe X 1 X 2 to the power of 2/3 so you
[00:52:16] maybe X 1 X 2 to the power of 2/3 so you just don't have the fiddle of these
[00:52:17] just don't have the fiddle of these features too much because the kernels
[00:52:21] features too much because the kernels will allow you to choose an infinitely
[00:52:23] will allow you to choose an infinitely large set of features okay and then
[00:52:26] large set of features okay and then finally we'll talk about the inseparable
[00:52:29] finally we'll talk about the inseparable case so I'm gonna do this today and then
[00:52:35] case so I'm gonna do this today and then this next Monday so
[00:52:55] and by the way I you know the machine
[00:52:58] and by the way I you know the machine there in the world's become a little
[00:53:00] there in the world's become a little it's funny I think the if you leap well
[00:53:02] it's funny I think the if you leap well in the news the media talks a lot about
[00:53:04] in the news the media talks a lot about machine learning the media just talks
[00:53:06] machine learning the media just talks about you know neural networks all the
[00:53:08] about you know neural networks all the time right and you hear about neural
[00:53:09] time right and you hear about neural networks and deep learning wrote the
[00:53:11] networks and deep learning wrote the later in this class but if you look at
[00:53:12] later in this class but if you look at what actually happens in practice in
[00:53:14] what actually happens in practice in machine learning the set of algorithms
[00:53:18] machine learning the set of algorithms actually used in practice it's actually
[00:53:20] actually used in practice it's actually much wider than neural networks and deep
[00:53:21] much wider than neural networks and deep learning so so we do not live in a
[00:53:23] learning so so we do not live in a neural networks only world we actually
[00:53:26] neural networks only world we actually use many many tools in machine learning
[00:53:28] use many many tools in machine learning it's just that deep learning attracts
[00:53:31] it's just that deep learning attracts the attention of the media in some this
[00:53:33] the attention of the media in some this in some way there's quite
[00:53:34] in some way there's quite disproportionate to what I find useful
[00:53:37] disproportionate to what I find useful you know that's like I love them but but
[00:53:41] you know that's like I love them but but they're not they're not the only thing
[00:53:43] they're not they're not the only thing in the world
[00:53:44] in the world and so yeah late last night I was
[00:53:46] and so yeah late last night I was talking an engineer about factor
[00:53:49] talking an engineer about factor analysis which you learn about latency
[00:53:51] analysis which you learn about latency s39 right unsupervised learning
[00:53:52] s39 right unsupervised learning algorithm and there's an application
[00:53:55] algorithm and there's an application that one of my teams is working on in
[00:53:58] that one of my teams is working on in manufacturing where I'm gonna use factor
[00:53:59] manufacturing where I'm gonna use factor analysis or something very similar to it
[00:54:01] analysis or something very similar to it which which is totally not a neural
[00:54:03] which which is totally not a neural network technique right for so they're
[00:54:04] network technique right for so they're there all these other technique is that
[00:54:06] there all these other technique is that including support vector machines in
[00:54:08] including support vector machines in ninety days and I think do can use are
[00:54:11] ninety days and I think do can use are not important all right so let's start
[00:54:17] not important all right so let's start developing the optimal margin classifier
[00:54:20] developing the optimal margin classifier so um first let me define the functional
[00:54:31] so um first let me define the functional margin which is informally the
[00:54:34] margin which is informally the functional margin of the classifier is
[00:54:36] functional margin of the classifier is how well how confident I and accurately
[00:54:39] how well how confident I and accurately do you classify an example
[00:54:41] do you classify an example so here's what I mean we're gonna go to
[00:54:45] so here's what I mean we're gonna go to binary classification and we're gonna
[00:54:48] binary classification and we're gonna use logistic regression right so so
[00:54:52] use logistic regression right so so let's start by motivating this with
[00:54:54] let's start by motivating this with logistic regression so deserve which is
[00:54:57] logistic regression so deserve which is a classifier H of theta equals the
[00:54:59] a classifier H of theta equals the logistic function applied to theta
[00:55:00] logistic function applied to theta transpose X and so if you turn this into
[00:55:05] transpose X and so if you turn this into a binary classification if you have this
[00:55:08] a binary classification if you have this algorithm predict not a probability but
[00:55:10] algorithm predict not a probability but predict 0 or 1 then what does classifier
[00:55:13] predict 0 or 1 then what does classifier will do is predict 1 if theta transpose
[00:55:19] will do is predict 1 if theta transpose X is greater than 0 right and predict 0
[00:55:28] otherwise okay because theta transpose X
[00:55:31] otherwise okay because theta transpose X is greater than 0 this means that G of
[00:55:37] is greater than 0 this means that G of theta transpose X is greater than 0.5
[00:55:39] theta transpose X is greater than 0.5 you can I've created emigrate to them an
[00:55:42] you can I've created emigrate to them an equal to it doesn't matter yes and
[00:55:43] equal to it doesn't matter yes and exactly 0.5 it doesn't really matter
[00:55:45] exactly 0.5 it doesn't really matter what you do
[00:55:47] what you do and so you predict 1 if theta transpose
[00:55:50] and so you predict 1 if theta transpose X is greater than equals 0 meaning that
[00:55:52] X is greater than equals 0 meaning that all probability the estimator probably
[00:55:55] all probability the estimator probably over cause being 1 is greater than 50/50
[00:55:58] over cause being 1 is greater than 50/50 and so you predict 1 and if theta
[00:55:59] and so you predict 1 and if theta transpose X is less than 0 then you
[00:56:02] transpose X is less than 0 then you predict that this class is 0 ok so this
[00:56:04] predict that this class is 0 ok so this is what happen if you have largest
[00:56:07] is what happen if you have largest regression output 1 or 0 rather than
[00:56:09] regression output 1 or 0 rather than output or probability so in other words
[00:56:14] output or probability so in other words this means that if Y I is equal to 1
[00:56:20] this means that if Y I is equal to 1 right then hope or we want that theta
[00:56:27] right then hope or we want that theta transpose X alright it's much greater
[00:56:32] transpose X alright it's much greater than 0 this double greater-than sign it
[00:56:36] than 0 this double greater-than sign it means much greater right because if the
[00:56:40] means much greater right because if the true label is 1 then if the album is
[00:56:43] true label is 1 then if the album is doing well hopefully theta transpose X
[00:56:47] doing well hopefully theta transpose X right will be faster there right so the
[00:56:50] right will be faster there right so the output probability is very very close to
[00:56:52] output probability is very very close to 1
[00:56:52] 1 and if indeed theta transpose X is much
[00:56:55] and if indeed theta transpose X is much greater than zero then G of theta
[00:56:57] greater than zero then G of theta transpose X will be very close to one
[00:56:59] transpose X will be very close to one which means that is giving a very good
[00:57:02] which means that is giving a very good very accurate prediction very correct
[00:57:05] very accurate prediction very correct and confident prediction right there
[00:57:08] and confident prediction right there goes one and if Y is equal to zero then
[00:57:14] goes one and if Y is equal to zero then what we want or what we hope is that
[00:57:16] what we want or what we hope is that theta transpose X I is much less than
[00:57:20] theta transpose X I is much less than zero right because if this is true then
[00:57:23] zero right because if this is true then the algorithm is doing very well on this
[00:57:24] the algorithm is doing very well on this example so um so the functional margin
[00:57:39] example so um so the functional margin which will define in a second captures
[00:57:42] which will define in a second captures this idea that if the Kasbah has a large
[00:57:46] this idea that if the Kasbah has a large functional margin it means that these
[00:57:48] functional margin it means that these two statements are true
[00:57:51] two statements are true um it's a local henro bit there's a
[00:57:56] um it's a local henro bit there's a different thing we define in a second
[00:57:58] different thing we define in a second there's the counted geometric margin and
[00:58:01] there's the counted geometric margin and that's the following and for now let's
[00:58:04] that's the following and for now let's assume the data is linearly separable
[00:58:16] so let's say that's the data set now
[00:58:28] that seems like a pretty good decision
[00:58:30] that seems like a pretty good decision boundary for separating the positive and
[00:58:32] boundary for separating the positive and negative examples
[00:58:36] that's another decision boundary in red
[00:58:39] that's another decision boundary in red that also separates a positive negative
[00:58:41] that also separates a positive negative examples but somehow the Green Line
[00:58:43] examples but somehow the Green Line looks much better than the red line so
[00:58:47] looks much better than the red line so why is that well the red line comes
[00:58:50] why is that well the red line comes really close it's a few of the training
[00:58:53] really close it's a few of the training examples whereas the Green Line you know
[00:58:59] examples whereas the Green Line you know has a much bigger separation right just
[00:59:02] has a much bigger separation right just as a much bigger distance from the
[00:59:04] as a much bigger distance from the positive a negative example so even
[00:59:06] positive a negative example so even though the red line and the Green Line
[00:59:08] though the red line and the Green Line both you know perfectly separate the
[00:59:11] both you know perfectly separate the positive and negative examples the Green
[00:59:14] positive and negative examples the Green Line has a much bigger separation which
[00:59:18] Line has a much bigger separation which is called the geometric margin there's a
[00:59:20] is called the geometric margin there's a much bigger geometric margin meaning of
[00:59:21] much bigger geometric margin meaning of physical separation from the training
[00:59:24] physical separation from the training examples even as it separates them okay
[00:59:27] examples even as it separates them okay and so what I'd like to do in mix
[00:59:33] and so what I'd like to do in mix several I guess with next 20 minutes is
[00:59:36] several I guess with next 20 minutes is formalize definitely functional margin
[00:59:38] formalize definitely functional margin and formalized definition of geometric
[00:59:40] and formalized definition of geometric margin and it will pose that I guess the
[00:59:43] margin and it will pose that I guess the optimizing classifier which is based in
[00:59:45] optimizing classifier which is based in algorithm that tries to maximize the
[00:59:47] algorithm that tries to maximize the geometric margin so what the rudimentary
[00:59:49] geometric margin so what the rudimentary SVM does what the Hesby mlo dimensional
[00:59:52] SVM does what the Hesby mlo dimensional spaces will do also called the optimal
[00:59:54] spaces will do also called the optimal margin classifier is pose an
[00:59:56] margin classifier is pose an optimization problem to try to find a
[00:59:58] optimization problem to try to find a green line to classify these examples so
[01:00:08] now in order to develop svms I'm going
[01:00:12] now in order to develop svms I'm going to change the notation a little bit
[01:00:14] to change the notation a little bit again yeah because these algorithms have
[01:00:16] again yeah because these algorithms have different properties using slightly
[01:00:19] different properties using slightly different notation to strive to make
[01:00:21] different notation to strive to make something math-amazing so when
[01:00:23] something math-amazing so when developing SVM's we're going to use
[01:00:27] developing SVM's we're going to use minus 1 and plus 1 to denote the cost
[01:00:31] minus 1 and plus 1 to denote the cost labels
[01:00:33] labels and we're going to have output so rather
[01:00:48] and we're going to have output so rather than having hypothesis output a
[01:00:50] than having hypothesis output a probability like you saw in logistic
[01:00:53] probability like you saw in logistic regression the support vector machine
[01:00:55] regression the support vector machine will output either minus 1 or plus 1 and
[01:00:58] will output either minus 1 or plus 1 and so G of Z becomes minus 1 or 1 so I'll
[01:01:08] so G of Z becomes minus 1 or 1 so I'll put 1 if Z is greater than equal to 0
[01:01:11] put 1 if Z is greater than equal to 0 and minus 1 otherwise so instead of a
[01:01:17] and minus 1 otherwise so instead of a smooth transition from 0 to 1 we have a
[01:01:19] smooth transition from 0 to 1 we have a hard transition an abrupt transition
[01:01:21] hard transition an abrupt transition from negative 1 to plus 1
[01:01:35] and finally where previously we had for
[01:01:42] and finally where previously we had for logistic regression right where this was
[01:01:49] logistic regression right where this was our n plus 1 with x0 equals 1 for the
[01:01:55] our n plus 1 with x0 equals 1 for the SVM we will have h of just write this
[01:01:59] SVM we will have h of just write this out so for the SVM the parameters of the
[01:02:20] out so for the SVM the parameters of the SVM will be the parameters W and B and
[01:02:22] SVM will be the parameters W and B and hypothesis applied to X will be G of
[01:02:25] hypothesis applied to X will be G of this and we're dropping the X 0 equals 1
[01:02:29] this and we're dropping the X 0 equals 1 constraint so separate out W and B as
[01:02:33] constraint so separate out W and B as follows so this is a standard notation
[01:02:34] follows so this is a standard notation used to develop support vector machines
[01:02:37] used to develop support vector machines and one way to think about this is this
[01:02:39] and one way to think about this is this if the parameters are you know theta 0
[01:02:41] if the parameters are you know theta 0 theta 1 theta 2 theta 3 then this is a
[01:02:45] theta 1 theta 2 theta 3 then this is a new B and this isn't UW
[01:02:48] new B and this isn't UW so you just separate out the theta 0
[01:02:52] so you just separate out the theta 0 which is previously Mouse playing to n 0
[01:02:58] and so on and so this term here becomes
[01:03:05] and so on and so this term here becomes sum from I equals 1 through n WI x pi
[01:03:11] sum from I equals 1 through n WI x pi plus b right super conservative x 0
[01:03:35] all right so let me formalize the
[01:03:39] all right so let me formalize the definition of a functional question so
[01:03:49] so the parameters W and B to find a
[01:03:52] so the parameters W and B to find a linear classifier right so you know what
[01:03:55] linear classifier right so you know what the form is just wrote down the
[01:03:57] the form is just wrote down the parameters W and P defines a really
[01:04:01] parameters W and P defines a really defines a hyperplane but defines a line
[01:04:03] defines a hyperplane but defines a line or in high dimensions it be a plane or a
[01:04:06] or in high dimensions it be a plane or a hyperplane but defines a straight line
[01:04:08] hyperplane but defines a straight line stepping out the positive and negative
[01:04:10] stepping out the positive and negative examples and so we're gonna say the
[01:04:13] examples and so we're gonna say the functional margin so function margin of
[01:04:44] functional margin so function margin of a hyperplane defined by this with
[01:04:47] a hyperplane defined by this with respect to one training example we're
[01:04:50] respect to one training example we're going to write as this and hyperplane
[01:04:54] going to write as this and hyperplane just means straight line right but in
[01:04:56] just means straight line right but in high dimension so this linear classifier
[01:04:57] high dimension so this linear classifier so it's just you know functional margin
[01:04:59] so it's just you know functional margin of this classifier respect to one
[01:05:01] of this classifier respect to one training example we're going to define
[01:05:03] training example we're going to define as this and so if you compare this with
[01:05:07] as this and so if you compare this with the equations we had up there you know
[01:05:10] the equations we had up there you know if y equals one we hope for that at y
[01:05:12] if y equals one we hope for that at y equals zero we hope for that so really
[01:05:14] equals zero we hope for that so really what we hope for is for a classifier to
[01:05:16] what we hope for is for a classifier to achieve a large functional margin right
[01:05:19] achieve a large functional margin right and so so if y I equals 1 then what we
[01:05:27] and so so if y I equals 1 then what we want or what we hope for is that ee
[01:05:32] want or what we hope for is that ee transpose X I plus B is greater than my
[01:05:36] transpose X I plus B is greater than my creatinine 0 and after they both equal
[01:05:39] creatinine 0 and after they both equal to minus 1
[01:05:41] to minus 1 then we once I'll be hope that this is
[01:05:48] then we once I'll be hope that this is much smaller than zero and if you kind
[01:05:51] much smaller than zero and if you kind of combine these two statements if you
[01:05:53] of combine these two statements if you take why I write and multiply it with
[01:06:02] that then you know these two statements
[01:06:05] that then you know these two statements together is basically saying that you
[01:06:07] together is basically saying that you hope that gamma hat I is much greater
[01:06:10] hope that gamma hat I is much greater than 0 because Y I know is plus 1 or
[01:06:13] than 0 because Y I know is plus 1 or minus 1 and and so Y is equal to 1 you
[01:06:18] minus 1 and and so Y is equal to 1 you want this to be very very large if Y is
[01:06:20] want this to be very very large if Y is negative 1 you want this to be a very
[01:06:22] negative 1 you want this to be a very very large negative number and so either
[01:06:25] very large negative number and so either way it's just saying that you hope this
[01:06:27] way it's just saying that you hope this would be very large so we just hope that
[01:06:38] and and as an aside one property of this
[01:06:43] and and as an aside one property of this as well is that so long as gamma hat is
[01:06:51] as well is that so long as gamma hat is greater than 0 that means the algorithm
[01:07:02] so so so long as the functional margin
[01:07:07] so so so long as the functional margin so long as this gamma hat is greater
[01:07:09] so long as this gamma hat is greater than 0 it means that either this is
[01:07:13] than 0 it means that either this is bigger than 0 this is less than 0
[01:07:15] bigger than 0 this is less than 0 depending on the sign of the label and
[01:07:18] depending on the sign of the label and it means that the algorithm yes
[01:07:20] it means that the algorithm yes this one example corrector is right and
[01:07:23] this one example corrector is right and much greater than 0 then it means you
[01:07:26] much greater than 0 then it means you know so the scaling Zira means in in the
[01:07:28] know so the scaling Zira means in in the logistic regression case it means that
[01:07:29] logistic regression case it means that the prediction is at least a little bit
[01:07:31] the prediction is at least a little bit above 0.5 and low Pitbull 0.5 probably
[01:07:34] above 0.5 and low Pitbull 0.5 probably at least gets it right and it was much
[01:07:36] at least gets it right and it was much greater than 0 much less than 0 and
[01:07:38] greater than 0 much less than 0 and Jesus you know the probability output in
[01:07:41] Jesus you know the probability output in the loose aggression cases are very
[01:07:42] the loose aggression cases are very close to one or very close to zero so
[01:07:48] close to one or very close to zero so one of the definition I'm going to
[01:08:00] one of the definition I'm going to define functional margin respect to the
[01:08:02] define functional margin respect to the training set to be gamma hat equals min
[01:08:05] training set to be gamma hat equals min over I here R equals ranges over your
[01:08:12] over I here R equals ranges over your training examples okay so this is a
[01:08:15] training examples okay so this is a worst-case notion but so this definition
[01:08:17] worst-case notion but so this definition of a function margin on the Left we
[01:08:19] of a function margin on the Left we define functional margin respect to a
[01:08:21] define functional margin respect to a single training example which is how are
[01:08:23] single training example which is how are you doing on that one training example
[01:08:25] you doing on that one training example and we'll define the function margin
[01:08:27] and we'll define the function margin with respect to the entire training set
[01:08:29] with respect to the entire training set as how are you doing on the worst
[01:08:31] as how are you doing on the worst example in your training set this is a
[01:08:34] example in your training set this is a little bit of a brittle notion and we're
[01:08:36] little bit of a brittle notion and we're for now for today we're assuming that
[01:08:39] for now for today we're assuming that the training set is linearly separable
[01:08:40] the training set is linearly separable so I'm gonna assume that the training
[01:08:42] so I'm gonna assume that the training set you know it looks like this and
[01:08:45] set you know it looks like this and separative of a straight line or the
[01:08:48] separative of a straight line or the Laxus later but because we're assuming
[01:08:50] Laxus later but because we're assuming just for today that the training set is
[01:08:53] just for today that the training set is linearly separable we'll use this kind
[01:08:55] linearly separable we'll use this kind of worst-case notion and defined the
[01:08:57] of worst-case notion and defined the function margin to be the function
[01:08:59] function margin to be the function margin of the worst training example
[01:09:08] now one thing about the definition of
[01:09:12] now one thing about the definition of the functional margin is they're
[01:09:14] the functional margin is they're actually really easy to cheat and
[01:09:15] actually really easy to cheat and increase the functional margin right and
[01:09:18] increase the functional margin right and one thing you can do is look at this
[01:09:20] one thing you can do is look at this formula is if you take W you know and
[01:09:24] formula is if you take W you know and multiply it by two and take B and
[01:09:27] multiply it by two and take B and multiply by two then everything here
[01:09:32] multiply by two then everything here just multiplies by two and you've
[01:09:34] just multiplies by two and you've doubled the functional margin right but
[01:09:36] doubled the functional margin right but you haven't actually changed anything
[01:09:38] you haven't actually changed anything meaningful okay so so one one way to
[01:09:41] meaningful okay so so one one way to cheat on the functional margin is just
[01:09:42] cheat on the functional margin is just by scaling the parameters by two or in
[01:09:45] by scaling the parameters by two or in seven - maybe you can multiply all your
[01:09:47] seven - maybe you can multiply all your parameters by ten and then you've
[01:09:49] parameters by ten and then you've actually increased the functional margin
[01:09:50] actually increased the functional margin of your training examples 10x but this
[01:09:53] of your training examples 10x but this doesn't actually change the decision
[01:09:55] doesn't actually change the decision boundary right it doesn't actually
[01:09:56] boundary right it doesn't actually change any classification just to
[01:09:57] change any classification just to multiply all of your parameters by a
[01:09:59] multiply all of your parameters by a factor of ten um so one thing you could
[01:10:05] factor of ten um so one thing you could do is replace one thing you could do
[01:10:13] would be to normalize the length of your
[01:10:16] would be to normalize the length of your parameters so for example hypothetically
[01:10:18] parameters so for example hypothetically you can impose a constraint the normal W
[01:10:21] you can impose a constraint the normal W is equal to one another way to do that
[01:10:24] is equal to one another way to do that we could see W and E and replace it with
[01:10:27] we could see W and E and replace it with W over right justify your parameters
[01:10:34] W over right justify your parameters through by the magnitude by their by the
[01:10:36] through by the magnitude by their by the Euclidean length of the parameter vector
[01:10:39] Euclidean length of the parameter vector W and this doesn't change any
[01:10:41] W and this doesn't change any classification is just V scaling the
[01:10:42] classification is just V scaling the parameters but but but it prevents you
[01:10:46] parameters but but but it prevents you know this way of cheating on the focused
[01:10:49] know this way of cheating on the focused on margin okay and in fact more
[01:10:53] on margin okay and in fact more generally you could actually scale W and
[01:10:55] generally you could actually scale W and V by any other value you want and it
[01:10:58] V by any other value you want and it doesn't doesn't matter
[01:10:59] doesn't doesn't matter could choose certain places by w over 17
[01:11:02] could choose certain places by w over 17 and then P over 17 before any other very
[01:11:05] and then P over 17 before any other very right and the classification stayed the
[01:11:08] right and the classification stayed the same
[01:11:08] same okay so we'll come back and use this
[01:11:10] okay so we'll come back and use this property all right so to find a
[01:11:26] property all right so to find a functional margin let's define the
[01:11:27] functional margin let's define the geometric margin and you see in a second
[01:11:30] geometric margin and you see in a second how did your metric on the function
[01:11:31] how did your metric on the function margin relate to each other um so
[01:11:35] margin relate to each other um so there's less let's define the geometric
[01:11:38] there's less let's define the geometric margin with respect to a single example
[01:11:41] margin with respect to a single example which is um so let's see let's say you
[01:11:45] which is um so let's see let's say you have a classifier right so given
[01:11:50] have a classifier right so given parameters W and V that defines a linear
[01:11:53] parameters W and V that defines a linear classifier and the equation W X plus B
[01:11:57] classifier and the equation W X plus B equals zero defines the equation of a
[01:11:59] equals zero defines the equation of a straight line so the axis here I think's
[01:12:02] straight line so the axis here I think's more than x2 and then half of this plane
[01:12:04] more than x2 and then half of this plane you know in this half of the plane you
[01:12:06] you know in this half of the plane you have W transpose X plus B is greater
[01:12:09] have W transpose X plus B is greater than 0 and in this half you have W
[01:12:12] than 0 and in this half you have W transpose X plus B is less than 0 and in
[01:12:15] transpose X plus B is less than 0 and in between this straight line but given by
[01:12:18] between this straight line but given by this equation W transpose X plus B
[01:12:20] this equation W transpose X plus B equals 0 right and so given parameters W
[01:12:23] equals 0 right and so given parameters W and B the upper right is where your
[01:12:25] and B the upper right is where your classifier will predict y equals 1 and
[01:12:28] classifier will predict y equals 1 and the lower left is well predict y is
[01:12:30] the lower left is well predict y is equal to negative 1 ok now let's say you
[01:12:35] equal to negative 1 ok now let's say you have a one training example here so
[01:12:38] have a one training example here so that's a training example X I comma Y I
[01:12:42] that's a training example X I comma Y I and that say is a positive example ok
[01:12:51] and that say is a positive example ok and so your classifier is cause find
[01:12:55] and so your classifier is cause find this example correctly right because and
[01:12:58] this example correctly right because and the upper right half plane you're in
[01:13:02] the upper right half plane you're in this half plane double transpose X plus
[01:13:04] this half plane double transpose X plus B is greater than 0 and so
[01:13:06] B is greater than 0 and so this upper-right region your classifier
[01:13:09] this upper-right region your classifier is predicting +1 right where is it this
[01:13:13] is predicting +1 right where is it this low hot region it be predicting H of X
[01:13:16] low hot region it be predicting H of X equals negative 1 and that's why this
[01:13:21] equals negative 1 and that's why this straight line where switches from
[01:13:23] straight line where switches from predicting negative to positive is the
[01:13:24] predicting negative to positive is the decision boundary so what we're going to
[01:13:31] decision boundary so what we're going to do is define this distance to be the
[01:13:36] do is define this distance to be the geometric margin of this training
[01:13:41] geometric margin of this training example it's that you couldn't distance
[01:13:44] example it's that you couldn't distance is what we're define to be the geometric
[01:13:46] is what we're define to be the geometric immersion so let me just write down what
[01:13:49] immersion so let me just write down what that is
[01:14:03] so the geometric margin
[01:14:10] you know the classifier of the hyper
[01:14:13] you know the classifier of the hyper plane defined by WB we respect to one
[01:14:18] plane defined by WB we respect to one example X my eye this is going to be
[01:14:22] example X my eye this is going to be gamourai equals and let's see I'm not
[01:14:34] gamourai equals and let's see I'm not proving why this is the case the proof
[01:14:36] proving why this is the case the proof is given an election notes but the
[01:14:38] is given an election notes but the legend else shows why this is the right
[01:14:40] legend else shows why this is the right formula for measuring the Euclidean
[01:14:43] formula for measuring the Euclidean distance that I just drew a picture up
[01:14:44] distance that I just drew a picture up there okay but then I'm not proving this
[01:14:47] there okay but then I'm not proving this here but the proof is giving election
[01:14:49] here but the proof is giving election that was me this turns out to be the way
[01:14:50] that was me this turns out to be the way you compute the Euclidean distance
[01:14:51] you compute the Euclidean distance between you have an example and in the
[01:14:54] between you have an example and in the decision boundary okay um and and and
[01:15:02] decision boundary okay um and and and this is for the positive example I guess
[01:15:04] this is for the positive example I guess more generally going to define the
[01:15:14] more generally going to define the geometric margin to be equal to this and
[01:15:17] geometric margin to be equal to this and this definition applies to positive
[01:15:19] this definition applies to positive examples and to negative examples and so
[01:15:24] examples and to negative examples and so the relationship between the geometric
[01:15:26] the relationship between the geometric margin and the functional margin is that
[01:15:28] margin and the functional margin is that the geometric margin is equals in a
[01:15:31] the geometric margin is equals in a personal margin divided by the norm of W
[01:15:52] finally the geometric margin with
[01:16:06] finally the geometric margin with respect to the training set is we're
[01:16:11] respect to the training set is we're gain use this worst-case notion look
[01:16:16] gain use this worst-case notion look through all your training examples and
[01:16:17] through all your training examples and pick the worst possible training example
[01:16:19] pick the worst possible training example and that is your geometric margin on the
[01:16:24] and that is your geometric margin on the training set
[01:16:25] training set oh and and so I hope that certain
[01:16:27] oh and and so I hope that certain notation is clear right so gamma hat was
[01:16:29] notation is clear right so gamma hat was the functional margin and gamma is a
[01:16:42] the functional margin and gamma is a geometric margin and so
[01:17:06] what the optimal margin classifier does
[01:17:15] is choose the parameters W and B to
[01:17:27] is choose the parameters W and B to maximize the geometric margin okay so in
[01:17:34] maximize the geometric margin okay so in other words this is the optimal margin
[01:17:36] other words this is the optimal margin classifiers it's the baby SVM
[01:17:41] SVM for linearly separable call data at
[01:17:45] SVM for linearly separable call data at least for today
[01:17:46] least for today so the optimum arching crossfire would
[01:17:48] so the optimum arching crossfire would choose that straight line because that
[01:17:50] choose that straight line because that straight line maximizes the distance or
[01:17:53] straight line maximizes the distance or maximizes the geometric margin so all of
[01:17:56] maximizes the geometric margin so all of these examples know oh how you pose this
[01:18:03] these examples know oh how you pose this mathematically they're few steps of this
[01:18:05] mathematically they're few steps of this derivation I don't want to do but I'll
[01:18:07] derivation I don't want to do but I'll just describe the beginning step and the
[01:18:10] just describe the beginning step and the last step and leave that in the in
[01:18:12] last step and leave that in the in between steps the lecture notes
[01:18:14] between steps the lecture notes but it turns out that one way to pose
[01:18:17] but it turns out that one way to pose this problem is to maximize gamma W and
[01:18:22] this problem is to maximize gamma W and B of gamma so you want to maximize the
[01:18:25] B of gamma so you want to maximize the geometric margin subject to that subject
[01:18:43] geometric margin subject to that subject to that every training example must have
[01:18:47] to that every training example must have geometric margin greater than or equal
[01:18:51] geometric margin greater than or equal to gamma right so you want gamma to
[01:18:54] to gamma right so you want gamma to means bigger possible subject to that
[01:18:56] means bigger possible subject to that every single training example must have
[01:18:57] every single training example must have at least as I mentioned
[01:18:59] at least as I mentioned this causes you to maximize the
[01:19:01] this causes you to maximize the worst-case geometric motion and it turns
[01:19:04] worst-case geometric motion and it turns out this is um not in this form this
[01:19:07] out this is um not in this form this isn't a convex optimization problem so
[01:19:09] isn't a convex optimization problem so it's difficult to solve this without
[01:19:10] it's difficult to solve this without don't like green design initially
[01:19:11] don't like green design initially there's no little glassware and so on
[01:19:13] there's no little glassware and so on but it turns out that by a few steps are
[01:19:15] but it turns out that by a few steps are be writing you can reformulate this
[01:19:18] be writing you can reformulate this problem as into the equivalent problem
[01:19:21] problem as into the equivalent problem which is a minimizing normal W subject
[01:19:24] which is a minimizing normal W subject to the dramatic margin and so it turns
[01:19:34] to the dramatic margin and so it turns out so I hope this problem makes sense
[01:19:35] out so I hope this problem makes sense right so this problem is just you know
[01:19:37] right so this problem is just you know solve for W and B to make sure that
[01:19:40] solve for W and B to make sure that every example test your metric margin
[01:19:42] every example test your metric margin creating equal gamma and you want gather
[01:19:44] creating equal gamma and you want gather to be as big as possible
[01:19:45] to be as big as possible so this is a way to find the
[01:19:46] so this is a way to find the optimization problem that says maximize
[01:19:49] optimization problem that says maximize the geometric margin and what we show in
[01:19:52] the geometric margin and what we show in the lecture notes is that through a few
[01:19:55] the lecture notes is that through a few steps you can rewrite this optimization
[01:19:57] steps you can rewrite this optimization problem into the following equivalent
[01:19:59] problem into the following equivalent form which is to try to minimize the
[01:20:02] form which is to try to minimize the normal W subject to this and maybe one
[01:20:05] normal W subject to this and maybe one piece of intuition to take away is you
[01:20:08] piece of intuition to take away is you know the smaller W is the bigger right
[01:20:11] know the smaller W is the bigger right the less of a normalization Division
[01:20:14] the less of a normalization Division effect you have right but the details
[01:20:17] effect you have right but the details are given in lecture notes ok but this
[01:20:20] are given in lecture notes ok but this turns out to be a convex optimization
[01:20:21] turns out to be a convex optimization problem and if you optimize this then
[01:20:24] problem and if you optimize this then you will have the optimal margin
[01:20:26] you will have the optimal margin classifier and they're very good
[01:20:27] classifier and they're very good numerical optimization packages to solve
[01:20:30] numerical optimization packages to solve this optimization problem and if you
[01:20:31] this optimization problem and if you give this a data set then you know
[01:20:33] give this a data set then you know assuming your data is separable we'll
[01:20:35] assuming your data is separable we'll fix that assumption well reconvene next
[01:20:37] fix that assumption well reconvene next week then you have the optimal knowledge
[01:20:40] week then you have the optimal knowledge and classifier where should be the baby
[01:20:41] and classifier where should be the baby svf and we add kernels to it then you
[01:20:43] svf and we add kernels to it then you have the full content of C
[01:20:46] have the full content of C alright let's break for the day
[01:20:49] alright let's break for the day see see you guys


================================================================================
LECTURE 007
================================================================================

Lecture 7 - Kernels | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018)

Source: https://www.youtube.com/watch?v=8NYoQiRANpg

---

Transcript

[00:00:03] all right good morning let's get started
[00:00:07] all right good morning let's get started so today you see the support vector
[00:00:11] so today you see the support vector machine algorithm and this is one of my
[00:00:15] machine algorithm and this is one of my favorite algorithms because it's very
[00:00:16] favorite algorithms because it's very turnkey classification problem so in
[00:00:24] turnkey classification problem so in particular talk a bit more about the
[00:00:26] particular talk a bit more about the optimization problem you have to solve
[00:00:29] optimization problem you have to solve in support vector machine then talk
[00:00:32] in support vector machine then talk about something called the representor
[00:00:34] about something called the representor theorem and this would be a key idea to
[00:00:37] theorem and this would be a key idea to how will work in potentially very high
[00:00:40] how will work in potentially very high dimensional like 100,000 dimensional or
[00:00:42] dimensional like 100,000 dimensional or a moving dimensional or 100 billion
[00:00:44] a moving dimensional or 100 billion dimensional or even infinite dimensional
[00:00:46] dimensional or even infinite dimensional feature spaces and just to teach you how
[00:00:48] feature spaces and just to teach you how to represent feature vectors and how to
[00:00:50] to represent feature vectors and how to represent parameters that may be you
[00:00:53] represent parameters that may be you know hundred billion dimensional or 100
[00:00:55] know hundred billion dimensional or 100 trillion dimensional or infinite
[00:00:56] trillion dimensional or infinite dimensional and based on this we derived
[00:01:00] dimensional and based on this we derived kernels which is the mechanism for
[00:01:01] kernels which is the mechanism for working these incredibly high
[00:01:03] working these incredibly high dimensional feature spaces and then
[00:01:05] dimensional feature spaces and then hopefully time permitting wrap up with a
[00:01:08] hopefully time permitting wrap up with a few examples of compute implementations
[00:01:11] few examples of compute implementations of these ideas so to recap on last
[00:01:16] of these ideas so to recap on last Wednesday we started to talk about the
[00:01:18] Wednesday we started to talk about the optimal margin the classifier which said
[00:01:21] optimal margin the classifier which said that even the data set that looks like
[00:01:24] that even the data set that looks like this then you want to find the decision
[00:01:28] this then you want to find the decision boundary with the greatest possible
[00:01:30] boundary with the greatest possible geometric margin right so the geometric
[00:01:33] geometric margin right so the geometric margin can be calculated by this formula
[00:01:35] margin can be calculated by this formula and this is just the derivations in
[00:01:38] and this is just the derivations in lecture notes just you know measuring
[00:01:40] lecture notes just you know measuring the distance to the nearest point and
[00:01:45] the distance to the nearest point and for now let's assume the data can be
[00:01:47] for now let's assume the data can be separated by a straight line and so
[00:01:49] separated by a straight line and so gamma I is this is sort of geometry I
[00:01:52] gamma I is this is sort of geometry I guess derivation the lecture notes this
[00:01:55] guess derivation the lecture notes this is the formula for computing the
[00:01:57] is the formula for computing the distance from the example X I Y I to the
[00:02:00] distance from the example X I Y I to the decision boundary governed by the
[00:02:02] decision boundary governed by the parameters W and B and gamma is the
[00:02:06] parameters W and B and gamma is the worst case geometric margin right you
[00:02:09] worst case geometric margin right you will make so
[00:02:13] of all of your M training examples which
[00:02:16] of all of your M training examples which one has the resource possible geometric
[00:02:18] one has the resource possible geometric margin and the support that the optimal
[00:02:20] margin and the support that the optimal margin classifier will try to make this
[00:02:23] margin classifier will try to make this as big as possible and by the way what
[00:02:26] as big as possible and by the way what will what you see later on is that
[00:02:27] will what you see later on is that optimum Archana classifiers BC this
[00:02:29] optimum Archana classifiers BC this algorithm and okto margin classifier
[00:02:32] algorithm and okto margin classifier plus kernels meaning AC take this idea
[00:02:35] plus kernels meaning AC take this idea of the pie in a hundred billion
[00:02:36] of the pie in a hundred billion dimensional feature space that's the
[00:02:38] dimensional feature space that's the support vector machine ok um so I saw
[00:02:43] support vector machine ok um so I saw one thing I didn't have time to talk
[00:02:45] one thing I didn't have time to talk about on Wednesday was the derivation of
[00:02:49] about on Wednesday was the derivation of this classification problem so whether
[00:02:52] this classification problem so whether this optimization objective come from so
[00:02:54] this optimization objective come from so let me let me just go over that very
[00:02:56] let me let me just go over that very briefly so the way motivated these
[00:03:00] briefly so the way motivated these definitions was said that given training
[00:03:01] definitions was said that given training set you want to find the decision
[00:03:03] set you want to find the decision boundary parametrized by W and B that
[00:03:07] boundary parametrized by W and B that maximizes the geometric margin right and
[00:03:10] maximizes the geometric margin right and so again as we can your classifier
[00:03:12] so again as we can your classifier output G equals W transpose X plus B and
[00:03:18] output G equals W transpose X plus B and so you want to find parameters W be
[00:03:20] so you want to find parameters W be they'll define the decision boundary
[00:03:23] they'll define the decision boundary when your classification switch from
[00:03:25] when your classification switch from positive and negative that maximizes the
[00:03:27] positive and negative that maximizes the geometric margin and so one way to pose
[00:03:30] geometric margin and so one way to pose this as an optimization problem is let's
[00:03:35] this as an optimization problem is let's see is to try to find the biggest
[00:03:39] see is to try to find the biggest possible value of gamma subject to that
[00:03:53] subject to that the geometric margin
[00:03:55] subject to that the geometric margin must be greater than equal to gamma
[00:03:56] must be greater than equal to gamma right so so in this optimization problem
[00:04:00] right so so in this optimization problem the parameters you get to fiddle with
[00:04:02] the parameters you get to fiddle with our gamma W and B and if you solve this
[00:04:06] our gamma W and B and if you solve this optimization problem then you are
[00:04:08] optimization problem then you are finding the values of W and B that
[00:04:11] finding the values of W and B that defines a straight line that defines the
[00:04:13] defines a straight line that defines the decision boundary so that so this
[00:04:17] decision boundary so that so this constraint says that every example right
[00:04:21] constraint says that every example right so this constraint says every example
[00:04:25] has Joe mentioned margin greater than
[00:04:28] has Joe mentioned margin greater than equal to gamma this is this is what is
[00:04:31] equal to gamma this is this is what is saying and you want to set gamma as big
[00:04:33] saying and you want to set gamma as big as possible which means that you're
[00:04:36] as possible which means that you're maximizing the worst-case geometric
[00:04:38] maximizing the worst-case geometric logic this make sense so so if I so the
[00:04:42] logic this make sense so so if I so the only way to make gamma say 17 or 20 or
[00:04:47] only way to make gamma say 17 or 20 or whatever is if every training example
[00:04:49] whatever is if every training example has geometric margin bigger than 17
[00:04:52] has geometric margin bigger than 17 right and so this optimization problem
[00:04:55] right and so this optimization problem is trying to find Delvian be to drive up
[00:04:58] is trying to find Delvian be to drive up gamma as big as possible and have every
[00:05:00] gamma as big as possible and have every example have geometric margin even
[00:05:02] example have geometric margin even bigger than camera so this optimization
[00:05:04] bigger than camera so this optimization problem maximizes the it causes your
[00:05:10] problem maximizes the it causes your causes your defined Delta NP with as big
[00:05:12] causes your defined Delta NP with as big a geometric margin as possible eyes
[00:05:14] a geometric margin as possible eyes bigger the worst case your magic margin
[00:05:15] bigger the worst case your magic margin as possible okay and so does this
[00:05:21] as possible okay and so does this defense actually yeah right okay
[00:05:24] defense actually yeah right okay actually raise your hand if this makes
[00:05:25] actually raise your hand if this makes sense oh okay well many of you all right
[00:05:29] sense oh okay well many of you all right I mean this even a slightly different
[00:05:30] I mean this even a slightly different way um so let's see of a few training
[00:05:33] way um so let's see of a few training examples you know in the training
[00:05:34] examples you know in the training examples geometric margins are 17 2 &amp; 5
[00:05:41] examples geometric margins are 17 2 &amp; 5 right then the Jamaican margin in this
[00:05:43] right then the Jamaican margin in this case is the worst case value 2 right and
[00:05:46] case is the worst case value 2 right and so if you are solving an optimization
[00:05:48] so if you are solving an optimization problem where I want every example where
[00:05:51] problem where I want every example where I wanted the
[00:05:54] I wanted the where I want a min over I of gamma I to
[00:06:00] where I want a min over I of gamma I to be as big as possible
[00:06:02] be as big as possible one way to enforce this is to say they
[00:06:04] one way to enforce this is to say they can
[00:06:04] can I must be bigger than equal to camera
[00:06:07] I must be bigger than equal to camera for every possible value of I and then
[00:06:09] for every possible value of I and then I'm gonna lift camera up as much as
[00:06:11] I'm gonna lift camera up as much as possible right because the only way to
[00:06:13] possible right because the only way to live camera up subject to this is it
[00:06:16] live camera up subject to this is it every verb value of gamma is bigger than
[00:06:18] every verb value of gamma is bigger than that and so lifting gamma up maximizing
[00:06:22] that and so lifting gamma up maximizing gamma as effective maximizing the worst
[00:06:24] gamma as effective maximizing the worst case examples geometric margin which is
[00:06:27] case examples geometric margin which is which is which is how we've defined this
[00:06:30] which is which is how we've defined this optimization okay and then the last one
[00:06:37] optimization okay and then the last one step to turn this problem into this one
[00:06:39] step to turn this problem into this one on the left is this interesting
[00:06:42] on the left is this interesting observation that you might remember when
[00:06:45] observation that you might remember when we talked about the functional margin
[00:06:48] we talked about the functional margin which is the numerator here that you
[00:06:51] which is the numerator here that you know the function margin you can scale W
[00:06:53] know the function margin you can scale W and B by any number and the decision
[00:06:56] and B by any number and the decision boundary stays the same and so you know
[00:06:59] boundary stays the same and so you know if if your classifier is y so this is G
[00:07:02] if if your classifier is y so this is G of W transpose X plus B right so if if W
[00:07:12] of W transpose X plus B right so if if W was the vector 2 1 let's say that's the
[00:07:21] was the vector 2 1 let's say that's the classifier right then you can take W and
[00:07:25] classifier right then you can take W and B and multiply it by any number you want
[00:07:28] B and multiply it by any number you want I can multiply this by 10 and this the
[00:07:35] I can multiply this by 10 and this the point it's the same straight line right
[00:07:38] point it's the same straight line right so if I take a look I think let's see
[00:07:43] so if I take a look I think let's see with this 2 1 X this actually defines
[00:07:47] with this 2 1 X this actually defines the decision boundary that looks like
[00:07:50] the decision boundary that looks like that if this is X 1 and this is X 2 then
[00:07:54] that if this is X 1 and this is X 2 then this is the equation of the straight
[00:07:55] this is the equation of the straight line where the V transpose X plus B
[00:07:58] line where the V transpose X plus B equals 0 right that's 1 &amp; 2 you could
[00:08:08] equals 0 right that's 1 &amp; 2 you could verify for yourself you plug in this
[00:08:10] verify for yourself you plug in this point then W transpose X plus B equals 0
[00:08:12] point then W transpose X plus B equals 0 we plug at this point there because it
[00:08:14] we plug at this point there because it was x equals 0 and so that's the
[00:08:16] was x equals 0 and so that's the decision boundary
[00:08:17] decision boundary where the SVM will predict positive
[00:08:20] where the SVM will predict positive everywhere here and predict negative
[00:08:23] everywhere here and predict negative everywhere to the lower left and this
[00:08:26] everywhere to the lower left and this straight line you know stays the same
[00:08:28] straight line you know stays the same you very much supply these parameters by
[00:08:31] you very much supply these parameters by any constant okay and so to simplify
[00:08:38] any constant okay and so to simplify this notice that you could choose
[00:08:41] this notice that you could choose anything you want for the normal W right
[00:08:44] anything you want for the normal W right just by scaling this by a factor of ten
[00:08:46] just by scaling this by a factor of ten you can increase it or scale me by a
[00:08:47] you can increase it or scale me by a factor one over ten you can decrease it
[00:08:49] factor one over ten you can decrease it but you have to flexibility to scale the
[00:08:52] but you have to flexibility to scale the parameters W and B you know up or down
[00:08:55] parameters W and B you know up or down by any fixed constant without changing
[00:08:58] by any fixed constant without changing the decision boundary and so the trick
[00:09:02] the decision boundary and so the trick to simplify this equation under that one
[00:09:04] to simplify this equation under that one is if you choose to scale the normal W
[00:09:08] is if you choose to scale the normal W to be equal to 1 over gamma because if
[00:09:15] to be equal to 1 over gamma because if you do that then this optimization
[00:09:18] you do that then this optimization objective becomes
[00:09:30] maximize one of the norm of W subject to
[00:09:44] right if you substitute normal W equals
[00:09:48] right if you substitute normal W equals one of Yama and so that cancels out and
[00:09:53] one of Yama and so that cancels out and so you end up with this optimization
[00:09:55] so you end up with this optimization problem instead of maximizing one of a
[00:09:57] problem instead of maximizing one of a norm W to minimize one half the normal W
[00:10:00] norm W to minimize one half the normal W squared subject to this okay and so
[00:10:13] squared subject to this okay and so that's a rough I know I did this
[00:10:15] that's a rough I know I did this relatively quickly again as usual the
[00:10:18] relatively quickly again as usual the full derivation is redundant lecture
[00:10:19] full derivation is redundant lecture notes but hopefully this gives you a
[00:10:21] notes but hopefully this gives you a flavor for why if you solve this
[00:10:23] flavor for why if you solve this optimization problem and you're
[00:10:25] optimization problem and you're minimizing over W and B that you are
[00:10:28] minimizing over W and B that you are solving for the parameters WMV that give
[00:10:30] solving for the parameters WMV that give you the optimal margin classifier okay
[00:10:35] you the optimal margin classifier okay now Delta margin classifier we've been
[00:10:40] now Delta margin classifier we've been deriving this algorithm as if you know
[00:10:43] deriving this algorithm as if you know the features X I let's see
[00:10:48] the features X I let's see we're from deriving these algorithm is
[00:10:49] we're from deriving these algorithm is if the features X I are some reasonable
[00:10:52] if the features X I are some reasonable dimensional feature X equals r2 it's a
[00:10:55] dimensional feature X equals r2 it's a 100 or something um what we will talk
[00:10:59] 100 or something um what we will talk about later is a case where the features
[00:11:02] about later is a case where the features X I become you know 100 trillion
[00:11:04] X I become you know 100 trillion dimensional right or infinite
[00:11:06] dimensional right or infinite dimensional and what what we will assume
[00:11:15] dimensional and what what we will assume is that W can be represented as a sum as
[00:11:23] is that W can be represented as a sum as a linear combination of the training
[00:11:26] a linear combination of the training examples
[00:11:27] examples so um in order to the writers for vector
[00:11:31] so um in order to the writers for vector machine we're going to make an
[00:11:32] machine we're going to make an additional restriction that the
[00:11:34] additional restriction that the parameters W can be expressed as a
[00:11:37] parameters W can be expressed as a linear combination of the training
[00:11:39] linear combination of the training examples right so and it turns out that
[00:11:43] examples right so and it turns out that when X I is you know 100 trillion
[00:11:46] when X I is you know 100 trillion dimensional doing this will let us
[00:11:48] dimensional doing this will let us derive algorithms that work even in
[00:11:50] derive algorithms that work even in these hundred trillion or in these
[00:11:51] these hundred trillion or in these infinite dimensional feature spaces now
[00:11:54] infinite dimensional feature spaces now I'm describing this just as an
[00:11:56] I'm describing this just as an assumption it turns out that there's a
[00:11:58] assumption it turns out that there's a theorem called the represented theorem
[00:12:00] theorem called the represented theorem that shows that you can make this
[00:12:02] that shows that you can make this assumption without losing any
[00:12:04] assumption without losing any performance the proof the represented
[00:12:06] performance the proof the represented theorem is quite complicated I don't
[00:12:07] theorem is quite complicated I don't want to do this in this class there's
[00:12:09] want to do this in this class there's actually written out the proof but why
[00:12:11] actually written out the proof but why you can make this assumption is also
[00:12:12] you can make this assumption is also written election else is pretty long and
[00:12:14] written election else is pretty long and involve proof involving a primal dual
[00:12:16] involve proof involving a primal dual optimization I don't to present the
[00:12:18] optimization I don't to present the whole proof here but there give you a
[00:12:19] whole proof here but there give you a flavor for why this is a reasonable
[00:12:21] flavor for why this is a reasonable assumption to make okay oh and just just
[00:12:25] assumption to make okay oh and just just to make things complicated
[00:12:26] to make things complicated later on we actually do this right so
[00:12:29] later on we actually do this right so why I is always plus minus one so so
[00:12:31] why I is always plus minus one so so we're actually by convention we're going
[00:12:33] we're actually by convention we're going to assume that
[00:12:34] to assume that WI can be written all right so in this
[00:12:37] WI can be written all right so in this example this is plus minus 1 right so
[00:12:40] example this is plus minus 1 right so this makes some of the math a little bit
[00:12:43] this makes some of the math a little bit down so you come out easier but this is
[00:12:45] down so you come out easier but this is still saying that W is can be
[00:12:47] still saying that W is can be represented as a linear combination of
[00:12:49] represented as a linear combination of the training examples ok so um let me
[00:12:54] the training examples ok so um let me just describe less formally why this is
[00:12:57] just describe less formally why this is a reasonable assumption but it's
[00:12:58] a reasonable assumption but it's actually not an assumption that
[00:12:59] actually not an assumption that represents a theorem proves that you
[00:13:01] represents a theorem proves that you know this is just true at the optimal
[00:13:02] know this is just true at the optimal value of W but let me convey a couple
[00:13:05] value of W but let me convey a couple ways why this is a reasonable thing to
[00:13:09] ways why this is a reasonable thing to do I see yes
[00:13:12] do I see yes so um maybe his intuition number one and
[00:13:19] so um maybe his intuition number one and I'm going to refer to logistic
[00:13:21] I'm going to refer to logistic regression
[00:13:27] right we're suppose that you run
[00:13:30] right we're suppose that you run logistic regression with gradient
[00:13:32] logistic regression with gradient descent say so cost appear in descent
[00:13:34] descent say so cost appear in descent then you will initialize the parameters
[00:13:36] then you will initialize the parameters to be equal to zero at first and then
[00:13:39] to be equal to zero at first and then for each iteration of stochastic
[00:13:41] for each iteration of stochastic gradient descent right you would update
[00:13:47] gradient descent right you would update theta gets updated as theta minus a
[00:13:50] theta gets updated as theta minus a learning rate times you know times X
[00:13:58] learning rate times you know times X okay and so sorry here alpha is a
[00:14:03] okay and so sorry here alpha is a learning rate
[00:14:03] learning rate nothing does overload the notation this
[00:14:05] nothing does overload the notation this alpha is nothing to do that alpha but so
[00:14:07] alpha is nothing to do that alpha but so they're saying that on every iteration
[00:14:09] they're saying that on every iteration you're updating the parameters theta as
[00:14:11] you're updating the parameters theta as a by adding or subtracting some constant
[00:14:16] a by adding or subtracting some constant times some training example and so kind
[00:14:18] times some training example and so kind of proof by induction right if they
[00:14:21] of proof by induction right if they tostop zero and on every iteration a
[00:14:24] tostop zero and on every iteration a great descent you're adding a multiple
[00:14:26] great descent you're adding a multiple of some training example then no matter
[00:14:29] of some training example then no matter how many iterations you run gradient
[00:14:31] how many iterations you run gradient descent theta is still a linear
[00:14:33] descent theta is still a linear combination of your training examples
[00:14:36] combination of your training examples okay and again identical stay there that
[00:14:39] okay and again identical stay there that it was really theta 0 theta 1 up to
[00:14:42] it was really theta 0 theta 1 up to theta n right whereas here we have a B
[00:14:46] theta n right whereas here we have a B and then w1 down to W I know this pens
[00:14:50] and then w1 down to W I know this pens really bad if you like alright yeah
[00:14:54] really bad if you like alright yeah throw these away so they don't keep
[00:14:56] throw these away so they don't keep haunting us in the future ok right so
[00:15:00] haunting us in the future ok right so but if you but so I did those a theta
[00:15:05] but if you but so I did those a theta rather than chop you but it turns out if
[00:15:07] rather than chop you but it turns out if you work through the algebra this is a
[00:15:09] you work through the algebra this is a little fruit by induction that you know
[00:15:11] little fruit by induction that you know as you run logistic regression after
[00:15:14] as you run logistic regression after every iteration the parameters theta of
[00:15:16] every iteration the parameters theta of the parameters W are always a linear
[00:15:19] the parameters W are always a linear combination of the training samples and
[00:15:22] combination of the training samples and this is also true of you use batch
[00:15:24] this is also true of you use batch gradient descent if you use factory in
[00:15:26] gradient descent if you use factory in the Senate then the up a roux is this
[00:15:38] and so it turns out you can derive
[00:15:40] and so it turns out you can derive gradient descent for the support vector
[00:15:43] gradient descent for the support vector machine learning algorithm as well you
[00:15:44] machine learning algorithm as well you can derive your descent authorized W
[00:15:46] can derive your descent authorized W subject to this and you could have a
[00:15:48] subject to this and you could have a proof by induction you know that no
[00:15:50] proof by induction you know that no matter how many iterations you run
[00:15:51] matter how many iterations you run through in descent it will always be a
[00:15:52] through in descent it will always be a linear combination of the training
[00:15:54] linear combination of the training examples so that's one intuition for how
[00:16:01] examples so that's one intuition for how you might see that assuming W is a
[00:16:05] you might see that assuming W is a linear combination of the training
[00:16:06] linear combination of the training examples you know is a reasonable
[00:16:08] examples you know is a reasonable assumption I want to present a second
[00:16:14] assumption I want to present a second set of intuitions and this one would be
[00:16:16] set of intuitions and this one would be easier if you're good at visualizing
[00:16:18] easier if you're good at visualizing high dimensional spaces I guess but let
[00:16:22] high dimensional spaces I guess but let me just give intuition number two which
[00:16:25] me just give intuition number two which is um let's see so so first of all let's
[00:16:35] is um let's see so so first of all let's take our example just now right let's
[00:16:37] take our example just now right let's say that the classifier uses this 2 1 X
[00:16:43] say that the classifier uses this 2 1 X minus 2 right so this is W and this is
[00:16:48] minus 2 right so this is W and this is um then it turns out that the decision
[00:16:52] um then it turns out that the decision boundary is this where this is 1 and
[00:16:56] boundary is this where this is 1 and this is 2 and it turns out that the
[00:17:00] this is 2 and it turns out that the vector W is always at 90 degrees to the
[00:17:04] vector W is always at 90 degrees to the decision boundary right this is a
[00:17:07] decision boundary right this is a effective I guess geometry or something
[00:17:10] effective I guess geometry or something well then the algebra right where it's
[00:17:12] well then the algebra right where it's the vector W to 1 so the vector W you
[00:17:16] the vector W to 1 so the vector W you know is sort of two to the right and
[00:17:17] know is sort of two to the right and then one up is always that well all
[00:17:21] then one up is always that well all right the vector W is always 90 degrees
[00:17:24] right the vector W is always 90 degrees to the decision boundary and it doesn't
[00:17:26] to the decision boundary and it doesn't bounce separates where you predict
[00:17:28] bounce separates where you predict positive from where you predict negative
[00:17:31] positive from where you predict negative okay and so it it turns out that if you
[00:17:39] okay and so it it turns out that if you have to take a simple example let's say
[00:17:42] have to take a simple example let's say you have two
[00:17:44] you have two training examples a positive example and
[00:17:48] training examples a positive example and a negative example right then is next to
[00:17:54] a negative example right then is next to the linear algebra way of saying this is
[00:17:56] the linear algebra way of saying this is that the vector W lies in the span of
[00:17:59] that the vector W lies in the span of the training examples oh and and and the
[00:18:03] the training examples oh and and and the way to picture this is that W sets the
[00:18:06] way to picture this is that W sets the direction of the decision boundary and
[00:18:08] direction of the decision boundary and as you vary B then the position so the
[00:18:10] as you vary B then the position so the relative position you know setting
[00:18:13] relative position you know setting different values of B will move the
[00:18:14] different values of B will move the decision boundary back and forth like
[00:18:16] decision boundary back and forth like this and W pins the direction and just
[00:18:24] this and W pins the direction and just one last example for for why this might
[00:18:27] one last example for for why this might be true is so we're gonna be working in
[00:18:36] be true is so we're gonna be working in very very high dimensional feature
[00:18:38] very very high dimensional feature spaces for this example let's say you
[00:18:41] spaces for this example let's say you have three features x1 x2 x3 right and
[00:18:44] have three features x1 x2 x3 right and then later we'll get to where this is
[00:18:46] then later we'll get to where this is like a hundred trillion right and let's
[00:18:48] like a hundred trillion right and let's say for the sake of illustration that
[00:18:50] say for the sake of illustration that all of your examples line the plane of
[00:18:53] all of your examples line the plane of x1 and x2 so let's say X 3 is equal to 0
[00:19:00] ok so let's say from all of your
[00:19:02] ok so let's say from all of your training examples X 3 equals 0 then the
[00:19:07] training examples X 3 equals 0 then the decision boundary you know will be will
[00:19:10] decision boundary you know will be will be some sort of vertical plane that
[00:19:11] be some sort of vertical plane that looks like this right so this is going
[00:19:14] looks like this right so this is going to be the plane specifying W transpose X
[00:19:19] to be the plane specifying W transpose X plus B equals 0 where now W and X are
[00:19:22] plus B equals 0 where now W and X are three dimensional and so the vector W
[00:19:27] three dimensional and so the vector W will have a should have W 3 equals 0
[00:19:34] will have a should have W 3 equals 0 right if one of the features is always 0
[00:19:37] right if one of the features is always 0 was always fixed then you know W 3
[00:19:40] was always fixed then you know W 3 should be equal to 0 and that's another
[00:19:42] should be equal to 0 and that's another way of saying that the vector W you know
[00:19:45] way of saying that the vector W you know should be represented as an in the span
[00:19:48] should be represented as an in the span of just two features and clinics
[00:19:50] of just two features and clinics span of the training examples okay all
[00:19:57] span of the training examples okay all right I'm not sure if either intuition
[00:19:59] right I'm not sure if either intuition one or intuition to convince this you I
[00:20:01] one or intuition to convince this you I think hopefully that's good enough but
[00:20:02] think hopefully that's good enough but this the second intuition will be easier
[00:20:05] this the second intuition will be easier if you're used to thinking about baptism
[00:20:07] if you're used to thinking about baptism high damaged all feature spaces and
[00:20:12] high damaged all feature spaces and again the formal proof of this result
[00:20:14] again the formal proof of this result which is called the represent two
[00:20:16] which is called the represent two theorem is given in the lecture notes
[00:20:19] theorem is given in the lecture notes but is a very possessive well no it was
[00:20:21] but is a very possessive well no it was actually there's actually on the most
[00:20:23] actually there's actually on the most complicated is well it's definitely the
[00:20:25] complicated is well it's definitely the high end in terms of complexity of the
[00:20:27] high end in terms of complexity of the of the full derivation of the formal
[00:20:29] of the full derivation of the formal derivation is this result so so let's
[00:20:42] derivation is this result so so let's assume that W can be written as follows
[00:20:47] assume that W can be written as follows so optimization problem was this you
[00:20:51] so optimization problem was this you want to solve for W and B so that the
[00:20:55] want to solve for W and B so that the norm of W squared is as small as
[00:20:57] norm of W squared is as small as possible and so that do this is the one
[00:21:06] for every value of I so let's see
[00:21:14] for every value of I so let's see normal W squared this is just equal to W
[00:21:17] normal W squared this is just equal to W transpose W and so if you plug in this
[00:21:23] transpose W and so if you plug in this definition of W you know into these
[00:21:27] definition of W you know into these equations you have as the optimization
[00:21:30] equations you have as the optimization objective min of 1/2 sum from I equals 1
[00:21:35] objective min of 1/2 sum from I equals 1 through m so this is W transpose W
[00:21:51] which is equal to I guess some of I some
[00:22:00] which is equal to I guess some of I some of AJ of my ELF ej y iy j and then X I
[00:22:08] of AJ of my ELF ej y iy j and then X I transpose XJ and I'm going to take this
[00:22:16] transpose XJ and I'm going to take this so this is an inner product between X I
[00:22:18] so this is an inner product between X I XJ and I'm gonna use I'm just only write
[00:22:21] XJ and I'm gonna use I'm just only write to this this right X I did this notation
[00:22:27] to this this right X I did this notation so X comma Z equals X transpose Z is the
[00:22:33] so X comma Z equals X transpose Z is the inner product between two vectors this
[00:22:38] inner product between two vectors this is maybe another alternative notation
[00:22:40] is maybe another alternative notation for writing inner products and when we
[00:22:42] for writing inner products and when we derived kernels you see that expressing
[00:22:45] derived kernels you see that expressing your algorithm in terms of inner
[00:22:46] your algorithm in terms of inner products between features X this is the
[00:22:48] products between features X this is the key map practical step needed to derive
[00:22:50] key map practical step needed to derive kernels and we'll use this slightly
[00:22:52] kernels and we'll use this slightly different so open angle bracket closing
[00:22:55] different so open angle bracket closing angle brackets notation to denote the
[00:22:57] angle brackets notation to denote the end the inner product between two
[00:22:59] end the inner product between two different feature vectors so that is the
[00:23:03] different feature vectors so that is the optimization objective oh and then this
[00:23:07] optimization objective oh and then this constraint it becomes something else I
[00:23:09] constraint it becomes something else I guess this becomes a what is it why I
[00:23:16] guess this becomes a what is it why I times W which is transpose X plus B is
[00:23:29] times W which is transpose X plus B is greater than one and again this
[00:23:31] greater than one and again this simplifies or if you just multiply this
[00:23:34] simplifies or if you just multiply this out
[00:23:48] so just to make sure the mapping is
[00:23:50] so just to make sure the mapping is clear all these pens are like dying all
[00:24:04] clear all these pens are like dying all right so that becomes this and this
[00:24:10] right so that becomes this and this becomes that okay and the key property
[00:24:17] becomes that okay and the key property we're going to use is that if you look
[00:24:19] we're going to use is that if you look at these two equations in terms of how
[00:24:21] at these two equations in terms of how we pose the optimization problem the
[00:24:22] we pose the optimization problem the only place that the feature vectors
[00:24:25] only place that the feature vectors appears is in this inner product and it
[00:24:33] appears is in this inner product and it turns out when we talked about the
[00:24:35] turns out when we talked about the kernel trick when we talked about the
[00:24:36] kernel trick when we talked about the application of kernels it turns out that
[00:24:39] application of kernels it turns out that if you can compute this very efficiently
[00:24:42] if you can compute this very efficiently that's when you can get away with
[00:24:43] that's when you can get away with manipulating even infinite dimensional
[00:24:46] manipulating even infinite dimensional feature vectors well we'll get to this
[00:24:48] feature vectors well we'll get to this in a second but the reason we want to
[00:24:49] in a second but the reason we want to write the whole algorithm in terms of
[00:24:51] write the whole algorithm in terms of inner products is there'll be important
[00:24:53] inner products is there'll be important cases where the feature vectors are
[00:24:55] cases where the feature vectors are hundred trillion dimensional but you can
[00:24:59] hundred trillion dimensional but you can compute it or even infinite dimensional
[00:25:01] compute it or even infinite dimensional but you can compute the inner product
[00:25:02] but you can compute the inner product very efficiently without needing to loop
[00:25:04] very efficiently without needing to loop over you know the 100 trillion elements
[00:25:06] over you know the 100 trillion elements in an array right and we'll see exactly
[00:25:08] in an array right and we'll see exactly how to do that later in very shortly
[00:25:21] so all right now it turns out that we've
[00:25:32] so all right now it turns out that we've now expressed the whole optimization
[00:25:36] now expressed the whole optimization algorithm in terms of these parameters
[00:25:38] algorithm in terms of these parameters alpha right defined here and B so now
[00:25:42] alpha right defined here and B so now the parameters theta
[00:25:43] the parameters theta well now the parameters need to optimize
[00:25:45] well now the parameters need to optimize for our alpha it turns out that by
[00:25:49] for our alpha it turns out that by convention in a way that you see support
[00:25:51] convention in a way that you see support vector machines refer to you know in
[00:25:53] vector machines refer to you know in research papers or in text books it
[00:25:55] research papers or in text books it turns out there's a further
[00:25:56] turns out there's a further simplification of that optimization
[00:25:58] simplification of that optimization problem which is that you can simplify
[00:26:00] problem which is that you can simplify to this and the derivation to go from
[00:26:06] to this and the derivation to go from that to this is again relatively
[00:26:09] that to this is again relatively complicated but it turns out you can
[00:26:14] complicated but it turns out you can further simplify the authorization
[00:26:17] further simplify the authorization problem I wrote there to this and again
[00:26:23] problem I wrote there to this and again you can copy this down if you want but
[00:26:25] you can copy this down if you want but it's also written the lecture notes and
[00:26:27] it's also written the lecture notes and by convention this slightly simplified
[00:26:30] by convention this slightly simplified version optimization problem is called
[00:26:32] version optimization problem is called the dual optimization problem the way to
[00:26:39] the dual optimization problem the way to simplify that authorisations problem to
[00:26:41] simplify that authorisations problem to this one that's actually done by using
[00:26:45] this one that's actually done by using convex optimization theory and and again
[00:26:48] convex optimization theory and and again the derivation is written in lecture
[00:26:51] the derivation is written in lecture notes but I don't want to do that here
[00:26:52] notes but I don't want to do that here if you want think of it as doing a bunch
[00:26:54] if you want think of it as doing a bunch more algebra to simplify that problem to
[00:26:56] more algebra to simplify that problem to this one and cause density you can still
[00:26:58] this one and cause density you can still be along the way there's a little more
[00:26:59] be along the way there's a little more complicated than that but but write full
[00:27:02] complicated than that but but write full derivation is given in the lecture notes
[00:27:05] derivation is given in the lecture notes and so finally you know the way you
[00:27:11] and so finally you know the way you train for the way you make a prediction
[00:27:13] train for the way you make a prediction right as you solve the Alpha rice and
[00:27:18] right as you solve the Alpha rice and maybe for B right so you solve this
[00:27:20] maybe for B right so you solve this optimization problem or that
[00:27:21] optimization problem or that optimization problem for the Alpha rice
[00:27:24] optimization problem for the Alpha rice and then to make a prediction
[00:27:36] you need to compute H of W B of X for a
[00:27:40] you need to compute H of W B of X for a new test example which is G of W
[00:27:44] new test example which is G of W transpose X plus B right but because of
[00:27:48] transpose X plus B right but because of the definition of W dub this is equal to
[00:27:52] the definition of W dub this is equal to G of that's W transpose X plus B because
[00:28:05] G of that's W transpose X plus B because this is W and so that's equal to G of
[00:28:12] this is W and so that's equal to G of sum over I of I in a product between X
[00:28:17] sum over I of I in a product between X and X plus B and so once again you know
[00:28:21] and X plus B and so once again you know once you have stored the alphas in your
[00:28:24] once you have stored the alphas in your computer memory you can make predictions
[00:28:26] computer memory you can make predictions using just inner products again right so
[00:28:29] using just inner products again right so the entire algorithm both the
[00:28:30] the entire algorithm both the optimization objective you need to do
[00:28:32] optimization objective you need to do during training as was how you make
[00:28:34] during training as was how you make predictions is is expressed only in
[00:28:37] predictions is is expressed only in terms of inner products so
[00:28:46] we're now ready to apply kernels and
[00:28:54] we're now ready to apply kernels and sometimes in machine learning people
[00:28:56] sometimes in machine learning people sometimes we call this a kernel trick
[00:28:58] sometimes we call this a kernel trick and then we just the other recipe for
[00:29:00] and then we just the other recipe for what this means
[00:29:02] what this means step one is write your algorithm in
[00:29:11] step one is write your algorithm in terms of X I XJ in terms of inner
[00:29:15] terms of X I XJ in terms of inner products and instead of carrying the
[00:29:18] products and instead of carrying the superscript you know X I XJ I'm
[00:29:21] superscript you know X I XJ I'm sometimes gonna write in the product
[00:29:22] sometimes gonna write in the product between X and Z right where X and z are
[00:29:25] between X and Z right where X and z are supposed to be proxy is for two
[00:29:27] supposed to be proxy is for two different training examples exam thanks
[00:29:29] different training examples exam thanks Jay but it simplifies a notation or
[00:29:31] Jay but it simplifies a notation or write a little bit to let there be some
[00:29:40] write a little bit to let there be some mapping from your original input
[00:29:50] mapping from your original input features X to some higher damage though
[00:29:54] features X to some higher damage though set of features Phi and so one example
[00:29:59] set of features Phi and so one example would be lets you try to predict housing
[00:30:02] would be lets you try to predict housing prices a particular house will be sold
[00:30:03] prices a particular house will be sold in the next month so maybe X in this
[00:30:06] in the next month so maybe X in this case is the size of the house or maybe
[00:30:09] case is the size of the house or maybe is size in yeah right maybe X is the
[00:30:13] is size in yeah right maybe X is the size of a house and so you could take
[00:30:18] size of a house and so you could take this 1d feature and expand it to a high
[00:30:21] this 1d feature and expand it to a high dimensional feature vector with X x
[00:30:24] dimensional feature vector with X x squared X cubed x to the 4th right so
[00:30:27] squared X cubed x to the 4th right so this would be one way of defining a high
[00:30:29] this would be one way of defining a high dimensional future laughing well another
[00:30:31] dimensional future laughing well another one could be if you have two features x1
[00:30:33] one could be if you have two features x1 and x2 corresponding to the size of
[00:30:36] and x2 corresponding to the size of house in the number of bedrooms and you
[00:30:38] house in the number of bedrooms and you can map this to different 5x which may
[00:30:41] can map this to different 5x which may be x1 x2 x1 times x2 x1 squared x2 x1 x2
[00:30:47] be x1 x2 x1 times x2 x1 squared x2 x1 x2 squared and so on it kind of a
[00:30:50] squared and so on it kind of a polynomial set of features or maybe
[00:30:51] polynomial set of features or maybe other set of features as well ok
[00:30:54] other set of features as well ok and what we'll be able to do is work
[00:30:57] and what we'll be able to do is work with feature mappings Phi of X where the
[00:31:01] with feature mappings Phi of X where the original input X may be 1d or 2d o or
[00:31:04] original input X may be 1d or 2d o or whatever and Phi of X could be you know
[00:31:08] whatever and Phi of X could be you know a hundred thousand dimensional or
[00:31:10] a hundred thousand dimensional or infinite dimensional but we'll be able
[00:31:13] infinite dimensional but we'll be able to do this very efficiently right oh you
[00:31:16] to do this very efficiently right oh you an infinite dimensional okay so guess
[00:31:21] an infinite dimensional okay so guess we'll get some concrete examples of this
[00:31:23] we'll get some concrete examples of this later but I want to give you the overall
[00:31:24] later but I want to give you the overall recipe and then what we're going to do
[00:31:28] recipe and then what we're going to do is a final way to compute ok of X comma
[00:31:36] is a final way to compute ok of X comma Z equals Phi of X transpose Phi of Z so
[00:31:45] Z equals Phi of X transpose Phi of Z so this is called a kernel function and
[00:31:47] this is called a kernel function and what we're going to do is we'll see that
[00:31:49] what we're going to do is we'll see that there are clever tricks so that you can
[00:31:52] there are clever tricks so that you can compute the inner product between X and
[00:31:54] compute the inner product between X and Z even when Phi of X and Phi of Z are
[00:31:57] Z even when Phi of X and Phi of Z are incredibly high dimensional right we'll
[00:31:59] incredibly high dimensional right we'll see an example of this enough in very
[00:32:01] see an example of this enough in very very soon and in step four is replace XZ
[00:32:08] very soon and in step four is replace XZ in algorithm with K of X Z okay because
[00:32:21] in algorithm with K of X Z okay because if you could do this then what you're
[00:32:23] if you could do this then what you're doing is you're running the whole
[00:32:25] doing is you're running the whole learning algorithm on this high
[00:32:27] learning algorithm on this high dimensional set of features and and the
[00:32:31] dimensional set of features and and the problem with you know swapping out X for
[00:32:34] problem with you know swapping out X for Phi of X right is that it can be very
[00:32:37] Phi of X right is that it can be very computationally expensive if you're
[00:32:38] computationally expensive if you're working over a hundred thousand
[00:32:39] working over a hundred thousand dimensional feature vectors right I mean
[00:32:41] dimensional feature vectors right I mean even by today's standards you know a
[00:32:43] even by today's standards you know a hundred thousand yes yeah it's not the
[00:32:45] hundred thousand yes yeah it's not the biggest I've seen I've seen actually
[00:32:46] biggest I've seen I've seen actually because I've seen that here are the
[00:32:47] because I've seen that here are the convenient features but you've been by
[00:32:49] convenient features but you've been by today's standards hundred thousand
[00:32:51] today's standards hundred thousand features is actually quite a lot and if
[00:32:56] features is actually quite a lot and if you're not King said just a thousand
[00:32:57] you're not King said just a thousand this is a lot and large number of
[00:32:59] this is a lot and large number of features I guess and the problem of
[00:33:03] features I guess and the problem of using business is quite computation
[00:33:04] using business is quite computation expensive to carry around these hundred
[00:33:06] expensive to carry around these hundred thousand or
[00:33:07] thousand or image the 100 million dimensional future
[00:33:09] image the 100 million dimensional future vectors or whatever but that's what you
[00:33:14] vectors or whatever but that's what you would do if you were to swap in Phi of X
[00:33:16] would do if you were to swap in Phi of X you know in the naive straightforward
[00:33:18] you know in the naive straightforward way for X but what we'll see is that if
[00:33:20] way for X but what we'll see is that if you can compute K of X Z then you could
[00:33:23] you can compute K of X Z then you could because you've written your whole
[00:33:24] because you've written your whole algorithm just in terms of inner
[00:33:26] algorithm just in terms of inner products then you don't ever need to
[00:33:28] products then you don't ever need to explicitly compute Phi of X so you can
[00:33:30] explicitly compute Phi of X so you can always just compute these kernels
[00:33:48] then get to that later I won't go for
[00:33:51] then get to that later I won't go for some kernels I was talked about by his
[00:33:53] some kernels I was talked about by his Darian spray on Wednesday yeah I think
[00:33:56] Darian spray on Wednesday yeah I think the no free lunch theorem is a
[00:33:58] the no free lunch theorem is a fascinating theory or concept but I
[00:34:00] fascinating theory or concept but I think that it's been I don't know it's
[00:34:03] think that it's been I don't know it's been less useful actually because I
[00:34:04] been less useful actually because I think we have inductive biases that turn
[00:34:06] think we have inductive biases that turn out to be useful
[00:34:08] out to be useful there's a there's a famous theorem in
[00:34:10] there's a there's a famous theorem in learning theory called no free lunch was
[00:34:12] learning theory called no free lunch was like 20 years ago dad basically says
[00:34:14] like 20 years ago dad basically says that in the worst case learning
[00:34:16] that in the worst case learning algorithms do not work for any learning
[00:34:19] algorithms do not work for any learning algorithm I can come up with some data
[00:34:20] algorithm I can come up with some data distribution so that your learning
[00:34:21] distribution so that your learning algorithm stops that that's roughly the
[00:34:23] algorithm stops that that's roughly the no free lunch to ever improved about
[00:34:24] no free lunch to ever improved about like 20 years ago but it turns out most
[00:34:26] like 20 years ago but it turns out most of the world most the time the universe
[00:34:27] of the world most the time the universe is not that hostile to all that so so
[00:34:30] is not that hostile to all that so so yeah that's the learning I was turned
[00:34:32] yeah that's the learning I was turned out okay all right let's go through one
[00:34:41] out okay all right let's go through one example of kernels so for this example
[00:34:44] example of kernels so for this example let's say that your original input
[00:34:46] let's say that your original input features was three dimensional X 1 X 2 X
[00:34:49] features was three dimensional X 1 X 2 X 3 and let's say I'm gonna choose the
[00:34:52] 3 and let's say I'm gonna choose the feature mapping Phi of X to be all so
[00:34:56] feature mapping Phi of X to be all so pairwise monomial terms so I'm gonna
[00:34:59] pairwise monomial terms so I'm gonna choose X 1 times X 1 X 1 X 2 X 1 X 3 X 2
[00:35:05] choose X 1 times X 1 X 1 X 2 X 1 X 3 X 2 X 1 okay and there are a couple
[00:35:16] X 1 okay and there are a couple duplicates that X 1 X 3 is equal to X 3
[00:35:18] duplicates that X 1 X 3 is equal to X 3 X 1 but agenda writes it out this way
[00:35:20] X 1 but agenda writes it out this way and so notice that if you have if X is
[00:35:25] and so notice that if you have if X is in RN right then Phi of X is in R and
[00:35:30] in RN right then Phi of X is in R and squared right so the three-dimensional
[00:35:33] squared right so the three-dimensional features two non dimensional and I'm
[00:35:36] features two non dimensional and I'm using small numbers for illustration in
[00:35:38] using small numbers for illustration in practice think of X as a thousand
[00:35:40] practice think of X as a thousand dimensional and so this is now a million
[00:35:42] dimensional and so this is now a million well think of this as maybe ten thousand
[00:35:44] well think of this as maybe ten thousand and this is now like a hundred million
[00:35:46] and this is now like a hundred million okay so and squid features this much
[00:35:48] okay so and squid features this much bigger and then similarly Phi of Z is
[00:35:52] bigger and then similarly Phi of Z is going to be Z 1 z 1 z 1 z 2
[00:36:10] some have gone from n features like
[00:36:13] some have gone from n features like 10,000 features to n square features HP
[00:36:15] 10,000 features to n square features HP in this case hundred million features um
[00:36:19] in this case hundred million features um so because there are n squared elements
[00:36:26] so because there are n squared elements right you would need order N squared
[00:36:30] right you would need order N squared time to compute Phi of X or to compute
[00:36:40] time to compute Phi of X or to compute Phi of X transpose Phi of Z explicitly
[00:36:47] Phi of X transpose Phi of Z explicitly right say we want to compute the inner
[00:36:49] right say we want to compute the inner product between 5s and Phi of Z and they
[00:36:50] product between 5s and Phi of Z and they do it explicitly in the obvious way
[00:36:52] do it explicitly in the obvious way they'll take N squared time to just
[00:36:54] they'll take N squared time to just compute all of these in their products
[00:36:55] compute all of these in their products and then do the right and then compute
[00:36:58] and then do the right and then compute this uh compute this right and there's
[00:37:01] this uh compute this right and there's actually n squared over 2 because a lot
[00:37:02] actually n squared over 2 because a lot of these things are duplicated but
[00:37:03] of these things are duplicated but that's the order N squared
[00:37:15] but let's see you can find a better way
[00:37:17] but let's see you can find a better way to do that so what we want is to write
[00:37:21] to do that so what we want is to write out the kernel of X comma Z so this Phi
[00:37:25] out the kernel of X comma Z so this Phi of X transpose Phi of Z right and what
[00:37:31] of X transpose Phi of Z right and what I'm gonna prove is that this can be
[00:37:32] I'm gonna prove is that this can be computed as X transpose Z squared and
[00:37:38] computed as X transpose Z squared and the cool thing is that remember X is n
[00:37:42] the cool thing is that remember X is n dimensional Z is n dimensional so X
[00:37:47] dimensional Z is n dimensional so X transpose Z squared this is an order n
[00:37:49] transpose Z squared this is an order n time computation right because taking X
[00:37:53] time computation right because taking X transpose Z you know that's just in a
[00:37:56] transpose Z you know that's just in a product of two n dimensional vectors and
[00:37:58] product of two n dimensional vectors and then you take that number you a
[00:37:59] then you take that number you a transpose Z is a real number and you
[00:38:01] transpose Z is a real number and you just square that number so that's the
[00:38:04] just square that number so that's the order n time computation and so let me
[00:38:08] order n time computation and so let me just prove that X transpose Z is equal
[00:38:10] just prove that X transpose Z is equal to well let me let me let me prove this
[00:38:12] to well let me let me let me prove this step right and so X transpose e squared
[00:38:17] step right and so X transpose e squared that's equal to right so this is X
[00:38:28] that's equal to right so this is X transpose Z right and then times this is
[00:38:39] transpose Z right and then times this is also X transpose e so this formula is X
[00:38:41] also X transpose e so this formula is X transpose e squared is X transpose e
[00:38:43] transpose e squared is X transpose e times itself and then if I rearrange
[00:38:47] times itself and then if I rearrange sums this is equal to sum from I equals
[00:38:50] sums this is equal to sum from I equals 1 through n sum from J equals 1 through
[00:38:53] 1 through n sum from J equals 1 through n X i zi x j zj and this in turn is you
[00:39:04] n X i zi x j zj and this in turn is you know some of i sum over J of X I XJ
[00:39:10] know some of i sum over J of X I XJ times Zi
[00:39:15] times Zi zj and so what this is doing is this
[00:39:20] zj and so what this is doing is this marching through all possible pairs of
[00:39:22] marching through all possible pairs of inj and multiplying X I XJ with the
[00:39:30] inj and multiplying X I XJ with the corresponding Zi Zi J and having that up
[00:39:34] corresponding Zi Zi J and having that up but of course if you were to compute Phi
[00:39:38] but of course if you were to compute Phi of X transpose Phi of Z what you do is
[00:39:41] of X transpose Phi of Z what you do is you take this and multiply it with that
[00:39:43] you take this and multiply it with that and then add it to the sum then take
[00:39:46] and then add it to the sum then take this and multiply with that and add it
[00:39:48] this and multiply with that and add it to some and so on until you end up
[00:39:50] to some and so on until you end up taking this and multiplying that and
[00:39:52] taking this and multiplying that and having it to your son right so that's
[00:39:55] having it to your son right so that's why so that's why this formula is just
[00:40:01] why so that's why this formula is just you know marching down these two lists
[00:40:04] you know marching down these two lists and multiplying multiplying multiply and
[00:40:06] and multiplying multiplying multiply and add it up which is exactly Phi transpose
[00:40:12] which is exactly the Phi of X transpose
[00:40:17] which is exactly the Phi of X transpose Phi of Z okay so this proves that you've
[00:40:23] Phi of Z okay so this proves that you've turned what was previously an order N
[00:40:25] turned what was previously an order N squared time calculation and turn order
[00:40:27] squared time calculation and turn order n time calculation which means that if n
[00:40:33] n time calculation which means that if n was 10,000 instead of needing to
[00:40:36] was 10,000 instead of needing to manipulate a hundred thousand
[00:40:38] manipulate a hundred thousand dimensional vectors to come up with
[00:40:41] dimensional vectors to come up with these URLs as my phone buzzing axes
[00:40:43] these URLs as my phone buzzing axes really loud okay instead of needing to
[00:40:46] really loud okay instead of needing to manipulate so 100,000 dimensional
[00:40:49] manipulate so 100,000 dimensional vectors you could do so in the plating
[00:40:51] vectors you could do so in the plating only 10,000 initial vectors now a few
[00:40:58] only 10,000 initial vectors now a few other examples of kernels
[00:41:11] it turns out that if you choose this
[00:41:16] it turns out that if you choose this current so let's see we had K of X comma
[00:41:18] current so let's see we had K of X comma Z equals X transpose e squared and we
[00:41:25] Z equals X transpose e squared and we now add a plus C there where C is a
[00:41:27] now add a plus C there where C is a constant so C is just some fixed real
[00:41:30] constant so C is just some fixed real number that corresponds to modifying
[00:41:34] number that corresponds to modifying your pictures as follows where instead
[00:41:37] your pictures as follows where instead of justice
[00:41:38] of justice you know binomial terms pairs of these
[00:41:40] you know binomial terms pairs of these things if you add plus C there it
[00:41:43] things if you add plus C there it corresponds to adding x1 x2 x3 to this
[00:41:48] corresponds to adding x1 x2 x3 to this to your set of features technically
[00:41:51] to your set of features technically there's actually waiting on this and
[00:41:53] there's actually waiting on this and actually root to see look to see root 2
[00:41:56] actually root to see look to see root 2 C and then as a constant C here as well
[00:41:58] C and then as a constant C here as well you can prove this yourself and it turns
[00:42:00] you can prove this yourself and it turns out that if this is your new definition
[00:42:02] out that if this is your new definition for Phi of X and make the same change to
[00:42:04] for Phi of X and make the same change to Phi of Z you know it's a root 2 C Z 1
[00:42:06] Phi of Z you know it's a root 2 C Z 1 and so on then if you take the inner
[00:42:08] and so on then if you take the inner product of these then it can be computed
[00:42:10] product of these then it can be computed as this right and so that's and so the
[00:42:14] as this right and so that's and so the row of the constant C it trades off the
[00:42:18] row of the constant C it trades off the relative weighting between the binomial
[00:42:20] relative weighting between the binomial terms though you know xixj compared to
[00:42:23] terms though you know xixj compared to the to the single to the first degree
[00:42:25] the to the single to the first degree terms like x1 or x2 or x3 other examples
[00:42:32] terms like x1 or x2 or x3 other examples if you choose this ^ d notice that this
[00:42:45] if you choose this ^ d notice that this still is in order n time computation
[00:42:50] still is in order n time computation right x transpose e takes all the end
[00:42:51] right x transpose e takes all the end time you add a number to it then you
[00:42:53] time you add a number to it then you take this the power of G so you can
[00:42:54] take this the power of G so you can compute this in all the end time but
[00:42:57] compute this in all the end time but this corresponds to now apply of X as
[00:43:03] this corresponds to now apply of X as all the number of terms turns out to be
[00:43:06] all the number of terms turns out to be M plus D choose D but it doesn't matter
[00:43:08] M plus D choose D but it doesn't matter it turns out this contains all features
[00:43:11] it turns out this contains all features of monomials
[00:43:17] up to order d so by which I mean it if
[00:43:22] up to order d so by which I mean it if let's say D is equal to five right then
[00:43:26] let's say D is equal to five right then this contains then Phi of X contains all
[00:43:28] this contains then Phi of X contains all the features of the form X 1 X 2 X 5 X
[00:43:32] the features of the form X 1 X 2 X 5 X 17 X 29 right this is a fifth-degree
[00:43:35] 17 X 29 right this is a fifth-degree thing or X or X 1 X 2 squared X 3 X you
[00:43:41] thing or X or X 1 X 2 squared X 3 X you know 18 this is also fifth order
[00:43:43] know 18 this is also fifth order polynomial fit for the monomials called
[00:43:46] polynomial fit for the monomials called and so if you choose this as your kernel
[00:43:49] and so if you choose this as your kernel this corresponds to constructing Phi of
[00:43:52] this corresponds to constructing Phi of X to contain all of these features and
[00:43:54] X to contain all of these features and there are exponentially many of them
[00:43:56] there are exponentially many of them right there all of these features any
[00:43:57] right there all of these features any older although these are called
[00:44:00] older although these are called monomials basically all the polynomial
[00:44:02] monomials basically all the polynomial terms all the more narrow terms up to a
[00:44:04] terms all the more narrow terms up to a fifth degree polynomial up to a fifth
[00:44:06] fifth degree polynomial up to a fifth order monomial term so and there are
[00:44:08] order monomial term so and there are turns out there are n plus d choose DS
[00:44:10] turns out there are n plus d choose DS which is a roughly M plus D to the power
[00:44:12] which is a roughly M plus D to the power of the very roughly so this is a very
[00:44:15] of the very roughly so this is a very very large number of features but your
[00:44:18] very large number of features but your computation doesn't blow up
[00:44:19] computation doesn't blow up exponentially even as D increases so
[00:44:26] exponentially even as D increases so what the support vector machine is is
[00:44:31] what the support vector machine is is taking the optimal margin classifier
[00:44:33] taking the optimal margin classifier that we derived earlier and applying the
[00:44:37] that we derived earlier and applying the kernel trick to it in which we already
[00:44:41] kernel trick to it in which we already had D so well so okto margin classifier
[00:44:49] plus the kernel trick right that is the
[00:44:55] plus the kernel trick right that is the support vector machine
[00:44:58] and so if you choose some of these
[00:45:00] and so if you choose some of these kernels for example then you could run
[00:45:03] kernels for example then you could run an SVM in these very very high
[00:45:05] an SVM in these very very high dimensional feature spaces in these you
[00:45:07] dimensional feature spaces in these you know hundred trillion dimensional
[00:45:09] know hundred trillion dimensional feature spaces but your computational
[00:45:12] feature spaces but your computational timescales only linearly as all their n
[00:45:15] timescales only linearly as all their n as the number as the dimension of your
[00:45:17] as the number as the dimension of your input features X rather than as a
[00:45:19] input features X rather than as a function of this hundred trillion
[00:45:20] function of this hundred trillion dimensional feature space you're
[00:45:22] dimensional feature space you're actually building a linear classifier so
[00:45:25] actually building a linear classifier so um why is this a good idea
[00:45:28] um why is this a good idea let me just not show a quick video to
[00:45:36] let me just not show a quick video to give you intuition for what this is
[00:45:38] give you intuition for what this is doing let's see okay I think projects it
[00:45:44] doing let's see okay I think projects it takes a while to warm up does it any
[00:45:49] takes a while to warm up does it any questions while we're
[00:46:00] yes so this kind of function appears
[00:46:03] yes so this kind of function appears applies only to this vision happy so
[00:46:05] applies only to this vision happy so each kernel function of yes up to
[00:46:10] each kernel function of yes up to trivial differences right if you have a
[00:46:12] trivial differences right if you have a feature mapping where the features that
[00:46:13] feature mapping where the features that come you are permuted or something then
[00:46:15] come you are permuted or something then the kernel function stays the same so
[00:46:18] the kernel function stays the same so there are trivial function
[00:46:19] there are trivial function transformations like that but if you
[00:46:21] transformations like that but if you have a totally different feature mapping
[00:46:22] have a totally different feature mapping you would expect to need a totally
[00:46:24] you would expect to need a totally different kernel function so I wanted to
[00:46:38] let's see oh cool so I want to give you
[00:46:44] let's see oh cool so I want to give you a visual picture
[00:46:52] I've wiped this
[00:47:09] all right this is a YouTube video that
[00:47:12] all right this is a YouTube video that can cotton faroush would teach us to
[00:47:14] can cotton faroush would teach us to that cs2 there the environment such as
[00:47:16] that cs2 there the environment such as IU so I don't know who Rudy are
[00:47:18] IU so I don't know who Rudy are erroneous but there's a nice
[00:47:20] erroneous but there's a nice visualization of what a support vector
[00:47:22] visualization of what a support vector machine is doing so let's say you have a
[00:47:27] machine is doing so let's say you have a learning algorithm where you're trying
[00:47:29] learning algorithm where you're trying to separate the blue dots from the red
[00:47:30] to separate the blue dots from the red dots right so the blue and the red dots
[00:47:34] dots right so the blue and the red dots can't be separated by a straight line
[00:47:36] can't be separated by a straight line but you put them on the plane and you
[00:47:38] but you put them on the plane and you use a feature mapping Phi to throw these
[00:47:40] use a feature mapping Phi to throw these points into much high dimensional space
[00:47:42] points into much high dimensional space so it's now throwing these points in the
[00:47:44] so it's now throwing these points in the 3-dimensional space in this three
[00:47:46] 3-dimensional space in this three dimensional space you can then find W so
[00:47:49] dimensional space you can then find W so W is now three dimensional because apply
[00:47:51] W is now three dimensional because apply the optimal margin classifier in this
[00:47:52] the optimal margin classifier in this three dimensional space that separates
[00:47:54] three dimensional space that separates the blue dots and the red dots and if
[00:47:58] the blue dots and the red dots and if you now you know examine what this is
[00:48:00] you now you know examine what this is doing back in the original space then
[00:48:03] doing back in the original space then your linear classifier actually defines
[00:48:05] your linear classifier actually defines that elliptical decision boundary
[00:48:08] that elliptical decision boundary variance right so you're taking the data
[00:48:11] variance right so you're taking the data all right so taking the data mapping
[00:48:18] all right so taking the data mapping it's a much higher dimensional feature
[00:48:19] it's a much higher dimensional feature space three dimension in this
[00:48:21] space three dimension in this visualization but in practice can be 100
[00:48:23] visualization but in practice can be 100 trillion dimensions and then finding a
[00:48:25] trillion dimensions and then finding a linear decision boundary in that hundred
[00:48:27] linear decision boundary in that hundred trillion dimensional space which is
[00:48:30] trillion dimensional space which is going to be a hyperplane like a like a
[00:48:31] going to be a hyperplane like a like a straight you know like a plane or a
[00:48:33] straight you know like a plane or a straight line or a plane and then when
[00:48:35] straight line or a plane and then when you look at what you just did in the
[00:48:36] you look at what you just did in the original feature space you found it very
[00:48:38] original feature space you found it very non linear this is your value so this is
[00:48:44] non linear this is your value so this is y and again you know you can only
[00:48:48] y and again you know you can only visualize relatively low dimensional
[00:48:51] visualize relatively low dimensional future spaces even even on a display
[00:48:53] future spaces even even on a display like that but you find that if you use a
[00:48:56] like that but you find that if you use a SVM kernel
[00:49:06] all right you can learn very nonlinear
[00:49:09] all right you can learn very nonlinear decision boundaries like that but that
[00:49:11] decision boundaries like that but that is a linear decision boundary in a very
[00:49:13] is a linear decision boundary in a very high dimensional space but when you
[00:49:14] high dimensional space but when you project it back down to you know 2d you
[00:49:17] project it back down to you know 2d you end up with a very dominant is about
[00:49:19] end up with a very dominant is about okay all right oh sure yes so in this
[00:49:36] okay all right oh sure yes so in this high dimensional space represented by
[00:49:38] high dimensional space represented by the feature mapping 5x so this data
[00:49:40] the feature mapping 5x so this data always have to be linearly separable so
[00:49:42] always have to be linearly separable so far been pretending that it does our
[00:49:44] far been pretending that it does our coming back and fix that assumption
[00:49:45] coming back and fix that assumption later today
[00:49:54] so um no how do you make kernels right
[00:50:05] so um no how do you make kernels right so just here's something so here's some
[00:50:08] so just here's something so here's some intuition you might have about kernels
[00:50:12] intuition you might have about kernels if X and z are similar you know if - if
[00:50:22] if X and z are similar you know if - if - infinite ampuls X is here close to
[00:50:24] - infinite ampuls X is here close to each other or summit each other then
[00:50:27] each other or summit each other then chain of X Z which is the inner product
[00:50:31] chain of X Z which is the inner product between X and Z right presumably this
[00:50:35] between X and Z right presumably this should be large and conversely if X and
[00:50:40] should be large and conversely if X and z are dissimilar then KF x z you know
[00:50:48] z are dissimilar then KF x z you know this maybe should be smaller right
[00:50:51] this maybe should be smaller right because uh the inner product of two very
[00:50:53] because uh the inner product of two very similar vectors that are pointing in the
[00:50:55] similar vectors that are pointing in the same direction should be large and the
[00:50:57] same direction should be large and the inner product of two dissimilar vectors
[00:50:59] inner product of two dissimilar vectors should be small right so this is one
[00:51:02] should be small right so this is one guiding principle behind you know what
[00:51:05] guiding principle behind you know what you see in the law of kernels just see
[00:51:06] you see in the law of kernels just see if if this is Phi of X and this is Phi
[00:51:08] if if this is Phi of X and this is Phi of Z the inner product is large but then
[00:51:12] of Z the inner product is large but then they kind of point off in random
[00:51:13] they kind of point off in random directions the inner product will be
[00:51:16] directions the inner product will be small right
[00:51:16] small right that's how vector inner product works
[00:51:19] that's how vector inner product works and so well what have you just poor
[00:51:22] and so well what have you just poor functionality straight out of the air
[00:51:24] functionality straight out of the air which is K of X Z equals e to the
[00:51:28] which is K of X Z equals e to the negative X is minus Z squared over 2
[00:51:32] negative X is minus Z squared over 2 Sigma squared right so this is one
[00:51:38] Sigma squared right so this is one example of a similar if you think of
[00:51:41] example of a similar if you think of kernels there's a similarity measure of
[00:51:43] kernels there's a similarity measure of function just you know let's just make
[00:51:46] function just you know let's just make up another similarity measure of
[00:51:47] up another similarity measure of function and this does have the property
[00:51:50] function and this does have the property that if x and z are very close to each
[00:51:52] that if x and z are very close to each other then this would be e to the 0
[00:51:55] other then this would be e to the 0 which is about 1 but if x and z are very
[00:51:58] which is about 1 but if x and z are very far apart then this would be small right
[00:52:00] far apart then this would be small right so this function it it actually
[00:52:02] so this function it it actually satisfies this
[00:52:03] satisfies this criteria satisfies those criteria and
[00:52:05] criteria satisfies those criteria and the question is is it okay to use this
[00:52:10] the question is is it okay to use this as a kernel function so it turns out
[00:52:15] as a kernel function so it turns out that a function like that K of X Z you
[00:52:20] that a function like that K of X Z you can use it as a kernel function only if
[00:52:24] can use it as a kernel function only if there exists some Phi such that K of X Z
[00:52:34] there exists some Phi such that K of X Z equals Phi of X transpose Phi Z right
[00:52:41] equals Phi of X transpose Phi Z right so we derived the whole algorithm
[00:52:42] so we derived the whole algorithm assuming this to be true and it turns
[00:52:44] assuming this to be true and it turns out if you plug in the kernel function
[00:52:46] out if you plug in the kernel function for which you know this isn't true then
[00:52:48] for which you know this isn't true then all of the derivation we wrote down
[00:52:50] all of the derivation we wrote down breaks down and the optimization problem
[00:52:52] breaks down and the optimization problem you know can have very strange solutions
[00:52:56] you know can have very strange solutions right that don't correspond to good
[00:52:58] right that don't correspond to good classification that could cost firewalls
[00:53:00] classification that could cost firewalls and so this puts some constraints on
[00:53:03] and so this puts some constraints on white kernel functions we could choose
[00:53:05] white kernel functions we could choose for example one thing it must satisfy is
[00:53:08] for example one thing it must satisfy is K of X X which is 5 X transpose Phi of Z
[00:53:12] K of X X which is 5 X transpose Phi of Z this had better be greater than or equal
[00:53:14] this had better be greater than or equal to 0 right sorry because in a proper
[00:53:18] to 0 right sorry because in a proper weight vector with itself had better be
[00:53:20] weight vector with itself had better be non-negative
[00:53:21] non-negative so if K of X X is ever 0 or less than 0
[00:53:24] so if K of X X is ever 0 or less than 0 then this is the only valid kernel
[00:53:25] then this is the only valid kernel function okay um
[00:53:28] function okay um more generally there's a theorem that
[00:53:33] more generally there's a theorem that proves when is something a valid kernel
[00:53:36] proves when is something a valid kernel somebody just outlined that that proof
[00:53:39] somebody just outlined that that proof very briefly which is let's set X 1 up
[00:53:44] very briefly which is let's set X 1 up to X D you know be any D points and
[00:53:52] to X D you know be any D points and that's net ok sorry about overloading of
[00:53:58] that's net ok sorry about overloading of notation this is a so K represents a
[00:54:02] notation this is a so K represents a kernel function and I'm gonna use K to
[00:54:05] kernel function and I'm gonna use K to represent a kernel matrix as well
[00:54:10] sometimes it's also called the grand
[00:54:11] sometimes it's also called the grand matrix but it's called the kernel matrix
[00:54:14] matrix but it's called the kernel matrix so the K IJ is equal to the kernel
[00:54:18] so the K IJ is equal to the kernel function applied to two of those points
[00:54:22] function applied to two of those points xixj right so you have D points so just
[00:54:26] xixj right so you have D points so just apply the kernel function to every pair
[00:54:28] apply the kernel function to every pair of those points and put them in a matrix
[00:54:29] of those points and put them in a matrix and a Big D by D matrix like that so it
[00:54:37] and a Big D by D matrix like that so it turns out that given any vector Z I
[00:54:44] turns out that given any vector Z I think you've seen something similar to
[00:54:47] think you've seen something similar to this in problem set one but given any
[00:54:49] this in problem set one but given any vector Z Z transpose AZ which is sum of
[00:54:55] vector Z Z transpose AZ which is sum of I sum over j zi k IJ zj right if k is a
[00:55:05] I sum over j zi k IJ zj right if k is a valid kernel function so if there is
[00:55:07] valid kernel function so if there is some feature mapping Phi then this
[00:55:11] some feature mapping Phi then this should equal to sum of i sum of j zi v
[00:55:15] should equal to sum of i sum of j zi v of x i transpose Phi of Z XJ times EJ
[00:55:25] of x i transpose Phi of Z XJ times EJ and by a couple of other steps let's see
[00:55:32] and by a couple of other steps let's see this v of x AI transpose 5xj I'm going
[00:55:35] this v of x AI transpose 5xj I'm going to expand out that inner product so sum
[00:55:37] to expand out that inner product so sum of a K Phi of X I element of K times Phi
[00:55:43] of a K Phi of X I element of K times Phi of XJ other than K times Z J and then
[00:55:52] of XJ other than K times Z J and then rearranging sums is Sun some of our K
[00:55:56] rearranging sums is Sun some of our K actually sorry I'm running out of white
[00:55:58] actually sorry I'm running out of white board and it's still an exploit
[00:56:14] several range sums some okay sum of i
[00:56:20] several range sums some okay sum of i sum of a j zi v x phi subscript K times
[00:56:29] sum of a j zi v x phi subscript K times Phi of X J subscript K times Z J which
[00:56:40] Phi of X J subscript K times Z J which is sum of a K squared and therefore this
[00:56:50] is sum of a K squared and therefore this must be greater than or equal to 0 and
[00:56:52] must be greater than or equal to 0 and so this proves that the matrix K the
[00:56:57] so this proves that the matrix K the kernel matrix K is positive
[00:57:00] kernel matrix K is positive semi-definite okay and so more generally
[00:57:06] semi-definite okay and so more generally it turns out that this is also a
[00:57:08] it turns out that this is also a sufficient condition for a kernel
[00:57:13] sufficient condition for a kernel function since our function K to be a
[00:57:15] function since our function K to be a valid kernel function so I'm just write
[00:57:19] valid kernel function so I'm just write this out this is called a vs. theorem so
[00:57:32] this out this is called a vs. theorem so k is a valid kernel so K is a valid
[00:57:42] k is a valid kernel so K is a valid kernel function ie there exists v such
[00:57:47] kernel function ie there exists v such that K of X Z equals Phi of X transpose
[00:57:52] that K of X Z equals Phi of X transpose Phi of Z if and only if for any D points
[00:58:05] you know x1 up to X D the corresponding
[00:58:11] you know x1 up to X D the corresponding kernel matrix is a positive
[00:58:21] kernel matrix is a positive semi-definite so let's write this K
[00:58:23] semi-definite so let's write this K trading equals zero and I proved just
[00:58:26] trading equals zero and I proved just one dimension of one one direction of
[00:58:29] one dimension of one one direction of this implication right this proof
[00:58:30] this implication right this proof outline here shows that if it is a valid
[00:58:33] outline here shows that if it is a valid kernel function then does this positive
[00:58:36] kernel function then does this positive semi-definite this Allah did improve the
[00:58:38] semi-definite this Allah did improve the opposite direction
[00:58:39] opposite direction it's as if an earlier right shows both
[00:58:42] it's as if an earlier right shows both direction so this algebra we did just
[00:58:45] direction so this algebra we did just now proves that dimension of the proof I
[00:58:47] now proves that dimension of the proof I did improve the reverse dimension but
[00:58:49] did improve the reverse dimension but this turns out to be a if and only if
[00:58:50] this turns out to be a if and only if condition and so this gives maybe one
[00:58:53] condition and so this gives maybe one test for whether or not something is a
[00:58:56] test for whether or not something is a valid kernel function okay and it turns
[00:59:01] valid kernel function okay and it turns out that the kernel I wrote up there
[00:59:04] out that the kernel I wrote up there that one K of X C it turns out this is a
[00:59:15] that one K of X C it turns out this is a valid kernel this is called the Gaussian
[00:59:16] valid kernel this is called the Gaussian kernel this is a probably the most
[00:59:22] kernel this is a probably the most widely use kernel
[00:59:24] widely use kernel well actually did what
[00:59:36] well actually the most widely-used
[00:59:38] well actually the most widely-used kernels is maybe the linear kernel which
[00:59:44] kernels is maybe the linear kernel which just uses K of X Z equals X transpose Z
[00:59:50] just uses K of X Z equals X transpose Z and so this is using you know Phi of x
[00:59:53] and so this is using you know Phi of x equals x right so no no no high
[00:59:55] equals x right so no no no high dimensional features so sometimes you
[00:59:57] dimensional features so sometimes you call it the linear kernel it just means
[00:59:58] call it the linear kernel it just means you're not using a high dimensional
[01:00:00] you're not using a high dimensional feature mapping or the future mapping is
[01:00:01] feature mapping or the future mapping is just equal to the original features this
[01:00:04] just equal to the original features this is this is actually pretty commonly used
[01:00:06] is this is actually pretty commonly used kernel function you're not taking
[01:00:09] kernel function you're not taking advantage of kernels in other words but
[01:00:11] advantage of kernels in other words but after the linear kernel the Gaussian
[01:00:13] after the linear kernel the Gaussian kernel is probably the most widely used
[01:00:15] kernel is probably the most widely used kernel or the one I wrote out there and
[01:00:18] kernel or the one I wrote out there and this corresponds to a future dimensional
[01:00:21] this corresponds to a future dimensional space that is infinite dimensional right
[01:00:26] space that is infinite dimensional right and this is actually the difficult
[01:00:30] and this is actually the difficult kernel function corresponds to using all
[01:00:32] kernel function corresponds to using all monomial features so if you have you
[01:00:35] monomial features so if you have you know X 1 and also X 1 X 2 and X 1
[01:00:38] know X 1 and also X 1 X 2 and X 1 squared X 2 then X 1 squared X 5 to the
[01:00:41] squared X 2 then X 1 squared X 5 to the 10 and so on up to you know X 1 to the
[01:00:44] 10 and so on up to you know X 1 to the 10,000 and X 2 to the 17 right whatever
[01:00:50] so this the kernel corresponds to using
[01:00:53] so this the kernel corresponds to using all these polynomial features without
[01:00:55] all these polynomial features without end going to arbitrarily high
[01:00:57] end going to arbitrarily high dimensional but giving a smaller waiting
[01:01:00] dimensional but giving a smaller waiting to the very very high dimensional ones
[01:01:02] to the very very high dimensional ones which is why it's why
[01:01:12] [Music]
[01:01:13] [Music] great
[01:01:15] great so the furniture and then toward the end
[01:01:18] so the furniture and then toward the end I'll give some other examples of kernels
[01:01:20] I'll give some other examples of kernels so it turns out that the kernel trick is
[01:01:23] so it turns out that the kernel trick is more general than the support vector
[01:01:25] more general than the support vector machine it was really popularized by
[01:01:29] machine it was really popularized by this vortex machine where you know
[01:01:31] this vortex machine where you know researchers I guess volume of Athens and
[01:01:34] researchers I guess volume of Athens and Corinth and Cortez found that applying
[01:01:37] Corinth and Cortez found that applying this kernel tricks the support vector
[01:01:38] this kernel tricks the support vector machine makes for a very effective
[01:01:40] machine makes for a very effective learning algorithm but the kernel trick
[01:01:42] learning algorithm but the kernel trick is actually more general and if you're
[01:01:44] is actually more general and if you're any learning algorithm that you can
[01:01:46] any learning algorithm that you can write in terms of products like this
[01:01:48] write in terms of products like this then you can apply the kernel trick to
[01:01:51] then you can apply the kernel trick to it and so you you play of this for a
[01:01:53] it and so you you play of this for a different learning algorithm in the in
[01:01:55] different learning algorithm in the in the program assignments as well and the
[01:01:58] the program assignments as well and the way to apply the coho Sheikh is take a
[01:02:00] way to apply the coho Sheikh is take a learning algorithm write the whole thing
[01:02:01] learning algorithm write the whole thing in terms of inner products and then
[01:02:03] in terms of inner products and then replace it with okay z it was some
[01:02:07] replace it with okay z it was some appropriately chosen kernel function k
[01:02:09] appropriately chosen kernel function k of XE and all of the discriminative
[01:02:12] of XE and all of the discriminative learning algorithms would learn so far
[01:02:14] learning algorithms would learn so far it can be written in this way so they
[01:02:17] it can be written in this way so they can apply the kernel trick so linear
[01:02:19] can apply the kernel trick so linear regression which is super aggression
[01:02:21] regression which is super aggression everything that generalizes any model
[01:02:23] everything that generalizes any model family the perceptron algorithm all of
[01:02:25] family the perceptron algorithm all of the all of those algorithms you can
[01:02:28] the all of those algorithms you can actually apply the kernel trick to which
[01:02:30] actually apply the kernel trick to which means that you can apply linear
[01:02:32] means that you can apply linear regression in an infinite dimensional
[01:02:34] regression in an infinite dimensional feature space if you wish and later in
[01:02:38] feature space if you wish and later in this class we'll talk about principal
[01:02:39] this class we'll talk about principal components analysis we should heard of
[01:02:41] components analysis we should heard of but when I talk about principal
[01:02:42] but when I talk about principal components analysis turns out that's yet
[01:02:44] components analysis turns out that's yet another algorithm that can be written
[01:02:46] another algorithm that can be written only in terms of inner products and so
[01:02:47] only in terms of inner products and so there's an everyone called kernel pca
[01:02:49] there's an everyone called kernel pca kernel principal component analysis if
[01:02:51] kernel principal component analysis if you don't know how pcs don't worry about
[01:02:52] you don't know how pcs don't worry about it we'll get to later but lot of
[01:02:54] it we'll get to later but lot of algorithms can be married with the
[01:02:56] algorithms can be married with the kernel trick to implicitly apply the
[01:02:59] kernel trick to implicitly apply the algorithm and even an infinite
[01:03:01] algorithm and even an infinite dimensional feature space but about
[01:03:02] dimensional feature space but about needing your computer to for infinite
[01:03:04] needing your computer to for infinite amounts of memory or using an infinite
[01:03:06] amounts of memory or using an infinite amount of computation but this actually
[01:03:10] amount of computation but this actually the single place is most powerfully
[01:03:11] the single place is most powerfully apart is that it's the support vector
[01:03:13] apart is that it's the support vector machine in practice I guess in practice
[01:03:16] machine in practice I guess in practice the kernel trick is apply all the time
[01:03:18] the kernel trick is apply all the time in support vector machines and that
[01:03:19] in support vector machines and that often in other algorithms
[01:03:28] all right
[01:03:37] I shoot any questions
[01:03:53] all right so last two things I want to
[01:03:57] all right so last two things I want to do today um one is fixed the assumption
[01:04:01] do today um one is fixed the assumption that we had made that the data is
[01:04:04] that we had made that the data is linearly separable so you know sometimes
[01:04:12] linearly separable so you know sometimes you don't want your learning algorithm
[01:04:14] you don't want your learning algorithm to have zero errors on the training set
[01:04:17] to have zero errors on the training set right you know so when you take this low
[01:04:19] right you know so when you take this low dimensional data and map it's a very
[01:04:21] dimensional data and map it's a very high dimensional feature space the data
[01:04:23] high dimensional feature space the data does become much more separable but it
[01:04:26] does become much more separable but it turns out that if your data centers will
[01:04:28] turns out that if your data centers will be noisy right if your data looks like
[01:04:45] be noisy right if your data looks like this you maybe wanted to find a decision
[01:04:50] this you maybe wanted to find a decision boundary like that and you don't want it
[01:04:53] boundary like that and you don't want it to try so hard to separate every little
[01:04:55] to try so hard to separate every little example right as to find it really
[01:04:58] example right as to find it really complicated decision boundary like that
[01:04:59] complicated decision boundary like that right so sometimes either the low
[01:05:01] right so sometimes either the low dimensional space one the high
[01:05:02] dimensional space one the high dimensional space file you don't
[01:05:05] dimensional space file you don't actually want the algorithm to separate
[01:05:07] actually want the algorithm to separate out your data perfectly and then
[01:05:08] out your data perfectly and then sometimes even in high dimensional
[01:05:10] sometimes even in high dimensional feature space your data may not be
[01:05:12] feature space your data may not be linearly separable you don't want to
[01:05:13] linearly separable you don't want to algorithm to you know have zero error on
[01:05:16] algorithm to you know have zero error on the training set and so um there's an
[01:05:20] the training set and so um there's an Aron County or one norm soft margin SVM
[01:05:28] Aron County or one norm soft margin SVM which is a modification to the basic
[01:05:33] which is a modification to the basic algorithm so the basic algorithm was win
[01:05:36] algorithm so the basic algorithm was win over this subject to
[01:05:53] and so what the l1 norm soft margin does
[01:05:57] and so what the l1 norm soft margin does is the following it says you know
[01:06:00] is the following it says you know previously this is saying that so
[01:06:02] previously this is saying that so remember this is the geometric margin
[01:06:07] right if you normalize this by the
[01:06:09] right if you normalize this by the normal W becomes excuse me this is the
[01:06:11] normal W becomes excuse me this is the function of margin if you divide this by
[01:06:14] function of margin if you divide this by the normal W becomes the geometric
[01:06:16] the normal W becomes the geometric margin so this optimization problem was
[01:06:20] margin so this optimization problem was saying let's make sure each example has
[01:06:23] saying let's make sure each example has functional margin created equal to one
[01:06:24] functional margin created equal to one and in the unknown soft margin SVM we're
[01:06:28] and in the unknown soft margin SVM we're going to relax this we're gonna say that
[01:06:30] going to relax this we're gonna say that this needs to be bigger than 1 - see
[01:06:32] this needs to be bigger than 1 - see there's a Greek alphabet see and then
[01:06:37] there's a Greek alphabet see and then we're gonna modify the cost function as
[01:06:39] we're gonna modify the cost function as follows where these C eyes are greater
[01:06:48] follows where these C eyes are greater than or equal to 0 so remember if the
[01:06:52] than or equal to 0 so remember if the function margin is creating equal zero
[01:06:53] function margin is creating equal zero it means the algorithm is closed find
[01:06:55] it means the algorithm is closed find that example correctly right if so long
[01:06:58] that example correctly right if so long as this thing is getting a little zero
[01:06:59] as this thing is getting a little zero then you know why and this thing will
[01:07:03] then you know why and this thing will have the same sign either both positive
[01:07:04] have the same sign either both positive or both negative that's what it means by
[01:07:07] or both negative that's what it means by product of two things to be greater than
[01:07:08] product of two things to be greater than zero
[01:07:09] zero both things have to at the same sign
[01:07:11] both things have to at the same sign right and so if this is if so so as this
[01:07:16] right and so if this is if so so as this is bigger than zero it means it's closed
[01:07:18] is bigger than zero it means it's closed find that example correctly and the SVM
[01:07:21] find that example correctly and the SVM is asking for it to not just classify
[01:07:24] is asking for it to not just classify correctly but classify correctly with a
[01:07:26] correctly but classify correctly with a with a functional margin at least one
[01:07:29] with a functional margin at least one and if you allow CI to be positive then
[01:07:34] and if you allow CI to be positive then that's relaxing that constraint okay but
[01:07:39] that's relaxing that constraint okay but you don't want the sea-ice to be too big
[01:07:41] you don't want the sea-ice to be too big which is why you add to the optimization
[01:07:44] which is why you add to the optimization cost function a cost for making
[01:07:47] cost function a cost for making see I - great and so you optimize this
[01:07:50] see I - great and so you optimize this as function of W and these are the
[01:07:54] as function of W and these are the alphabets and if if you draw a picture
[01:08:00] alphabets and if if you draw a picture it turns out that in this example with
[01:08:08] it turns out that in this example with that being the optimal decision boundary
[01:08:09] that being the optimal decision boundary it turns out that these examples these
[01:08:13] it turns out that these examples these three examples will be equidistant from
[01:08:16] three examples will be equidistant from this straight line right because if they
[01:08:17] this straight line right because if they were then you can fiddle the straight
[01:08:19] were then you can fiddle the straight line to improve the margin you've a
[01:08:21] line to improve the margin you've a little bit more it turns out that these
[01:08:23] little bit more it turns out that these three examples have function margin
[01:08:26] three examples have function margin exactly equal to one at this example
[01:08:28] exactly equal to one at this example over there we have functional margin
[01:08:30] over there we have functional margin equal to two and the further away
[01:08:31] equal to two and the further away examples that even bigger functional
[01:08:33] examples that even bigger functional margins and what this optimization
[01:08:35] margins and what this optimization objective is saying that this is okay if
[01:08:39] objective is saying that this is okay if you have an example here with functional
[01:08:41] you have an example here with functional margin so everything right - the
[01:08:44] margin so everything right - the everything here as function margin one
[01:08:47] everything here as function margin one if an example here I have functional
[01:08:50] if an example here I have functional margin little bit less than 1 and this
[01:08:52] margin little bit less than 1 and this by having by setting CI to 0.5 say is
[01:08:55] by having by setting CI to 0.5 say is letting me get away with having function
[01:08:57] letting me get away with having function logic loading less than 1
[01:09:05] and one other reason why you might want
[01:09:09] and one other reason why you might want to use the unknown soft margin SVM is
[01:09:12] to use the unknown soft margin SVM is the following which is that's it you
[01:09:15] the following which is that's it you have a data set that looks like this
[01:09:20] you know seems like it seems like that
[01:09:23] you know seems like it seems like that would be a pretty good decision boundary
[01:09:25] would be a pretty good decision boundary but if we add just a lot of examples
[01:09:30] but if we add just a lot of examples there's a lot of evidence but if you
[01:09:33] there's a lot of evidence but if you have just one outlier say over here then
[01:09:39] have just one outlier say over here then technically the data set is still
[01:09:41] technically the data set is still linearly separable right if you really
[01:09:44] linearly separable right if you really want to separate this data set sorry
[01:09:49] want to separate this data set sorry that seemed to be cooling these pens
[01:09:50] that seemed to be cooling these pens myself as well if you want to separate
[01:09:54] myself as well if you want to separate out this data set you can actually you
[01:09:57] out this data set you can actually you know choose that decision boundary but
[01:10:01] know choose that decision boundary but the basic optimization classifier will
[01:10:03] the basic optimization classifier will allow the presence of one training
[01:10:05] allow the presence of one training example to cause you to have this
[01:10:08] example to cause you to have this dramatic swing in the position of the
[01:10:11] dramatic swing in the position of the decision boundary so the R because the
[01:10:13] decision boundary so the R because the original auto margin classifier it
[01:10:15] original auto margin classifier it optimizes for the worst case margin the
[01:10:18] optimizes for the worst case margin the concept of optimizing for the worst case
[01:10:20] concept of optimizing for the worst case margin allows one example by being the
[01:10:23] margin allows one example by being the worst case training examples have a huge
[01:10:25] worst case training examples have a huge impact on your decision boundary and so
[01:10:27] impact on your decision boundary and so the l1 non-stop margin SVM allows the
[01:10:31] the l1 non-stop margin SVM allows the SVM to still keep the decision boundary
[01:10:34] SVM to still keep the decision boundary closer to the blue line even when this
[01:10:36] closer to the blue line even when this one outlier and it makes it so much more
[01:10:39] one outlier and it makes it so much more robust to outliers
[01:10:45] and then if you go through the
[01:10:48] and then if you go through the represents a theorem derivation you know
[01:10:51] represents a theorem derivation you know represent w is a function of the alphas
[01:10:53] represent w is a function of the alphas and so on it turns out that the problem
[01:10:55] and so on it turns out that the problem then simplifies to the following so this
[01:10:59] then simplifies to the following so this is I'm just right after some some after
[01:11:08] is I'm just right after some some after you know the home represent account the
[01:11:11] you know the home represent account the whole represented calculation the
[01:11:16] whole represented calculation the derivation this is just what we had
[01:11:22] derivation this is just what we had previously
[01:11:23] previously I've not changed anything so far right
[01:11:31] I've not changed anything so far right this is just exactly what we had and it
[01:11:40] this is just exactly what we had and it turns out that the only change to this
[01:11:43] turns out that the only change to this is we end up with an additional
[01:11:45] is we end up with an additional condition on the Alpha rise so if you go
[01:11:50] condition on the Alpha rise so if you go for that simplification
[01:11:52] for that simplification now that you've changed the algorithms
[01:11:54] now that you've changed the algorithms of this extra term then the new form
[01:11:57] of this extra term then the new form this is called the dual form where they
[01:11:59] this is called the dual form where they also say no problem
[01:12:00] also say no problem the only change is that you end up with
[01:12:01] the only change is that you end up with this additional condition the
[01:12:05] this additional condition the constraints between alpha are between 0
[01:12:07] constraints between alpha are between 0 and C and it turns out that today there
[01:12:15] and C and it turns out that today there are very good you know packages software
[01:12:17] are very good you know packages software packages for just solving that for you
[01:12:19] packages for just solving that for you III think once upon a time we were doing
[01:12:22] III think once upon a time we were doing machine learning you need to worry about
[01:12:24] machine learning you need to worry about whether your köppen inverting matrices
[01:12:25] whether your köppen inverting matrices was good enough right and when when Kofa
[01:12:28] was good enough right and when when Kofa inverting matrices will as mature
[01:12:30] inverting matrices will as mature there's just one thing yet to think
[01:12:31] there's just one thing yet to think about but today linear algebra you know
[01:12:33] about but today linear algebra you know packages have gotten good enough that
[01:12:36] packages have gotten good enough that when you invert the matrix it just
[01:12:37] when you invert the matrix it just invert the matrix you doesn't worry too
[01:12:39] invert the matrix you doesn't worry too much when we're solving you have to
[01:12:41] much when we're solving you have to worry too much about it so in the early
[01:12:43] worry too much about it so in the early days of SVM solving this problem was
[01:12:45] days of SVM solving this problem was really hard you don't worry of your
[01:12:46] really hard you don't worry of your optimization packages you optimizing it
[01:12:48] optimization packages you optimizing it though I think today they're very good
[01:12:49] though I think today they're very good numeric often
[01:12:50] numeric often and packages they just solved this
[01:12:52] and packages they just solved this problem for you and you can just call it
[01:12:53] problem for you and you can just call it with all worrying about the details that
[01:12:55] with all worrying about the details that much all right so this l1 no soft margin
[01:13:00] much all right so this l1 no soft margin SVM and oh and so and so this parameter
[01:13:05] SVM and oh and so and so this parameter C is something you need to choose we'll
[01:13:07] C is something you need to choose we'll talk on Wednesday about how to choose
[01:13:09] talk on Wednesday about how to choose this parameter but it trades off how
[01:13:12] this parameter but it trades off how much you want to insist on getting the
[01:13:14] much you want to insist on getting the training examples right versus you know
[01:13:16] training examples right versus you know saying it's okay if you label if you can
[01:13:18] saying it's okay if you label if you can example as well well we'll discuss on
[01:13:21] example as well well we'll discuss on Wednesday we'll discuss PI's and
[01:13:23] Wednesday we'll discuss PI's and variants how to choose the parameter
[01:13:24] variants how to choose the parameter like C all right so the last thing I
[01:13:32] like C all right so the last thing I want to lastly would like you to see
[01:13:34] want to lastly would like you to see today there's really just a few examples
[01:13:36] today there's really just a few examples of SVM kernels let me just give all
[01:13:44] of SVM kernels let me just give all right so it turns out the SVM with
[01:13:46] right so it turns out the SVM with polynomial kernel works quite well so
[01:13:50] polynomial kernel works quite well so this is a you know K of X Z equals X
[01:13:55] this is a you know K of X Z equals X transpose e to the T this is no that's
[01:13:57] transpose e to the T this is no that's called a polynomial kernel and this is
[01:14:00] called a polynomial kernel and this is called a Gaussian kernel with this video
[01:14:02] called a Gaussian kernel with this video the most widely used one is a Gaussian
[01:14:05] the most widely used one is a Gaussian kernel right and turns out that I guess
[01:14:07] kernel right and turns out that I guess early days of SVM's you know one of the
[01:14:10] early days of SVM's you know one of the proof points as yes was few the machine
[01:14:13] proof points as yes was few the machine learning was doing a lot of work on
[01:14:14] learning was doing a lot of work on handwritten digit classification so
[01:14:16] handwritten digit classification so that's a so digit is a matrix of pixels
[01:14:19] that's a so digit is a matrix of pixels with values that are you know 0 or 1 or
[01:14:23] with values that are you know 0 or 1 or maybe grayscale values right and say you
[01:14:25] maybe grayscale values right and say you take the list of pixel intensity values
[01:14:26] take the list of pixel intensity values and list them so there's 0 0 0 1 1 0 0 0
[01:14:31] and list them so there's 0 0 0 1 1 0 0 0 0 1 0 and just this sound all the pixel
[01:14:36] 0 1 0 and just this sound all the pixel intensity values then this can be your
[01:14:38] intensity values then this can be your future x and they feed it to an SVM
[01:14:41] future x and they feed it to an SVM using either of these kernels it'll do
[01:14:44] using either of these kernels it'll do not too badly as a pen written digit
[01:14:48] not too badly as a pen written digit classification right so does the classic
[01:14:50] classification right so does the classic data set called m-miss which is a
[01:14:53] data set called m-miss which is a classic benchmark in computing in
[01:14:55] classic benchmark in computing in history of machine learning and it was a
[01:14:58] history of machine learning and it was a very surprising result many years ago
[01:15:01] very surprising result many years ago that you know support vector machine
[01:15:02] that you know support vector machine with a kernel
[01:15:04] with a kernel this does very well on hermit integer
[01:15:06] this does very well on hermit integer classification in the past several years
[01:15:08] classification in the past several years we found that deep learning algorithms
[01:15:10] we found that deep learning algorithms must be convolutional neural networks do
[01:15:12] must be convolutional neural networks do even better than the SVM but for some
[01:15:14] even better than the SVM but for some time
[01:15:15] time lesbians were the best algorithm and and
[01:15:18] lesbians were the best algorithm and and they're very easy to use in turnkey
[01:15:20] they're very easy to use in turnkey there aren't a lot of parameters if they
[01:15:21] there aren't a lot of parameters if they don't work so that's the one very nice
[01:15:23] don't work so that's the one very nice property about them um but more
[01:15:26] property about them um but more generally a lot of the most innovative
[01:15:32] generally a lot of the most innovative work in SVM's
[01:15:33] work in SVM's has been into design of kernels so
[01:15:36] has been into design of kernels so here's one example let's say you want a
[01:15:39] here's one example let's say you want a protein sequence classifier so protein
[01:15:49] protein sequence classifier so protein sequences are made up of amino acids so
[01:15:52] sequences are made up of amino acids so because a lot of our bodies are made of
[01:15:55] because a lot of our bodies are made of proteins and proteins are just sequences
[01:15:57] proteins and proteins are just sequences of amino acids and there are 20 amino
[01:15:59] of amino acids and there are 20 amino acids but in order to simplify the
[01:16:02] acids but in order to simplify the description and really not worry too
[01:16:05] description and really not worry too much apology and hope the biologists
[01:16:06] much apology and hope the biologists don't get mad at me I'm gonna pretend
[01:16:08] don't get mad at me I'm gonna pretend that 26 amino acids even though they're
[01:16:10] that 26 amino acids even though they're because they're 26 alphabets so I'm
[01:16:12] because they're 26 alphabets so I'm gonna use the alphabets A through Z to
[01:16:14] gonna use the alphabets A through Z to denote amino acids even though I know
[01:16:16] denote amino acids even though I know this suppose me only 20 but this is
[01:16:18] this suppose me only 20 but this is easier to talk with with 26 alphabets
[01:16:20] easier to talk with with 26 alphabets and so protein is a sequence of
[01:16:23] and so protein is a sequence of alphabets right because the protein in
[01:16:32] alphabets right because the protein in your body is the sequence is made up of
[01:16:34] your body is the sequence is made up of the sequence of amino acids and amino
[01:16:36] the sequence of amino acids and amino acids can be very variable 9 something
[01:16:39] acids can be very variable 9 something very very long so if you're very short
[01:16:40] very very long so if you're very short so the question is how do you represent
[01:16:46] the feature X
[01:16:50] so it turns out and so the goal is to be
[01:16:53] so it turns out and so the goal is to be the input X and make a prediction about
[01:16:57] the input X and make a prediction about this particular protein like what is the
[01:16:59] this particular protein like what is the function of this protein right and so
[01:17:02] function of this protein right and so well here's one way to design a feature
[01:17:04] well here's one way to design a feature vector which is uh I'm going to list out
[01:17:07] vector which is uh I'm going to list out all combinations of four amino acids you
[01:17:14] all combinations of four amino acids you can tell this will take a while right go
[01:17:17] can tell this will take a while right go down to a a a Z and then a a B a and so
[01:17:23] down to a a a Z and then a a B a and so on and eventually you know there'll be a
[01:17:26] on and eventually you know there'll be a be a JT TST a down to zzzzz right and
[01:17:33] be a JT TST a down to zzzzz right and then I'm going to construct five x
[01:17:37] according to the number of times I see
[01:17:39] according to the number of times I see the sequence in the amino acid so for
[01:17:42] the sequence in the amino acid so for example being a JT appears twice so I'm
[01:17:47] example being a JT appears twice so I'm gonna put two there you know TST a
[01:17:52] gonna put two there you know TST a whatever right a PS ones so I'm for the
[01:17:56] whatever right a PS ones so I'm for the one there and there are no a is no ABS
[01:17:58] one there and there are no a is no ABS no you see okay so this is a 20 to the
[01:18:04] no you see okay so this is a 20 to the for you know 26 to 420 is a four
[01:18:07] for you know 26 to 420 is a four dimensional feature vector so this is a
[01:18:09] dimensional feature vector so this is a very very high dimensional feature
[01:18:11] very very high dimensional feature vector and it turns out that using some
[01:18:14] vector and it turns out that using some statistical 22 for is 160,000 it's
[01:18:18] statistical 22 for is 160,000 it's pretty high dimensional quite expensive
[01:18:19] pretty high dimensional quite expensive to compute and it turns out that using
[01:18:23] to compute and it turns out that using dynamic programming given to amino acid
[01:18:27] dynamic programming given to amino acid sequences you can compute 5x transyl Phi
[01:18:29] sequences you can compute 5x transyl Phi of Z as K of X Z and there's a there's a
[01:18:34] of Z as K of X Z and there's a there's a there's a dynamic programming algorithm
[01:18:35] there's a dynamic programming algorithm for doing this
[01:18:36] for doing this the details aren't important for
[01:18:37] the details aren't important for personal siestas you know if any of you
[01:18:39] personal siestas you know if any of you have taken in advanced years algorithm
[01:18:41] have taken in advanced years algorithm schools and learned about the
[01:18:42] schools and learned about the knuth-morris-pratt algorithm this is
[01:18:46] knuth-morris-pratt algorithm this is quite similar to that so don new right
[01:18:49] quite similar to that so don new right with Stanford Stanford professor
[01:18:50] with Stanford Stanford professor emeritus professor here so the DPRK's
[01:18:52] emeritus professor here so the DPRK's question was in ads and using this is
[01:18:56] question was in ads and using this is actually quite this is that here pretty
[01:19:00] actually quite this is that here pretty decent algorithm for
[01:19:01] decent algorithm for sequence of say amino acids and training
[01:19:05] sequence of say amino acids and training a supervised learning algorithm to make
[01:19:06] a supervised learning algorithm to make a clock binary classification on
[01:19:08] a clock binary classification on University premises so as your PI
[01:19:11] University premises so as your PI support vector machines one of the
[01:19:12] support vector machines one of the things you see is that depending on the
[01:19:14] things you see is that depending on the input data you have there can be
[01:19:16] input data you have there can be innovative kernels to use in order to
[01:19:19] innovative kernels to use in order to measure the similarity of two amino acid
[01:19:22] measure the similarity of two amino acid sequences or the similarity of two of
[01:19:24] sequences or the similarity of two of whatever else and then to use that to
[01:19:27] whatever else and then to use that to buy the classifier even on very strange
[01:19:30] buy the classifier even on very strange shaped object which you know do not come
[01:19:32] shaped object which you know do not come as a feature okay so and I think
[01:19:39] as a feature okay so and I think actually another example or if the input
[01:19:41] actually another example or if the input X is a histogram you know maybe you have
[01:19:43] X is a histogram you know maybe you have two different countries your histograms
[01:19:45] two different countries your histograms of people's demographic because it turns
[01:19:47] of people's demographic because it turns out that there is a kernel that taking
[01:19:50] out that there is a kernel that taking the min of the two histograms and then
[01:19:51] the min of the two histograms and then summing up to compute a kernel function
[01:19:53] summing up to compute a kernel function that inputs two histograms it measures
[01:19:55] that inputs two histograms it measures how similar they are so there many
[01:19:56] how similar they are so there many different kernel functions for many
[01:19:58] different kernel functions for many different unique types of inputs you
[01:20:00] different unique types of inputs you might want to possible okay so that's of
[01:20:03] might want to possible okay so that's of SVM's a very useful algorithm and what
[01:20:07] SVM's a very useful algorithm and what we'll do on Wednesday is continue with
[01:20:10] we'll do on Wednesday is continue with more advice on now do you know all of
[01:20:11] more advice on now do you know all of these learning algorithms we'll talk
[01:20:13] these learning algorithms we'll talk about bias and variance to give you more
[01:20:14] about bias and variance to give you more advice on how to actually apply them so
[01:20:17] advice on how to actually apply them so that's great and then I look hard to see
[01:20:19] that's great and then I look hard to see you on Wednesday


================================================================================
LECTURE 008
================================================================================

Lecture 8 - Data Splits, Models & Cross-Validation | Stanford CS229: Machine Learning (Autumn 2018)

Source: https://www.youtube.com/watch?v=rjbkWSTjHzM

---

Transcript

[00:00:03] hey guys um let's get started so over
[00:00:08] hey guys um let's get started so over the last several weeks you've learned a
[00:00:09] the last several weeks you've learned a lot about many different learning
[00:00:11] lot about many different learning algorithms from linear regression so
[00:00:13] algorithms from linear regression so this is regression to generalizing
[00:00:15] this is regression to generalizing models
[00:00:16] models Jennifer albums like GD n ie based most
[00:00:19] Jennifer albums like GD n ie based most recently support vector machines what
[00:00:22] recently support vector machines what I'd like to do today is to start talking
[00:00:25] I'd like to do today is to start talking about advice for applying learning
[00:00:28] about advice for applying learning algorithms the fusional bottom of the
[00:00:29] algorithms the fusional bottom of the theory behind how to make good decisions
[00:00:32] theory behind how to make good decisions of what to do how to actually apply
[00:00:35] of what to do how to actually apply these algorithms and so today I want to
[00:00:39] these algorithms and so today I want to discuss bias and variance and it turns
[00:00:43] discuss bias and variance and it turns out you know I've built quite a lot of
[00:00:45] out you know I've built quite a lot of machine learning systems and it turns
[00:00:47] machine learning systems and it turns out that bias variance there's one of
[00:00:49] out that bias variance there's one of those concepts this is sort of easy to
[00:00:51] those concepts this is sort of easy to understand but hard to master what is it
[00:00:55] understand but hard to master what is it lots of those was it all these board
[00:00:57] lots of those was it all these board games or sometimes a smartphone game say
[00:00:59] games or sometimes a smartphone game say he's to learn hard to master or
[00:01:01] he's to learn hard to master or something like that
[00:01:01] something like that so Bosnians exactly one of those things
[00:01:03] so Bosnians exactly one of those things where I've had some PhD students you
[00:01:06] where I've had some PhD students you know they're worked with me for several
[00:01:07] know they're worked with me for several years and then graduated and worked in
[00:01:09] years and then graduated and worked in industry for a couple years after that
[00:01:10] industry for a couple years after that and and they actually tell me that you
[00:01:13] and and they actually tell me that you know when they took machine learning at
[00:01:15] know when they took machine learning at Stanford they learned bias and variance
[00:01:17] Stanford they learned bias and variance but as they progress for many years
[00:01:20] but as they progress for many years their understanding of bias and variance
[00:01:21] their understanding of bias and variance continues to deepen so I'm going to try
[00:01:24] continues to deepen so I'm going to try to accelerate your learning of what bias
[00:01:28] to accelerate your learning of what bias and variance because I find that people
[00:01:30] and variance because I find that people that understand this concept are much
[00:01:34] that understand this concept are much more efficient in terms of how you
[00:01:36] more efficient in terms of how you develop learning algorithms and meet
[00:01:38] develop learning algorithms and meet reverence work so let's talk about this
[00:01:39] reverence work so let's talk about this today and to be a recurring theme that
[00:01:41] today and to be a recurring theme that will come up with gain a few times in to
[00:01:43] will come up with gain a few times in to make several weeks as well um and then
[00:01:46] make several weeks as well um and then we'll discuss regularization and talk
[00:01:49] we'll discuss regularization and talk about how to reduce variance in learning
[00:01:52] about how to reduce variance in learning algorithms talk boy train dev test lists
[00:01:55] algorithms talk boy train dev test lists and then also talk about a few model
[00:01:58] and then also talk about a few model selection and cross validation
[00:02:00] selection and cross validation algorithms oh and the C reminders for
[00:02:04] algorithms oh and the C reminders for today problem set one is due tonight
[00:02:08] today problem set one is due tonight 11:59 p.m. and if you are not yet ready
[00:02:13] 11:59 p.m. and if you are not yet ready to submit it today late submissions are
[00:02:16] to submit it today late submissions are accepted until Saturday evening Saturday
[00:02:18] accepted until Saturday evening Saturday 11:59 p.m. with the details of late
[00:02:21] 11:59 p.m. with the details of late submissions written according to the
[00:02:23] submissions written according to the laid a policy written across websites if
[00:02:25] laid a policy written across websites if so so definitely purchase em your home
[00:02:27] so so definitely purchase em your home on time today if for some reason you're
[00:02:29] on time today if for some reason you're not able to the late submission which we
[00:02:32] not able to the late submission which we don't encourage anyone to take advantage
[00:02:34] don't encourage anyone to take advantage of but it is written on the course
[00:02:36] of but it is written on the course website and problem set two will be
[00:02:40] website and problem set two will be released shortly actually I think I was
[00:02:43] released shortly actually I think I was already posted online and there's do two
[00:02:46] already posted online and there's do two weeks from now right and so so oh and
[00:02:54] weeks from now right and so so oh and and and what I'm going to do today is
[00:02:56] and and what I'm going to do today is talk about their conceptual aspects of
[00:02:58] talk about their conceptual aspects of this and if you want to see even more
[00:03:01] this and if you want to see even more math between these sort of conceptual
[00:03:03] math between these sort of conceptual concepts at this Friday's discussion
[00:03:05] concepts at this Friday's discussion section will be covering some of the
[00:03:09] section will be covering some of the mathematical aspects of learning
[00:03:11] mathematical aspects of learning theories such as error decomposition
[00:03:13] theories such as error decomposition uniform conversions and vc-dimension you
[00:03:15] uniform conversions and vc-dimension you know what one interesting thing I've
[00:03:16] know what one interesting thing I've learned really watching the evolution of
[00:03:18] learned really watching the evolution of machine learning over many years is that
[00:03:20] machine learning over many years is that machine learning is a discipline as I
[00:03:22] machine learning is a discipline as I become less mathematical over the years
[00:03:25] become less mathematical over the years so I remember when you know machine
[00:03:30] so I remember when you know machine learning people used to worry about
[00:03:32] learning people used to worry about computing the normal equations where X
[00:03:34] computing the normal equations where X transpose X inverse equals X transpose Y
[00:03:35] transpose X inverse equals X transpose Y how numerically stable is you a
[00:03:37] how numerically stable is you a numerical solver for solving the normal
[00:03:39] numerical solver for solving the normal equations of an inverting a matrix and
[00:03:41] equations of an inverting a matrix and solving linear equations but because
[00:03:44] solving linear equations but because numerical linear algebra has made
[00:03:46] numerical linear algebra has made tremendous strides you now we just call
[00:03:48] tremendous strides you now we just call a linear you know linear algebra routine
[00:03:50] a linear you know linear algebra routine to invert the matrix of solving linear
[00:03:51] to invert the matrix of solving linear equations I got not worried about what
[00:03:54] equations I got not worried about what is numerically stable or no but once
[00:03:56] is numerically stable or no but once upon the time lot of my friends in
[00:03:58] upon the time lot of my friends in machine learning were reading text books
[00:04:00] machine learning were reading text books on numerical optimization to figure out
[00:04:02] on numerical optimization to figure out of your you know formula for inverting a
[00:04:05] of your you know formula for inverting a matrix or really solving the system
[00:04:07] matrix or really solving the system equations was numerically stable and so
[00:04:09] equations was numerically stable and so one of the trends have seen is that I
[00:04:12] one of the trends have seen is that I think you know three or four years ago
[00:04:15] think you know three or four years ago to understand violent variants there was
[00:04:17] to understand violent variants there was a certain mathematical theory that was
[00:04:19] a certain mathematical theory that was crucial to understand
[00:04:20] crucial to understand and so I used to teach that in CS 229
[00:04:23] and so I used to teach that in CS 229 but we decided we're constantly trying
[00:04:26] but we decided we're constantly trying to improve this class right but I
[00:04:28] to improve this class right but I decided that that mathematical theory is
[00:04:31] decided that that mathematical theory is actually less crucial today if your main
[00:04:33] actually less crucial today if your main goal is to make these albums were so we
[00:04:35] goal is to make these albums were so we still teach it but we're doing in the
[00:04:37] still teach it but we're doing in the Friday discussion section and that means
[00:04:38] Friday discussion section and that means more time for the main lecture here to
[00:04:41] more time for the main lecture here to talk more about the conceptual things I
[00:04:42] talk more about the conceptual things I think will help you build learning
[00:04:44] think will help you build learning algorithms as well as for the newer
[00:04:45] algorithms as well as for the newer topics like well we'll talk about the
[00:04:47] topics like well we'll talk about the random forest decision trees around the
[00:04:49] random forest decision trees around the forest and your network stinks so okay
[00:04:54] forest and your network stinks so okay so the solver buys a variance um let's
[00:04:58] so the solver buys a variance um let's say you have this data set right I'm
[00:05:08] say you have this data set right I'm gonna draw the same data set three times
[00:05:17] okay so um let's say you have a housing
[00:05:24] okay so um let's say you have a housing price prediction problem where this is
[00:05:26] price prediction problem where this is the size of the house and does the price
[00:05:27] the size of the house and does the price of the house um you know it looks like
[00:05:31] of the house um you know it looks like if you fit a straight line to this data
[00:05:34] if you fit a straight line to this data maybe it's not too bad right but it
[00:05:40] maybe it's not too bad right but it looks like if this data set seems to go
[00:05:42] looks like if this data set seems to go up and then curve downward a little bit
[00:05:44] up and then curve downward a little bit right and so maybe this is a slightly
[00:05:47] right and so maybe this is a slightly better model if you fit a whole series
[00:05:50] better model if you fit a whole series so this is if you fit a linear function
[00:05:52] so this is if you fit a linear function theta 0 plus theta 1 X but if you fit a
[00:05:57] theta 0 plus theta 1 X but if you fit a quadratic model maybe there's actually
[00:05:59] quadratic model maybe there's actually physically a little bit better or you
[00:06:07] physically a little bit better or you could actually fit a high order
[00:06:08] could actually fit a high order polynomial this is one two three four
[00:06:10] polynomial this is one two three four five six examples serve you for the
[00:06:12] five six examples serve you for the fifth order polynomial let's say the 5x
[00:06:20] fifth order polynomial let's say the 5x to the fifth then you know you can
[00:06:24] to the fifth then you know you can actually fit a function that passes
[00:06:25] actually fit a function that passes through all the points perfectly but
[00:06:28] through all the points perfectly but that doesn't seem like a great model for
[00:06:31] that doesn't seem like a great model for this data
[00:06:32] this data and so um to name these phenomenon the
[00:06:37] and so um to name these phenomenon the function assuming you know the one in
[00:06:39] function assuming you know the one in the middle is what we like fitting a
[00:06:43] the middle is what we like fitting a quadratic function is maybe pretty
[00:06:45] quadratic function is maybe pretty grateful service called just right
[00:06:46] grateful service called just right whoever as this example on the left it
[00:06:52] whoever as this example on the left it under fits the data as it's not
[00:06:59] under fits the data as it's not capturing the trend that is maybe semi
[00:07:03] capturing the trend that is maybe semi evidence in the data and we say this
[00:07:05] evidence in the data and we say this algorithm has high bias and the term
[00:07:09] algorithm has high bias and the term buyers you know the term bias has has
[00:07:13] buyers you know the term bias has has actually multiple meanings in the
[00:07:14] actually multiple meanings in the English language we as a society want to
[00:07:17] English language we as a society want to avoid you know racial bias and gender
[00:07:20] avoid you know racial bias and gender bias and discrimination against people's
[00:07:22] bias and discrimination against people's orientation and things like that so the
[00:07:25] orientation and things like that so the term bias and machine learning has a
[00:07:26] term bias and machine learning has a completely separate meaning and it just
[00:07:29] completely separate meaning and it just means that and and and it just means
[00:07:31] means that and and and it just means that um this learning algorithm had very
[00:07:35] that um this learning algorithm had very strong preconceptions that data could be
[00:07:38] strong preconceptions that data could be fit by linear function this algorithm
[00:07:40] fit by linear function this algorithm has a very strong bias a very strong
[00:07:42] has a very strong bias a very strong preconception that the relationship
[00:07:44] preconception that the relationship between pricing and haut size of house
[00:07:45] between pricing and haut size of house is linear and this bias turns out not to
[00:07:48] is linear and this bias turns out not to be true okay so there's a different
[00:07:50] be true okay so there's a different sense of bias then then the then the
[00:07:52] sense of bias then then the then the then the other type of undesirable bias
[00:07:54] then the other type of undesirable bias where the boy this is society or which
[00:07:56] where the boy this is society or which which interesting he comes up in machine
[00:07:58] which interesting he comes up in machine learning as well in other contexts we
[00:08:00] learning as well in other contexts we want our learning out of what those that
[00:08:02] want our learning out of what those that advises well there's different use of
[00:08:03] advises well there's different use of the term and in contrast in this curve
[00:08:07] the term and in contrast in this curve on the right we say that this is a
[00:08:09] on the right we say that this is a overfitting the data and this algorithm
[00:08:16] as high variance and the term high
[00:08:20] as high variance and the term high variance comes from this intuition that
[00:08:22] variance comes from this intuition that you happen to get these five examples
[00:08:26] you happen to get these five examples but if you know a friend of yours was to
[00:08:29] but if you know a friend of yours was to collect data from series six six
[00:08:32] collect data from series six six examples if a friend of yours was to
[00:08:34] examples if a friend of yours was to collect you know a slightly different
[00:08:36] collect you know a slightly different set of six examples right so four friend
[00:08:38] set of six examples right so four friend of yours with a rerun you collected
[00:08:40] of yours with a rerun you collected slightly different set of housings how
[00:08:44] slightly different set of housings how you know right then does our River will
[00:08:49] you know right then does our River will fit some totally other varying function
[00:08:52] fit some totally other varying function on this and so the your predictions will
[00:08:55] on this and so the your predictions will have very high variance if you think of
[00:08:57] have very high variance if you think of this as a version over different random
[00:08:58] this as a version over different random draws of the data so so the variations
[00:09:02] draws of the data so so the variations if if a friend of yours does in the same
[00:09:04] if if a friend of yours does in the same experiment just slightly different data
[00:09:06] experiment just slightly different data set just due to random noise then this
[00:09:08] set just due to random noise then this algorithm fitting a fifth order
[00:09:09] algorithm fitting a fifth order polynomial results in a totally
[00:09:11] polynomial results in a totally different result so that's so we say
[00:09:14] different result so that's so we say that this album has very high variance
[00:09:16] that this album has very high variance there's a lot of variability in the
[00:09:18] there's a lot of variability in the predictions this algorithm will make um
[00:09:21] predictions this algorithm will make um so one of the things we'll need to do is
[00:09:24] so one of the things we'll need to do is identify if your learning algorithm so
[00:09:29] identify if your learning algorithm so we clean the learning algorithm it
[00:09:30] we clean the learning algorithm it almost never works the first time right
[00:09:32] almost never works the first time right and so when I'm developing learning
[00:09:34] and so when I'm developing learning algorithms my standard workflow is often
[00:09:37] algorithms my standard workflow is often to train an algorithm often trained on
[00:09:40] to train an algorithm often trained on something quick and dirty and then try
[00:09:42] something quick and dirty and then try to understand if the algorithm has a
[00:09:44] to understand if the algorithm has a problem of high bias or high variance if
[00:09:46] problem of high bias or high variance if it's under fitting ozone Fillion data
[00:09:48] it's under fitting ozone Fillion data and I use that insight to decide how to
[00:09:51] and I use that insight to decide how to improve the learning algorithm okay and
[00:09:53] improve the learning algorithm okay and I will say a lot more about how to
[00:09:55] I will say a lot more about how to improve the learning algorithm will have
[00:09:57] improve the learning algorithm will have a menu of tools that we'll talk about in
[00:09:59] a menu of tools that we'll talk about in the next couple weeks about how to
[00:10:01] the next couple weeks about how to reduce bias or reduce variance of all of
[00:10:05] reduce bias or reduce variance of all of your learning algorithms um I just
[00:10:07] your learning algorithms um I just mentioned that the problems of bias
[00:10:11] mentioned that the problems of bias variance also hold true for
[00:10:15] variance also hold true for classification problems so
[00:10:27] so let's say that's a binary
[00:10:29] so let's say that's a binary classification problem if you fit a
[00:10:33] classification problem if you fit a logistic regression model to this you
[00:10:37] logistic regression model to this you know straight line fit to the data maybe
[00:10:40] know straight line fit to the data maybe that's not great
[00:10:42] that's not great if you fitted logistic regression model
[00:10:45] if you fitted logistic regression model with a few nonlinear features so yeah
[00:10:49] with a few nonlinear features so yeah features x1 x2 if instead of using X 1
[00:10:53] features x1 x2 if instead of using X 1 and X 2 as features you use additional
[00:10:55] and X 2 as features you use additional features x1 squared x2 squared x1 times
[00:10:58] features x1 squared x2 squared x1 times x2 x1 cube x2 it does a spy of X right
[00:11:03] x2 x1 cube x2 it does a spy of X right and you can have a small set of features
[00:11:05] and you can have a small set of features you choose by hand it's usually pretty
[00:11:07] you choose by hand it's usually pretty more features than this or use an SVM
[00:11:10] more features than this or use an SVM kernel and use an SVM for this problem
[00:11:12] kernel and use an SVM for this problem then if you let's see if you have too
[00:11:17] then if you let's see if you have too many features then you might actually
[00:11:19] many features then you might actually have a learning algorithm that fits the
[00:11:21] have a learning algorithm that fits the decision boundary here that looks like
[00:11:23] decision boundary here that looks like that right and this learning algorithm
[00:11:29] that right and this learning algorithm actually gets perfect performance on the
[00:11:31] actually gets perfect performance on the training set but this overfits
[00:11:35] excuse me I meant to make the colors
[00:11:38] excuse me I meant to make the colors consistent sorry I meant to use red
[00:11:40] consistent sorry I meant to use red thank you you get what I mean
[00:11:42] thank you you get what I mean um and there's only if you choose
[00:11:45] um and there's only if you choose somewhere in between you know there you
[00:11:49] somewhere in between you know there you get something that that seems to be a
[00:11:51] get something that that seems to be a much better fit to the data the green
[00:11:53] much better fit to the data the green line seems to be a pretty good way of
[00:11:55] line seems to be a pretty good way of separating positive and negative
[00:11:56] separating positive and negative examples that they're sort of just right
[00:11:58] examples that they're sort of just right so similar to I guess I messed up the
[00:12:01] so similar to I guess I messed up the COS idea well kind of but similar to
[00:12:03] COS idea well kind of but similar to these colors here the blue line under
[00:12:05] these colors here the blue line under fits because it's not capturing trends
[00:12:06] fits because it's not capturing trends they're pretty apparent as data the the
[00:12:09] they're pretty apparent as data the the orange line over fits is just much too
[00:12:11] orange line over fits is just much too complicated hypotheses whereas the green
[00:12:13] complicated hypotheses whereas the green line is just right
[00:12:17] so it turns out that in the area of you
[00:12:25] so it turns out that in the area of you know GPU computing ability to train
[00:12:27] know GPU computing ability to train models of Lalla features one of the by
[00:12:31] models of Lalla features one of the by building a big enough model so take a
[00:12:34] building a big enough model so take a support vector machine if you add enough
[00:12:36] support vector machine if you add enough features to it if you're high enough you
[00:12:38] features to it if you're high enough you know dimensional feature space or if you
[00:12:41] know dimensional feature space or if you take a linear regression model religious
[00:12:44] take a linear regression model religious regression model and just add enough
[00:12:45] regression model and just add enough features to it you can often overfit the
[00:12:48] features to it you can often overfit the data and it turns out that one of the
[00:12:53] data and it turns out that one of the most effective ways to prevent
[00:12:55] most effective ways to prevent overfitting is regularization so let me
[00:12:59] overfitting is regularization so let me describe what that is and are you
[00:13:08] describe what that is and are you working today's lecture so this is and
[00:13:28] working today's lecture so this is and regularization is a it'll be one of
[00:13:31] regularization is a it'll be one of those techniques that won't take that
[00:13:34] those techniques that won't take that long to explain it'll sound deceptively
[00:13:36] long to explain it'll sound deceptively simple but it's one of the techniques
[00:13:38] simple but it's one of the techniques that I use most often I feel like that
[00:13:41] that I use most often I feel like that uses regularization in many many models
[00:13:43] uses regularization in many many models so so just because it doesn't sound
[00:13:45] so so just because it doesn't sound caught that comprise it or maybe won't
[00:13:47] caught that comprise it or maybe won't even take that long to explain today
[00:13:48] even take that long to explain today don't underestimate how widely use it is
[00:13:50] don't underestimate how widely use it is this is using it's not using every
[00:13:53] this is using it's not using every single machine learning model but it's
[00:13:54] single machine learning model but it's used very very often so here's the idea
[00:14:09] which is a stick linear regression right
[00:14:27] which is a stick linear regression right so that's the optimization objective for
[00:14:30] so that's the optimization objective for linear regression if you want to add
[00:14:33] linear regression if you want to add regularization you just add one extra
[00:14:38] regularization you just add one extra term here lambda times norm of theta
[00:14:45] term here lambda times norm of theta squared right sometimes you write lambda
[00:14:48] squared right sometimes you write lambda over two to make some of the derivations
[00:14:50] over two to make some of the derivations come out easier and what this does is it
[00:14:54] come out easier and what this does is it takes your cost function for logistic
[00:14:56] takes your cost function for logistic regression which you try to minimize
[00:14:58] regression which you try to minimize trying to minimize the squared error fit
[00:15:00] trying to minimize the squared error fit to the data and you are creating an
[00:15:03] to the data and you are creating an incentive term for the algorithm to make
[00:15:06] incentive term for the algorithm to make the parameters theta is smaller okay so
[00:15:09] the parameters theta is smaller okay so this is called the regularization term
[00:15:16] and it turns out that um let's take the
[00:15:21] and it turns out that um let's take the linear regression overfitting example so
[00:15:28] linear regression overfitting example so you know if you said long D equals zero
[00:15:30] you know if you said long D equals zero then it's just linear regression over
[00:15:32] then it's just linear regression over the fifth order polynomial features it
[00:15:36] the fifth order polynomial features it turns out that as you increase lambda or
[00:15:39] turns out that as you increase lambda or lambda to some intermediate value or
[00:15:41] lambda to some intermediate value or depend on the scale of the data unless
[00:15:43] depend on the scale of the data unless he said lambda equals one then when you
[00:15:46] he said lambda equals one then when you solve for this minimization problem was
[00:15:48] solve for this minimization problem was this augment problem for the value of
[00:15:50] this augment problem for the value of theta this term penalizes the parameters
[00:15:54] theta this term penalizes the parameters being to be and it turns out that you
[00:15:57] being to be and it turns out that you end up with a fit that looks a little
[00:16:02] end up with a fit that looks a little bit better right there maybe looks like
[00:16:03] bit better right there maybe looks like that
[00:16:04] that okay and by preventing the parameters
[00:16:08] okay and by preventing the parameters data from being too big you're making it
[00:16:10] data from being too big you're making it harder for the learning algorithm to
[00:16:12] harder for the learning algorithm to overfit the data it turns out fitting a
[00:16:15] overfit the data it turns out fitting a very high order polynomial like that may
[00:16:16] very high order polynomial like that may result in value of stated is a very
[00:16:18] result in value of stated is a very large right and and then if you sent
[00:16:23] large right and and then if you sent lambda to be too large then you actually
[00:16:27] lambda to be too large then you actually end up in an under fitting regime okay
[00:16:32] end up in an under fitting regime okay so they're usually be some optimal value
[00:16:35] so they're usually be some optimal value of lambda where it long equals zero
[00:16:37] of lambda where it long equals zero you're not using any regularization here
[00:16:39] you're not using any regularization here so it may be overfitting if lambda is
[00:16:42] so it may be overfitting if lambda is way too big then you're forcing all the
[00:16:45] way too big then you're forcing all the parameters to be too close to zero in
[00:16:48] parameters to be too close to zero in fact actually think about if mom it was
[00:16:50] fact actually think about if mom it was equals you know 10 to the 100 or some
[00:16:52] equals you know 10 to the 100 or some ridiculously large number then you are
[00:16:55] ridiculously large number then you are really forcing all the Thetas to be 0
[00:16:57] really forcing all the Thetas to be 0 right if all the phases are 0 then you
[00:17:00] right if all the phases are 0 then you know then you're kind of fitting this
[00:17:02] know then you're kind of fitting this straight line right so that's it mom V
[00:17:04] straight line right so that's it mom V equals 10 to the 100 and so and this is
[00:17:07] equals 10 to the 100 and so and this is a very simple function which is the
[00:17:10] a very simple function which is the function 0 right this function H of
[00:17:12] function 0 right this function H of theta of x equals 0 right approximately
[00:17:16] theta of x equals 0 right approximately 0 this is a very simple function which
[00:17:19] 0 this is a very simple function which you get if you set down to be very large
[00:17:21] you get if you set down to be very large and by dialing lambda between you know a
[00:17:24] and by dialing lambda between you know a far too large value like 10 to the 100
[00:17:27] far too large value like 10 to the 100 compared to far too small value like
[00:17:29] compared to far too small value like angle 0 you use you smooth the interplay
[00:17:31] angle 0 you use you smooth the interplay between this much to simple function of
[00:17:34] between this much to simple function of H equals 0 and a much to complex
[00:17:36] H equals 0 and a much to complex function okay so there is so that's
[00:17:46] function okay so there is so that's pretty it it so that's pretty much it
[00:17:51] pretty it it so that's pretty much it for regularization in terms of what you
[00:17:53] for regularization in terms of what you need to is meant but if you like your
[00:17:54] need to is meant but if you like your learning Outram may be overfitting add
[00:17:57] learning Outram may be overfitting add this to your model and solve this
[00:18:00] this to your model and solve this optimization problem and it will help
[00:18:03] optimization problem and it will help relieve overfitting more generally if
[00:18:08] relieve overfitting more generally if you are
[00:18:13] let's see more generally if you have a
[00:18:19] let's see more generally if you have a save logistic regression problem where
[00:18:22] save logistic regression problem where this is your cost function then to add
[00:18:34] this is your cost function then to add regularization I guess instead of min
[00:18:37] regularization I guess instead of min this is a max right if you're applying
[00:18:39] this is a max right if you're applying there just regression then this was the
[00:18:41] there just regression then this was the original cost function then you can have
[00:18:44] original cost function then you can have - long term or lambda over 2 there's
[00:18:48] - long term or lambda over 2 there's just defensive scaling of lambda times
[00:18:51] just defensive scaling of lambda times the norm of theta squared and there's a
[00:18:53] the norm of theta squared and there's a minus here because village is regression
[00:18:54] minus here because village is regression we're maximizing around in minimizing
[00:18:56] we're maximizing around in minimizing what this could be our merits in any of
[00:18:58] what this could be our merits in any of the generalized linear model family as
[00:18:59] the generalized linear model family as well but by subtracting lambda times the
[00:19:03] well but by subtracting lambda times the norm of theta squared this allows you to
[00:19:04] norm of theta squared this allows you to also regularize the classification
[00:19:06] also regularize the classification algorithm such as logistic regression ok
[00:19:10] algorithm such as logistic regression ok um it turns out that and and I want to
[00:19:16] um it turns out that and and I want to make an analogy that where where all the
[00:19:19] make an analogy that where where all the math details are true but we don't want
[00:19:22] math details are true but we don't want to talk about all the math details it
[00:19:23] to talk about all the math details it turns out that one of the reasons the
[00:19:26] turns out that one of the reasons the support vector machine doesn't over fit
[00:19:28] support vector machine doesn't over fit too badly even though it has you know
[00:19:30] too badly even though it has you know you can be working infinite like a you
[00:19:33] you can be working infinite like a you know infinite dimensional feature space
[00:19:34] know infinite dimensional feature space right so so why doesn't a support vector
[00:19:37] right so so why doesn't a support vector machine just over fit like crazy we
[00:19:39] machine just over fit like crazy we showed on Monday that by using kernels
[00:19:42] showed on Monday that by using kernels is sort of using an infinite dimensional
[00:19:44] is sort of using an infinite dimensional feature space right so why doesn't
[00:19:47] feature space right so why doesn't always fit these crazy complicated
[00:19:49] always fit these crazy complicated functions it just over today to say that
[00:19:51] functions it just over today to say that crazy it turns out and the theory is
[00:19:53] crazy it turns out and the theory is complicated it turns out that you know
[00:19:59] complicated it turns out that you know the optimization objective of the
[00:20:00] the optimization objective of the support vector machine was to minimize
[00:20:02] support vector machine was to minimize the norm of W squared this turns out to
[00:20:05] the norm of W squared this turns out to correspond to maximizing the margin that
[00:20:08] correspond to maximizing the margin that you measured margin SVM and it's
[00:20:10] you measured margin SVM and it's actually possible to prove that this has
[00:20:13] actually possible to prove that this has a similar effect
[00:20:14] a similar effect as that right that this is why the
[00:20:17] as that right that this is why the support vector machine despite working
[00:20:18] support vector machine despite working infinite dimensional feature space
[00:20:20] infinite dimensional feature space sometimes by forcing the parameters to
[00:20:23] sometimes by forcing the parameters to be small it's difficult for the support
[00:20:25] be small it's difficult for the support vector machine to overfit the data too
[00:20:28] vector machine to overfit the data too much okay the theory to actually show
[00:20:29] much okay the theory to actually show this is quite complicated yeah there's
[00:20:35] this is quite complicated yeah there's actually very show that the class of
[00:20:37] actually very show that the class of classifiers where this is what normal W
[00:20:39] classifiers where this is what normal W is small cannot be too complicating
[00:20:41] is small cannot be too complicating cannot over fit basically but that's why
[00:20:43] cannot over fit basically but that's why as you guys working you can work an
[00:20:46] as you guys working you can work an infinite dimensional features basis yeah
[00:20:48] infinite dimensional features basis yeah oh sure do you ever recognize per
[00:20:58] oh sure do you ever recognize per element appearances um not really
[00:21:01] element appearances um not really and the problem with that is you know
[00:21:04] and the problem with that is you know give one let me give one more specific
[00:21:06] give one let me give one more specific example then come back to that right so
[00:21:08] example then come back to that right so it turns out that um so we talked about
[00:21:13] it turns out that um so we talked about now you base as a text classification
[00:21:15] now you base as a text classification algorithm it turns out that tell me
[00:21:20] algorithm it turns out that tell me let's see if it's a classification
[00:21:21] let's see if it's a classification algorithm problem yes classify spam and
[00:21:23] algorithm problem yes classify spam and non-spam or classified it sentiment
[00:21:26] non-spam or classified it sentiment positive or negative sentiment of a
[00:21:27] positive or negative sentiment of a tweet or something let's say you have a
[00:21:29] tweet or something let's say you have a hundred examples but you have ten
[00:21:33] hundred examples but you have ten thousand dimensional features right so
[00:21:35] thousand dimensional features right so let's see your features are these you
[00:21:37] let's see your features are these you know take the dictionary a odd Bach and
[00:21:40] know take the dictionary a odd Bach and so on it's a one zero one right so let's
[00:21:43] so on it's a one zero one right so let's you construct your feature vectors it
[00:21:46] you construct your feature vectors it turns out that if you fit the just
[00:21:49] turns out that if you fit the just regression to this type of data where
[00:21:51] regression to this type of data where you have 10,000 parameters and hundred
[00:21:53] you have 10,000 parameters and hundred examples
[00:21:54] examples this will badly this will probably
[00:21:56] this will badly this will probably overfit the data because you have but it
[00:22:01] overfit the data because you have but it turns out that if you use the just
[00:22:03] turns out that if you use the just regress in with regularization this is
[00:22:06] regress in with regularization this is actually a pretty good algorithm for
[00:22:07] actually a pretty good algorithm for text ossification and this will usually
[00:22:10] text ossification and this will usually in terms of performance accuracy you
[00:22:13] in terms of performance accuracy you know because this is Lee just regression
[00:22:14] know because this is Lee just regression you need to implement gradient descent
[00:22:16] you need to implement gradient descent or
[00:22:16] or to solve the good value parameters but
[00:22:18] to solve the good value parameters but logistic regression with regularization
[00:22:20] logistic regression with regularization for text classification will usually
[00:22:23] for text classification will usually perform outperform naive Bayes on a
[00:22:26] perform outperform naive Bayes on a classification accuracy you standpoint
[00:22:28] classification accuracy you standpoint without regularization which is Russian
[00:22:31] without regularization which is Russian will value over fit this data and and to
[00:22:35] will value over fit this data and and to to explain a bit more you know imagine
[00:22:38] to explain a bit more you know imagine that you have a three dimensional
[00:22:41] that you have a three dimensional subspace where you have two examples
[00:22:44] subspace where you have two examples then all you can do is fit a straight
[00:22:47] then all you can do is fit a straight line right for the hyperplane to
[00:22:49] line right for the hyperplane to separate these two examples but so one
[00:22:51] separate these two examples but so one rule of thumb for logistic regression is
[00:22:55] rule of thumb for logistic regression is that if you do not use regularization
[00:22:57] that if you do not use regularization it's nice if the number of examples is
[00:23:00] it's nice if the number of examples is at least on the order of the number of
[00:23:02] at least on the order of the number of parameters you want to fit right so this
[00:23:04] parameters you want to fit right so this is if you're not using regularization
[00:23:05] is if you're not using regularization it's nice if in fact I personally think
[00:23:08] it's nice if in fact I personally think that I tend to use the jurors and the
[00:23:10] that I tend to use the jurors and the only of the number of examples can be
[00:23:12] only of the number of examples can be maybe 10x bigger than the number of
[00:23:15] maybe 10x bigger than the number of examples because that's what you need to
[00:23:17] examples because that's what you need to have enough information to fit good
[00:23:19] have enough information to fit good choices all these parameters but that's
[00:23:22] choices all these parameters but that's a good not using regularization but if
[00:23:24] a good not using regularization but if you are using regularization then you
[00:23:27] you are using regularization then you can fit you know even 10,000 parameters
[00:23:30] can fit you know even 10,000 parameters right even with only 100 examples and
[00:23:32] right even with only 100 examples and this will be a pretty decent text
[00:23:35] this will be a pretty decent text classification out um the question you
[00:23:39] classification out um the question you had just now why don't we regularize per
[00:23:41] had just now why don't we regularize per parameter right so why don't we
[00:23:44] parameter right so why don't we let's see so I guess instead of lambda
[00:23:47] let's see so I guess instead of lambda you norm of theta squared it would be a
[00:23:50] you norm of theta squared it would be a sum over J lambda J you know and theta J
[00:23:55] sum over J lambda J you know and theta J squared right the reason we don't do
[00:23:58] squared right the reason we don't do this is because you then end up with if
[00:24:00] this is because you then end up with if you have 10,000 parameters here you end
[00:24:03] you have 10,000 parameters here you end up with another 10,000 parameters here
[00:24:05] up with another 10,000 parameters here and so choosing all this 10,000 lambdas
[00:24:07] and so choosing all this 10,000 lambdas is as difficult that's just choosing all
[00:24:09] is as difficult that's just choosing all these parameters in the first place so
[00:24:11] these parameters in the first place so we don't have a good way to do this
[00:24:12] we don't have a good way to do this whereas when you talk about
[00:24:13] whereas when you talk about cross-validation multi-session a little
[00:24:15] cross-validation multi-session a little bit we'll talk about how to choose maybe
[00:24:17] bit we'll talk about how to choose maybe one parameter lambda but but those
[00:24:19] one parameter lambda but but those techniques won't work for choosing so
[00:24:21] techniques won't work for choosing so 10,000 parameters
[00:24:38] you'll see right yes thank you um yes so
[00:24:41] you'll see right yes thank you um yes so in order to make sure that the different
[00:24:43] in order to make sure that the different launchers on the similar scale a common
[00:24:45] launchers on the similar scale a common pre-processing step we use in learning
[00:24:47] pre-processing step we use in learning algorithms is take two different
[00:24:49] algorithms is take two different features
[00:24:50] features so for text classification if all the
[00:24:52] so for text classification if all the features are zero one you can just leave
[00:24:54] features are zero one you can just leave the features alone but if housing
[00:24:56] the features alone but if housing classification if feature one is the
[00:24:58] classification if feature one is the size of house which I guess ranges from
[00:25:00] size of house which I guess ranges from 100 how big are the biggest houses no no
[00:25:04] 100 how big are the biggest houses no no whatever let's see how's this girl from
[00:25:07] whatever let's see how's this girl from 500 square feet to 10,000 square feet
[00:25:09] 500 square feet to 10,000 square feet 10,000 square foot really really big for
[00:25:11] 10,000 square foot really really big for housing yes but then um feature x2 is
[00:25:14] housing yes but then um feature x2 is the number of bedrooms which proudly
[00:25:16] the number of bedrooms which proudly ranges from like oh no wonder
[00:25:18] ranges from like oh no wonder I guess that some houses a ton of
[00:25:19] I guess that some houses a ton of dangerous but I think most houses have
[00:25:21] dangerous but I think most houses have been most 5 bedrooms right then these
[00:25:24] been most 5 bedrooms right then these features are on very different scales
[00:25:25] features are on very different scales and normalizing them to all be on a
[00:25:28] and normalizing them to all be on a similar scale so subtract out the mean
[00:25:31] similar scale so subtract out the mean and divided by the standard deviation so
[00:25:33] and divided by the standard deviation so scale all of these things to be between
[00:25:35] scale all of these things to be between you know 0 1 or between minus 1 to 1
[00:25:38] you know 0 1 or between minus 1 to 1 with would be a good pre-processing step
[00:25:41] with would be a good pre-processing step before applying these methods it turns
[00:25:43] before applying these methods it turns out that this will make gradient descent
[00:25:45] out that this will make gradient descent run faster as well as a comic
[00:25:46] run faster as well as a comic pre-processing step to scale each
[00:25:48] pre-processing step to scale each individual feature to be on a similar
[00:25:50] individual feature to be on a similar range of values
[00:26:10] so it's actually both searches repeated
[00:26:14] so it's actually both searches repeated why why don't support vector machines
[00:26:16] why why don't support vector machines ever too badly is it because there's no
[00:26:17] ever too badly is it because there's no small number of small vectors or is it
[00:26:19] small number of small vectors or is it because of minimizing the penalty w um I
[00:26:22] because of minimizing the penalty w um I would say the formal argument relies
[00:26:24] would say the formal argument relies more on the latter so it turns out that
[00:26:25] more on the latter so it turns out that if you look at all the cost if you look
[00:26:28] if you look at all the cost if you look at all the class of functions sector
[00:26:29] at all the class of functions sector today to have a large margin that class
[00:26:32] today to have a large margin that class has low complexity formalized by low VC
[00:26:34] has low complexity formalized by low VC dimension which you learned about in
[00:26:36] dimension which you learned about in Friday's discussion section if you want
[00:26:37] Friday's discussion section if you want to come to that and so it turns out that
[00:26:40] to come to that and so it turns out that the cause of all functions that separate
[00:26:41] the cause of all functions that separate the data of a large margin is a
[00:26:43] the data of a large margin is a relatively simple class of functions but
[00:26:45] relatively simple class of functions but and by simple cost functions I mean has
[00:26:46] and by simple cost functions I mean has low VC dimension which talked about this
[00:26:48] low VC dimension which talked about this Friday and does any function within that
[00:26:51] Friday and does any function within that class of functions is not too likely to
[00:26:54] class of functions is not too likely to overfit so it is convenient the support
[00:26:57] overfit so it is convenient the support vector machine is relatively no number
[00:26:59] vector machine is relatively no number of support vectors but there you could
[00:27:02] of support vectors but there you could imagine other algorithms of a very large
[00:27:04] imagine other algorithms of a very large number of support vectors
[00:27:05] number of support vectors that's all as the large margin is still
[00:27:08] that's all as the large margin is still a local Mexican so I would say the game
[00:27:24] oh sure yes so is it possible though so
[00:27:36] oh sure yes so is it possible though so yes so one of the so yes so in general
[00:27:40] yes so one of the so yes so in general models that have high bias to enter
[00:27:43] models that have high bias to enter under fit and models of high variance
[00:27:44] under fit and models of high variance sent to overfit
[00:27:45] sent to overfit we use these terms overfilled high
[00:27:48] we use these terms overfilled high variance under fit hi buyers
[00:27:50] variance under fit hi buyers not quite and they have very similar
[00:27:52] not quite and they have very similar meanings right their first frustration
[00:27:53] meanings right their first frustration assume they didn't mean the same thing
[00:27:55] assume they didn't mean the same thing one thing we'll see later a two weeks
[00:27:57] one thing we'll see later a two weeks from now is we'll talk about algorithms
[00:27:59] from now is we'll talk about algorithms with high bias and high variance so this
[00:28:04] with high bias and high variance so this is a and actually one way to think of
[00:28:06] is a and actually one way to think of high bias in high variance was hopeful
[00:28:08] high bias in high variance was hopeful Dominator is your data set that looks
[00:28:10] Dominator is your data set that looks like this and if somehow your classifier
[00:28:18] like this and if somehow your classifier has very high complexity there's a very
[00:28:20] has very high complexity there's a very very complicated function but for some
[00:28:23] very complicated function but for some reason is still not fitting your data
[00:28:25] reason is still not fitting your data well right so that'd be one way to have
[00:28:27] well right so that'd be one way to have high bias and high variance which does
[00:28:28] high bias and high variance which does happen
[00:28:35] all right
[00:28:47] so to wrap up the discussion on
[00:28:50] so to wrap up the discussion on regularization this one so mechanically
[00:28:58] regularization this one so mechanically the way you implement regularization is
[00:29:01] the way you implement regularization is by adding that penalty on the norm of
[00:29:03] by adding that penalty on the norm of the parameters so that's what you
[00:29:06] the parameters so that's what you actually implement it turns out that
[00:29:08] actually implement it turns out that there's another way to think about
[00:29:10] there's another way to think about regularization so you remember when we
[00:29:12] regularization so you remember when we talked about the new linear regression
[00:29:14] talked about the new linear regression we talked about minimizing squared error
[00:29:16] we talked about minimizing squared error and then later on we saw that linear
[00:29:18] and then later on we saw that linear regression was maximum likelihood
[00:29:20] regression was maximum likelihood estimation on a certain journalize
[00:29:22] estimation on a certain journalize linear model using a using using using a
[00:29:25] linear model using a using using using a Gaussian distribution as the expertise
[00:29:27] Gaussian distribution as the expertise for the exponential family is a member
[00:29:29] for the exponential family is a member of the extension family it turns out
[00:29:31] of the extension family it turns out that does a similar point of view you
[00:29:34] that does a similar point of view you can take on the regularization algorithm
[00:29:36] can take on the regularization algorithm that we just saw which is let's say s is
[00:29:40] that we just saw which is let's say s is the training set so um given a training
[00:29:53] the training set so um given a training set you want to find the most likely
[00:29:58] set you want to find the most likely value of theta right and so by Bayes
[00:30:05] value of theta right and so by Bayes rule P of theta given s is P R s given
[00:30:10] rule P of theta given s is P R s given theta times P of theta divided by P of s
[00:30:15] theta times P of theta divided by P of s and so if you want to pick the value of
[00:30:20] and so if you want to pick the value of theta that's the most likely value of
[00:30:22] theta that's the most likely value of theta given the data you saw then
[00:30:25] theta given the data you saw then because the denominator is just a
[00:30:26] because the denominator is just a constant this is our max over theta P of
[00:30:30] constant this is our max over theta P of x given theta times P of theta
[00:30:41] and so if you're using logistic
[00:30:46] and so if you're using logistic regression then the first term is this
[00:30:56] and in the second term is P of theta
[00:31:02] and in the second term is P of theta where this is the you know logistic
[00:31:04] where this is the you know logistic regression models they write or any
[00:31:10] regression models they write or any generalized linear model and it turns
[00:31:15] generalized linear model and it turns out that if you assume P of theta is
[00:31:19] out that if you assume P of theta is Gaussian so if we assume your phases
[00:31:24] Gaussian so if we assume your phases follow theta the probability on theta is
[00:31:28] follow theta the probability on theta is Gaussian with mean zero and some
[00:31:34] Gaussian with mean zero and some variance tau squared I so in other words
[00:31:38] variance tau squared I so in other words a p of theta is you know 1 over root 2
[00:31:42] a p of theta is you know 1 over root 2 pi I guess this be determinant of tau
[00:31:46] pi I guess this be determinant of tau squared I write e to the negative theta
[00:31:52] squared I write e to the negative theta transpose tau squared i inverse so the
[00:32:01] transpose tau squared i inverse so the Gaussian probability as follows it turns
[00:32:03] Gaussian probability as follows it turns out that if this is your prior
[00:32:06] out that if this is your prior distribution for theta and you plug this
[00:32:09] distribution for theta and you plug this in here and you take logs compute the
[00:32:13] in here and you take logs compute the max and so on then you end up with
[00:32:14] max and so on then you end up with exactly the regularization technique
[00:32:17] exactly the regularization technique that we found just now ok and so in
[00:32:21] that we found just now ok and so in everything we've been doing so far we've
[00:32:24] everything we've been doing so far we've been taking a frequentist interpretation
[00:32:28] been taking a frequentist interpretation I guess the two main schools of
[00:32:32] I guess the two main schools of statistics are there frequences school
[00:32:37] statistics are there frequences school of statistic and the Bayesian school
[00:32:39] of statistic and the Bayesian school statistic and they used to be sort of
[00:32:44] statistic and they used to be sort of Titanic academic debates about which is
[00:32:47] Titanic academic debates about which is the right one but I think
[00:32:48] the right one but I think the statisticians have gotten together
[00:32:50] the statisticians have gotten together and and kind of made peace and and then
[00:32:53] and and kind of made peace and and then it goes really between these two more
[00:32:54] it goes really between these two more and more these days maybe not not all
[00:32:56] and more these days maybe not not all the time later but the frequency school
[00:32:58] the time later but the frequency school statistic we say that there is some data
[00:33:01] statistic we say that there is some data and we want to find the value of theta
[00:33:07] and we want to find the value of theta that makes the data as likely as
[00:33:10] that makes the data as likely as possible and that's where we got maximum
[00:33:12] possible and that's where we got maximum likelihood estimation all right and in
[00:33:15] likelihood estimation all right and in the frequency school statistics we view
[00:33:17] the frequency school statistics we view there as being some true value of theta
[00:33:19] there as being some true value of theta out in the world that is unknown and so
[00:33:22] out in the world that is unknown and so there is some true value of theta that
[00:33:24] there is some true value of theta that generated all these housing prices and
[00:33:25] generated all these housing prices and our goal is to estimate this true
[00:33:28] our goal is to estimate this true parameter in the Bayesian school of
[00:33:31] parameter in the Bayesian school of statistics we say that theta is unknown
[00:33:33] statistics we say that theta is unknown but before you see even any data you
[00:33:37] but before you see even any data you already have some prior beliefs about
[00:33:39] already have some prior beliefs about how housing prices are generated out in
[00:33:41] how housing prices are generated out in the world and your prior beliefs are
[00:33:43] the world and your prior beliefs are captured in a prior distribution denoted
[00:33:46] captured in a prior distribution denoted by P of theta so this is called a
[00:33:48] by P of theta so this is called a Gaussian prior and we say that and and
[00:33:56] Gaussian prior and we say that and and and if you look at this Gaussian prior
[00:33:59] and if you look at this Gaussian prior excuse me it's quite reasonable you're
[00:34:02] excuse me it's quite reasonable you're saying that before you seen any data on
[00:34:04] saying that before you seen any data on average I think the parent is of theta
[00:34:06] average I think the parent is of theta have mean zero because I don't know if
[00:34:08] have mean zero because I don't know if each theta is positive negative so
[00:34:10] each theta is positive negative so giving them a zero seems reasonable and
[00:34:12] giving them a zero seems reasonable and most things in the world I thought since
[00:34:13] most things in the world I thought since I just assumed that my prior on Thetas
[00:34:16] I just assumed that my prior on Thetas Gaussian so you know because we could
[00:34:18] Gaussian so you know because we could debate if this is a the right assumption
[00:34:20] debate if this is a the right assumption but it's not totally unreasonable right
[00:34:22] but it's not totally unreasonable right but they say well actually I think you
[00:34:25] but they say well actually I think you know for the next linear regression
[00:34:27] know for the next linear regression problem I'm gonna work on next week and
[00:34:30] problem I'm gonna work on next week and I have no idea what I'm gonna work on
[00:34:31] I have no idea what I'm gonna work on what I'm gonna apply linear regression
[00:34:33] what I'm gonna apply linear regression in next week it was actually not too bad
[00:34:35] in next week it was actually not too bad an assumption to say you know my priors
[00:34:37] an assumption to say you know my priors is Gaussian and in the Bayesian view of
[00:34:40] is Gaussian and in the Bayesian view of the world our goal is to find the value
[00:34:45] the world our goal is to find the value of theta that is most likely after we
[00:34:53] of theta that is most likely after we have seen the data okay
[00:34:55] have seen the data okay and so this is called map estimation
[00:35:00] we're sense for maximum a-posteriori
[00:35:06] estimation so this is actually the map
[00:35:08] estimation so this is actually the map estimator I get the odd max of this
[00:35:10] estimator I get the odd max of this right there's the map or the maximum a
[00:35:17] right there's the map or the maximum a posteriori estimates of theta which
[00:35:18] posteriori estimates of theta which means look at the data compute the
[00:35:20] means look at the data compute the Bayesian posterior distribution on theta
[00:35:22] Bayesian posterior distribution on theta and pay the value of theta that's most
[00:35:24] and pay the value of theta that's most likely okay and so one of the things you
[00:35:27] likely okay and so one of the things you do in the problem set that was just
[00:35:29] do in the problem set that was just released is no is actually show this
[00:35:32] released is no is actually show this equivalence as well as plug in a
[00:35:36] equivalence as well as plug in a different prior States or other than the
[00:35:38] different prior States or other than the Gaussian prior you experiment about
[00:35:40] Gaussian prior you experiment about whether P of a is the Laplace prior and
[00:35:43] whether P of a is the Laplace prior and define and derive a different map okay
[00:36:04] wait sorry few said again yes OSE oh yes
[00:36:25] wait sorry few said again yes OSE oh yes current difference we need these to be
[00:36:27] current difference we need these to be seen as recognize diversity is not right
[00:36:28] seen as recognize diversity is not right yes so so MOU here corresponds to you
[00:36:32] yes so so MOU here corresponds to you know the average salary without
[00:36:33] know the average salary without regularization and this procedure here
[00:36:36] regularization and this procedure here corresponds to having regularization it
[00:36:39] corresponds to having regularization it turns out that free consistent decisions
[00:36:42] turns out that free consistent decisions can also use regularization it's just
[00:36:44] can also use regularization it's just that they don't try to justify it
[00:36:45] that they don't try to justify it through amazing a prior they shouldn't
[00:36:46] through amazing a prior they shouldn't say so if your frequency statistic if
[00:36:49] say so if your frequency statistic if your frequentist statistics your job is
[00:36:51] your frequentist statistics your job is a wake up and come up with an algorithm
[00:36:53] a wake up and come up with an algorithm to estimate this you know true value of
[00:36:55] to estimate this you know true value of theta then because it's out in the world
[00:36:56] theta then because it's out in the world you can come of any procedure you want
[00:36:58] you can come of any procedure you want and then spa your procedure you can add
[00:37:00] and then spa your procedure you can add a regularization term I think there's a
[00:37:02] a regularization term I think there's a lot of these debates in frequences and
[00:37:03] lot of these debates in frequences and patients are more philosophical I think
[00:37:05] patients are more philosophical I think as a machine learning person as an
[00:37:07] as a machine learning person as an engineer I don't really you know I think
[00:37:10] engineer I don't really you know I think the philosophical debates are lovely but
[00:37:12] the philosophical debates are lovely but I just I just like my stuff to work so
[00:37:14] I just I just like my stuff to work so so I decided so we can say so
[00:37:17] so I decided so we can say so frequencies can also in fin
[00:37:18] frequencies can also in fin regularization it's just that they say
[00:37:20] regularization it's just that they say this is part of the algorithm they
[00:37:21] this is part of the algorithm they invented rather than derived from a
[00:37:23] invented rather than derived from a Bayesian prior alright cool so
[00:37:40] let's talk about so in in a discussion
[00:37:45] let's talk about so in in a discussion on regularization and choosing the
[00:37:49] on regularization and choosing the degree of polynomial so let's see let's
[00:38:00] degree of polynomial so let's see let's say I plot a chart where on the
[00:38:01] say I plot a chart where on the horizontal axis I plot model complexity
[00:38:08] horizontal axis I plot model complexity so how complicated is your model so for
[00:38:10] so how complicated is your model so for example to the right of this curve could
[00:38:13] example to the right of this curve could be a very high degree polynomial and
[00:38:24] be a very high degree polynomial and what you find is that as you increase
[00:38:28] what you find is that as you increase model complexity your training error if
[00:38:33] model complexity your training error if you do not regularize right so if you
[00:38:35] you do not regularize right so if you fit a linear function which I found any
[00:38:38] fit a linear function which I found any cubic function and so on you find that
[00:38:40] cubic function and so on you find that the higher the degree of a polynomial
[00:38:41] the higher the degree of a polynomial the bed to your training error because
[00:38:43] the bed to your training error because you know a fifth order polynomial always
[00:38:46] you know a fifth order polynomial always facilitate better than a fourth order
[00:38:47] facilitate better than a fourth order polynomial if if you if you do not
[00:38:50] polynomial if if you if you do not regular eyes but what we saw with the
[00:38:52] regular eyes but what we saw with the original picture was that the ability of
[00:38:58] original picture was that the ability of the algorithm to generalize kind of goes
[00:39:04] the algorithm to generalize kind of goes down and then starts to go back up right
[00:39:07] down and then starts to go back up right and so if you were to have a separate
[00:39:09] and so if you were to have a separate test set and evaluate your classifier on
[00:39:12] test set and evaluate your classifier on a set of data that the algorithm hasn't
[00:39:14] a set of data that the algorithm hasn't seen so far so measure how well the
[00:39:16] seen so far so measure how well the album generalizes to a different novel
[00:39:18] album generalizes to a different novel set of data then if you fit a linear
[00:39:20] set of data then if you fit a linear function then this under fits if you
[00:39:27] function then this under fits if you take the fifth order polynomial this
[00:39:29] take the fifth order polynomial this over fits and this somewhere in between
[00:39:39] right that is just right okay and this
[00:39:45] right that is just right okay and this curve is true for regularization as well
[00:39:48] curve is true for regularization as well so say you apply linear regression with
[00:39:51] so say you apply linear regression with 10000 features to a very small shiny
[00:39:54] 10000 features to a very small shiny example if launder was much too big then
[00:40:04] example if launder was much too big then they will under fit if lambda was 0 so
[00:40:11] they will under fit if lambda was 0 so you're not recognizing at all then it
[00:40:13] you're not recognizing at all then it will over fit and there will be some
[00:40:16] will over fit and there will be some intermediate value of lambda that's not
[00:40:18] intermediate value of lambda that's not too big not too small that you know
[00:40:20] too big not too small that you know balances overfitting and underfitting
[00:40:22] balances overfitting and underfitting okay so um what I like to do next is
[00:40:26] okay so um what I like to do next is describe a mechanistic as you a few
[00:40:28] describe a mechanistic as you a few different mechanistic procedures for
[00:40:30] different mechanistic procedures for trying to find this point in the middle
[00:40:33] trying to find this point in the middle right and so
[00:40:56] um so given the data set what we'll
[00:41:11] um so given the data set what we'll often do is take your data set and split
[00:41:15] often do is take your data set and split it into different subsets and a good
[00:41:18] it into different subsets and a good hygiene is the ticket agent train to
[00:41:20] hygiene is the ticket agent train to Train dev and test sets so if you have
[00:41:23] Train dev and test sets so if you have say 10,000 examples and you're trying to
[00:41:30] say 10,000 examples and you're trying to carry out this model selection problem
[00:41:32] carry out this model selection problem so for example let's say you're trying
[00:41:34] so for example let's say you're trying to decide what order polynomial you want
[00:41:37] to decide what order polynomial you want to fit right or you're trying to choose
[00:41:46] to fit right or you're trying to choose the value of lambda or you're trying to
[00:41:50] the value of lambda or you're trying to choose the value of tau there was the
[00:41:52] choose the value of tau there was the bandwidth parameter in a locally
[00:41:54] bandwidth parameter in a locally weighted regression that you saw on the
[00:41:56] weighted regression that you saw on the problem set and if that we saw with a
[00:41:58] problem set and if that we saw with a locally weighted regression
[00:41:59] locally weighted regression all right so oh you're trying to choose
[00:42:02] all right so oh you're trying to choose a value C in a support vector machine so
[00:42:05] a value C in a support vector machine so remember the SVM objective was actually
[00:42:08] remember the SVM objective was actually this write what you know subject to some
[00:42:13] this write what you know subject to some of the things but for the unknown soft
[00:42:15] of the things but for the unknown soft margin they were southbound on Wednesday
[00:42:18] margin they were southbound on Wednesday total on Monday you're trying to
[00:42:20] total on Monday you're trying to minimize the norm of W and then there
[00:42:22] minimize the norm of W and then there was this additional parameter C that
[00:42:24] was this additional parameter C that trades off how much you insist on
[00:42:26] trades off how much you insist on classifying every training example
[00:42:28] classifying every training example perfectly so whether you're trying to
[00:42:31] perfectly so whether you're trying to make which of these decisions are trying
[00:42:33] make which of these decisions are trying to make how do you you know choose a
[00:42:38] to make how do you you know choose a polynomial size or choose lambda or
[00:42:40] polynomial size or choose lambda or choose tau or choose parameter C which
[00:42:42] choose tau or choose parameter C which also has this bias variance trade-off
[00:42:44] also has this bias variance trade-off there will be some valleys that see the
[00:42:45] there will be some valleys that see the too large and some valleys of C that's
[00:42:47] too large and some valleys of C that's you small
[00:42:58] so here's one thing you can do which is
[00:43:06] let's see so split your training data s
[00:43:10] let's see so split your training data s into a subset which I'm gonna call the
[00:43:14] into a subset which I'm gonna call the real training set as subscript train and
[00:43:18] real training set as subscript train and in some subset which we call as
[00:43:20] in some subset which we call as subscript dev and F stands for
[00:43:23] subscript dev and F stands for development and then later we'll talk
[00:43:28] development and then later we'll talk about a separate test set and so what
[00:43:33] about a separate test set and so what you can do is train each model and by
[00:43:39] you can do is train each model and by model I mean um option for the degree of
[00:43:48] model I mean um option for the degree of polynomial on s train Soviet evaluating
[00:44:01] polynomial on s train Soviet evaluating a menu of models right so let's say this
[00:44:03] a menu of models right so let's say this is model 1 model 2 and so on up to model
[00:44:08] is model 1 model 2 and so on up to model 5 up to some number they can train each
[00:44:10] 5 up to some number they can train each of these models on the first subset of
[00:44:14] of these models on the first subset of the data and then get some hypothesis
[00:44:20] the data and then get some hypothesis that's called H I and then measure the
[00:44:29] that's called H I and then measure the error on s death which is the second
[00:44:34] error on s death which is the second subset of data called the development
[00:44:35] subset of data called the development set and pick the one
[00:44:50] so rather than and and I want to
[00:44:53] so rather than and and I want to contrast this with an alternative
[00:44:56] contrast this with an alternative procedure right so the two cents of the
[00:44:58] procedure right so the two cents of the day two substances they talk about tests
[00:45:00] day two substances they talk about tests and data training set and development
[00:45:02] and data training set and development sets and after training first of all the
[00:45:06] sets and after training first of all the world second apollomon or third all
[00:45:08] world second apollomon or third all polynomial on the training set evaluate
[00:45:10] polynomial on the training set evaluate all of these different models on the
[00:45:11] all of these different models on the separate held up development sets and
[00:45:14] separate held up development sets and then pick the one with the lowest error
[00:45:15] then pick the one with the lowest error on the development center okay but one
[00:45:19] on the development center okay but one thing to not do would be to evaluate all
[00:45:21] thing to not do would be to evaluate all these algorithms instead on the training
[00:45:23] these algorithms instead on the training set and then pick the one with the
[00:45:26] set and then pick the one with the lowest error on the training set right
[00:45:28] lowest error on the training set right why not what what goes wrong when you do
[00:45:29] why not what what goes wrong when you do that
[00:45:38] Yeah right you just over fit I were you
[00:45:40] Yeah right you just over fit I were you over it
[00:45:49] yeah yep cool right so if you use this
[00:45:52] yeah yep cool right so if you use this procedure you always end up picking the
[00:45:54] procedure you always end up picking the fifth order polynomial right because the
[00:45:56] fifth order polynomial right because the more complex our rhythm will always do
[00:45:59] more complex our rhythm will always do better on the training set so if you do
[00:46:00] better on the training set so if you do this this will always cause you to say
[00:46:02] this this will always cause you to say let's use the fifth order polynomial or
[00:46:04] let's use the fifth order polynomial or the highest possible order polynomial so
[00:46:06] the highest possible order polynomial so this won't help you realize in the
[00:46:08] this won't help you realize in the housing price prediction example the
[00:46:10] housing price prediction example the second order polynomial is a benefit to
[00:46:12] second order polynomial is a benefit to the data and that's why for this
[00:46:16] the data and that's why for this procedure if you evaluate your models
[00:46:21] procedure if you evaluate your models error or the separate development set
[00:46:23] error or the separate development set that the album did not see during
[00:46:25] that the album did not see during training this allows you to hopefully
[00:46:28] training this allows you to hopefully pick a model that neither overfits no
[00:46:30] pick a model that neither overfits no longer fits and in this example
[00:46:31] longer fits and in this example hopefully you find that there will be
[00:46:34] hopefully you find that there will be the second order polynomial right the
[00:46:36] the second order polynomial right the one that's just right in between that
[00:46:37] one that's just right in between that actually does best on your development
[00:46:39] actually does best on your development center okay now and then you know if you
[00:46:51] center okay now and then you know if you are if you are publishing an academic
[00:46:54] are if you are publishing an academic paper on machine learning then this
[00:46:57] paper on machine learning then this procedure has looked at the training set
[00:46:59] procedure has looked at the training set as was the development set right so this
[00:47:01] as was the development set right so this this procedure this piece of code is you
[00:47:05] this procedure this piece of code is you know is - in these decisions is tune the
[00:47:08] know is - in these decisions is tune the parameters the training set and is tuned
[00:47:10] parameters the training set and is tuned the decision on the degree of polynomial
[00:47:12] the decision on the degree of polynomial to the DEF set and so if you want to
[00:47:16] to the DEF set and so if you want to know if you want to publish a paper to
[00:47:17] know if you want to publish a paper to say oh my algorithm achieves 90%
[00:47:20] say oh my algorithm achieves 90% accuracy of this data set is not valid
[00:47:23] accuracy of this data set is not valid to report the result on the dev set
[00:47:25] to report the result on the dev set because the algorithm has already been
[00:47:27] because the algorithm has already been optimized to the data set in particular
[00:47:29] optimized to the data set in particular information about what's the most what's
[00:47:32] information about what's the most what's the best degree polynomial was derived
[00:47:35] the best degree polynomial was derived from the DEF set from the development
[00:47:36] from the DEF set from the development set and so if you're publishing a paper
[00:47:39] set and so if you're publishing a paper or you want to report an unbiased result
[00:47:45] evaluate the algorithm
[00:47:49] separate test set as a test and report
[00:47:58] separate test set as a test and report that error okay and so if you're
[00:48:00] that error okay and so if you're publishing your paper if you can say
[00:48:02] publishing your paper if you can say good hygiene to report the error on a
[00:48:06] good hygiene to report the error on a completely separate test set that you
[00:48:08] completely separate test set that you did not in any way shape or form look at
[00:48:10] did not in any way shape or form look at during the development of your model
[00:48:12] during the development of your model doing the training procedure or Devon
[00:48:24] doing the training procedure or Devon test is there any different by Mike it
[00:48:26] test is there any different by Mike it depends on the jet size it depends on
[00:48:28] depends on the jet size it depends on the size of the data set and so it turns
[00:48:32] the size of the data set and so it turns out that actually let me let me give an
[00:48:37] out that actually let me let me give an example actually so let's say you're
[00:48:39] example actually so let's say you're trying to fit a degree of polynomial and
[00:48:44] trying to fit a degree of polynomial and you want to choose right to death error
[00:48:50] you want to choose right to death error so you can for the first second third
[00:48:52] so you can for the first second third fourth degree polynomial and so after
[00:48:57] fourth degree polynomial and so after fitting all of these let's say that the
[00:48:59] fitting all of these let's say that the squared error right to just use round
[00:49:01] squared error right to just use round numbers is ten five point one five point
[00:49:07] numbers is ten five point one five point zero nine seven ten okay just to just to
[00:49:17] zero nine seven ten okay just to just to use round numbers for illustrative
[00:49:19] use round numbers for illustrative purposes if you're using the def error
[00:49:21] purposes if you're using the def error to pick the best hypothesis to pick the
[00:49:27] to pick the best hypothesis to pick the best classifier you would say that using
[00:49:31] best classifier you would say that using a fifth order polynomial against you for
[00:49:34] a fifth order polynomial against you for point nine squared error right but did
[00:49:37] point nine squared error right but did you really earn that four point nine
[00:49:39] you really earn that four point nine squared error or did you just get lucky
[00:49:42] squared error or did you just get lucky because there is some noise and so maybe
[00:49:45] because there is some noise and so maybe all of these actually have error that
[00:49:48] all of these actually have error that close to five point zero but some are
[00:49:50] close to five point zero but some are just higher so much as lower and you
[00:49:52] just higher so much as lower and you just got a little bit lucky that on the
[00:49:54] just got a little bit lucky that on the DEF set this did better which is why if
[00:49:56] DEF set this did better which is why if you look in your deaf set error your def
[00:49:58] you look in your deaf set error your def set error is a biased estimate
[00:50:01] set error is a biased estimate right and so where's there a very large
[00:50:03] right and so where's there a very large test set there is a very large test said
[00:50:05] test set there is a very large test said maybe the true numbers are 10 5 5 5 7 10
[00:50:12] maybe the true numbers are 10 5 5 5 7 10 by your actual expected squared errors
[00:50:14] by your actual expected squared errors it's just that because of little bit of
[00:50:17] it's just that because of little bit of noise you got lucky and you reported 4.9
[00:50:19] noise you got lucky and you reported 4.9 and so this would be a bad thing to do
[00:50:21] and so this would be a bad thing to do in an economic paper right because of
[00:50:22] in an economic paper right because of what you earned was an error of 5.0 you
[00:50:25] what you earned was an error of 5.0 you didn't earn an error 4.9 that's just
[00:50:27] didn't earn an error 4.9 that's just that because you're overfitting a little
[00:50:30] that because you're overfitting a little bit and the DEF set you chose the thing
[00:50:33] bit and the DEF set you chose the thing that looked best for the DEF set but
[00:50:34] that looked best for the DEF set but your algorithm didn't actually keep that
[00:50:36] your algorithm didn't actually keep that error it's just because of noise okay so
[00:50:39] error it's just because of noise okay so so now in some cells consider good
[00:50:42] so now in some cells consider good practice to report so reporting on the
[00:50:48] practice to report so reporting on the death error is in isn't isn't really a
[00:50:50] death error is in isn't isn't really a valid unbiased procedure and then
[00:50:55] question
[00:51:27] yeah so what one of the just a
[00:51:30] yeah so what one of the just a researcher said yes you're right one of
[00:51:32] researcher said yes you're right one of the problems with some of the machine
[00:51:33] the problems with some of the machine learning benchmarks that people worked
[00:51:35] learning benchmarks that people worked on for a long time is does this
[00:51:37] on for a long time is does this unavoidable mental overfitting the
[00:51:39] unavoidable mental overfitting the people call them to use the Z's and
[00:51:40] people call them to use the Z's and everyone's worked my same trying to
[00:51:42] everyone's worked my same trying to publish the best numbers the same test
[00:51:43] publish the best numbers the same test said so the academic committee on
[00:51:45] said so the academic committee on machine learning does have some amount
[00:51:47] machine learning does have some amount of overfitting to the standard
[00:51:49] of overfitting to the standard benchmarks that people worked on for a
[00:51:51] benchmarks that people worked on for a long time and this is an unfortunate
[00:51:52] long time and this is an unfortunate result when the testing is very very
[00:51:54] result when the testing is very very large the amount of overfitting is
[00:51:56] large the amount of overfitting is probably smaller but when the test set
[00:51:58] probably smaller but when the test set is not big enough then the overfitting
[00:52:00] is not big enough then the overfitting result can cause sometimes even research
[00:52:03] result can cause sometimes even research papers to publish results that are
[00:52:05] papers to publish results that are you're probably over fit to the data set
[00:52:08] you're probably over fit to the data set and so I think there's actually one
[00:52:11] and so I think there's actually one stand the academic benchmark because the
[00:52:12] stand the academic benchmark because the dataset called seefox quite small so
[00:52:14] dataset called seefox quite small so it's actually very same research paper
[00:52:17] it's actually very same research paper analyzing results on C far arguing that
[00:52:21] analyzing results on C far arguing that some fraction of the progress that was
[00:52:23] some fraction of the progress that was made was actually perhaps researchers
[00:52:26] made was actually perhaps researchers unintentionally overfitting to this
[00:52:28] unintentionally overfitting to this dataset okay oh my the way um one thing
[00:52:34] dataset okay oh my the way um one thing I do when I'm building you know
[00:52:36] I do when I'm building you know production machine learning systems so
[00:52:37] production machine learning systems so when I'm when I'm shipping a product
[00:52:39] when I'm when I'm shipping a product right I just like build with speech
[00:52:40] right I just like build with speech recognition to them and just make it
[00:52:42] recognition to them and just make it work I just wanna and not and if I'm not
[00:52:44] work I just wanna and not and if I'm not trying to publish a paper and not try
[00:52:46] trying to publish a paper and not try and make some claim sometimes I don't
[00:52:48] and make some claim sometimes I don't bother to the test set right so and and
[00:52:51] bother to the test set right so and and that means I don't know the true error
[00:52:52] that means I don't know the true error of the system sometimes but I'm very
[00:52:55] of the system sometimes but I'm very conscious of that if I don't allow data
[00:52:56] conscious of that if I don't allow data sometimes I'm gonna decide to just not
[00:52:59] sometimes I'm gonna decide to just not have the test set and it means I just
[00:53:00] have the test set and it means I just don't try to report the tested number I
[00:53:02] don't try to report the tested number I can report a Jeff said number which I
[00:53:04] can report a Jeff said number which I know is bias and I just don't report a
[00:53:07] know is bias and I just don't report a tested number I don't do this in your
[00:53:08] tested number I don't do this in your publishing your academia paper right
[00:53:10] publishing your academia paper right this is not good if you're publishing a
[00:53:11] this is not good if you're publishing a paper on making claims on the outside
[00:53:13] paper on making claims on the outside but already doing is building a product
[00:53:15] but already doing is building a product and not writing a paper distance this is
[00:53:17] and not writing a paper distance this is actually
[00:53:18] actually okay yeah yeah okay good that's let me
[00:53:32] okay yeah yeah okay good that's let me let me get to that good so um the next
[00:53:35] let me get to that good so um the next topic about some of the train dev test
[00:53:37] topic about some of the train dev test blitz is how do you decide how much data
[00:53:39] blitz is how do you decide how much data should go into you should these three
[00:53:41] should go into you should these three subsets um so I can tell its owner just
[00:53:46] subsets um so I can tell its owner just tell you the historical perspective and
[00:53:48] tell you the historical perspective and then a modern perspective um
[00:53:49] then a modern perspective um historically the rule of thumb was you
[00:53:52] historically the rule of thumb was you take your training said right take your
[00:53:55] take your training said right take your training set s and then you would send
[00:53:57] training set s and then you would send you know one rule of thumb that you see
[00:53:59] you know one rule of thumb that you see a lot of people refer to is 70% train
[00:54:03] a lot of people refer to is 70% train right 30% tests this is one common rule
[00:54:08] right 30% tests this is one common rule of thumb that you just hear a lot or
[00:54:11] of thumb that you just hear a lot or maybe you have if you if you don't have
[00:54:13] maybe you have if you if you don't have a test said if you're not doing model
[00:54:15] a test said if you're not doing model selection if you just if you already
[00:54:16] selection if you just if you already picked them although not realizing or
[00:54:19] picked them although not realizing or maybe you have people use you know 60
[00:54:22] maybe you have people use you know 60 percent range 20 percent 20 percent
[00:54:26] percent range 20 percent 20 percent tests right so these are rules of thumb
[00:54:29] tests right so these are rules of thumb that people use to give and these are
[00:54:33] that people use to give and these are decent rules of thumb when you don't
[00:54:35] decent rules of thumb when you don't have a massive data set see if you have
[00:54:37] have a massive data set see if you have 100 tons of examples maybe of a thousand
[00:54:40] 100 tons of examples maybe of a thousand examples may be several thousand
[00:54:43] examples may be several thousand examples I think these rules of thumb
[00:54:44] examples I think these rules of thumb are perfectly fine what I'm seeing is
[00:54:48] are perfectly fine what I'm seeing is that as you move to machine learning
[00:54:50] that as you move to machine learning problems with really really giant data
[00:54:51] problems with really really giant data sets the percentage of data you sent a
[00:54:54] sets the percentage of data you sent a dev and test shrinking and so here's
[00:54:58] dev and test shrinking and so here's what I mean let's say you have 10
[00:55:01] what I mean let's say you have 10 million examples decent-size not giant
[00:55:06] million examples decent-size not giant but you can resize so let's take this is
[00:55:11] but you can resize so let's take this is actually pretty good rule of thumb if
[00:55:13] actually pretty good rule of thumb if you're a small desert if you have you
[00:55:14] you're a small desert if you have you know a 5 min examples this perfectly 5
[00:55:16] know a 5 min examples this perfectly 5 rule of thumb to use but if you have 10
[00:55:18] rule of thumb to use but if you have 10 million examples then you know you have
[00:55:22] million examples then you know you have 6 million 2 million too many
[00:55:28] 6 million 2 million too many right trained deaf tests and the
[00:55:34] right trained deaf tests and the question is do you really need two
[00:55:36] question is do you really need two million examples to estimate the
[00:55:38] million examples to estimate the performance of your final classifier
[00:55:40] performance of your final classifier sometimes you do if you're working on
[00:55:43] sometimes you do if you're working on online advertising you know which I have
[00:55:45] online advertising you know which I have done and you're trying to increase your
[00:55:48] done and you're trying to increase your ad click-through rate by 0.1 percent
[00:55:50] ad click-through rate by 0.1 percent because it turns out increasing our
[00:55:52] because it turns out increasing our click-through rates by 0.1 percent which
[00:55:54] click-through rates by 0.1 percent which I've done multiple times turns out to be
[00:55:55] I've done multiple times turns out to be very lucrative then you actually need a
[00:55:58] very lucrative then you actually need a very large data set to measure these
[00:56:00] very large data set to measure these very very small improvements because to
[00:56:02] very very small improvements because to increase an ad click-through rate by 0.1
[00:56:05] increase an ad click-through rate by 0.1 you might have llama projects you might
[00:56:06] you might have llama projects you might have 10 projects each of which increases
[00:56:09] have 10 projects each of which increases at click-through rate by 0.01% right and
[00:56:11] at click-through rate by 0.01% right and so to measure these very different small
[00:56:14] so to measure these very different small differences in algorithm one does 0.01
[00:56:17] differences in algorithm one does 0.01 percent better than algorithm B but so
[00:56:19] percent better than algorithm B but so you need a wall date so to tease out
[00:56:20] you need a wall date so to tease out that very small difference so if you're
[00:56:22] that very small difference so if you're in the business of teasing out these
[00:56:23] in the business of teasing out these very small differences you actually need
[00:56:26] very small differences you actually need very large test sets but if you are
[00:56:28] very large test sets but if you are comparing different algorithms and one
[00:56:30] comparing different algorithms and one algorithm is you know 2% better or even
[00:56:34] algorithm is you know 2% better or even 1% better than the other algorithm then
[00:56:37] 1% better than the other algorithm then with a thousand examples maybe right a
[00:56:40] with a thousand examples maybe right a thousand examples may be enough for you
[00:56:42] thousand examples may be enough for you to distinguish between these much larger
[00:56:44] to distinguish between these much larger differences so my recommendation for
[00:56:48] differences so my recommendation for choosing the Jeff and Tess says is
[00:56:50] choosing the Jeff and Tess says is choose them to be big enough that you
[00:56:55] choose them to be big enough that you have been updated to make meaningful
[00:56:57] have been updated to make meaningful comparisons between different algorithms
[00:56:59] comparisons between different algorithms and if you suspect your algorithms will
[00:57:01] and if you suspect your algorithms will be Rhian performance by 0.01% you just
[00:57:04] be Rhian performance by 0.01% you just need a law data to distinguish that
[00:57:06] need a law data to distinguish that right so if you have 100 examples then
[00:57:09] right so if you have 100 examples then you know if one algorithm has 90%
[00:57:13] you know if one algorithm has 90% accuracy and one algorithm has 90 point
[00:57:16] accuracy and one algorithm has 90 point zero one percent accuracy then unless
[00:57:18] zero one percent accuracy then unless you have at least a thousand examples
[00:57:20] you have at least a thousand examples and maybe ten thousand or more you just
[00:57:22] and maybe ten thousand or more you just can't see this very small difference
[00:57:23] can't see this very small difference right if you're a hundred examples you
[00:57:25] right if you're a hundred examples you just can't measure this very small
[00:57:26] just can't measure this very small difference so my advice is choose your
[00:57:30] difference so my advice is choose your dev and test sets to be big enough that
[00:57:32] dev and test sets to be big enough that you could see the differences in the
[00:57:35] you could see the differences in the performance of algorithms that you have
[00:57:37] performance of algorithms that you have yet that you roughly expected
[00:57:39] yet that you roughly expected and then you don't need to make your jab
[00:57:41] and then you don't need to make your jab in chest as much larger than that and I
[00:57:43] in chest as much larger than that and I would usually then just put the data you
[00:57:45] would usually then just put the data you don't need an indefinite essence back in
[00:57:47] don't need an indefinite essence back in the training set so when you're working
[00:57:49] the training set so when you're working a very large data set say you know
[00:57:51] a very large data set say you know million or ten million one hundred
[00:57:53] million or ten million one hundred million examples what you see is that
[00:57:54] million examples what you see is that the percentage of data that goes into
[00:57:57] the percentage of data that goes into dev and test tends to be much smaller so
[00:57:59] dev and test tends to be much smaller so it might be so you see for example maybe
[00:58:03] it might be so you see for example maybe 90 percent train you know five percent
[00:58:05] 90 percent train you know five percent dev and five percent test right or even
[00:58:08] dev and five percent test right or even smaller or even one percent one percent
[00:58:10] smaller or even one percent one percent depending on how much states you really
[00:58:11] depending on how much states you really need to measure to the level of accuracy
[00:58:14] need to measure to the level of accuracy you need the differences that
[00:58:15] you need the differences that performance your algorithms okay all
[00:58:21] performance your algorithms okay all right um just to give this whole
[00:58:24] right um just to give this whole procedure a name what we just did here
[00:58:26] procedure a name what we just did here between the train and dev set this
[00:58:30] between the train and dev set this procedure that we have is called holdout
[00:58:32] procedure that we have is called holdout cross-validation and sometimes to
[00:58:41] cross-validation and sometimes to distinguish this from other cross
[00:58:43] distinguish this from other cross validation procedures we'll talk about
[00:58:45] validation procedures we'll talk about in a minute sometimes this is called
[00:58:46] in a minute sometimes this is called simple hold our cross validation we'll
[00:58:48] simple hold our cross validation we'll talk about some other hold our cross
[00:58:50] talk about some other hold our cross validation procedures in a second and
[00:58:54] validation procedures in a second and and the dev set is sometimes also called
[00:59:03] and the dev set is sometimes also called the our cross validation set
[00:59:10] right so sometimes you people use
[00:59:13] right so sometimes you people use sometimes you hear people say you know
[00:59:15] sometimes you hear people say you know we're going to use a cross-validation
[00:59:16] we're going to use a cross-validation set that means roughly the same thing as
[00:59:18] set that means roughly the same thing as a dev set okay so in the normal workflow
[00:59:24] a dev set okay so in the normal workflow of developing a learning algorithm when
[00:59:26] of developing a learning algorithm when you're given the data said I was
[00:59:28] you're given the data said I was splitted into a training set and a dev
[00:59:29] splitted into a training set and a dev set oh and I used to say
[00:59:31] set oh and I used to say cross-validation set but
[00:59:33] cross-validation set but cross-validation is just a mouthful so I
[00:59:35] cross-validation is just a mouthful so I think just motivated by the reducing
[00:59:37] think just motivated by the reducing number of syllables because we're using
[00:59:38] number of syllables because we're using the cause of so often more and more
[00:59:40] the cause of so often more and more people just call the dev set but it
[00:59:41] people just call the dev set but it means you're roughly the same thing
[00:59:42] means you're roughly the same thing right so so when I'm building a machine
[00:59:45] right so so when I'm building a machine learning system I'll often take the days
[00:59:47] learning system I'll often take the days that splits into train and dev and if
[00:59:50] that splits into train and dev and if you need a test set and also a test set
[00:59:51] you need a test set and also a test set and then keep on fitting the parameters
[00:59:55] and then keep on fitting the parameters to the training set and evaluating the
[00:59:58] to the training set and evaluating the performance of your algorithm on the
[00:59:59] performance of your algorithm on the Jeff set and so of using that to come up
[01:00:02] Jeff set and so of using that to come up with new features choose the model size
[01:00:03] with new features choose the model size choose the regularization parameter
[01:00:05] choose the regularization parameter lambda really try out lots of different
[01:00:08] lambda really try out lots of different things and spend you know several days
[01:00:10] things and spend you know several days or weeks to optimize the performance on
[01:00:12] or weeks to optimize the performance on the DEF set and then when you want to
[01:00:16] the DEF set and then when you want to know how well is your algorithm
[01:00:17] know how well is your algorithm performing to then evaluate the model on
[01:00:20] performing to then evaluate the model on the test set right and and the thing to
[01:00:23] the test set right and and the thing to be careful not to do is to make any
[01:00:25] be careful not to do is to make any decisions about your model using the
[01:00:27] decisions about your model using the test set because then then you're signed
[01:00:30] test set because then then you're signed to fit the data to the test set there's
[01:00:32] to fit the data to the test set there's no longer than bias estimate so and one
[01:00:36] no longer than bias estimate so and one thing that is actually okay to do is if
[01:00:38] thing that is actually okay to do is if you have a team that's working on a
[01:00:40] you have a team that's working on a problem if every week they measure the
[01:00:43] problem if every week they measure the performance on the test set and report
[01:00:45] performance on the test set and report out on the chart right you know the
[01:00:48] out on the chart right you know the performance of the test set that's
[01:00:50] performance of the test set that's actually okay you can evaluate the model
[01:00:52] actually okay you can evaluate the model multiple times on the test set you can
[01:00:53] multiple times on the test set you can actually give out a weekly report saying
[01:00:55] actually give out a weekly report saying this week for our online advertising
[01:00:57] this week for our online advertising system we have this result in a test set
[01:00:59] system we have this result in a test set one week later with distres on test set
[01:01:01] one week later with distres on test set won't be later this code says that it's
[01:01:02] won't be later this code says that it's actually okay to evaluate your algorithm
[01:01:04] actually okay to evaluate your algorithm repeatedly on the test set what's not
[01:01:07] repeatedly on the test set what's not okay is to use those evaluations to make
[01:01:09] okay is to use those evaluations to make any decisions about your learning
[01:01:10] any decisions about your learning algorithm so for example if one day you
[01:01:12] algorithm so for example if one day you notice that your model is doing worse
[01:01:14] notice that your model is doing worse this weekend last week on the test set
[01:01:17] this weekend last week on the test set if you use that to revert back to an
[01:01:19] if you use that to revert back to an older model
[01:01:20] older model then you've just made a decision that's
[01:01:22] then you've just made a decision that's based on the test set and your test set
[01:01:23] based on the test set and your test set is no longer bias but if all you do is
[01:01:26] is no longer bias but if all you do is report all the result but not make any
[01:01:28] report all the result but not make any decisions based on the test set
[01:01:29] decisions based on the test set performance such as whether to revert to
[01:01:31] performance such as whether to revert to an earlier model then you feel that it
[01:01:33] an earlier model then you feel that it is actually legitimate it's actually
[01:01:34] is actually legitimate it's actually okay to keep on you know user use the
[01:01:38] okay to keep on you know user use the same test set to track your your team's
[01:01:40] same test set to track your your team's performance over time okay all right
[01:01:45] performance over time okay all right good so when your very large data says
[01:01:50] good so when your very large data says this is a procedure of you know
[01:01:54] this is a procedure of you know developing for defining the I trained F
[01:01:56] developing for defining the I trained F in test sets and this procedure can be
[01:01:59] in test sets and this procedure can be used to choose the model of polynomial
[01:02:01] used to choose the model of polynomial it can also be used to choose the
[01:02:03] it can also be used to choose the regularization parameter lambda or the
[01:02:05] regularization parameter lambda or the parent st or the parameter tau from a
[01:02:09] parent st or the parameter tau from a locally weighted regression now one of
[01:02:14] locally weighted regression now one of you have a very small data set so it
[01:02:24] you have a very small data set so it turns out that so I'm going to leave out
[01:02:26] turns out that so I'm going to leave out the test set for now let's just assume
[01:02:28] the test set for now let's just assume you're some circuit tester I'm not gonna
[01:02:29] you're some circuit tester I'm not gonna worry about that for now um but let's
[01:02:32] worry about that for now um but let's say you have a 100 examples right if
[01:02:37] say you have a 100 examples right if you're going to split this into you know
[01:02:39] you're going to split this into you know 17 in the training set in s subscript
[01:02:42] 17 in the training set in s subscript train and 30 in s def then you train
[01:02:47] train and 30 in s def then you train your algorithm on 70 examples instead of
[01:02:49] your algorithm on 70 examples instead of a hundred examples and so I've actually
[01:02:52] a hundred examples and so I've actually worked on a few healthcare problems okay
[01:02:54] worked on a few healthcare problems okay most of my PhD students including Anand
[01:02:56] most of my PhD students including Anand work doing a lot of work on machine
[01:02:59] work doing a lot of work on machine learning advisor healthcare and so we're
[01:03:01] learning advisor healthcare and so we're actually working a few data says in
[01:03:02] actually working a few data says in healthcare where you know every training
[01:03:05] healthcare where you know every training example correspond to some patient that
[01:03:08] example correspond to some patient that sometimes had a you know unfortunate
[01:03:10] sometimes had a you know unfortunate disease or if ever you're working or if
[01:03:13] disease or if ever you're working or if every example correspond it to injecting
[01:03:17] every example correspond it to injecting a patient with a drug and seeing what
[01:03:18] a patient with a drug and seeing what happened to the patient right sometimes
[01:03:20] happened to the patient right sometimes there's literally lot of blood and pain
[01:03:22] there's literally lot of blood and pain that goes into collecting every example
[01:03:24] that goes into collecting every example and if you have a hundred examples to
[01:03:27] and if you have a hundred examples to hold out 30 of them for the purpose of
[01:03:30] hold out 30 of them for the purpose of model selection using only 70s
[01:03:33] model selection using only 70s 100 examples it seems like you're
[01:03:34] 100 examples it seems like you're wasting a lot of data there was
[01:03:36] wasting a lot of data there was collected through a lot of you know
[01:03:38] collected through a lot of you know literal pain right so is there a way to
[01:03:41] literal pain right so is there a way to say do model selection so just choose
[01:03:44] say do model selection so just choose the degree of polynomial without quote
[01:03:46] the degree of polynomial without quote slightly wasting so much of the data
[01:03:52] there is a procedure that you should use
[01:03:55] there is a procedure that you should use only if you have a small dataset only if
[01:03:58] only if you have a small dataset only if you're worried about the size oh and the
[01:04:00] you're worried about the size oh and the other disadvantage of this is evaluate
[01:04:02] other disadvantage of this is evaluate your model only on 30 examples and that
[01:04:04] your model only on 30 examples and that seems really small right yeah okay can
[01:04:06] seems really small right yeah okay can you just find more data to evaluate your
[01:04:08] you just find more data to evaluate your models as well so there's a procedure
[01:04:11] models as well so there's a procedure that you should use only if you have a
[01:04:14] that you should use only if you have a small data set called k-fold
[01:04:16] small data set called k-fold cross-validation or CAFO cv and this is
[01:04:20] cross-validation or CAFO cv and this is in contrast to simple cross-validation
[01:04:23] in contrast to simple cross-validation but this is the idea which is let's say
[01:04:27] but this is the idea which is let's say this your training set s you know x1 y1
[01:04:31] this your training set s you know x1 y1 down to x say 100 by 100 what we're
[01:04:38] down to x say 100 by 100 what we're going to do is take the training set and
[01:04:41] going to do is take the training set and divide it into K pieces so for the
[01:04:45] divide it into K pieces so for the purpose of illustration I'm going to use
[01:04:46] purpose of illustration I'm going to use K equals 5 when I'm just to make the
[01:04:49] K equals 5 when I'm just to make the writing on the board same pen k equals
[01:04:52] writing on the board same pen k equals 10 is typical but so what you do is um
[01:05:03] 10 is typical but so what you do is um take your data set and divide it into
[01:05:06] take your data set and divide it into five different subsets of in this
[01:05:08] five different subsets of in this example you would have 20 examples I'm
[01:05:11] example you would have 20 examples I'm doing 100 examples divided into 5
[01:05:13] doing 100 examples divided into 5 subsets is that 20 examples in each
[01:05:15] subsets is that 20 examples in each subset and what you do is from I equals
[01:05:20] subset and what you do is from I equals 1 to K train
[01:05:25] 1 to K train I give fit parameters on K minus 1
[01:05:32] I give fit parameters on K minus 1 pieces
[01:05:36] and then test on the remaining one piece
[01:05:50] and then test on the remaining one piece and then your average right so in other
[01:05:54] and then your average right so in other words when K is equals five we're going
[01:05:57] words when K is equals five we're going to loop through five times in the first
[01:05:59] to loop through five times in the first iteration we're going to Train on these
[01:06:01] iteration we're going to Train on these and test on the last one fifth of the
[01:06:05] and test on the last one fifth of the data so we'll hold out the last one
[01:06:08] data so we'll hold out the last one fifth of the data trained on the rest
[01:06:09] fifth of the data trained on the rest and test on that and then in the second
[01:06:11] and test on that and then in the second iteration through this for loop we'll
[01:06:14] iteration through this for loop we'll train on pieces one two three and five
[01:06:17] train on pieces one two three and five and test on piece number four and we get
[01:06:20] and test on piece number four and we get the number and then you hold out this
[01:06:24] the number and then you hold out this third piece trained on the others test
[01:06:26] third piece trained on the others test on this and so on so you're doing five
[01:06:28] on this and so on so you're doing five times we're on each time you leave out
[01:06:30] times we're on each time you leave out one fifth of the data trained on the
[01:06:32] one fifth of the data trained on the remaining four fifths and you evaluate
[01:06:34] remaining four fifths and you evaluate the model on that final one fifth okay
[01:06:38] the model on that final one fifth okay and so if you're trying to choose the
[01:06:40] and so if you're trying to choose the degree of polynomial what you would do
[01:06:42] degree of polynomial what you would do is I guess for you know D equals one to
[01:06:47] is I guess for you know D equals one to five
[01:06:55] right so you do this procedure for a
[01:06:58] right so you do this procedure for a first-order polynomial fit you fit a
[01:07:01] first-order polynomial fit you fit a linear regression model of five times
[01:07:03] linear regression model of five times each time on four fifths in the model
[01:07:04] each time on four fifths in the model and testin remaining one fifth and you
[01:07:06] and testin remaining one fifth and you repeat this whole procedure for the
[01:07:08] repeat this whole procedure for the quadratic function repeat this whole
[01:07:09] quadratic function repeat this whole procedure for the cubic function and so
[01:07:11] procedure for the cubic function and so on and after doing this for every order
[01:07:15] on and after doing this for every order polynomial from serve one two three four
[01:07:17] polynomial from serve one two three four five you would then pick the degree of
[01:07:19] five you would then pick the degree of polynomial that oh and sorry and then
[01:07:23] polynomial that oh and sorry and then for each of these models you then
[01:07:24] for each of these models you then average the five SS you have for this
[01:07:26] average the five SS you have for this error okay and then after doing this you
[01:07:30] error okay and then after doing this you will pick the degree of polynomial that
[01:07:33] will pick the degree of polynomial that did best according to this according to
[01:07:35] did best according to this according to this metric right and then maybe you
[01:07:38] this metric right and then maybe you find it a second-order polynomial does
[01:07:40] find it a second-order polynomial does this and now you actually end up with
[01:07:47] this and now you actually end up with five classifiers right because you know
[01:07:50] five classifiers right because you know at five classifies each one fits on
[01:07:52] at five classifies each one fits on four-fifths of the data and then and and
[01:07:55] four-fifths of the data and then and and there's a there's a final optional step
[01:07:58] there's a there's a final optional step which is the refit the model on all 100%
[01:08:04] which is the refit the model on all 100% of the data so if you want you could
[01:08:09] of the data so if you want you could keep five classifiers around and output
[01:08:11] keep five classifiers around and output their predictions but then you're
[01:08:13] their predictions but then you're keeping five pass files around this may
[01:08:16] keeping five pass files around this may be a bit more common now that you've
[01:08:17] be a bit more common now that you've chosen to use a second-order polynomial
[01:08:19] chosen to use a second-order polynomial to just refit them all the ones on all
[01:08:21] to just refit them all the ones on all 100% of the data okay
[01:08:24] 100% of the data okay and so the advantage of k-fold
[01:08:28] and so the advantage of k-fold cross-validation is that instead of
[01:08:30] cross-validation is that instead of leaving out 30% of your data for your
[01:08:33] leaving out 30% of your data for your dev set on each iteration you're only
[01:08:35] dev set on each iteration you're only leaving out one over k of your data I
[01:08:38] leaving out one over k of your data I use K equals five for illustration but
[01:08:41] use K equals five for illustration but in practice Kinkos ten is by far the
[01:08:43] in practice Kinkos ten is by far the most common choice that we use of
[01:08:45] most common choice that we use of sometimes seeing people use K equals 20
[01:08:47] sometimes seeing people use K equals 20 but quite rarely but if used K equals 10
[01:08:51] but quite rarely but if used K equals 10 then on each iteration you're leaving
[01:08:53] then on each iteration you're leaving out just one tenth of the data 10% in
[01:08:56] out just one tenth of the data 10% in later rather than 30% of the data
[01:08:59] later rather than 30% of the data and so this procedure compared to simple
[01:09:03] and so this procedure compared to simple cross-validation it makes more efficient
[01:09:06] cross-validation it makes more efficient use of the data because you're holding
[01:09:08] use of the data because you're holding out you know only 10% of the data on
[01:09:10] out you know only 10% of the data on each generation the disadvantage of this
[01:09:13] each generation the disadvantage of this is completely very expensive that you're
[01:09:15] is completely very expensive that you're now fitting each model ten times instead
[01:09:18] now fitting each model ten times instead of just once okay but but then when you
[01:09:20] of just once okay but but then when you have a small data set this is actually a
[01:09:22] have a small data set this is actually a better procedure than simple
[01:09:24] better procedure than simple cross-validation if you don't mind the
[01:09:26] cross-validation if you don't mind the competition expense of fitting each
[01:09:27] competition expense of fitting each model ten times this does this actually
[01:09:29] model ten times this does this actually lets you get away with holding on this
[01:09:31] lets you get away with holding on this data and then there's one even more
[01:09:44] data and then there's one even more extreme version of this which you should
[01:09:46] extreme version of this which you should use if you have very very small data
[01:09:48] use if you have very very small data sets so sometimes you might have an even
[01:09:50] sets so sometimes you might have an even smaller if they set you know if you're
[01:09:52] smaller if they set you know if you're doing a class project with twenty
[01:09:54] doing a class project with twenty examples this that's that's small even
[01:09:56] examples this that's that's small even by today's machine learning standards so
[01:09:58] by today's machine learning standards so this does an extreme version of k-fold
[01:10:01] this does an extreme version of k-fold cross-validation called leave one out
[01:10:03] cross-validation called leave one out cross validation which is if you say a
[01:10:08] cross validation which is if you say a equals m right so in other words here's
[01:10:12] equals m right so in other words here's your training set
[01:10:13] your training set maybe twenty examples so you're gonna
[01:10:16] maybe twenty examples so you're gonna divide this into as many pieces as you
[01:10:19] divide this into as many pieces as you have training examples and what you do
[01:10:22] have training examples and what you do is leave out one example trained on the
[01:10:25] is leave out one example trained on the other nineteen and test on the one
[01:10:27] other nineteen and test on the one example you held out and then leave out
[01:10:30] example you held out and then leave out a second example trained on the other
[01:10:31] a second example trained on the other nineteen and tested one example you held
[01:10:33] nineteen and tested one example you held out and do that twenty times and then
[01:10:35] out and do that twenty times and then you averaged this over the twenty
[01:10:37] you averaged this over the twenty outcomes evaluate how good different
[01:10:39] outcomes evaluate how good different orders of polynomial the huge downside
[01:10:42] orders of polynomial the huge downside of this is this is completely very very
[01:10:44] of this is this is completely very very expensive because now you need to train
[01:10:46] expensive because now you need to train your algorithm m times so you you kind
[01:10:49] your algorithm m times so you you kind of never do this unless M is really
[01:10:50] of never do this unless M is really small I personally have I pretty much
[01:10:54] small I personally have I pretty much never use this procedure unless Emma is
[01:10:56] never use this procedure unless Emma is a hundred or less that yes you you know
[01:10:58] a hundred or less that yes you you know if your model isn't too complicated you
[01:11:00] if your model isn't too complicated you can afford to fill in linear regression
[01:11:01] can afford to fill in linear regression model hundred times like it's not too
[01:11:03] model hundred times like it's not too bad right so so if if M is less than 100
[01:11:07] bad right so so if if M is less than 100 you could consider this procedure but if
[01:11:09] you could consider this procedure but if if M is a thousand fifteen alidium
[01:11:11] if M is a thousand fifteen alidium although fitting them although a
[01:11:12] although fitting them although a thousand times it seems to come out the
[01:11:14] thousand times it seems to come out the world and usually use k-fold
[01:11:15] world and usually use k-fold cross-validation instead but if you do
[01:11:18] cross-validation instead but if you do have twenty examples then you know I
[01:11:20] have twenty examples then you know I would then if you have twenty examples I
[01:11:23] would then if you have twenty examples I would probably use this procedure and
[01:11:25] would probably use this procedure and somewhere between twenty and 50s
[01:11:28] somewhere between twenty and 50s maybe we're not switch over from leave
[01:11:30] maybe we're not switch over from leave one out okay for cross-validation
[01:11:46] oh yeah so right so since you have K
[01:11:52] oh yeah so right so since you have K estimates say ten tenesmus you're using
[01:11:54] estimates say ten tenesmus you're using 10-fold cross-validation
[01:11:55] 10-fold cross-validation can you measure the variance on those
[01:11:57] can you measure the variance on those ten estimates it turns out that those
[01:12:00] ten estimates it turns out that those ten estimates are correlated because
[01:12:02] ten estimates are correlated because each of the ten classifiers
[01:12:04] each of the ten classifiers eight eight eight out of nine of the
[01:12:06] eight eight eight out of nine of the sense of data they trained on overlap so
[01:12:09] sense of data they trained on overlap so there was some very interesting theory
[01:12:12] there was some very interesting theory result there's some research papers
[01:12:14] result there's some research papers written by micro currents actually it
[01:12:17] written by micro currents actually it was like a long time ago trying to
[01:12:19] was like a long time ago trying to understand how correlated are these ten
[01:12:21] understand how correlated are these ten estimates and from a theoretical point
[01:12:24] estimates and from a theoretical point of view the we as far as I know the
[01:12:27] of view the we as far as I know the latest error result shows that this is
[01:12:29] latest error result shows that this is not the worst estimate than training
[01:12:30] not the worst estimate than training error but no but but maybe it's showing
[01:12:33] error but no but but maybe it's showing us in practice is not you could measure
[01:12:37] us in practice is not you could measure it but we don't really trust that
[01:12:40] it but we don't really trust that estimate of variance because we think
[01:12:41] estimate of variance because we think all 10s mr. Hardy
[01:12:42] all 10s mr. Hardy Carlita Oh at least somewhat correlated
[01:12:53] where do I find using k-fold
[01:12:55] where do I find using k-fold cross-validation debugging um if you
[01:12:57] cross-validation debugging um if you have a very small training set then
[01:13:00] have a very small training set then maybe yes but deep learning algorithms
[01:13:02] maybe yes but deep learning algorithms depend on the details right sometimes it
[01:13:04] depend on the details right sometimes it takes so long to train that training
[01:13:06] takes so long to train that training training training on your network 20
[01:13:08] training training on your network 20 times you know seems like a pain unless
[01:13:10] times you know seems like a pain unless unless you have enough data unless uh
[01:13:13] unless you have enough data unless uh unless you're near a network is quite
[01:13:14] unless you're near a network is quite small right so it's rarely done with a
[01:13:19] small right so it's rarely done with a deep learning algorithm but if you're
[01:13:21] deep learning algorithm but if you're frankly if you have so little data if
[01:13:23] frankly if you have so little data if you're 20 training examples you know
[01:13:27] you're 20 training examples you know there are other techniques that you
[01:13:28] there are other techniques that you probably need to use the Bruce
[01:13:30] probably need to use the Bruce performance such as transfer learning or
[01:13:32] performance such as transfer learning or just more hand engineering of input
[01:13:35] just more hand engineering of input features or something else sorry tell
[01:13:44] features or something else sorry tell you oh sorry thank you for asking that
[01:13:53] you oh sorry thank you for asking that this average step no I meant averaging
[01:13:58] this average step no I meant averaging the test errors so here you will have
[01:14:01] the test errors so here you will have trained ten classifiers and you know
[01:14:04] trained ten classifiers and you know when you evaluate it on the Left I won
[01:14:06] when you evaluate it on the Left I won ten for the data you get it really you
[01:14:08] ten for the data you get it really you get a number right so you're looping ten
[01:14:10] get a number right so you're looping ten times hold on one part trained on the
[01:14:13] times hold on one part trained on the others test on this part you left out
[01:14:14] others test on this part you left out and so that would give you a number like
[01:14:16] and so that would give you a number like oh say oh when you test on this part you
[01:14:18] oh say oh when you test on this part you left out the squared error was 5.0 and
[01:14:21] left out the squared error was 5.0 and then do it again square error was five
[01:14:23] then do it again square error was five point seven squared was two point eight
[01:14:24] point seven squared was two point eight so by average I meant average those
[01:14:27] so by average I meant average those numbers and the average of those numbers
[01:14:30] numbers and the average of those numbers is your estimate of the error of a you
[01:14:34] is your estimate of the error of a you know third order polynomial for this
[01:14:36] know third order polynomial for this problem so this is an averaging instead
[01:14:39] problem so this is an averaging instead of real numbers that you got from this
[01:14:41] of real numbers that you got from this is so so this loop gives you K real
[01:14:44] is so so this loop gives you K real numbers and so this is averaging those
[01:14:47] numbers and so this is averaging those cave
[01:14:47] cave numbers to estimate for this alternate
[01:14:50] numbers to estimate for this alternate how could a cost 5 with a degree
[01:14:53] how could a cost 5 with a degree polynomials okay well I should love
[01:14:55] polynomials okay well I should love questions there's one thing I like over
[01:14:57] questions there's one thing I like over go ahead but it's lost two good I see
[01:15:10] go ahead but it's lost two good I see sure yes I'm sure in something other
[01:15:12] sure yes I'm sure in something other than f1 score with the doing other than
[01:15:14] than f1 score with the doing other than average yes it would
[01:15:17] average yes it would having f1 school is complicated yes I
[01:15:19] having f1 school is complicated yes I think I think we'll talk actually so
[01:15:22] think I think we'll talk actually so this week Friday we'll talk about
[01:15:24] this week Friday we'll talk about learning theory next week next Friday
[01:15:25] learning theory next week next Friday were talking about performance
[01:15:27] were talking about performance evaluation metrics so I'll talk about
[01:15:29] evaluation metrics so I'll talk about one school all right oh sure honey
[01:15:35] one school all right oh sure honey sample the data in these says so for the
[01:15:38] sample the data in these says so for the purpose of this cause assuming all your
[01:15:39] purpose of this cause assuming all your data comes to the same distribution oh I
[01:15:42] data comes to the same distribution oh I will use the randomly shuffle again in
[01:15:46] will use the randomly shuffle again in the era of machine learning and big data
[01:15:47] the era of machine learning and big data there's one other interesting trend is
[01:15:50] there's one other interesting trend is which which just wasn't true 10 years
[01:15:51] which which just wasn't true 10 years ago which is we're increasingly trying
[01:15:53] ago which is we're increasingly trying to train and test on different stats
[01:15:54] to train and test on different stats we're trying to you know train on data
[01:15:57] we're trying to you know train on data collected in one context and apply to
[01:16:00] collected in one context and apply to totally different context suggests we're
[01:16:03] totally different context suggests we're trying to you know train on speech
[01:16:06] trying to you know train on speech collected on your cell phone because
[01:16:08] collected on your cell phone because we've all that data and trying to apply
[01:16:10] we've all that data and trying to apply it to a small speaker where it was
[01:16:15] it to a small speaker where it was collected on a different microphone in
[01:16:16] collected on a different microphone in your cell phone or something so if you
[01:16:19] your cell phone or something so if you are doing that and the way you set your
[01:16:21] are doing that and the way you set your train to have test fit it's a bit more
[01:16:22] train to have test fit it's a bit more complicated um I wasn't going to talk
[01:16:25] complicated um I wasn't going to talk about in this class you want to learn
[01:16:26] about in this class you want to learn more I think that the sound of that
[01:16:28] more I think that the sound of that cause I mentioned I was working on this
[01:16:30] cause I mentioned I was working on this book machine learning journey so that
[01:16:32] book machine learning journey so that book is finished and if you go to this
[01:16:34] book is finished and if you go to this website you can get a copy of it for
[01:16:36] website you can get a copy of it for free that talks about that and I also
[01:16:40] free that talks about that and I also talked about this more CS 2:30 English
[01:16:42] talked about this more CS 2:30 English which goes more into the Big Data but
[01:16:44] which goes more into the Big Data but you could you know go and learn machine
[01:16:46] you could you know go and learn machine you can also read all about it in
[01:16:47] you can also read all about it in machine learning learning if the
[01:16:50] machine learning learning if the training test
[01:16:51] training test different distribution yeah but random
[01:16:54] different distribution yeah but random softly it would be a good default if you
[01:16:55] softly it would be a good default if you think your training test that's not too
[01:16:57] think your training test that's not too different all right just one last thing
[01:17:01] different all right just one last thing with a cover real quick which is a
[01:17:07] with a cover real quick which is a feature selection and so so them justice
[01:17:23] feature selection and so so them justice drag one so sometimes you have a lot of
[01:17:25] drag one so sometimes you have a lot of features so actually let's take text
[01:17:27] features so actually let's take text classification you might have ten
[01:17:29] classification you might have ten thousand features because one in the ten
[01:17:31] thousand features because one in the ten thousand words
[01:17:32] thousand words but you might suspect that while the
[01:17:34] but you might suspect that while the features are not important right you
[01:17:36] features are not important right you know there won't be whether the worthy
[01:17:38] know there won't be whether the worthy is called a stop where whether the word
[01:17:40] is called a stop where whether the word D appears an email not doesn't really
[01:17:42] D appears an email not doesn't really tell you this family of spam because of
[01:17:44] tell you this family of spam because of where the a of you know these are called
[01:17:47] where the a of you know these are called stop words they don't tell you much
[01:17:48] stop words they don't tell you much about the content of the email but so if
[01:17:52] about the content of the email but so if a lot of features sometimes one way to
[01:17:55] a lot of features sometimes one way to reduce overfitting is to try to find a
[01:17:58] reduce overfitting is to try to find a small subset of the features that are
[01:18:00] small subset of the features that are most useful for your task right and so
[01:18:02] most useful for your task right and so um
[01:18:03] um this takes judgment there are some
[01:18:05] this takes judgment there are some problems like computer vision where you
[01:18:07] problems like computer vision where you have a lot of features Chris one into
[01:18:09] have a lot of features Chris one into there being a lot of pixels in every
[01:18:10] there being a lot of pixels in every image but probably every pixel is
[01:18:13] image but probably every pixel is somewhat relevant so you don't want to
[01:18:15] somewhat relevant so you don't want to select a subset of pixels for most
[01:18:17] select a subset of pixels for most configuration tasks but there are some
[01:18:19] configuration tasks but there are some other problems where you may have lot of
[01:18:21] other problems where you may have lot of features then you suspect the way to
[01:18:24] features then you suspect the way to prevent overfitting is to find a small
[01:18:26] prevent overfitting is to find a small subset of the most relevant features for
[01:18:28] subset of the most relevant features for your task so feature selection is a
[01:18:31] your task so feature selection is a special case of model selection that
[01:18:33] special case of model selection that applies to when you suspect that even
[01:18:36] applies to when you suspect that even though you have ten thousand features
[01:18:38] though you have ten thousand features maybe only 50 of them are highly
[01:18:40] maybe only 50 of them are highly relevant right and so one example if you
[01:18:44] relevant right and so one example if you are measuring a lot of things going on
[01:18:46] are measuring a lot of things going on in a truck in order to figure out if the
[01:18:49] in a truck in order to figure out if the truck is about to break down right you
[01:18:51] truck is about to break down right you might preventive maintenance you might
[01:18:54] might preventive maintenance you might measure hundreds of variables or many
[01:18:56] measure hundreds of variables or many hundreds of variables but you might
[01:18:57] hundreds of variables but you might secretly suspect that there are only a
[01:18:59] secretly suspect that there are only a few things that you know predict when
[01:19:01] few things that you know predict when the truck is about to go down so good
[01:19:03] the truck is about to go down so good preventive maintenance so if you suspect
[01:19:04] preventive maintenance so if you suspect that's the case then feature selection
[01:19:07] that's the case then feature selection will be a reasonable approach to try
[01:19:08] will be a reasonable approach to try right and so here's the I'll just write
[01:19:12] right and so here's the I'll just write out one algorithm which is start with
[01:19:17] out one algorithm which is start with this is script F equals the empty set of
[01:19:21] this is script F equals the empty set of features and then you're repeatedly try
[01:19:31] features and then you're repeatedly try adding each feature I so f and c which
[01:19:43] adding each feature I so f and c which single feature addition most improves
[01:19:54] single feature addition most improves the DEF set performance and then step
[01:20:03] the DEF set performance and then step two is a go ahead and commit to add that
[01:20:05] two is a go ahead and commit to add that feature
[01:20:18] so let me illustrate this with pictures
[01:20:21] so let me illustrate this with pictures so let's say you have all five features
[01:20:27] so let's say you have all five features X 1 through X 5 and in practice are you
[01:20:29] X 1 through X 5 and in practice are you more like X 1 through X 500 oh I went
[01:20:31] more like X 1 through X 500 oh I went through 10,000 I'll just use 5 so start
[01:20:35] through 10,000 I'll just use 5 so start off with an empty set of features and
[01:20:36] off with an empty set of features and you know train a linear classifier with
[01:20:39] you know train a linear classifier with no features so the model is H of X
[01:20:41] no features so the model is H of X equals theta 0 right with no features so
[01:20:45] equals theta 0 right with no features so this won't be very good normal but see
[01:20:46] this won't be very good normal but see how well this does on your death set so
[01:20:48] how well this does on your death set so the thinking were average the Y's right
[01:20:50] the thinking were average the Y's right so it's not really normal
[01:20:51] so it's not really normal NIC's so this is step one in the second
[01:20:55] NIC's so this is step one in the second iteration you would then take each of
[01:20:57] iteration you would then take each of these features and add it to the empty
[01:20:59] these features and add it to the empty set so you can try the empty set plus x1
[01:21:02] set so you can try the empty set plus x1 and he said plus x2 and he said plus x5
[01:21:07] and he said plus x2 and he said plus x5 and for each of these you infinite
[01:21:09] and for each of these you infinite corresponding models so for this one you
[01:21:11] corresponding models so for this one you fit the picture of x equals theta 0 plus
[01:21:14] fit the picture of x equals theta 0 plus theta 1 x5 so let's try adding one
[01:21:17] theta 1 x5 so let's try adding one feature to your model and see which
[01:21:19] feature to your model and see which model best improves your performance on
[01:21:22] model best improves your performance on the DEF set right and let's say you find
[01:21:24] the DEF set right and let's say you find it adding feature two is the best choice
[01:21:27] it adding feature two is the best choice so now what we do is set the set of
[01:21:30] so now what we do is set the set of features to be x2 for the next step you
[01:21:35] features to be x2 for the next step you would then consider starting of X 2 and
[01:21:40] would then consider starting of X 2 and adding X 1 or X 3 or X 4 or X 5 so if
[01:21:50] adding X 1 or X 3 or X 4 or X 5 so if your model is already using the feature
[01:21:52] your model is already using the feature X 2
[01:21:53] X 2 what's the other feature what additional
[01:21:55] what's the other feature what additional feature most helps you algorithm and
[01:21:58] feature most helps you algorithm and let's say this X 4 right fit for models
[01:22:01] let's say this X 4 right fit for models see which one does best and now you
[01:22:02] see which one does best and now you would commit to using the features X 2
[01:22:06] would commit to using the features X 2 and X 4 and you kind of keep on doing
[01:22:11] and X 4 and you kind of keep on doing this keep on adding features greedily
[01:22:13] this keep on adding features greedily keep on adding features one at a time to
[01:22:15] keep on adding features one at a time to see which single feature addition helps
[01:22:18] see which single feature addition helps improve your album demos and
[01:22:21] improve your album demos and and and you can keep iterating until
[01:22:23] and and you can keep iterating until adding more features now hers
[01:22:24] adding more features now hers performance and then pick what whichever
[01:22:27] performance and then pick what whichever feature subset allows you to have the
[01:22:29] feature subset allows you to have the best possible performance which I've
[01:22:30] best possible performance which I've said okay so this is a special case of
[01:22:33] said okay so this is a special case of model selection called forward search
[01:22:35] model selection called forward search it's called forward search we're
[01:22:36] it's called forward search we're starting empty set of features and
[01:22:37] starting empty set of features and adding features one at a time there's a
[01:22:40] adding features one at a time there's a procedure called backward search which
[01:22:41] procedure called backward search which can read about that we start with all
[01:22:43] can read about that we start with all the features remove features one other
[01:22:44] the features remove features one other time but this would be a reasonable
[01:22:46] time but this would be a reasonable feature section algorithm the
[01:22:49] feature section algorithm the disadvantage of this is is quite
[01:22:50] disadvantage of this is is quite computationally expensive but this can
[01:22:52] computationally expensive but this can help you select a decent set of features
[01:22:54] help you select a decent set of features okay several running a little bit late
[01:22:58] okay several running a little bit late let's break oh so I think I was meant to
[01:23:01] let's break oh so I think I was meant to be on the road next week but because is
[01:23:05] be on the road next week but because is still unable to teach I think we'll have
[01:23:07] still unable to teach I think we'll have Rafael teach decision trees next week
[01:23:12] Rafael teach decision trees next week and then also can talk about networks we
[01:23:16] and then also can talk about networks we okay so let's break for today and then
[01:23:19] okay so let's break for today and then maybe we'll see some of you at the
[01:23:20] maybe we'll see some of you at the Friday discussion


================================================================================
LECTURE 009
================================================================================

Lecture 9 - Approx/Estimation Error & ERM | Stanford CS229: Machine Learning (Autumn 2018)

Source: https://www.youtube.com/watch?v=iVOxMcumR4A

---

Transcript

[00:00:03] okay welcome everyone so today we'll be
[00:00:08] okay welcome everyone so today we'll be going over learning theory this is this
[00:00:11] going over learning theory this is this used to be taught in the main lectures
[00:00:13] used to be taught in the main lectures and in previous offerings this year
[00:00:16] and in previous offerings this year we're going to cover it as as a Friday
[00:00:18] we're going to cover it as as a Friday section however some of the concepts
[00:00:21] section however some of the concepts here are we're going to be covering
[00:00:23] here are we're going to be covering today are important in the sense that
[00:00:27] today are important in the sense that they kind of deepen your understanding
[00:00:28] they kind of deepen your understanding of how machine learning kind of works
[00:00:31] of how machine learning kind of works and other covers what are the
[00:00:32] and other covers what are the assumptions that you're making and you
[00:00:34] assumptions that you're making and you know why do things generalize and and so
[00:00:38] know why do things generalize and and so forth so here's the agenda for today so
[00:00:40] forth so here's the agenda for today so we're going to quickly start off with
[00:00:43] we're going to quickly start off with framing the learning problem and we'll
[00:00:47] framing the learning problem and we'll go deep into bias-variance tradeoff will
[00:00:50] go deep into bias-variance tradeoff will you go we'll spend some time over there
[00:00:52] you go we'll spend some time over there and we look at some other ways where how
[00:00:56] and we look at some other ways where how you can kind of decompose the error as
[00:00:59] you can kind of decompose the error as an approximation error an estimation
[00:01:01] an approximation error an estimation error we will see what empirical risk
[00:01:05] error we will see what empirical risk minimization is and then we'll spend
[00:01:07] minimization is and then we'll spend some time on uniform convergence and VC
[00:01:10] some time on uniform convergence and VC dimensions so let's jump right in
[00:01:15] dimensions so let's jump right in right so d so the assumptions under
[00:01:24] right so d so the assumptions under which we're going to be operating for
[00:01:28] which we're going to be operating for for this lecture and in fact for most of
[00:01:30] for this lecture and in fact for most of most of the algorithms that we'll be
[00:01:33] most of the algorithms that we'll be covering in this course is that there
[00:01:37] covering in this course is that there are two main assumptions one is that
[00:01:40] are two main assumptions one is that there exists a data distribution B from
[00:01:51] there exists a data distribution B from which X Y pairs are sampled so this is
[00:01:56] which X Y pairs are sampled so this is this makes sense in the supervised
[00:01:58] this makes sense in the supervised learning setting where you're expected
[00:02:01] learning setting where you're expected to learn a mapping from X to Y but the
[00:02:04] to learn a mapping from X to Y but the assumption also actually holds more
[00:02:07] assumption also actually holds more generally even in the unsupervised
[00:02:08] generally even in the unsupervised setting case the the main assumption is
[00:02:11] setting case the the main assumption is that there is a data generating
[00:02:13] that there is a data generating distribution
[00:02:14] distribution and the examples that we have in our
[00:02:17] and the examples that we have in our training set and the ones we will be
[00:02:20] training set and the ones we will be encountering when we tested are all
[00:02:25] encountering when we tested are all coming from the same distribution right
[00:02:27] coming from the same distribution right that's that's like the core assumption
[00:02:28] that's that's like the core assumption without this coming up with any theory
[00:02:33] without this coming up with any theory is is it's going to be much harder so
[00:02:35] is is it's going to be much harder so the Assumption here is that you know
[00:02:36] the Assumption here is that you know there is some kind of a data generating
[00:02:39] there is some kind of a data generating process and we have a few samples from
[00:02:42] process and we have a few samples from that data generating process that
[00:02:43] that data generating process that becomes our training set and that's a
[00:02:45] becomes our training set and that's a finite number you can get an infinite
[00:02:49] finite number you can get an infinite number of samples from this data
[00:02:50] number of samples from this data generating process and the examples that
[00:02:54] generating process and the examples that we're going to encounter at test time
[00:02:56] we're going to encounter at test time are also samples from the same process
[00:02:59] are also samples from the same process that's that's the assumption and there
[00:03:01] that's that's the assumption and there is a second assumption which is that all
[00:03:05] is a second assumption which is that all these samples are sample independently
[00:03:16] so with these two assumptions we can
[00:03:20] so with these two assumptions we can imagine learning the process of learning
[00:03:23] imagine learning the process of learning to look something like this so we have a
[00:03:27] to look something like this so we have a set of XY pairs which we call this yes
[00:03:34] set of XY pairs which we call this yes these are just x1 y1 X M Y M so we have
[00:03:42] these are just x1 y1 X M Y M so we have M samples from from a sample from the
[00:03:48] M samples from from a sample from the data generating process and we feed this
[00:03:52] data generating process and we feed this into a learning algorithm
[00:04:02] and the output of the learning algorithm
[00:04:06] and the output of the learning algorithm is what we call as a hypothesis
[00:04:09] is what we call as a hypothesis so hypothesis is is a function which
[00:04:14] so hypothesis is is a function which accepts an input a new input X and makes
[00:04:17] accepts an input a new input X and makes a prediction about about Y for that X so
[00:04:20] a prediction about about Y for that X so this hypothesis is sometimes also in the
[00:04:23] this hypothesis is sometimes also in the form of theta hat so if we if we
[00:04:27] form of theta hat so if we if we restrict ourselves to a class of
[00:04:29] restrict ourselves to a class of hypotheses for example all possible
[00:04:32] hypotheses for example all possible logistic regression models of of
[00:04:34] logistic regression models of of dimension n for example then it's you
[00:04:38] dimension n for example then it's you know obtaining those parameters is
[00:04:40] know obtaining those parameters is equivalent to obtaining the hypothesis
[00:04:42] equivalent to obtaining the hypothesis function itself so a key thing to note
[00:04:45] function itself so a key thing to note here is that this s is a random variable
[00:04:55] this is a random variable this is a
[00:04:57] this is a random variable this is a deterministic function
[00:05:06] and what happens when you feed a random
[00:05:09] and what happens when you feed a random variable through a deterministic
[00:05:11] variable through a deterministic function you get a random variable
[00:05:14] function you get a random variable exactly so so the hypothesis that we get
[00:05:18] exactly so so the hypothesis that we get is also a random variable right so all
[00:05:27] is also a random variable right so all random variables have a distribution
[00:05:29] random variables have a distribution associated with them the distribution
[00:05:31] associated with them the distribution associated with the data is the
[00:05:34] associated with the data is the distribution of capital D this is just
[00:05:37] distribution of capital D this is just enough ixed deterministic function and
[00:05:40] enough ixed deterministic function and there is a distribution associated with
[00:05:43] there is a distribution associated with the with the with the parameters that we
[00:05:48] the with the with the parameters that we obtain that has a certain distribution
[00:05:49] obtain that has a certain distribution as well in in this in a more statistical
[00:05:57] as well in in this in a more statistical setting we call this an estimator so if
[00:06:02] setting we call this an estimator so if you take some advanced statistics
[00:06:03] you take some advanced statistics courses you will call what you will come
[00:06:05] courses you will call what you will come across as an estimator here we call it a
[00:06:08] across as an estimator here we call it a learning algorithm right and the
[00:06:11] learning algorithm right and the distribution of theta is also called the
[00:06:15] distribution of theta is also called the sampling distribution and the what's
[00:06:25] sampling distribution and the what's implied in this process is that there
[00:06:27] implied in this process is that there exists some theta star or in its H star
[00:06:34] exists some theta star or in its H star however you want to view it which is in
[00:06:37] however you want to view it which is in the sense or true parameter the true
[00:06:44] parameter that we wish to be the output
[00:06:48] parameter that we wish to be the output of the learning algorithm but of course
[00:06:50] of the learning algorithm but of course you know we never know we never know
[00:06:52] you know we never know we never know what theta star is and when what we get
[00:06:57] what theta star is and when what we get out of the learning algorithm is is
[00:07:01] out of the learning algorithm is is going to be just a sample from a random
[00:07:04] going to be just a sample from a random a random variable now a thing to notice
[00:07:09] a random variable now a thing to notice that this the theta star or H star is
[00:07:11] that this the theta star or H star is not random it's just an unknown constant
[00:07:18] not about when we say it's not random it
[00:07:22] not about when we say it's not random it means there is no probability
[00:07:24] means there is no probability distribution associated with it it's
[00:07:25] distribution associated with it it's just a constant which we don't know that
[00:07:28] just a constant which we don't know that that's that's the assumption under which
[00:07:29] that's that's the assumption under which you operate right no let's see what
[00:07:36] you operate right no let's see what let's see what what what are some
[00:07:38] let's see what what what are some properties about this theta theta hat so
[00:07:41] properties about this theta theta hat so all the all all the entities that we
[00:07:45] all the all all the entities that we estimate are generally decorated with a
[00:07:49] estimate are generally decorated with a hat on top which which indicates that
[00:07:51] hat on top which which indicates that it's it's something that we estimated
[00:07:53] it's it's something that we estimated and anything with a star is like you
[00:07:56] and anything with a star is like you know the true or the right answer which
[00:07:58] know the true or the right answer which we don't have access to in general so
[00:08:03] we don't have access to in general so any questions with this so far yeah yeah
[00:08:12] any questions with this so far yeah yeah so this could be in case of like a
[00:08:17] so this could be in case of like a linear or logistic regression linear
[00:08:20] linear or logistic regression linear regression generally happens to be a
[00:08:21] regression generally happens to be a vector it could be a scalar it could be
[00:08:23] vector it could be a scalar it could be you know a matrix it could be anything
[00:08:25] you know a matrix it could be anything right it's just an entity that we
[00:08:28] right it's just an entity that we estimate and sometimes it's store can
[00:08:31] estimate and sometimes it's store can also be so generic that it need not even
[00:08:34] also be so generic that it need not even be parametrized will it's just some
[00:08:36] be parametrized will it's just some function that you estimate right so yeah
[00:08:40] function that you estimate right so yeah so it could it could be a vector or a
[00:08:42] so it could it could be a vector or a scalar or a matrix it could be anything
[00:08:44] scalar or a matrix it could be anything right so let's see what happens when we
[00:08:52] so in the lecture we saw this diagram
[00:08:56] so in the lecture we saw this diagram for India now when we are talking about
[00:08:58] for India now when we are talking about bias variance so
[00:09:05] in case of regression and we saw that we
[00:09:41] in case of regression and we saw that we saw this as the concepts of underfitting
[00:09:50] this is overfit and this is like just
[00:09:55] this is overfit and this is like just right alright so the concept of
[00:10:00] right alright so the concept of underfitting and overfitting are kind of
[00:10:01] underfitting and overfitting are kind of closely related to bias and variance so
[00:10:05] closely related to bias and variance so this is how you would knew it from the
[00:10:09] this is how you would knew it from the data so this from the data view right
[00:10:12] data so this from the data view right because this is X this is y you know
[00:10:14] because this is X this is y you know this is your data and if you look at you
[00:10:18] this is your data and if you look at you know look at it from a data point of
[00:10:21] know look at it from a data point of view these are the kind of different
[00:10:23] view these are the kind of different algorithms that you might get right
[00:10:25] algorithms that you might get right however or to get a more formal sense
[00:10:28] however or to get a more formal sense formal view into what's what's bias and
[00:10:31] formal view into what's what's bias and variance it's more useful to seed from
[00:10:35] variance it's more useful to seed from the parameter view so let's imagine we
[00:10:42] the parameter view so let's imagine we have four different learning algorithms
[00:10:53] and here this is the parameter space
[00:10:56] and here this is the parameter space let's say theta 1 theta 2 let's imagine
[00:10:58] let's say theta 1 theta 2 let's imagine you know you have just two parameters
[00:11:01] you know you have just two parameters that's easy to visualize theta 1 and
[00:11:04] that's easy to visualize theta 1 and theta 2 all right and this corresponds
[00:11:08] theta 2 all right and this corresponds to algorithm a algorithm B C and D right
[00:11:14] to algorithm a algorithm B C and D right there is there is a true theta star
[00:11:18] there is there is a true theta star let's let's which is unknown right now
[00:11:30] let's imagine we run through this this
[00:11:33] let's imagine we run through this this process of sampling M examples running
[00:11:37] process of sampling M examples running it through the algorithm obtain a theta
[00:11:40] it through the algorithm obtain a theta hat right and then we start with a new
[00:11:43] hat right and then we start with a new sample sample from D running through the
[00:11:46] sample sample from D running through the algorithm we get a different data hat
[00:11:48] algorithm we get a different data hat right and the theta hat is going to be
[00:11:50] right and the theta hat is going to be different for different learning
[00:11:51] different for different learning algorithms so so let's imagine first we
[00:11:56] algorithms so so let's imagine first we we we sample some data that's our
[00:11:59] we we sample some data that's our training set run it through algorithm a
[00:12:00] training set run it through algorithm a and let's say this is the parameter we
[00:12:05] and let's say this is the parameter we got and then we run it through algorithm
[00:12:06] got and then we run it through algorithm B and let's say this is the parameter
[00:12:09] B and let's say this is the parameter with art and through C here and through
[00:12:12] with art and through C here and through D over here and we're going to repeat
[00:12:15] D over here and we're going to repeat this you know second one may be here and
[00:12:22] this you know second one may be here and so on and you repeat this process over
[00:12:24] so on and you repeat this process over and over and over the the key is that
[00:12:27] and over and over the the key is that the number of samples per input is M
[00:12:30] the number of samples per input is M that is fixed right but we're going to
[00:12:31] that is fixed right but we're going to repeat this process and over and over
[00:12:33] repeat this process and over and over and for every time we repeat it we get a
[00:12:36] and for every time we repeat it we get a different point over here
[00:12:52] right so each point each dot corresponds
[00:12:58] right so each point each dot corresponds to a sample of size M right the number
[00:13:01] to a sample of size M right the number of points is basically the number of
[00:13:03] of points is basically the number of times we repeated the experiment right
[00:13:05] times we repeated the experiment right and what we see is that these dots are
[00:13:09] and what we see is that these dots are basically samples from the sampling
[00:13:11] basically samples from the sampling distribution right now the concept of of
[00:13:18] distribution right now the concept of of bias and variance is kind of visible
[00:13:20] bias and variance is kind of visible over here so if we were to classify this
[00:13:23] over here so if we were to classify this now we would call this as bias and
[00:13:28] now we would call this as bias and variance right so these two are
[00:13:38] variance right so these two are algorithms that have low bias these two
[00:13:42] algorithms that have low bias these two are have high variance these two have
[00:13:44] are have high variance these two have low where these are low bias high bias
[00:13:47] low where these are low bias high bias low variance high variance so what does
[00:13:50] low variance high variance so what does this mean
[00:13:51] this mean so bias is basically checking Rd is the
[00:13:59] so bias is basically checking Rd is the sampling distribution kind of centered
[00:14:00] sampling distribution kind of centered around the true parameter the true
[00:14:02] around the true parameter the true unknown parameter is it centered around
[00:14:04] unknown parameter is it centered around the true parameter right and variance is
[00:14:07] the true parameter right and variance is is measuring basically how dispersed the
[00:14:11] is measuring basically how dispersed the the sampling distribution is right so so
[00:14:15] the sampling distribution is right so so formally speaking this is bias and
[00:14:17] formally speaking this is bias and variance and it becomes you know pretty
[00:14:19] variance and it becomes you know pretty clear when we see it in the parameter
[00:14:21] clear when we see it in the parameter view instead of in the the data view and
[00:14:25] view instead of in the the data view and essentially bias and variance are
[00:14:27] essentially bias and variance are basically just properties of the first
[00:14:29] basically just properties of the first and second moments of your sampling
[00:14:30] and second moments of your sampling distribution so you're asking the first
[00:14:33] distribution so you're asking the first moment that's the mean is it centered
[00:14:34] moment that's the mean is it centered around the true parameter and you know
[00:14:36] around the true parameter and you know the second moment that variance that's
[00:14:38] the second moment that variance that's literally variance of the bias-variance
[00:14:39] literally variance of the bias-variance tradeoff yeah
[00:14:51] yeah um so this is a diagram where I'm
[00:15:00] yeah um so this is a diagram where I'm using only two Thetas just to fit you
[00:15:02] using only two Thetas just to fit you know right on a whiteboard so you you
[00:15:05] know right on a whiteboard so you you would imagine something that has high
[00:15:06] would imagine something that has high variance for example this one to
[00:15:09] variance for example this one to probably be of a much much higher
[00:15:10] probably be of a much much higher dimension not just two but it would
[00:15:12] dimension not just two but it would still be spread out it would still have
[00:15:14] still be spread out it would still have like high variance there would be points
[00:15:16] like high variance there would be points in a higher dimensional space you know
[00:15:18] in a higher dimensional space you know but more spread out right so the
[00:15:26] but more spread out right so the question was the question was in over
[00:15:31] question was the question was in over here we we actually had more number of
[00:15:34] here we we actually had more number of Thetas but here with the higher variance
[00:15:37] Thetas but here with the higher variance plots we are having the same number of
[00:15:39] plots we are having the same number of Thetas so yeah so you could imagine this
[00:15:43] Thetas so yeah so you could imagine this to be higher dimensional and also
[00:15:45] to be higher dimensional and also different algorithms can have different
[00:15:49] different algorithms can have different bias and variance even though they have
[00:15:51] bias and variance even though they have the same number of parameters for
[00:15:54] the same number of parameters for example if you had regularization the
[00:15:56] example if you had regularization the variance would come down for example we
[00:15:59] variance would come down for example we go over that a few observations that we
[00:16:03] go over that a few observations that we want to make is that as we increase the
[00:16:06] want to make is that as we increase the size of the data every time we feed and
[00:16:08] size of the data every time we feed and so if this were to may be made bigger if
[00:16:12] so if this were to may be made bigger if we take a bigger sample for every every
[00:16:15] we take a bigger sample for every every time we learn the variance of theta hat
[00:16:21] time we learn the variance of theta hat would become small right so if we repeat
[00:16:25] would become small right so if we repeat the same thing but with larger number of
[00:16:28] the same thing but with larger number of examples so this would be more all of
[00:16:30] examples so this would be more all of these would be more tightly concentrated
[00:16:33] these would be more tightly concentrated right so the spread is so the spread is
[00:16:37] right so the spread is so the spread is a function of how many examples we have
[00:16:39] a function of how many examples we have in each in each iteration right so as M
[00:16:47] in each in each iteration right so as M tends to infinity right the variance
[00:16:55] tends to infinity right the variance tends to zero right if you were to
[00:16:59] tends to zero right if you were to collect an infinite number of samples
[00:17:01] collect an infinite number of samples run it through the algorithm you would
[00:17:04] run it through the algorithm you would get some particular theta hat and if you
[00:17:08] get some particular theta hat and if you were to repeat that with an infinite
[00:17:09] were to repeat that with an infinite number of examples you will always keep
[00:17:11] number of examples you will always keep getting the same theta hat now the rate
[00:17:16] getting the same theta hat now the rate at which the variance goes goes to zero
[00:17:19] at which the variance goes goes to zero as you increase M is you can think of it
[00:17:22] as you increase M is you can think of it as what's also called like the
[00:17:24] as what's also called like the statistical efficiency it's basically a
[00:17:32] statistical efficiency it's basically a measure of how efficient your algorithm
[00:17:34] measure of how efficient your algorithm is in squeezing out information from a
[00:17:37] is in squeezing out information from a given amount of data and if theta hat
[00:17:43] given amount of data and if theta hat tends to theta star as M tends to
[00:17:48] tends to theta star as M tends to infinity you call such algorithms as
[00:17:52] infinity you call such algorithms as consistent so consistent and if the
[00:18:04] consistent so consistent and if the expected value of your theta hat is
[00:18:07] expected value of your theta hat is equal to theta star for all M so no
[00:18:14] equal to theta star for all M so no matter how big your sample sizes if you
[00:18:17] matter how big your sample sizes if you always end up with a sampling
[00:18:20] always end up with a sampling distribution that's centered around the
[00:18:22] distribution that's centered around the true parameter then your estimator is
[00:18:25] true parameter then your estimator is called an unbiased estimator yes so
[00:18:30] called an unbiased estimator yes so efficiency is is basically the rate at
[00:18:32] efficiency is is basically the rate at which the variance drops to zero as M
[00:18:40] which the variance drops to zero as M tends to zero so for example you may
[00:18:42] tends to zero so for example you may have one algorithm which which where the
[00:18:46] have one algorithm which which where the variance is a function of 1 over m
[00:18:49] variance is a function of 1 over m square another algorithm where the
[00:18:51] square another algorithm where the variance is a function of e to the minus
[00:18:56] variance is a function of e to the minus m you know you you can have the variance
[00:18:58] m you know you you can have the variance can drive down at different rates
[00:19:01] can drive down at different rates relative to M so that's kind of captures
[00:19:03] relative to M so that's kind of captures which what's efficiency here
[00:19:08] right yeah yeah so theta theta hat
[00:19:22] right yeah yeah so theta theta hat approaches so this is a random variable
[00:19:29] approaches so this is a random variable here so so here's one thing to be clear
[00:19:32] here so so here's one thing to be clear about here this is a number a constant
[00:19:35] about here this is a number a constant and this is a constant but here this is
[00:19:38] and this is a constant but here this is a random variable right so what we are
[00:19:41] a random variable right so what we are saying is that as M tends to infinity
[00:19:44] saying is that as M tends to infinity theta hat that is the distribution
[00:19:46] theta hat that is the distribution converges towards being a constant and
[00:19:49] converges towards being a constant and that constant is going to be theta star
[00:19:52] that constant is going to be theta star which means at smaller values of M your
[00:19:55] which means at smaller values of M your algorithm might be centered elsewhere
[00:19:57] algorithm might be centered elsewhere but as you get more and more data your
[00:20:00] but as you get more and more data your sampling distribution variance reduces
[00:20:02] sampling distribution variance reduces and also gets centered around the truth
[00:20:04] and also gets centered around the truth theta star eventually so informally
[00:20:11] theta star eventually so informally speaking if your algorithm has high bias
[00:20:15] speaking if your algorithm has high bias it essentially means no matter how much
[00:20:19] it essentially means no matter how much data or evidence you provided it kind of
[00:20:21] data or evidence you provided it kind of always keeps away from from theta star
[00:20:24] always keeps away from from theta star right you cannot change its mind no
[00:20:26] right you cannot change its mind no matter how much data you feed it it's
[00:20:28] matter how much data you feed it it's never going to center itself around
[00:20:29] never going to center itself around around theta star that's like a high
[00:20:31] around theta star that's like a high bias toggle it's biased away from the
[00:20:33] bias toggle it's biased away from the true parameter and variance is you can
[00:20:37] true parameter and variance is you can think of it as your algorithm that's
[00:20:39] think of it as your algorithm that's that's that's kind of highly distracted
[00:20:42] that's that's kind of highly distracted by the noise in the data and kind of
[00:20:44] by the noise in the data and kind of easily get swayed away you know far away
[00:20:47] easily get swayed away you know far away depending on the noise in your data so
[00:20:50] depending on the noise in your data so these algorithms you would call them as
[00:20:53] these algorithms you would call them as those having high variance because they
[00:20:55] those having high variance because they can easily get swayed by noise in the
[00:20:58] can easily get swayed by noise in the data and as we are seeing here bias and
[00:21:02] data and as we are seeing here bias and variance are kind of independent of each
[00:21:04] variance are kind of independent of each other you can have algorithms that have
[00:21:06] other you can have algorithms that have you know an independent amount of bias
[00:21:09] you know an independent amount of bias and variance in them you know there is
[00:21:11] and variance in them you know there is there is no correlation between bias and
[00:21:15] there is no correlation between bias and variance
[00:21:17] variance and one way so the how do we how do we
[00:21:22] and one way so the how do we how do we kind of fight variants so first let's
[00:21:24] kind of fight variants so first let's look at how we can address variants yes
[00:21:34] so bias and variance are properties of
[00:21:38] so bias and variance are properties of the algorithm at a given size M right so
[00:21:43] the algorithm at a given size M right so these plots were from well from a fixed
[00:21:48] these plots were from well from a fixed size M and for that fixed size data this
[00:21:52] size M and for that fixed size data this algorithm has high bias low variance
[00:21:56] algorithm has high bias low variance this algorithm has high variance and
[00:21:59] this algorithm has high variance and high bias and so on yeah yeah you can
[00:22:02] high bias and so on yeah yeah you can you can you can think of it as yeah it
[00:22:04] you can you can think of it as yeah it you assume like a fixed data size right
[00:22:08] you assume like a fixed data size right so fighting variance so one way to kind
[00:22:22] so fighting variance so one way to kind of address if you are in a high variance
[00:22:26] of address if you are in a high variance situation this will just increase the
[00:22:29] situation this will just increase the amount of data that you have and that
[00:22:31] amount of data that you have and that would naturally just reduce the variance
[00:22:33] would naturally just reduce the variance in your algorithm yes that is true so
[00:22:41] in your algorithm yes that is true so you don't know upfront what whether
[00:22:42] you don't know upfront what whether you're you're you know you know high
[00:22:45] you're you're you know you know high bias or high variance scenario one way
[00:22:49] bias or high variance scenario one way to kind of one way to kind of test that
[00:22:54] to kind of one way to kind of test that is by looking at your training
[00:22:57] is by looking at your training performance versus test performance we
[00:23:00] performance versus test performance we go over that arm in fact we're going to
[00:23:03] go over that arm in fact we're going to go into you know much more detail in the
[00:23:06] go into you know much more detail in the main lectures of how do you identify
[00:23:07] main lectures of how do you identify bias and variance here we're just going
[00:23:09] bias and variance here we're just going over the concepts of what our bias and
[00:23:11] over the concepts of what our bias and what our variance so one way to address
[00:23:16] what our variance so one way to address variances you just get more data right
[00:23:18] variances you just get more data right as you get more data the your sampling
[00:23:21] as you get more data the your sampling distributions kind of tend to get more
[00:23:23] distributions kind of tend to get more concentrated the other way is what's
[00:23:27] concentrated the other way is what's called as regularization
[00:23:32] so when you when you had regularization
[00:23:35] so when you when you had regularization like l2 regularization or l1
[00:23:37] like l2 regularization or l1 regularization what we are effectively
[00:23:41] regularization what we are effectively doing is let's say we have an algorithm
[00:23:45] doing is let's say we have an algorithm with high variance maybe low bias no
[00:23:53] with high variance maybe low bias no bias high variance and you add
[00:23:58] bias high variance and you add regularization right what you end up
[00:24:01] regularization right what you end up with is an algorithm that has maybe a
[00:24:10] with is an algorithm that has maybe a small bias you increase the bias by
[00:24:13] small bias you increase the bias by adding regularization but low variance
[00:24:19] so if what you care about is your
[00:24:22] so if what you care about is your predictive accuracy you're probably
[00:24:24] predictive accuracy you're probably better off training off high variance to
[00:24:29] better off training off high variance to some bias and getting down reducing your
[00:24:31] some bias and getting down reducing your your variance to a large extent yeah
[00:24:38] yeah we'll we are going to look into
[00:24:42] yeah we'll we are going to look into that up next
[00:24:54] so in order to kind of get a better
[00:24:57] so in order to kind of get a better understanding of this let's imagine I
[00:25:03] think of this as the space of hypothesis
[00:25:06] think of this as the space of hypothesis space of right so let's assume there is
[00:25:16] space of right so let's assume there is a true there exists this hypothesis
[00:25:21] a true there exists this hypothesis let's call it G right which is like the
[00:25:26] let's call it G right which is like the best possible hypothesis you can think
[00:25:28] best possible hypothesis you can think of by best possible hypothesis I mean if
[00:25:32] of by best possible hypothesis I mean if you were to kind of take this take this
[00:25:38] you were to kind of take this take this hypothesis and take the expected value
[00:25:41] hypothesis and take the expected value of the loss with respect to the data
[00:25:42] of the loss with respect to the data generating distribution of across an
[00:25:44] generating distribution of across an infinite amount of data you kind of have
[00:25:46] infinite amount of data you kind of have the lowest error with this so this is
[00:25:48] the lowest error with this so this is enough you know the best possible
[00:25:51] enough you know the best possible hypothesis and then there is this class
[00:25:54] hypothesis and then there is this class of hypotheses let's call this classes H
[00:26:00] of hypotheses let's call this classes H right so this for example can be the set
[00:26:04] right so this for example can be the set of all logistic regression hypotheses or
[00:26:08] of all logistic regression hypotheses or the set of all SVM's you know so this is
[00:26:12] the set of all SVM's you know so this is a class of hypotheses and what we what
[00:26:17] a class of hypotheses and what we what we end up with when we take a finite
[00:26:20] we end up with when we take a finite amount of data is some member over here
[00:26:23] amount of data is some member over here right so let me call this H star right
[00:26:28] right so let me call this H star right there is also some hypothesis in this
[00:26:34] there is also some hypothesis in this class let me call it kind of H star
[00:26:38] class let me call it kind of H star which is the best in class hypotheses so
[00:26:41] which is the best in class hypotheses so within the set of all logistic
[00:26:43] within the set of all logistic regression functions there exist some
[00:26:45] regression functions there exist some you know some model which would give you
[00:26:48] you know some model which would give you the lowest lowest error if you were to
[00:26:52] the lowest lowest error if you were to test it on the full data distribution
[00:26:53] test it on the full data distribution right the best possible hypothesis may
[00:26:57] right the best possible hypothesis may not be inside
[00:26:59] not be inside you're inside your hypothesis class it's
[00:27:03] you're inside your hypothesis class it's just some you know some hypothesis that
[00:27:05] just some you know some hypothesis that that's that's conceptually something
[00:27:07] that's that's conceptually something outside the class right now G is the
[00:27:14] outside the class right now G is the best possible hypothesis H star is best
[00:27:24] best possible hypothesis H star is best in class just in class h h ad is one you
[00:27:34] in class just in class h h ad is one you learnt from finite data so we also
[00:27:49] learnt from finite data so we also introduced some new notation so epsilon
[00:27:53] introduced some new notation so epsilon of H is you will call this the risk or
[00:28:00] of H is you will call this the risk or generalization error right and it is
[00:28:09] generalization error right and it is defined to be equal to the expectation
[00:28:12] defined to be equal to the expectation of X Y sample from D so you sample
[00:28:27] of X Y sample from D so you sample examples from the data generating
[00:28:30] examples from the data generating process run it through the hypothesis
[00:28:33] process run it through the hypothesis check whether it matches with with your
[00:28:37] check whether it matches with with your output and if it matches you get a 1 if
[00:28:40] output and if it matches you get a 1 if it does if it if it doesn't match you
[00:28:43] it does if it if it doesn't match you get a 1 if it matches you get a 0 so on
[00:28:45] get a 1 if it matches you get a 0 so on average this is you know roughly
[00:28:48] average this is you know roughly speaking the fraction of all examples on
[00:28:51] speaking the fraction of all examples on which you make a mistake and here we are
[00:28:54] which you make a mistake and here we are kind of thinking about this from a
[00:28:56] kind of thinking about this from a classification point of view to check if
[00:28:58] classification point of view to check if you know the class that your output
[00:29:00] you know the class that your output matches the true class or not but you
[00:29:02] matches the true class or not but you can also extend this to the regression
[00:29:05] can also extend this to the regression setting but that's a little harder to
[00:29:07] setting but that's a little harder to analyze but you know the generalization
[00:29:10] analyze but you know the generalization holds to the regression setting is
[00:29:12] holds to the regression setting is but we'll stick to classification for
[00:29:14] but we'll stick to classification for now and we have an epsilon hat s of H
[00:29:21] now and we have an epsilon hat s of H and this is called the empirical risk is
[00:29:30] and this is called the empirical risk is the empirical risk or empirical error
[00:29:32] the empirical risk or empirical error and this over here is the difference
[00:29:55] and this over here is the difference here is that here this is like an
[00:29:56] here is that here this is like an infinite process you're you're sampling
[00:29:59] infinite process you're you're sampling from D forever and calculating like the
[00:30:01] from D forever and calculating like the long-term average whereas this is you
[00:30:04] long-term average whereas this is you have a finite number that's given to you
[00:30:05] have a finite number that's given to you and what's the fraction of examples on
[00:30:08] and what's the fraction of examples on which you make you make them an error
[00:30:10] which you make you make them an error right all right before we go further
[00:30:15] right all right before we go further there was a question of how adding
[00:30:19] there was a question of how adding regularization reduces your variance so
[00:30:22] regularization reduces your variance so what you can see
[00:30:26] actually let me let me get back to that
[00:30:28] actually let me let me get back to that in a bit so you know G and this is
[00:30:39] in a bit so you know G and this is called the Bayes error so this
[00:30:46] called the Bayes error so this essentially means if you take the best
[00:30:47] essentially means if you take the best ways possible hypothesis what's the
[00:30:49] ways possible hypothesis what's the fraction what's what's the rate at which
[00:30:52] fraction what's what's the rate at which you make errors you know and that can be
[00:30:54] you make errors you know and that can be nonzero right even if you take the best
[00:30:56] nonzero right even if you take the best possible hypothesis ever and that can
[00:30:59] possible hypothesis ever and that can still still make some some mistakes and
[00:31:01] still still make some some mistakes and and this is also called irreducible
[00:31:03] and this is also called irreducible error for example if your data
[00:31:11] error for example if your data generating process you know spits out
[00:31:14] generating process you know spits out examples where for the same X you have
[00:31:17] examples where for the same X you have different Y's in two different examples
[00:31:19] different Y's in two different examples then you know no no learning algorithm
[00:31:23] then you know no no learning algorithm can you know
[00:31:25] can you know do well in such cases that's just one
[00:31:29] do well in such cases that's just one one kind of irreducible error they can
[00:31:31] one kind of irreducible error they can be other kinds of identity visible
[00:31:32] be other kinds of identity visible errors as well and epsilon of H star
[00:31:42] errors as well and epsilon of H star epsilon G is called the approximation
[00:31:46] epsilon G is called the approximation error so this essentially means what is
[00:31:55] error so this essentially means what is the price that we are paying for
[00:31:57] the price that we are paying for limiting ourselves to some class right
[00:32:00] limiting ourselves to some class right so it's the it's the error between its
[00:32:03] so it's the it's the error between its it's the difference between the best
[00:32:05] it's the difference between the best possible error that you can get and the
[00:32:07] possible error that you can get and the best possible error you can get from H
[00:32:09] best possible error you can get from H star right so this is this is an
[00:32:13] star right so this is this is an attribute of the class so what's the
[00:32:16] attribute of the class so what's the cost you are paying for restricting
[00:32:18] cost you are paying for restricting yourself to a class and then you have
[00:32:21] yourself to a class and then you have epsilon of H and minus epsilon of H star
[00:32:27] epsilon of H and minus epsilon of H star and this you call it the estimation
[00:32:29] and this you call it the estimation error the estimation error is given the
[00:32:36] error the estimation error is given the data that we got you know the M examples
[00:32:39] data that we got you know the M examples that we got and we estimated you know
[00:32:42] that we got and we estimated you know using our estimator some H h H act
[00:32:46] using our estimator some H h H act what's the what's the what's the error
[00:32:57] what's the what's the what's the error due to estimation and this is like so
[00:33:06] due to estimation and this is like so this this the error on G is is the Bayes
[00:33:10] this this the error on G is is the Bayes error the gap between this error and the
[00:33:14] error the gap between this error and the best in class is the approximation error
[00:33:16] best in class is the approximation error and the gap between the best in class
[00:33:19] and the gap between the best in class and the hypothesis that you end up with
[00:33:21] and the hypothesis that you end up with is called the estimation error right and
[00:33:24] is called the estimation error right and it's easy to see that H hat is actually
[00:33:32] it's easy to see that H hat is actually equal to
[00:33:35] estimation error plus approximation
[00:33:43] estimation error plus approximation error right it's pretty C's with you
[00:33:56] error right it's pretty C's with you know if you just add them up all these
[00:33:57] know if you just add them up all these cancel out and you're just left with
[00:34:00] cancel out and you're just left with epsilon of H hat so it's it's kind of
[00:34:04] epsilon of H hat so it's it's kind of useful to think about your
[00:34:05] useful to think about your generalization error as different
[00:34:08] generalization error as different components some error which you just
[00:34:13] components some error which you just cannot you know reduce it no matter what
[00:34:15] cannot you know reduce it no matter what no matter what type odd says you pick no
[00:34:17] no matter what type odd says you pick no matter how much of training data you
[00:34:18] matter how much of training data you have there's no way you can get rid of
[00:34:20] have there's no way you can get rid of the irreducible error and then you make
[00:34:22] the irreducible error and then you make some some decisions about that you're
[00:34:25] some some decisions about that you're going to limit yourself to neural
[00:34:27] going to limit yourself to neural networks or logistic regression or
[00:34:29] networks or logistic regression or whatever and thereby you're kind of
[00:34:30] whatever and thereby you're kind of defining a class of all possible models
[00:34:33] defining a class of all possible models and that has a cost itself and that's
[00:34:35] and that has a cost itself and that's your approximation error and then you
[00:34:36] your approximation error and then you are working with limited data and this
[00:34:39] are working with limited data and this is generally due to data right and with
[00:34:42] is generally due to data right and with the limited data that you have and
[00:34:44] the limited data that you have and possibly due to some nuances of your
[00:34:45] possibly due to some nuances of your algorithm you also have an estimation
[00:34:47] algorithm you also have an estimation error I mean we can further see that the
[00:34:52] error I mean we can further see that the estimation error can be broken down into
[00:34:55] estimation error can be broken down into estimation variance and the estimation
[00:34:58] estimation variance and the estimation bias and you can all therefore write
[00:35:08] bias and you can all therefore write this as and what we commonly call as
[00:35:21] this as and what we commonly call as bias and variance are you know this we
[00:35:23] bias and variance are you know this we call it as variance and this we call it
[00:35:27] call it as variance and this we call it as bias and this is just irreducible so
[00:35:34] as bias and this is just irreducible so sometimes you see the bias-variance
[00:35:37] sometimes you see the bias-variance decomposition and sometimes you see the
[00:35:40] decomposition and sometimes you see the estimation approximation error
[00:35:42] estimation approximation error decomposition they are somewhat related
[00:35:43] decomposition they are somewhat related they're not exactly the same so the bias
[00:35:47] they're not exactly the same so the bias is based
[00:35:49] is based why is you know bias is basically trying
[00:35:54] why is you know bias is basically trying to capture why is a chat far from a from
[00:35:56] to capture why is a chat far from a from G right why is it staying away from G
[00:35:59] G right why is it staying away from G you know why did our hypothesis stay
[00:36:01] you know why did our hypothesis stay away from the true hypothesis and that
[00:36:03] away from the true hypothesis and that could be because your classes is kind of
[00:36:06] could be because your classes is kind of too small or it could be due to other
[00:36:09] too small or it could be due to other reasons such as you know as we'll see
[00:36:13] reasons such as you know as we'll see maybe regularization that kind of keeps
[00:36:15] maybe regularization that kind of keeps you away from a certain certain
[00:36:18] you away from a certain certain hypotheses right and the variance is
[00:36:21] hypotheses right and the variance is generally due to it like it's almost
[00:36:22] generally due to it like it's almost always due to having small data it could
[00:36:25] always due to having small data it could be due to other reasons as well but
[00:36:29] be due to other reasons as well but these are like two different ways of of
[00:36:31] these are like two different ways of of decomposing your your error so now if
[00:36:37] decomposing your your error so now if you have high bias how do you fight high
[00:36:39] you have high bias how do you fight high bias fight.i bias
[00:36:50] so how would you fight high bias any
[00:36:53] so how would you fight high bias any guesses mm-hmm yeah exactly so one way
[00:37:02] guesses mm-hmm yeah exactly so one way is to just you know make your H bigger
[00:37:09] I'd make your H bigger and also you can
[00:37:12] I'd make your H bigger and also you can you can try you know different
[00:37:14] you can try you know different algorithms after making your H bigger
[00:37:18] algorithms after making your H bigger and what this generally means is what we
[00:37:23] and what this generally means is what we saw there was regularization kind of you
[00:37:25] saw there was regularization kind of you know reduces your your variance by
[00:37:30] know reduces your your variance by paying a small cost in bias and over
[00:37:33] paying a small cost in bias and over here you know so let's say your
[00:37:51] here you know so let's say your algorithm has some bias right so it has
[00:37:57] algorithm has some bias right so it has a high bias and some variance and you
[00:38:07] a high bias and some variance and you make H bigger your class bigger right
[00:38:14] make H bigger your class bigger right and this generally results in something
[00:38:16] and this generally results in something which reduces your bias but also
[00:38:19] which reduces your bias but also increases your variance right so with
[00:38:25] increases your variance right so with this picture you can you can also see
[00:38:27] this picture you can you can also see you know what's the effect of how does
[00:38:30] you know what's the effect of how does variance come into picture now just by
[00:38:32] variance come into picture now just by having a bigger class there is a higher
[00:38:34] having a bigger class there is a higher probability that the hypothesis that you
[00:38:37] probability that the hypothesis that you estimate can vary a lot right if you
[00:38:40] estimate can vary a lot right if you reduce your the space of hypothesis you
[00:38:46] reduce your the space of hypothesis you may be increasing your bias because you
[00:38:48] may be increasing your bias because you may be moving away from G but you're
[00:38:50] may be moving away from G but you're also effectively reducing your variance
[00:38:51] also effectively reducing your variance right so that's that's the one of the
[00:38:55] right so that's that's the one of the you know trade-off that you observe that
[00:38:57] you know trade-off that you observe that any step you a step that you take for
[00:39:00] any step you a step that you take for example in reducing bias by May
[00:39:04] example in reducing bias by May get bigger also makes it possible for
[00:39:07] get bigger also makes it possible for your H hat to land at much
[00:39:09] your H hat to land at much you know at a wider space and increases
[00:39:11] you know at a wider space and increases your variance and if you take a step to
[00:39:13] your variance and if you take a step to reducing your variance by maybe making
[00:39:16] reducing your variance by maybe making your your class smaller you may end up
[00:39:20] your your class smaller you may end up making it smaller by being away from the
[00:39:22] making it smaller by being away from the end thereby increase your your increase
[00:39:26] end thereby increase your your increase your bias so when you when you add
[00:39:29] your bias so when you when you add regularization you know the the question
[00:39:32] regularization you know the the question of somebody asks before of how does me
[00:39:35] of somebody asks before of how does me in how does adding regularization
[00:39:40] in how does adding regularization decrease the variance by adding
[00:39:43] decrease the variance by adding regularization you're effectively kind
[00:39:45] regularization you're effectively kind of shrinking the class of hypothesis
[00:39:47] of shrinking the class of hypothesis that you have you start penalizing those
[00:39:50] that you have you start penalizing those hypotheses whose theta is very is very
[00:39:52] hypotheses whose theta is very is very large and in a way you're kind of you
[00:39:54] large and in a way you're kind of you know shrinking the class of hypotheses
[00:39:56] know shrinking the class of hypotheses that you have so if you shrink the class
[00:39:59] that you have so if you shrink the class of hypotheses your your variance is kind
[00:40:01] of hypotheses your your variance is kind of reduced because you know there's much
[00:40:03] of reduced because you know there's much smaller wiggles room for your estimator
[00:40:05] smaller wiggles room for your estimator to place your H hat and you know if you
[00:40:09] to place your H hat and you know if you shrink it by going away from from from G
[00:40:13] shrink it by going away from from from G you you also introduced bias that's like
[00:40:15] you you also introduced bias that's like you know the bias-variance tradeoff any
[00:40:20] you know the bias-variance tradeoff any questions on the so far
[00:40:31] yeah you you you probably want to think
[00:40:34] yeah you you you probably want to think of each of these you probably want to
[00:40:37] of each of these you probably want to think of this as a generalized version
[00:40:39] think of this as a generalized version of this right so here we have like fixed
[00:40:41] of this right so here we have like fixed data on theta 2 but you know because you
[00:40:45] data on theta 2 but you know because you could parameterize them into a few
[00:40:47] could parameterize them into a few parameters you can kind of plot it in a
[00:40:48] parameters you can kind of plot it in a matrix space but that's like a more
[00:40:50] matrix space but that's like a more general like a bag of hypotheses and you
[00:40:55] general like a bag of hypotheses and you know but in any case in both of both
[00:40:57] know but in any case in both of both those diagrams a point here is one
[00:41:00] those diagrams a point here is one hypotheses a point there is one
[00:41:02] hypotheses a point there is one hypotheses here it's parameterize here
[00:41:03] hypotheses here it's parameterize here it's not parametrized yes the thing is D
[00:41:16] it's not parametrized yes the thing is D for D so the question is how what if we
[00:41:20] for D so the question is how what if we we shrink it towards H star right the
[00:41:23] we shrink it towards H star right the thing is we don't know where H star is
[00:41:26] thing is we don't know where H star is right if we knew it we didn't even need
[00:41:28] right if we knew it we didn't even need to learn anything we could just go
[00:41:29] to learn anything we could just go straight there right so yeah
[00:41:44] with regularization so the question is
[00:41:47] with regularization so the question is when you add regularization are we sure
[00:41:49] when you add regularization are we sure that the bias is going up no we don't
[00:41:52] that the bias is going up no we don't know and and this is a common scenario
[00:41:54] know and and this is a common scenario what happens right you when you add
[00:41:56] what happens right you when you add regularization you you you reduce the
[00:41:58] regularization you you you reduce the variance for sure but you're very likely
[00:42:01] variance for sure but you're very likely going to introduce some bias in that
[00:42:03] going to introduce some bias in that process
[00:42:08] so if you add regularization you're
[00:42:11] so if you add regularization you're shrinking your hypothesis space in some
[00:42:14] shrinking your hypothesis space in some way so you're kind of moving away from
[00:42:16] way so you're kind of moving away from true G so you're kind of adding a little
[00:42:18] true G so you're kind of adding a little bit bias you're very likely to add some
[00:42:20] bit bias you're very likely to add some bias in that process yeah so it's so I I
[00:42:26] bias in that process yeah so it's so I I would encourage you to you know kind of
[00:42:29] would encourage you to you know kind of after this lecture to think about this a
[00:42:31] after this lecture to think about this a little more slowly it's it's it's it
[00:42:32] little more slowly it's it's it's it takes a while to kind of internalize
[00:42:33] takes a while to kind of internalize this the concept of bias and variance
[00:42:35] this the concept of bias and variance and and
[00:42:37] and and it's not very intuitive but but you know
[00:42:41] it's not very intuitive but but you know thinking about it more definitely helps
[00:42:45] all right any other questions before we
[00:42:48] all right any other questions before we move on so an example of a hypothesis
[00:42:54] move on so an example of a hypothesis class right so an example would be the
[00:42:57] class right so an example would be the set of all logistic regression models
[00:43:01] set of all logistic regression models right and when you do gradient descent
[00:43:04] right and when you do gradient descent on your you know logistic regression
[00:43:06] on your you know logistic regression class you are kind of implicitly
[00:43:07] class you are kind of implicitly restricting yourself to set up our
[00:43:09] restricting yourself to set up our possible logistic regression models
[00:43:11] possible logistic regression models that's kind of implicit so the H is the
[00:43:22] that's kind of implicit so the H is the output of the learning algorithm all
[00:43:26] output of the learning algorithm all right so you feed an input to your
[00:43:28] right so you feed an input to your algorithm this is not the model this is
[00:43:30] algorithm this is not the model this is the learning algorithm like this is like
[00:43:32] the learning algorithm like this is like gradient descent for example right and
[00:43:34] gradient descent for example right and the output of that is the parameters
[00:43:37] the output of that is the parameters that you learnt that converge to right
[00:43:40] that you learnt that converge to right so so yeah you probably don't want to
[00:43:44] so so yeah you probably don't want to think about this as the model that you
[00:43:45] think about this as the model that you learned but this as the like the
[00:43:48] learned but this as the like the training process and the output of the
[00:43:50] training process and the output of the training process is a model that you'll
[00:43:52] training process is a model that you'll know and that is a point in your in the
[00:43:56] know and that is a point in your in the class of hypotheses yes so you fix that
[00:44:07] class of hypotheses yes so you fix that the class of learning mods you say I'm
[00:44:10] the class of learning mods you say I'm going only going to learn logistic
[00:44:12] going only going to learn logistic regression models right for different
[00:44:14] regression models right for different different samples of data that you feed
[00:44:16] different samples of data that you feed it as your training set you're going to
[00:44:17] it as your training set you're going to get learn a different theta hat yes but
[00:44:23] get learn a different theta hat yes but they have to be within the class of
[00:44:25] they have to be within the class of hypotheses all right so let's let's move
[00:44:31] hypotheses all right so let's let's move on
[00:44:56] so next we come across this concept
[00:44:59] so next we come across this concept called empirical risk minimization
[00:45:22] this the empirical risk minimizer so so
[00:45:38] this the empirical risk minimizer so so the empirical risk minimizer is a
[00:45:40] the empirical risk minimizer is a learning algorithm it is one of those
[00:45:44] learning algorithm it is one of those kind of boxes that we drew it is you
[00:45:46] kind of boxes that we drew it is you know so in the box that we drew earlier
[00:45:56] know so in the box that we drew earlier as learning algorithm right so the the
[00:46:04] as learning algorithm right so the the diagram that we drew earlier based on
[00:46:07] diagram that we drew earlier based on which we reasoned everything so far
[00:46:09] which we reasoned everything so far didn't actually tell you what actually
[00:46:11] didn't actually tell you what actually happens inside it could be doing
[00:46:13] happens inside it could be doing gradient descent it could just do
[00:46:14] gradient descent it could just do something else it couldn't be you know
[00:46:16] something else it couldn't be you know some some you know smart programmer
[00:46:19] some some you know smart programmer who's written a whole bunch of if else
[00:46:21] who's written a whole bunch of if else and just returns a theta it could be
[00:46:23] and just returns a theta it could be anything right and no matter what kind
[00:46:25] anything right and no matter what kind of algorithm was used the bias-variance
[00:46:29] of algorithm was used the bias-variance theory still holds right now we are
[00:46:32] theory still holds right now we are going to look at a very specific type of
[00:46:35] going to look at a very specific type of learning algorithms called the empirical
[00:46:37] learning algorithms called the empirical risk minimizer right so and this was eat
[00:46:44] risk minimizer right so and this was eat into your algorithm and you get it star
[00:46:59] no H its add equal to so what does erm
[00:47:10] no H its add equal to so what does erm empirical risk minimization it's what
[00:47:12] empirical risk minimization it's what we've been doing so far in the course
[00:47:17] we're we try to find a minimizer in a
[00:47:20] we're we try to find a minimizer in a class of hypotheses that minimizes the
[00:47:23] class of hypotheses that minimizes the average training error weight so for
[00:47:38] average training error weight so for example this is trying to minimize the
[00:47:41] example this is trying to minimize the training error from a classification
[00:47:44] training error from a classification perspective this is kind of minimizing
[00:47:47] perspective this is kind of minimizing the or increase in their training
[00:47:50] the or increase in their training accuracy which is different from what
[00:47:52] accuracy which is different from what actually logistic regression did where
[00:47:54] actually logistic regression did where we were doing the maximum likelihood or
[00:47:55] we were doing the maximum likelihood or minimizing the negative log likelihood
[00:47:57] minimizing the negative log likelihood it can be shown that losses like the
[00:48:00] it can be shown that losses like the logistic loss are can be well
[00:48:02] logistic loss are can be well approximated by by the ERM and and and
[00:48:06] approximated by by the ERM and and and this theory should should hold
[00:48:09] this theory should should hold nonetheless right so if if we are
[00:48:16] nonetheless right so if if we are limiting ourselves to do that class of
[00:48:21] limiting ourselves to do that class of algorithms which which work by
[00:48:24] algorithms which which work by minimizing the training loss right as
[00:48:28] minimizing the training loss right as opposed to something that say returns a
[00:48:31] opposed to something that say returns a constant all the time or does something
[00:48:33] constant all the time or does something else if we limit ourselves to empirical
[00:48:38] else if we limit ourselves to empirical risk minimizer's then we can come up
[00:48:40] risk minimizer's then we can come up with more theoretical results for
[00:48:43] with more theoretical results for example uniform convergence which we are
[00:48:45] example uniform convergence which we are going to look at right now
[00:49:02] so so we are limiting ourselves to
[00:49:07] so so we are limiting ourselves to empirical risk minimizer's and starting
[00:49:11] empirical risk minimizer's and starting off uniform convergence so there are two
[00:49:28] off uniform convergence so there are two central questions that we are kind of
[00:49:29] central questions that we are kind of interested in so one question is if we
[00:49:36] interested in so one question is if we do empirical risk minimization that is
[00:49:38] do empirical risk minimization that is if we just reduce the training loss
[00:49:40] if we just reduce the training loss right what what does that say about the
[00:49:44] right what what does that say about the generalization error of that so that is
[00:49:46] generalization error of that so that is basically the height of H versus H so
[00:49:56] basically the height of H versus H so for you know consider some hypothesis
[00:49:58] for you know consider some hypothesis right and that gives you some amount of
[00:50:00] right and that gives you some amount of training error right what does that say
[00:50:03] training error right what does that say about its generalization error like
[00:50:05] about its generalization error like that's one central question we want to
[00:50:09] that's one central question we want to consider and the second one is how does
[00:50:14] consider and the second one is how does the generalization error of our learned
[00:50:18] the generalization error of our learned hypothesis compared to the best possible
[00:50:23] hypothesis compared to the best possible generalization error in that class right
[00:50:27] generalization error in that class right note we're you know we're only talking
[00:50:29] note we're you know we're only talking about a star and not G yeah so it's star
[00:50:33] about a star and not G yeah so it's star is is the best in class so these are
[00:50:38] is is the best in class so these are these are two central questions that we
[00:50:39] these are two central questions that we want to we want to explore and for this
[00:50:43] want to we want to explore and for this we're going to use two tools right so
[00:50:49] we're going to use two tools right so one is called the Union bound right
[00:50:55] one is called the Union bound right what's the Union bound if we have
[00:51:00] what's the Union bound if we have see different events - okay then
[00:51:07] see different events - okay then this need not be independent then the
[00:51:16] this need not be independent then the probability of if this looks trivial it
[00:51:40] probability of if this looks trivial it is trivial it's it's it's probably one
[00:51:41] is trivial it's it's it's probably one of the axioms in in in your undergrad
[00:51:45] of the axioms in in in your undergrad probability class the the probability of
[00:51:48] probability class the the probability of any one of these events happening is
[00:51:50] any one of these events happening is less than or equal to the sum of the
[00:51:53] less than or equal to the sum of the probabilities of of each of them
[00:51:55] probabilities of of each of them happening right and then we have a
[00:51:59] happening right and then we have a second tool is called the halflings
[00:52:08] second tool is called the halflings inequality we're only going to state the
[00:52:18] inequality we're only going to state the inequality here there is a supplemental
[00:52:21] inequality here there is a supplemental notes on the website that actually
[00:52:23] notes on the website that actually proves the tufting inequality you can go
[00:52:25] proves the tufting inequality you can go through that but here we are only going
[00:52:29] through that but here we are only going to state the result in fact throughout
[00:52:31] to state the result in fact throughout this session you only got a state result
[00:52:32] this session you only got a state result so you're not going to prove anything so
[00:52:35] so you're not going to prove anything so let Z 1 Z 2 Z em be sampled from some
[00:52:47] let Z 1 Z 2 Z em be sampled from some Bernoulli distribution parameter fee and
[00:52:52] Bernoulli distribution parameter fee and let's call we had to be average of them
[00:53:02] of Zi and let there be a gamma greater
[00:53:12] of Zi and let there be a gamma greater than zero which we call it as the margin
[00:53:17] so the huffing inequality basically says
[00:53:20] so the huffing inequality basically says the probability that the absolute
[00:53:25] the probability that the absolute difference between the estimated fee
[00:53:28] difference between the estimated fee parameter and the true fee parameter is
[00:53:33] parameter and the true fee parameter is greater than some margin can be bounded
[00:53:38] greater than some margin can be bounded by two times the exponential of minus 2
[00:53:44] by two times the exponential of minus 2 gamma square em not very obvious but you
[00:53:54] gamma square em not very obvious but you know you can you can you can show this
[00:53:55] know you can you can you can show this what what is basically saying is there
[00:53:57] what what is basically saying is there is some there is some some parameter
[00:54:02] is some there is some some parameter between 0 &amp; 1 of a Bernoulli
[00:54:04] between 0 &amp; 1 of a Bernoulli distribution the fact that it is between
[00:54:08] distribution the fact that it is between 0 &amp; 1 means it's it's bounded and and
[00:54:11] 0 &amp; 1 means it's it's bounded and and that's a key requirement for the Hough
[00:54:13] that's a key requirement for the Hough dings inequality and now we take samples
[00:54:17] dings inequality and now we take samples from this Bernoulli distribution and the
[00:54:20] from this Bernoulli distribution and the estimator for this is basically you know
[00:54:23] estimator for this is basically you know and these are just zeros or ones Zi Zi
[00:54:25] and these are just zeros or ones Zi Zi each of the Zi is either a 0 or a 1 the
[00:54:27] each of the Zi is either a 0 or a 1 the sample a 0 or a 1 with probability P and
[00:54:31] sample a 0 or a 1 with probability P and the estimator is basically just the
[00:54:34] the estimator is basically just the averages of your samples right and the
[00:54:39] averages of your samples right and the absolute difference between the
[00:54:41] absolute difference between the estimated value and the true value the
[00:54:43] estimated value and the true value the probability that this difference becomes
[00:54:46] probability that this difference becomes greater than some margin gamma is
[00:54:49] greater than some margin gamma is bounded by this expression so there are
[00:54:52] bounded by this expression so there are a lot of things happening here so
[00:54:54] a lot of things happening here so probably one of slowly think through
[00:54:57] probably one of slowly think through this so this is a margin right and this
[00:55:04] this so this is a margin right and this is like basically like the deviation or
[00:55:08] is like basically like the deviation or the error
[00:55:11] right it's the absolute value of how far
[00:55:15] right it's the absolute value of how far away your estimated value is from from
[00:55:17] away your estimated value is from from the true and you would like it to be
[00:55:21] the true and you would like it to be closer so you you you probably want your
[00:55:25] closer so you you you probably want your your fee hat and fee to be not more than
[00:55:28] your fee hat and fee to be not more than 0.001 right so in which case if the
[00:55:32] 0.001 right so in which case if the absolute value between the estimated and
[00:55:35] absolute value between the estimated and a true parameter is greater than 0.001
[00:55:40] a true parameter is greater than 0.001 if that's the margin you're that you're
[00:55:42] if that's the margin you're that you're interested in then this the huffing is
[00:55:46] interested in then this the huffing is inequality proves that if you were to
[00:55:49] inequality proves that if you were to repeat this process over and over and
[00:55:50] repeat this process over and over and over the number of times fee hat is
[00:55:54] over the number of times fee hat is going to be great it's going to be
[00:55:56] going to be great it's going to be farther than 0.001 from the true
[00:55:58] farther than 0.001 from the true parameter it's going to be less than
[00:56:00] parameter it's going to be less than this expression which is a function of M
[00:56:02] this expression which is a function of M right and that is you can kind of
[00:56:05] right and that is you can kind of believe it because as M increases this
[00:56:08] believe it because as M increases this becomes smaller which means the
[00:56:10] becomes smaller which means the probability of your estimate deviating
[00:56:14] probability of your estimate deviating more than a certain margin only reduces
[00:56:16] more than a certain margin only reduces as you increase M right so this is
[00:56:19] as you increase M right so this is hoppings inequality and we're going to
[00:56:20] hoppings inequality and we're going to use this questions
[00:56:35] not so so the question is is hate star
[00:56:39] not so so the question is is hate star the limit of H at as M goes to infinity
[00:56:43] the limit of H at as M goes to infinity it is its star in in the limit as M goes
[00:56:48] it is its star in in the limit as M goes to infinity if it is a consistent
[00:56:51] to infinity if it is a consistent estimator right so we went over the
[00:56:54] estimator right so we went over the concept of consistency I'd given
[00:56:55] concept of consistency I'd given infinite data will you eventually get to
[00:56:57] infinite data will you eventually get to the right answer and if your estimator
[00:56:59] the right answer and if your estimator is not consistent then it will it need
[00:57:01] is not consistent then it will it need not be so in general H hat need not
[00:57:05] not be so in general H hat need not converge to H star as you get an
[00:57:06] converge to H star as you get an infinite amount of data so now we want
[00:57:12] infinite amount of data so now we want to use these tools tool 1 and tool to to
[00:57:17] to use these tools tool 1 and tool to to answer our like the central questions
[00:57:24] any other questions yeah this is a more
[00:57:39] any other questions yeah this is a more limited version of Hough Kings
[00:57:40] limited version of Hough Kings inequality and yes if we limit ourselves
[00:57:43] inequality and yes if we limit ourselves to a bernoulli variable
[00:57:45] to a bernoulli variable which has some parameter fee and you
[00:57:47] which has some parameter fee and you take samples from it and you construct
[00:57:51] take samples from it and you construct an estimator which is the average of the
[00:57:54] an estimator which is the average of the of the samples of the zeros and ones
[00:57:56] of the samples of the zeros and ones then this inequality holds
[00:58:00] then this inequality holds that's the this inequality is called the
[00:58:02] that's the this inequality is called the Hough dings inequality yes
[00:58:11] so if you're in general that there are
[00:58:17] so if you're in general that there are there there is this class of algorithms
[00:58:19] there there is this class of algorithms called maximum likelihood algorithms
[00:58:20] called maximum likelihood algorithms maximum likelihood estimators and a pure
[00:58:23] maximum likelihood estimators and a pure maximum likelihood estimator is
[00:58:24] maximum likelihood estimator is generally consistent if you include
[00:58:27] generally consistent if you include regularization then it need not be it
[00:58:32] regularization then it need not be it need not be consistent though I'm not
[00:58:36] need not be consistent though I'm not very sure about that I'm not very sure
[00:58:37] very sure about that I'm not very sure about that so basically you know for the
[00:59:14] about that so basically you know for the Mike what he responded was if you have
[00:59:18] Mike what he responded was if you have an algorithm like a neural net which is
[00:59:20] an algorithm like a neural net which is which is non convex you may actually not
[00:59:23] which is non convex you may actually not end up with the same result even if you
[00:59:27] end up with the same result even if you increase increase like the number of
[00:59:31] increase increase like the number of though I would probably call the fact I
[00:59:35] though I would probably call the fact I would probably think of the non
[00:59:37] would probably think of the non convexity to be part of an estimation
[00:59:40] convexity to be part of an estimation bias because you could in theory always
[00:59:44] bias because you could in theory always find like the global minimum of a neural
[00:59:46] find like the global minimum of a neural network is just that there's some bias
[00:59:47] network is just that there's some bias in our estimator that we are using
[00:59:48] in our estimator that we are using gradient descent and we cannot solve it
[00:59:54] okay so now let's let's use these two
[00:59:58] okay so now let's let's use these two tools and for that we're going to start
[01:00:02] tools and for that we're going to start at this diagram
[01:00:12] so over here we have hypothesis here we
[01:00:19] so over here we have hypothesis here we have error and
[01:00:34] there's actually one one curve which I'm
[01:00:37] there's actually one one curve which I'm trying to make it thick and probably
[01:00:39] trying to make it thick and probably make it to look like multiple curves
[01:00:40] make it to look like multiple curves this just one curve and this we will
[01:00:43] this just one curve and this we will call it s this is the generalization
[01:00:51] call it s this is the generalization risk or the
[01:00:53] risk or the the generalization error of every
[01:00:56] the generalization error of every possible hypothesis in our class right
[01:01:01] possible hypothesis in our class right so pick one hypothesis that's going to
[01:01:05] so pick one hypothesis that's going to be somewhere on this axis calculate the
[01:01:08] be somewhere on this axis calculate the generalization error not the am
[01:01:11] generalization error not the am particular the generalization error and
[01:01:13] particular the generalization error and you know that's the height of that curve
[01:01:15] you know that's the height of that curve right and we also have something like
[01:01:20] right and we also have something like this right so this dotted line now
[01:01:31] this right so this dotted line now corresponds to H of H now let's let's
[01:01:43] corresponds to H of H now let's let's sample a set of M examples and calculate
[01:01:47] sample a set of M examples and calculate the empirical error of all our
[01:01:50] the empirical error of all our hypotheses in our class and plot that as
[01:01:52] hypotheses in our class and plot that as a curve all right any questions on what
[01:01:56] a curve all right any questions on what these two are yeah it need not beats and
[01:02:01] these two are yeah it need not beats and I'm just in fact Li this is very likely
[01:02:05] I'm just in fact Li this is very likely not even a straight line you're just
[01:02:06] not even a straight line you're just thinking of all possible hypotheses it
[01:02:08] thinking of all possible hypotheses it need not be convex this is just to get
[01:02:12] need not be convex this is just to get some ideas you get better intuitions on
[01:02:16] some ideas you get better intuitions on some of these ideas yes so the black
[01:02:20] some of these ideas yes so the black line the thick black line is the
[01:02:23] line the thick black line is the generalization error of all your
[01:02:25] generalization error of all your hypotheses right and let's say you
[01:02:28] hypotheses right and let's say you sample some some some data right let's
[01:02:31] sample some some some data right let's call it s on that sample you have
[01:02:33] call it s on that sample you have training error for all possible
[01:02:35] training error for all possible hypotheses right we haven't not learnt
[01:02:37] hypotheses right we haven't not learnt anything right it's it's this is the
[01:02:41] anything right it's it's this is the generalization error and this is the
[01:02:43] generalization error and this is the empirical error for the given s right
[01:02:47] empirical error for the given s right now in order to apply halflings
[01:02:52] now in order to apply halflings inequality here right so let's consider
[01:03:00] inequality here right so let's consider some H I right this is some hypotheses
[01:03:07] some H I right this is some hypotheses we don't know so we start with some
[01:03:10] we don't know so we start with some random hypotheses right and so so by
[01:03:16] random hypotheses right and so so by starting with some hypotheses like think
[01:03:18] starting with some hypotheses like think of this as you start with some parameter
[01:03:20] of this as you start with some parameter right and so the height of this line up
[01:03:34] right and so the height of this line up to the the thick black curve is
[01:03:38] to the the thick black curve is basically the generalization error of H
[01:03:47] basically the generalization error of H is the height to the thick black curve
[01:03:49] is the height to the thick black curve so let me call this epsilon of H I right
[01:03:56] so let me call this epsilon of H I right and the height to the dotted curve until
[01:03:59] and the height to the dotted curve until here and this is epsilon hat of H I I'm
[01:04:08] here and this is epsilon hat of H I I'm going to ignore the s for now right and
[01:04:14] going to ignore the s for now right and this corresponds to like the the sample
[01:04:17] this corresponds to like the the sample that we obtained now one thing you can
[01:04:21] that we obtained now one thing you can you can you can check is that the
[01:04:23] you can you can check is that the expected value of
[01:04:44] where the expectation is with respect to
[01:04:46] where the expectation is with respect to the data the sample so what this means
[01:04:48] the data the sample so what this means is that for one particular sample you
[01:04:52] is that for one particular sample you this is the generalization error you got
[01:04:54] this is the generalization error you got take another set of M samples that curve
[01:04:57] take another set of M samples that curve might look some you know some other way
[01:04:59] might look some you know some other way and you know the height of the dotted
[01:05:01] and you know the height of the dotted line would be there so in general on
[01:05:03] line would be there so in general on average if you Sam average across all
[01:05:06] average if you Sam average across all possible training samples that you can
[01:05:08] possible training samples that you can get the the expected value of the height
[01:05:12] get the the expected value of the height to the dotted line is going to be the
[01:05:15] to the dotted line is going to be the height to the thick line right that's
[01:05:18] height to the thick line right that's that's just a five now here if you apply
[01:05:20] that's just a five now here if you apply halflings inequality you basically get
[01:05:24] halflings inequality you basically get probability of absolute difference
[01:05:27] probability of absolute difference between the empirical error versus the
[01:05:32] between the empirical error versus the generalization error doing greater than
[01:05:37] generalization error doing greater than gamma is less than equal to minus two
[01:05:45] gamma is less than equal to minus two square this is basically you know opting
[01:05:50] square this is basically you know opting in equality we have right here except in
[01:05:53] in equality we have right here except in place of fee and fee hat we have the
[01:05:56] place of fee and fee hat we have the true generalization error and the
[01:05:59] true generalization error and the empirical error any questions on this so
[01:06:02] empirical error any questions on this so far so what we are saying is essentially
[01:06:08] far so what we are saying is essentially the the gap between the generalization
[01:06:12] the the gap between the generalization error and the empirical error right
[01:06:16] error and the empirical error right right the gap being greater than some
[01:06:22] right the gap being greater than some margin gamma is going to be bounded by
[01:06:27] margin gamma is going to be bounded by this expression so loosely speaking what
[01:06:32] this expression so loosely speaking what this means is as we increase the size M
[01:06:36] this means is as we increase the size M if our trainings up if we plot the set
[01:06:40] if our trainings up if we plot the set of all dotted lines for a larger M they
[01:06:44] of all dotted lines for a larger M they are going to be more concentrated around
[01:06:46] are going to be more concentrated around the black line does that make sense
[01:06:51] the black line does that make sense so so take a moment and think about it
[01:06:53] so so take a moment and think about it this
[01:06:54] this dotted line correspond to s of some
[01:06:56] dotted line correspond to s of some particular size M we could take another
[01:07:00] particular size M we could take another sample of you know a fixed set of
[01:07:03] sample of you know a fixed set of examples and that might look something
[01:07:07] examples and that might look something like this and take another sample of
[01:07:12] like this and take another sample of size M and that might look something
[01:07:13] size M and that might look something like this
[01:07:17] like this now and now consider the set of all
[01:07:21] now and now consider the set of all deviations from from the black line to
[01:07:24] deviations from from the black line to every possible dotted line along the
[01:07:26] every possible dotted line along the vertical line of H I right now this gap
[01:07:30] vertical line of H I right now this gap is greater than some margin gamma with
[01:07:36] is greater than some margin gamma with probability less than this term over
[01:07:40] probability less than this term over here right so so it essentially means
[01:07:43] here right so so it essentially means that if you start plotting dotted lines
[01:07:45] that if you start plotting dotted lines with the bigger em right where the set
[01:07:48] with the bigger em right where the set of all those dotted lines are correspond
[01:07:50] of all those dotted lines are correspond to a bigger M they are going to be much
[01:07:53] to a bigger M they are going to be much more tightly concentrated around the
[01:07:56] more tightly concentrated around the true generalization of that of that H
[01:08:00] that make sense right you're basically
[01:08:03] that make sense right you're basically applying tufting inequality to this gap
[01:08:06] applying tufting inequality to this gap over here instead of something that's
[01:08:08] over here instead of something that's basically what you're doing no that's
[01:08:14] basically what you're doing no that's good but but there's a problem here the
[01:08:16] good but but there's a problem here the problem here is that we started with
[01:08:20] problem here is that we started with some hypotheses and then averaged across
[01:08:22] some hypotheses and then averaged across all possible data that you could sample
[01:08:25] all possible data that you could sample but in practice this is useless because
[01:08:27] but in practice this is useless because in practice we start with some data and
[01:08:29] in practice we start with some data and run the empirical risk minimizer to find
[01:08:32] run the empirical risk minimizer to find the lowest H for that particular data
[01:08:35] the lowest H for that particular data right and when you when when which means
[01:08:39] right and when you when when which means that H and the data that you have are
[01:08:42] that H and the data that you have are not really independent right you you
[01:08:44] not really independent right you you chose the H to minimize minimize the
[01:08:47] chose the H to minimize minimize the risk for the empirical risk for the
[01:08:50] risk for the empirical risk for the particular data that you are given in
[01:08:51] particular data that you are given in the first place right so to to fix this
[01:08:56] the first place right so to to fix this what we want to do is basically extend
[01:09:01] what we want to do is basically extend this result that we got to
[01:09:06] this result that we got to account for all H right now if we want
[01:09:11] account for all H right now if we want to get a bound on the the gap between
[01:09:16] to get a bound on the the gap between the a probabilistic bound and the gap
[01:09:19] the a probabilistic bound and the gap between the generalization error and the
[01:09:25] between the generalization error and the empirical error for all age you know
[01:09:30] empirical error for all age you know what's that bound going to look like
[01:09:31] what's that bound going to look like right and this is basically called
[01:09:34] right and this is basically called uniform uniform convergence this result
[01:09:36] uniform uniform convergence this result is called uniform convergence because we
[01:09:38] is called uniform convergence because we are trying to we are trying to see how
[01:09:41] are trying to we are trying to see how the risk curve converges uniformly to
[01:09:45] the risk curve converges uniformly to the generalization risk how the
[01:09:47] the generalization risk how the empirical risk curve uniformly converges
[01:09:50] empirical risk curve uniformly converges to the generalization risk curve and and
[01:09:52] to the generalization risk curve and and it's that that's called uniform
[01:09:55] it's that that's called uniform convergence which you can apply to
[01:09:56] convergence which you can apply to functions in general but here we are
[01:09:57] functions in general but here we are applying to the risk curves across our
[01:10:00] applying to the risk curves across our hypotheses and we can show I'm gonna
[01:10:04] hypotheses and we can show I'm gonna just skip the math so this we showed
[01:10:09] just skip the math so this we showed using halflings inequality and you can
[01:10:12] using halflings inequality and you can apply the Union bound for unioning
[01:10:15] apply the Union bound for unioning across all age except we can first we're
[01:10:20] across all age except we can first we're going to limit ourselves to correct so
[01:10:25] going to limit ourselves to correct so let me start over so we got this bound
[01:10:28] let me start over so we got this bound for a fixed edge right but we are
[01:10:31] for a fixed edge right but we are interested in getting the bound for any
[01:10:34] interested in getting the bound for any possible edge right so that's our next
[01:10:36] possible edge right so that's our next step right and the way we're going to
[01:10:39] step right and the way we're going to going to extend this point wise result
[01:10:41] going to extend this point wise result to across all of them is going to look
[01:10:44] to across all of them is going to look different for two possible cases one is
[01:10:46] different for two possible cases one is a case of finite hypothesis class and
[01:10:49] a case of finite hypothesis class and the other case is going to be the case
[01:10:51] the other case is going to be the case for infinite hypothesis class so what
[01:10:54] for infinite hypothesis class so what does it look like so
[01:11:06] so let's first consider finite
[01:11:10] so let's first consider finite hypothesis classes so first we are going
[01:11:21] hypothesis classes so first we are going to assume that the class of H has a
[01:11:26] to assume that the class of H has a finite number of hypotheses the result
[01:11:29] finite number of hypotheses the result by itself is not very useful but it's
[01:11:31] by itself is not very useful but it's going to be like a building block for
[01:11:33] going to be like a building block for further for the other case so let's
[01:11:36] further for the other case so let's assume that the number of hypotheses in
[01:11:40] assume that the number of hypotheses in this class is some number K right we can
[01:11:45] this class is some number K right we can show that I'm not going to go over the
[01:11:49] show that I'm not going to go over the the derivation but I'm just going to
[01:11:51] the derivation but I'm just going to write out the result it's it's pretty
[01:11:54] write out the result it's it's pretty intuitive so basically what we do is we
[01:11:57] intuitive so basically what we do is we apply the Union bound for all K
[01:11:59] apply the Union bound for all K hypotheses and we end up just
[01:12:02] hypotheses and we end up just multiplying that by a factor of K all
[01:12:04] multiplying that by a factor of K all right so what we get is the probability
[01:12:08] right so what we get is the probability that there exists some hypotheses in H
[01:12:18] that there exists some hypotheses in H such that the empirical error minus
[01:12:23] such that the empirical error minus generalization error this is greater
[01:12:30] generalization error this is greater than gamma is less than equal to K times
[01:12:37] K times the probability of any one which
[01:12:40] K times the probability of any one which is equal to K times 2 minus 2 gamma
[01:12:49] is equal to K times 2 minus 2 gamma square M and this we flip it over we
[01:12:54] square M and this we flip it over we negate it and we get the probability
[01:12:57] negate it and we get the probability that for all hypotheses in our class
[01:13:08] empirical risk - generalization risk is
[01:13:12] empirical risk - generalization risk is less than gamma this is going to be
[01:13:15] less than gamma this is going to be greater than equal to 1 minus 2 K so
[01:13:28] greater than equal to 1 minus 2 K so with probability at least 1 - you know
[01:13:33] with probability at least 1 - you know this expression which we can call this
[01:13:38] this expression which we can call this Delta with probability at least so much
[01:13:43] Delta with probability at least so much for all hypotheses our margin is going
[01:13:47] for all hypotheses our margin is going to be less than some gamma right this is
[01:13:51] to be less than some gamma right this is this is just
[01:13:54] hoppings inequality plus Union bound and
[01:13:57] hoppings inequality plus Union bound and just negate the two sides you get this
[01:14:00] just negate the two sides you get this and you can go with this slowly you know
[01:14:03] and you can go with this slowly you know later from the notes the notes goes over
[01:14:06] later from the notes the notes goes over this in more detail right now basically
[01:14:11] this in more detail right now basically now what we have is no now let's let
[01:14:15] now what we have is no now let's let Delta K gamma square hmm so we basically
[01:14:23] Delta K gamma square hmm so we basically now have a relation between Delta which
[01:14:31] now have a relation between Delta which is like the probability of error by here
[01:14:38] is like the probability of error by here by error I mean that the empirical risk
[01:14:43] by error I mean that the empirical risk and the generalization risk are farther
[01:14:46] and the generalization risk are farther than some some margin and gamma is
[01:14:49] than some some margin and gamma is called the margin of error and M is your
[01:14:55] called the margin of error and M is your sample size
[01:15:00] so what so what this basically tells is
[01:15:04] so what so what this basically tells is if your algorithm is the empirical risk
[01:15:07] if your algorithm is the empirical risk minimizer it could have been any kind of
[01:15:10] minimizer it could have been any kind of algorithm but if it is the kind that
[01:15:12] algorithm but if it is the kind that minimizes the training error then you
[01:15:15] minimizes the training error then you can get by by just changing the sample
[01:15:20] can get by by just changing the sample size you can get a relation between the
[01:15:24] size you can get a relation between the margin of error and the probability of
[01:15:25] margin of error and the probability of error and related to the sample size
[01:15:27] error and related to the sample size right so what we can do with this
[01:15:34] right so what we can do with this relation is basically fix any two and
[01:15:37] relation is basically fix any two and solve for the third and that gives us
[01:15:41] solve for the third and that gives us you know some actionable results for
[01:15:46] you know some actionable results for example you can fix any two and solve
[01:15:51] example you can fix any two and solve for the third from this relationship
[01:15:52] for the third from this relationship right and what what what that could mean
[01:15:57] right and what what what that could mean is for example so you you can choose any
[01:16:00] is for example so you you can choose any two and solve for the third am I'm only
[01:16:02] two and solve for the third am I'm only going to go over one one one of those so
[01:16:04] going to go over one one one of those so let's fix fix gamma and Delta to be
[01:16:14] let's fix fix gamma and Delta to be greater than 0 and we solve for M and we
[01:16:19] greater than 0 and we solve for M and we get em to be a too many good one over to
[01:16:25] get em to be a too many good one over to gamma square Delta so what this means is
[01:16:34] gamma square Delta so what this means is with probability at least 1 minus Delta
[01:16:37] with probability at least 1 minus Delta which means probably at least 99% 99.9%
[01:16:41] which means probably at least 99% 99.9% for example the probability at least 1
[01:16:45] for example the probability at least 1 minus Delta the margin of error between
[01:16:49] minus Delta the margin of error between the empirical risk and the true
[01:16:52] the empirical risk and the true generalization risk is going to be less
[01:16:54] generalization risk is going to be less than gamma as long as your training size
[01:16:59] than gamma as long as your training size is bigger than this expression
[01:17:01] is bigger than this expression all right that's something actionable
[01:17:03] all right that's something actionable for us right now theory can be useful so
[01:17:06] for us right now theory can be useful so this is also called the sample
[01:17:08] this is also called the sample complexity dessert
[01:17:13] right
[01:17:14] right and basically what this means is as you
[01:17:17] and basically what this means is as you increase em and you sample different
[01:17:20] increase em and you sample different sets of data sets your dotted lines are
[01:17:25] sets of data sets your dotted lines are going to get closer and closer to to the
[01:17:29] going to get closer and closer to to the thick line which means minimizing you're
[01:17:32] thick line which means minimizing you're minimizing on the dotted line will also
[01:17:36] minimizing on the dotted line will also get you closer to the generalization
[01:17:38] get you closer to the generalization error so this this is basically telling
[01:17:40] error so this this is basically telling you how minimizing on on minimizing on
[01:17:44] you how minimizing on on minimizing on the empirical risk gets you closer to
[01:17:47] the empirical risk gets you closer to generalization right okay so that so we
[01:17:53] generalization right okay so that so we started off with two questions relating
[01:17:55] started off with two questions relating the empirical risk to generalization
[01:17:57] the empirical risk to generalization risk now let's let's explore the second
[01:18:00] risk now let's let's explore the second question what about the generalization
[01:18:03] question what about the generalization error of our minimizer with the best
[01:18:13] error of our minimizer with the best possible in class so let's look at this
[01:18:16] possible in class so let's look at this diagram again let's say we started with
[01:18:19] diagram again let's say we started with this dotted curve right and the
[01:18:21] this dotted curve right and the minimizer of that would be 8 star and
[01:18:25] minimizer of that would be 8 star and this is 8 star sorry the diagram is a
[01:18:28] this is 8 star sorry the diagram is a little so this is H hat sorry so this is
[01:18:40] little so this is H hat sorry so this is H at and this has a particular
[01:18:45] H at and this has a particular generalization error right that is the
[01:18:48] generalization error right that is the point of let's assume we got this data
[01:18:51] point of let's assume we got this data set we ran the empirical risk minimizer
[01:18:54] set we ran the empirical risk minimizer and we obtained this hypothesis and when
[01:18:57] and we obtained this hypothesis and when we deploy this in the world in the real
[01:18:58] we deploy this in the world in the real world it's error is going to be so much
[01:19:01] world it's error is going to be so much right now how does this compare to the
[01:19:05] right now how does this compare to the performance of the minimizer of the the
[01:19:09] performance of the minimizer of the the best possible
[01:19:16] so this is H star best-in-class right
[01:19:22] so this is H star best-in-class right now
[01:19:23] now we want to get a relation between this
[01:19:25] we want to get a relation between this error level and this irrelevant
[01:19:28] error level and this irrelevant we got one bound that relates this to
[01:19:32] we got one bound that relates this to this and now we want something that
[01:19:33] this and now we want something that relates this to this now how do we do
[01:19:37] relates this to this now how do we do that it's pretty straightforward so the
[01:19:46] that it's pretty straightforward so the generalization error of H hat that's
[01:19:49] generalization error of H hat that's this dot over here
[01:19:52] this dot over here less than equal to empirical risk of H
[01:19:58] less than equal to empirical risk of H hat plus gamma so we got a result using
[01:20:04] hat plus gamma so we got a result using a huffing and union-bound that the gap
[01:20:07] a huffing and union-bound that the gap between the dotted line and the the
[01:20:09] between the dotted line and the the thick black line is always less than
[01:20:11] thick black line is always less than gamma right and it's the absolute value
[01:20:14] gamma right and it's the absolute value so we can we can write it this way as
[01:20:17] so we can we can write it this way as well and this right so basically we we
[01:20:22] well and this right so basically we we started from the thick black line
[01:20:25] started from the thick black line dropped down to the dotted line and this
[01:20:30] dropped down to the dotted line and this is going to be less than the empirical
[01:20:34] is going to be less than the empirical error of H star plus gamma why is that
[01:20:39] error of H star plus gamma why is that because M empirical error the empirical
[01:20:45] because M empirical error the empirical error of H hat by definition is less
[01:20:49] error of H hat by definition is less than or equal to the empirical error on
[01:20:51] than or equal to the empirical error on any other hypothesis including the
[01:20:53] any other hypothesis including the best-in-class
[01:20:54] best-in-class because this is the training error not
[01:20:56] because this is the training error not not not the generalization error right
[01:20:58] not not the generalization error right so which means and and this is less than
[01:21:06] so which means and and this is less than or equal to so we we dropped from the
[01:21:09] or equal to so we we dropped from the generalization to the test and we said
[01:21:12] generalization to the test and we said this test is the this training error is
[01:21:15] this test is the this training error is always going to be less than the
[01:21:20] always going to be less than the empirical error of the best-in-class you
[01:21:23] empirical error of the best-in-class you see that the best-in-class was higher
[01:21:24] see that the best-in-class was higher for the trained particular
[01:21:26] for the trained particular and this again is now this gap is also
[01:21:31] and this again is now this gap is also bounded because we prove uniform
[01:21:32] bounded because we prove uniform convergence that the gap between the
[01:21:34] convergence that the gap between the dotted line and thick line is bounded by
[01:21:36] dotted line and thick line is bounded by gamma for any edge right and this is
[01:21:40] gamma for any edge right and this is therefore H star plus 2 gamma because we
[01:21:50] therefore H star plus 2 gamma because we added the extra margin so we wanted a
[01:21:52] added the extra margin so we wanted a relation between the the our our
[01:21:57] relation between the the our our hypothesis generalization error to the
[01:21:59] hypothesis generalization error to the generalization error of the
[01:22:00] generalization error of the best-in-class hypothesis so we dropped
[01:22:03] best-in-class hypothesis so we dropped from the generalization error to the
[01:22:07] from the generalization error to the empirical error of our hypothesis
[01:22:09] empirical error of our hypothesis related that to the empirical error of
[01:22:11] related that to the empirical error of the best-in-class and again bounded by
[01:22:14] the best-in-class and again bounded by the gap between these two so we got a
[01:22:16] the gap between these two so we got a gap between the generalization bound the
[01:22:19] gap between the generalization bound the generalized error of our hypothesis to
[01:22:20] generalized error of our hypothesis to the best-in-class generalization any
[01:22:23] the best-in-class generalization any questions on this so the result
[01:22:33] questions on this so the result basically says with probability 1 minus
[01:22:37] basically says with probability 1 minus Delta and for training size M the
[01:22:47] generalization error of the hypothesis
[01:22:50] generalization error of the hypothesis from the empirical risk minimizer is
[01:22:52] from the empirical risk minimizer is going to be within the best-in-class
[01:22:58] going to be within the best-in-class generalization error plus 2 times log
[01:23:10] Delta so this was basically so you can
[01:23:15] Delta so this was basically so you can get this when you when you so in this
[01:23:23] get this when you when you so in this expression if you set this equal to
[01:23:25] expression if you set this equal to Delta and solve for gamma you will get
[01:23:27] Delta and solve for gamma you will get this any questions
[01:23:33] this any questions I think we are already over time so the
[01:23:41] I think we are already over time so the the case for infinite classes is an
[01:23:44] the case for infinite classes is an extension to this maybe I'll just write
[01:23:46] extension to this maybe I'll just write the result so there is a concept called
[01:23:48] the result so there is a concept called VC dimension
[01:23:49] VC dimension which is a pretty simple concept but you
[01:23:53] which is a pretty simple concept but you know we won't be going over it today VC
[01:23:56] know we won't be going over it today VC dimension basically says what is the so
[01:24:00] dimension basically says what is the so VC dimension is you can think of it as
[01:24:02] VC dimension is you can think of it as trying to assign a size to an infinitely
[01:24:07] trying to assign a size to an infinitely to it to an infinite size hypothesis
[01:24:09] to it to an infinite size hypothesis class for a fixed size hypothesis class
[01:24:11] class for a fixed size hypothesis class we had like you know care to me the size
[01:24:12] we had like you know care to me the size of the hypothesis so we see of some
[01:24:16] of the hypothesis so we see of some hypothesis class it's going to be some
[01:24:19] hypothesis class it's going to be some number right some number which which
[01:24:22] number right some number which which kind of which is like the size of that
[01:24:25] kind of which is like the size of that hypothesis it's basically telling you
[01:24:26] hypothesis it's basically telling you how how expressive it is and and on
[01:24:31] how how expressive it is and and on using using the the VC dimension there
[01:24:34] using using the the VC dimension there are very nice geometrical meanings of VC
[01:24:37] are very nice geometrical meanings of VC dimension you can you can get a bound
[01:24:39] dimension you can you can get a bound similar bound but now it's not for high
[01:24:44] similar bound but now it's not for high it's not for finite classes anymore some
[01:24:53] it's not for finite classes anymore some Big O
[01:25:19] right
[01:25:20] right so in place of this margin we ended up
[01:25:24] so in place of this margin we ended up with a different margin that is a
[01:25:26] with a different margin that is a function of the VC dimension and the the
[01:25:32] function of the VC dimension and the the key takeaway from this is that the the
[01:25:37] key takeaway from this is that the the number of data examples that the sample
[01:25:40] number of data examples that the sample complexity that you want is generally
[01:25:43] complexity that you want is generally you know an order of the VC dimension to
[01:25:46] you know an order of the VC dimension to get good results that's basically the
[01:25:48] get good results that's basically the domain result from that right from with
[01:25:52] domain result from that right from with that I guess will will will break for
[01:25:55] that I guess will will will break for the day and we'll take more questions


================================================================================
LECTURE 010
================================================================================

Lecture 10 - Decision Trees and Ensemble Methods | Stanford CS229: Machine Learning (Autumn 2018)

Source: https://www.youtube.com/watch?v=wr9gUr-eWdA

---

Transcript

[00:00:03] hello everyone so my name is Rafael
[00:00:08] hello everyone so my name is Rafael Townsend I'm one of the head TA for this
[00:00:10] Townsend I'm one of the head TA for this class this week Andrews travelling in my
[00:00:13] class this week Andrews travelling in my advisor is still dealing with medical
[00:00:15] advisor is still dealing with medical issues so I'm gonna be giving today's
[00:00:17] issues so I'm gonna be giving today's lecture you heard from my wonderful Co
[00:00:19] lecture you heard from my wonderful Co head ta an on a couple weeks ago and so
[00:00:22] head ta an on a couple weeks ago and so today we're gonna be going over decision
[00:00:24] today we're gonna be going over decision trees in various ensemble methods so
[00:00:28] trees in various ensemble methods so these might seem a bit like despair
[00:00:29] these might seem a bit like despair topics at first but really decision
[00:00:31] topics at first but really decision trees are sort of a classical example
[00:00:33] trees are sort of a classical example model class to use with various ensemble
[00:00:36] model class to use with various ensemble methods we're gonna get into a little
[00:00:37] methods we're gonna get into a little bit why in a bit but just to give you
[00:00:40] bit why in a bit but just to give you guys an overview what the outlines can
[00:00:42] guys an overview what the outlines can be refreshing to over decision trees
[00:00:43] be refreshing to over decision trees then we're gonna go over general on
[00:00:45] then we're gonna go over general on something methods and then go
[00:00:46] something methods and then go specifically into bagging random forests
[00:00:48] specifically into bagging random forests and boosting okay so let's get started
[00:00:53] and boosting okay so let's get started so first let's cover some decision trees
[00:01:05] okay so last week
[00:01:07] okay so last week Andrew was covering SVM which are sort
[00:01:09] Andrew was covering SVM which are sort of one of the classical linear models
[00:01:11] of one of the classical linear models and sort of brought to a close a lot of
[00:01:13] and sort of brought to a close a lot of discussion of those linear models and so
[00:01:15] discussion of those linear models and so today we're gonna be getting to decision
[00:01:17] today we're gonna be getting to decision trees which is really one of our first
[00:01:18] trees which is really one of our first examples of a non-linear model and so to
[00:01:21] examples of a non-linear model and so to motivate these guys let me give you guys
[00:01:23] motivate these guys let me give you guys an example okay so I'm Canadian I really
[00:01:27] an example okay so I'm Canadian I really like the ski so I'm gonna motivate it
[00:01:29] like the ski so I'm gonna motivate it using that so pretend you have a
[00:01:31] using that so pretend you have a classifier that given a time and a
[00:01:34] classifier that given a time and a location tells you whether or not you
[00:01:36] location tells you whether or not you can ski if it's a binary classifier
[00:01:38] can ski if it's a binary classifier saying yes or no and so you have you can
[00:01:40] saying yes or no and so you have you can imagine a graph like this and on the
[00:01:44] imagine a graph like this and on the x-axis we're gonna have time in months
[00:01:47] x-axis we're gonna have time in months so counting from the start so starting
[00:01:53] so counting from the start so starting at 1 for January to 12 for December and
[00:01:57] at 1 for January to 12 for December and then on the y-axis we're gonna use
[00:01:59] then on the y-axis we're gonna use latitude in degrees okay and so for
[00:02:03] latitude in degrees okay and so for those of you who might have forgotten
[00:02:04] those of you who might have forgotten what latitude is it's basically at
[00:02:07] what latitude is it's basically at positive 90 degrees you're at the North
[00:02:09] positive 90 degrees you're at the North Pole at negative 90 degrees you're at
[00:02:11] Pole at negative 90 degrees you're at the South Pole so positive 90 negative
[00:02:16] the South Pole so positive 90 negative 90
[00:02:17] 90 zero being the equator it's sort of your
[00:02:20] zero being the equator it's sort of your location along the north-south axis okay
[00:02:23] location along the north-south axis okay so given this if you might recall the
[00:02:27] so given this if you might recall the winter in the northern hemisphere
[00:02:29] winter in the northern hemisphere generally happens in the early months of
[00:02:31] generally happens in the early months of the year so you might see that you can
[00:02:33] the year so you might see that you can ski in these early months over here and
[00:02:35] ski in these early months over here and have some positive data points and then
[00:02:37] have some positive data points and then in the later months right and then in
[00:02:42] in the later months right and then in the middle you can't really ski verses
[00:02:46] the middle you can't really ski verses in the southern hemisphere it's
[00:02:47] in the southern hemisphere it's basically flipped where you can not ski
[00:02:50] basically flipped where you can not ski in the early months you can ski during
[00:02:53] in the early months you can ski during the name july/august
[00:02:55] the name july/august time period and then you can't ski in
[00:02:58] time period and then you can't ski in the later months and then the equator in
[00:03:01] the later months and then the equator in general is just not great for skiing
[00:03:02] general is just not great for skiing there's a reason I don't live there and
[00:03:04] there's a reason I don't live there and so you just have a bunch of negatives
[00:03:05] so you just have a bunch of negatives here okay and so when you look at a data
[00:03:10] here okay and so when you look at a data set like this you've sort of got these
[00:03:12] set like this you've sort of got these separate regions that you're looking at
[00:03:14] separate regions that you're looking at right and you sort of want to isolate
[00:03:15] right and you sort of want to isolate out those regions of positive examples
[00:03:16] out those regions of positive examples if you had a linear classifier you'd
[00:03:19] if you had a linear classifier you'd sort of be hard-pressed to come up with
[00:03:21] sort of be hard-pressed to come up with any sort of decision boundary that would
[00:03:22] any sort of decision boundary that would separate this reasonably now you could
[00:03:24] separate this reasonably now you could think okay maybe you have an SVM or
[00:03:26] think okay maybe you have an SVM or something you come up with a kernel that
[00:03:28] something you come up with a kernel that could protect perhaps project this into
[00:03:30] could protect perhaps project this into a higher feature space that would make
[00:03:31] a higher feature space that would make it linearly separable but it turns out
[00:03:33] it linearly separable but it turns out that with decision trees you have a very
[00:03:35] that with decision trees you have a very natural way to do this so to sort of
[00:03:39] natural way to do this so to sort of make clear exactly what we want to do
[00:03:41] make clear exactly what we want to do with decision trees is we want to sort
[00:03:43] with decision trees is we want to sort of partition the space into individual
[00:03:45] of partition the space into individual regions so we sort of want to isolate
[00:03:47] regions so we sort of want to isolate out things like positive examples for
[00:03:49] out things like positive examples for example and in general this problem is
[00:03:51] example and in general this problem is fairly intractable just coming up with
[00:03:53] fairly intractable just coming up with the optimal regions but how we do it
[00:03:56] the optimal regions but how we do it with decision trees is we do it in this
[00:03:58] with decision trees is we do it in this basically greedy top-down recursive
[00:04:09] basically greedy top-down recursive manner this would be recursive
[00:04:14] manner this would be recursive partitioning okay
[00:04:22] partitioning okay and so it's basically it's top down
[00:04:24] and so it's basically it's top down because we're starting with the overall
[00:04:25] because we're starting with the overall region and we want to slowly partition
[00:04:27] region and we want to slowly partition it up okay and then it's greedy because
[00:04:29] it up okay and then it's greedy because at each step we want to pick the best
[00:04:31] at each step we want to pick the best partition possible
[00:04:33] partition possible okay so let's actually try and work out
[00:04:36] okay so let's actually try and work out intuitively what a decision tree would
[00:04:38] intuitively what a decision tree would do okay so what we do is we start with
[00:04:40] do okay so what we do is we start with the overall space and the tree is
[00:04:42] the overall space and the tree is basically going to play 20 questions
[00:04:44] basically going to play 20 questions with this space okay so like for example
[00:04:47] with this space okay so like for example one question it might ask is if we have
[00:04:50] one question it might ask is if we have the data coming in like this is is the
[00:04:53] the data coming in like this is is the latitude greater than thirty degrees
[00:04:56] latitude greater than thirty degrees okay and that would involve sort of
[00:05:00] okay and that would involve sort of cutting the space like this for example
[00:05:02] cutting the space like this for example okay and then we'd have a yes or a no
[00:05:08] okay and then we'd have a yes or a no and so starting from like the most
[00:05:11] and so starting from like the most general space now we have partitioned
[00:05:13] general space now we have partitioned the overall space into two separate
[00:05:15] the overall space into two separate spaces using this this question okay and
[00:05:18] spaces using this this question okay and this is where the recursive part comes
[00:05:20] this is where the recursive part comes in now because now that you've sort of
[00:05:22] in now because now that you've sort of split this space into two you can then
[00:05:25] split this space into two you can then sort of treat each individual space as a
[00:05:27] sort of treat each individual space as a new problem to ask a new question about
[00:05:29] new problem to ask a new question about and so for example now that you've asked
[00:05:31] and so for example now that you've asked this latitude greater than 30 question
[00:05:33] this latitude greater than 30 question you could then ask something like month
[00:05:37] you could then ask something like month less than like March or something like
[00:05:40] less than like March or something like that and that would give you a yes or no
[00:05:44] that and that would give you a yes or no and what that works out to effectively
[00:05:46] and what that works out to effectively is that now you've taken this upper
[00:05:48] is that now you've taken this upper space here and divided it up into these
[00:05:53] space here and divided it up into these two separate regions like this and so
[00:05:56] two separate regions like this and so you could imagine how through asking
[00:05:58] you could imagine how through asking these recursive questions over and over
[00:06:00] these recursive questions over and over again you could start splitting up the
[00:06:01] again you could start splitting up the entire space into your individual
[00:06:03] entire space into your individual regions like this okay and so to make
[00:06:11] regions like this okay and so to make this a little bit more formal what we're
[00:06:13] this a little bit more formal what we're looking for is we're looking for sort of
[00:06:16] looking for is we're looking for sort of this split function okay so you can sort
[00:06:18] this split function okay so you can sort of define a region so you have a region
[00:06:22] of define a region so you have a region and let's call that region R P in this
[00:06:26] and let's call that region R P in this case for our parent okay and we're
[00:06:29] case for our parent okay and we're looking for looking for
[00:06:33] looking for looking for a split s P such that you have an SP you
[00:06:46] a split s P such that you have an SP you can sort of write out this SP function
[00:06:47] can sort of write out this SP function as a function of J comma T okay where
[00:06:53] as a function of J comma T okay where you so you were J is which feature
[00:06:55] you so you were J is which feature number and T is the threshold you're
[00:06:57] number and T is the threshold you're using and so you can sort of write this
[00:06:59] using and so you can sort of write this out formally as sort of you're
[00:07:00] out formally as sort of you're outputting a tuple where on the one hand
[00:07:03] outputting a tuple where on the one hand you have a set X where you have the XJ
[00:07:08] you have a set X where you have the XJ the the J feature of X is less than the
[00:07:11] the the J feature of X is less than the threshold and you have X's element of RP
[00:07:15] threshold and you have X's element of RP since we're only partitioning that
[00:07:16] since we're only partitioning that parent region and then the second set is
[00:07:20] parent region and then the second set is literally the same thing except it's
[00:07:24] literally the same thing except it's just those that are greater than T
[00:07:33] and so we can refer to each one of these
[00:07:34] and so we can refer to each one of these as r1 and r2 any questions so far now
[00:07:48] as r1 and r2 any questions so far now okay so we sort of define now how we
[00:07:51] okay so we sort of define now how we would sort of do this we're trying to
[00:07:52] would sort of do this we're trying to like greedily pick these picks that are
[00:07:54] like greedily pick these picks that are partitioning our input space and the
[00:07:57] partitioning our input space and the splits are sort of defined by which
[00:07:58] splits are sort of defined by which feature you're looking at and the
[00:08:00] feature you're looking at and the threshold that you're applying to that
[00:08:02] threshold that you're applying to that feature a sort of a natural question to
[00:08:05] feature a sort of a natural question to ask now is is how do you choose these
[00:08:08] ask now is is how do you choose these splits
[00:08:22] all right and so I sort of give this
[00:08:25] all right and so I sort of give this intuitive explanation that really what
[00:08:27] intuitive explanation that really what you're trying to do is you're trying to
[00:08:28] you're trying to do is you're trying to isolate out the space of positives and
[00:08:30] isolate out the space of positives and negatives in this case and so what it's
[00:08:33] negatives in this case and so what it's useful to define is a loss on a region
[00:08:37] useful to define is a loss on a region okay so define your loss l on r and so
[00:08:54] okay so define your loss l on r and so for now let's define our loss as
[00:08:56] for now let's define our loss as something fairly obvious is your miss
[00:08:58] something fairly obvious is your miss classification loss it's how many
[00:09:00] classification loss it's how many examples in your region you get wrong
[00:09:02] examples in your region you get wrong and so assuming that you have given C
[00:09:10] and so assuming that you have given C classes total you can define P hat C to
[00:09:26] classes total you can define P hat C to be the proportion of examples in our
[00:09:52] okay and so now that we've got this
[00:09:55] okay and so now that we've got this definition where we had this P hat C of
[00:09:57] definition where we had this P hat C of telling us the proportion of examples
[00:09:59] telling us the proportion of examples that we've got in that case you can sort
[00:10:01] that we've got in that case you can sort of define the loss of any region as loss
[00:10:05] of define the loss of any region as loss let's call it in this classification
[00:10:09] let's call it in this classification it's just one - max over C of P hat C
[00:10:19] it's just one - max over C of P hat C okay and so the reasoning behind this is
[00:10:22] okay and so the reasoning behind this is basically you can say that for any
[00:10:24] basically you can say that for any region you've subdivided generally what
[00:10:26] region you've subdivided generally what you'll want to do is predict the most
[00:10:28] you'll want to do is predict the most common class there which is just the
[00:10:30] common class there which is just the maximum P hat C right and so then all
[00:10:33] maximum P hat C right and so then all the remaining probability just gets
[00:10:35] the remaining probability just gets thrown onto miss classification errors
[00:10:37] thrown onto miss classification errors okay and so then once we do this we want
[00:10:42] okay and so then once we do this we want to basically pick now that we have a
[00:10:44] to basically pick now that we have a loss defined we want to pick a split
[00:10:48] loss defined we want to pick a split that decreases the loss as much as
[00:10:51] that decreases the loss as much as possible so you recall I've defined this
[00:10:53] possible so you recall I've defined this region our parent and then these two
[00:10:55] region our parent and then these two children regions r1 and r2 and you
[00:10:58] children regions r1 and r2 and you basically want to reduce that loss as
[00:11:01] basically want to reduce that loss as much as possible so you want to
[00:11:04] much as possible so you want to basically minimize loss our parent -
[00:11:14] basically minimize loss our parent - loss of r1 r2 and so this is sort of
[00:11:23] loss of r1 r2 and so this is sort of here parent loss this is your children
[00:11:33] here parent loss this is your children loss okay and since you're picking and
[00:11:40] loss okay and since you're picking and basically what you're minimizing over in
[00:11:42] basically what you're minimizing over in some case is this J comma T that we
[00:11:44] some case is this J comma T that we defined over here since the split is
[00:11:47] defined over here since the split is really what is going to define our two
[00:11:49] really what is going to define our two children regions all right and what
[00:11:52] children regions all right and what you'll notice is that the loss of the
[00:11:53] you'll notice is that the loss of the parent doesn't really matter in this
[00:11:55] parent doesn't really matter in this case because that's already defined so
[00:11:57] case because that's already defined so really all you're trying to do is
[00:11:58] really all you're trying to do is minimize this negative sum of losses of
[00:12:01] minimize this negative sum of losses of your children
[00:12:03] your children okay so let's move to this next board so
[00:12:27] okay so let's move to this next board so I started to find this miss
[00:12:28] I started to find this miss classification law so let's get a little
[00:12:29] classification law so let's get a little bit into actually why miss
[00:12:31] bit into actually why miss classification loss isn't actually the
[00:12:32] classification loss isn't actually the right loss to use for this problem so
[00:12:51] okay and so for a simple example let's
[00:12:54] okay and so for a simple example let's pretend so I've sort of drawn out a tree
[00:12:56] pretend so I've sort of drawn out a tree like this let's pretend that instead we
[00:12:58] like this let's pretend that instead we have another set up here where we're
[00:13:03] have another set up here where we're coming into a decision node now at this
[00:13:05] coming into a decision node now at this point we have 900 positives and 100
[00:13:08] point we have 900 positives and 100 negatives okay so this is sort of a Miss
[00:13:11] negatives okay so this is sort of a Miss classification loss of 100 in this case
[00:13:14] classification loss of 100 in this case because you'd predict the most common
[00:13:15] because you'd predict the most common class and end up with 100 misclassified
[00:13:17] class and end up with 100 misclassified examples and so this would be your
[00:13:21] examples and so this would be your region RP right now all right and so
[00:13:24] region RP right now all right and so then you can split it into these two
[00:13:26] then you can split it into these two other regions right say r1 and r2 and
[00:13:35] other regions right say r1 and r2 and say that what you've achieved now is you
[00:13:37] say that what you've achieved now is you have this 700 positive 100 negatives on
[00:13:40] have this 700 positive 100 negatives on this side versus 200 positives and zero
[00:13:46] this side versus 200 positives and zero negatives on this side okay now this
[00:13:51] negatives on this side okay now this seems like a pretty good split since
[00:13:52] seems like a pretty good split since you're getting out some more examples
[00:13:54] you're getting out some more examples but what you can see is that if you just
[00:13:56] but what you can see is that if you just drew the same thing again right RP with
[00:14:00] drew the same thing again right RP with 900 100 split split
[00:14:08] and say in this case instead you got 400
[00:14:12] and say in this case instead you got 400 positives over here 100 negatives and
[00:14:18] positives over here 100 negatives and 500 positives so most people would argue
[00:14:23] 500 positives so most people would argue that this right decision boundary is
[00:14:25] that this right decision boundary is better than the left one because you're
[00:14:27] better than the left one because you're basically isolating out even more
[00:14:28] basically isolating out even more positives in this case however if you're
[00:14:31] positives in this case however if you're just looking at your Mis classification
[00:14:33] just looking at your Mis classification loss it turns out that on this left one
[00:14:35] loss it turns out that on this left one here let's call this r1 r2 versus this
[00:14:37] here let's call this r1 r2 versus this right one let's call this r1 prime r2
[00:14:39] right one let's call this r1 prime r2 prime okay so your loss of r1 plus r2 on
[00:14:48] prime okay so your loss of r1 plus r2 on this left case is just one hundred plus
[00:14:50] this left case is just one hundred plus zero okay so just one hundred and then
[00:14:54] zero okay so just one hundred and then on the right side here it's actually
[00:14:56] on the right side here it's actually still just the same alright and in fact
[00:15:05] still just the same alright and in fact if you look at the original loss of your
[00:15:07] if you look at the original loss of your parent it's also just a hundred right so
[00:15:10] parent it's also just a hundred right so you haven't really according to this
[00:15:12] you haven't really according to this lost metric changed anything at all and
[00:15:14] lost metric changed anything at all and so that sort of brings up one problem
[00:15:16] so that sort of brings up one problem with the Mis classification loss is that
[00:15:17] with the Mis classification loss is that it's not really sensitive enough okay
[00:15:21] it's not really sensitive enough okay so like instead what we can do is we can
[00:15:24] so like instead what we can do is we can define this cross-entropy loss okay
[00:15:44] which will define as L cross
[00:15:51] let me just write this out here
[00:16:00] and so really what you're doing is
[00:16:02] and so really what you're doing is you're just summing over the classes and
[00:16:04] you're just summing over the classes and it's the probability the proportion of
[00:16:06] it's the probability the proportion of elements in that class times the log of
[00:16:08] elements in that class times the log of the proportion in that class and how you
[00:16:10] the proportion in that class and how you can think of this is it's sort of a this
[00:16:12] can think of this is it's sort of a this concept that we borrow from information
[00:16:14] concept that we borrow from information theory which is sort of like the number
[00:16:16] theory which is sort of like the number of bits you need to communicate to tell
[00:16:19] of bits you need to communicate to tell someone who already knows what the
[00:16:20] someone who already knows what the probabilities are what class you're
[00:16:22] probabilities are what class you're looking at and so that sounds like a
[00:16:24] looking at and so that sounds like a mouthful but really you can sort of
[00:16:26] mouthful but really you can sort of think of it intuitively as if someone
[00:16:28] think of it intuitively as if someone already knows the probabilities like say
[00:16:29] already knows the probabilities like say it's a hundred percent chance that it is
[00:16:31] it's a hundred percent chance that it is of one class then you don't need to
[00:16:33] of one class then you don't need to communicate anything to tell them
[00:16:35] communicate anything to tell them exactly which classes because it's
[00:16:36] exactly which classes because it's obvious that it's that one class versus
[00:16:38] obvious that it's that one class versus if you have like a fairly even split
[00:16:40] if you have like a fairly even split then you need to communicate a lot more
[00:16:41] then you need to communicate a lot more information to tell someone exactly what
[00:16:44] information to tell someone exactly what class you are in any questions so far
[00:16:50] class you are in any questions so far yep the R 1 R 2 for the parent class for
[00:17:05] yep the R 1 R 2 for the parent class for this case here yeah yeah so um for that
[00:17:10] this case here yeah yeah so um for that case there so you see that I'll try and
[00:17:12] case there so you see that I'll try and reach up there but so it's like say like
[00:17:13] reach up there but so it's like say like RP was your start region right you could
[00:17:16] RP was your start region right you could say it's the overall region right and
[00:17:18] say it's the overall region right and then R 1 would be all the points above
[00:17:21] then R 1 would be all the points above this latitude 30 line and our two would
[00:17:24] this latitude 30 line and our two would be all the points below the latitude 30
[00:17:26] be all the points below the latitude 30 line yep
[00:17:39] yeah so the question is when you're
[00:17:41] yeah so the question is when you're trying to minimize this loss here is it
[00:17:43] trying to minimize this loss here is it the same as maximizing the the children
[00:17:46] the same as maximizing the the children loss and sir
[00:17:47] loss and sir now let's see of maximizing the children
[00:17:51] now let's see of maximizing the children loss and yeah it turns out it doesn't
[00:17:52] loss and yeah it turns out it doesn't really matter which which way you put it
[00:17:55] really matter which which way you put it it's just basically you're trying to
[00:17:57] it's just basically you're trying to either minimize the loss of the children
[00:17:59] either minimize the loss of the children or maximize the gain in information
[00:18:02] or maximize the gain in information basically yeah let's see yeah you're
[00:18:18] basically yeah let's see yeah you're right that should actually be a max let
[00:18:21] right that should actually be a max let me fix that really quick because you
[00:18:32] me fix that really quick because you start with your parent loss and then
[00:18:33] start with your parent loss and then you're subtracting out your children's
[00:18:34] you're subtracting out your children's loss and so the amount left let's see
[00:18:38] loss and so the amount left let's see the higher this loss is yeah so you
[00:18:40] the higher this loss is yeah so you really want to maximize this guy make
[00:18:46] really want to maximize this guy make sense everyone thanks for that
[00:19:02] okay so I've sort of given this late
[00:19:04] okay so I've sort of given this late hand-wavy oh sure what's up so that
[00:19:11] hand-wavy oh sure what's up so that would be log base two the question is
[00:19:12] would be log base two the question is for the cross-entropy loss is it log
[00:19:14] for the cross-entropy loss is it log base 2 or log base C it's log base 2
[00:19:16] base 2 or log base C it's log base 2 okay yep or sorry I didn't quite hear
[00:19:29] okay yep or sorry I didn't quite hear that okay so the question is can what is
[00:19:42] that okay so the question is can what is the proportion that are correct versus
[00:19:44] the proportion that are correct versus incorrect for these two examples we've
[00:19:46] incorrect for these two examples we've worked through here and so yeah
[00:19:49] worked through here and so yeah basically what we're starting with is
[00:19:51] basically what we're starting with is we're starting with we have nine hundred
[00:19:52] we're starting with we have nine hundred one hundred nine hundred faucets and one
[00:19:54] one hundred nine hundred faucets and one hundred negatives all right so you can
[00:19:55] hundred negatives all right so you can imagine if you just stopped at this
[00:19:56] imagine if you just stopped at this point right you would just classify
[00:19:59] point right you would just classify everything that's positive right and so
[00:20:01] everything that's positive right and so you get one hundred negatives incorrect
[00:20:04] you get one hundred negatives incorrect that makes it because this is nine
[00:20:06] that makes it because this is nine hundred positives and one hundred
[00:20:07] hundred positives and one hundred negatives so if you just stopped here
[00:20:10] negatives so if you just stopped here and just tried to classify given this
[00:20:11] and just tried to classify given this whole region or P you would end up
[00:20:13] whole region or P you would end up getting 10% of your examples wrong right
[00:20:18] getting 10% of your examples wrong right in this case we're sort of talking we're
[00:20:20] in this case we're sort of talking we're not talking about percentages we're
[00:20:21] not talking about percentages we're talking about absolute number of
[00:20:23] talking about absolute number of examples that we've gotten wrong
[00:20:24] examples that we've gotten wrong you can also definitely talk in terms of
[00:20:26] you can also definitely talk in terms of percentages instead and then down here
[00:20:29] percentages instead and then down here once you've split it right now you've
[00:20:32] once you've split it right now you've got these two sub regions and on this on
[00:20:35] got these two sub regions and on this on this left one here you still have more
[00:20:37] this left one here you still have more positives than negatives right so you're
[00:20:40] positives than negatives right so you're still going to classify positive in this
[00:20:42] still going to classify positive in this leaf all right and you're still gonna
[00:20:44] leaf all right and you're still gonna classify positive in this leaf too
[00:20:46] classify positive in this leaf too because they're both majority class or
[00:20:51] because they're both majority class or the positives are still the majority
[00:20:52] the positives are still the majority class there and in this case and you
[00:20:54] class there and in this case and you have zero negatives you're not gonna
[00:20:55] have zero negatives you're not gonna make any errors in your classification
[00:20:57] make any errors in your classification verses in this case you're still gonna
[00:20:59] verses in this case you're still gonna make 100 errors and so what I'm saying
[00:21:01] make 100 errors and so what I'm saying is that at this level so if we just look
[00:21:03] is that at this level so if we just look above this line at our
[00:21:05] above this line at our right you're making 100 mistakes and
[00:21:07] right you're making 100 mistakes and then below this line you're still making
[00:21:09] then below this line you're still making 100 mistakes so what I'm saying is that
[00:21:11] 100 mistakes so what I'm saying is that the loss in this case is not very
[00:21:13] the loss in this case is not very informative so this so this P hat okay
[00:21:21] informative so this so this P hat okay I'm being a little bit loose with the
[00:21:22] I'm being a little bit loose with the terminal with the notation here but the
[00:21:24] terminal with the notation here but the P hat in this case is a proportion okay
[00:21:26] P hat in this case is a proportion okay but you can also easily basically it's
[00:21:29] but you can also easily basically it's like whether you're normalizing the
[00:21:30] like whether you're normalizing the whole thing or not
[00:21:35] okay so I've started given this a bit
[00:21:39] okay so I've started given this a bit hand wavy explanation as to why miss
[00:21:41] hand wavy explanation as to why miss classification loss versus cross-entropy
[00:21:43] classification loss versus cross-entropy loss might be better or worse we can
[00:21:45] loss might be better or worse we can actually get a fairly good intuition for
[00:21:47] actually get a fairly good intuition for why this is the case by looking at it
[00:21:50] why this is the case by looking at it from a sort of geometric perspective so
[00:21:53] from a sort of geometric perspective so pretend now that you have this this plot
[00:21:59] pretend now that you have this this plot okay and what you're plotting here is
[00:22:01] okay and what you're plotting here is pretend you have a binary classification
[00:22:03] pretend you have a binary classification problem okay so you have just is it
[00:22:06] problem okay so you have just is it positive class or negative class okay
[00:22:08] positive class or negative class okay and so you can sort of represent say P
[00:22:10] and so you can sort of represent say P hat as like the proportion of positives
[00:22:13] hat as like the proportion of positives in your set okay and what you've got
[00:22:15] in your set okay and what you've got plotted up here is your loss okay
[00:22:19] plotted up here is your loss okay for cross-entropy loss well your curve
[00:22:22] for cross-entropy loss well your curve is gonna end up looking like is it's
[00:22:24] is gonna end up looking like is it's gonna end up looking like this strictly
[00:22:25] gonna end up looking like this strictly concave curve like this okay and what
[00:22:30] concave curve like this okay and what you can do is you can sort of look at
[00:22:32] you can do is you can sort of look at where your children versus your parent
[00:22:34] where your children versus your parent would fall on this curve so say that you
[00:22:37] would fall on this curve so say that you have two children okay you have one up
[00:22:40] have two children okay you have one up here so the places call this lr1 and you
[00:22:44] here so the places call this lr1 and you have one down here lr2 okay
[00:22:49] have one down here lr2 okay and say that you have an equal number of
[00:22:51] and say that you have an equal number of examples in both r1 and r2 so they're
[00:22:54] examples in both r1 and r2 so they're equally weighted if you take when you're
[00:22:57] equally weighted if you take when you're looking at the overall loss between the
[00:22:59] looking at the overall loss between the two right that's really just the average
[00:23:01] two right that's really just the average of the two so you can draw a line
[00:23:03] of the two so you can draw a line between these two and the midpoint turns
[00:23:07] between these two and the midpoint turns out to be the average of your two losses
[00:23:09] out to be the average of your two losses so this is l r1 + l r2 divided by 2
[00:23:22] that's what this guy's okay and what you
[00:23:28] that's what this guy's okay and what you can notice is that in fact the loss of
[00:23:30] can notice is that in fact the loss of the parent node is actually just this
[00:23:33] the parent node is actually just this point projected upwards here so this
[00:23:35] point projected upwards here so this would be your L R parent and this
[00:23:39] would be your L R parent and this difference right here this difference is
[00:23:45] difference right here this difference is sort of your change in loss does this
[00:23:57] sort of your change in loss does this make sense any questions okay so we have
[00:24:04] make sense any questions okay so we have this just to recap okay so we have say
[00:24:07] this just to recap okay so we have say we have two children regions right and
[00:24:09] we have two children regions right and they have different probabilities of
[00:24:11] they have different probabilities of positive examples occurring all right
[00:24:14] positive examples occurring all right they sort of would fall one would fall
[00:24:16] they sort of would fall one would fall on this point on the curve and say the
[00:24:18] on this point on the curve and say the other one falls on this point on the
[00:24:19] other one falls on this point on the curve then the average of the two loss
[00:24:21] curve then the average of the two loss is sort of falls on that midpoint
[00:24:22] is sort of falls on that midpoint between these two original losses and if
[00:24:26] between these two original losses and if you look at the parent it's really just
[00:24:28] you look at the parent it's really just halfway between on the x-axis and you
[00:24:30] halfway between on the x-axis and you can project upwards for that as well and
[00:24:32] can project upwards for that as well and you end up with the loss of our parent
[00:24:34] you end up with the loss of our parent what's up
[00:24:44] okay so what we're looking at here is
[00:24:46] okay so what we're looking at here is we're looking at the cross entropy law
[00:24:47] we're looking at the cross entropy law so you've got this function here this L
[00:24:49] so you've got this function here this L cross entropy right and that's in terms
[00:24:52] cross entropy right and that's in terms of P hat C's right and in this case here
[00:24:55] of P hat C's right and in this case here we're just assuming that we have two
[00:24:56] we're just assuming that we have two classes okay and so what we're doing is
[00:24:59] classes okay and so what we're doing is we're just modifying the P hat see we're
[00:25:02] we're just modifying the P hat see we're we're changing that on the x-axis and
[00:25:04] we're changing that on the x-axis and then we're looking at what the response
[00:25:05] then we're looking at what the response of the overall loss function is on the
[00:25:07] of the overall loss function is on the y-axis and so what I just did here is
[00:25:10] y-axis and so what I just did here is for any this curve just represents for
[00:25:12] for any this curve just represents for any P hat see what the cross entropy
[00:25:15] any P hat see what the cross entropy loss would look like okay and so we can
[00:25:19] loss would look like okay and so we can come back to this for example right and
[00:25:20] come back to this for example right and if we look at this parent here right
[00:25:23] if we look at this parent here right this guy has a 10% right it's sort of
[00:25:26] this guy has a 10% right it's sort of like P hat P hat for this guy is 0.1
[00:25:31] like P hat P hat for this guy is 0.1 it's 10% basically or or I guess no in
[00:25:35] it's 10% basically or or I guess no in this case would be 0.9 sorry and then
[00:25:38] this case would be 0.9 sorry and then vs. here in these two cases right your P
[00:25:41] vs. here in these two cases right your P hat in this case is one since you've got
[00:25:44] hat in this case is one since you've got them all right all right and then in
[00:25:46] them all right all right and then in this case it's 0.8 all right and so you
[00:25:50] this case it's 0.8 all right and so you can sort of see since these are equal
[00:25:51] can sort of see since these are equal there's the same number of examples in
[00:25:53] there's the same number of examples in both of these the P hat of the parent is
[00:25:55] both of these the P hat of the parent is just the average of the pH of the
[00:25:57] just the average of the pH of the children okay and so that's how we can
[00:26:00] children okay and so that's how we can sort of take this LR parent this L R
[00:26:03] sort of take this LR parent this L R parent is just half way if we projected
[00:26:04] parent is just half way if we projected this down all right
[00:26:06] this down all right let me just erase this a little bit here
[00:26:13] if we projected this down like this we'd
[00:26:21] if we projected this down like this we'd see that this is that this point here is
[00:26:23] see that this is that this point here is the midpoint okay and but then when
[00:26:33] the midpoint okay and but then when you're actually averaging the two losses
[00:26:34] you're actually averaging the two losses after you've done the split then you can
[00:26:37] after you've done the split then you can basically just you're just taking the
[00:26:39] basically just you're just taking the average loss right you're just summing L
[00:26:41] average loss right you're just summing L R 1 plus L R 2 and if you're taking the
[00:26:43] R 1 plus L R 2 and if you're taking the average then you're dividing by 2 and
[00:26:44] average then you're dividing by 2 and what you can do is you can just draw the
[00:26:46] what you can do is you can just draw the line and take the midpoint of this line
[00:26:47] line and take the midpoint of this line instead
[00:26:52] yeah yeah exactly
[00:27:06] yeah yeah exactly so yeah really any if there it's a good
[00:27:09] so yeah really any if there it's a good point the question was if you have an
[00:27:11] point the question was if you have an uneven split what would that look like
[00:27:14] uneven split what would that look like on this curve right and so at this point
[00:27:17] on this curve right and so at this point I've been making the math easy by just
[00:27:18] I've been making the math easy by just saying there's an even split but really
[00:27:20] saying there's an even split but really if there was a slightly uneven split you
[00:27:21] if there was a slightly uneven split you the average would just be any point
[00:27:23] the average would just be any point along this line that you've drawn and as
[00:27:26] along this line that you've drawn and as you can see the whole thing is strictly
[00:27:27] you can see the whole thing is strictly concave so any point along that line is
[00:27:30] concave so any point along that line is going to lie below the original loss
[00:27:32] going to lie below the original loss curve for the parent so you're basically
[00:27:35] curve for the parent so you're basically as long as you're not picking the exact
[00:27:37] as long as you're not picking the exact same points on the probability curve and
[00:27:39] same points on the probability curve and not making any gain at all in your split
[00:27:40] not making any gain at all in your split you're gonna gain some amount of
[00:27:42] you're gonna gain some amount of information through this split okay now
[00:27:54] this was the cross entropy loss right if
[00:28:04] this was the cross entropy loss right if instead we look at the miss
[00:28:05] instead we look at the miss classification loss over here let's draw
[00:28:08] classification loss over here let's draw this one instead
[00:28:32] well we can see in this case if you draw
[00:28:35] well we can see in this case if you draw it is that it's in fact really this
[00:28:36] it is that it's in fact really this pyramid kind of shape
[00:28:38] pyramid kind of shape we're just linear and then flips over
[00:28:40] we're just linear and then flips over once you start classifying the other
[00:28:41] once you start classifying the other side and if you did the same argument
[00:28:44] side and if you did the same argument here where he had L R 1 and L R 2 and
[00:28:51] here where he had L R 1 and L R 2 and then you drew a line between them right
[00:28:53] then you drew a line between them right that's basically just still the lost
[00:28:55] that's basically just still the lost curve and so in this case like your
[00:28:57] curve and so in this case like your midpoint would be the same point as your
[00:28:59] midpoint would be the same point as your parent so your loss of our parent in
[00:29:02] parent so your loss of our parent in this case would equal your loss of R 1
[00:29:06] this case would equal your loss of R 1 plus loss of R 2 divided by 2 all right
[00:29:13] plus loss of R 2 divided by 2 all right and so in this case you can there's even
[00:29:15] and so in this case you can there's even though according to the cross-entropy
[00:29:17] though according to the cross-entropy formulation you do have a gain in
[00:29:19] formulation you do have a gain in information and intuitively we do see a
[00:29:21] information and intuitively we do see a gain in information over here for the
[00:29:22] gain in information over here for the misclassification law since it's not
[00:29:24] misclassification law since it's not very sensitive if you end up with points
[00:29:26] very sensitive if you end up with points on the same side of the curve then you
[00:29:28] on the same side of the curve then you actually don't see any sort of
[00:29:30] actually don't see any sort of information game based on this kind of
[00:29:31] information game based on this kind of representation and so there's actually a
[00:29:38] representation and so there's actually a couple I presented the cross entropy
[00:29:39] couple I presented the cross entropy loss here there's also the Gini loss
[00:29:42] loss here there's also the Gini loss which is another one which people just
[00:29:45] which is another one which people just write out as as the sum over your
[00:29:49] write out as as the sum over your classes P hat C times 1 minus P hat C ok
[00:29:55] classes P hat C times 1 minus P hat C ok and it turns out that this curve also
[00:29:57] and it turns out that this curve also looks very similar to this original
[00:29:59] looks very similar to this original cross entropy curve and what you'll see
[00:30:01] cross entropy curve and what you'll see is that actually most curves that are
[00:30:03] is that actually most curves that are successful youth for decision splits
[00:30:06] successful youth for decision splits look basically like this strictly
[00:30:08] look basically like this strictly concave function ok so that's what it
[00:30:13] concave function ok so that's what it covers a lot of the criteria we use for
[00:30:16] covers a lot of the criteria we use for splits let's look at some extensions for
[00:30:20] splits let's look at some extensions for a decision trees
[00:30:38] I'm gonna keep this guy
[00:30:58] okay so I so far I've been talking about
[00:31:01] okay so I so far I've been talking about decision trees for classification you
[00:31:03] decision trees for classification you could also imagine having decision trees
[00:31:06] could also imagine having decision trees for regression and people generally call
[00:31:08] for regression and people generally call these regression trees okay
[00:31:11] these regression trees okay I'm so taking the ski example again
[00:31:13] I'm so taking the ski example again let's pretend that instead of now
[00:31:15] let's pretend that instead of now predicting whether or not you can ski
[00:31:16] predicting whether or not you can ski you're predicting the amount of snowfall
[00:31:18] you're predicting the amount of snowfall you would expect in that area around
[00:31:20] you would expect in that area around that time and so like let's I'm just
[00:31:24] that time and so like let's I'm just gonna say it's like inches of snowfall I
[00:31:26] gonna say it's like inches of snowfall I guess or something per like day or
[00:31:29] guess or something per like day or something and just like maybe you have
[00:31:31] something and just like maybe you have some values up here some high value
[00:31:34] some values up here some high value because you're it's winter over there
[00:31:36] because you're it's winter over there it's mostly zeros over here cuz you're
[00:31:39] it's mostly zeros over here cuz you're summer and then you have some more high
[00:31:41] summer and then you have some more high values over here and then you have zeros
[00:31:46] values over here and then you have zeros along the equator again zeros southern
[00:31:53] along the equator again zeros southern hemisphere over our winter like this and
[00:32:02] hemisphere over our winter like this and you can sort of see how you do just the
[00:32:04] you can sort of see how you do just the exact same thing you still want to
[00:32:05] exact same thing you still want to isolate out regions and sort of increase
[00:32:07] isolate out regions and sort of increase like the purity of those regions so you
[00:32:10] like the purity of those regions so you could still create like your trees like
[00:32:12] could still create like your trees like this split out like this for example and
[00:32:18] this split out like this for example and what you do when you get to one of your
[00:32:20] what you do when you get to one of your leaves is instead of just predicting a
[00:32:22] leaves is instead of just predicting a majority class what you can do is
[00:32:24] majority class what you can do is predict the mean of the values left so
[00:32:27] predict the mean of the values left so you're predicting predict Y hat wear
[00:32:36] you're predicting predict Y hat wear well for RM so but then you have a
[00:32:39] well for RM so but then you have a region RM you're predicting Y hat of M
[00:32:41] region RM you're predicting Y hat of M which is the sum of all the indices in
[00:32:45] which is the sum of all the indices in RM y I minus y hat M and you want the
[00:32:52] RM y I minus y hat M and you want the squared loss and then you skim sort of I
[00:32:55] squared loss and then you skim sort of I guess in this case you want to normalize
[00:32:57] guess in this case you want to normalize by the overall cardinality of RM or how
[00:33:01] by the overall cardinality of RM or how many points you have
[00:33:04] many points you have and so in this case basically all you've
[00:33:07] and so in this case basically all you've done is you've switched your loss
[00:33:09] done is you've switched your loss function or no sorry that's wrong this
[00:33:16] function or no sorry that's wrong this is actually I got a little bit ahead of
[00:33:17] is actually I got a little bit ahead of myself this is actually just the mean
[00:33:20] myself this is actually just the mean value would just be this in this case
[00:33:22] value would just be this in this case right it's just you're summing all the
[00:33:24] right it's just you're summing all the values within your region so in this
[00:33:26] values within your region so in this case seven nine eight ten and then just
[00:33:28] case seven nine eight ten and then just taking the average of that but so then
[00:33:32] taking the average of that but so then what you do is what I was starting to
[00:33:34] what you do is what I was starting to write out there was actually really the
[00:33:35] write out there was actually really the the loss that you would use in this case
[00:33:37] the loss that you would use in this case right which is your squared loss okay so
[00:33:42] right which is your squared loss okay so like we'll just call that l squared
[00:33:46] which in this case would be equal to Y
[00:33:54] which in this case would be equal to Y minus y hat M squared over R M and
[00:34:01] minus y hat M squared over R M and that's what I started to write over
[00:34:03] that's what I started to write over there but in this case right you have
[00:34:05] there but in this case right you have your mean prediction and then your loss
[00:34:07] your mean prediction and then your loss in this case is how far off your mean
[00:34:09] in this case is how far off your mean prediction is from the overall
[00:34:11] prediction is from the overall predictions in this case yep
[00:34:33] so that's a really good question the
[00:34:35] so that's a really good question the question was how do you actually search
[00:34:38] question was how do you actually search for you splits how do you actually solve
[00:34:39] for you splits how do you actually solve the optimization problem of finding
[00:34:41] the optimization problem of finding these splits and it turns out that you
[00:34:43] these splits and it turns out that you can actually basically brute force it
[00:34:45] can actually basically brute force it very efficiently I'm going to get into
[00:34:47] very efficiently I'm going to get into sort of the details of how you do that
[00:34:49] sort of the details of how you do that shortly but it turns out that you can
[00:34:50] shortly but it turns out that you can just go through everything fairly
[00:34:52] just go through everything fairly quickly I'll get into that I think
[00:34:54] quickly I'll get into that I think that's in a couple sections from now any
[00:34:58] that's in a couple sections from now any other questions okay so this is for
[00:35:05] other questions okay so this is for regression trees right it turns out that
[00:35:08] regression trees right it turns out that another useful extension that that you
[00:35:11] another useful extension that that you don't really get for other learning
[00:35:13] don't really get for other learning algorithms is that you can also deal
[00:35:15] algorithms is that you can also deal with categorical variables fairly easily
[00:35:17] with categorical variables fairly easily and basically for this case you could
[00:35:32] and basically for this case you could imagine that instead of having your
[00:35:34] imagine that instead of having your latitude in degrees you could just have
[00:35:36] latitude in degrees you could just have three categories right you could have
[00:35:38] three categories right you could have something like this is the northern
[00:35:43] something like this is the northern hemisphere this is the equator and this
[00:35:48] hemisphere this is the equator and this is the southern hemisphere okay and then
[00:35:53] is the southern hemisphere okay and then you could ask questions instead of the
[00:35:55] you could ask questions instead of the sort like that initial question we had
[00:35:57] sort like that initial question we had before where was latitude greater than
[00:35:59] before where was latitude greater than thirty your question could instead be is
[00:36:02] thirty your question could instead be is is I guess this would be is location in
[00:36:10] is I guess this would be is location in [Music]
[00:36:12] [Music] northern hemisphere right and so you
[00:36:15] northern hemisphere right and so you could have basically any sort of subs
[00:36:16] could have basically any sort of subs that you could ask a question about any
[00:36:18] that you could ask a question about any sort of subset of the categories you're
[00:36:20] sort of subset of the categories you're looking at it's in this case northern
[00:36:22] looking at it's in this case northern you would still this question would
[00:36:24] you would still this question would still split out this top part from these
[00:36:25] still split out this top part from these bottom pieces here one thing to be
[00:36:28] bottom pieces here one thing to be careful about though is that if you have
[00:36:30] careful about though is that if you have Q categories then you have I mean you
[00:36:39] Q categories then you have I mean you basically are considering every single
[00:36:41] basically are considering every single possible sub
[00:36:42] possible sub set of these categories so that's 2 to
[00:36:44] set of these categories so that's 2 to the Q possible splits and so in general
[00:36:53] the Q possible splits and so in general you don't want to deal with too many
[00:36:54] you don't want to deal with too many categories because this will become
[00:36:57] categories because this will become quickly intractable to look through that
[00:36:59] quickly intractable to look through that many possible examples it turns out that
[00:37:02] many possible examples it turns out that in certain very specific cases you can
[00:37:05] in certain very specific cases you can still deal with a lot of categories one
[00:37:08] still deal with a lot of categories one such case is for binary classification
[00:37:10] such case is for binary classification where then you can just the math it's a
[00:37:13] where then you can just the math it's a little bit complicated for this one but
[00:37:14] little bit complicated for this one but you can basically sort your categories
[00:37:16] you can basically sort your categories by how many positive examples are in
[00:37:17] by how many positive examples are in each category and then just take that as
[00:37:20] each category and then just take that as like a sorted order them and search
[00:37:22] like a sorted order them and search through that linearly and it turns out
[00:37:24] through that linearly and it turns out that that yields you in optimal
[00:37:25] that that yields you in optimal solutions so decision trees we can use
[00:37:36] solutions so decision trees we can use them for regression we can also use them
[00:37:37] them for regression we can also use them for categorical variables one thing that
[00:37:40] for categorical variables one thing that I've not gotten into is that you can
[00:37:43] I've not gotten into is that you can imagine that in the limit if you grew
[00:37:45] imagine that in the limit if you grew your tree without ever stopping you
[00:37:47] your tree without ever stopping you could end up just having a separate
[00:37:48] could end up just having a separate region for every single data point that
[00:37:50] region for every single data point that you have and so that's really you could
[00:37:53] you have and so that's really you could consider that probably overfitting if
[00:37:55] consider that probably overfitting if you ran it all the way to that
[00:37:57] you ran it all the way to that completion right so you can sort of see
[00:37:58] completion right so you can sort of see that decision trees are fairly high
[00:38:02] that decision trees are fairly high variance models and so one thing that
[00:38:07] variance models and so one thing that we're interested in doing is
[00:38:08] we're interested in doing is regularizing these high variance models
[00:38:24] and generally how people have solved
[00:38:27] and generally how people have solved this problem is through a number of
[00:38:28] this problem is through a number of heuristics okay
[00:38:29] heuristics okay so one such heuristic is that if you hit
[00:38:32] so one such heuristic is that if you hit a certain minimum leaf size you stop
[00:38:35] a certain minimum leaf size you stop splitting that leaf okay so for example
[00:38:41] splitting that leaf okay so for example in this case if you've hit like you only
[00:38:42] in this case if you've hit like you only have four examples left in this leaf
[00:38:44] have four examples left in this leaf then you just stop another one is you
[00:38:47] then you just stop another one is you can enforce a maximum depth and sort of
[00:38:55] can enforce a maximum depth and sort of a related one in this case is a max
[00:38:57] a related one in this case is a max number of nodes and then a fourth very
[00:39:09] number of nodes and then a fourth very tempting one I've got to say to use is
[00:39:12] tempting one I've got to say to use is you say a minimum decrease in loss right
[00:39:24] all right and I say this one's tempting
[00:39:26] all right and I say this one's tempting because it's generally not actually a
[00:39:28] because it's generally not actually a good idea to use this minimum decrease
[00:39:30] good idea to use this minimum decrease in wasallam and you can think about that
[00:39:32] in wasallam and you can think about that I thinking that if you have any sort of
[00:39:34] I thinking that if you have any sort of higher-order interactions between your
[00:39:36] higher-order interactions between your variables you might have to ask one
[00:39:39] variables you might have to ask one question that is not very optimal or
[00:39:41] question that is not very optimal or doesn't give you that much of an
[00:39:42] doesn't give you that much of an increase in loss and then your follow-up
[00:39:44] increase in loss and then your follow-up question combined with that first
[00:39:45] question combined with that first question might give you a much better
[00:39:46] question might give you a much better increase and you can sort of see that in
[00:39:48] increase and you can sort of see that in this case where initial latitude
[00:39:50] this case where initial latitude question doesn't really give us that
[00:39:52] question doesn't really give us that much of a game we sort of split some
[00:39:53] much of a game we sort of split some positive negatives but the combination
[00:39:55] positive negatives but the combination of the latitude question plus the time
[00:39:57] of the latitude question plus the time question really nails down what we want
[00:39:59] question really nails down what we want and if we were looking at it purely from
[00:40:01] and if we were looking at it purely from the minimum decrease in lost perspective
[00:40:03] the minimum decrease in lost perspective we might stomp too early and miss that
[00:40:05] we might stomp too early and miss that entirely and so a better way to do this
[00:40:09] entirely and so a better way to do this kind of loss decrease is instead you
[00:40:12] kind of loss decrease is instead you grow out your full tree and then you
[00:40:14] grow out your full tree and then you prune it backwards instead so you you
[00:40:16] prune it backwards instead so you you grow out the whole thing and then you
[00:40:17] grow out the whole thing and then you check which nodes to prune out pruning
[00:40:21] check which nodes to prune out pruning and how you generally do this is you you
[00:40:24] and how you generally do this is you you take it you have a validation set that
[00:40:26] take it you have a validation set that you use this with and you evaluate what
[00:40:29] you use this with and you evaluate what your miss classification error is on
[00:40:31] your miss classification error is on your validation set if for each example
[00:40:33] your validation set if for each example that you might remove for each leaf that
[00:40:35] that you might remove for each leaf that you might remember
[00:40:36] you might remember so you would use miss classification in
[00:40:41] so you would use miss classification in this case with the validation set
[00:41:00] any questions
[00:41:03] any questions yep the minimum decrease in loss so yeah
[00:41:09] yep the minimum decrease in loss so yeah of course so you'll recall that before I
[00:41:12] of course so you'll recall that before I was talking about sort of this RP this
[00:41:14] was talking about sort of this RP this loss of our parent versus loss of our 1
[00:41:17] loss of our parent versus loss of our 1 plus loss of our two right so when we're
[00:41:19] plus loss of our two right so when we're or I had written out a maximization
[00:41:21] or I had written out a maximization basically oh to be clear the question is
[00:41:25] basically oh to be clear the question is can you explain a little bit more
[00:41:26] can you explain a little bit more clearly what this minimum decrease in
[00:41:28] clearly what this minimum decrease in loss means and so you have your loss of
[00:41:33] loss means and so you have your loss of r1 and r2 versus your loss of our parent
[00:41:35] r1 and r2 versus your loss of our parent right so the split before the split
[00:41:37] right so the split before the split alright you have your loss before split
[00:41:45] alright you have your loss before split yeah loss of our parent and then after
[00:41:50] yeah loss of our parent and then after split you have loss of r1 plus loss of
[00:42:00] split you have loss of r1 plus loss of r2 and if if this decrease between your
[00:42:05] r2 and if if this decrease between your loss of our parent to your loss of your
[00:42:07] loss of our parent to your loss of your children is not great enough you might
[00:42:08] children is not great enough you might be tempted to say okay that question
[00:42:10] be tempted to say okay that question didn't really gain us anything and so
[00:42:13] didn't really gain us anything and so therefore we will not actually use that
[00:42:14] therefore we will not actually use that question but what I'm saying is that
[00:42:17] question but what I'm saying is that sometimes you have to ask multiple
[00:42:18] sometimes you have to ask multiple questions right yet to ask sort of
[00:42:20] questions right yet to ask sort of suboptimal questions first to get to the
[00:42:22] suboptimal questions first to get to the really good questions especially if you
[00:42:23] really good questions especially if you have sort of interaction between your
[00:42:25] have sort of interaction between your variables if there's some amount of
[00:42:26] variables if there's some amount of correlation between your variables
[00:42:37] okay so we talked about regularization I
[00:42:42] okay so we talked about regularization I said that we would get to run time let's
[00:42:45] said that we would get to run time let's actually just go up here again so let's
[00:42:53] actually just go up here again so let's cover that really quickly
[00:43:27] okay so it'll be useful to just find a
[00:43:30] okay so it'll be useful to just find a couple numbers at this point so say you
[00:43:33] couple numbers at this point so say you have n examples you have FB trace and
[00:43:48] have n examples you have FB trace and finally you have D say the depth with
[00:43:52] finally you have D say the depth with your tree is D okay you've grunt you you
[00:43:59] your tree is D okay you've grunt you you have an examples that you trained on you
[00:44:01] have an examples that you trained on you with each has F features and your
[00:44:03] with each has F features and your resulting tree is depth D so a test time
[00:44:06] resulting tree is depth D so a test time your run time is basically just your
[00:44:09] your run time is basically just your depth of D it's just oh D all right
[00:44:18] depth of D it's just oh D all right which is your depth and typically though
[00:44:20] which is your depth and typically though not all cases D and sort of about is
[00:44:26] not all cases D and sort of about is less than the log of your number of
[00:44:29] less than the log of your number of examples and you can sort of think about
[00:44:31] examples and you can sort of think about this as if you have a fairly balanced
[00:44:33] this as if you have a fairly balanced tree right you'll end up sort of evenly
[00:44:36] tree right you'll end up sort of evenly splitting out all the examples in sort
[00:44:37] splitting out all the examples in sort of recursively like doing these binary
[00:44:39] of recursively like doing these binary splits and so you'll be splitting it at
[00:44:41] splits and so you'll be splitting it at the log of that end okay so at test time
[00:44:44] the log of that end okay so at test time you've generally got it pretty quick at
[00:44:47] you've generally got it pretty quick at train time
[00:44:55] you have each point so if you return
[00:44:58] you have each point so if you return back to this example you'll see that
[00:45:00] back to this example you'll see that each point right once you've done a
[00:45:02] each point right once you've done a split only belongs to the left or right
[00:45:05] split only belongs to the left or right of that split afterwards all right sort
[00:45:07] of that split afterwards all right sort of like like this point right here once
[00:45:09] of like like this point right here once you've split here will only ever be part
[00:45:10] you've split here will only ever be part of this region will never be considered
[00:45:11] of this region will never be considered on the other side on the right-hand side
[00:45:13] on the other side on the right-hand side of that split alright so if you're if
[00:45:18] of that split alright so if you're if your tree is of depth D each point each
[00:45:22] your tree is of depth D each point each point is part of oh the nose okay and
[00:45:39] point is part of oh the nose okay and then at each node you can actually work
[00:45:41] then at each node you can actually work out that the cost of evaluating that
[00:45:44] out that the cost of evaluating that point for a train time is actually just
[00:45:46] point for a train time is actually just proportional to the number of features F
[00:46:04] and I won't get too much into the
[00:46:06] and I won't get too much into the details of why this is but you can
[00:46:08] details of why this is but you can consider that if you're doing binary
[00:46:10] consider that if you're doing binary features for example where each features
[00:46:12] features for example where each features just yes or no of some sort then you
[00:46:14] just yes or no of some sort then you only have to consider if you have F
[00:46:16] only have to consider if you have F features total you only have to consider
[00:46:18] features total you only have to consider F possible splits and so that's why the
[00:46:21] F possible splits and so that's why the cost in that case would be F and then if
[00:46:23] cost in that case would be F and then if it was instead a quantitative feature I
[00:46:26] it was instead a quantitative feature I mentioned briefly that you could sort
[00:46:28] mentioned briefly that you could sort the overall features and then scan
[00:46:30] the overall features and then scan through them linearly and that also ends
[00:46:33] through them linearly and that also ends up being asymptotically o of F to do
[00:46:35] up being asymptotically o of F to do that okay so each point is that most o
[00:46:40] that okay so each point is that most o of D nodes and then the cost of a point
[00:46:42] of D nodes and then the cost of a point at each node is o of F and you have n
[00:46:45] at each node is o of F and you have n points total so the total cost is really
[00:46:47] points total so the total cost is really just
[00:46:54] is just oh of NFD like this and it turns
[00:47:00] is just oh of NFD like this and it turns out that this is actually surprisingly
[00:47:03] out that this is actually surprisingly fast especially if you consider that n
[00:47:08] fast especially if you consider that n times F is just the size of your
[00:47:10] times F is just the size of your original design matrix right or your
[00:47:12] original design matrix right or your data matrix right your data matrix is of
[00:47:21] data matrix right your data matrix is of size and times F right and then your
[00:47:28] size and times F right and then your only your your run time is going through
[00:47:30] only your your run time is going through the data matrix that most depth times
[00:47:32] the data matrix that most depth times and since depth is log of n that turns
[00:47:34] and since depth is log of n that turns out to be or generally bounded by log of
[00:47:36] out to be or generally bounded by log of n you have generally fairly fast
[00:47:39] n you have generally fairly fast training time as well
[00:47:41] training time as well any questions about run time okay so
[00:47:50] any questions about run time okay so I've been talking a lot about the good
[00:47:52] I've been talking a lot about the good sides of decision trees right they seem
[00:47:54] sides of decision trees right they seem pretty nice so far however there are a
[00:47:56] pretty nice so far however there are a number of downsides too and one big one
[00:48:02] is that it doesn't have additive
[00:48:05] is that it doesn't have additive structure to it and so let me explain a
[00:48:07] structure to it and so let me explain a little bit what that means
[00:48:29] okay so let's say now we have an example
[00:48:33] okay so let's say now we have an example and you have just two features again so
[00:48:35] and you have just two features again so X 1 and X 2 and you can say you define a
[00:48:40] X 1 and X 2 and you can say you define a line and just running through the middle
[00:48:42] line and just running through the middle defined by x1 equals x2 and all the
[00:48:46] defined by x1 equals x2 and all the points above this line are positive and
[00:48:51] points above this line are positive and all the points below it are negative now
[00:48:54] all the points below it are negative now if you have a simple linear model like
[00:48:55] if you have a simple linear model like logistic regression to have no issue
[00:48:57] logistic regression to have no issue with this kind of setup but for a
[00:48:59] with this kind of setup but for a decision tree basically you'd have to
[00:49:02] decision tree basically you'd have to ask a lot of questions that even
[00:49:04] ask a lot of questions that even somewhat approximate this line what you
[00:49:06] somewhat approximate this line what you could try you're going to say okay let's
[00:49:07] could try you're going to say okay let's split this way something like this and
[00:49:15] split this way something like this and basically something like that right and
[00:49:18] basically something like that right and even here you so you've asked a lot of
[00:49:20] even here you so you've asked a lot of questions and you've only gotten a very
[00:49:22] questions and you've only gotten a very rough approximation of the actual line
[00:49:24] rough approximation of the actual line that you've drawn in this case and so
[00:49:27] that you've drawn in this case and so decision trees do have a lot of issues
[00:49:29] decision trees do have a lot of issues with these kind of structures where this
[00:49:31] with these kind of structures where this the features are interacting additively
[00:49:33] the features are interacting additively with one another ok so to recap so far
[00:49:40] with one another ok so to recap so far since we've covered a number of
[00:49:42] since we've covered a number of different things about decision trees
[00:49:46] there's a number of pluses and minuses
[00:49:49] there's a number of pluses and minuses to decision trees ok so on the plus side
[00:49:51] to decision trees ok so on the plus side there actually I think this is an
[00:49:53] there actually I think this is an important point is that they're actually
[00:49:54] important point is that they're actually pretty easy to explain right if you're
[00:49:57] pretty easy to explain right if you're explaining what a decision tree is to
[00:49:59] explaining what a decision tree is to like a non-technical person it's fairly
[00:50:01] like a non-technical person it's fairly obvious you're like okay you have this
[00:50:03] obvious you're like okay you have this tree you're just playing 20 questions
[00:50:04] tree you're just playing 20 questions with your data and letting it come up
[00:50:06] with your data and letting it come up with one question at a time
[00:50:08] with one question at a time they're also interpret able you can just
[00:50:11] they're also interpret able you can just draw out the tree especially for shorter
[00:50:13] draw out the tree especially for shorter trees to see exactly what it's doing it
[00:50:21] trees to see exactly what it's doing it can deal with categorical variables
[00:50:29] and it's generally pretty fast and
[00:50:35] and it's generally pretty fast and however on the negative side one that I
[00:50:40] however on the negative side one that I alluded to was that they're fairly high
[00:50:42] alluded to was that they're fairly high variance models and so are oftentimes
[00:50:46] variance models and so are oftentimes prone to overfitting in your data
[00:50:51] they're bad at additive structure and
[00:51:00] they're bad at additive structure and then finally they have because in large
[00:51:05] then finally they have because in large part because of these bursts - they're
[00:51:07] part because of these bursts - they're generally have fairly low predictive
[00:51:09] generally have fairly low predictive accuracy I know what you guys are
[00:51:16] accuracy I know what you guys are thinking I just spent all this time
[00:51:17] thinking I just spent all this time talking about decision trees and I tell
[00:51:18] talking about decision trees and I tell you guys they actually sort of suck so
[00:51:20] you guys they actually sort of suck so why did I actually cover decision trees
[00:51:21] why did I actually cover decision trees and the answer is that in fact you can
[00:51:25] and the answer is that in fact you can make decision trees a lot better through
[00:51:27] make decision trees a lot better through ensemble and a lot of the methods for
[00:51:29] ensemble and a lot of the methods for example the leading methods and kaggle
[00:51:31] example the leading methods and kaggle these days are actually built on
[00:51:33] these days are actually built on ensembles of decision trees and they
[00:51:35] ensembles of decision trees and they really provide an ideal sort of a model
[00:51:37] really provide an ideal sort of a model framework to look at through which we
[00:51:39] framework to look at through which we can examine a lot of these different
[00:51:40] can examine a lot of these different ensemble methods any questions about
[00:51:44] ensemble methods any questions about decision trees before I move on I I
[00:51:55] decision trees before I move on I I don't think that's strictly okay so the
[00:51:57] don't think that's strictly okay so the question is for the cross-entropy loss
[00:52:00] question is for the cross-entropy loss does the log need to be based - and the
[00:52:02] does the log need to be based - and the answer is I'm pretty sure that is not
[00:52:04] answer is I'm pretty sure that is not very relevant in this case I'm not a
[00:52:06] very relevant in this case I'm not a hundred percent sure about that but I'm
[00:52:07] hundred percent sure about that but I'm pretty sure that the base is a lot good
[00:52:09] pretty sure that the base is a lot good that makes it
[00:52:09] that makes it it's cross-entropy loss actually
[00:52:11] it's cross-entropy loss actually initially came out of like information
[00:52:12] initially came out of like information there we have like computer bits and
[00:52:14] there we have like computer bits and you're transmitting bits and so it's
[00:52:15] you're transmitting bits and so it's useful to think in terms of bits of
[00:52:17] useful to think in terms of bits of information that you can transmit which
[00:52:19] information that you can transmit which is why I always came up as log base to
[00:52:20] is why I always came up as log base to initial formulation
[00:52:53] okay so now let's talk about ensemble
[00:53:06] okay so what I want is on Tom blowing
[00:53:10] okay so what I want is on Tom blowing help at some level you can sort of think
[00:53:12] help at some level you can sort of think back to your basic statistics so say you
[00:53:15] back to your basic statistics so say you have you have excise excise which are
[00:53:28] random variables
[00:53:36] well sometimes right this is just our V
[00:53:42] that are independent identically
[00:53:52] that are independent identically distributed and so probably a lot of you
[00:53:59] distributed and so probably a lot of you are familiar with this already well you
[00:54:04] are familiar with this already well you can call this iid okay now say that your
[00:54:11] can call this iid okay now say that your variance of one of these variables is
[00:54:17] variance of one of these variables is Sigma squared
[00:54:21] then what you can show is that the
[00:54:23] then what you can show is that the variance of the mean of many of these
[00:54:27] variance of the mean of many of these variables so let's of many of these
[00:54:29] variables so let's of many of these random variables or written
[00:54:32] random variables or written alternatively one over n sum over I of X
[00:54:37] alternatively one over n sum over I of X I is equal to Sigma squared over N and
[00:54:44] I is equal to Sigma squared over N and so each independent variable you factor
[00:54:47] so each independent variable you factor in is decreasing the variance of your
[00:54:50] in is decreasing the variance of your model and so the thought is that if you
[00:54:54] model and so the thought is that if you can factor in a number of independent
[00:54:56] can factor in a number of independent sources you can slowly decrease your
[00:54:59] sources you can slowly decrease your variance okay so I saw that though this
[00:55:02] variance okay so I saw that though this is a little bit simplistic of a way of
[00:55:04] is a little bit simplistic of a way of looking at this because really all these
[00:55:06] looking at this because really all these different things are factoring together
[00:55:07] different things are factoring together have some amount of correlation with
[00:55:09] have some amount of correlation with each other and so this independent
[00:55:10] each other and so this independent assumption is oftentimes not correct so
[00:55:13] assumption is oftentimes not correct so if instead you drop the independence
[00:55:23] if instead you drop the independence assumption so now your variables are
[00:55:40] assumption so now your variables are just ID right okay and say we can
[00:55:53] just ID right okay and say we can characterize what the correlation
[00:55:54] characterize what the correlation between any two x i's is and we can
[00:55:57] between any two x i's is and we can write that down as rho so
[00:56:13] then you can actually write out the
[00:56:15] then you can actually write out the variance of your mean as Rho Sigma
[00:56:23] variance of your mean as Rho Sigma squared squared plus 1 minus Rho over m
[00:56:32] squared squared plus 1 minus Rho over m or no n Sigma squared okay and so you
[00:56:38] or no n Sigma squared okay and so you can sort of see that if your correlation
[00:56:39] can sort of see that if your correlation if they're fully correlated then your
[00:56:41] if they're fully correlated then your this term will drop to zero and that
[00:56:43] this term will drop to zero and that you'll just have Sigma squared again
[00:56:45] you'll just have Sigma squared again because adding a bunch of fully
[00:56:46] because adding a bunch of fully correlated variables it's just going to
[00:56:47] correlated variables it's just going to give you the original variables variance
[00:56:49] give you the original variables variance versus if they're completely d
[00:56:51] versus if they're completely d correlated then this term drops to zero
[00:56:53] correlated then this term drops to zero and you just end up with Sigma squared
[00:56:54] and you just end up with Sigma squared over n which gives you the initial
[00:56:56] over n which gives you the initial independent identically distributed
[00:56:59] independent identically distributed equation and so in this case really what
[00:57:03] equation and so in this case really what you want to do the name of the game is
[00:57:05] you want to do the name of the game is you want to have as many different
[00:57:07] you want to have as many different models that your factoring is possible
[00:57:09] models that your factoring is possible to increase this n which drives this
[00:57:11] to increase this n which drives this term down and then on the other hand you
[00:57:13] term down and then on the other hand you also want to make sure those models are
[00:57:15] also want to make sure those models are as d correlated as possible so that your
[00:57:17] as d correlated as possible so that your Rho goes down in this first term goes
[00:57:19] Rho goes down in this first term goes down as well okay
[00:57:35] and so this gives rise to a number of
[00:57:37] and so this gives rise to a number of different ways to Ensemble and one way
[00:57:49] different ways to Ensemble and one way you could think about doing this is you
[00:57:51] you could think about doing this is you just use different algorithms this is
[00:58:03] just use different algorithms this is actually what a lot of people in cackled
[00:58:05] actually what a lot of people in cackled for example will do is they'll just take
[00:58:07] for example will do is they'll just take a core random forest and svm average of
[00:58:10] a core random forest and svm average of them all together and you know that
[00:58:12] them all together and you know that actually works pretty well but then you
[00:58:15] actually works pretty well but then you sort of have to spend your time
[00:58:16] sort of have to spend your time implementing all these separate
[00:58:17] implementing all these separate algorithms which it's oftentimes not the
[00:58:19] algorithms which it's oftentimes not the most efficient use of your time another
[00:58:22] most efficient use of your time another one that people would like to do is just
[00:58:25] one that people would like to do is just use different training sets okay and
[00:58:39] use different training sets okay and again in this case like you probably
[00:58:41] again in this case like you probably spend a lot of effort collecting your
[00:58:42] spend a lot of effort collecting your initial training set you don't want your
[00:58:44] initial training set you don't want your like machine learning person to just
[00:58:46] like machine learning person to just come and recommend to you that just go
[00:58:48] come and recommend to you that just go collect a whole second training set or
[00:58:50] collect a whole second training set or something like that to improve your
[00:58:51] something like that to improve your performance like that's generally not
[00:58:53] performance like that's generally not the most helpful recommendation okay
[00:58:56] the most helpful recommendation okay and so then what we're gonna cover now
[00:58:59] and so then what we're gonna cover now are these two other methods that we use
[00:59:00] are these two other methods that we use to do Ensemble II and one of them is
[00:59:02] to do Ensemble II and one of them is called bagging which is sort of trying
[00:59:07] called bagging which is sort of trying to approximate having different training
[00:59:09] to approximate having different training sets I'll get into that quickly and then
[00:59:12] sets I'll get into that quickly and then you also have boosting
[00:59:19] and just so that you had to have a
[00:59:20] and just so that you had to have a little bit of context we're gonna be
[00:59:22] little bit of context we're gonna be using decision trees to talk of a lot
[00:59:23] using decision trees to talk of a lot about these models and so bagging you
[00:59:26] about these models and so bagging you might have heard of random force that's
[00:59:29] might have heard of random force that's a variant of bagging for decision trees
[00:59:32] a variant of bagging for decision trees and then for boosting you might have
[00:59:36] and then for boosting you might have heard of things like add a boost or XG
[00:59:42] heard of things like add a boost or XG boost which are variants of boosting for
[00:59:46] boost which are variants of boosting for decision trees okay so that sort of
[00:59:53] decision trees okay so that sort of covers that a high level would want to
[00:59:55] covers that a high level would want to do these first two are very nice because
[00:59:57] do these first two are very nice because they're sort of would give us a much
[00:59:58] they're sort of would give us a much more like independently correlated or
[01:00:01] more like independently correlated or less correlated variables but generally
[01:00:03] less correlated variables but generally we're we end up doing these latter two
[01:00:06] we're we end up doing these latter two because we don't want to collect new
[01:00:07] because we don't want to collect new training sets or train entirely new
[01:00:08] training sets or train entirely new algorithms okay so let's cover bagging
[01:00:12] algorithms okay so let's cover bagging first
[01:00:21] okay so bagging really stands for this
[01:00:24] okay so bagging really stands for this thing it's called bootstrap aggregation
[01:00:26] thing it's called bootstrap aggregation okay and so first let's just break down
[01:00:42] okay and so first let's just break down this term so bootstrap what that is is
[01:00:44] this term so bootstrap what that is is this typically this method use and
[01:00:45] this typically this method use and statistics to measure the uncertainty of
[01:00:48] statistics to measure the uncertainty of your estimate okay and so what what is
[01:00:52] your estimate okay and so what what is useful to define in this case for when
[01:00:54] useful to define in this case for when you're talking about bagging is you can
[01:00:56] you're talking about bagging is you can say that you have a true population P
[01:01:06] say that you have a true population P okay and your training set training set
[01:01:15] s is sampled from P you just are drawing
[01:01:19] s is sampled from P you just are drawing a bunch of examples from P and that's
[01:01:21] a bunch of examples from P and that's what forms your training set and so
[01:01:24] what forms your training set and so ideally like for example this different
[01:01:26] ideally like for example this different training sets approach what you do with
[01:01:28] training sets approach what you do with you just draw s1 s2 s3 s4 and then train
[01:01:31] you just draw s1 s2 s3 s4 and then train your model and each one
[01:01:31] your model and each one separately unfortunately you generally
[01:01:34] separately unfortunately you generally don't have the time to do that and so
[01:01:36] don't have the time to do that and so what that what bootstrapping does is you
[01:01:39] what that what bootstrapping does is you assume basically that your population is
[01:01:44] assume basically that your population is your training sample okay so you assume
[01:01:48] your training sample okay so you assume that your population is your training
[01:01:50] that your population is your training sample and so now that you have this s
[01:01:53] sample and so now that you have this s is approximating your P then you can
[01:01:55] is approximating your P then you can draw new samples from your population by
[01:01:58] draw new samples from your population by just drawing samples from s instead okay
[01:02:01] just drawing samples from s instead okay so you have bootstrap samples is what
[01:02:05] so you have bootstrap samples is what they're called z samples from s and so
[01:02:15] they're called z samples from s and so how that works is you basically just
[01:02:16] how that works is you basically just take your train your your training
[01:02:19] take your train your your training sample okay say it's of like cardinality
[01:02:21] sample okay say it's of like cardinality n or something and you're just sample n
[01:02:23] n or something and you're just sample n times from s and this is important you
[01:02:26] times from s and this is important you do it with replacement because they're
[01:02:28] do it with replacement because they're pretending that this is a population and
[01:02:30] pretending that this is a population and so doing it with replacement sort of
[01:02:32] so doing it with replacement sort of makes it of something hold that you're
[01:02:34] makes it of something hold that you're sampling from it as a population okay so
[01:02:40] sampling from it as a population okay so that's bootstrapping so you generate all
[01:02:42] that's bootstrapping so you generate all these different bootstrap samples Z on
[01:02:44] these different bootstrap samples Z on your from your training set and what you
[01:02:47] your from your training set and what you can do is you can take your model and
[01:02:49] can do is you can take your model and train it on all these separate bootstrap
[01:02:51] train it on all these separate bootstrap samples and then you can sort of look at
[01:02:53] samples and then you can sort of look at the variability in the predictions that
[01:02:55] the variability in the predictions that your model ends up making based on these
[01:02:57] your model ends up making based on these different bootstrap samples and that
[01:02:59] different bootstrap samples and that gives you sort of a measure of
[01:03:00] gives you sort of a measure of uncertainty I'm not going to go into too
[01:03:02] uncertainty I'm not going to go into too much detail that because that's not
[01:03:04] much detail that because that's not actually what we're gonna use
[01:03:04] actually what we're gonna use bootstrapping for what we want to use
[01:03:07] bootstrapping for what we want to use bootstrapping force we want to aggregate
[01:03:09] bootstrapping force we want to aggregate basically bootstrap samples and so at a
[01:03:11] basically bootstrap samples and so at a very high level what that means is we're
[01:03:13] very high level what that means is we're gonna take a bunch of bootstrap samples
[01:03:15] gonna take a bunch of bootstrap samples train separate models on each and then
[01:03:17] train separate models on each and then average their outputs okay
[01:03:22] average their outputs okay so let's make that a little bit more
[01:03:24] so let's make that a little bit more formal
[01:03:46] so you have bootstrap samples z1 through
[01:04:01] so you have bootstrap samples z1 through ZM say okay capital n let's just say how
[01:04:05] ZM say okay capital n let's just say how many bootstrap samples you're gonna take
[01:04:07] many bootstrap samples you're gonna take okay you train a model GM okay on Xia
[01:04:24] okay and then all you're doing is you're
[01:04:28] okay and then all you're doing is you're just defining this new sort of meta
[01:04:29] just defining this new sort of meta model I'm not putting a subscript on
[01:04:32] model I'm not putting a subscript on this one to show that it's the meta
[01:04:33] this one to show that it's the meta Model T of M which is just the sum of
[01:04:37] Model T of M which is just the sum of your predictions your individual models
[01:04:43] your predictions your individual models divided by the total number of models
[01:04:46] divided by the total number of models you have and this is just me writing out
[01:04:51] you have and this is just me writing out what I was sort of talking about right
[01:04:53] what I was sort of talking about right up there for bagging it's you're taking
[01:04:54] up there for bagging it's you're taking these bootstrap samples and then your
[01:04:57] these bootstrap samples and then your training separate models and then you're
[01:04:58] training separate models and then you're just aggregating them all together to
[01:05:00] just aggregating them all together to get this bagging approach and so if we
[01:05:08] get this bagging approach and so if we just do a little bit of analysis from
[01:05:10] just do a little bit of analysis from the bias-variance perspective on this we
[01:05:12] the bias-variance perspective on this we can sort of see why this kind of thing
[01:05:14] can sort of see why this kind of thing might work
[01:05:27] and so you recall we have this equation
[01:05:29] and so you recall we have this equation up here right there various variants of
[01:05:31] up here right there various variants of the mean is Rho Sigma squared plus 1
[01:05:34] the mean is Rho Sigma squared plus 1 minus Rho over N Sigma squared so let me
[01:05:37] minus Rho over N Sigma squared so let me just write that out here and in this
[01:05:50] just write that out here and in this case our n is actually really just the
[01:05:52] case our n is actually really just the number of bootstrap samples so we'll
[01:05:54] number of bootstrap samples so we'll just use Big M in this case and what
[01:05:58] just use Big M in this case and what you're doing is by taking these
[01:05:59] you're doing is by taking these bootstrap samples you're sort of D
[01:06:01] bootstrap samples you're sort of D correlating your the models your
[01:06:02] correlating your the models your training your bootstrapping is driving
[01:06:11] training your bootstrapping is driving down well okay and so by driving this
[01:06:22] down well okay and so by driving this down you're sort of making this term get
[01:06:24] down you're sort of making this term get smaller and smaller and then your
[01:06:25] smaller and smaller and then your question might be okay what about this
[01:06:26] question might be okay what about this term here and it turns out that
[01:06:28] term here and it turns out that basically you can take as many bootstrap
[01:06:31] basically you can take as many bootstrap samples as you want
[01:06:32] samples as you want and that'll slowly drive down it
[01:06:34] and that'll slowly drive down it increases M and drive down this second
[01:06:37] increases M and drive down this second term and it turns out that one nice
[01:06:39] term and it turns out that one nice thing about bootstrapping is that
[01:06:41] thing about bootstrapping is that increasing the number of bootstrap
[01:06:44] increasing the number of bootstrap models your training doesn't actually
[01:06:46] models your training doesn't actually cause you to over fit any more than you
[01:06:48] cause you to over fit any more than you were beforehand because all you're doing
[01:06:50] were beforehand because all you're doing is you're driving down this term here so
[01:06:53] is you're driving down this term here so more M it's just less variance all
[01:07:03] more M it's just less variance all you're doing is driving down the second
[01:07:05] you're doing is driving down the second term as much as possible when you're
[01:07:06] term as much as possible when you're getting more and more bootstrap samples
[01:07:08] getting more and more bootstrap samples so generally only improves performance
[01:07:09] so generally only improves performance and so generally what people will do is
[01:07:11] and so generally what people will do is they'll sample more and more models
[01:07:13] they'll sample more and more models until they see that their error stops
[01:07:15] until they see that their error stops going down because that means I've
[01:07:16] going down because that means I've basically eliminated this term over here
[01:07:19] basically eliminated this term over here so this seems kind of nice right you're
[01:07:23] so this seems kind of nice right you're decreasing the variance where's the
[01:07:25] decreasing the variance where's the trade-off coming in oh there's a
[01:07:27] trade-off coming in oh there's a question
[01:07:32] yeah there's definitely a bound right
[01:07:34] yeah there's definitely a bound right because I'm not gonna define one
[01:07:37] because I'm not gonna define one formally right now oh the question is
[01:07:40] formally right now oh the question is can you define a bound on how much you
[01:07:42] can you define a bound on how much you decrease row by I'm not yeah so there's
[01:07:45] decrease row by I'm not yeah so there's definitely a lower bound yeah lower
[01:07:49] definitely a lower bound yeah lower bound and how far you can decrease row
[01:07:52] bound and how far you can decrease row it basically comes down to your
[01:07:54] it basically comes down to your bootstrap samples are still fairly
[01:07:55] bootstrap samples are still fairly highly correlated with one another all
[01:07:58] highly correlated with one another all right because they're still just drawing
[01:07:59] right because they're still just drawing it from the same sample set s really
[01:08:02] it from the same sample set s really your Z's gonna end up containing about
[01:08:04] your Z's gonna end up containing about two for each Izzy's gonna contain about
[01:08:06] two for each Izzy's gonna contain about two-thirds of s and so your Z's are
[01:08:08] two-thirds of s and so your Z's are still gonna be fairly highly correlated
[01:08:09] still gonna be fairly highly correlated with each other and though I don't have
[01:08:11] with each other and though I don't have a formal equation to write down as to
[01:08:12] a formal equation to write down as to exactly how much that decreases row by
[01:08:15] exactly how much that decreases row by how much that balance row by you can
[01:08:17] how much that balance row by you can sort of see intuitively that there is a
[01:08:19] sort of see intuitively that there is a bound there and then you can't just
[01:08:20] bound there and then you can't just magically decrease row all the way down
[01:08:22] magically decrease row all the way down to zero and achieve zero variance so I
[01:08:30] to zero and achieve zero variance so I was saying that you decrease variance
[01:08:31] was saying that you decrease variance this seems very nice one issue that
[01:08:34] this seems very nice one issue that comes up with with bootstrapping is that
[01:08:37] comes up with with bootstrapping is that in fact you're actually slightly
[01:08:38] in fact you're actually slightly increasing the bias of your models when
[01:08:40] increasing the bias of your models when you're doing this and the reasoning for
[01:08:43] you're doing this and the reasoning for that is because of this subsampling that
[01:08:53] that is because of this subsampling that I was talking about here each one of
[01:08:55] I was talking about here each one of your Z's is now about two-thirds of the
[01:08:57] your Z's is now about two-thirds of the original s so your training unless data
[01:08:59] original s so your training unless data and so your models are becoming slightly
[01:09:01] and so your models are becoming slightly less you know complex and so that
[01:09:05] less you know complex and so that increases your bias in this case yes
[01:09:14] yeah for sure so the question is can you
[01:09:18] yeah for sure so the question is can you explain the difference between a random
[01:09:20] explain the difference between a random variable and an algorithm in this case
[01:09:22] variable and an algorithm in this case right and so you can sort of at a very
[01:09:25] right and so you can sort of at a very high level you can think of an algorithm
[01:09:26] high level you can think of an algorithm as a classifier that as a function
[01:09:28] as a classifier that as a function that's taking in some data and making a
[01:09:30] that's taking in some data and making a prediction right and if you sort of see
[01:09:34] prediction right and if you sort of see those that whole set up as sort of like
[01:09:36] those that whole set up as sort of like probably the algorithm is giving some
[01:09:38] probably the algorithm is giving some sort of output in the problem holistic
[01:09:39] sort of output in the problem holistic perspective you can sort of see the
[01:09:41] perspective you can sort of see the algorithm as like a random variable in a
[01:09:44] algorithm as like a random variable in a case in this case sort of like you're
[01:09:46] case in this case sort of like you're basically considering sort of the space
[01:09:49] basically considering sort of the space of possible predictions that your
[01:09:51] of possible predictions that your algorithm can make and that you can sort
[01:09:53] algorithm can make and that you can sort of see as a distribution of possible
[01:09:55] of see as a distribution of possible predictions and that you can approximate
[01:09:58] predictions and that you can approximate that as a random variable I mean it is a
[01:09:59] that as a random variable I mean it is a random variable at some level because
[01:10:01] random variable at some level because it's sort of like based on what training
[01:10:04] it's sort of like based on what training sample you end up with your predictions
[01:10:06] sample you end up with your predictions of your output model are going to change
[01:10:08] of your output model are going to change and so since you're sampling sort of
[01:10:10] and so since you're sampling sort of these random samples from your
[01:10:12] these random samples from your population set you can consider your
[01:10:15] population set you can consider your algorithm as sort of based on that
[01:10:16] algorithm as sort of based on that random sample and therefore random
[01:10:18] random sample and therefore random variable itself okay so yeah your
[01:10:24] variable itself okay so yeah your bicycle increased because of random
[01:10:31] bicycle increased because of random subsampling
[01:10:39] but generally the decrease in variance
[01:10:42] but generally the decrease in variance that you get from doing this it's much
[01:10:44] that you get from doing this it's much larger than the slight increase in bias
[01:10:46] larger than the slight increase in bias you get from from doing this random life
[01:10:49] you get from from doing this random life subsampling so in a lot of cases bagging
[01:10:51] subsampling so in a lot of cases bagging is quite nice
[01:11:08] okay so I've talked a bit about buying
[01:11:11] okay so I've talked a bit about buying about bagging let's talk about decision
[01:11:13] about bagging let's talk about decision trees plus bagging now okay so you
[01:11:25] trees plus bagging now okay so you recall that decision trees are high
[01:11:29] recall that decision trees are high variance low bias and this right here
[01:11:40] variance low bias and this right here sort of explains why they're pretty good
[01:11:42] sort of explains why they're pretty good fit for bagging okay because bagging
[01:11:44] fit for bagging okay because bagging what you're doing is you're decreasing
[01:11:45] what you're doing is you're decreasing the variance of your models for a slight
[01:11:48] the variance of your models for a slight increase in bias and since most of your
[01:11:50] increase in bias and since most of your error from your decision trees is coming
[01:11:52] error from your decision trees is coming from the high variance side of things by
[01:11:55] from the high variance side of things by sort of driving down that variance you
[01:11:57] sort of driving down that variance you get a lot more benefit than for a model
[01:11:59] get a lot more benefit than for a model that would be on the reverse hybrid bias
[01:12:02] that would be on the reverse hybrid bias and low variance alright so so this
[01:12:06] and low variance alright so so this makes this like an ideal fit for bagging
[01:12:24] okay so now this is sort of decision
[01:12:28] okay so now this is sort of decision tree split bagging I said that random
[01:12:30] tree split bagging I said that random force or sort of a version of decision
[01:12:33] force or sort of a version of decision trees plus backing and so what I've
[01:12:36] trees plus backing and so what I've described here is actually almost random
[01:12:37] described here is actually almost random for us at this point the one key point
[01:12:39] for us at this point the one key point we're still missing is that random
[01:12:42] we're still missing is that random forest actually introduce even more
[01:12:43] forest actually introduce even more randomization into each individual
[01:12:45] randomization into each individual decision tree and the idea behind that
[01:12:48] decision tree and the idea behind that is that as I had that question from
[01:12:50] is that as I had that question from before is this row you can only drive it
[01:12:52] before is this row you can only drive it down so far through just pure
[01:12:54] down so far through just pure bootstrapping
[01:12:55] bootstrapping but if you can further D correlate your
[01:12:57] but if you can further D correlate your different random variables and you can
[01:12:59] different random variables and you can drive down that variance even further
[01:13:00] drive down that variance even further okay and so the idea there is that
[01:13:05] okay and so the idea there is that basically for at each split four random
[01:13:08] basically for at each split four random forest
[01:13:18] at each split you consider only a
[01:13:29] at each split you consider only a fraction of your total features so it's
[01:13:47] fraction of your total features so it's sort of like for that ski example maybe
[01:13:48] sort of like for that ski example maybe like for the first plate I only let it
[01:13:50] like for the first plate I only let it look at latitude and then for the second
[01:13:52] look at latitude and then for the second split I only let it look at the time of
[01:13:55] split I only let it look at the time of the year and so this might seem a little
[01:13:58] the year and so this might seem a little bit unintuitive at first but you can
[01:14:00] bit unintuitive at first but you can sort of get the intuition for two ways
[01:14:01] sort of get the intuition for two ways one is that you're decreasing row and
[01:14:09] one is that you're decreasing row and then the other one is you can think that
[01:14:12] then the other one is you can think that say you have a classification example we
[01:14:14] say you have a classification example we have one very strong predictor that gets
[01:14:17] have one very strong predictor that gets you very good performance on its own and
[01:14:19] you very good performance on its own and regardless of what bootstrap sample you
[01:14:21] regardless of what bootstrap sample you selects your models probably gonna use
[01:14:22] selects your models probably gonna use that predictor as its first split that's
[01:14:25] that predictor as its first split that's gonna cause all your models to be very
[01:14:26] gonna cause all your models to be very highly correlated right at that first
[01:14:28] highly correlated right at that first split for example and by instead forcing
[01:14:30] split for example and by instead forcing it to to sample from different features
[01:14:34] it to to sample from different features instead that's going to increase the or
[01:14:37] instead that's going to increase the or decrease the correlation between your
[01:14:39] decrease the correlation between your models and so it's all about D
[01:14:41] models and so it's all about D correlating your models in this case
[01:14:51] okay and that sort of brings so close a
[01:14:54] okay and that sort of brings so close a lot of our discussion of bagging are
[01:14:56] lot of our discussion of bagging are there any questions regarding bagging
[01:15:00] okay
[01:15:05] now I've covered bagging let's get a
[01:15:07] now I've covered bagging let's get a little bit into boosting and I'll make
[01:15:15] little bit into boosting and I'll make this quick
[01:15:16] this quick but basically whereas bagging we sort of
[01:15:22] but basically whereas bagging we sort of saw in the intuition that we were
[01:15:24] saw in the intuition that we were decreasing variants boosting is sort of
[01:15:27] decreasing variants boosting is sort of actually more of the opposite where
[01:15:28] actually more of the opposite where you're decreasing the bias of your
[01:15:30] you're decreasing the bias of your models okay and also it is basically
[01:15:45] models okay and also it is basically more additive and how it's doing things
[01:15:49] more additive and how it's doing things so versus you'll recall that for bagging
[01:15:55] so versus you'll recall that for bagging you were taking the average of a number
[01:15:57] you were taking the average of a number of variables and boosting what happens
[01:15:59] of variables and boosting what happens you train one model and then you add
[01:16:00] you train one model and then you add that prediction into your ensemble and
[01:16:02] that prediction into your ensemble and then when you turn a new model you just
[01:16:03] then when you turn a new model you just add that in as a prediction and so
[01:16:06] add that in as a prediction and so that's a little bit hand wavy right now
[01:16:07] that's a little bit hand wavy right now so let me actually make that clear
[01:16:09] so let me actually make that clear through an example so say you have a
[01:16:15] through an example so say you have a data set again x1 x2 x2 and you have
[01:16:20] data set again x1 x2 x2 and you have some data points maybe some it's
[01:16:23] some data points maybe some it's actually just called pluses and minuses
[01:16:26] actually just called pluses and minuses say you have some more pluses here and
[01:16:30] say you have some more pluses here and then maybe a couple minuses and pluses
[01:16:32] then maybe a couple minuses and pluses here okay and what you say your training
[01:16:37] here okay and what you say your training size one decision tree so decision
[01:16:40] size one decision tree so decision stumps is what we call them it's you
[01:16:41] stumps is what we call them it's you only get to ask one question at a time
[01:16:43] only get to ask one question at a time and the reason behind this just really
[01:16:46] and the reason behind this just really quickly is that because you're
[01:16:47] quickly is that because you're decreasing bias by restricting your
[01:16:49] decreasing bias by restricting your trees to be only depth one you basically
[01:16:53] trees to be only depth one you basically are increasing their amount of bias and
[01:16:55] are increasing their amount of bias and decreasing the amount of variance which
[01:16:56] decreasing the amount of variance which makes them a better fit for boosting
[01:16:57] makes them a better fit for boosting kind of methods and say that you come up
[01:17:01] kind of methods and say that you come up with a decision boundary okay say this
[01:17:03] with a decision boundary okay say this one here okay and what you're gonna do
[01:17:07] one here okay and what you're gonna do is on this side you predict positive
[01:17:09] is on this side you predict positive right and on this side you predict
[01:17:11] right and on this side you predict negative it's like a reasonable like
[01:17:13] negative it's like a reasonable like lying that you could draw here but it's
[01:17:14] lying that you could draw here but it's not perfect right you've made some
[01:17:16] not perfect right you've made some mistakes
[01:17:17] mistakes in fact what you can do is you can sort
[01:17:19] in fact what you can do is you can sort of identify these mistakes so if we draw
[01:17:21] of identify these mistakes so if we draw this in red hey you've got made these
[01:17:25] this in red hey you've got made these guys as mistakes and what boosting does
[01:17:28] guys as mistakes and what boosting does is basically it increases the weights of
[01:17:31] is basically it increases the weights of the mistakes you've made and then for
[01:17:34] the mistakes you've made and then for the next decision stump that you train
[01:17:36] the next decision stump that you train it's now trained on this modified set
[01:17:38] it's now trained on this modified set which I subscride over here and so now
[01:17:50] which I subscride over here and so now you these positives I'll just draw them
[01:17:51] you these positives I'll just draw them much bigger you know you've got big
[01:17:53] much bigger you know you've got big positives here and some small negatives
[01:17:54] positives here and some small negatives and some small positives some big
[01:17:57] and some small positives some big negatives here and so now your model to
[01:18:02] negatives here and so now your model to try and get these right might pick a
[01:18:03] try and get these right might pick a decision boundary like this and this is
[01:18:07] decision boundary like this and this is also basically recursive in that each
[01:18:09] also basically recursive in that each step right you're going to be reading
[01:18:11] step right you're going to be reading each of the examples based on how many
[01:18:12] each of the examples based on how many of your previous ones have gotten it
[01:18:15] of your previous ones have gotten it wrong or right in the past and so
[01:18:20] wrong or right in the past and so basically what you're doing is you can
[01:18:23] basically what you're doing is you can sort of weight each one of these
[01:18:25] sort of weight each one of these classifiers you can determine for
[01:18:32] classifiers you can determine for classifier GM a weight alpha M which is
[01:18:46] classifier GM a weight alpha M which is proportional to how many examples you
[01:18:48] proportional to how many examples you got wrong or right so better classifier
[01:18:50] got wrong or right so better classifier you want to give it more weight and a
[01:18:53] you want to give it more weight and a bad classifier you want to give it less
[01:18:55] bad classifier you want to give it less right portion all and I think that the
[01:19:01] right portion all and I think that the exact equation used in adaboost for
[01:19:04] exact equation used in adaboost for example is just log of 1 minus the error
[01:19:08] example is just log of 1 minus the error of your n model divided lis basically
[01:19:11] of your n model divided lis basically log odds okay and then your total
[01:19:15] log odds okay and then your total classifier is just f of a or let's just
[01:19:18] classifier is just f of a or let's just call it G of X again G of X it's just
[01:19:22] call it G of X again G of X it's just the sum over m
[01:19:27] the sum over m of alpha and G of M and then each G of M
[01:19:33] of alpha and G of M and then each G of M is trained on a weighted on a reweighed
[01:19:47] is trained on a weighted on a reweighed it actually reweighed 'add and so i've
[01:19:57] it actually reweighed 'add and so i've glossed over a lot of the details here
[01:19:58] glossed over a lot of the details here in interest of time but the specifics of
[01:20:01] in interest of time but the specifics of an algorithm like this are will be in
[01:20:04] an algorithm like this are will be in the lecture notes and this algorithm is
[01:20:06] the lecture notes and this algorithm is actually known as adaboost and basically
[01:20:12] actually known as adaboost and basically through similar techniques you can
[01:20:14] through similar techniques you can derive algorithms such as XG boost or
[01:20:16] derive algorithms such as XG boost or gradient boosting machines that also
[01:20:19] gradient boosting machines that also allow you to basically re-weight the
[01:20:21] allow you to basically re-weight the examples you're getting right or wrong
[01:20:22] examples you're getting right or wrong in this sort of dynamic fashion and
[01:20:24] in this sort of dynamic fashion and slowly adding them in is additive
[01:20:26] slowly adding them in is additive fashion to your composite model and that
[01:20:29] fashion to your composite model and that about finishes it for today thanks for
[01:20:32] about finishes it for today thanks for coming great rest of your week


================================================================================
LECTURE 011
================================================================================

Lecture 11 - Introduction to Neural Networks | Stanford CS229: Machine Learning (Autumn 2018)

Source: https://www.youtube.com/watch?v=MfIjxPh6Pys

---

Transcript

[00:00:03] hello everyone welcome to CS 2 to 9
[00:00:08] hello everyone welcome to CS 2 to 9 today we're going to talk about deep
[00:00:11] today we're going to talk about deep learning and neural networks we're going
[00:00:15] learning and neural networks we're going to have two lectures on that one today
[00:00:17] to have two lectures on that one today and a little bit more of it on Monday
[00:00:21] and a little bit more of it on Monday don't hesitate to ask questions during
[00:00:24] don't hesitate to ask questions during the lecture so stop me if you don't
[00:00:26] the lecture so stop me if you don't understand something and we'll try to
[00:00:27] understand something and we'll try to build the intuition around your own
[00:00:29] build the intuition around your own Network together we will actually start
[00:00:31] Network together we will actually start with an algorithm that you guys have
[00:00:33] with an algorithm that you guys have seen previously called logistic
[00:00:35] seen previously called logistic regression everybody remembers logistic
[00:00:37] regression everybody remembers logistic regression ok remember it's a
[00:00:39] regression ok remember it's a classification algorithm we're going to
[00:00:42] classification algorithm we're going to do that
[00:00:43] do that explain how logistic regression can be
[00:00:45] explain how logistic regression can be interpreted as a neural network specific
[00:00:48] interpreted as a neural network specific case of a neural network and then we
[00:00:51] case of a neural network and then we will go to neural networks sounds good
[00:00:53] will go to neural networks sounds good so the quick intro and deep learning so
[00:01:03] so the quick intro and deep learning so deep learning is a is a set of
[00:01:05] deep learning is a is a set of techniques that is let's say a subset of
[00:01:09] techniques that is let's say a subset of machine learning and it's one of the
[00:01:11] machine learning and it's one of the growing techniques that have been used
[00:01:12] growing techniques that have been used in the industry specifically for
[00:01:14] in the industry specifically for problems in computer vision natural
[00:01:16] problems in computer vision natural language processing and speech
[00:01:17] language processing and speech recognition so you guys have a lot of
[00:01:19] recognition so you guys have a lot of different tools and and plugins on your
[00:01:23] different tools and and plugins on your smartphones that uses this type of
[00:01:25] smartphones that uses this type of algorithm the reason it came to work
[00:01:29] algorithm the reason it came to work very well is primarily the new
[00:01:32] very well is primarily the new computational methods so one thing we're
[00:01:34] computational methods so one thing we're going to see today is that deep learning
[00:01:39] going to see today is that deep learning is really really computationally
[00:01:41] is really really computationally expensive and people had to find
[00:01:44] expensive and people had to find techniques in order to parallelize the
[00:01:46] techniques in order to parallelize the code and use GPU specifically in order
[00:01:50] code and use GPU specifically in order to graphical processing unit in order to
[00:01:52] to graphical processing unit in order to be able to compute the computations in
[00:01:56] be able to compute the computations in the punic the second part is the data
[00:02:00] the punic the second part is the data available has been growing after after
[00:02:06] available has been growing after after the internet bubble with the
[00:02:07] the internet bubble with the digitalization of the work so now people
[00:02:10] digitalization of the work so now people have access to large amounts of data and
[00:02:12] have access to large amounts of data and this type of algorithm has the
[00:02:13] this type of algorithm has the specificity of be able to learn
[00:02:15] specificity of be able to learn when there's a lot of data so these
[00:02:18] when there's a lot of data so these models are very flexible and the more
[00:02:20] models are very flexible and the more you give them data the more they will be
[00:02:21] you give them data the more they will be able to understand the salient feature
[00:02:24] able to understand the salient feature of the data and finally algorithms so
[00:02:29] of the data and finally algorithms so people have come up with with new
[00:02:31] people have come up with with new techniques in order to use the data use
[00:02:35] techniques in order to use the data use the competition power and build models
[00:02:37] the competition power and build models so we're going to touch a little bit on
[00:02:39] so we're going to touch a little bit on all of that but let's go with logistic
[00:02:41] all of that but let's go with logistic regression first can you guys see in the
[00:02:50] regression first can you guys see in the back yeah okay so you remember what
[00:02:57] back yeah okay so you remember what logistic regression is we're going to
[00:02:59] logistic regression is we're going to fix a goal for us that is a
[00:03:03] fix a goal for us that is a classification goal so let's try to to
[00:03:07] classification goal so let's try to to find cuts in images so find cuts in
[00:03:12] find cuts in images so find cuts in images meaning binary classification if
[00:03:18] images meaning binary classification if there is a cut in the image we want to
[00:03:23] there is a cut in the image we want to output a number that is close to one
[00:03:25] output a number that is close to one presence of the cut and if there is no
[00:03:29] presence of the cut and if there is no cut in the image one output zero let's
[00:03:37] cut in the image one output zero let's say for now we're constrained to the
[00:03:39] say for now we're constrained to the fact that there is maximum one cut very
[00:03:41] fact that there is maximum one cut very much there's no more if you have to draw
[00:03:44] much there's no more if you have to draw the logistic regression model that's
[00:03:46] the logistic regression model that's what you would do you would take a cut
[00:03:48] what you would do you would take a cut so this is an image of the cuts very bad
[00:03:52] so this is an image of the cuts very bad at that sorry in computer science you
[00:04:00] at that sorry in computer science you know that images can be represented as
[00:04:02] know that images can be represented as 3d matrices so if I tell you that this
[00:04:06] 3d matrices so if I tell you that this is a color image of size 64 by 64 how
[00:04:12] is a color image of size 64 by 64 how many numbers do I have to represent
[00:04:14] many numbers do I have to represent those pixels
[00:04:20] yeah I heard it 64 by 64 by 3:3 for the
[00:04:25] yeah I heard it 64 by 64 by 3:3 for the RGB Channel red green blue every pixel
[00:04:30] RGB Channel red green blue every pixel in an image can be represented by three
[00:04:32] in an image can be represented by three numbers one represents in the red filter
[00:04:33] numbers one represents in the red filter the green filter and the blue filter so
[00:04:37] the green filter and the blue filter so actually this image is of size 64 times
[00:04:40] actually this image is of size 64 times 64 times 3 that make sense so the first
[00:04:46] 64 times 3 that make sense so the first thing we will do in order to use
[00:04:47] thing we will do in order to use logistic regression to find if there is
[00:04:48] logistic regression to find if there is a cut on this image we're going to
[00:04:50] a cut on this image we're going to flatten this into a vector so I'm going
[00:04:56] flatten this into a vector so I'm going to take all the numbers in this matrix
[00:04:58] to take all the numbers in this matrix and flatten them in a vector just an
[00:05:01] and flatten them in a vector just an image to vector operation nothing more
[00:05:04] image to vector operation nothing more and now I can use my logistic regression
[00:05:06] and now I can use my logistic regression because I have a vector input so I'm
[00:05:09] because I have a vector input so I'm going to to take all of these and push
[00:05:14] going to to take all of these and push them in an operation let me call this in
[00:05:17] them in an operation let me call this in the logistic operation which has one
[00:05:19] the logistic operation which has one part that is W X plus B where X is going
[00:05:26] part that is W X plus B where X is going to be the image so W X plus B and the
[00:05:32] to be the image so W X plus B and the second part is going to be the sigmoid
[00:05:35] second part is going to be the sigmoid everybody's familiar with the sigmoid
[00:05:37] everybody's familiar with the sigmoid function function that takes a number
[00:05:39] function function that takes a number between minus infinity and plus infinity
[00:05:41] between minus infinity and plus infinity maps it between 0 and 1 it's very
[00:05:43] maps it between 0 and 1 it's very convenient for classification problems
[00:05:44] convenient for classification problems and this we're going to call it Y hat
[00:05:46] and this we're going to call it Y hat which is sigmoid of what you've seen in
[00:05:50] which is sigmoid of what you've seen in class previously I think it's theta
[00:05:52] class previously I think it's theta transpose X but here we will just
[00:05:54] transpose X but here we will just separate the notation into W and B so
[00:06:04] separate the notation into W and B so can someone tell me what's the shape of
[00:06:06] can someone tell me what's the shape of W 2 matrix W vector matrix
[00:06:21] what
[00:06:24] yeah 64 by 64 by 3 as a yeah so you know
[00:06:30] yeah 64 by 64 by 3 as a yeah so you know that this guy here is a vector of 64 by
[00:06:35] that this guy here is a vector of 64 by 64 by 3 a column vector so the shape of
[00:06:38] 64 by 3 a column vector so the shape of X is going to be 64 by 64 by 3 times 1
[00:06:46] X is going to be 64 by 64 by 3 times 1 this is the shape and this I think it's
[00:06:51] twelve twelve thousand 288 and this
[00:06:56] twelve twelve thousand 288 and this indeed because we want Y hat to be one
[00:06:59] indeed because we want Y hat to be one by one this W has to be one by twelve
[00:07:05] by one this W has to be one by twelve 298 that make sense so we have a row
[00:07:09] 298 that make sense so we have a row vector as our parameter we're just
[00:07:13] vector as our parameter we're just changing the notations of the logistic
[00:07:15] changing the notations of the logistic regression that you guys have seen and
[00:07:17] regression that you guys have seen and so once we have this model we need to
[00:07:19] so once we have this model we need to train it as you know and the process of
[00:07:21] train it as you know and the process of training is that first we will
[00:07:23] training is that first we will initialize or parameters these are what
[00:07:30] initialize or parameters these are what we call parameters we will use the
[00:07:33] we call parameters we will use the specific vocabulary of weights and bias
[00:07:37] specific vocabulary of weights and bias I believe you guys have heard this
[00:07:40] I believe you guys have heard this vocabulary before weights and biases so
[00:07:44] vocabulary before weights and biases so we're going to find the right W and the
[00:07:47] we're going to find the right W and the right B in order to be able to use this
[00:07:52] right B in order to be able to use this model properly once we initialize them
[00:07:54] model properly once we initialize them what we will do is that we will optimize
[00:07:57] what we will do is that we will optimize them find the optimal W and B and after
[00:08:06] them find the optimal W and B and after we found the optimal W and B we will use
[00:08:09] we found the optimal W and B we will use them to predict
[00:08:20] does this process make sense this
[00:08:23] does this process make sense this training process and I think the
[00:08:25] training process and I think the important part is to understand what
[00:08:26] important part is to understand what this is find the optimal W and B means
[00:08:30] this is find the optimal W and B means defining your last function which is the
[00:08:33] defining your last function which is the objective and in machine learning you
[00:08:35] objective and in machine learning you often have this this specific problem
[00:08:38] often have this this specific problem where you have a function that you know
[00:08:40] where you have a function that you know you want to find the network function
[00:08:42] you want to find the network function but you don't know the values of its
[00:08:44] but you don't know the values of its parameters in order to find them you're
[00:08:46] parameters in order to find them you're going to use a proxy that is going to be
[00:08:48] going to use a proxy that is going to be your last function if you manage to
[00:08:49] your last function if you manage to minimize the last function you will find
[00:08:51] minimize the last function you will find the right parameters so you define a
[00:08:55] the right parameters so you define a loss function that is the logistic loss
[00:09:01] Y log Y hat plus 1 minus y log of 1
[00:09:08] Y log Y hat plus 1 minus y log of 1 minus y hat you guys have seen this one
[00:09:12] minus y hat you guys have seen this one you remember where it comes from comes
[00:09:16] you remember where it comes from comes from a maximum likelihood estimation
[00:09:18] from a maximum likelihood estimation starting from a probabilistic model and
[00:09:23] starting from a probabilistic model and so the idea is how can I minimize this
[00:09:25] so the idea is how can I minimize this function minimize because I've put a
[00:09:27] function minimize because I've put a minus sign here I want to find W and B
[00:09:32] minus sign here I want to find W and B that minimize this function and I'm
[00:09:33] that minimize this function and I'm going to use a gradient descent
[00:09:35] going to use a gradient descent algorithm which means I'm going to
[00:09:38] algorithm which means I'm going to iteratively compute the derivative of
[00:09:41] iteratively compute the derivative of the loss with respect to my parameters
[00:09:48] and at every step I will update them to
[00:09:52] and at every step I will update them to make this loss function go a little down
[00:09:54] make this loss function go a little down at every trait if set so in terms of
[00:09:56] at every trait if set so in terms of implementation this is a for loop you
[00:09:58] implementation this is a for loop you will loop over a certain number of
[00:10:00] will loop over a certain number of iteration and at every point you will
[00:10:02] iteration and at every point you will compute the derivative of your loss with
[00:10:04] compute the derivative of your loss with respect to your parameters everybody
[00:10:07] respect to your parameters everybody remembers how to compute this number
[00:10:09] remembers how to compute this number take the derivative here you use the
[00:10:12] take the derivative here you use the fact that the sigmoid function has a
[00:10:14] fact that the sigmoid function has a derivative that is Sigma 8 times 1 minus
[00:10:17] derivative that is Sigma 8 times 1 minus Sigma and you will compute the results
[00:10:20] Sigma and you will compute the results we're going to do some derivative later
[00:10:22] we're going to do some derivative later today but just to set up the problem
[00:10:25] today but just to set up the problem here so the few things that I want to do
[00:10:29] here so the few things that I want to do I want to touch on here is first how
[00:10:32] I want to touch on here is first how many parameters does
[00:10:33] many parameters does this model have this logistic regression
[00:10:35] this model have this logistic regression if you have to count them so this is the
[00:10:47] if you have to count them so this is the number 89 yeah correct so twelve
[00:10:49] number 89 yeah correct so twelve thousand two hundred eighty eight
[00:10:50] thousand two hundred eighty eight weights and one bias that make sense so
[00:10:54] weights and one bias that make sense so actually it's funny because you can
[00:10:55] actually it's funny because you can quickly count it by just counting the
[00:10:57] quickly count it by just counting the number of edges on the on the on the
[00:10:59] number of edges on the on the on the drawing plus one every circle has a bias
[00:11:03] drawing plus one every circle has a bias every Edge has a weight because
[00:11:06] every Edge has a weight because ultimately this operation you can
[00:11:09] ultimately this operation you can rewrite it like that right it means
[00:11:13] rewrite it like that right it means every weight has every weight
[00:11:16] every weight has every weight corresponds to an edge so that's another
[00:11:18] corresponds to an edge so that's another way to count it we're going to use it a
[00:11:19] way to count it we're going to use it a little further so we're starting with
[00:11:21] little further so we're starting with not too many parameters actually and one
[00:11:24] not too many parameters actually and one thing that we noticed is that the number
[00:11:25] thing that we noticed is that the number of parameters of our model depends on
[00:11:27] of parameters of our model depends on the size of the input we probably don't
[00:11:30] the size of the input we probably don't want that at some point so we're going
[00:11:31] want that at some point so we're going to change it later so two equations that
[00:11:35] to change it later so two equations that I want you to remember is the first one
[00:11:38] I want you to remember is the first one is neurone equals linear plus activation
[00:11:43] is neurone equals linear plus activation so this is the vocabulary we will use in
[00:11:47] so this is the vocabulary we will use in your networks we define in your own as
[00:11:49] your networks we define in your own as an operation that has two parts one
[00:11:52] an operation that has two parts one linear part and one activation part and
[00:11:54] linear part and one activation part and it's exactly that this is actually a
[00:11:56] it's exactly that this is actually a neural we have a linear part WX plus B
[00:12:02] neural we have a linear part WX plus B and then we take the output of this
[00:12:04] and then we take the output of this linear part and we put it in an
[00:12:06] linear part and we put it in an activation that in this case is the
[00:12:08] activation that in this case is the sigmoid function it can be other
[00:12:10] sigmoid function it can be other functions okay so this is the first
[00:12:13] functions okay so this is the first equation not too hard the second
[00:12:17] equation not too hard the second equation that I wanna set now is the
[00:12:19] equation that I wanna set now is the model equals architecture plus
[00:12:26] model equals architecture plus parameters what does that mean it means
[00:12:33] parameters what does that mean it means here we're trying to train a logistic
[00:12:35] here we're trying to train a logistic regression in order to be able to use it
[00:12:38] regression in order to be able to use it we need an architecture which is the
[00:12:41] we need an architecture which is the following a one-year own neural network
[00:12:45] following a one-year own neural network and the parameters W and E so basically
[00:12:48] and the parameters W and E so basically when people say we've shipped a model
[00:12:51] when people say we've shipped a model like in the industry what they're saying
[00:12:53] like in the industry what they're saying is that they found the right parameters
[00:12:55] is that they found the right parameters with the right architecture they have
[00:12:57] with the right architecture they have two files and these two files are
[00:12:59] two files and these two files are predicting a bunch of things okay one
[00:13:03] predicting a bunch of things okay one parameter file and one architecture file
[00:13:05] parameter file and one architecture file the architecture will be modified a lot
[00:13:07] the architecture will be modified a lot today we will add neurons all over and
[00:13:10] today we will add neurons all over and the parameters will always be called W
[00:13:13] the parameters will always be called W and B but they will become bigger and
[00:13:15] and B but they will become bigger and bigger because we have more data we want
[00:13:17] bigger because we have more data we want to be able to understand it you can get
[00:13:20] to be able to understand it you can get that it's going to be hard to understand
[00:13:21] that it's going to be hard to understand what a cat is with only that that that
[00:13:24] what a cat is with only that that that many parameters we want to have more
[00:13:26] many parameters we want to have more parameters any questions so far so this
[00:13:32] parameters any questions so far so this was just to set up the problem with
[00:13:33] was just to set up the problem with logistic regression let's try to set a
[00:13:37] logistic regression let's try to set a new goal after the first goal we have
[00:13:40] new goal after the first goal we have set prior to that so the second goal
[00:13:44] set prior to that so the second goal would be find cat lion so a little
[00:13:58] would be find cat lion so a little different than before only thing we
[00:14:00] different than before only thing we changed is that we want to now to detect
[00:14:02] changed is that we want to now to detect three types of animals if there's a cat
[00:14:05] three types of animals if there's a cat on the image I want to know there is a
[00:14:06] on the image I want to know there is a cat if there is anyone on the image I
[00:14:08] cat if there is anyone on the image I want to know there is an iguana if
[00:14:09] want to know there is an iguana if there's a line on image I want to know
[00:14:11] there's a line on image I want to know it as well so how would you modify the
[00:14:14] it as well so how would you modify the network that we previously had in order
[00:14:17] network that we previously had in order to take this into account yeah
[00:14:26] yeah good idea so put two more circles
[00:14:29] yeah good idea so put two more circles so neurons and do the same thing so we
[00:14:32] so neurons and do the same thing so we have our picture here with the cats
[00:14:35] have our picture here with the cats so the cat is going to the right 64 by
[00:14:43] so the cat is going to the right 64 by 64 by 3 we flatten it from X 1 to X n
[00:14:47] 64 by 3 we flatten it from X 1 to X n let's say and represent 64 64 by 3 and
[00:14:51] let's say and represent 64 64 by 3 and what I will do is that I will use 3
[00:14:53] what I will do is that I will use 3 neurons that are all computing the same
[00:14:57] neurons that are all computing the same thing they're all connected to all these
[00:14:59] thing they're all connected to all these inputs ok I connect all my inputs x1 to
[00:15:08] inputs ok I connect all my inputs x1 to xn to each of these neurons and I will
[00:15:12] xn to each of these neurons and I will use a specific set of notation here
[00:15:43] why to hat equals a 2-1 sigmoid of w2 1
[00:15:52] why to hat equals a 2-1 sigmoid of w2 1 plus 2 and similarly white 3 hat equals
[00:16:00] plus 2 and similarly white 3 hat equals 81 which is sigmoid of w3 1x plus 3 so
[00:16:11] 81 which is sigmoid of w3 1x plus 3 so I'm introducing a few notations here and
[00:16:14] I'm introducing a few notations here and we will get used to it don't worry so
[00:16:16] we will get used to it don't worry so just just write this down and we're
[00:16:18] just just write this down and we're going to go over it so the square
[00:16:21] going to go over it so the square brackets here represents what we will
[00:16:25] brackets here represents what we will call later on a layer if you look at
[00:16:27] call later on a layer if you look at this network it looks like there is one
[00:16:30] this network it looks like there is one layer here there is one layer in which
[00:16:32] layer here there is one layer in which neurons don't communicate with each
[00:16:34] neurons don't communicate with each other we could add up to it and we will
[00:16:37] other we could add up to it and we will do it later on more neurons in other
[00:16:39] do it later on more neurons in other layers we will then note with square
[00:16:41] layers we will then note with square brackets the index of the layer the
[00:16:44] brackets the index of the layer the index that is the subscript to this a is
[00:16:47] index that is the subscript to this a is the number identifying the neuron inside
[00:16:50] the number identifying the neuron inside the layer so here we have one layer we
[00:16:52] the layer so here we have one layer we have a 1 a 2 and a 3 with square
[00:16:55] have a 1 a 2 and a 3 with square brackets one to identify the layer does
[00:16:57] brackets one to identify the layer does it make sense and then we have our Y hat
[00:17:00] it make sense and then we have our Y hat that instead of being a single number as
[00:17:02] that instead of being a single number as it was before is now a vector of size 3
[00:17:08] it was before is now a vector of size 3 so how many parameters does this network
[00:17:11] so how many parameters does this network have
[00:17:24] how much
[00:17:28] okay how did you come up with that okay
[00:17:33] okay how did you come up with that okay yeah correct so we just have three times
[00:17:35] yeah correct so we just have three times the thing we had before because we added
[00:17:37] the thing we had before because we added two more neurons and they all have their
[00:17:39] two more neurons and they all have their own set of parameters look like this
[00:17:41] own set of parameters look like this edge is a separate edges this one so we
[00:17:43] edge is a separate edges this one so we have to replicate parameters for each of
[00:17:46] have to replicate parameters for each of these so w11 would be the equivalent of
[00:17:48] these so w11 would be the equivalent of what we had for the cat but we have to
[00:17:50] what we had for the cat but we have to add two more parameter vectors and
[00:17:54] add two more parameter vectors and biases so other question when you have
[00:17:58] biases so other question when you have to train this logistic regression what
[00:18:01] to train this logistic regression what data set did you need
[00:18:13] can someone try to describe the data set
[00:18:18] yeah yeah correct so we need images and
[00:18:25] yeah yeah correct so we need images and labels with it's labeled as cat one or
[00:18:28] labels with it's labeled as cat one or no cat zero so it's a binary
[00:18:30] no cat zero so it's a binary classification with images and labels
[00:18:32] classification with images and labels now what do you think should be the data
[00:18:35] now what do you think should be the data set to train this network yes that's a
[00:18:49] set to train this network yes that's a good idea
[00:18:50] good idea so just to repeat a label for an image
[00:18:55] so just to repeat a label for an image that has a cat would probably be a
[00:18:59] that has a cat would probably be a vector with a one and two zeroes where
[00:19:03] vector with a one and two zeroes where the one should represent the present the
[00:19:05] the one should represent the present the presence of a cat this one should
[00:19:08] presence of a cat this one should represent the presence of a lion and
[00:19:09] represent the presence of a lion and this one should represent the presence
[00:19:11] this one should represent the presence of an iguana so let's assume I use this
[00:19:16] of an iguana so let's assume I use this scheme to label my dataset I train this
[00:19:19] scheme to label my dataset I train this network using the same techniques here
[00:19:22] network using the same techniques here initialize all my weights and biases
[00:19:24] initialize all my weights and biases with a value a starting value optimize a
[00:19:28] with a value a starting value optimize a loss function by using gradient descent
[00:19:31] loss function by using gradient descent and then use y hat equals lala to
[00:19:34] and then use y hat equals lala to predict what do you think this neuron is
[00:19:39] predict what do you think this neuron is going to be responsible for if you have
[00:19:47] going to be responsible for if you have to describe the responsibilities of this
[00:19:49] to describe the responsibilities of this neuro yes well this one yeah lion and
[00:19:59] neuro yes well this one yeah lion and this one iguana so basically the way you
[00:20:02] this one iguana so basically the way you go free that's a good question
[00:20:08] go free that's a good question we're going to talk about that now
[00:20:09] we're going to talk about that now multiple image contain different animals
[00:20:11] multiple image contain different animals or not so going back on what you said
[00:20:13] or not so going back on what you said because we decided to label our data set
[00:20:16] because we decided to label our data set like that after training this neuron is
[00:20:19] like that after training this neuron is not really going to be there to detect
[00:20:21] not really going to be there to detect cuts if we had changed the labeling
[00:20:23] cuts if we had changed the labeling scheme and I said that the second entry
[00:20:25] scheme and I said that the second entry would correspond to the
[00:20:26] would correspond to the shut the presence of the cat then after
[00:20:29] shut the presence of the cat then after training you will detect that this
[00:20:31] training you will detect that this neuron is responsible for detecting the
[00:20:32] neuron is responsible for detecting the cat so the network is going to evolve
[00:20:34] cat so the network is going to evolve depending on the way you label your
[00:20:36] depending on the way you label your dataset
[00:20:37] dataset now do you think that this network can
[00:20:41] now do you think that this network can still be robust to different animals in
[00:20:43] still be robust to different animals in the same picture so this cat now has a
[00:20:47] the same picture so this cat now has a friend that is a lion okay I have no
[00:20:50] friend that is a lion okay I have no idea how to draw a lion but let's say
[00:20:53] idea how to draw a lion but let's say there is a lion here and because there
[00:20:57] there is a lion here and because there is a lion I will add a one here do you
[00:20:59] is a lion I will add a one here do you think this network is robust to this
[00:21:02] think this network is robust to this type of labeling
[00:21:13] hmm it should be the neurons aren't
[00:21:18] hmm it should be the neurons aren't talking to each other that's a good
[00:21:19] talking to each other that's a good answer actually another answer that's a
[00:21:31] answer actually another answer that's a good on intuition because the network
[00:21:34] good on intuition because the network what it sees is just one one zero and an
[00:21:36] what it sees is just one one zero and an image it doesn't see that this one
[00:21:39] image it doesn't see that this one corresponds to the Cal correspond to the
[00:21:41] corresponds to the Cal correspond to the first one and the second and the line
[00:21:43] first one and the second and the line correspond to the second one so this is
[00:21:46] correspond to the second one so this is a property of neural networks it's the
[00:21:47] a property of neural networks it's the fact that you don't need to tell them
[00:21:49] fact that you don't need to tell them everything if you have enough data
[00:21:50] everything if you have enough data they're going to figure it out
[00:21:52] they're going to figure it out so because you will have also cats with
[00:21:54] so because you will have also cats with iguanas cats alone Lions with iguanas
[00:21:57] iguanas cats alone Lions with iguanas lions alone ultimately this neuron will
[00:22:00] lions alone ultimately this neuron will understand what it's looking for and it
[00:22:02] understand what it's looking for and it will understand that this one
[00:22:03] will understand that this one corresponds to this line just needs a
[00:22:07] corresponds to this line just needs a lot of data so yes it's going to be
[00:22:09] lot of data so yes it's going to be robust and that's the reason you
[00:22:12] robust and that's the reason you mentioned is going to be robust to that
[00:22:13] mentioned is going to be robust to that because the tree neurons aren't
[00:22:15] because the tree neurons aren't communicating together so we can totally
[00:22:18] communicating together so we can totally train them independent independently
[00:22:20] train them independent independently from each other and in fact the sigmoid
[00:22:22] from each other and in fact the sigmoid here doesn't depend on the sigmoid here
[00:22:24] here doesn't depend on the sigmoid here and doesn't depend on the same weight
[00:22:25] and doesn't depend on the same weight here it means we can have one one and
[00:22:28] here it means we can have one one and one as an output yes question you could
[00:22:36] one as an output yes question you could you could you could think about it as
[00:22:37] you could you could think about it as trilogies equations so we wouldn't call
[00:22:40] trilogies equations so we wouldn't call that in your own network yet it's not
[00:22:42] that in your own network yet it's not ready yet but it's a three neural
[00:22:46] ready yet but it's a three neural network or three logistic regression
[00:22:48] network or three logistic regression with each other now following up on that
[00:22:51] with each other now following up on that yeah go for it the question W and B are
[00:22:59] yeah go for it the question W and B are related to what oh yeah so so usually
[00:23:05] related to what oh yeah so so usually you would have theta transpose X which
[00:23:08] you would have theta transpose X which is sum of theta I X I correct and what I
[00:23:13] is sum of theta I X I correct and what I will split it is I will spit it in sum
[00:23:15] will split it is I will spit it in sum of theta I X I plus theta 0 times 1 I'll
[00:23:21] of theta I X I plus theta 0 times 1 I'll split it like that theta 0 would
[00:23:23] split it like that theta 0 would correspond to be
[00:23:24] correspond to be and these data eyes would correspond to
[00:23:26] and these data eyes would correspond to Wis make sense one more question
[00:23:45] good question that's the next thing
[00:23:47] good question that's the next thing we're going to see so the question is a
[00:23:51] we're going to see so the question is a follow-up on this is there cases where
[00:23:53] follow-up on this is there cases where we have a constraint where there is only
[00:23:57] we have a constraint where there is only one possible outcome it means there is
[00:24:00] one possible outcome it means there is no chat in Lyon there's either a cat or
[00:24:02] no chat in Lyon there's either a cat or a lion
[00:24:02] a lion there is no Guana in Lyon there's either
[00:24:05] there is no Guana in Lyon there's either in iguana or line think about health
[00:24:08] in iguana or line think about health care there are many there are many
[00:24:10] care there are many there are many models that are made to detect if this
[00:24:16] models that are made to detect if this is skin disease is present on based on
[00:24:18] is skin disease is present on based on cell microscopic images usually there is
[00:24:21] cell microscopic images usually there is no overlap between this is it means you
[00:24:23] no overlap between this is it means you want to classify a specific this is
[00:24:25] want to classify a specific this is among a large number of diseases so this
[00:24:27] among a large number of diseases so this model would still work but would not be
[00:24:30] model would still work but would not be optimal because it's longer to Train
[00:24:32] optimal because it's longer to Train maybe one this is super super rare and
[00:24:34] maybe one this is super super rare and one of the neurons is never going to be
[00:24:36] one of the neurons is never going to be trained let's say you're working in a
[00:24:38] trained let's say you're working in a zoo where there is only one in wanna and
[00:24:40] zoo where there is only one in wanna and there are thousands of lions and
[00:24:41] there are thousands of lions and thousands of cats this guy will never
[00:24:44] thousands of cats this guy will never train almost you know it would be super
[00:24:46] train almost you know it would be super hard to train this one so you want to
[00:24:48] hard to train this one so you want to start with another model that where you
[00:24:49] start with another model that where you put the constraint that okay there is
[00:24:51] put the constraint that okay there is only one disease that we want to predict
[00:24:53] only one disease that we want to predict and let the model learn with all the
[00:24:56] and let the model learn with all the neurons learn together by creating
[00:24:58] neurons learn together by creating interaction between them have you guys
[00:25:01] interaction between them have you guys heard of soft max yes some of you I see
[00:25:06] heard of soft max yes some of you I see that okay so let's look at soft max a
[00:25:09] that okay so let's look at soft max a little bit together so we set a new goal
[00:25:11] little bit together so we set a new goal now which is we add a constraint which
[00:25:21] now which is we add a constraint which is unique animal on an image so at most
[00:25:32] is unique animal on an image so at most one animal on an image
[00:25:35] so I'm going to modify the network a
[00:25:37] so I'm going to modify the network a little bit we have our chat and there is
[00:25:40] little bit we have our chat and there is no line on the image we flatten it and
[00:25:45] no line on the image we flatten it and now I'm going to use the same scheme
[00:25:48] now I'm going to use the same scheme with the tree neurons a1 a2 a3 but as an
[00:26:04] with the tree neurons a1 a2 a3 but as an output what I'm going to use is an
[00:26:09] output what I'm going to use is an exponent softmax function so let me be
[00:26:13] exponent softmax function so let me be more precise let me let me actually
[00:26:15] more precise let me let me actually introduce another notation to make it
[00:26:17] introduce another notation to make it easier as you know the neuron is a
[00:26:21] easier as you know the neuron is a linear part plus an activation so we're
[00:26:24] linear part plus an activation so we're going to introduce a notation for the
[00:26:28] going to introduce a notation for the linear part I'm going to introduce z1 1
[00:26:31] linear part I'm going to introduce z1 1 to represent the linear part of the
[00:26:34] to represent the linear part of the first neuron z11 two to introduce the
[00:26:41] first neuron z11 two to introduce the linear part of the second gyro
[00:26:43] linear part of the second gyro so now when neuron has two parts one
[00:26:45] so now when neuron has two parts one which compute Z and one which computes a
[00:26:47] which compute Z and one which computes a equals Samoyed Ozzy now I'm going to
[00:26:51] equals Samoyed Ozzy now I'm going to remove all the activations and make
[00:26:54] remove all the activations and make these these and I'm going to use the
[00:26:59] these these and I'm going to use the specific formula
[00:27:24] so this if you recall it's exactly the
[00:27:29] so this if you recall it's exactly the softmax formula okay so now the network
[00:27:53] softmax formula okay so now the network we have can you guys see rates too small
[00:27:56] we have can you guys see rates too small too small okay I'm going to just write
[00:27:59] too small okay I'm going to just write this formula bigger and then you can
[00:28:02] this formula bigger and then you can figure out the others I guess because e
[00:28:04] figure out the others I guess because e of Z 3 1 divided by sum from J equals 1
[00:28:10] of Z 3 1 divided by sum from J equals 1 2 3 of e exponential of ZK this one so
[00:28:20] 2 3 of e exponential of ZK this one so here is a for the third one if you are
[00:28:21] here is a for the third one if you are doing it for the first one you will add
[00:28:23] doing it for the first one you will add you just change this into a 2 into a 1
[00:28:25] you just change this into a 2 into a 1 and for a second 1 into a 2 so why is
[00:28:28] and for a second 1 into a 2 so why is this formula interesting and why is it
[00:28:30] this formula interesting and why is it not robust to this labeling scheme
[00:28:32] not robust to this labeling scheme anymore it's because the sum of the
[00:28:36] anymore it's because the sum of the outputs of this network have to sum up
[00:28:38] outputs of this network have to sum up to 1 you can try it if you sum the three
[00:28:41] to 1 you can try it if you sum the three outputs you get the same thing in the
[00:28:44] outputs you get the same thing in the numerator and on the denominator and you
[00:28:46] numerator and on the denominator and you get one that makes sense
[00:28:49] get one that makes sense so instead of getting a probabilistic
[00:28:53] so instead of getting a probabilistic output for each each of Y if each of Y
[00:28:59] output for each each of Y if each of Y hat 1 Y had to I had 3 we will get a
[00:29:02] hat 1 Y had to I had 3 we will get a probability distribution over all the
[00:29:04] probability distribution over all the classes so it means we cannot get 0.7
[00:29:07] classes so it means we cannot get 0.7 0.6 0.1 telling us roughly that there is
[00:29:11] 0.6 0.1 telling us roughly that there is probably a cat and a lion but no iguana
[00:29:13] probably a cat and a lion but no iguana we have to sum these two one so it means
[00:29:16] we have to sum these two one so it means if there is no cut and no lion it means
[00:29:19] if there is no cut and no lion it means there is very likely an iguana the three
[00:29:22] there is very likely an iguana the three probabilities are dependent on each
[00:29:24] probabilities are dependent on each other and for this one we have to label
[00:29:29] other and for this one we have to label the following way 1 1 0 for a cat 0 1 0
[00:29:34] the following way 1 1 0 for a cat 0 1 0 for a lion or 0 0 1
[00:29:37] for a lion or 0 0 1 for an iguana so this is called a
[00:29:40] for an iguana so this is called a softmax multi-class network you assume
[00:30:04] softmax multi-class network you assume there is at least one of the three
[00:30:05] there is at least one of the three classes otherwise you have to add a
[00:30:07] classes otherwise you have to add a fourth input that will represent absence
[00:30:09] fourth input that will represent absence of animal but this way your assume there
[00:30:13] of animal but this way your assume there is always one of these three animals on
[00:30:14] is always one of these three animals on every picture and how many parameters
[00:30:23] every picture and how many parameters does the network have the same as the
[00:30:27] does the network have the same as the second one we still have three neurons
[00:30:29] second one we still have three neurons and although I didn't write it this Z 1
[00:30:31] and although I didn't write it this Z 1 is equal to w1 1 X plus B 1 Z 2 thames
[00:30:36] is equal to w1 1 X plus B 1 Z 2 thames III same so there's 3 n plus 3
[00:30:39] III same so there's 3 n plus 3 parameters so one question that we
[00:30:46] parameters so one question that we didn't talk about is how do we train
[00:30:48] didn't talk about is how do we train these parameters these these parameters
[00:30:56] these parameters these these parameters the 3n plus 3 parameters how do we train
[00:30:58] the 3n plus 3 parameters how do we train them you think this scheme will work or
[00:31:01] them you think this scheme will work or not what's wrong what's wrong with this
[00:31:03] not what's wrong what's wrong with this scheme what's wrong with the last
[00:31:09] scheme what's wrong with the last function specifically
[00:31:15] there's only two outcomes so in this
[00:31:18] there's only two outcomes so in this last function y is a number between 0 &amp;
[00:31:22] last function y is a number between 0 &amp; 1 y hat same is a probability y is
[00:31:26] 1 y hat same is a probability y is either 0 or 1 y hat is between 0 &amp; 1 so
[00:31:29] either 0 or 1 y hat is between 0 &amp; 1 so it cannot match this labeling so we need
[00:31:32] it cannot match this labeling so we need to modify the loss function so let's
[00:31:36] to modify the loss function so let's call it loss trainer what I'm going to
[00:31:42] call it loss trainer what I'm going to do is I'm going to just sum it up for
[00:31:46] do is I'm going to just sum it up for the fingers
[00:32:05] does this make sense so I'm just doing
[00:32:09] does this make sense so I'm just doing three times this loss for each of the
[00:32:13] three times this loss for each of the neurons so we have exactly three times
[00:32:15] neurons so we have exactly three times this we sum them together and if you
[00:32:20] this we sum them together and if you train this last function you should be
[00:32:22] train this last function you should be able to train the three neurons that you
[00:32:24] able to train the three neurons that you have and again talking about scarcity of
[00:32:27] have and again talking about scarcity of one of the classes if there is not many
[00:32:29] one of the classes if there is not many in Guana then the third term of this sum
[00:32:34] in Guana then the third term of this sum is not going to help this neuron train
[00:32:38] is not going to help this neuron train towards detecting an iguana
[00:32:40] towards detecting an iguana it's going to push it to the technology
[00:32:42] it's going to push it to the technology Juana any question on the last function
[00:32:47] Juana any question on the last function does this one make sense yeah yeah
[00:33:02] does this one make sense yeah yeah usually that's what will happen is that
[00:33:03] usually that's what will happen is that the output of this network once it's
[00:33:06] the output of this network once it's trained is going to be a probability
[00:33:07] trained is going to be a probability distribution you will pick the maximum
[00:33:09] distribution you will pick the maximum of those and you will set it one and the
[00:33:11] of those and you will set it one and the others to zero as your prediction one
[00:33:17] others to zero as your prediction one more question yeah
[00:33:28] if you use the two one if you use this
[00:33:31] if you use the two one if you use this labeling skin-like one one zero for this
[00:33:34] labeling skin-like one one zero for this network what do you think it will happen
[00:33:40] it will probably not work and the reason
[00:33:43] it will probably not work and the reason is this sum is equal to two there's some
[00:33:47] is this sum is equal to two there's some of these entries while the sum of this
[00:33:49] of these entries while the sum of this entry is equal to one so you will never
[00:33:51] entry is equal to one so you will never be able to match the output to the input
[00:33:54] be able to match the output to the input to the label it makes sense
[00:33:56] to the label it makes sense so what the network is probably going to
[00:33:58] so what the network is probably going to do is it's probably going to send this
[00:34:00] do is it's probably going to send this one to one half this one to one half and
[00:34:02] one to one half this one to one half and this one to zero probably which is not
[00:34:04] this one to zero probably which is not what you want okay let's talk about the
[00:34:09] what you want okay let's talk about the last function for this softmax
[00:34:11] last function for this softmax regression because you know what's
[00:34:22] regression because you know what's interesting about this loss is if I take
[00:34:25] interesting about this loss is if I take this derivative derivative of the Lost
[00:34:29] this derivative derivative of the Lost 3m with respect to W to one you thing is
[00:34:36] 3m with respect to W to one you thing is going to be harder than this derivative
[00:34:38] going to be harder than this derivative then this one or no it's going to be
[00:34:41] then this one or no it's going to be exactly the same because only one of
[00:34:44] exactly the same because only one of these three terms depends on W want to
[00:34:45] these three terms depends on W want to it means the derivative of the two
[00:34:47] it means the derivative of the two others are zero so we're exactly at the
[00:34:50] others are zero so we're exactly at the same complexity during the derivation
[00:34:52] same complexity during the derivation but this one you think if you try to
[00:34:56] but this one you think if you try to compute let's say we define a loss
[00:35:00] compute let's say we define a loss function that corresponds roughly to
[00:35:01] function that corresponds roughly to that if you try to compute the
[00:35:02] that if you try to compute the derivative of the loss with respect to W
[00:35:04] derivative of the loss with respect to W 2 it will become much more complex
[00:35:07] 2 it will become much more complex because this number the output here that
[00:35:11] because this number the output here that is going to impact the loss function
[00:35:13] is going to impact the loss function directly not only depends on the
[00:35:15] directly not only depends on the parameters of W 2 it also depends on the
[00:35:18] parameters of W 2 it also depends on the parents of W 1 and W 3 and same for this
[00:35:21] parents of W 1 and W 3 and same for this put this output also depends on the
[00:35:23] put this output also depends on the parameters W 2 doesn't make sense
[00:35:25] parameters W 2 doesn't make sense because of this denominator so the
[00:35:29] because of this denominator so the softmax regression needs a different
[00:35:30] softmax regression needs a different loss function and a different derivative
[00:35:33] loss function and a different derivative so the loss function will define is a
[00:35:36] so the loss function will define is a very common one in deep learning
[00:35:38] very common one in deep learning is called the softmax first entropy
[00:35:41] is called the softmax first entropy cross entropy loss i'm not going to into
[00:35:50] cross entropy loss i'm not going to into the details of where it comes from but
[00:35:51] the details of where it comes from but you can get the intuition why
[00:36:14] so it surprisingly looks like the binary
[00:36:19] so it surprisingly looks like the binary croissant the binary the logistic class
[00:36:21] croissant the binary the logistic class function the only difference is that we
[00:36:24] function the only difference is that we will sum it up on all the on all the
[00:36:30] will sum it up on all the on all the classes now we will take a derivative of
[00:36:36] classes now we will take a derivative of something that looks like that later but
[00:36:38] something that looks like that later but I'd say if you can try it at home on
[00:36:40] I'd say if you can try it at home on this one it would be a good exercise
[00:36:42] this one it would be a good exercise this way so this binary croissant ropey
[00:36:46] this way so this binary croissant ropey loss is very likely to be used in
[00:36:48] loss is very likely to be used in classification problems that are multi
[00:36:51] classification problems that are multi class okay so this was the first part on
[00:36:57] class okay so this was the first part on logistic regression types of networks
[00:36:59] logistic regression types of networks and I think we're ready now with the
[00:37:02] and I think we're ready now with the notation that we introduced to jump on
[00:37:04] notation that we introduced to jump on to neural networks any question on this
[00:37:07] to neural networks any question on this first part before we move on so one
[00:37:15] first part before we move on so one question I would have for you let's say
[00:37:17] question I would have for you let's say instead of trying to predict if there is
[00:37:20] instead of trying to predict if there is a cat or no cat we will trying to
[00:37:23] a cat or no cat we will trying to predict the age of the cat based on the
[00:37:26] predict the age of the cat based on the image what would you change this network
[00:37:31] image what would you change this network instead of predicting one zero you want
[00:37:34] instead of predicting one zero you want to predict the age of the cat what are
[00:37:39] to predict the age of the cat what are the things you would change
[00:37:43] yes okay so I repeat I I basically make
[00:37:57] yes okay so I repeat I I basically make several output nodes where each of them
[00:38:00] several output nodes where each of them corresponds to one edge of cats so would
[00:38:02] corresponds to one edge of cats so would you use this network or the third one
[00:38:06] you use this network or the third one would use the tree neurons your own
[00:38:09] would use the tree neurons your own network or the softmax regression the
[00:38:12] network or the softmax regression the third one why you have a unique age you
[00:38:16] third one why you have a unique age you cannot have two ages right so we would
[00:38:19] cannot have two ages right so we would use a soft max one because we want a
[00:38:21] use a soft max one because we want a probability distribution along the edge
[00:38:23] probability distribution along the edge the age okay that makes sense that's a
[00:38:28] the age okay that makes sense that's a good approach there is also another
[00:38:29] good approach there is also another approach which is using directly
[00:38:31] approach which is using directly regression to predict an age an age can
[00:38:34] regression to predict an age an age can be between 0 and plus in feet not plus
[00:38:36] be between 0 and plus in feet not plus infinity 0 in a certain number and so
[00:38:44] infinity 0 in a certain number and so let's say you want to do a regression
[00:38:45] let's say you want to do a regression how would you modify your network change
[00:38:50] how would you modify your network change the sigmoid the sigmoid puts the Z
[00:38:52] the sigmoid the sigmoid puts the Z between 0 &amp; 1 we don't want this to
[00:38:54] between 0 &amp; 1 we don't want this to happen so I'd say we will change the
[00:38:56] happen so I'd say we will change the sigmoid into what function would you
[00:38:58] sigmoid into what function would you change the Samoyed
[00:39:09] yes so the second one you said was or to
[00:39:15] yes so the second one you said was or to get a plus-one type of distribution okay
[00:39:17] get a plus-one type of distribution okay so let's let's go with linear you
[00:39:19] so let's let's go with linear you mentioned linear we could just use a
[00:39:21] mentioned linear we could just use a linear function right for the sigmoid
[00:39:25] linear function right for the sigmoid but this becomes a linear regression the
[00:39:28] but this becomes a linear regression the whole network becomes a linear
[00:39:29] whole network becomes a linear regression another one that is very
[00:39:30] regression another one that is very common in in deep learning is called the
[00:39:32] common in in deep learning is called the rayleigh function it's a function that
[00:39:34] rayleigh function it's a function that is almost linear but for every input
[00:39:37] is almost linear but for every input that is negative it's equal to zero
[00:39:39] that is negative it's equal to zero because we cannot have negative age it
[00:39:41] because we cannot have negative age it makes sense to use this one okay so this
[00:39:46] makes sense to use this one okay so this is called rectified linear units really
[00:39:50] is called rectified linear units really it's a very common one in different now
[00:39:54] it's a very common one in different now what else would you change we talked
[00:39:56] what else would you change we talked about linear regression you remember the
[00:39:58] about linear regression you remember the last function you were using a linear
[00:40:00] last function you were using a linear regression what was it it was probably
[00:40:06] regression what was it it was probably one of these two y hat minus y just
[00:40:10] one of these two y hat minus y just comparison between the output label and
[00:40:13] comparison between the output label and Y hats the prediction or it was the l2
[00:40:16] Y hats the prediction or it was the l2 loss Y hat minus y in l2 norm so that's
[00:40:20] loss Y hat minus y in l2 norm so that's what we would use we would modify our
[00:40:22] what we would use we would modify our loss function to fit the regression type
[00:40:24] loss function to fit the regression type of problem and the reason we would use
[00:40:26] of problem and the reason we would use this loss instead of the one we have for
[00:40:29] this loss instead of the one we have for a regression test is because in
[00:40:31] a regression test is because in optimization the shape of this loss is
[00:40:34] optimization the shape of this loss is much easier to optimize for a regression
[00:40:36] much easier to optimize for a regression task than it is for a classification
[00:40:37] task than it is for a classification task and vice versa not going to go into
[00:40:41] task and vice versa not going to go into the details of that but that's the
[00:40:42] the details of that but that's the intuition ok let's go have fun with
[00:40:46] intuition ok let's go have fun with neural networks
[00:41:10] so we we stick to our first goal I've
[00:41:17] so we we stick to our first goal I've given an image tell us if there is cat
[00:41:21] given an image tell us if there is cat or no cat this is one this is you but
[00:41:28] or no cat this is one this is you but now we're going to make a network a
[00:41:29] now we're going to make a network a little more complex we're going to add
[00:41:31] little more complex we're going to add some parameters so I get my teacher of
[00:41:33] some parameters so I get my teacher of the cat cat is moving okay
[00:41:45] the cat cat is moving okay and what I'm going to do is that I'm
[00:41:47] and what I'm going to do is that I'm going to put more neurons than before
[00:41:50] going to put more neurons than before maybe something like that
[00:42:35] so using the same notation you see that
[00:42:38] so using the same notation you see that my square bracket here is 2 indicating
[00:42:41] my square bracket here is 2 indicating that there is a layer here which is the
[00:42:44] that there is a layer here which is the second layer while this one is the first
[00:42:52] second layer while this one is the first air and this one is the third layer
[00:42:56] everybody's up to speed with the
[00:42:58] everybody's up to speed with the notations cool so now notice that when
[00:43:04] notations cool so now notice that when you make a choice of architecture you
[00:43:07] you make a choice of architecture you have to be careful of one thing is that
[00:43:10] have to be careful of one thing is that the output layer has to have the same
[00:43:13] the output layer has to have the same number of neurons as you want the number
[00:43:15] number of neurons as you want the number of classes to be for a classification
[00:43:17] of classes to be for a classification and one for a regression so how many
[00:43:27] and one for a regression so how many parameters does need this network have
[00:43:31] parameters does need this network have can someone quickly give me the thought
[00:43:33] can someone quickly give me the thought process so how much here
[00:43:41] yeah like 3n plus 3 let's say
[00:43:59] yeah correct so here you would have
[00:44:02] yeah correct so here you would have three any weights plus three biases here
[00:44:06] three any weights plus three biases here you would have two times three weights
[00:44:08] you would have two times three weights plus two biases because you have three
[00:44:10] plus two biases because you have three neurons connected to two neurons and
[00:44:13] neurons connected to two neurons and here you will have two times one plus
[00:44:15] here you will have two times one plus one bias this is the total number of
[00:44:18] one bias this is the total number of characters so you see that we didn't add
[00:44:22] characters so you see that we didn't add too much parameters most of the
[00:44:23] too much parameters most of the parameters are still in the input layer
[00:44:28] let's define some vocabulary the first
[00:44:31] let's define some vocabulary the first word is layer layer denotes neurons that
[00:44:35] word is layer layer denotes neurons that are not connected to each other these
[00:44:36] are not connected to each other these two neurons are not connected to each
[00:44:37] two neurons are not connected to each other
[00:44:38] other these three neurons are not connected to
[00:44:39] these three neurons are not connected to each other we call this cluster of
[00:44:41] each other we call this cluster of neurons a layer and this has three
[00:44:44] neurons a layer and this has three layers we would use input layer to
[00:44:47] layers we would use input layer to define the first layer output layer to
[00:44:50] define the first layer output layer to define the third layer because it
[00:44:51] define the third layer because it directly sees the output and we would
[00:44:53] directly sees the output and we would call the second layer a hidden layer and
[00:44:59] the reason we call it hidden is because
[00:45:02] the reason we call it hidden is because the input and the output are hidden from
[00:45:04] the input and the output are hidden from this layer it means the only thing that
[00:45:06] this layer it means the only thing that this layer sees as input is what's the
[00:45:09] this layer sees as input is what's the previous layer again it so it's an
[00:45:12] previous layer again it so it's an abstraction of the inputs but it's not
[00:45:14] abstraction of the inputs but it's not the input doesn't make sense and say it
[00:45:18] the input doesn't make sense and say it doesn't see the output it just gives
[00:45:19] doesn't see the output it just gives what it understood to the last neuron
[00:45:22] what it understood to the last neuron that will compare the output to the
[00:45:24] that will compare the output to the ground truth so now why our neural
[00:45:28] ground truth so now why our neural network interesting and why do we call
[00:45:30] network interesting and why do we call this hidden layer is because if you
[00:45:34] this hidden layer is because if you train this network on cats
[00:45:36] train this network on cats classification with a lot of images of
[00:45:38] classification with a lot of images of cats you would notice that the first
[00:45:41] cats you would notice that the first layers are going to understand the
[00:45:43] layers are going to understand the fundamental concepts of the image which
[00:45:46] fundamental concepts of the image which is the edges this neuron is going to be
[00:45:49] is the edges this neuron is going to be able to detect this type of edges this
[00:45:52] able to detect this type of edges this your own probably going to detect some
[00:45:54] your own probably going to detect some other type of edge this neuron may be
[00:45:57] other type of edge this neuron may be this type of edge then what's going to
[00:46:00] this type of edge then what's going to happen is that this neuron are going to
[00:46:01] happen is that this neuron are going to communicate what they found on the image
[00:46:03] communicate what they found on the image to the next layers new
[00:46:05] to the next layers new and this room is going to use the edges
[00:46:07] and this room is going to use the edges that these guys found to figure out that
[00:46:09] that these guys found to figure out that oh there is a their ears while this one
[00:46:13] oh there is a their ears while this one is going to figure out oh there is a
[00:46:15] is going to figure out oh there is a mouth and so on if you have several
[00:46:18] mouth and so on if you have several neural and they're going to communicate
[00:46:19] neural and they're going to communicate what they understood to the output
[00:46:22] what they understood to the output neuron that is going to construct the
[00:46:24] neuron that is going to construct the face of the cat
[00:46:25] face of the cat based on what it received and be able to
[00:46:27] based on what it received and be able to tell if there is a cat or not
[00:46:29] tell if there is a cat or not so the reason it's called hidden layer
[00:46:31] so the reason it's called hidden layer is because we don't really know what
[00:46:34] is because we don't really know what it's going to figure out but with enough
[00:46:35] it's going to figure out but with enough data it should understand very complex
[00:46:38] data it should understand very complex information about the data the deeper
[00:46:40] information about the data the deeper you go the more complex information the
[00:46:42] you go the more complex information the neurons are able to understand let me
[00:46:45] neurons are able to understand let me give you another example which is a
[00:46:47] give you another example which is a house prediction example house price
[00:46:51] house prediction example house price prediction
[00:47:12] so let's assume that our inputs are
[00:47:15] so let's assume that our inputs are number of bedrooms size of the house zip
[00:47:24] number of bedrooms size of the house zip code and wealth of the neighborhood
[00:47:30] code and wealth of the neighborhood let's say what we will build is a
[00:47:33] let's say what we will build is a network that has twin neurons in the
[00:47:37] network that has twin neurons in the first layer and one your own in the
[00:47:38] first layer and one your own in the output layer so what's interesting is
[00:47:42] output layer so what's interesting is that as a human if you were to build
[00:47:45] that as a human if you were to build this network and like hand engineer it
[00:47:48] this network and like hand engineer it you would say that okay zip code and
[00:47:51] you would say that okay zip code and wealth or or sorry zip code and wealth
[00:47:57] wealth or or sorry zip code and wealth are able to tell us about the school
[00:48:01] are able to tell us about the school quality in the neighborhood the quality
[00:48:04] quality in the neighborhood the quality of the school that is next to the house
[00:48:08] of the school that is next to the house probably as a human you would say these
[00:48:12] probably as a human you would say these are probably good features to predict
[00:48:13] are probably good features to predict that the zip code is going to tell us if
[00:48:16] that the zip code is going to tell us if the neighborhood is walkable or not
[00:48:20] the neighborhood is walkable or not probably the size and the number of
[00:48:26] probably the size and the number of bedrooms is going to tell us what's the
[00:48:29] bedrooms is going to tell us what's the size of the family that can fit in this
[00:48:31] size of the family that can fit in this house and these three are probably
[00:48:35] house and these three are probably better information than these in order
[00:48:37] better information than these in order to finally predict the price so that's a
[00:48:41] to finally predict the price so that's a way to hand engineer that by hand as a
[00:48:44] way to hand engineer that by hand as a human in order to give human knowledge
[00:48:48] human in order to give human knowledge to the network to figure out the price
[00:48:50] to the network to figure out the price in practice what we do here is that we
[00:48:54] in practice what we do here is that we use a fully connected layer fully
[00:49:02] use a fully connected layer fully connected what does it mean it means
[00:49:03] connected what does it mean it means that we connect every input of a layer
[00:49:07] that we connect every input of a layer every every input to the first layer
[00:49:10] every every input to the first layer every output of the first layer to the
[00:49:13] every output of the first layer to the input of the third layer
[00:49:14] input of the third layer here and so on so all the neurons among
[00:49:17] here and so on so all the neurons among like from one layer to another are
[00:49:18] like from one layer to another are connected with each other what we're
[00:49:20] connected with each other what we're saying is that we will let the network
[00:49:22] saying is that we will let the network figure these out we will net the neurons
[00:49:25] figure these out we will net the neurons of the first layer figure out what's
[00:49:27] of the first layer figure out what's interesting for the second layer to make
[00:49:29] interesting for the second layer to make the price prediction so we will not tell
[00:49:32] the price prediction so we will not tell these to the network instead we will
[00:49:34] these to the network instead we will fully connect the network and so on
[00:49:41] fully connect the network and so on okay we'll fully connect the network and
[00:49:45] okay we'll fully connect the network and let it figure out what are the
[00:49:46] let it figure out what are the interesting features and often time the
[00:49:48] interesting features and often time the network is going to be able better than
[00:49:50] network is going to be able better than humans to find these what are the
[00:49:52] humans to find these what are the features that are representative
[00:49:53] features that are representative sometimes you may hear neural networks
[00:49:56] sometimes you may hear neural networks referred as blackbox models the reason
[00:50:00] referred as blackbox models the reason is we will not understand what this edge
[00:50:03] is we will not understand what this edge would correspond to it's it's hard to
[00:50:05] would correspond to it's it's hard to figure out that this neuron is detecting
[00:50:08] figure out that this neuron is detecting a weighted average of the input features
[00:50:12] a weighted average of the input features does it make sense another word you
[00:50:18] does it make sense another word you might hear is end to end learning the
[00:50:21] might hear is end to end learning the reason we talk about end to end learning
[00:50:23] reason we talk about end to end learning is because we have an input a ground
[00:50:26] is because we have an input a ground truth and we don't constrain the network
[00:50:30] truth and we don't constrain the network in the middle we let it learn whatever
[00:50:31] in the middle we let it learn whatever it has to learn and we call it end to
[00:50:34] it has to learn and we call it end to end learning because we're just training
[00:50:36] end learning because we're just training based on the input and the output
[00:51:14] let's delve more into the math of this
[00:51:17] let's delve more into the math of this network the neural network that we have
[00:51:20] network the neural network that we have here which has an input layer a hidden
[00:51:22] here which has an input layer a hidden layer and an output layer let's try to
[00:51:25] layer and an output layer let's try to write down the equations that run the
[00:51:27] write down the equations that run the input and pour propagated through the
[00:51:29] input and pour propagated through the output we first have z1 that is the
[00:51:35] output we first have z1 that is the linear part of the first layer that is
[00:51:37] linear part of the first layer that is computed using w1 times X plus B then
[00:51:45] computed using w1 times X plus B then this z1 is given to an activation let's
[00:51:48] this z1 is given to an activation let's say it's sigmoid which is sigmoid of z1
[00:51:52] say it's sigmoid which is sigmoid of z1 z2 is then the linear part of the second
[00:51:56] z2 is then the linear part of the second neuron which is going to take the output
[00:52:00] neuron which is going to take the output of the previous layer multiplied by its
[00:52:03] of the previous layer multiplied by its weights and add the bias the second
[00:52:08] weights and add the bias the second activation is going to take the sigmoid
[00:52:11] activation is going to take the sigmoid of z2 and finally we have the third
[00:52:16] of z2 and finally we have the third layer which is going to multiply its
[00:52:19] layer which is going to multiply its weights with the output of the layer
[00:52:22] weights with the output of the layer present in it and add its bias and
[00:52:26] present in it and add its bias and finally we have the third activation
[00:52:29] finally we have the third activation which is simply the simulate so what is
[00:52:39] which is simply the simulate so what is interesting to notice between these
[00:52:42] interesting to notice between these equations and the equations that we
[00:52:45] equations and the equations that we wrote here is that we put everything in
[00:52:50] wrote here is that we put everything in matrices so it means this 8/3 that I
[00:52:55] matrices so it means this 8/3 that I have here sorry this here for three
[00:52:59] have here sorry this here for three neurons I wrote three
[00:53:01] neurons I wrote three here for three neurons in the second
[00:53:06] here for three neurons in the second layer I just wrote a single equation to
[00:53:08] layer I just wrote a single equation to summarize it but the shape of these
[00:53:11] summarize it but the shape of these things are going to be vectors so let's
[00:53:13] things are going to be vectors so let's go over the shapes let's try to define
[00:53:15] go over the shapes let's try to define them z11 is going to be X which is n by
[00:53:20] them z11 is going to be X which is n by 1 times W which has to be 3 by n because
[00:53:26] 1 times W which has to be 3 by n because it connects three neurons to the input
[00:53:29] it connects three neurons to the input so this Z has to be 3 by 1 it makes
[00:53:33] so this Z has to be 3 by 1 it makes sense because we have three neurons
[00:53:37] sense because we have three neurons now let's go let's go deeper a 1 is just
[00:53:42] now let's go let's go deeper a 1 is just the sigmoid of z1 so it doesn't change
[00:53:44] the sigmoid of z1 so it doesn't change the shape it keeps the 3 by 1 Z 2 we
[00:53:49] the shape it keeps the 3 by 1 Z 2 we know it it has to be 2 by 1 because
[00:53:51] know it it has to be 2 by 1 because there are two neurons in the second
[00:53:53] there are two neurons in the second layer and it helps us figure out what W
[00:53:57] layer and it helps us figure out what W 2 would be we know a 1 is 3 by 1 it
[00:54:00] 2 would be we know a 1 is 3 by 1 it means that W 2 has to be 2 by 3 and if
[00:54:05] means that W 2 has to be 2 by 3 and if you count the edges between the first
[00:54:07] you count the edges between the first and the second layer here you will find
[00:54:09] and the second layer here you will find 6 ages 2 times 3 a 2 same shape as z2 z3
[00:54:17] 6 ages 2 times 3 a 2 same shape as z2 z3 1 by 1 a 3 1 by 1 w3 it has to be 1 by 2
[00:54:23] 1 by 1 a 3 1 by 1 w3 it has to be 1 by 2 because a 2 is 2 by it's same for me B
[00:54:29] because a 2 is 2 by it's same for me B is going to be the number of neurons so
[00:54:31] is going to be the number of neurons so 3 by 1 2 by 1 and finally 1 by 1 so I
[00:54:38] 3 by 1 2 by 1 and finally 1 by 1 so I think it's usually very helpful even
[00:54:39] think it's usually very helpful even when coding this type of equations to
[00:54:42] when coding this type of equations to know all the shapes that are involved
[00:54:44] know all the shapes that are involved are you guys like totally ok with the
[00:54:47] are you guys like totally ok with the shapes super easy to figure out ok cool
[00:54:50] shapes super easy to figure out ok cool so now what is interesting is that we
[00:54:56] so now what is interesting is that we will try to vectorize the code even more
[00:54:58] will try to vectorize the code even more does someone remember the difference
[00:55:00] does someone remember the difference between stochastic gradient descent and
[00:55:02] between stochastic gradient descent and gradient descent what's the difference
[00:55:13] exactly so Cassie gradient descent is
[00:55:17] exactly so Cassie gradient descent is update the weights and the bias after
[00:55:20] update the weights and the bias after you see every example so the direction
[00:55:22] you see every example so the direction of the gradient is quite noisy doesn't
[00:55:25] of the gradient is quite noisy doesn't represent very well the entire batch
[00:55:26] represent very well the entire batch while gradient descent or batch gradient
[00:55:29] while gradient descent or batch gradient descent is update after you've seen the
[00:55:32] descent is update after you've seen the whole batch of examples and the gradient
[00:55:34] whole batch of examples and the gradient is much more precise it points to the
[00:55:37] is much more precise it points to the direction you want to go to so what
[00:55:43] direction you want to go to so what we're trying to do now is to write down
[00:55:46] we're trying to do now is to write down these equations if instead of giving one
[00:55:49] these equations if instead of giving one single cat image we had given a bunch of
[00:55:52] single cat image we had given a bunch of images that either have a cat or another
[00:55:54] images that either have a cat or another cat so now our input X so what happens
[00:56:05] cat so now our input X so what happens for an input batch of examples so now
[00:56:22] for an input batch of examples so now arm or X is not anymore a single column
[00:56:27] arm or X is not anymore a single column vector it's a matrix with the first
[00:56:32] vector it's a matrix with the first image corresponding to X 1 the second
[00:56:35] image corresponding to X 1 the second image corresponding to X 2 and so on
[00:56:37] image corresponding to X 2 and so on until the enth image corresponding to X
[00:56:41] until the enth image corresponding to X and I'm introducing a new notation which
[00:56:45] and I'm introducing a new notation which is the parentheses superscript
[00:56:48] is the parentheses superscript corresponding to the ID of the example
[00:56:55] so square brackets for the layer round
[00:57:00] so square brackets for the layer round brackets for the idea of the example
[00:57:03] brackets for the idea of the example we're talking about so just to give more
[00:57:07] we're talking about so just to give more context on what we're trying to do we
[00:57:09] context on what we're trying to do we know that this is a bunch of operations
[00:57:11] know that this is a bunch of operations we just have a network with inputs
[00:57:15] we just have a network with inputs hidden and output layer we could have a
[00:57:18] hidden and output layer we could have a network with a thousand layer the more
[00:57:20] network with a thousand layer the more layers we have the more computation
[00:57:22] layers we have the more computation and it quickly goes up so what we want
[00:57:26] and it quickly goes up so what we want to do is to be able to paralyze our code
[00:57:28] to do is to be able to paralyze our code or our computation as much as possible
[00:57:30] or our computation as much as possible by giving batches of input and
[00:57:32] by giving batches of input and parallelizing these equations so let's
[00:57:34] parallelizing these equations so let's see how these equations are modified
[00:57:36] see how these equations are modified when we give it a bash of M inputs I
[00:57:41] will use capital letters to denote the
[00:57:47] will use capital letters to denote the equivalent of the lowercase letters but
[00:57:50] equivalent of the lowercase letters but for a batch of input so z1 as an example
[00:57:53] for a batch of input so z1 as an example would be w1 let's use the same w1 times
[00:58:00] would be w1 let's use the same w1 times X plus b1 so let's analyze what z1 would
[00:58:05] X plus b1 so let's analyze what z1 would look like
[00:58:07] look like z1 we know that for every for every
[00:58:12] z1 we know that for every for every input example of the batch we will get
[00:58:15] input example of the batch we will get one Z 1 it should look like this
[00:58:29] then we have to figure out what has to
[00:58:31] then we have to figure out what has to be the shapes of this equation in order
[00:58:33] be the shapes of this equation in order to end up with this we know that Z one
[00:58:36] to end up with this we know that Z one was three by one it mean it means
[00:58:39] was three by one it mean it means Capital Z one has to be three by M
[00:58:44] Capital Z one has to be three by M because each of these column vectors are
[00:58:47] because each of these column vectors are three by one and we have M of them
[00:58:49] three by one and we have M of them because for each input we forward
[00:58:52] because for each input we forward propagate through the network we get
[00:58:53] propagate through the network we get these equations so for the first cut
[00:58:55] these equations so for the first cut image we get these equations for the
[00:58:57] image we get these equations for the second cut image we get again equations
[00:58:59] second cut image we get again equations like that and so on so what is the shape
[00:59:08] like that and so on so what is the shape of X we have it above we know that it's
[00:59:10] of X we have it above we know that it's n by M what is the shape of W one it
[00:59:18] n by M what is the shape of W one it didn't change the ability one doesn't
[00:59:20] didn't change the ability one doesn't change it's not because I will give a
[00:59:22] change it's not because I will give a thousand inputs to my network that the
[00:59:24] thousand inputs to my network that the parameters are going to be more so the
[00:59:28] parameters are going to be more so the parameter number stays the same even if
[00:59:30] parameter number stays the same even if I give more inputs and so this has to be
[00:59:33] I give more inputs and so this has to be 3 by n in order to match now the
[00:59:37] 3 by n in order to match now the interesting thing is that there is a an
[00:59:40] interesting thing is that there is a an algebraic problem here what is the
[00:59:43] algebraic problem here what is the algebraic problem we said that the
[00:59:45] algebraic problem we said that the number of parameters doesn't change it
[00:59:49] number of parameters doesn't change it means that W has the same shape as it
[00:59:52] means that W has the same shape as it has before as it had before B should
[00:59:55] has before as it had before B should have the same shape as it had before
[00:59:57] have the same shape as it had before right should be 3 by 1 what's the
[01:00:01] right should be 3 by 1 what's the problem of this equation exactly we're
[01:00:08] problem of this equation exactly we're summing a 3 by M matrix to a 3 by 1
[01:00:12] summing a 3 by M matrix to a 3 by 1 vector this is not possible in that it
[01:00:16] vector this is not possible in that it doesn't work doesn't match when you do
[01:00:18] doesn't work doesn't match when you do some summations or subtraction you need
[01:00:21] some summations or subtraction you need the two terms to be the same shape
[01:00:23] the two terms to be the same shape because you will do an element-wise
[01:00:25] because you will do an element-wise addition of them an element-wise
[01:00:27] addition of them an element-wise subscription so what's the trick that is
[01:00:30] subscription so what's the trick that is used here it's a it's a technique called
[01:00:32] used here it's a it's a technique called broadcasting
[01:00:41] broadcasting is that is the fact that we
[01:00:44] broadcasting is that is the fact that we don't want to change the number of
[01:00:45] don't want to change the number of parameters it should stay the same but
[01:00:48] parameters it should stay the same but we still want this operation to be able
[01:00:50] we still want this operation to be able to be written in parallel version so we
[01:00:54] to be written in parallel version so we still want to write this equation
[01:00:55] still want to write this equation because we want to paralyze our code but
[01:00:57] because we want to paralyze our code but we don't want to add more parameters it
[01:00:58] we don't want to add more parameters it doesn't make sense so what we're going
[01:01:00] doesn't make sense so what we're going to do is that we're going to create a
[01:01:02] to do is that we're going to create a vector B tilde 1 which is going to be B
[01:01:08] vector B tilde 1 which is going to be B 1 repeated three times sorry repeated M
[01:01:13] 1 repeated three times sorry repeated M times
[01:01:23] so we just keep the same number of
[01:01:25] so we just keep the same number of parameters but just repeat them in order
[01:01:28] parameters but just repeat them in order to be able to write my code in parallel
[01:01:32] to be able to write my code in parallel is this called broadcasting and what is
[01:01:35] is this called broadcasting and what is convenient is that for those of you who
[01:01:37] convenient is that for those of you who do not do homeworks are in max hub or
[01:01:39] do not do homeworks are in max hub or Python MATLAB okay so in MATLAB know
[01:01:43] Python MATLAB okay so in MATLAB know Python Python so in Python there is a
[01:01:49] Python Python so in Python there is a package that is often used to code these
[01:01:51] package that is often used to code these equations it's non pipe some people call
[01:01:54] equations it's non pipe some people call it dumpy not sure
[01:01:55] it dumpy not sure so numpy basically numerical Python we
[01:02:01] so numpy basically numerical Python we directly do the broadcasting it means if
[01:02:04] directly do the broadcasting it means if you sum this three by m matrix with a
[01:02:09] you sum this three by m matrix with a three by one parameter vector is going
[01:02:12] three by one parameter vector is going to automatically reproduce the parameter
[01:02:14] to automatically reproduce the parameter vector M times so that the equation
[01:02:16] vector M times so that the equation works
[01:02:16] works it's called broadcasting it make sense
[01:02:19] it's called broadcasting it make sense so because we're using this technique
[01:02:22] so because we're using this technique we're able to rewrite all these
[01:02:24] we're able to rewrite all these equations with capital letters you want
[01:02:28] equations with capital letters you want to do it together or do you want to do
[01:02:29] to do it together or do you want to do it on your own wants to do it on their
[01:02:33] it on your own wants to do it on their own okay so let's do it on their own on
[01:02:40] own okay so let's do it on their own on your own so rewrite these with capital
[01:02:43] your own so rewrite these with capital letters and figure out the shapes I
[01:02:45] letters and figure out the shapes I think you can do it at home where we're
[01:02:47] think you can do it at home where we're not going to date here but make sure you
[01:02:48] not going to date here but make sure you understand all the shapes yeah
[01:03:05] so the question is how is this different
[01:03:07] so the question is how is this different from principle component analysis this
[01:03:10] from principle component analysis this is a supervised learning algorithm that
[01:03:11] is a supervised learning algorithm that will be used to predict the price of a
[01:03:13] will be used to predict the price of a house principle component analysis
[01:03:16] house principle component analysis doesn't predict anything it gets an
[01:03:19] doesn't predict anything it gets an input matrix X normalizes it compute the
[01:03:22] input matrix X normalizes it compute the covariance matrix and then figures out
[01:03:24] covariance matrix and then figures out what are the principal components by
[01:03:26] what are the principal components by doing the eigenvalue decomposition but
[01:03:29] doing the eigenvalue decomposition but the outcome of pca is you know that the
[01:03:32] the outcome of pca is you know that the most important features of your data set
[01:03:35] most important features of your data set X are going to be these features here
[01:03:39] X are going to be these features here we're not looking at the features we're
[01:03:41] we're not looking at the features we're only looking at the output that's what
[01:03:42] only looking at the output that's what is important to us
[01:03:57] so the question is can you explain why
[01:03:59] so the question is can you explain why the first layer would see the edges is
[01:04:01] the first layer would see the edges is there any tuition behind it it's not
[01:04:03] there any tuition behind it it's not always going to see the edges but it's
[01:04:05] always going to see the edges but it's often time going to see edges because in
[01:04:09] often time going to see edges because in order to detect a human face let's say
[01:04:11] order to detect a human face let's say you will train an algorithm to find out
[01:04:14] you will train an algorithm to find out whose face it is so it has to understand
[01:04:16] whose face it is so it has to understand the faces very well you need the network
[01:04:19] the faces very well you need the network to be complex enough to understand very
[01:04:21] to be complex enough to understand very detailed feature of the face and usually
[01:04:24] detailed feature of the face and usually this neuron what it sees as input or
[01:04:28] this neuron what it sees as input or pixels so it means every edge here is
[01:04:31] pixels so it means every edge here is the multiplication of the weight by a
[01:04:33] the multiplication of the weight by a pixel so it sees pixels it cannot
[01:04:38] pixel so it sees pixels it cannot understand the face as a whole because
[01:04:40] understand the face as a whole because it sees only pixels it's very granular
[01:04:43] it sees only pixels it's very granular information for it so it's going to
[01:04:45] information for it so it's going to check if pixels nearby have the same
[01:04:48] check if pixels nearby have the same color and understand that there is an
[01:04:50] color and understand that there is an edge there okay but it's too complicated
[01:04:53] edge there okay but it's too complicated to understand the whole face in the
[01:04:54] to understand the whole face in the first layer however if it understands a
[01:04:58] first layer however if it understands a little more than a pixel information it
[01:05:00] little more than a pixel information it can give it to the next neuron this
[01:05:02] can give it to the next neuron this neuron will receive more than pixel
[01:05:04] neuron will receive more than pixel information it would receive a little
[01:05:07] information it would receive a little more complex like edges and then it will
[01:05:10] more complex like edges and then it will use this information to build on top of
[01:05:11] use this information to build on top of it and build the features of the face so
[01:05:14] it and build the features of the face so what I'm trying to sum up is that these
[01:05:15] what I'm trying to sum up is that these neurons only see the pixels so they're
[01:05:17] neurons only see the pixels so they're not able to build more than the edges
[01:05:19] not able to build more than the edges that's the minimum thing that they can
[01:05:21] that's the minimum thing that they can the maximum thing they can build and
[01:05:24] the maximum thing they can build and it's it's a complex topic like
[01:05:25] it's it's a complex topic like interpretation of neural network is very
[01:05:27] interpretation of neural network is very highly researched topic the big research
[01:05:29] highly researched topic the big research topic so nobody figured out exactly how
[01:05:33] topic so nobody figured out exactly how all the neurons evolved yeah one more
[01:05:37] all the neurons evolved yeah one more question and then we move on
[01:05:50] so the question is how how do you decide
[01:05:54] so the question is how how do you decide how many neurons per layer how many
[01:05:56] how many neurons per layer how many layers
[01:05:57] layers what's the architecture of their neural
[01:05:58] what's the architecture of their neural network there are two things to take
[01:06:00] network there are two things to take into a consideration I would say first
[01:06:02] into a consideration I would say first and nobody knows the right answer so you
[01:06:04] and nobody knows the right answer so you have to test it so you guys talked about
[01:06:06] have to test it so you guys talked about training set validation set and test set
[01:06:09] training set validation set and test set so what we would do is we would try 10
[01:06:12] so what we would do is we would try 10 different architectures train its train
[01:06:15] different architectures train its train the network on this look at the
[01:06:17] the network on this look at the validation set accuracy of all these and
[01:06:19] validation set accuracy of all these and decide which one seems to be the best
[01:06:21] decide which one seems to be the best that's how we figure out what's the
[01:06:23] that's how we figure out what's the right network size on top of that using
[01:06:25] right network size on top of that using experience is often valuable so if you
[01:06:28] experience is often valuable so if you give me a problem I try always to gauge
[01:06:31] give me a problem I try always to gauge how complex is the problem like CAD
[01:06:34] how complex is the problem like CAD classification do you think it's easier
[01:06:38] classification do you think it's easier or harder than day-and-night
[01:06:39] or harder than day-and-night classification so then a classification
[01:06:42] classification so then a classification is I give you an image I ask you to
[01:06:43] is I give you an image I ask you to predict if it was taken during the day
[01:06:45] predict if it was taken during the day or during the night and on the other
[01:06:46] or during the night and on the other hand you want there is a cat on the
[01:06:48] hand you want there is a cat on the image or not which one is easier which
[01:06:50] image or not which one is easier which one is harder
[01:06:54] who thinks cat classification is harder
[01:06:58] who thinks cat classification is harder ok I think people are great at
[01:07:00] ok I think people are great at classification seems harder why because
[01:07:02] classification seems harder why because there are many breeds of cats can look
[01:07:04] there are many breeds of cats can look like different things
[01:07:05] like different things there's not many breeds of nights one
[01:07:10] there's not many breeds of nights one thing that might be challenging in the
[01:07:11] thing that might be challenging in the image classification is if you want also
[01:07:14] image classification is if you want also to figure it out in house like inside
[01:07:16] to figure it out in house like inside you know maybe there is a tiny window
[01:07:19] you know maybe there is a tiny window there and I'm able to tell that is the
[01:07:21] there and I'm able to tell that is the day but for a network to understand it
[01:07:23] day but for a network to understand it you will need a lot more data than if
[01:07:24] you will need a lot more data than if only you wanted to work outside
[01:07:26] only you wanted to work outside different so these problems all have
[01:07:29] different so these problems all have their own complexity based on their
[01:07:31] their own complexity based on their complexity I think the network should be
[01:07:33] complexity I think the network should be deeper become the more complex usually
[01:07:35] deeper become the more complex usually is the problem the more data you need in
[01:07:37] is the problem the more data you need in order to figure out the output the more
[01:07:39] order to figure out the output the more deeper should be the network that's an
[01:07:41] deeper should be the network that's an intuition I think ok let's move on guys
[01:07:44] intuition I think ok let's move on guys because I think we have about what 12
[01:07:48] because I think we have about what 12 more minutes
[01:07:57] okay let's try to write the lost
[01:08:00] okay let's try to write the lost function for this problem so now that we
[01:08:19] function for this problem so now that we have our network we have written this
[01:08:22] have our network we have written this propagation equation and I will call it
[01:08:24] propagation equation and I will call it for propagation phase going forward it's
[01:08:27] for propagation phase going forward it's going from the input to the output later
[01:08:30] going from the input to the output later on when we will do we will derive these
[01:08:32] on when we will do we will derive these equations we will call them backward
[01:08:34] equations we will call them backward propagation because we're starting from
[01:08:36] propagation because we're starting from the loss and going backwards so let's
[01:08:40] the loss and going backwards so let's let's talk about the optimization
[01:08:42] let's talk about the optimization problem optimizing w1 w2 w3 b1 b2
[01:08:55] problem optimizing w1 w2 w3 b1 b2 Mitri we have a lot of stuff to optimize
[01:08:59] Mitri we have a lot of stuff to optimize right we have to find the right values
[01:09:00] right we have to find the right values for these and remember model equals
[01:09:02] for these and remember model equals architectural parameter we have our
[01:09:03] architectural parameter we have our architecture if we have our parameters
[01:09:05] architecture if we have our parameters we're done so in order to do that we
[01:09:08] we're done so in order to do that we have to define an objective function
[01:09:13] have to define an objective function sometimes called loss sometimes cost
[01:09:16] sometimes called loss sometimes cost cost function so usually we would call
[01:09:20] cost function so usually we would call it loss if there is only one example in
[01:09:22] it loss if there is only one example in the batch and cost if there is multiple
[01:09:25] the batch and cost if there is multiple examples in a match so the last function
[01:09:35] examples in a match so the last function that let's define the cost function the
[01:09:38] that let's define the cost function the cost function J depends on Y hat and Y
[01:09:42] cost function J depends on Y hat and Y okay so Y hat Y hat is a 3 ok
[01:09:54] it depends on Y hat and Y and we will
[01:09:58] it depends on Y hat and Y and we will set it to be the sum of the loss
[01:10:03] set it to be the sum of the loss functions Li and I will normalize it
[01:10:07] functions Li and I will normalize it it's not mandatory but normalize it with
[01:10:09] it's not mandatory but normalize it with one over so what does this mean is that
[01:10:15] one over so what does this mean is that we're going for batch gradient descent
[01:10:17] we're going for batch gradient descent we want to compute the loss function for
[01:10:21] we want to compute the loss function for the whole batch paralyze our code and
[01:10:23] the whole batch paralyze our code and then calculate the cost function that
[01:10:26] then calculate the cost function that will be then derived to give us the
[01:10:30] will be then derived to give us the direction of the gradient that is the
[01:10:32] direction of the gradient that is the average direction of all the the
[01:10:34] average direction of all the the derivation with respect to the whole
[01:10:36] derivation with respect to the whole input batch and Li will be the last
[01:10:43] input batch and Li will be the last function corresponding to one parameter
[01:10:45] function corresponding to one parameter so what's the error on this specific one
[01:10:49] so what's the error on this specific one input sorry not parameter and it will be
[01:10:52] input sorry not parameter and it will be the logistic loss you've already seen
[01:11:10] the logistic loss you've already seen these equations I believe so now is it
[01:11:15] these equations I believe so now is it more complex to take a derivative with
[01:11:18] more complex to take a derivative with respect to J like of J with respect to
[01:11:20] respect to J like of J with respect to the parameters or of L what's the most
[01:11:23] the parameters or of L what's the most complex between this one let's say we're
[01:11:27] complex between this one let's say we're taking derivative with respect to W to
[01:11:31] taking derivative with respect to W to compare to this one
[01:11:38] which one is the hardest
[01:11:43] who thinks J is the hardest we think it
[01:11:48] who thinks J is the hardest we think it doesn't matter it doesn't matter because
[01:11:53] doesn't matter it doesn't matter because derivation is a linear operation right
[01:11:56] derivation is a linear operation right so you can just take the derivative
[01:11:57] so you can just take the derivative inside and you will see that if you know
[01:11:59] inside and you will see that if you know this you just have to take the sum over
[01:12:02] this you just have to take the sum over this so instead of computing or
[01:12:05] this so instead of computing or derivatives on J
[01:12:06] derivatives on J we will come compute them on L but it's
[01:12:08] we will come compute them on L but it's totally equivalent there's just one more
[01:12:10] totally equivalent there's just one more step at the end okay so now we defined
[01:12:15] step at the end okay so now we defined our loss function super we define our
[01:12:21] our loss function super we define our loss function and the next step is
[01:12:22] loss function and the next step is optimize so we have to compute a lot of
[01:12:25] optimize so we have to compute a lot of derivatives
[01:12:41] and that's called backward propagation
[01:12:51] so the question is why is it called
[01:12:54] so the question is why is it called backward propagation it's because what
[01:12:56] backward propagation it's because what we want to do ultimately is this for any
[01:13:00] we want to do ultimately is this for any N equals one to three we want to do that
[01:13:09] N equals one to three we want to do that WL equals W L minus alpha derivative of
[01:13:15] WL equals W L minus alpha derivative of J with respect to W and BL equals V L
[01:13:23] J with respect to W and BL equals V L minus alpha derivative of J with respect
[01:13:29] minus alpha derivative of J with respect so we want to do that for every
[01:13:32] so we want to do that for every parameter in layer 1 2 &amp; 3 so it means
[01:13:36] parameter in layer 1 2 &amp; 3 so it means we have to compute all these derivatives
[01:13:38] we have to compute all these derivatives we have to compute derivative of the
[01:13:40] we have to compute derivative of the cost with respect to W 1 W 2 W 3 B 1 B 2
[01:13:43] cost with respect to W 1 W 2 W 3 B 1 B 2 B 3 you've done it with logistic
[01:13:46] B 3 you've done it with logistic regression we're going to do it with a
[01:13:48] regression we're going to do it with a neural network and you're going to
[01:13:50] neural network and you're going to understand why it's called backward
[01:13:51] understand why it's called backward propagation which one you want to start
[01:13:53] propagation which one you want to start with which derivative you want to start
[01:13:56] with which derivative you want to start with the derivative with respect W 1 W 2
[01:13:58] with the derivative with respect W 1 W 2 or W 3 they say assuming we'll do the
[01:14:01] or W 3 they say assuming we'll do the bias later W what W want you think that
[01:14:07] bias later W what W want you think that value one is a good idea I don't want to
[01:14:10] value one is a good idea I don't want to do W 1 and I think we should do W 3 and
[01:14:15] do W 1 and I think we should do W 3 and the reason is because if you look at
[01:14:19] the reason is because if you look at this loss function do you think the
[01:14:24] this loss function do you think the relation between W 3 and this loss
[01:14:26] relation between W 3 and this loss function is easier to understand or the
[01:14:28] function is easier to understand or the relation between W 1 and this loss
[01:14:30] relation between W 1 and this loss function is the relation between W 3 and
[01:14:33] function is the relation between W 3 and this last function because W 3 happens
[01:14:35] this last function because W 3 happens much later in the in the network so if
[01:14:37] much later in the in the network so if you want to understand how much should
[01:14:39] you want to understand how much should we move W 1 in order to make the last
[01:14:41] we move W 1 in order to make the last move it's much more complicated than
[01:14:43] move it's much more complicated than answering the question how much should W
[01:14:45] answering the question how much should W 3 move to move the loss because there is
[01:14:48] 3 move to move the loss because there is much more connections if you want to
[01:14:51] much more connections if you want to compute with W 1
[01:14:53] compute with W 1 so that's why we call it backward
[01:14:54] so that's why we call it backward propagation is because we will start
[01:14:55] propagation is because we will start with the top layer the one that's the
[01:14:57] with the top layer the one that's the closest to the last function derive the
[01:15:00] closest to the last function derive the derivative of J with respect to w1 okay
[01:15:09] derivative of J with respect to w1 okay and once we computed this derivative
[01:15:11] and once we computed this derivative which we are going to do next week once
[01:15:15] which we are going to do next week once we completed this number we can then
[01:15:17] we completed this number we can then tackle this one oh sorry yeah thanks
[01:15:25] tackle this one oh sorry yeah thanks yeah once we computed this number we
[01:15:28] yeah once we computed this number we will be able to compute this one very
[01:15:30] will be able to compute this one very easily why very easily because we can
[01:15:34] easily why very easily because we can use the chain rule of calculus so let's
[01:15:37] use the chain rule of calculus so let's see how it works
[01:15:38] see how it works we're I'm just going to give you a one
[01:15:41] we're I'm just going to give you a one minute pitch on on backdrop but we'll do
[01:15:44] minute pitch on on backdrop but we'll do it next week together so if we had to
[01:15:46] it next week together so if we had to compute this derivative what I will do
[01:15:49] compute this derivative what I will do is that I will separate it into several
[01:15:51] is that I will separate it into several derivative that are easier I will
[01:15:54] derivative that are easier I will separate it into derivative of J with
[01:15:56] separate it into derivative of J with respect as something with this something
[01:15:58] respect as something with this something with respect the w3 and the question is
[01:16:02] with respect the w3 and the question is what should this something be I will
[01:16:05] what should this something be I will look at my equations I know that J
[01:16:08] look at my equations I know that J depends on Y hat and I know that Y hat
[01:16:11] depends on Y hat and I know that Y hat depends on Z 3 Y hat is the same thing
[01:16:14] depends on Z 3 Y hat is the same thing as a 3 I know it depends on Z 3 so why
[01:16:18] as a 3 I know it depends on Z 3 so why don't why don't I include these three in
[01:16:20] don't why don't I include these three in my equation I also know that Z 3 depends
[01:16:22] my equation I also know that Z 3 depends on W 3 and the derivative of Z 3 with
[01:16:24] on W 3 and the derivative of Z 3 with respect to W 3 super easy it's just a 2
[01:16:27] respect to W 3 super easy it's just a 2 transpose so I will just make a quick
[01:16:31] transpose so I will just make a quick hack and say that this derivative is the
[01:16:34] hack and say that this derivative is the same as taking it with respect to a 3
[01:16:38] same as taking it with respect to a 3 taking the derivative of 83 with respect
[01:16:40] taking the derivative of 83 with respect to Z 3 and taking the derivative of Z 3
[01:16:45] to Z 3 and taking the derivative of Z 3 with respect to W 3
[01:16:49] with respect to W 3 so you see same same derivative
[01:16:52] so you see same same derivative calculated in different ways and I know
[01:16:55] calculated in different ways and I know this I know these are pretty easy to
[01:17:00] this I know these are pretty easy to compute so that's why we call it back
[01:17:03] compute so that's why we call it back propagation is because we use the chain
[01:17:05] propagation is because we use the chain rule to compute the derivative
[01:17:06] rule to compute the derivative w3 and then one I want to do it for w2
[01:17:09] w3 and then one I want to do it for w2 I'm going to insert I'm going to insert
[01:17:14] I'm going to insert I'm going to insert the derivative with Z three times the
[01:17:19] the derivative with Z three times the derivative of Z three with respect to a
[01:17:23] derivative of Z three with respect to a two times the derivative of a two with
[01:17:26] two times the derivative of a two with respect to Z 2 and their relative of Z 2
[01:17:30] respect to Z 2 and their relative of Z 2 with respect to W 2 does this make sense
[01:17:34] with respect to W 2 does this make sense that this thing here is the same thing
[01:17:40] that this thing here is the same thing as this it means if I want to compute
[01:17:45] as this it means if I want to compute the derivative of W 2 I don't need to
[01:17:47] the derivative of W 2 I don't need to come to this anymore
[01:17:48] come to this anymore I already did for W 3 I just need to
[01:17:51] I already did for W 3 I just need to compute those which are easy ones and so
[01:17:53] compute those which are easy ones and so on if I want to compute the derivative
[01:17:55] on if I want to compute the derivative of J with respect to W 1 I'm going to
[01:18:01] of J with respect to W 1 I'm going to I'm not going to decompose all the thing
[01:18:03] I'm not going to decompose all the thing again I'm just going to take the
[01:18:04] again I'm just going to take the derivative of J with respect to Z 2
[01:18:07] derivative of J with respect to Z 2 which is equal to this whole thing and
[01:18:11] which is equal to this whole thing and then I'm going to multiply it by
[01:18:12] then I'm going to multiply it by derivative of Z 2 with respect to a 1
[01:18:16] derivative of Z 2 with respect to a 1 times derivative of a 1 with respect to
[01:18:19] times derivative of a 1 with respect to Z 1 times the derivative of Z 1 with
[01:18:24] Z 1 times the derivative of Z 1 with respect to W 1 and again this thing I
[01:18:28] respect to W 1 and again this thing I know it already I computed it previously
[01:18:30] know it already I computed it previously just for this one so what's what's
[01:18:34] just for this one so what's what's interesting about it is that I'm not
[01:18:36] interesting about it is that I'm not going to redo the work I did I'm just
[01:18:38] going to redo the work I did I'm just going to store the right values while
[01:18:39] going to store the right values while back propagating and continue to
[01:18:41] back propagating and continue to derivate one thing that you need to
[01:18:43] derivate one thing that you need to notice though is that look you need this
[01:18:47] notice though is that look you need this forward propagation equation in order to
[01:18:49] forward propagation equation in order to remember what should be the path to take
[01:18:52] remember what should be the path to take in your chain rule because you know that
[01:18:55] in your chain rule because you know that this derivative of J with respect to W 3
[01:18:59] this derivative of J with respect to W 3 I cannot use it as it is
[01:19:00] I cannot use it as it is because W 3 is not connected to the
[01:19:03] because W 3 is not connected to the previous layer if you look at this
[01:19:04] previous layer if you look at this equation e 2 doesn't depend on W 3 it
[01:19:08] equation e 2 doesn't depend on W 3 it depends on Z 3 sorry like my bad
[01:19:12] depends on Z 3 sorry like my bad it depends no sorry what I wanted to say
[01:19:14] it depends no sorry what I wanted to say is that Z 2 is connected to W 2
[01:19:20] is that Z 2 is connected to W 2 but a1 is not connected to w2 so you
[01:19:26] but a1 is not connected to w2 so you want to choose the path that you're
[01:19:28] want to choose the path that you're going through in the proper way so that
[01:19:30] going through in the proper way so that there is no cancellation in these
[01:19:32] there is no cancellation in these derivatives you cannot compute
[01:19:35] derivatives you cannot compute derivative of W 2 with respect to 2 a1
[01:19:43] right you cannot compute that you don't
[01:19:47] right you cannot compute that you don't know it okay so I think we're done for
[01:19:50] know it okay so I think we're done for today so one thing that I'd like you to
[01:19:53] today so one thing that I'd like you to do if you have time is just think about
[01:19:56] do if you have time is just think about the things that can be tweaked in a
[01:19:57] the things that can be tweaked in a neural network when you build a neural
[01:20:00] neural network when you build a neural network you are not done
[01:20:02] network you are not done you have to tweak it you have to tweak
[01:20:03] you have to tweak it you have to tweak the activations you have to take the
[01:20:04] the activations you have to take the loss function there's many things you
[01:20:06] loss function there's many things you can tweak and that's what we're going to
[01:20:07] can tweak and that's what we're going to see next which ok thanks


================================================================================
LECTURE 012
================================================================================

Lecture 12 - Backprop & Improving Neural Networks | Stanford CS229: Machine Learning (Autumn 2018)

Source: https://www.youtube.com/watch?v=zUazLXZZA2U

---

Transcript

[00:00:04] hi everyone welcome welcome to the
[00:00:08] hi everyone welcome welcome to the second lecture on deep learning for CST
[00:00:11] second lecture on deep learning for CST to 9 so a quick announcement before we
[00:00:14] to 9 so a quick announcement before we start there is a Piazza post number 695
[00:00:17] start there is a Piazza post number 695 which is the meet quarter survey for CST
[00:00:21] which is the meet quarter survey for CST to nine so fill it in when you have time
[00:00:23] to nine so fill it in when you have time ok so let's get back to deep learning so
[00:00:28] ok so let's get back to deep learning so last week together we've seen what the
[00:00:33] last week together we've seen what the neural network is and we started by
[00:00:35] neural network is and we started by defining the logistic regression from a
[00:00:37] defining the logistic regression from a neural network perspective we said that
[00:00:39] neural network perspective we said that logistic regression can be viewed as a
[00:00:42] logistic regression can be viewed as a one neuron neural network where there is
[00:00:45] one neuron neural network where there is a linear part and an activation part
[00:00:47] a linear part and an activation part which was sigmoid in that case we we've
[00:00:50] which was sigmoid in that case we we've seen that sigmoid is a common activation
[00:00:53] seen that sigmoid is a common activation function to be used for classification
[00:00:54] function to be used for classification tasks because it casts a number between
[00:00:57] tasks because it casts a number between minus infinity and plus infinity
[00:00:59] minus infinity and plus infinity 0 in the z-row one interval which can be
[00:01:02] 0 in the z-row one interval which can be interpreted as a probability and then we
[00:01:05] interpreted as a probability and then we introduced the neural network so we
[00:01:06] introduced the neural network so we started to stack some neurons inside a
[00:01:09] started to stack some neurons inside a layer and then stack layer on top of
[00:01:11] layer and then stack layer on top of each other
[00:01:12] each other and we said that the more we stack
[00:01:14] and we said that the more we stack layers the more parameters we have and
[00:01:16] layers the more parameters we have and the more parameters we have the more our
[00:01:18] the more parameters we have the more our network is able to copy the complexity
[00:01:22] network is able to copy the complexity of our data because it becomes more
[00:01:23] of our data because it becomes more flexible so we stopped at a point where
[00:01:28] flexible so we stopped at a point where we did a forward propagation we had an
[00:01:30] we did a forward propagation we had an example during training before were
[00:01:32] example during training before were propagated through the network we get
[00:01:34] propagated through the network we get the output then we compute the cost
[00:01:36] the output then we compute the cost function which compares this output to
[00:01:38] function which compares this output to the ground truth and we were in the
[00:01:40] the ground truth and we were in the process of back propagating the error to
[00:01:43] process of back propagating the error to tell our parameters how they should move
[00:01:44] tell our parameters how they should move in order to detect cuts more properly
[00:01:47] in order to detect cuts more properly does that make sense all this part so
[00:01:51] does that make sense all this part so today we're going to continue that so
[00:01:53] today we're going to continue that so we're in the second part neural networks
[00:01:55] we're in the second part neural networks we're going to derive the back
[00:01:56] we're going to derive the back propagation with the chain rule and
[00:01:58] propagation with the chain rule and after that we're going to talk about how
[00:02:01] after that we're going to talk about how to improve our neural networks because
[00:02:03] to improve our neural networks because in practice it's not because you
[00:02:05] in practice it's not because you designed in your own network that is
[00:02:06] designed in your own network that is going to work there's a lot of hacks and
[00:02:08] going to work there's a lot of hacks and tricks that you need to know in order to
[00:02:11] tricks that you need to know in order to make a neural network work ok let's go
[00:02:15] make a neural network work ok let's go so first thing that we talked about is
[00:02:19] so first thing that we talked about is in order to define our optimization
[00:02:23] in order to define our optimization problem and find our right parameters we
[00:02:25] problem and find our right parameters we need to define a cost function and
[00:02:28] need to define a cost function and usually we said we would use the letter
[00:02:31] usually we said we would use the letter J to denote the cost function so here
[00:02:33] J to denote the cost function so here when I talk about cost function I'm
[00:02:35] when I talk about cost function I'm talking about a batch of examples it
[00:02:38] talking about a batch of examples it means I'm for propagating M examples at
[00:02:41] means I'm for propagating M examples at a time you remember why we do that
[00:02:43] a time you remember why we do that what's the reason we use a batch instead
[00:02:47] what's the reason we use a batch instead of a single example vectorization we
[00:02:51] of a single example vectorization we want to use what our GPU can do and
[00:02:53] want to use what our GPU can do and paralyze the computation so that's what
[00:02:55] paralyze the computation so that's what we do so we have M examples that go for
[00:03:01] we do so we have M examples that go for propagate in the network and each of
[00:03:05] propagate in the network and each of them has a loss function associated with
[00:03:07] them has a loss function associated with them the average of the loss functions
[00:03:09] them the average of the loss functions over the batch give us the cost function
[00:03:10] over the batch give us the cost function and we had defined this loss function
[00:03:13] and we had defined this loss function together L of I assuming we're still and
[00:03:18] together L of I assuming we're still and just as a reminder we're still in this
[00:03:21] just as a reminder we're still in this network where where we had a cat
[00:03:24] network where where we had a cat remember this 1 X 1 2 X n the cat was
[00:03:33] remember this 1 X 1 2 X n the cat was flattened into a vector RGB matrix into
[00:03:36] flattened into a vector RGB matrix into one vector and then there was a neural
[00:03:38] one vector and then there was a neural network with three neurons then two
[00:03:40] network with three neurons then two neurons than one neuron remember fully
[00:03:43] neurons than one neuron remember fully connected here everything and then we
[00:03:53] connected here everything and then we remember this one I think that what is
[00:03:56] remember this one I think that what is this one ok so now we're here we take M
[00:03:59] this one ok so now we're here we take M images of cats or non cats for propagate
[00:04:02] images of cats or non cats for propagate everything in the network compute a loss
[00:04:04] everything in the network compute a loss function for each of them average it and
[00:04:06] function for each of them average it and get the cost function so our last
[00:04:08] get the cost function so our last function was the binary cross-entropy
[00:04:11] function was the binary cross-entropy or also called the loss function the
[00:04:13] or also called the loss function the logistic loss function it was the
[00:04:15] logistic loss function it was the following y hi
[00:04:17] following y hi log of y hat I plus 1 minus y I log
[00:04:25] log of y hat I plus 1 minus y I log of 1 minus y hat bye
[00:04:30] so let me circle this one it's an
[00:04:36] so let me circle this one it's an important one and what we said is that
[00:04:39] important one and what we said is that this network has many parameters and we
[00:04:42] this network has many parameters and we said the first layer has W 1 V 1 the
[00:04:46] said the first layer has W 1 V 1 the second layer has W 2 V 2 and the third
[00:04:51] second layer has W 2 V 2 and the third layer has W 3 B 3 where the square
[00:04:56] layer has W 3 B 3 where the square brackets this denotes the layer and we
[00:05:00] brackets this denotes the layer and we have to train all these parameters one
[00:05:02] have to train all these parameters one thing we noticed is that because we want
[00:05:04] thing we noticed is that because we want to make a good use of the chain rule
[00:05:05] to make a good use of the chain rule we're going to start by by computing the
[00:05:08] we're going to start by by computing the derivative of these guys W 3 and V 3 and
[00:05:11] derivative of these guys W 3 and V 3 and then come back and do W 2 and V 2 and
[00:05:14] then come back and do W 2 and V 2 and then back again W 1 and B 1 in order to
[00:05:17] then back again W 1 and B 1 in order to use our formulas of the update of the
[00:05:22] use our formulas of the update of the gradient descent where W would be equal
[00:05:24] gradient descent where W would be equal to W minus alpha derivative of the cost
[00:05:27] to W minus alpha derivative of the cost with respect to W and this for any layer
[00:05:31] with respect to W and this for any layer L between 1 &amp; 3 same for B
[00:05:42] okay so let's try to do it this is the
[00:05:47] okay so let's try to do it this is the first number we want to compute and
[00:05:50] first number we want to compute and remember the reason we want to compute
[00:05:52] remember the reason we want to compute derivative of the cost with respect to W
[00:05:54] derivative of the cost with respect to W 3 is because the relationship between W
[00:05:56] 3 is because the relationship between W 3 and the cost is easier than the
[00:05:59] 3 and the cost is easier than the relationship between W 1 and the cost
[00:06:01] relationship between W 1 and the cost because W 1 had much more connection
[00:06:04] because W 1 had much more connection going through the network before ending
[00:06:06] going through the network before ending up in the cost computation so one thing
[00:06:13] up in the cost computation so one thing we should not cease before starting this
[00:06:16] we should not cease before starting this calculation is that the derivative is
[00:06:18] calculation is that the derivative is linear so this if I take the derivative
[00:06:21] linear so this if I take the derivative of J I can just take the derivative of L
[00:06:24] of J I can just take the derivative of L and it's the same thing I just need to
[00:06:26] and it's the same thing I just need to add the summation prior to that because
[00:06:29] add the summation prior to that because derivative is a linear operation that
[00:06:31] derivative is a linear operation that make sense to everyone so instead of
[00:06:34] make sense to everyone so instead of computing this I'm going to compute that
[00:06:38] computing this I'm going to compute that and then I will add the summation if we
[00:06:42] and then I will add the summation if we just make our notation easier so I'm
[00:06:45] just make our notation easier so I'm taking the derivative of a loss of one
[00:06:47] taking the derivative of a loss of one example propagated through the network
[00:06:49] example propagated through the network with respect to W 3 so let's do the
[00:06:51] with respect to W 3 so let's do the calculation together everyone I have a
[00:06:55] calculation together everyone I have a minus y I derivative with respect to W 3
[00:07:01] minus y I derivative with respect to W 3 of what we remember that Y hat was equal
[00:07:05] of what we remember that Y hat was equal to Sigma of W 3x plus B or W 3 a 2 plus
[00:07:10] to Sigma of W 3x plus B or W 3 a 2 plus B because a 2 is the input to the second
[00:07:12] B because a 2 is the input to the second layer remember so I would write it down
[00:07:14] layer remember so I would write it down here sigmoid of W 3 a 2 plus B 3 ok yeah
[00:07:40] this is good like that it's too small W
[00:07:50] this is good like that it's too small W through a 2 plus B 3 it's good like that
[00:07:56] through a 2 plus B 3 it's good like that okay so we have this term and then we
[00:08:00] okay so we have this term and then we have the second term which is plus 1
[00:08:02] have the second term which is plus 1 minus y I times derivative of w3 ability
[00:08:09] minus y I times derivative of w3 ability with respect to W 3 of 1 oh sorry I
[00:08:12] with respect to W 3 of 1 oh sorry I forgot the logarithm here of log of 1
[00:08:21] forgot the logarithm here of log of 1 minus Sigma of W 3 a 2 plus B 3 and so
[00:08:31] minus Sigma of W 3 a 2 plus B 3 and so just a reminder the reason we have this
[00:08:33] just a reminder the reason we have this is because we've written the forward
[00:08:35] is because we've written the forward propagation in the previous class you
[00:08:36] propagation in the previous class you guys remember the four forward
[00:08:37] guys remember the four forward propagation we had z3 which took a 2 as
[00:08:41] propagation we had z3 which took a 2 as input and computed the linear part and
[00:08:44] input and computed the linear part and sigmoid this is the activation function
[00:08:46] sigmoid this is the activation function used in the last neuron here okay so
[00:08:50] used in the last neuron here okay so let's try to to compute this derivative
[00:08:55] Y I so the derivative of log log prime
[00:09:00] Y I so the derivative of log log prime equals 1 over log I'm ready this formula
[00:09:06] equals 1 over log I'm ready this formula so I will just take 1 over sorry 1 over
[00:09:11] so I will just take 1 over sorry 1 over X minus 1 over X if you put your next
[00:09:13] X minus 1 over X if you put your next here so block prime of X so I will take
[00:09:18] here so block prime of X so I will take 1 over sigmoid of w3 a 2 plus B 3 I know
[00:09:22] 1 over sigmoid of w3 a 2 plus B 3 I know that this thing can be written a 3 right
[00:09:24] that this thing can be written a 3 right so I will just write a 3 instead of
[00:09:27] so I will just write a 3 instead of writing the simulator game so we have 1
[00:09:30] writing the simulator game so we have 1 over a 3 times the derivative of a 3
[00:09:34] over a 3 times the derivative of a 3 with respect to W 3 remember is that I'm
[00:09:38] with respect to W 3 remember is that I'm gonna write it down here if we take the
[00:09:40] gonna write it down here if we take the derivative of sigmoid of blah blah blah
[00:09:43] derivative of sigmoid of blah blah blah let's say derivative of log of sigmoid
[00:09:47] let's say derivative of log of sigmoid over W what we have is 1 over the
[00:09:51] over W what we have is 1 over the sigmoid
[00:09:53] sigmoid time's the derivative with respect to w3
[00:09:56] time's the derivative with respect to w3 of the sigmoid does that make sense
[00:10:00] of the sigmoid does that make sense that's what we're using here so the
[00:10:03] that's what we're using here so the derivative of sigmoid sigmoid prime of X
[00:10:07] derivative of sigmoid sigmoid prime of X is actually pretty easy to compute is
[00:10:09] is actually pretty easy to compute is sigmoid of x times 1 minus Sigma of X ok
[00:10:16] sigmoid of x times 1 minus Sigma of X ok so I'm just going to take the derivative
[00:10:18] so I'm just going to take the derivative is going to give me 8 a 3 times 1 minus
[00:10:23] is going to give me 8 a 3 times 1 minus a 3 there is still one step because
[00:10:29] a 3 there is still one step because there is a composition of three
[00:10:31] there is a composition of three functions here there is a logarithm
[00:10:33] functions here there is a logarithm there's a sigmoid and there's also a
[00:10:34] there's a sigmoid and there's also a linear function WX plus B or W a 2 plus
[00:10:38] linear function WX plus B or W a 2 plus B so I also need to take the derivative
[00:10:41] B so I also need to take the derivative of the linear part with respect to W 3
[00:10:43] of the linear part with respect to W 3 because I know that sigmoid of W 3 a 2
[00:10:48] because I know that sigmoid of W 3 a 2 plus B 3 if I want to take the
[00:10:51] plus B 3 if I want to take the derivative of that with respect to W 3 I
[00:10:55] derivative of that with respect to W 3 I need to go inside and take the
[00:10:58] need to go inside and take the derivative of what's inside ok so this
[00:11:02] derivative of what's inside ok so this will give me the sigmoid or whatever a 3
[00:11:07] will give me the sigmoid or whatever a 3 times 1 minus a 3 times the derivative
[00:11:11] times 1 minus a 3 times the derivative with respect to W 3 of the linear part
[00:11:21] does this make sense so I'm going to
[00:11:27] does this make sense so I'm going to write it here bigger here I need to take
[00:11:30] write it here bigger here I need to take the derivative of the linear part with
[00:11:33] the derivative of the linear part with respect to W 3 which is equal to a 2
[00:11:36] respect to W 3 which is equal to a 2 transpose so one thing you you may want
[00:11:42] transpose so one thing you you may want to check is when we compute
[00:11:49] when I'm trying to compute this
[00:11:51] when I'm trying to compute this derivative I'm trying to compute this
[00:12:04] derivative I'm trying to compute this derivative why is there a transpose that
[00:12:06] derivative why is there a transpose that comes out how do you come up with that
[00:12:08] comes out how do you come up with that you look at the shape here what's the
[00:12:12] you look at the shape here what's the shape of w3 someone remembers one by two
[00:12:25] shape of w3 someone remembers one by two yeah
[00:12:26] yeah why one by two yeah it's connecting two
[00:12:36] why one by two yeah it's connecting two neurons to one your own so it has to be
[00:12:38] neurons to one your own so it has to be one by two easily flip it and in order
[00:12:41] one by two easily flip it and in order to come back to that you can write your
[00:12:44] to come back to that you can write your forward propagation make the shape
[00:12:45] forward propagation make the shape analysis and find out that it's a one by
[00:12:47] analysis and find out that it's a one by two matrix how about this thing what's
[00:12:53] two matrix how about this thing what's the shape of that hmm the scaler yeah so
[00:13:01] the shape of that hmm the scaler yeah so scaler so it's one by one how do you
[00:13:02] scaler so it's one by one how do you know is because this thing is basically
[00:13:04] know is because this thing is basically z3 is the linear part of the last neuron
[00:13:07] z3 is the linear part of the last neuron and a3 we know that it's Y hat so it's a
[00:13:10] and a3 we know that it's Y hat so it's a scalar between zero and one so this has
[00:13:12] scalar between zero and one so this has to be a scalar as well because taking
[00:13:14] to be a scalar as well because taking the sigmoid should not change the shape
[00:13:16] the sigmoid should not change the shape so now the question is what's the shape
[00:13:20] so now the question is what's the shape of this entire thing the shape of this
[00:13:26] of this entire thing the shape of this entire thing should be the shape of w3
[00:13:29] entire thing should be the shape of w3 because you're taking the derivative of
[00:13:31] because you're taking the derivative of a scalar with respect to a higher
[00:13:34] a scalar with respect to a higher dimensional matrix or vector here called
[00:13:37] dimensional matrix or vector here called a row vector then it means that the
[00:13:40] a row vector then it means that the shape of this has to be the same shape
[00:13:41] shape of this has to be the same shape of w3 so 1 by 2 and you know that when
[00:13:45] of w3 so 1 by 2 and you know that when you take this simple derivative in in
[00:13:48] you take this simple derivative in in real like in with scalars not with high
[00:13:51] real like in with scalars not with high dimensional you know that this is an
[00:13:53] dimensional you know that this is an easy derivative it just should it should
[00:13:55] easy derivative it just should it should give you a 2 right but in higher
[00:13:58] give you a 2 right but in higher dimensions sometimes you have transposed
[00:14:00] dimensions sometimes you have transposed that come up and
[00:14:01] that come up and you know that the answer is a to
[00:14:03] you know that the answer is a to transpose is because you know that a 2
[00:14:05] transpose is because you know that a 2 is a 2 by 1 matrix so this is not
[00:14:09] is a 2 by 1 matrix so this is not possible it's not possible to get a 2
[00:14:13] possible it's not possible to get a 2 because otherwise it wouldn't match the
[00:14:14] because otherwise it wouldn't match the derivative that you're calculating so it
[00:14:16] derivative that you're calculating so it has to be a 2 transpose so either you
[00:14:18] has to be a 2 transpose so either you you learn the formula by heart or you
[00:14:21] you learn the formula by heart or you you learn how to analyze shapes ok any
[00:14:25] you learn how to analyze shapes ok any questions on that so that's why it's a 2
[00:14:31] questions on that so that's why it's a 2 transpose now minus y I so I'm I'm on
[00:14:42] transpose now minus y I so I'm I'm on this one now the second term of the of
[00:14:45] this one now the second term of the of the derivative and I take the derivative
[00:14:47] the derivative and I take the derivative of this so I get 1 over 1 minus a 3 a 3
[00:14:52] of this so I get 1 over 1 minus a 3 a 3 denotes the sigmoid so I'm just copying
[00:14:54] denotes the sigmoid so I'm just copying this back using the fact that the
[00:14:56] this back using the fact that the derivative a logarithm is 1 over X and
[00:14:58] derivative a logarithm is 1 over X and then I will multiply this by the
[00:15:01] then I will multiply this by the derivative of 1 minus a 3 with respect
[00:15:03] derivative of 1 minus a 3 with respect to W 3 I know that there's a minus that
[00:15:06] to W 3 I know that there's a minus that needs to come up so I would write it
[00:15:07] needs to come up so I would write it down here minus 1 and I also have the
[00:15:10] down here minus 1 and I also have the derivative of the sigmoid with respect
[00:15:13] derivative of the sigmoid with respect to what's inside the sigmoid so a 3
[00:15:17] to what's inside the sigmoid so a 3 times 1 minus a 3 and what's the last
[00:15:21] times 1 minus a 3 and what's the last term the last term is simply the one we
[00:15:23] term the last term is simply the one we just talked about it's the derivative of
[00:15:26] just talked about it's the derivative of what's inside the sigmoid with respect
[00:15:29] what's inside the sigmoid with respect to W 3 so it's a 2 transpose again ok
[00:15:43] so now I will just simplify I know this
[00:15:48] so now I will just simplify I know this scalar simplifies with this one this one
[00:15:51] scalar simplifies with this one this one simplifies with that one really going to
[00:15:54] simplifies with that one really going to copy back all the results - why I times
[00:16:00] copy back all the results - why I times 1 minus a 3 a 2 transpose plus 1 minus y
[00:16:09] 1 minus a 3 a 2 transpose plus 1 minus y I times the - I'm going to put the minus
[00:16:14] I times the - I'm going to put the minus here so I'm taking the - putting it on
[00:16:17] here so I'm taking the - putting it on on the front times a 3 times a 2
[00:16:22] on the front times a 3 times a 2 transpose and then quickly looking at
[00:16:25] transpose and then quickly looking at that I see that some of the terms will
[00:16:28] that I see that some of the terms will cancel out right okay so I have one term
[00:16:37] cancel out right okay so I have one term here why why I time this - a 3 a 2
[00:16:43] here why why I time this - a 3 a 2 transpose would cancel out with + y i-83
[00:16:46] transpose would cancel out with + y i-83 a 2 transpose to make sense so like the
[00:16:50] a 2 transpose to make sense so like the term that we multiply this number we
[00:16:53] term that we multiply this number we cancel out with the term we multiply
[00:16:54] cancel out with the term we multiply this number going to continue it gives
[00:17:02] this number going to continue it gives me Y I times a 2 transpose this part
[00:17:07] me Y I times a 2 transpose this part minus a 3 times a 2 transpose I can
[00:17:15] minus a 3 times a 2 transpose I can factor this because I have the same term
[00:17:16] factor this because I have the same term a 2 transpose and gives me finally why I
[00:17:20] a 2 transpose and gives me finally why I - a 3 times a 2 transpose okay so it
[00:17:29] - a 3 times a 2 transpose okay so it doesn't look that bad actually I don't
[00:17:33] doesn't look that bad actually I don't know when we take a derivative of
[00:17:35] know when we take a derivative of something kind of ugly we expect
[00:17:37] something kind of ugly we expect something ugly to come out but this
[00:17:39] something ugly to come out but this doesn't seem too bad
[00:17:45] any questions on that I let you write it
[00:17:47] any questions on that I let you write it quickly and then we're going to move to
[00:17:49] quickly and then we're going to move to the rest so once I get this result I can
[00:17:52] the rest so once I get this result I can just write down the cost for derivative
[00:17:54] just write down the cost for derivative with respect to W 3
[00:17:56] with respect to W 3 I know it's just 1 - I just need to take
[00:18:01] I know it's just 1 - I just need to take the summation of this thing so why I - a
[00:18:09] the summation of this thing so why I - a 3 times y 2 transpose a 2 transpose and
[00:18:16] 3 times y 2 transpose a 2 transpose and I have a minus sign coming up front so
[00:18:18] I have a minus sign coming up front so that's my derivative okay so we were
[00:18:27] that's my derivative okay so we were done with that and we can we can just
[00:18:29] done with that and we can we can just take this formula plug it in back in our
[00:18:33] take this formula plug it in back in our gradient descent update rule and update
[00:18:35] gradient descent update rule and update w3 yeah now the question is you can do
[00:18:44] w3 yeah now the question is you can do the same thing as as we just did but
[00:18:46] the same thing as as we just did but with v3 is going to be the similar
[00:18:48] with v3 is going to be the similar difficulty we're going to do it with W -
[00:18:52] difficulty we're going to do it with W - now and think how does that back
[00:18:54] now and think how does that back propagate to W - so now it's W to stern
[00:18:58] propagate to W - so now it's W to stern we want to compute derivative of L the
[00:19:01] we want to compute derivative of L the loss with respect to W of the second
[00:19:04] loss with respect to W of the second layer the question is how I'm gonna get
[00:19:07] layer the question is how I'm gonna get this one without having too much work
[00:19:10] this one without having too much work I'm not going to start over here as we
[00:19:12] I'm not going to start over here as we said last time I'm going to use the
[00:19:14] said last time I'm going to use the chain rule of calculus so I'm going to
[00:19:16] chain rule of calculus so I'm going to try to decompose this derivative into
[00:19:19] try to decompose this derivative into several derivatives so I know that Y hat
[00:19:22] several derivatives so I know that Y hat is the first thing that is connected to
[00:19:25] is the first thing that is connected to the loss function right the output
[00:19:28] the loss function right the output neuron is directly connected to the last
[00:19:29] neuron is directly connected to the last function so I'm going to take the
[00:19:32] function so I'm going to take the derivative of the last function with
[00:19:33] derivative of the last function with respect to Y hat also called a tree
[00:19:36] respect to Y hat also called a tree right is the easiest one I can calculate
[00:19:39] right is the easiest one I can calculate I also know that a tree which is the
[00:19:43] I also know that a tree which is the output activation of the last neuron is
[00:19:44] output activation of the last neuron is connected with the linear part of the
[00:19:46] connected with the linear part of the last neuron which is z3 so I can take
[00:19:49] last neuron which is z3 so I can take derivative of a tree with respect to Z 3
[00:19:54] derivative of a tree with respect to Z 3 you remember what this is going to be
[00:19:56] you remember what this is going to be derivative of a tree with respect to Z
[00:19:59] derivative of a tree with respect to Z three derivative of sigmoid I know that
[00:20:05] three derivative of sigmoid I know that a tree called sigmoid of z 3 so this
[00:20:08] a tree called sigmoid of z 3 so this derivative is very simple it's just that
[00:20:10] derivative is very simple it's just that it's just 8 3 times 1 minus a 3 right so
[00:20:16] it's just 8 3 times 1 minus a 3 right so I'm going to continue I know that Z 3 Z
[00:20:19] I'm going to continue I know that Z 3 Z 3 is equal to what it's equal to W 3 a 2
[00:20:23] 3 is equal to what it's equal to W 3 a 2 plus B which path things I need do I
[00:20:26] plus B which path things I need do I need to take in order to back propagate
[00:20:27] need to take in order to back propagate I don't want to take the derivative with
[00:20:29] I don't want to take the derivative with respect to W 3 because I went yet stuck
[00:20:31] respect to W 3 because I went yet stuck I don't want to take the derivative with
[00:20:33] I don't want to take the derivative with respect to B 3 because I will get stuck
[00:20:34] respect to B 3 because I will get stuck I will take the derivative with respect
[00:20:36] I will take the derivative with respect to a 2 because a 2 will be connected to
[00:20:40] to a 2 because a 2 will be connected to Z 2 Z 2 will be connected to a 1 and I
[00:20:42] Z 2 Z 2 will be connected to a 1 and I can back propagate from this path so I'm
[00:20:46] can back propagate from this path so I'm going to take the relative of Z 3 with
[00:20:48] going to take the relative of Z 3 with respect to a 2 to have my error back
[00:20:52] respect to a 2 to have my error back propagate and so on I know that a 2 is
[00:20:55] propagate and so on I know that a 2 is equal to Sigma of Z 2 so I'm just going
[00:20:59] equal to Sigma of Z 2 so I'm just going to do that and I know that this
[00:21:01] to do that and I know that this derivative is going to be easy as well
[00:21:03] derivative is going to be easy as well and finally I also know that Z 2 is
[00:21:08] and finally I also know that Z 2 is connected to W 2 so I'm going to take
[00:21:10] connected to W 2 so I'm going to take derivative of Z 2 with respect to W 2 so
[00:21:16] derivative of Z 2 with respect to W 2 so just what I want you to get is the
[00:21:17] just what I want you to get is the thought process of this chain rule why
[00:21:20] thought process of this chain rule why don't we take a derivative with respect
[00:21:22] don't we take a derivative with respect to W 3 or B threes because we will get
[00:21:23] to W 3 or B threes because we will get stuck we want the error to back
[00:21:25] stuck we want the error to back propagate and in order for the error to
[00:21:27] propagate and in order for the error to back propagate we have to go through
[00:21:28] back propagate we have to go through variables that are connected to each
[00:21:31] variables that are connected to each other
[00:21:32] other does it make sense so now the question
[00:21:39] does it make sense so now the question is how can we use this how can we use
[00:21:42] is how can we use this how can we use the derivative we already have in order
[00:21:45] the derivative we already have in order to to to to compute the derivative with
[00:21:49] to to to to compute the derivative with respect to W 2
[00:21:51] respect to W 2 can someone tell me how we can use the
[00:21:53] can someone tell me how we can use the results from this calculation in order
[00:21:55] results from this calculation in order not to do it again
[00:22:04] you cash it
[00:22:07] you cash it so there's another discussion on caching
[00:22:10] so there's another discussion on caching which is which is correct that's in
[00:22:13] which is which is correct that's in order to get this result very quickly we
[00:22:14] order to get this result very quickly we will use cash but what I want here is to
[00:22:17] will use cash but what I want here is to you to tell me if this result appears
[00:22:20] you to tell me if this result appears somewhere here
[00:22:21] somewhere here yeah the first three terms so this one
[00:22:27] yeah the first three terms so this one this one in this one yeah is it the
[00:22:32] this one in this one yeah is it the first two terms or the first three terms
[00:22:33] first two terms or the first three terms the first two terms here but good
[00:22:35] the first two terms here but good intuition yeah so these results is
[00:22:37] intuition yeah so these results is actually the first two terms here we
[00:22:40] actually the first two terms here we just calculated it okay well how do we
[00:22:44] just calculated it okay well how do we know that it's not easy to see one thing
[00:22:46] know that it's not easy to see one thing we know based on what we've written in
[00:22:48] we know based on what we've written in very big on this board is that the
[00:22:51] very big on this board is that the derivative of z3 because this is III
[00:22:55] derivative of z3 because this is III right derivative of z3 with respect to W
[00:22:58] right derivative of z3 with respect to W 3 is a 2 transpose right so I could
[00:23:01] 3 is a 2 transpose right so I could write here that this thing is the
[00:23:05] write here that this thing is the relative of z3 with respect to W 3 is
[00:23:11] relative of z3 with respect to W 3 is correct so I know that because I wanted
[00:23:15] correct so I know that because I wanted to compute the derivative of the loss to
[00:23:17] to compute the derivative of the loss to W 3 I know that I could have written
[00:23:20] W 3 I know that I could have written derivative of loss with respect to W 3
[00:23:22] derivative of loss with respect to W 3 as derivative of loss with respect to Z
[00:23:26] as derivative of loss with respect to Z 3 times derivative of z3 with respect to
[00:23:33] 3 times derivative of z3 with respect to W 3 correct and I know that this is a to
[00:23:38] W 3 correct and I know that this is a to transpose so it means that this thing is
[00:23:42] transpose so it means that this thing is the receive of the loss with respect to
[00:23:43] the receive of the loss with respect to Z 3 does it make sense so I got I got my
[00:23:48] Z 3 does it make sense so I got I got my decomposition of the derivative we had
[00:23:50] decomposition of the derivative we had if we wanted to use the chain rule from
[00:23:52] if we wanted to use the chain rule from here on we could have just separated it
[00:23:54] here on we could have just separated it into two terms and took the derivative
[00:23:55] into two terms and took the derivative here okay so I know the result of this
[00:24:00] here okay so I know the result of this thing I know that this thing is
[00:24:03] thing I know that this thing is basically 83 minus y times a 2 transpose
[00:24:10] basically 83 minus y times a 2 transpose I just flipped it because of the minus
[00:24:13] I just flipped it because of the minus sign
[00:24:19] okay now tell me what's disturb what is
[00:24:31] okay now tell me what's disturb what is it sir
[00:24:32] it sir let's go back yeah so sigmoid I'm just
[00:24:37] let's go back yeah so sigmoid I'm just going to write it a 2 times 1 minus a 2
[00:24:41] going to write it a 2 times 1 minus a 2 if that makes sense
[00:24:42] if that makes sense Sigma 8 times 1 minus Sigma what is this
[00:24:47] Sigma 8 times 1 minus Sigma what is this term
[00:24:57] oh sorry my bad that's not the right one
[00:24:59] oh sorry my bad that's not the right one this one this one is that this one is
[00:25:06] this one this one is that this one is sigmoid a 2 is sigmoid of Z 2 so this
[00:25:09] sigmoid a 2 is sigmoid of Z 2 so this result comes from this term what's what
[00:25:11] result comes from this term what's what about this term sorry W 3 is it W 3 or
[00:25:21] about this term sorry W 3 is it W 3 or no I heard transpose how do we know if
[00:25:25] no I heard transpose how do we know if it's W 3 or W 3 transpose so let's look
[00:25:28] it's W 3 or W 3 transpose so let's look at the shape of this what's D 3 it's one
[00:25:32] at the shape of this what's D 3 it's one by one it's a scalar is the linear part
[00:25:34] by one it's a scalar is the linear part of the last neuron what's the shape of
[00:25:36] of the last neuron what's the shape of that this is two one we have two neurons
[00:25:40] that this is two one we have two neurons in the layer W 3 we said that it was the
[00:25:44] in the layer W 3 we said that it was the 1 by 2 matrix so we have to transpose it
[00:25:46] 1 by 2 matrix so we have to transpose it so the result of that is W 3 transpose
[00:25:51] so the result of that is W 3 transpose and how about the last term
[00:26:01] same as here one layer before yeah
[00:26:08] same as here one layer before yeah someone said day 1 transpose ok yep this
[00:26:27] someone said day 1 transpose ok yep this one there's a transpose here oh oh yeah
[00:26:40] one there's a transpose here oh oh yeah yeah you correct you correct thank you
[00:26:42] yeah you correct you correct thank you that's what you mean
[00:26:43] that's what you mean yeah this one was from the z3 dw3 we
[00:26:47] yeah this one was from the z3 dw3 we didn't end up using that because we will
[00:26:49] didn't end up using that because we will get stuck so there's no idea to
[00:26:50] get stuck so there's no idea to transpose here Thanks
[00:26:52] transpose here Thanks any other questions or remarks so that's
[00:26:58] any other questions or remarks so that's cool let's write let's write down our
[00:27:02] cool let's write let's write down our derivative cleanly on the board so we
[00:27:09] derivative cleanly on the board so we have derivative of our last function
[00:27:14] have derivative of our last function with respect to W 2 which seems to be
[00:27:18] with respect to W 2 which seems to be equal to a 3 minus y from the first term
[00:27:25] equal to a 3 minus y from the first term the second term seems to be equal to W 3
[00:27:32] the second term seems to be equal to W 3 transpose then we have a term which is a
[00:27:38] transpose then we have a term which is a 2 times 1 minus a 2 ok and finally
[00:27:46] finally we have another term that is a 1
[00:27:49] finally we have another term that is a 1 transpose so are we done or not
[00:28:01] so our triggers the thing is there's two
[00:28:05] so our triggers the thing is there's two ways to compute derivatives either you
[00:28:08] ways to compute derivatives either you go very rigorously and do what we did
[00:28:11] go very rigorously and do what we did here for w2 or you try to do a chain
[00:28:16] here for w2 or you try to do a chain moon analysis and you try to fit the
[00:28:18] moon analysis and you try to fit the terms the problem is this result is not
[00:28:21] terms the problem is this result is not completely correct there is a shape
[00:28:23] completely correct there is a shape problem it means when we took our
[00:28:25] problem it means when we took our derivatives which should have flipped
[00:28:27] derivatives which should have flipped some of the terms we did it there is
[00:28:30] some of the terms we did it there is actually we won't have time to go in the
[00:28:32] actually we won't have time to go in the details in this lecture because we have
[00:28:34] details in this lecture because we have other things to see but there is a
[00:28:36] other things to see but there is a section note I think on the website
[00:28:38] section note I think on the website which details the other method which is
[00:28:40] which details the other method which is more rigorous which is like that for all
[00:28:42] more rigorous which is like that for all the derivatives what we're going to see
[00:28:44] the derivatives what we're going to see is how you can use chain rule plus shape
[00:28:46] is how you can use chain rule plus shape analysis to come up with the results
[00:28:47] analysis to come up with the results very quickly okay so let's let's analyze
[00:28:50] very quickly okay so let's let's analyze the shape of all that we know that the
[00:28:52] the shape of all that we know that the first term is a scalar into 1 by 1 we
[00:28:56] first term is a scalar into 1 by 1 we know that the second term is the
[00:28:57] know that the second term is the transpose of 1 by 2 so it's 2 by 1 and
[00:29:01] transpose of 1 by 2 so it's 2 by 1 and we know that this thing here a 2 times 1
[00:29:04] we know that this thing here a 2 times 1 minus a 2 is 2 by 1 it's an element-wise
[00:29:11] minus a 2 is 2 by 1 it's an element-wise product and this one is a 1 transpose so
[00:29:14] product and this one is a 1 transpose so it's 3 by 1 transpose so it's 1 by 3 so
[00:29:18] it's 3 by 1 transpose so it's 1 by 3 so there seem to be a problem here there is
[00:29:20] there seem to be a problem here there is no match between these two operations
[00:29:22] no match between these two operations for example right so the question is how
[00:29:27] for example right so the question is how can we how can we put everything
[00:29:29] can we how can we put everything together if we do it very good a city we
[00:29:32] together if we do it very good a city we know how to put it together if you're
[00:29:34] know how to put it together if you're used to doing the chain rule
[00:29:36] used to doing the chain rule you can quickly quickly do it around so
[00:29:38] you can quickly quickly do it around so after experience you will be able to to
[00:29:40] after experience you will be able to to fit all these together the important
[00:29:43] fit all these together the important thing to know is that here there is an
[00:29:45] thing to know is that here there is an element twice product which is here so
[00:29:50] element twice product which is here so every time you will take the derivative
[00:29:52] every time you will take the derivative of the sigmoid is going to end up being
[00:29:54] of the sigmoid is going to end up being an element twice product and it's the
[00:29:57] an element twice product and it's the case whatever the activation that you're
[00:29:59] case whatever the activation that you're using is so the right result is this one
[00:30:09] so here I have my elementwise product of
[00:30:12] so here I have my elementwise product of a 2 by 1 by a 2 by 1 so it gives me a 2
[00:30:18] a 2 by 1 by a 2 by 1 so it gives me a 2 by 1 column vector and then I need
[00:30:25] by 1 column vector and then I need something that is 1 by 1 and 1 by 3 how
[00:30:28] something that is 1 by 1 and 1 by 3 how do I know what what do I need to have I
[00:30:30] do I know what what do I need to have I know that the shape of this thing w3
[00:30:33] know that the shape of this thing w3 needs to be 2 by 3 it's connecting two
[00:30:39] needs to be 2 by 3 it's connecting two three neurons to neurons so w2 has to be
[00:30:42] three neurons to neurons so w2 has to be 2 by 3 in order to end up with this I
[00:30:44] 2 by 3 in order to end up with this I know that this has to come here a3 minus
[00:30:47] know that this has to come here a3 minus y and a1 transpose comes again and here
[00:30:51] y and a1 transpose comes again and here I get my correct answer
[00:31:10] don't worry if it's the first time you
[00:31:12] don't worry if it's the first time you do the chain rule and is going quickly
[00:31:14] do the chain rule and is going quickly don't worry read the lecture notes with
[00:31:17] don't worry read the lecture notes with the rigorous parts taking the derivative
[00:31:19] the rigorous parts taking the derivative it will make more sense but I feel that
[00:31:23] it will make more sense but I feel that usually in practice we don't compute
[00:31:25] usually in practice we don't compute these chain rules anymore because
[00:31:28] these chain rules anymore because because programming frameworks do it for
[00:31:30] because programming frameworks do it for us
[00:31:31] us but it's important to know at least how
[00:31:33] but it's important to know at least how the chain will decomposes and also how
[00:31:37] the chain will decomposes and also how to make this the compute this derivative
[00:31:38] to make this the compute this derivative if you read research papers specifically
[00:31:41] if you read research papers specifically any questions on that I think I want to
[00:31:44] any questions on that I think I want to go back to what you mentioned with the
[00:31:45] go back to what you mentioned with the cache so why is cache very important
[00:31:48] cache so why is cache very important that was your question as well yeah yeah
[00:31:54] that was your question as well yeah yeah it has to be so it means when you take
[00:31:59] it has to be so it means when you take the derivative of Samoyed you take
[00:32:00] the derivative of Samoyed you take derivative with respect to every entry
[00:32:02] derivative with respect to every entry of the matrix which gives you an element
[00:32:03] of the matrix which gives you an element twice product going back to the cache so
[00:32:09] twice product going back to the cache so one thing is it seems that during back
[00:32:13] one thing is it seems that during back propagation there is a lot of terms that
[00:32:15] propagation there is a lot of terms that appear that were computed during forward
[00:32:17] appear that were computed during forward propagation right all these terms a 1
[00:32:20] propagation right all these terms a 1 transpose a 2 a 3 all these we have it
[00:32:24] transpose a 2 a 3 all these we have it from the for propagation so if we don't
[00:32:26] from the for propagation so if we don't catch anything we have to recompute them
[00:32:28] catch anything we have to recompute them it means I'm going backwards but then I
[00:32:31] it means I'm going backwards but then I feel oh I need a 2 actually so I have to
[00:32:34] feel oh I need a 2 actually so I have to really go forward the game to get a 2 I
[00:32:36] really go forward the game to get a 2 I go backwards I need a 1 I need to
[00:32:38] go backwards I need a 1 I need to forward propagate my X again to get a 1
[00:32:40] forward propagate my X again to get a 1 I don't want to do that so in order to
[00:32:43] I don't want to do that so in order to avoid that when I do my for propagation
[00:32:44] avoid that when I do my for propagation I would keep in memory almost all the
[00:32:48] I would keep in memory almost all the values that I'm getting including the
[00:32:50] values that I'm getting including the W's because as you see to compute the
[00:32:52] W's because as you see to compute the derivative of loss with respect W to we
[00:32:54] derivative of loss with respect W to we need W 3 but also the activation or
[00:32:59] need W 3 but also the activation or linear variables so I'm going to save
[00:33:02] linear variables so I'm going to save them in my in my network during the for
[00:33:06] them in my in my network during the for propagation in order to use it during
[00:33:07] propagation in order to use it during the backward propagation that make sense
[00:33:10] the backward propagation that make sense and again it's all for computation
[00:33:14] and again it's all for computation efficiency it has some memory cost
[00:33:22] okay so that was the backpropagation
[00:33:25] okay so that was the backpropagation and now I can use my formula of the cost
[00:33:30] and now I can use my formula of the cost with respect to the last function and I
[00:33:37] with respect to the last function and I know that this is going to be my update
[00:33:43] this is going to be used in order to
[00:33:45] this is going to be used in order to update w2 and I will do the same for w1
[00:33:48] update w2 and I will do the same for w1 then you guys can do it at home if you
[00:33:51] then you guys can do it at home if you want to make sure you understood take
[00:33:52] want to make sure you understood take the derivative with respect to w1 okay
[00:34:02] the derivative with respect to w1 okay so let's move on to the next part which
[00:34:05] so let's move on to the next part which is improving your neural network so in
[00:34:16] is improving your neural network so in practice when you when you do this
[00:34:18] practice when you when you do this process of training for propagation
[00:34:20] process of training for propagation backward propagation updates you don't
[00:34:23] backward propagation updates you don't end up having a good network most of the
[00:34:26] end up having a good network most of the time in order to get a good network you
[00:34:28] time in order to get a good network you need to improve it you need to use a
[00:34:30] need to improve it you need to use a bunch of techniques that will make your
[00:34:32] bunch of techniques that will make your network work in practice
[00:34:34] network work in practice the first the first trick is to use
[00:34:38] the first the first trick is to use different activation functions so
[00:34:45] different activation functions so together we've seen one activation
[00:34:47] together we've seen one activation function which was sigmoid and we
[00:34:53] function which was sigmoid and we remember the graph of sigmoid is getting
[00:34:56] remember the graph of sigmoid is getting a number between minus infinity and plus
[00:34:58] a number between minus infinity and plus infinity and casting it between zero and
[00:35:01] infinity and casting it between zero and one and we know that the formula is
[00:35:03] one and we know that the formula is sigmoid of z equals 1 over 1 plus
[00:35:07] sigmoid of z equals 1 over 1 plus exponential minus z we also know that
[00:35:10] exponential minus z we also know that the derivative of sigmoid is sigmoid of
[00:35:13] the derivative of sigmoid is sigmoid of Z times 1 minus Sigma of Z okay another
[00:35:19] Z times 1 minus Sigma of Z okay another very common activation function is relu
[00:35:23] very common activation function is relu we talked quickly about it last time
[00:35:27] we talked quickly about it last time value of Z which is equal to 0 if the
[00:35:31] value of Z which is equal to 0 if the is less than zero and Z if Z is positive
[00:35:35] is less than zero and Z if Z is positive so the graph of relu
[00:35:38] so the graph of relu looks like something like this with and
[00:35:51] looks like something like this with and finally another one we were using
[00:35:54] finally another one we were using commonly as well is tan H so hyperbolic
[00:35:57] commonly as well is tan H so hyperbolic tangent and tan H of Z equals
[00:36:02] tangent and tan H of Z equals exponential Z minus exponential minus Z
[00:36:05] exponential Z minus exponential minus Z over exponential Z plus exponential
[00:36:08] over exponential Z plus exponential minus Z the derivative of tan H is 1
[00:36:16] minus Z the derivative of tan H is 1 minus tan H squared of Z and the graph
[00:36:27] minus tan H squared of Z and the graph looks kind of like sigmoid but but it
[00:36:32] looks kind of like sigmoid but but it goes between minus one and plus one so
[00:36:40] goes between minus one and plus one so one question now that I've given you
[00:36:43] one question now that I've given you three activation function can you guess
[00:36:48] three activation function can you guess why we would use one instead of the
[00:36:50] why we would use one instead of the other and and which one has more
[00:36:53] other and and which one has more benefits so when I talk about activation
[00:36:58] benefits so when I talk about activation functions I talk about the functions
[00:37:00] functions I talk about the functions that you will put in these neurons after
[00:37:03] that you will put in these neurons after the linear parts what do you think is
[00:37:10] the linear parts what do you think is the main advantage of sigmoid
[00:37:16] yeah yeah you use it for classification
[00:37:22] yeah yeah you use it for classification between it gives you a probability
[00:37:23] between it gives you a probability what's the main disadvantage of sigmoid
[00:37:31] it's easy that should be an advantage
[00:37:34] it's easy that should be an advantage should be a benefit yeah correct if
[00:37:44] should be a benefit yeah correct if you're at high activation if you are
[00:37:46] you're at high activation if you are high Z's or low Z's your graduate is
[00:37:48] high Z's or low Z's your graduate is very close to zero so look here based on
[00:37:51] very close to zero so look here based on this graph we know that if Z is very big
[00:37:54] this graph we know that if Z is very big if Z is very big our gradient is going
[00:37:57] if Z is very big our gradient is going to be very small the slope of this graph
[00:37:59] to be very small the slope of this graph is very very small it's almost flat same
[00:38:03] is very very small it's almost flat same for these that are very low in the
[00:38:04] for these that are very low in the negative right what's the problem with
[00:38:07] negative right what's the problem with having low gradients is when I'm back
[00:38:09] having low gradients is when I'm back propagating if the Zi clash was big the
[00:38:13] propagating if the Zi clash was big the gradient is going to be very small and
[00:38:15] gradient is going to be very small and it would be super hard to update my
[00:38:17] it would be super hard to update my parameters that are early in the network
[00:38:19] parameters that are early in the network because the gradient is just going to
[00:38:20] because the gradient is just going to vanish does it make sense
[00:38:23] vanish does it make sense so sigmoid is one of these activation
[00:38:25] so sigmoid is one of these activation which which works very well in the
[00:38:28] which which works very well in the linear regime but has trouble working in
[00:38:32] linear regime but has trouble working in saturating regimes because the network
[00:38:34] saturating regimes because the network doesn't update the parameters properly
[00:38:36] doesn't update the parameters properly it goes very very slowly we're going to
[00:38:39] it goes very very slowly we're going to talk about that a little more
[00:38:40] talk about that a little more how about tonnage very similar right
[00:38:46] how about tonnage very similar right similar like high seas and low these
[00:38:49] similar like high seas and low these lead to saturation of a tannish
[00:38:51] lead to saturation of a tannish activation relu on the other hand
[00:38:54] activation relu on the other hand doesn't have this problem if Z is very
[00:38:57] doesn't have this problem if Z is very big in the positives there is no
[00:39:00] big in the positives there is no saturation the gradient just passes and
[00:39:02] saturation the gradient just passes and the gradient is one when we're here
[00:39:05] the gradient is one when we're here right the slope is equal to one so it's
[00:39:08] right the slope is equal to one so it's actually just directing the gradient to
[00:39:09] actually just directing the gradient to some entry it's not multiplying it by
[00:39:11] some entry it's not multiplying it by anything when you back propagate so you
[00:39:14] anything when you back propagate so you know this term here this term that I
[00:39:16] know this term here this term that I have here all the a 3 minus 8 3 times 1
[00:39:19] have here all the a 3 minus 8 3 times 1 minus a 3 or a 2 1 times 1 minus a 2 if
[00:39:22] minus a 3 or a 2 1 times 1 minus a 2 if we use real activations when we change
[00:39:25] we use real activations when we change these
[00:39:27] these with what with with the derivative of r
[00:39:31] with what with with the derivative of r lu and the derivative of r lu can be
[00:39:36] lu and the derivative of r lu can be written indicator function of z being
[00:39:39] written indicator function of z being positive you've seen in indicator
[00:39:43] positive you've seen in indicator functions so this is equal to 1 if Z is
[00:39:46] functions so this is equal to 1 if Z is positive 0 otherwise ok so we will see
[00:39:52] positive 0 otherwise ok so we will see why we use Rayleigh mostly yeah yeah
[00:40:00] why we use Rayleigh mostly yeah yeah free you remember the house prediction
[00:40:02] free you remember the house prediction example in that case if you know if you
[00:40:05] example in that case if you know if you only predict the price of a house based
[00:40:07] only predict the price of a house based on some features you would use value
[00:40:08] on some features you would use value because you know that the output should
[00:40:10] because you know that the output should be a positive number between 0 and plus
[00:40:12] be a positive number between 0 and plus infinity it doesn't make sense to use
[00:40:14] infinity it doesn't make sense to use one of 10 H or Samoyed yeah doesn't
[00:40:24] one of 10 H or Samoyed yeah doesn't really matter I think if if I want my
[00:40:26] really matter I think if if I want my output to be between 0 and 1 I would use
[00:40:28] output to be between 0 and 1 I would use Samoyed if I owned my output to be
[00:40:30] Samoyed if I owned my output to be between minus 1 and 1
[00:40:31] between minus 1 and 1 I would use tonnage so you know there is
[00:40:34] I would use tonnage so you know there is there are some tasks where the output is
[00:40:37] there are some tasks where the output is kind of a reward or a minus reward that
[00:40:40] kind of a reward or a minus reward that you want to get like in reinforcement
[00:40:42] you want to get like in reinforcement learning you would use 10 H as an output
[00:40:44] learning you would use 10 H as an output activation which is because minus 1
[00:40:46] activation which is because minus 1 looks like a negative reward plus 1
[00:40:48] looks like a negative reward plus 1 looks like a positive reward and you
[00:40:50] looks like a positive reward and you want to decide what should be the reward
[00:40:57] good question why do we consider these
[00:41:00] good question why do we consider these functions we can actually consider any
[00:41:02] functions we can actually consider any functions apart from the identity
[00:41:04] functions apart from the identity function so let's see why
[00:41:06] function so let's see why thanks for the transition like why do we
[00:41:13] thanks for the transition like why do we need activation functions so let's
[00:41:24] need activation functions so let's assume that we have a network which is
[00:41:26] assume that we have a network which is the same as before so our network is
[00:41:28] the same as before so our network is three neurons casting into two neurons
[00:41:30] three neurons casting into two neurons casting into one your own and we're
[00:41:33] casting into one your own and we're trying to use activations or equal to
[00:41:41] trying to use activations or equal to identity functions so it means Z is
[00:41:44] identity functions so it means Z is given to Z let's try to derive the for
[00:41:48] given to Z let's try to derive the for propagation Y hat equals a tree equals Z
[00:41:54] propagation Y hat equals a tree equals Z 3 equals W 3 a 2 plus B 3 I know that a
[00:42:04] 3 equals W 3 a 2 plus B 3 I know that a 2 a 2 is equal to Z 2 because there is
[00:42:09] 2 a 2 is equal to Z 2 because there is no activation and Z 2 is equal to W 2 A
[00:42:13] no activation and Z 2 is equal to W 2 A 1 plus B 2 so I can cast here W 2 W 2 A
[00:42:20] 1 plus B 2 so I can cast here W 2 W 2 A 1 plus B 2 plus B 3 I can continue I
[00:42:30] 1 plus B 2 plus B 3 I can continue I know that a 1 is equal to Z 1 and I know
[00:42:33] know that a 1 is equal to Z 1 and I know that Z 1 is w 1 X plus B
[00:43:25] and B equals W three times W two times B
[00:43:39] and B equals W three times W two times B 1 plus W 3 times B 2 plus B 3 so what's
[00:43:53] 1 plus W 3 times B 2 plus B 3 so what's the insight here is that we need
[00:43:58] the insight here is that we need activation functions the reason is if
[00:44:01] activation functions the reason is if you don't use activation functions no
[00:44:03] you don't use activation functions no matter how deep is your network is going
[00:44:05] matter how deep is your network is going to be equivalent to a linear regression
[00:44:07] to be equivalent to a linear regression so the complexity of the network comes
[00:44:10] so the complexity of the network comes from the activation function in the
[00:44:12] from the activation function in the reason we can understand if we're trying
[00:44:15] reason we can understand if we're trying to detect cuts what we're trying to do
[00:44:17] to detect cuts what we're trying to do is to train a network that will mimic
[00:44:19] is to train a network that will mimic the formula of detecting cuts we don't
[00:44:22] the formula of detecting cuts we don't know this formula so we want to mimic it
[00:44:24] know this formula so we want to mimic it using a lot of time matters if we just
[00:44:27] using a lot of time matters if we just have a linear regression we cannot mimic
[00:44:29] have a linear regression we cannot mimic this because we're going to look at
[00:44:32] this because we're going to look at pixel by pixel and assign every way to a
[00:44:34] pixel by pixel and assign every way to a certain pixel if I give a new example
[00:44:38] certain pixel if I give a new example it's not gonna work anymore yeah yeah so
[00:44:46] it's not gonna work anymore yeah yeah so I think that's that that goes back to
[00:44:48] I think that's that that goes back to your question as well so this is why we
[00:44:50] your question as well so this is why we need activation functions and then the
[00:44:51] need activation functions and then the question was can we use different
[00:44:53] question was can we use different activation functions and how do we how
[00:44:55] activation functions and how do we how do we put them inside a layer or inside
[00:44:57] do we put them inside a layer or inside neurons usually we would use there are
[00:45:00] neurons usually we would use there are more activation functions I think in CS
[00:45:02] more activation functions I think in CS 2:30 we go over a few more but not not
[00:45:04] 2:30 we go over a few more but not not not today these have been designed with
[00:45:08] not today these have been designed with experience so these are the ones that
[00:45:10] experience so these are the ones that that that work better and let's our
[00:45:14] that that work better and let's our networks train there are plenty of other
[00:45:16] networks train there are plenty of other activation functions that have been
[00:45:18] activation functions that have been tested
[00:45:19] tested usually you would you would use the same
[00:45:21] usually you would you would use the same activation functions inside every layer
[00:45:24] activation functions inside every layer so when you it's it's a it's it's for
[00:45:27] so when you it's it's a it's it's for training it doesn't have any special
[00:45:30] training it doesn't have any special reason I think but when you have a
[00:45:32] reason I think but when you have a network like that you would call this
[00:45:34] network like that you would call this layer a random layer meaning it's a
[00:45:36] layer a random layer meaning it's a fully connected layer with radioactive
[00:45:38] fully connected layer with radioactive ation this one a sigmoid layer it means
[00:45:40] ation this one a sigmoid layer it means it's a fully connected layer with the
[00:45:42] it's a fully connected layer with the sigmoid activation and the last one is
[00:45:44] sigmoid activation and the last one is sigmoid I I think people have been
[00:45:46] sigmoid I I think people have been trying a lot of putting activate
[00:45:49] trying a lot of putting activate different activations in different
[00:45:50] different activations in different neurons in a layer in different layers
[00:45:52] neurons in a layer in different layers and the consensus was using one
[00:45:56] and the consensus was using one activation in the layer and also using
[00:46:00] activation in the layer and also using one of these three activations yeah so
[00:46:04] one of these three activations yeah so if someone comes up with a better
[00:46:05] if someone comes up with a better activation that is obviously helping
[00:46:09] activation that is obviously helping training our models on different data
[00:46:11] training our models on different data sets people would adopt it but right now
[00:46:14] sets people would adopt it but right now these are the ones that work better
[00:46:24] you know last time we talked about hyper
[00:46:27] you know last time we talked about hyper parameters a little bit these are all
[00:46:29] parameters a little bit these are all hyper parameters so in practice you're
[00:46:31] hyper parameters so in practice you're not going to choose these randomly
[00:46:32] not going to choose these randomly you're going to try a bunch of them and
[00:46:34] you're going to try a bunch of them and choose some of them that seem to help
[00:46:37] choose some of them that seem to help your model train there's a lot of
[00:46:39] your model train there's a lot of experimental results in deep burning and
[00:46:41] experimental results in deep burning and we don't really understand fully why
[00:46:43] we don't really understand fully why certain activations work better than
[00:46:45] certain activations work better than others okay let's move on
[00:47:14] okay let's go over initialization
[00:47:16] okay let's go over initialization techniques
[00:47:35] okay
[00:47:37] okay let me use this port so another trick
[00:47:50] let me use this port so another trick that you can use in order to help your
[00:47:52] that you can use in order to help your network train our initialization methods
[00:47:55] network train our initialization methods and normalization methods so earlier we
[00:48:07] and normalization methods so earlier we talked about the fact that if Z is too
[00:48:09] talked about the fact that if Z is too big or Z is too low in the negative
[00:48:13] big or Z is too low in the negative numbers
[00:48:13] numbers it will lead to saturation of the
[00:48:15] it will lead to saturation of the network so in order to avoid that you
[00:48:17] network so in order to avoid that you can use normalization of the input so
[00:48:27] can use normalization of the input so assume that you have a network where the
[00:48:29] assume that you have a network where the data is 2-dimensional x1 x2 is your
[00:48:33] data is 2-dimensional x1 x2 is your two-dimensional input you can assume
[00:48:40] two-dimensional input you can assume that x1 x2 is distributed like this
[00:48:43] that x1 x2 is distributed like this thing so this is if I plot X 1 again X 2
[00:48:48] thing so this is if I plot X 1 again X 2 for a lot of data I will get that type
[00:48:50] for a lot of data I will get that type of graph the problem is if I do my W X
[00:48:55] of graph the problem is if I do my W X plus B to compute my Z 1 if x's are very
[00:48:58] plus B to compute my Z 1 if x's are very big it will lead to very big Z's which
[00:49:00] big it will lead to very big Z's which will lead to saturated activations in
[00:49:03] will lead to saturated activations in order to avoid that one method is to
[00:49:07] compute the mean of this data using mu
[00:49:11] compute the mean of this data using mu equals 1 over the size of the batch of
[00:49:14] equals 1 over the size of the batch of the internet you have in the training
[00:49:16] the internet you have in the training set sum of excise so you're just giving
[00:49:21] set sum of excise so you're just giving you the mean for x1 and the mean for x2
[00:49:27] you would compute the operation x equals
[00:49:30] you would compute the operation x equals x minus mu and you will get that type of
[00:49:33] x minus mu and you will get that type of plot if you re plot the transform data
[00:49:40] let's say X 1 tilde X 2 tilde so here
[00:49:45] let's say X 1 tilde X 2 tilde so here it's a little better but it's still not
[00:49:47] it's a little better but it's still not good in order to solve the problem fully
[00:49:51] good in order to solve the problem fully you're going to compute Sigma squared
[00:49:55] you're going to compute Sigma squared which is basically the standard
[00:49:57] which is basically the standard deviation squared so the variance of the
[00:50:00] deviation squared so the variance of the data and then you will divide by Sigma
[00:50:04] data and then you will divide by Sigma square
[00:50:17] so you would do that and you would make
[00:50:20] so you would do that and you would make the transformation of X being equal to X
[00:50:23] the transformation of X being equal to X divided by Sigma and it will give you a
[00:50:27] divided by Sigma and it will give you a graph that is centered up so you usually
[00:50:38] graph that is centered up so you usually prefer to to work with a centered data
[00:50:42] prefer to to work with a centered data yeah sorry oh yeah yeah sorry sorry yeah
[00:50:48] yeah sorry oh yeah yeah sorry sorry yeah great so if we subtract the mean of X 1
[00:50:53] great so if we subtract the mean of X 1 and X 2 so it should look like this but
[00:51:07] and X 2 so it should look like this but be centered okay and then if you says if
[00:51:15] be centered okay and then if you says if you standardize it it looks like
[00:51:17] you standardize it it looks like something like that so why is it better
[00:51:19] something like that so why is it better because if you look at your your your
[00:51:21] because if you look at your your your loss function now before the loss
[00:51:24] loss function now before the loss function would look like something like
[00:51:26] function would look like something like this
[00:51:33] and after normalizing the input it may
[00:51:37] and after normalizing the input it may look like something something like this
[00:51:41] look like something something like this so what's the difference between these
[00:51:43] so what's the difference between these two loss functions why is this one
[00:51:45] two loss functions why is this one easier to Train is because if you have a
[00:51:46] easier to Train is because if you have a starting point that is here let's say
[00:51:49] starting point that is here let's say your gradient descent algorithm is going
[00:51:51] your gradient descent algorithm is going to go to towards approximately the
[00:51:55] to go to towards approximately the steepest slope so you're going to go
[00:51:57] steepest slope so you're going to go there and then this one is going to go
[00:51:59] there and then this one is going to go there and then you're going to go there
[00:52:01] there and then you're going to go there and then you're going to go there like
[00:52:03] and then you're going to go there like that and so on until you end up at the
[00:52:06] that and so on until you end up at the right point but the steepest slope in
[00:52:10] right point but the steepest slope in this loss contour is always pointing
[00:52:12] this loss contour is always pointing towards the middle so if you start
[00:52:14] towards the middle so if you start somewhere you will directly go towards
[00:52:18] somewhere you will directly go towards the minimum of your loss function so
[00:52:20] the minimum of your loss function so that's why it's helpful usually to
[00:52:22] that's why it's helpful usually to normalize so this is one method and in
[00:52:28] normalize so this is one method and in practice the way you initialize your
[00:52:31] practice the way you initialize your weights is very important yeah
[00:52:38] yes so exactly so here I used a very
[00:52:49] yes so exactly so here I used a very simple case but you would divide
[00:52:51] simple case but you would divide element-wise by the Sigma here okay so
[00:52:56] element-wise by the Sigma here okay so like every entry of your matrix you
[00:52:58] like every entry of your matrix you would divide it by the Sigma or one
[00:53:01] would divide it by the Sigma or one other thing that is important to notice
[00:53:02] other thing that is important to notice this Sigma and mu are computed over the
[00:53:06] this Sigma and mu are computed over the training set you have a training set you
[00:53:08] training set you have a training set you compute the mean of the training set the
[00:53:09] compute the mean of the training set the standard deviation of the training set
[00:53:10] standard deviation of the training set and these Sigma and you have to be used
[00:53:13] and these Sigma and you have to be used on the test set as well it means now
[00:53:15] on the test set as well it means now that you want to test your algorithm on
[00:53:16] that you want to test your algorithm on the test set you should not compute the
[00:53:18] the test set you should not compute the mean of the test set and the standard
[00:53:21] mean of the test set and the standard deviation of the test set and normalize
[00:53:22] deviation of the test set and normalize your test input through the network
[00:53:25] your test input through the network instead you should use the mu and the
[00:53:28] instead you should use the mu and the Sigma that were completed on the train
[00:53:30] Sigma that were completed on the train set because your network is used to see
[00:53:32] set because your network is used to see this type of transformation as an input
[00:53:35] this type of transformation as an input so you want the distribution of the
[00:53:38] so you want the distribution of the input at the first year to be always the
[00:53:40] input at the first year to be always the same no matter if it's the train or the
[00:53:42] same no matter if it's the train or the test set
[00:53:48] here likely this leads to fewer
[00:53:52] here likely this leads to fewer iterations okay we have a lot to see so
[00:53:57] iterations okay we have a lot to see so I will I will skip a few questions so
[00:54:03] I will I will skip a few questions so let's let's delve a little more into
[00:54:04] let's let's delve a little more into vanishing and exploding radius so in
[00:54:20] vanishing and exploding radius so in order to get an intuition of why we have
[00:54:22] order to get an intuition of why we have this vanishing or exploding Radian
[00:54:23] this vanishing or exploding Radian problem we can consider a network which
[00:54:27] problem we can consider a network which is very very deep and has a two
[00:54:33] is very very deep and has a two dimensional input okay and so on so
[00:54:40] dimensional input okay and so on so let's say we have let's say we have ten
[00:54:42] let's say we have let's say we have ten layers in total 10 layers plus an output
[00:54:48] layers in total 10 layers plus an output layer so assume assume all the
[00:54:53] layer so assume assume all the activations all the activations are
[00:54:56] activations all the activations are identity functions and assume that these
[00:55:00] identity functions and assume that these biases are equal to 0 if you compute Y
[00:55:04] biases are equal to 0 if you compute Y hats the output of the network with
[00:55:08] hats the output of the network with respect to the input you know that Y hat
[00:55:11] respect to the input you know that Y hat would be equal to W of layer L capital L
[00:55:15] would be equal to W of layer L capital L denotes the last layer times a L minus 1
[00:55:21] denotes the last layer times a L minus 1 plus BL but be L is 0 so we can remove
[00:55:24] plus BL but be L is 0 so we can remove it WL x al minus 1 you know that al
[00:55:28] it WL x al minus 1 you know that al minus 1 is W l minus 1 times a L minus 2
[00:55:36] minus 1 is W l minus 1 times a L minus 2 because the activation is an identity
[00:55:39] because the activation is an identity function and so on you can back
[00:55:41] function and so on you can back propagate can go back and you will get
[00:55:44] propagate can go back and you will get that Y hat equals W L times W L minus 1
[00:55:51] that Y hat equals W L times W L minus 1 times blah blah blah times w1 times X
[00:55:56] times blah blah blah times w1 times X you get something like that right so now
[00:56:02] you get something like that right so now let's let's consider two cases let's
[00:56:05] let's let's consider two cases let's consider where the case where the WL
[00:56:08] consider where the case where the WL matrices are a little bigger than the
[00:56:13] matrices are a little bigger than the identity function a little larger than
[00:56:15] identity function a little larger than the islands function in terms of values
[00:56:17] the islands function in terms of values let's say WL including all these so all
[00:56:21] let's say WL including all these so all these matrices which are two by two
[00:56:24] these matrices which are two by two matrices right are these ones what's the
[00:56:31] matrices right are these ones what's the consequence the consequence is that this
[00:56:35] consequence the consequence is that this whole thing here is going to be equal to
[00:56:39] whole thing here is going to be equal to one point five to the power L one point
[00:56:43] one point five to the power L one point five to the power L zero zero it will it
[00:56:50] five to the power L zero zero it will it will make Y hat explode to make the
[00:56:53] will make Y hat explode to make the value of y hat explode just because this
[00:56:55] value of y hat explode just because this number is a tiny little bit more than
[00:56:57] number is a tiny little bit more than one same phenomena if we had zero point
[00:57:01] one same phenomena if we had zero point five instead of one point five here the
[00:57:04] five instead of one point five here the value the multiplicative value of all
[00:57:06] value the multiplicative value of all these matrices will be zero point five
[00:57:09] these matrices will be zero point five to the power L here 0.5 to the power L
[00:57:11] to the power L here 0.5 to the power L here and Y hat will always be very close
[00:57:15] here and Y hat will always be very close to zero so you see the issue with
[00:57:18] to zero so you see the issue with vanishing exploding gradient is that all
[00:57:21] vanishing exploding gradient is that all the Earth's add up like multiplied each
[00:57:23] the Earth's add up like multiplied each other and if you end up with numbers
[00:57:26] other and if you end up with numbers that are smaller than 1 you will get
[00:57:28] that are smaller than 1 you will get totally vanished gradient when you go
[00:57:31] totally vanished gradient when you go back if you have values that are a
[00:57:34] back if you have values that are a little bigger than 1 you will get
[00:57:35] little bigger than 1 you will get exploding gradient so we did it as a
[00:57:37] exploding gradient so we did it as a forward propagation equation we could
[00:57:39] forward propagation equation we could have done it exactly the same analysis
[00:57:41] have done it exactly the same analysis with the derivatives assuming the
[00:57:45] with the derivatives assuming the derivatives of the weight matrices are a
[00:57:48] derivatives of the weight matrices are a little lower than the identity or a
[00:57:50] little lower than the identity or a little higher than the identity so we
[00:57:53] little higher than the identity so we want to avoid that one way that is not
[00:57:55] want to avoid that one way that is not perfect to avoid this is to initialize
[00:57:58] perfect to avoid this is to initialize your weight properly initialize them
[00:58:00] your weight properly initialize them into the right range of values so you
[00:58:02] into the right range of values so you agree that we would prefer the weights
[00:58:04] agree that we would prefer the weights to be around 1 as close as possible to 1
[00:58:07] to be around 1 as close as possible to 1 if they're very close to 1 we probably
[00:58:10] if they're very close to 1 we probably we can avoid the vanishing and exploding
[00:58:11] we can avoid the vanishing and exploding radiant problem so let's look at the
[00:58:17] radiant problem so let's look at the initialization problem the first thing
[00:58:25] initialization problem the first thing to look at is example of the 1 euro if
[00:58:35] to look at is example of the 1 euro if you consider this neuron here which has
[00:58:40] you consider this neuron here which has a bunch of inputs and outputs on
[00:58:46] a bunch of inputs and outputs on activation a you know that the equation
[00:58:50] activation a you know that the equation inside the neuron is a equals whatever
[00:58:55] inside the neuron is a equals whatever function let's say sigmoid of Z and you
[00:58:58] function let's say sigmoid of Z and you know that Z is equal to W 1 X 1 plus W 2
[00:59:02] know that Z is equal to W 1 X 1 plus W 2 X 2 plus blah blah blah plus W and X n
[00:59:06] X 2 plus blah blah blah plus W and X n so it's a dot product between the W's
[00:59:08] so it's a dot product between the W's and the X's so the interesting thing to
[00:59:12] and the X's so the interesting thing to notice is that we have n terms here so
[00:59:18] notice is that we have n terms here so in order for Z to not explode we would
[00:59:20] in order for Z to not explode we would like all of this term to be small if W
[00:59:25] like all of this term to be small if W is are too big then this term will
[00:59:28] is are too big then this term will explode with the size of the input of
[00:59:30] explode with the size of the input of the layer so instead if we have a large
[00:59:34] the layer so instead if we have a large n means the input is very large what we
[00:59:37] n means the input is very large what we want is very small w i's so the larger n
[00:59:42] want is very small w i's so the larger n the smaller has to be W I so based on
[00:59:46] the smaller has to be W I so based on this intuition it seems that it would be
[00:59:50] this intuition it seems that it would be a good idea to initialize w i's with
[00:59:54] a good idea to initialize w i's with something that is close to 1 over n we
[00:59:59] something that is close to 1 over n we have n terms the more terms we have the
[01:00:01] have n terms the more terms we have the more likely z is going to be big but if
[01:00:04] more likely z is going to be big but if our initialization says the more terms
[01:00:06] our initialization says the more terms you have the smaller the value of the
[01:00:08] you have the smaller the value of the weights we should be able to keep Zener
[01:00:09] weights we should be able to keep Zener in a certain range that is appropriate
[01:00:12] in a certain range that is appropriate to avoid vanishing and exploding
[01:00:13] to avoid vanishing and exploding gradients so this term to be a possible
[01:00:17] gradients so this term to be a possible initialization scheme
[01:00:22] so in practice I'm going to write a few
[01:00:25] so in practice I'm going to write a few initial ization schemes that we're not
[01:00:27] initial ization schemes that we're not going to prove if you interested in
[01:00:29] going to prove if you interested in seeing more proofs of that you can take
[01:00:31] seeing more proofs of that you can take CS 2:30 where we prove this
[01:00:33] CS 2:30 where we prove this initialization scheme I take down the
[01:00:43] initialization scheme I take down the board
[01:00:48] so there are a few initialization that
[01:00:50] so there are a few initialization that are commonly used and again this is this
[01:00:54] are commonly used and again this is this is very practical and people have been
[01:00:55] is very practical and people have been testing a lot of initializations but
[01:00:58] testing a lot of initializations but they ended up using those so one is to
[01:01:02] they ended up using those so one is to initialize the weights I'm writing the
[01:01:04] initialize the weights I'm writing the code for those of you who know one on PI
[01:01:08] code for those of you who know one on PI not going to compile it here with
[01:01:11] not going to compile it here with whatever shape you're using element
[01:01:15] whatever shape you're using element twice times the square root of one over
[01:01:22] twice times the square root of one over N of L minus one so what does that mean
[01:01:29] N of L minus one so what does that mean it means that I will look at the number
[01:01:30] it means that I will look at the number of input I'm writing an n L minus one
[01:01:33] of input I'm writing an n L minus one here and to the L minus one I'm looking
[01:01:37] here and to the L minus one I'm looking at how many inputs are coming to my
[01:01:38] at how many inputs are coming to my layer assuming we're at layer L how many
[01:01:42] layer assuming we're at layer L how many inputs are coming I'm going to
[01:01:44] inputs are coming I'm going to initialize the weights of this layer
[01:01:48] initialize the weights of this layer proportionally to the number of inputs
[01:01:50] proportionally to the number of inputs that are coming in so the intuition is
[01:01:52] that are coming in so the intuition is very similar to what we described there
[01:01:54] very similar to what we described there so this initialization has been shown to
[01:01:56] so this initialization has been shown to work very well for sigmoid activation so
[01:02:00] work very well for sigmoid activation so if you use sigmoid what's interesting is
[01:02:08] if you use sigmoid what's interesting is if you use relu it's been it's been
[01:02:11] if you use relu it's been it's been observed that putting a two here instead
[01:02:14] observed that putting a two here instead of a one would make the network train
[01:02:16] of a one would make the network train better and again it's very practical
[01:02:20] better and again it's very practical it's one of the fields that that we need
[01:02:23] it's one of the fields that that we need more Theory on it but a lot of
[01:02:27] more Theory on it but a lot of observation has been made so far you
[01:02:32] observation has been made so far you guys want to do that as a project to see
[01:02:34] guys want to do that as a project to see why it is happening it would be
[01:02:36] why it is happening it would be interested okay and finally there is a
[01:02:42] interested okay and finally there is a more common one that is used which is
[01:02:46] more common one that is used which is called Xavier initialization which which
[01:02:56] called Xavier initialization which which proposes to update the weights
[01:03:00] using square root of 1 over an L minus 1
[01:03:07] using square root of 1 over an L minus 1 for 10h this is another one and another
[01:03:11] for 10h this is another one and another one that is I believe called slow
[01:03:13] one that is I believe called slow initialization recommends to to
[01:03:25] initialization recommends to to initialize the weights of a layer using
[01:03:27] initialize the weights of a layer using the foreign formula so quickly the quick
[01:03:34] the foreign formula so quickly the quick intuition behind the last one the last
[01:03:36] intuition behind the last one the last one is very often used the quick
[01:03:39] one is very often used the quick intuition is that we're doing the same
[01:03:41] intuition is that we're doing the same thing but also for the back propagated
[01:03:43] thing but also for the back propagated gradients so we're saying the weights
[01:03:45] gradients so we're saying the weights are going to multiply the back
[01:03:47] are going to multiply the back propagated gradient so we also need to
[01:03:49] propagated gradient so we also need to look at how many inputs do we have
[01:03:51] look at how many inputs do we have during the back propagation and L is the
[01:03:54] during the back propagation and L is the number of inputs you have during back
[01:03:55] number of inputs you have during back propagation and L minus 1 is the number
[01:03:57] propagation and L minus 1 is the number of inputs you have during for
[01:03:58] of inputs you have during for propagation so taking an average a
[01:04:00] propagation so taking an average a geometric average of those
[01:04:16] and the reason we have a random function
[01:04:19] and the reason we have a random function here is because if you don't initialize
[01:04:22] here is because if you don't initialize your weights randomly you will end up
[01:04:24] your weights randomly you will end up with some problem called the symmetry
[01:04:26] with some problem called the symmetry problem where every neuron is going to
[01:04:28] problem where every neuron is going to learn kind of the same thing to avoid
[01:04:30] learn kind of the same thing to avoid that you will make the neuron starts at
[01:04:32] that you will make the neuron starts at different places and let them evolve
[01:04:34] different places and let them evolve independently from each other as much as
[01:04:37] independently from each other as much as possible so now we have two choices
[01:04:40] possible so now we have two choices either we go over regularization or
[01:04:42] either we go over regularization or optimization how much have you talked
[01:04:45] optimization how much have you talked about regularization so far l1 l2 early
[01:04:48] about regularization so far l1 l2 early stopping all that's really stopping
[01:04:51] stopping all that's really stopping everybody remembers well it is know a
[01:04:53] everybody remembers well it is know a little bit so let's go over optimization
[01:04:56] little bit so let's go over optimization I guess and then we will do some
[01:04:57] I guess and then we will do some regularization depending on the time we
[01:04:59] regularization depending on the time we have so I believe so far you've seen
[01:05:11] have so I believe so far you've seen gradient descent and stochastic gradient
[01:05:13] gradient descent and stochastic gradient descent as to possible optimization
[01:05:16] descent as to possible optimization algorithm in practice there is a
[01:05:18] algorithm in practice there is a trade-off between these two which is
[01:05:19] trade-off between these two which is called mini-batch gradient descent what
[01:05:22] called mini-batch gradient descent what is the trade-off the trade-off is that
[01:05:24] is the trade-off the trade-off is that batch gradient descent is cool because
[01:05:27] batch gradient descent is cool because you can use vectorization you can give a
[01:05:30] you can use vectorization you can give a batch of input for propagated all at
[01:05:33] batch of input for propagated all at once during vac using a vectorized code
[01:05:35] once during vac using a vectorized code stochastic gradient descent advantage is
[01:05:38] stochastic gradient descent advantage is that the updates are very quick and
[01:05:40] that the updates are very quick and imagine that you have the data set with
[01:05:42] imagine that you have the data set with 1 million images 1 million images in the
[01:05:45] 1 million images 1 million images in the data set and you want to do batch
[01:05:46] data set and you want to do batch gradient descent you know how long it's
[01:05:49] gradient descent you know how long it's going to take to do one updates very
[01:05:51] going to take to do one updates very long so we don't want that because maybe
[01:05:53] long so we don't want that because maybe we don't need to go over the full
[01:05:55] we don't need to go over the full dataset in order to have a good update
[01:05:56] dataset in order to have a good update maybe the updates based on a thousand
[01:05:59] maybe the updates based on a thousand examples might already give us the right
[01:06:01] examples might already give us the right direction for the gradient of where to
[01:06:03] direction for the gradient of where to go it's not going to be as good as on a
[01:06:05] go it's not going to be as good as on a minyan example where is going to be a
[01:06:06] minyan example where is going to be a very good approximation so that's why
[01:06:09] very good approximation so that's why most people would use mini-batch
[01:06:11] most people would use mini-batch gradient descent where you have a
[01:06:12] gradient descent where you have a trade-off between stochasticity and also
[01:06:15] trade-off between stochasticity and also vectorization so in terms of notation
[01:06:22] vectorization so in terms of notation I'm going to call X the matrix X 1 X 2 X
[01:06:31] I'm going to call X the matrix X 1 X 2 X m and capital y the same matrix with
[01:06:37] m and capital y the same matrix with wise so we have M training examples and
[01:06:41] wise so we have M training examples and I'm going to split these into batches so
[01:06:45] I'm going to split these into batches so I'm going to call the first batch X 1
[01:06:49] I'm going to call the first batch X 1 like this until X maybe T like that and
[01:06:56] like this until X maybe T like that and X 1 can contain probably X 1 until X
[01:07:01] X 1 can contain probably X 1 until X 1,000 assuming it's a batch of a
[01:07:03] 1,000 assuming it's a batch of a thousand examples X 2 then will contain
[01:07:06] thousand examples X 2 then will contain X 1000 and one until X 2000 and so on so
[01:07:11] X 1000 and one until X 2000 and so on so this is the notation for the batch when
[01:07:14] this is the notation for the batch when I use curly brackets same for y
[01:07:33] so in terms of algorithm how does the
[01:07:38] so in terms of algorithm how does the mini-batch gradient descent algorithm
[01:07:40] mini-batch gradient descent algorithm work we're going to iterate so for TNT
[01:07:48] work we're going to iterate so for TNT from 1 to blah blah blah - how many
[01:07:51] from 1 to blah blah blah - how many iteration you want to do we're going to
[01:07:53] iteration you want to do we're going to select a batch select a batch of XK 1 XT
[01:08:07] select a batch select a batch of XK 1 XT YT you will forward propagate the batch
[01:08:14] and you will back propagate the batch so
[01:08:23] and you will back propagate the batch so by forward propagation I mean you send
[01:08:25] by forward propagation I mean you send all the batch to the network and you
[01:08:28] all the batch to the network and you compute the lost functions for every
[01:08:29] compute the lost functions for every examples of the batch you sum them
[01:08:31] examples of the batch you sum them together and you compute the cost
[01:08:32] together and you compute the cost function over the entire batch which is
[01:08:34] function over the entire batch which is the average of the loss functions and so
[01:08:40] the average of the loss functions and so assuming assuming the batch is of size
[01:08:45] assuming assuming the batch is of size 1,000 this will be the the formula to
[01:08:54] 1,000 this will be the the formula to compute the batch over 1,000 examples
[01:09:00] and after the back propagation of course
[01:09:04] and after the back propagation of course update W L and the L for all the else
[01:09:10] update W L and the L for all the else for all the layers this is the equation
[01:09:30] so in terms of graph what you're likely
[01:09:37] so in terms of graph what you're likely to see is that for batch gradient
[01:09:40] to see is that for batch gradient descent your cost function J would have
[01:09:44] descent your cost function J would have looked like that
[01:09:45] looked like that if you plot it against the number of
[01:09:48] if you plot it against the number of iterations on the other hand if you use
[01:09:54] iterations on the other hand if you use a mini batch gradient descent you're
[01:09:55] a mini batch gradient descent you're most likely to see something like this
[01:09:58] most likely to see something like this so it's also decreasing as a trend but
[01:10:01] so it's also decreasing as a trend but because the gradient is approximated and
[01:10:04] because the gradient is approximated and doesn't necessarily go straight to the
[01:10:06] doesn't necessarily go straight to the to the middle of your last to the lower
[01:10:08] to the middle of your last to the lower point of the last function you will see
[01:10:10] point of the last function you will see a kind of graph like that the smaller
[01:10:13] a kind of graph like that the smaller the batch the more stochasticity so the
[01:10:16] the batch the more stochasticity so the more noise you will have on your cost
[01:10:18] more noise you will have on your cost function graph and of course if you if
[01:10:29] function graph and of course if you if we plot again if we plot the last
[01:10:34] we plot again if we plot the last function and this was gradient descent
[01:10:37] function and this was gradient descent so this is the top view of the last
[01:10:39] so this is the top view of the last function assuming we're in two
[01:10:40] function assuming we're in two dimensions your stochastic gradient
[01:10:43] dimensions your stochastic gradient descent or batch gradient descent would
[01:10:45] descent or batch gradient descent would do something like that so the difference
[01:10:51] do something like that so the difference is there seem to be less iteration with
[01:10:54] is there seem to be less iteration with the red algorithm but the iteration are
[01:10:57] the red algorithm but the iteration are much heavier to compute so each of the
[01:11:00] much heavier to compute so each of the green iteration are going to be very
[01:11:02] green iteration are going to be very very very quick while the red ones are
[01:11:05] very very quick while the red ones are going to be slow to compute this is the
[01:11:08] going to be slow to compute this is the trade off now there is another algorithm
[01:11:13] trade off now there is another algorithm that I want to go over which is called
[01:11:16] that I want to go over which is called the momentum momentum algorithm
[01:11:25] sometimes called gradient descent plus
[01:11:27] sometimes called gradient descent plus momentum algorithm so what's the
[01:11:32] momentum algorithm so what's the intuition behind momentum
[01:11:38] the intuition is let's look at this lost
[01:11:44] the intuition is let's look at this lost contour plot and I'm doing an extreme
[01:11:50] contour plot and I'm doing an extreme case just to illustrate the intuition
[01:11:56] assume you have a loss that is very
[01:12:00] assume you have a loss that is very extended in one direction so this
[01:12:03] extended in one direction so this direction is very extended and the other
[01:12:07] direction is very extended and the other one is smaller you're starting at the
[01:12:09] one is smaller you're starting at the points like this one your gradient
[01:12:13] points like this one your gradient descent algorithm itself is going to
[01:12:15] descent algorithm itself is going to follow the following map it's going to
[01:12:16] follow the following map it's going to be orthogonal to the current contour is
[01:12:21] be orthogonal to the current contour is au term contour loss is going to go
[01:12:23] au term contour loss is going to go there and then there and then there and
[01:12:26] there and then there and then there and then there and so on so what you would
[01:12:31] then there and so on so what you would like is to move a faster on the
[01:12:35] like is to move a faster on the horizontal line and slower to the
[01:12:36] horizontal line and slower to the vertical on the vertical side so on this
[01:12:41] vertical on the vertical side so on this axis you would like to move with smaller
[01:12:45] axis you would like to move with smaller updates and on this axis you want to
[01:12:48] updates and on this axis you want to move with larger objects correct if this
[01:12:51] move with larger objects correct if this happened we would probably end up in the
[01:12:54] happened we would probably end up in the minimum much quicker than we currently
[01:12:56] minimum much quicker than we currently are so in order to do that we're going
[01:12:58] are so in order to do that we're going to use a technique called momentum which
[01:13:01] to use a technique called momentum which is going to look at the past gradients
[01:13:03] is going to look at the past gradients so look at the past updates assume we're
[01:13:05] so look at the past updates assume we're here assume we're somewhere here
[01:13:10] here assume we're somewhere here gradient descent doesn't look at its
[01:13:12] gradient descent doesn't look at its past at all it just will compute the
[01:13:15] past at all it just will compute the fault propagation compute the backdrop
[01:13:16] fault propagation compute the backdrop look at the direction and go to that
[01:13:18] look at the direction and go to that direction
[01:13:19] direction what momentum is going to say is look at
[01:13:21] what momentum is going to say is look at the past updates that you did and try to
[01:13:23] the past updates that you did and try to consider this past update in order to
[01:13:25] consider this past update in order to find the right way to go so if you look
[01:13:28] find the right way to go so if you look at the past update and you take an
[01:13:30] at the past update and you take an average of the past update you would
[01:13:32] average of the past update you would take an average of this update going up
[01:13:34] take an average of this update going up and the update after it going down the
[01:13:37] and the update after it going down the average on the vertical side is going to
[01:13:39] average on the vertical side is going to be small because one went up one went
[01:13:41] be small because one went up one went down but on the horizontal axis both
[01:13:44] down but on the horizontal axis both went to the same direction so the update
[01:13:47] went to the same direction so the update will not change too much on the verts on
[01:13:50] will not change too much on the verts on this axis so you're most likely to do
[01:13:53] this axis so you're most likely to do something like that if you use momentum
[01:14:01] something like that if you use momentum does it make sense the intuition behind
[01:14:03] does it make sense the intuition behind it so that's the intuition why we want
[01:14:07] it so that's the intuition why we want to use mind and for those of you who do
[01:14:10] to use mind and for those of you who do physics sometimes you can think of
[01:14:12] physics sometimes you can think of momentum as friction you know like like
[01:14:16] momentum as friction you know like like if you if you launch a rocket and you
[01:14:19] if you if you launch a rocket and you want to move it quickly around it's not
[01:14:21] want to move it quickly around it's not gonna move because the rocket has a
[01:14:22] gonna move because the rocket has a certain weight and has a certain
[01:14:23] certain weight and has a certain momentum you cannot change its direction
[01:14:25] momentum you cannot change its direction very very noisily so let's see at the
[01:14:39] very very noisily so let's see at the implementation of of momentum gradient
[01:14:42] implementation of of momentum gradient descent oh and I believe we're almost
[01:14:46] descent oh and I believe we're almost done right yeah okay so let's look at it
[01:14:50] done right yeah okay so let's look at it in the implementation quickly so
[01:14:53] in the implementation quickly so gradient descent was W equals W minus
[01:14:56] gradient descent was W equals W minus alpha derivative of the loss with
[01:14:59] alpha derivative of the loss with respect to W what we're going to do is
[01:15:01] respect to W what we're going to do is we're going to use another variable
[01:15:03] we're going to use another variable called velocity which is going to be the
[01:15:05] called velocity which is going to be the average of the previous velocity and the
[01:15:11] average of the previous velocity and the current weight update so we're going to
[01:15:20] current weight update so we're going to use that and instead of the updates
[01:15:22] use that and instead of the updates being the derivative directly we're
[01:15:24] being the derivative directly we're going to update the velocity so the
[01:15:26] going to update the velocity so the velocity is going to be a variable that
[01:15:29] velocity is going to be a variable that tracks the direction that we should take
[01:15:34] tracks the direction that we should take regarding the current update and also
[01:15:36] regarding the current update and also the past updates with a factor beta that
[01:15:39] the past updates with a factor beta that is B going to be the weight the
[01:15:43] is B going to be the weight the interesting point is that in terms of
[01:15:45] interesting point is that in terms of implementation it's one more line of
[01:15:46] implementation it's one more line of code in terms of memory is just one
[01:15:49] code in terms of memory is just one additional variable and it actually has
[01:15:51] additional variable and it actually has a big impact on the optimization there
[01:15:54] a big impact on the optimization there are much more optimization algorithms
[01:15:56] are much more optimization algorithms that we're not going to see together
[01:15:57] that we're not going to see together today in C su-30 we teach something
[01:16:00] today in C su-30 we teach something called rmsprop
[01:16:01] called rmsprop and Adam that's our
[01:16:04] and Adam that's our likely the the the ones that are used
[01:16:07] likely the the the ones that are used the most in deep learning and the reason
[01:16:10] the most in deep learning and the reason is if you come up with an optimization
[01:16:12] is if you come up with an optimization algorithm you still have to prove that
[01:16:14] algorithm you still have to prove that it works very well on a wide variety of
[01:16:16] it works very well on a wide variety of application between bv4 researchers
[01:16:19] application between bv4 researchers adopted for their research so Adam
[01:16:22] adopted for their research so Adam brings momentum to the OP the returning
[01:16:26] brings momentum to the OP the returning optimization algorithms okay thanks guys
[01:16:28] optimization algorithms okay thanks guys and that's all for deep learning in cs2
[01:16:32] and that's all for deep learning in cs2 tonight so far


================================================================================
LECTURE 013
================================================================================

Lecture 13 - Debugging ML Models and Error Analysis | Stanford CS229: Machine Learning (Autumn 2018)

Source: https://www.youtube.com/watch?v=ORrStCArmP4

---

Transcript

[00:00:04] okay happy Halloween what I want to do
[00:00:07] okay happy Halloween what I want to do today is share of you advice for
[00:00:10] today is share of you advice for applying machine learning and you
[00:00:12] applying machine learning and you probably allude to this before but um
[00:00:14] probably allude to this before but um yeah I think over the last several weeks
[00:00:17] yeah I think over the last several weeks you've learned a lot about the mechanics
[00:00:19] you've learned a lot about the mechanics of how to build different learning
[00:00:21] of how to build different learning algorithms everything from the
[00:00:23] algorithms everything from the aggression which is aggression as VMS
[00:00:25] aggression which is aggression as VMS random forests is it neuro networks and
[00:00:28] random forests is it neuro networks and what I want to do today is share of you
[00:00:31] what I want to do today is share of you some principles for helping you become
[00:00:33] some principles for helping you become efficient at how you apply all of these
[00:00:35] efficient at how you apply all of these things to solve whatever application
[00:00:37] things to solve whatever application problem you might want to work on um and
[00:00:41] problem you might want to work on um and so a lot of today's materials I should
[00:00:44] so a lot of today's materials I should not have mathematical but there's also
[00:00:46] not have mathematical but there's also some of the hardest material tests were
[00:00:48] some of the hardest material tests were in this class to understand um it turns
[00:00:52] in this class to understand um it turns out that when you give advice on how to
[00:00:54] out that when you give advice on how to apply a learning algorithm such as you
[00:00:57] apply a learning algorithm such as you know don't waste lots of time collecting
[00:00:58] know don't waste lots of time collecting data unless you you you have confidence
[00:01:02] data unless you you you have confidence is useful to actually spend all that
[00:01:03] is useful to actually spend all that time it turns out when I say things like
[00:01:05] time it turns out when I say things like that people you know this easy agree is
[00:01:07] that people you know this easy agree is that of course you shouldn't waste time
[00:01:08] that of course you shouldn't waste time collecting loss a day there unless you
[00:01:11] collecting loss a day there unless you have some confidence it's actually good
[00:01:12] have some confidence it's actually good use your time that's a very easy thing
[00:01:14] use your time that's a very easy thing to agree with but the hard thing is when
[00:01:18] to agree with but the hard thing is when you go home today and you're actually
[00:01:20] you go home today and you're actually working on your class project right to
[00:01:24] working on your class project right to apply the principles we talked about
[00:01:26] apply the principles we talked about today when you're actually on the ground
[00:01:27] today when you're actually on the ground talking to your teammates saying alright
[00:01:29] talking to your teammates saying alright do we collect more data for our class
[00:01:31] do we collect more data for our class project now or not to make the right
[00:01:33] project now or not to make the right judgment call for that to map the
[00:01:35] judgment call for that to map the concepts you learn today so when you're
[00:01:37] concepts you learn today so when you're actually in the hot seat you know making
[00:01:39] actually in the hot seat you know making a decision do we go and spend another
[00:01:41] a decision do we go and spend another two days scraping data off the internet
[00:01:43] two days scraping data off the internet or do goons tune this out tune this
[00:01:45] or do goons tune this out tune this parameters algorithm to actually make
[00:01:46] parameters algorithm to actually make those decisions it's actually it it
[00:01:49] those decisions it's actually it it often takes a lot of careful thinking to
[00:01:52] often takes a lot of careful thinking to make the mapping from the principles we
[00:01:54] make the mapping from the principles we talked about today and they prepare all
[00:01:55] talked about today and they prepare all of you go yep that makes sense but they
[00:01:57] of you go yep that makes sense but they actually do that when you're in the hot
[00:01:59] actually do that when you're in the hot seat making the decisions that that's
[00:02:01] seat making the decisions that that's something that
[00:02:02] something that we often take take some careful thought
[00:02:04] we often take take some careful thought I guess and I think you know for a long
[00:02:09] I guess and I think you know for a long time
[00:02:10] time positive machine learning have been an
[00:02:12] positive machine learning have been an art right we're you know we'll go
[00:02:14] art right we're you know we'll go through these people that have been
[00:02:15] through these people that have been doing it for 30 years and we say hey my
[00:02:17] doing it for 30 years and we say hey my learning algorithm doesn't work you know
[00:02:20] learning algorithm doesn't work you know what do we do now and then they would
[00:02:22] what do we do now and then they would have some judgment or you go people to
[00:02:24] have some judgment or you go people to ask me and for some reason because we've
[00:02:26] ask me and for some reason because we've done it for a long time we'll say oh
[00:02:28] done it for a long time we'll say oh yeah get more dates or tune that
[00:02:29] yeah get more dates or tune that parameter or try a new network of big
[00:02:31] parameter or try a new network of big hidden units and for some reason that
[00:02:33] hidden units and for some reason that work and what I hope to do today is turn
[00:02:36] work and what I hope to do today is turn that black magic that hot that that arts
[00:02:39] that black magic that hot that that arts into much of a science so that you can
[00:02:40] into much of a science so that you can much more systematic make these
[00:02:42] much more systematic make these decisions yourself rather than talk to
[00:02:44] decisions yourself rather than talk to someone there's done this for 30 years
[00:02:47] someone there's done this for 30 years then that for some reason is able to
[00:02:49] then that for some reason is able to give you the good recommendations even
[00:02:51] give you the good recommendations even if you know but turn it from more of a
[00:02:55] if you know but turn it from more of a black art into a more of a systematic
[00:02:57] black art into a more of a systematic engineering discipline um and and just a
[00:03:00] engineering discipline um and and just a one-note someone I wouldn't do today is
[00:03:03] one-note someone I wouldn't do today is not the best approach for developing
[00:03:05] not the best approach for developing novel machine or any research if you are
[00:03:07] novel machine or any research if you are if your main goes to write research
[00:03:09] if your main goes to write research papers some of what I'll say will apply
[00:03:12] papers some of what I'll say will apply some others say will not apply but I'll
[00:03:13] some others say will not apply but I'll come back to that later but so most of
[00:03:15] come back to that later but so most of today is focus on how the hell you build
[00:03:17] today is focus on how the hell you build stuff that works right the build build
[00:03:19] stuff that works right the build build applications that work so the three key
[00:03:23] applications that work so the three key ideas you see today are first is
[00:03:26] ideas you see today are first is Diagnostics for debugging learning
[00:03:29] Diagnostics for debugging learning algorithms one thing you might not know
[00:03:33] algorithms one thing you might not know or actually if you work on a class
[00:03:34] or actually if you work on a class project maybe you know this already is
[00:03:36] project maybe you know this already is that when you implant to learning Alvin
[00:03:38] that when you implant to learning Alvin for the first time it almost never works
[00:03:40] for the first time it almost never works right that is not the first time and so
[00:03:45] right that is not the first time and so what is it I still remember it was there
[00:03:48] what is it I still remember it was there was a weekend about a year ago where I
[00:03:51] was a weekend about a year ago where I implemented
[00:03:52] implemented softmax regression on my laptop and it
[00:03:54] softmax regression on my laptop and it worked the first time and even to this
[00:03:56] worked the first time and even to this day I still remember that feeling of
[00:03:59] day I still remember that feeling of surprise I know there's gotta be a bug
[00:04:00] surprise I know there's gotta be a bug and I went in to try to find a bug and
[00:04:02] and I went in to try to find a bug and there wasn't about it but it is so rare
[00:04:04] there wasn't about it but it is so rare Travie Algren works the first time I
[00:04:07] Travie Algren works the first time I still remember every year later and so
[00:04:10] still remember every year later and so longer the workflow of developing
[00:04:12] longer the workflow of developing learning algorithms it actually feels
[00:04:14] learning algorithms it actually feels like a debugging workflow right and so
[00:04:17] like a debugging workflow right and so my hope you become systematic at that um
[00:04:21] my hope you become systematic at that um and two key ideas here about our
[00:04:24] and two key ideas here about our analysis innovative analysis so how to
[00:04:26] analysis innovative analysis so how to analyze the air as you're learning
[00:04:28] analyze the air as you're learning algorithm and also how to how to
[00:04:29] algorithm and also how to how to understand was not working with error
[00:04:31] understand was not working with error analysis how long said what's working
[00:04:33] analysis how long said what's working which is ablative analysis and then
[00:04:35] which is ablative analysis and then finally some philosophies and how the
[00:04:37] finally some philosophies and how the get started or the machine learning
[00:04:39] get started or the machine learning project such as your class project
[00:04:41] project such as your class project okay so let's starts without discussing
[00:04:45] okay so let's starts without discussing debugging learning algorithms um so what
[00:04:50] debugging learning algorithms um so what happens all the time is you have an idea
[00:04:52] happens all the time is you have an idea from machine learning application you
[00:04:54] from machine learning application you implement something and then it won't
[00:04:56] implement something and then it won't work as well as you hoped and the key
[00:04:58] work as well as you hoped and the key question is what do you do next right
[00:05:00] question is what do you do next right whenever I work on machine learning
[00:05:02] whenever I work on machine learning algorithm that's actually most of my
[00:05:03] algorithm that's actually most of my workflow we usually have something
[00:05:05] workflow we usually have something unfermented it's just not working that
[00:05:07] unfermented it's just not working that well
[00:05:07] well and your ability to decide what to do
[00:05:09] and your ability to decide what to do NYX has a huge impact on your efficiency
[00:05:13] NYX has a huge impact on your efficiency I think when when when I was a when I
[00:05:17] I think when when when I was a when I was in undergrad at Carnegie Mellon
[00:05:19] was in undergrad at Carnegie Mellon University I had a friend that would
[00:05:22] University I had a friend that would debug their code by you know they write
[00:05:25] debug their code by you know they write a piece of code and then as always we
[00:05:28] a piece of code and then as always we write Pisco initially always a bunch of
[00:05:29] write Pisco initially always a bunch of syntax errors right and so their
[00:05:31] syntax errors right and so their debugging strategy was to delete every
[00:05:33] debugging strategy was to delete every single line of code that generates a
[00:05:34] single line of code that generates a syntax error because it was a good way
[00:05:36] syntax error because it was a good way to garrulous and so that wasn't a good
[00:05:37] to garrulous and so that wasn't a good strategy so in in in machine learning as
[00:05:40] strategy so in in in machine learning as well they're good and less good
[00:05:42] well they're good and less good debugging strategies right um so let's
[00:05:45] debugging strategies right um so let's not so motivating example uh let's say
[00:05:48] not so motivating example uh let's say building an anti spam classifier and
[00:05:51] building an anti spam classifier and let's say you've carefully chosen a
[00:05:54] let's say you've carefully chosen a small set of hundred words to use as
[00:05:55] small set of hundred words to use as features so instead of all using you
[00:05:57] features so instead of all using you know ten thousand or fifty thousand
[00:05:58] know ten thousand or fifty thousand words you've chosen a hundred words that
[00:06:01] words you've chosen a hundred words that you think could be most relevant to
[00:06:04] you think could be most relevant to anti-spam and let's say you start off
[00:06:08] anti-spam and let's say you start off implementing logistic recognization I
[00:06:11] implementing logistic recognization I think when talked about this is also you
[00:06:13] think when talked about this is also you know there's a frequencies in Beijing in
[00:06:14] know there's a frequencies in Beijing in school but you can think of
[00:06:15] school but you can think of basing logistic regression where you
[00:06:18] basing logistic regression where you have the maximum likelihood term on the
[00:06:21] have the maximum likelihood term on the left and then that second term is the
[00:06:23] left and then that second term is the regularization term right so that's so
[00:06:26] regularization term right so that's so that's Bayesian logistic regression if
[00:06:29] that's Bayesian logistic regression if you are Bayesian or which is regression
[00:06:31] you are Bayesian or which is regression with regularization if you're you know
[00:06:34] with regularization if you're you know using frequency statistics and let's say
[00:06:37] using frequency statistics and let's say that they're just regression with
[00:06:39] that they're just regression with regularization or based on Lynch's
[00:06:41] regularization or based on Lynch's direction it gets twenty percent test
[00:06:43] direction it gets twenty percent test error which is an unacceptably high
[00:06:44] error which is an unacceptably high making one in five mistakes on on your
[00:06:47] making one in five mistakes on on your spam filter and so what do you do next
[00:06:50] spam filter and so what do you do next um now for this scenario I want a and so
[00:06:54] um now for this scenario I want a and so um for when you implement an algorithm
[00:06:58] um for when you implement an algorithm like this what many teams will do is try
[00:07:02] like this what many teams will do is try improving the algorithm in different
[00:07:04] improving the algorithm in different ways so what many teams will do is say
[00:07:06] ways so what many teams will do is say oh yeah I remember you know well we like
[00:07:09] oh yeah I remember you know well we like big data more data always help so let's
[00:07:11] big data more data always help so let's get some more data and hope that solves
[00:07:13] get some more data and hope that solves the problem
[00:07:14] the problem so one some teams will say let's get
[00:07:16] so one some teams will say let's get more training examples good and it's
[00:07:18] more training examples good and it's actually true you know more data pretty
[00:07:20] actually true you know more data pretty much never hurts it almost always holds
[00:07:22] much never hurts it almost always holds but but the key question is how much um
[00:07:24] but but the key question is how much um or you could try using a smaller set of
[00:07:27] or you could try using a smaller set of features behind your features Prarie
[00:07:29] features behind your features Prarie somewhere in that relevance so let's get
[00:07:31] somewhere in that relevance so let's get rid of some features or you could try
[00:07:33] rid of some features or you could try having a larger cell features kinds of
[00:07:35] having a larger cell features kinds of features that seem small resourceless
[00:07:36] features that seem small resourceless add more features um or you might want
[00:07:41] add more features um or you might want other designs of the features you know
[00:07:43] other designs of the features you know instead of just using features in the
[00:07:46] instead of just using features in the email body you can use features from the
[00:07:48] email body you can use features from the email header the email header has on not
[00:07:52] email header the email header has on not just a from to subject but also routing
[00:07:54] just a from to subject but also routing information about what's a set of
[00:07:56] information about what's a set of service of the internet that the email
[00:07:58] service of the internet that the email took to get to you or you could try
[00:08:02] took to get to you or you could try running gradient descent for more
[00:08:03] running gradient descent for more iterations that that you know that never
[00:08:05] iterations that that you know that never hurts right pretty usually in self
[00:08:09] hurts right pretty usually in self grading descent let's switch to Newton's
[00:08:11] grading descent let's switch to Newton's method or let's try a different value
[00:08:14] method or let's try a different value for lambda or always say you know forget
[00:08:18] for lambda or always say you know forget about basically just regression or
[00:08:19] about basically just regression or logistic regularization let's let's use
[00:08:22] logistic regularization let's let's use two totally different algorithmic in his
[00:08:23] two totally different algorithmic in his or network or something right so what
[00:08:27] or network or something right so what happens in a lot of teams is someone
[00:08:32] happens in a lot of teams is someone will pick one of these ideas kind of a
[00:08:34] will pick one of these ideas kind of a random it depends on you know what they
[00:08:37] random it depends on you know what they happen to read the night before write
[00:08:39] happen to read the night before write about something or the experience on the
[00:08:42] about something or the experience on the last project and sometimes a project and
[00:08:45] last project and sometimes a project and sometimes your project leader will say
[00:08:47] sometimes your project leader will say you know we'll pick one of these and
[00:08:49] you know we'll pick one of these and just say let's try that and it's been a
[00:08:51] just say let's try that and it's been a spend a few days or a few weeks trying
[00:08:53] spend a few days or a few weeks trying that and it may or may not be the best
[00:08:55] that and it may or may not be the best thing so I think that in in in my team's
[00:09:00] thing so I think that in in in my team's machine or any workflow so first if you
[00:09:02] machine or any workflow so first if you actually you and a few others sit down
[00:09:04] actually you and a few others sit down and brainstorm a list of the things you
[00:09:06] and brainstorm a list of the things you could try you're actually already ahead
[00:09:07] could try you're actually already ahead of a lot of teams because lot of teams
[00:09:09] of a lot of teams because lot of teams will kind of just by gut feeling right
[00:09:13] will kind of just by gut feeling right or the most opinionated person will pick
[00:09:16] or the most opinionated person will pick one of these things at random and do
[00:09:17] one of these things at random and do that but you brainstorm a list of things
[00:09:19] that but you brainstorm a list of things and then and then try to evaluate the
[00:09:21] and then and then try to evaluate the different options you're already ahead
[00:09:22] different options you're already ahead of many teams
[00:09:24] of many teams oh sorry and and and I think the yeah
[00:09:27] oh sorry and and and I think the yeah and I think right you know and then
[00:09:30] and I think right you know and then unless you analyze these different
[00:09:32] unless you analyze these different options it's hard to know which of these
[00:09:37] options it's hard to know which of these is actually the best options right so um
[00:09:40] is actually the best options right so um the most common diagnostic I end up
[00:09:43] the most common diagnostic I end up using in developing learning algorithms
[00:09:46] using in developing learning algorithms is a bias versus variance diagnostic
[00:09:50] is a bias versus variance diagnostic right and I think talking about bias and
[00:09:53] right and I think talking about bias and variance already with a classifier is
[00:09:55] variance already with a classifier is highly bias then it tends to under fit
[00:09:59] highly bias then it tends to under fit the data so high bias is well you guys
[00:10:05] the data so high bias is well you guys remember this right if if you have a
[00:10:08] remember this right if if you have a data set there's like this a highly
[00:10:10] data set there's like this a highly biased crossbar and maybe much too
[00:10:12] biased crossbar and maybe much too simple and high variance classifies may
[00:10:15] simple and high variance classifies may be much too complex and some something
[00:10:17] be much too complex and some something in between you know with with trade-off
[00:10:20] in between you know with with trade-off bias and variance inappropriately right
[00:10:22] bias and variance inappropriately right so those bodies invariance and so it
[00:10:26] so those bodies invariance and so it turns out that one of the most common
[00:10:28] turns out that one of the most common Diagnostics
[00:10:29] Diagnostics using in pretty much every single
[00:10:31] using in pretty much every single machine learning project is a bias
[00:10:33] machine learning project is a bias versus fear instead gnostic to
[00:10:35] versus fear instead gnostic to understand how much of your learning
[00:10:37] understand how much of your learning over this problem comes from bias and
[00:10:39] over this problem comes from bias and how much of it comes from variance and
[00:10:42] how much of it comes from variance and and you know i i've had i don't know
[00:10:45] and you know i i've had i don't know they former PhD students right that
[00:10:47] they former PhD students right that learned about bias and variance when do
[00:10:50] learned about bias and variance when do the india PhD and then sometimes even a
[00:10:52] the india PhD and then sometimes even a couple years after they graduated from
[00:10:54] couple years after they graduated from Stanford and worked you know on more
[00:10:57] Stanford and worked you know on more problems they actually tell me that
[00:10:58] problems they actually tell me that their their understanding of bias and
[00:11:01] their their understanding of bias and variances continue to deepen right for
[00:11:05] variances continue to deepen right for many years so there's one of those
[00:11:06] many years so there's one of those concepts this is um if you can system at
[00:11:09] concepts this is um if you can system at the apply they're making much more
[00:11:10] the apply they're making much more efficient and and this is really the
[00:11:12] efficient and and this is really the maybe the single most useful to our town
[00:11:15] maybe the single most useful to our town understanding by the variants of
[00:11:16] understanding by the variants of debugging learning algorithms and so
[00:11:19] debugging learning algorithms and so what I'm going to describe is a workflow
[00:11:21] what I'm going to describe is a workflow where you would run some Diagnostics to
[00:11:24] where you would run some Diagnostics to figure out what is the problem and then
[00:11:26] figure out what is the problem and then try to fix whether problem is and so
[00:11:28] try to fix whether problem is and so just to summarize this this example you
[00:11:33] just to summarize this this example you know literature in Arizona have to be
[00:11:35] know literature in Arizona have to be high and you want to and you suspect
[00:11:37] high and you want to and you suspect promise either high variance or high
[00:11:39] promise either high variance or high bias and so it turns out that there's a
[00:11:42] bias and so it turns out that there's a diagnostic that lets you look at your
[00:11:44] diagnostic that lets you look at your algorithms performance and try to figure
[00:11:47] algorithms performance and try to figure out if how much the problem is variance
[00:11:50] out if how much the problem is variance and how much of the problem is biased oh
[00:11:53] and how much of the problem is biased oh and I'm gonna say test error but if
[00:11:55] and I'm gonna say test error but if you're developing should I really be
[00:11:56] you're developing should I really be doing this with a def said or a
[00:11:58] doing this with a def said or a development set rather than a test set
[00:12:00] development set rather than a test set right but so let me let me explain this
[00:12:05] diagnostic in greater detail so turns
[00:12:09] diagnostic in greater detail so turns out that um if you have a classifier
[00:12:11] out that um if you have a classifier with very high variance then the
[00:12:14] with very high variance then the performance on the test set or actually
[00:12:17] performance on the test set or actually be a better brother practice use the
[00:12:19] be a better brother practice use the holdout cross-validation so that the
[00:12:21] holdout cross-validation so that the developer said you see that the error
[00:12:23] developer said you see that the error that you classify has a much a much
[00:12:27] that you classify has a much a much lower error on the training set than on
[00:12:30] lower error on the training set than on the development set but in contrast if
[00:12:33] the development set but in contrast if you have high bias then the training
[00:12:35] you have high bias then the training error and the test error on the deficit
[00:12:37] error and the test error on the deficit error will go
[00:12:38] error will go behind so let me illustrate this with a
[00:12:41] behind so let me illustrate this with a picture so this is a learning curve and
[00:12:45] picture so this is a learning curve and what that means is on the horizontal
[00:12:48] what that means is on the horizontal axis
[00:12:49] axis you're gonna vary the number of training
[00:12:51] you're gonna vary the number of training examples right and when I talked about
[00:12:54] examples right and when I talked about bison barians I had a plot where the
[00:12:56] bison barians I had a plot where the horizontal axis was the degree of
[00:12:58] horizontal axis was the degree of polynomial right you for the first order
[00:13:00] polynomial right you for the first order second order third order fourth order
[00:13:02] second order third order fourth order polynomial in this plots the horizontal
[00:13:04] polynomial in this plots the horizontal axis is different this number of
[00:13:05] axis is different this number of training examples and so it turns out
[00:13:08] training examples and so it turns out that when you train a learning algorithm
[00:13:11] that when you train a learning algorithm you know the more data you have usually
[00:13:13] you know the more data you have usually the better your development set error
[00:13:16] the better your development set error your better your test set error right
[00:13:18] your better your test set error right it's just error usually goes down when
[00:13:20] it's just error usually goes down when you increase the number of training
[00:13:21] you increase the number of training examples the other thing the other and
[00:13:24] examples the other thing the other and and and let's say that you're hoping to
[00:13:26] and and let's say that you're hoping to achieve a certain level of desired
[00:13:29] achieve a certain level of desired performance you know for business
[00:13:30] performance you know for business reasons you like your spam classifier to
[00:13:32] reasons you like your spam classifier to achieve a certain level of design
[00:13:34] achieve a certain level of design performance and often sometimes design
[00:13:36] performance and often sometimes design level performance is to do about as well
[00:13:39] level performance is to do about as well as a human can there's a common business
[00:13:41] as a human can there's a common business objective depending on your application
[00:13:43] objective depending on your application but sometimes it could be different
[00:13:45] but sometimes it could be different right so you have some your product
[00:13:47] right so you have some your product manager you know tells you that all you
[00:13:49] manager you know tells you that all you if you're leading the brother you think
[00:13:51] if you're leading the brother you think that you need to hit a certain level
[00:13:52] that you need to hit a certain level target performance in order to be very
[00:13:54] target performance in order to be very useful spam filter so the other plot to
[00:14:02] useful spam filter so the other plot to add to this which will help you analyze
[00:14:04] add to this which will help you analyze bias versus variances support the
[00:14:06] bias versus variances support the training error now once you're happy of
[00:14:11] training error now once you're happy of training error is that it increases as
[00:14:15] training error is that it increases as the training set size increases because
[00:14:18] the training set size increases because if you have only one example right let's
[00:14:22] if you have only one example right let's see building a spam classifier and you
[00:14:24] see building a spam classifier and you have only one training example then any
[00:14:26] have only one training example then any algorithm you know can fit one training
[00:14:28] algorithm you know can fit one training example perfectly and so if your
[00:14:30] example perfectly and so if your training set size is very small the
[00:14:32] training set size is very small the training set error is usually zero right
[00:14:34] training set error is usually zero right because if you have a five training
[00:14:36] because if you have a five training examples probably you can fit all five
[00:14:38] examples probably you can fit all five examples perfectly and there's only if
[00:14:40] examples perfectly and there's only if you have a bigger training set that it
[00:14:42] you have a bigger training set that it becomes harder for the learning
[00:14:43] becomes harder for the learning algorithm to fit your training
[00:14:45] algorithm to fit your training data that well right oh well in the
[00:14:48] data that well right oh well in the linear regression case you have you have
[00:14:49] linear regression case you have you have one example yeah you can Phyllis
[00:14:51] one example yeah you can Phyllis straight nine two data if you have two
[00:14:52] straight nine two data if you have two examples you can fit any model pretty
[00:14:54] examples you can fit any model pretty much to the data and have zero training
[00:14:57] much to the data and have zero training around there's only if you have a very
[00:14:58] around there's only if you have a very large training set that a classifier
[00:15:01] large training set that a classifier like which is regression or linear
[00:15:03] like which is regression or linear regression may have a harder time
[00:15:04] regression may have a harder time fitting all of your training examples so
[00:15:06] fitting all of your training examples so that's why training error or average
[00:15:08] that's why training error or average training error averaged over your
[00:15:09] training error averaged over your training set generally increases as you
[00:15:13] training set generally increases as you increase the training set size so um now
[00:15:18] increase the training set size so um now there are two characteristics of this
[00:15:21] there are two characteristics of this plot that suggest that if you plot the
[00:15:24] plot that suggest that if you plot the learning curves if you see this dis
[00:15:27] learning curves if you see this dis pattern to suggest that theorem has a
[00:15:30] pattern to suggest that theorem has a large bias problem right and the two
[00:15:33] large bias problem right and the two properties written in the bottom one the
[00:15:35] properties written in the bottom one the weaker signal the one that's harder to
[00:15:37] weaker signal the one that's harder to rely on is that the development set
[00:15:40] rely on is that the development set error or the test set error is still
[00:15:41] error or the test set error is still decreasing as you increase the training
[00:15:43] decreasing as you increase the training set size so the green curve is still you
[00:15:45] set size so the green curve is still you know still looks like it's going down
[00:15:46] know still looks like it's going down and so this suggests that if you
[00:15:49] and so this suggests that if you increase the training set size and
[00:15:50] increase the training set size and extrapolate further to the right that
[00:15:52] extrapolate further to the right that the curve would keep on going down this
[00:15:55] the curve would keep on going down this turns out to be a weaker signal because
[00:15:57] turns out to be a weaker signal because sometimes we look at the curve like that
[00:15:59] sometimes we look at the curve like that is actually quite hard to tell you have
[00:16:02] is actually quite hard to tell you have to extrapolate to the right so if you
[00:16:05] to extrapolate to the right so if you double the training set size how much
[00:16:07] double the training set size how much further would the green curve go down
[00:16:08] further would the green curve go down then she kind of hard to tell so I find
[00:16:10] then she kind of hard to tell so I find this a useful signal but sometimes it's
[00:16:12] this a useful signal but sometimes it's been hard to judge you know exactly
[00:16:14] been hard to judge you know exactly where the curve will go of you
[00:16:16] where the curve will go of you extrapolate to the right um the stronger
[00:16:19] extrapolate to the right um the stronger signal is actually the second one the
[00:16:21] signal is actually the second one the fact that there's a huge gap between
[00:16:22] fact that there's a huge gap between your training error and your test set
[00:16:24] your training error and your test set error or your training or your jeff's
[00:16:26] error or your training or your jeff's that there would better thing to look at
[00:16:27] that there would better thing to look at is actually a stronger signal that um
[00:16:30] is actually a stronger signal that um this particular learning algorithm has
[00:16:32] this particular learning algorithm has has high variance right because as you
[00:16:37] has high variance right because as you increase the training set size you find
[00:16:40] increase the training set size you find that the gap between training test error
[00:16:44] that the gap between training test error usually closes usually reduces and so
[00:16:47] usually closes usually reduces and so there's no a lot of room for making your
[00:16:52] there's no a lot of room for making your test set error become closer to your
[00:16:54] test set error become closer to your training
[00:16:55] training and so that's if you see a learning
[00:16:57] and so that's if you see a learning curve like there's a strong side that um
[00:16:59] curve like there's a strong side that um you have a variance problem okay now
[00:17:02] you have a variance problem okay now let's look at what the curve what the
[00:17:04] let's look at what the curve what the learning curve will look like um if you
[00:17:06] learning curve will look like um if you have a bias problem so this is a typical
[00:17:09] have a bias problem so this is a typical learning curve for high bias which is
[00:17:12] learning curve for high bias which is that's your def set error or your
[00:17:14] that's your def set error or your development cell the holocrons values
[00:17:16] development cell the holocrons values and say their test error and you're
[00:17:19] and say their test error and you're hoping to hit a level of performance
[00:17:20] hoping to hit a level of performance like that and your training error looks
[00:17:24] like that and your training error looks like that and so one sign that you have
[00:17:31] like that and so one sign that you have a high bias problem is that this
[00:17:35] a high bias problem is that this algorithm is not even doing that well on
[00:17:37] algorithm is not even doing that well on the training set right even on the
[00:17:39] the training set right even on the training set you know you're not
[00:17:42] training set you know you're not achieving your desired level of
[00:17:44] achieving your desired level of performance and it's like not learn it
[00:17:47] performance and it's like not learn it imagine you're looking out from you see
[00:17:49] imagine you're looking out from you see it was like this algorithm has seen
[00:17:51] it was like this algorithm has seen these examples and even for examples the
[00:17:53] these examples and even for examples the scene is not doing this was you were
[00:17:55] scene is not doing this was you were hoping so clearly the algorithms not
[00:17:56] hoping so clearly the algorithms not fitting the data well enough so this is
[00:17:59] fitting the data well enough so this is a sign that you have a high bias problem
[00:18:01] a sign that you have a high bias problem not in the features you're learning or
[00:18:03] not in the features you're learning or it was too simple and and the other
[00:18:06] it was too simple and and the other signal is that there's a very small gap
[00:18:10] signal is that there's a very small gap between the training under test error
[00:18:12] between the training under test error right and you can imagine when you see a
[00:18:14] right and you can imagine when you see a plot like this no matter how much more
[00:18:17] plot like this no matter how much more data you get right go ahead and
[00:18:19] data you get right go ahead and extrapolate to the right as far as you
[00:18:21] extrapolate to the right as far as you want you know no matter how much more
[00:18:23] want you know no matter how much more data you get no matter how far you
[00:18:26] data you get no matter how far you extrapolate to the right of this plot
[00:18:27] extrapolate to the right of this plot they read the blue curve the training
[00:18:30] they read the blue curve the training error is never gonna come back down to
[00:18:32] error is never gonna come back down to hit the desired level of performance and
[00:18:34] hit the desired level of performance and because the test set error is you know
[00:18:37] because the test set error is you know generally higher than your training set
[00:18:38] generally higher than your training set error no matter how much more data you
[00:18:40] error no matter how much more data you have no matter how far you extrapolate
[00:18:42] have no matter how far you extrapolate to the right the error is never going to
[00:18:44] to the right the error is never going to come down to your design level
[00:18:46] come down to your design level performance so if you get a training
[00:18:50] performance so if you get a training error and test that error curve that
[00:18:52] error and test that error curve that looks like this you kind of know that
[00:18:53] looks like this you kind of know that you know while getting more training
[00:18:56] you know while getting more training data may help right the green curve
[00:18:59] data may help right the green curve could come down like a little bit if you
[00:19:00] could come down like a little bit if you get more training data the act of
[00:19:04] get more training data the act of getting more training data by itself
[00:19:06] getting more training data by itself will never get you to where you
[00:19:07] will never get you to where you to go okay so let's work through this
[00:19:14] to go okay so let's work through this example so for each of the four bullets
[00:19:18] example so for each of the four bullets here each of the four first four ideas
[00:19:22] here each of the four first four ideas fixes either a high variance or a high
[00:19:24] fixes either a high variance or a high bias problem right so let's let's go
[00:19:27] bias problem right so let's let's go through them and and say an Oscar for
[00:19:30] through them and and say an Oscar for the first one
[00:19:32] the first one do you think do you think it helps you
[00:19:34] do you think do you think it helps you fix high bias or high variance
[00:19:46] cool all right high variance right Amy
[00:19:49] cool all right high variance right Amy well I say it will say well right anyone
[00:19:53] well I say it will say well right anyone I'll say why yes right yeah right I
[00:20:13] I'll say why yes right yeah right I guess if you're feeling a very high
[00:20:14] guess if you're feeling a very high order polynomial that we goes like this
[00:20:16] order polynomial that we goes like this if you have more data it will make it
[00:20:18] if you have more data it will make it anything up then the warm at least
[00:20:19] anything up then the warm at least oscillate so crazy you feel a higher
[00:20:22] oscillate so crazy you feel a higher order polynomial right and um if you
[00:20:25] order polynomial right and um if you look at the high variance curve wow it's
[00:20:34] look at the high variance curve wow it's not latency in my that's all for some
[00:20:37] not latency in my that's all for some reason
[00:20:44] right so this is the high variance plot
[00:20:48] right so this is the high variance plot and and if you have a learning algorithm
[00:20:53] and and if you have a learning algorithm high variance you can hopefully you know
[00:20:56] high variance you can hopefully you know if you extrapolate to the right there is
[00:20:58] if you extrapolate to the right there is some hope that the green curve will keep
[00:21:00] some hope that the green curve will keep on coming down so so getting more
[00:21:03] on coming down so so getting more training data if you have high variance
[00:21:05] training data if you have high variance which is if you're in this situation it
[00:21:07] which is if you're in this situation it looks like it could happen
[00:21:08] looks like it could happen hope this is worth trying great can't
[00:21:10] hope this is worth trying great can't guarantee your work with worth trying oh
[00:21:21] I see yes sorry other this is good so
[00:21:23] I see yes sorry other this is good so let's see um the curse will look like
[00:21:26] let's see um the curse will look like this assuming that your training data is
[00:21:28] this assuming that your training data is iid right training and death in Texas
[00:21:31] iid right training and death in Texas are always drawn from the same
[00:21:33] are always drawn from the same distribution there is learning theory
[00:21:37] distribution there is learning theory that suggests that in most cases the
[00:21:39] that suggests that in most cases the being curve should decay as 1 over
[00:21:41] being curve should decay as 1 over square root of M that's the rate that we
[00:21:43] square root of M that's the rate that we should decay until an answer because
[00:21:46] should decay until an answer because some Bayes error that's what the
[00:21:48] some Bayes error that's what the learning theory says that doesn't make
[00:21:50] learning theory says that doesn't make sense and sometimes and learning
[00:21:54] sense and sometimes and learning algorithms errors don't always go to
[00:21:55] algorithms errors don't always go to zero right because sometimes they're
[00:21:58] zero right because sometimes they're sometimes on the data is just ambiguous
[00:22:00] sometimes on the data is just ambiguous oh my god I guess yeah my PhD students
[00:22:03] oh my god I guess yeah my PhD students are including on them we do a lot of
[00:22:05] are including on them we do a lot of work on health care and sometimes you
[00:22:07] work on health care and sometimes you look in an x-ray it's just blurry and
[00:22:09] look in an x-ray it's just blurry and you could try to make a diagnosis right
[00:22:11] you could try to make a diagnosis right is there is there or I show all on ins
[00:22:14] is there is there or I show all on ins working on predicting patients mortality
[00:22:16] working on predicting patients mortality or what's the chance of someone dying in
[00:22:17] or what's the chance of someone dying in the next year or so and sometimes the
[00:22:20] the next year or so and sometimes the local a patient's medical record you
[00:22:22] local a patient's medical record you just can't tell right what is you know
[00:22:24] just can't tell right what is you know will they pass away in the next year or
[00:22:25] will they pass away in the next year or so or look like x-ray you just can't
[00:22:28] so or look like x-ray you just can't tell is there is there a tumor or not
[00:22:30] tell is there is there a tumor or not because it's just blurry so learning
[00:22:31] because it's just blurry so learning others error I don't always tk20 but the
[00:22:34] others error I don't always tk20 but the theory says that as M increases should
[00:22:37] theory says that as M increases should decay roughly a rate of 1 over square
[00:22:39] decay roughly a rate of 1 over square root of them to what that baseline error
[00:22:41] root of them to what that baseline error which is which is called Bayes error
[00:22:43] which is which is called Bayes error which is the best that you could
[00:22:44] which is the best that you could possibly hope anything could do given
[00:22:46] possibly hope anything could do given how blurry the images are
[00:22:48] how blurry the images are giving home nobody see the day tourists
[00:22:49] giving home nobody see the day tourists right all right um oh sorry gave the
[00:22:55] right all right um oh sorry gave the eyes their way okay so try small Estella
[00:22:58] eyes their way okay so try small Estella features that fixes a high variance
[00:23:01] features that fixes a high variance problem right and one concrete example
[00:23:04] problem right and one concrete example would be if you have this data set and
[00:23:07] would be if you have this data set and you're fitting uh you know tenth order
[00:23:09] you're fitting uh you know tenth order polynomial and the curve all sleeves off
[00:23:11] polynomial and the curve all sleeves off of the place that's high variance you
[00:23:13] of the place that's high variance you could say well maybe I don't need a
[00:23:15] could say well maybe I don't need a tenth order polynomial may be actually
[00:23:17] tenth order polynomial may be actually use you only Wow sorry what's going on
[00:23:24] okay right so maybe you say maybe I
[00:23:28] okay right so maybe you say maybe I don't need my features to be all of
[00:23:30] don't need my features to be all of these things ten for the pollen you you
[00:23:32] these things ten for the pollen you you know maybe if this is too high bearings
[00:23:34] know maybe if this is too high bearings and get rid of a lot of features and
[00:23:36] and get rid of a lot of features and just use you know much smaller number of
[00:23:38] just use you know much smaller number of features right so that fixes high
[00:23:43] features right so that fixes high variance and then if you use the largest
[00:23:47] variance and then if you use the largest set of features faces high bias right
[00:23:52] set of features faces high bias right cool right so that's if you are setting
[00:23:55] cool right so that's if you are setting a straight line to the data there's not
[00:23:57] a straight line to the data there's not doing that well in go G maybe actually
[00:23:59] doing that well in go G maybe actually add a quadratic term and just add more
[00:24:01] add a quadratic term and just add more features right so that fixes values and
[00:24:03] features right so that fixes values and having email header features yep January
[00:24:10] having email header features yep January I would try this if they try to reduce
[00:24:15] I would try this if they try to reduce bias right and so in the workflow of how
[00:24:19] bias right and so in the workflow of how you develop a learning algorithm I would
[00:24:22] you develop a learning algorithm I would recommend that you yes so so one of the
[00:24:27] recommend that you yes so so one of the things about building learning
[00:24:29] things about building learning algorithms is that for a new application
[00:24:33] algorithms is that for a new application problem it's difficult to know in
[00:24:37] problem it's difficult to know in advance if you're gonna run into high
[00:24:40] advance if you're gonna run into high bias or high variance problem right it
[00:24:43] bias or high variance problem right it is actually very difficult to know in
[00:24:45] is actually very difficult to know in advance what's gonna go wrong with your
[00:24:46] advance what's gonna go wrong with your learning algorithm and so the advice I
[00:24:49] learning algorithm and so the advice I tend to give is if you work on the new
[00:24:52] tend to give is if you work on the new app
[00:24:52] app Kayson implements a quick and dirty
[00:24:54] Kayson implements a quick and dirty learning algorithm it have a quick and
[00:24:57] learning algorithm it have a quick and dirty implementation of something so you
[00:24:59] dirty implementation of something so you can run your learning algorithm just you
[00:25:02] can run your learning algorithm just you know sort of logistic regression it
[00:25:04] know sort of logistic regression it starts with something simple and then
[00:25:06] starts with something simple and then run this bias-variance type of analysis
[00:25:09] run this bias-variance type of analysis to see sort of what went wrong and then
[00:25:13] to see sort of what went wrong and then use that to decide what to do next you
[00:25:15] use that to decide what to do next you go to more complex algorithms you try
[00:25:18] go to more complex algorithms you try any more theater the one exception to
[00:25:21] any more theater the one exception to this is if you're working on a domain in
[00:25:24] this is if you're working on a domain in which you have a lot of experience right
[00:25:26] which you have a lot of experience right and so for example you know I've done
[00:25:29] and so for example you know I've done the long work on speech recognition so
[00:25:31] the long work on speech recognition so because I've done that work I kind of
[00:25:33] because I've done that work I kind of have a sense of how much they just need
[00:25:35] have a sense of how much they just need it for a new application then then I
[00:25:37] it for a new application then then I might just build something more
[00:25:38] might just build something more complicated from the get-go over here
[00:25:40] complicated from the get-go over here doing all your working on say face
[00:25:42] doing all your working on say face recognition and because you've rid of
[00:25:44] recognition and because you've rid of all the research papers you have a sense
[00:25:45] all the research papers you have a sense of how much data sneed it then maybe
[00:25:47] of how much data sneed it then maybe it's worth trying something because
[00:25:49] it's worth trying something because you're building on the body of knowledge
[00:25:51] you're building on the body of knowledge but but if you work on something on a
[00:25:53] but but if you work on something on a brand new application that you and maybe
[00:25:55] brand new application that you and maybe you know no one in the published
[00:25:57] you know no one in the published academic literature has worked on or you
[00:25:59] academic literature has worked on or you don't totally trust the published
[00:26:01] don't totally trust the published results to be representative of your
[00:26:03] results to be representative of your problem then I would usually recommend
[00:26:05] problem then I would usually recommend that you implement a build a quick and
[00:26:08] that you implement a build a quick and dirty implementation look at the buyers
[00:26:10] dirty implementation look at the buyers and barons of the algorithm and then use
[00:26:12] and barons of the algorithm and then use that to better decide what to try next
[00:26:17] that to better decide what to try next so I think bias and variance is I think
[00:26:22] so I think bias and variance is I think as that is really like the single most
[00:26:23] as that is really like the single most powerful - I know you know for analyzing
[00:26:26] powerful - I know you know for analyzing the performance of learning algorithms
[00:26:27] the performance of learning algorithms that do this pre-emergent every single
[00:26:29] that do this pre-emergent every single machine or any application there's one
[00:26:32] machine or any application there's one other pattern that I see quite often
[00:26:34] other pattern that I see quite often which is which which edges the second
[00:26:38] which is which which edges the second set which is which is a which is your
[00:26:43] set which is which is a which is your optimization algorithm working so so let
[00:26:47] optimization algorithm working so so let me most let me explain this with most
[00:26:48] me most let me explain this with most being example right so um it turns out
[00:26:52] being example right so um it turns out that when you implement a learning
[00:26:54] that when you implement a learning algorithm you often have a few guesses
[00:26:57] algorithm you often have a few guesses for what's wrong
[00:26:58] for what's wrong and if you can systematically test if
[00:27:01] and if you can systematically test if that hypothesis is right before you
[00:27:03] that hypothesis is right before you spend a lot of work to try to fix it
[00:27:05] spend a lot of work to try to fix it then you can be much more efficient so
[00:27:07] then you can be much more efficient so let's explain that with a concrete
[00:27:09] let's explain that with a concrete example so you understand those words I
[00:27:11] example so you understand those words I just said maybe they're a little bit
[00:27:12] just said maybe they're a little bit abstract which is let's say that you
[00:27:15] abstract which is let's say that you know you tune your logistic regression
[00:27:17] know you tune your logistic regression the algorithm for a while
[00:27:18] the algorithm for a while and let's say the suni Russian gets two
[00:27:20] and let's say the suni Russian gets two percent error on spam email and 2% error
[00:27:23] percent error on spam email and 2% error on non step right and it's okay to have
[00:27:26] on non step right and it's okay to have two percent error on spam email maybe
[00:27:28] two percent error on spam email maybe right you know so you have to read a
[00:27:30] right you know so you have to read a little bit of spam email it's like you
[00:27:32] little bit of spam email it's like you that's okay but two percent error on
[00:27:35] that's okay but two percent error on non-stem it's just not really acceptable
[00:27:37] non-stem it's just not really acceptable because you're losing one in fifty
[00:27:39] because you're losing one in fifty important emails and let's say that you
[00:27:44] important emails and let's say that you know your teammate right also try chains
[00:27:47] know your teammate right also try chains in SVM and they find in an SVM using a
[00:27:51] in SVM and they find in an SVM using a linear kernel guess ten percent error on
[00:27:53] linear kernel guess ten percent error on spam but 0.01 percent error on non-stem
[00:27:57] spam but 0.01 percent error on non-stem right maybe not great but for this
[00:27:59] right maybe not great but for this purpose of illustration let's say this
[00:28:01] purpose of illustration let's say this is susceptible um but because it turns
[00:28:05] is susceptible um but because it turns out logistic regression is more
[00:28:07] out logistic regression is more computationally efficient and and and it
[00:28:10] computationally efficient and and and it may be easier to update right here you
[00:28:12] may be easier to update right here you get more examples to run a few more
[00:28:14] get more examples to run a few more iterations of gradient descent and let's
[00:28:16] iterations of gradient descent and let's say you want to ship a logistic
[00:28:17] say you want to ship a logistic regression implementation rather than
[00:28:19] regression implementation rather than SVM implementation so what do you do
[00:28:22] SVM implementation so what do you do thinks it turns out that one common
[00:28:27] thinks it turns out that one common question you have when training your
[00:28:29] question you have when training your learning algorithm is you often wonder
[00:28:32] learning algorithm is you often wonder is your optimization algorithm
[00:28:35] is your optimization algorithm converging right so you know it's great
[00:28:38] converging right so you know it's great in a sense it is it converging and so
[00:28:41] in a sense it is it converging and so one thing you might do is draw a plot of
[00:28:44] one thing you might do is draw a plot of the training optimization objective of J
[00:28:48] the training optimization objective of J of theta or whatever you're maximizing
[00:28:49] of theta or whatever you're maximizing all along likelihood that J of theta or
[00:28:51] all along likelihood that J of theta or whatever versus number of iterations and
[00:28:54] whatever versus number of iterations and often the plot will look like that right
[00:28:57] often the plot will look like that right and you know the curve is kind of going
[00:29:00] and you know the curve is kind of going up but not that fast
[00:29:03] up but not that fast and if you train it twice as long or
[00:29:05] and if you train it twice as long or even ten times as long will that help
[00:29:07] even ten times as long will that help right and again training maybe the
[00:29:09] right and again training maybe the algorithm for more durations it you know
[00:29:11] algorithm for more durations it you know pretty much never hurts
[00:29:13] pretty much never hurts if you regular eyes the algorithm
[00:29:14] if you regular eyes the algorithm properly trained me the algorithm longer
[00:29:16] properly trained me the algorithm longer you know although almost always helps
[00:29:18] you know although almost always helps right pretty much never hurts but is the
[00:29:22] right pretty much never hurts but is the right thing to do to go and burn another
[00:29:24] right thing to do to go and burn another 48 hours of you know CPU or GPU cycles
[00:29:28] 48 hours of you know CPU or GPU cycles to just train this thing longer in the
[00:29:29] to just train this thing longer in the hoping works better right maybe maybe
[00:29:32] hoping works better right maybe maybe not so is there a is there a systematic
[00:29:35] not so is there a is there a systematic way to tell is there a better way to
[00:29:39] way to tell is there a better way to tell if you should invest a lot more
[00:29:41] tell if you should invest a lot more time in running the optimization
[00:29:43] time in running the optimization algorithm sometimes it's just hard to
[00:29:45] algorithm sometimes it's just hard to tell right so um now the other question
[00:29:50] tell right so um now the other question that you sometimes wonder so a lot of
[00:29:54] that you sometimes wonder so a lot of where a lot of this iteration of deeper
[00:29:56] where a lot of this iteration of deeper learning algorithms is smoking which to
[00:29:58] learning algorithms is smoking which to learn I was doing and just asking
[00:30:00] learn I was doing and just asking yourself one of my guess is for what
[00:30:02] yourself one of my guess is for what could be wrong and maybe one of your
[00:30:04] could be wrong and maybe one of your guesses is well maybe optimizing the
[00:30:06] guesses is well maybe optimizing the wrong cost function right so so here's
[00:30:09] wrong cost function right so so here's what I mean um what you care about is
[00:30:11] what I mean um what you care about is this weight and accuracy criteria you
[00:30:16] this weight and accuracy criteria you know we're sort of sum over your def set
[00:30:20] know we're sort of sum over your def set or test set of you know weights on
[00:30:22] or test set of you know weights on different examples of whether it gets it
[00:30:25] different examples of whether it gets it right where the weights are higher for
[00:30:27] right where the weights are higher for non-standard span because you really
[00:30:29] non-standard span because you really wanna make sure you label non-spam
[00:30:31] wanna make sure you label non-spam e-mail correctly right so so maybe
[00:30:33] e-mail correctly right so so maybe that's the way to accuracy criteria you
[00:30:36] that's the way to accuracy criteria you care about but for logistic regression
[00:30:41] care about but for logistic regression you're maximizing this cost function
[00:30:43] you're maximizing this cost function right love likelihood - this
[00:30:46] right love likelihood - this regularization term so you're optimizing
[00:30:48] regularization term so you're optimizing J of theta when what you actually care
[00:30:51] J of theta when what you actually care about is a of theta so maybe our
[00:30:54] about is a of theta so maybe our optimizing the wrong cost function and
[00:30:57] optimizing the wrong cost function and then one way to change the cost function
[00:30:58] then one way to change the cost function would be to fiddle with the parameter
[00:31:00] would be to fiddle with the parameter lambda right that's one way to change
[00:31:02] lambda right that's one way to change the definition of J of theta another way
[00:31:05] the definition of J of theta another way to change J of theta is to just totally
[00:31:08] to change J of theta is to just totally change the cost function you're
[00:31:09] change the cost function you're maximizing like change it to the SVM
[00:31:11] maximizing like change it to the SVM objective right oh and then then part of
[00:31:14] objective right oh and then then part of that also means choosing the appropriate
[00:31:16] that also means choosing the appropriate value for see okay
[00:31:19] value for see okay and so there's a second diagnostic which
[00:31:25] and so there's a second diagnostic which I end up using
[00:31:27] I end up using which is we shall help you tell is the
[00:31:30] which is we shall help you tell is the problem your optimization algorithm in
[00:31:34] problem your optimization algorithm in other words is gradient ascent not
[00:31:36] other words is gradient ascent not converging or is the problem that you're
[00:31:38] converging or is the problem that you're just optimizing the wrong function right
[00:31:41] just optimizing the wrong function right and then we'll see two examples of this
[00:31:43] and then we'll see two examples of this is the first example okay um and so
[00:31:46] is the first example okay um and so here's the diagnostic that can help you
[00:31:48] here's the diagnostic that can help you figure that out so just to summarize
[00:31:51] figure that out so just to summarize this scenario this um this example this
[00:31:54] this scenario this um this example this running example we're using the SVM
[00:31:57] running example we're using the SVM Opera homes which is Russian but you
[00:31:58] Opera homes which is Russian but you want to deploy this regression let's let
[00:32:01] want to deploy this regression let's let theta SVM but the parameters learned by
[00:32:02] theta SVM but the parameters learned by an SVM and instead of writing the SVM
[00:32:05] an SVM and instead of writing the SVM parameters as W and B I'm just going to
[00:32:08] parameters as W and B I'm just going to write the linear SVM as your linear
[00:32:09] write the linear SVM as your linear kernel you know using the logistic
[00:32:12] kernel you know using the logistic regression parameter ization right so
[00:32:14] regression parameter ization right so you have a linear set of parameters and
[00:32:16] you have a linear set of parameters and that's the thing the prrb the parameters
[00:32:18] that's the thing the prrb the parameters learned by the just aggression right so
[00:32:20] learned by the just aggression right so it's just yeah regularize which is
[00:32:23] it's just yeah regularize which is russian or basically just in Russian so
[00:32:26] russian or basically just in Russian so you care about weights and accuracy and
[00:32:30] and the SVM outperforms basing is just
[00:32:35] and the SVM outperforms basing is just regression okay so this is one sly
[00:32:37] regression okay so this is one sly summary of where we are in this example
[00:32:41] summary of where we are in this example so how can you tell if the problem is
[00:32:44] so how can you tell if the problem is your optimization algorithm meaning that
[00:32:47] your optimization algorithm meaning that you need to run gradient descent longer
[00:32:50] you need to run gradient descent longer to actually maximize J of theta or is
[00:32:53] to actually maximize J of theta or is sorry and then right and this is the
[00:32:55] sorry and then right and this is the what BRR tries to maximize right so so
[00:32:59] what BRR tries to maximize right so so how these hell we were were two possible
[00:33:01] how these hell we were were two possible hypotheses we want to distinguish
[00:33:02] hypotheses we want to distinguish between one is that the learning
[00:33:06] between one is that the learning algorithm is not actually finding the
[00:33:08] algorithm is not actually finding the value of theta that maximizes J of theta
[00:33:10] value of theta that maximizes J of theta or if for some reason great innocent is
[00:33:13] or if for some reason great innocent is not converging so that would be a
[00:33:15] not converging so that would be a problem the optimization algorithm that
[00:33:17] problem the optimization algorithm that J of theta that you know promptly opt
[00:33:21] J of theta that you know promptly opt for the problem to be what the
[00:33:23] for the problem to be what the optimization algorithm means that if
[00:33:25] optimization algorithm means that if only we could have
[00:33:26] only we could have algorithm that maximizes J of theta we
[00:33:28] algorithm that maximizes J of theta we would do great but for some reason
[00:33:30] would do great but for some reason gradient descent isn't doing well that's
[00:33:32] gradient descent isn't doing well that's one hypothesis the second hypothesis is
[00:33:35] one hypothesis the second hypothesis is that J of theta is just a wrong function
[00:33:37] that J of theta is just a wrong function to be optimizing it's just a bad choice
[00:33:38] to be optimizing it's just a bad choice of cost function that J of theta is too
[00:33:41] of cost function that J of theta is too different from a of theta the maximizing
[00:33:44] different from a of theta the maximizing J of theta doesn't give you you know a
[00:33:46] J of theta doesn't give you you know a classifier that does well on a of theta
[00:33:49] classifier that does well on a of theta which is what you actually care about
[00:33:51] which is what you actually care about okay any quiz a problem set up I wanna
[00:33:55] okay any quiz a problem set up I wanna make sure people understand this this is
[00:33:57] make sure people understand this this is race raise your hand if this makes sense
[00:33:59] race raise your hand if this makes sense most people okay cool okay good any
[00:34:04] most people okay cool okay good any questions about this problem set up oh
[00:34:09] thank you why not mess myself theta
[00:34:11] thank you why not mess myself theta directly because F theta is non
[00:34:13] directly because F theta is non differentiable so we don't actually have
[00:34:15] differentiable so we don't actually have you know does this indicate a function
[00:34:17] you know does this indicate a function so we actually don't we it turns out
[00:34:20] so we actually don't we it turns out maximizing a of theta explicitly is
[00:34:22] maximizing a of theta explicitly is np-hard but just we just don't have
[00:34:25] np-hard but just we just don't have great algorithms to trying to do that
[00:34:28] great algorithms to trying to do that okay so it turns out there's a
[00:34:32] okay so it turns out there's a diagnostic you could use to distinguish
[00:34:34] diagnostic you could use to distinguish between D sub to these two different
[00:34:37] between D sub to these two different problems and here's the diagnostic which
[00:34:40] problems and here's the diagnostic which is check the cost function that logistic
[00:34:45] is check the cost function that logistic regression is trying to maximize so J
[00:34:48] regression is trying to maximize so J and compute that cost function on the
[00:34:52] and compute that cost function on the parameters found by the SVM and compute
[00:34:56] parameters found by the SVM and compute that cost function on the parameters
[00:34:58] that cost function on the parameters found by based on logistic regression
[00:35:00] found by based on logistic regression and just see which which value is higher
[00:35:02] and just see which which value is higher okay so there are two cases either this
[00:35:09] okay so there are two cases either this is greater or this is less than equal to
[00:35:12] is greater or this is less than equal to right there just two possible cases so
[00:35:15] right there just two possible cases so what I'm going to do is go over case one
[00:35:17] what I'm going to do is go over case one and case two corresponding to this
[00:35:19] and case two corresponding to this greater than or it's less than equal
[00:35:21] greater than or it's less than equal then and let's let's see what that
[00:35:23] then and let's let's see what that implies so on the next slide I'm going
[00:35:25] implies so on the next slide I'm going to copy over this equation right that's
[00:35:28] to copy over this equation right that's that's just a fact that the SVM does
[00:35:30] that's just a fact that the SVM does better then based on logistic regression
[00:35:32] better then based on logistic regression on our problem so on the next I'm going
[00:35:34] on our problem so on the next I'm going to copy over this first equation and
[00:35:36] to copy over this first equation and then we're going to consider
[00:35:38] then we're going to consider these two cases separately so great -
[00:35:40] these two cases separately so great - that would be case one and less than
[00:35:42] that would be case one and less than equal to will be case - okay so let me
[00:35:44] equal to will be case - okay so let me copy over these two equations in the
[00:35:46] copy over these two equations in the next slide right so that's the first
[00:35:48] next slide right so that's the first equation that i just copied over here
[00:35:50] equation that i just copied over here and that's this is the greater than this
[00:35:53] and that's this is the greater than this case one okay so let's see how to
[00:35:56] case one okay so let's see how to interpret this in case one J of theta
[00:36:03] interpret this in case one J of theta SVM is greater than J of theta be our
[00:36:06] SVM is greater than J of theta be our right meaning that whatever the SVM was
[00:36:10] right meaning that whatever the SVM was doing it found a value for theta which
[00:36:16] doing it found a value for theta which we've written as theta SVM and theta as
[00:36:23] we've written as theta SVM and theta as VM has a higher value on the cost
[00:36:26] VM has a higher value on the cost function J than theta be wrong but base
[00:36:31] function J than theta be wrong but base in logistic regression was trying to
[00:36:33] in logistic regression was trying to maximize J of theta right I mean
[00:36:36] maximize J of theta right I mean basically just Russian it's just using
[00:36:37] basically just Russian it's just using gradient descent to try to maximize J of
[00:36:40] gradient descent to try to maximize J of theta and so under case one this shows
[00:36:45] theta and so under case one this shows that whatever the SVM was doing whatever
[00:36:48] that whatever the SVM was doing whatever your buddy
[00:36:49] your buddy implementing SVM did they managed to
[00:36:52] implementing SVM did they managed to find a value for theta that actually
[00:36:55] find a value for theta that actually achieves a higher value of J of theta
[00:36:57] achieves a higher value of J of theta then your implementation of base in
[00:37:00] then your implementation of base in logistic regression so this means that
[00:37:03] logistic regression so this means that theta prr fails to maximize the cost
[00:37:06] theta prr fails to maximize the cost function J and and the problem is with
[00:37:10] function J and and the problem is with the optimization algorithm okay so this
[00:37:13] the optimization algorithm okay so this case one case two again I'm just copying
[00:37:18] case one case two again I'm just copying over the first equation right because
[00:37:20] over the first equation right because this is just part of our analysis to
[00:37:22] this is just part of our analysis to spot the problem set up but in case two
[00:37:25] spot the problem set up but in case two is now the second line is now a less
[00:37:26] is now the second line is now a less than or equal sign okay so let's see how
[00:37:30] than or equal sign okay so let's see how to interpret this so under if we look at
[00:37:36] to interpret this so under if we look at the second equation right the less than
[00:37:37] the second equation right the less than equal to sign it looks like J do the
[00:37:41] equal to sign it looks like J do the better job than the SVM maximizing jr.
[00:37:45] better job than the SVM maximizing jr. excuse me it looks like basically just
[00:37:47] excuse me it looks like basically just regression did a better job than GSB
[00:37:50] regression did a better job than GSB maximising junior theater right so you
[00:37:53] maximising junior theater right so you know you talk based on religious
[00:37:55] know you talk based on religious aggression to maximize jf Theta and by
[00:37:58] aggression to maximize jf Theta and by golly I found it I found the value of
[00:38:00] golly I found it I found the value of theta does that it found the value that
[00:38:03] theta does that it found the value that achieves a higher value of G of theta
[00:38:04] achieves a higher value of G of theta than whatever your buddy did using an
[00:38:07] than whatever your buddy did using an SVM implementation so it actually did a
[00:38:08] SVM implementation so it actually did a good job trying to find the value of
[00:38:10] good job trying to find the value of theta that dries up J of theta as much
[00:38:14] theta that dries up J of theta as much as possible but if you look at these two
[00:38:17] as possible but if you look at these two equations in combination what we have is
[00:38:20] equations in combination what we have is that the SVM does worse on the cost
[00:38:26] that the SVM does worse on the cost function J but it does better on the
[00:38:30] function J but it does better on the thing you actually care about a of theta
[00:38:32] thing you actually care about a of theta so what these two equations in
[00:38:36] so what these two equations in combination tell you is that having the
[00:38:39] combination tell you is that having the best value the highest value for J of
[00:38:41] best value the highest value for J of theta does not correspond to having the
[00:38:44] theta does not correspond to having the best possible value for a of theta so
[00:38:49] best possible value for a of theta so tells you that maximizing J of theta
[00:38:51] tells you that maximizing J of theta doesn't mean you're doing a good job on
[00:38:54] doesn't mean you're doing a good job on a of theta and therefore maybe J of
[00:38:57] a of theta and therefore maybe J of theta is not such a good thing to be
[00:38:58] theta is not such a good thing to be maximizing because maximizing it doesn't
[00:39:00] maximizing because maximizing it doesn't actually give you the result you also
[00:39:02] actually give you the result you also really care about so under case two you
[00:39:07] really care about so under case two you can be convinced that J of theta is just
[00:39:10] can be convinced that J of theta is just an is not the best function to
[00:39:12] an is not the best function to maximizing because getting high value of
[00:39:15] maximizing because getting high value of J of theta it doesn't get your high
[00:39:16] J of theta it doesn't get your high value from what you actually care about
[00:39:18] value from what you actually care about and so the problem is with the objective
[00:39:20] and so the problem is with the objective function of the maximization problem and
[00:39:23] function of the maximization problem and maybe we should just find a different
[00:39:25] maybe we should just find a different function to maximize okay so um any
[00:39:35] function to maximize okay so um any questions about this
[00:39:54] yeah let me come back to that yeah it's
[00:39:57] yeah let me come back to that yeah it's a complicated answer yeah all right
[00:39:59] a complicated answer yeah all right actually let's do this first um so all
[00:40:08] actually let's do this first um so all right
[00:40:09] right for these four bullets does it fix the
[00:40:12] for these four bullets does it fix the optimization algorithm or does it fix
[00:40:13] optimization algorithm or does it fix the optimization objective first one
[00:40:15] the optimization objective first one does it fix the optimization algorithm
[00:40:17] does it fix the optimization algorithm or does it fix the automation or
[00:40:18] or does it fix the automation or objective cool second one oh I don't
[00:40:25] objective cool second one oh I don't know what's wrong with this thing it's
[00:40:26] know what's wrong with this thing it's so strange okay right does it fix the
[00:40:31] so strange okay right does it fix the optimization algorithm or fix also an
[00:40:33] optimization algorithm or fix also an objective positive right so Newton's
[00:40:37] objective positive right so Newton's method still looks at the same cost
[00:40:38] method still looks at the same cost function J of theta but in some cases it
[00:40:41] function J of theta but in some cases it just optimizes it much more efficiently
[00:40:42] just optimizes it much more efficiently um this is a funny one usually you
[00:40:45] um this is a funny one usually you fiddle with lambda to trade-off bias and
[00:40:50] fiddle with lambda to trade-off bias and variance things right this is one way to
[00:40:52] variance things right this is one way to change the optimization objective
[00:40:53] change the optimization objective although usually you change lambda so it
[00:40:56] although usually you change lambda so it just buys in their hands rather than
[00:40:58] just buys in their hands rather than this right and then trying to use an SVM
[00:41:01] this right and then trying to use an SVM right would be one way to totally change
[00:41:03] right would be one way to totally change the optimization objective okay so to
[00:41:07] the optimization objective okay so to answer the question just now sometimes
[00:41:09] answer the question just now sometimes we find you have the wrong optimization
[00:41:11] we find you have the wrong optimization objective is that there there isn't
[00:41:13] objective is that there there isn't always an obvious thing to do sometimes
[00:41:16] always an obvious thing to do sometimes you have to brainstorm a few ideas that
[00:41:18] you have to brainstorm a few ideas that there isn't always one obvious thing to
[00:41:22] there isn't always one obvious thing to try but at least it tells you that that
[00:41:24] try but at least it tells you that that category of things are trying out
[00:41:26] category of things are trying out different optimization objectives it's
[00:41:27] different optimization objectives it's working well right all right
[00:41:33] working well right all right so let's go through a more complex
[00:41:37] so let's go through a more complex example though they'll you know
[00:41:39] example though they'll you know incorporate some of these what's wrong I
[00:41:41] incorporate some of these what's wrong I spray my laptop and wonder if life was
[00:41:44] spray my laptop and wonder if life was so strange this is what I can do
[00:41:55] all right oh all right let's go for more
[00:41:58] all right oh all right let's go for more complex example that will illustrate
[00:42:01] complex example that will illustrate some of these concepts that we've been
[00:42:03] some of these concepts that we've been going through and just let you see
[00:42:05] going through and just let you see another example of these things oh and
[00:42:09] another example of these things oh and and I find that dumb one thing I've
[00:42:12] and I find that dumb one thing I've learned as a teacher you know one of the
[00:42:14] learned as a teacher you know one of the ways for you to become good at this
[00:42:16] ways for you to become good at this right is to go you know working a good
[00:42:19] right is to go you know working a good AI group five years right because when
[00:42:21] AI group five years right because when you work on a good AI group for some
[00:42:23] you work on a good AI group for some several years then you have seen you
[00:42:25] several years then you have seen you know ten projects and that lets you gain
[00:42:28] know ten projects and that lets you gain that experience but it turns out that it
[00:42:30] that experience but it turns out that it takes I don't know depending on what
[00:42:32] takes I don't know depending on what they are group you work on it it takes
[00:42:33] they are group you work on it it takes if you work on a different project every
[00:42:36] if you work on a different project every year then in five years I guess you're
[00:42:37] year then in five years I guess you're working five projects something I
[00:42:39] working five projects something I actually don't know or maybe ten
[00:42:40] actually don't know or maybe ten projects or something but one of the
[00:42:42] projects or something but one of the reasons that song in the way I try to
[00:42:45] reasons that song in the way I try to explain this you actually go give
[00:42:47] explain this you actually go give specific scenarios with you this so that
[00:42:49] specific scenarios with you this so that you know my peer sisters and I we spend
[00:42:52] you know my peer sisters and I we spend actually we spent like many years
[00:42:54] actually we spent like many years working with Stanford autonomous
[00:42:56] working with Stanford autonomous helicopter but I'm trying to distill the
[00:42:58] helicopter but I'm trying to distill the key lessons down for you so that you
[00:42:59] key lessons down for you so that you don't need to work on a project for you
[00:43:01] don't need to work on a project for you know feeis to gain this experience but
[00:43:03] know feeis to gain this experience but to give you some approximation to this
[00:43:05] to give you some approximation to this knowledge and maybe twenty minutes where
[00:43:07] knowledge and maybe twenty minutes where is the twenty minutes won't give you the
[00:43:09] is the twenty minutes won't give you the depth of the three years of experience
[00:43:10] depth of the three years of experience but have each other
[00:43:11] but have each other summarize a key lesson so that so you
[00:43:14] summarize a key lesson so that so you can learn from experience that others
[00:43:16] can learn from experience that others took years to develop um all right so uh
[00:43:21] took years to develop um all right so uh this helicopter he sits in my office but
[00:43:25] this helicopter he sits in my office but but if you go to my office and you know
[00:43:28] but if you go to my office and you know grab this helicopter and and we asked
[00:43:31] grab this helicopter and and we asked you to write a piece of code to make
[00:43:33] you to write a piece of code to make this fly by yourself use the learning
[00:43:35] this fly by yourself use the learning algorithm to make this slide by yourself
[00:43:36] algorithm to make this slide by yourself how do you go about doing so so it turns
[00:43:39] how do you go about doing so so it turns out a good way to make a helicopter fly
[00:43:42] out a good way to make a helicopter fly by itself is to use is to do the
[00:43:45] by itself is to use is to do the following um step one is build a
[00:43:48] following um step one is build a computer simulator for a helicopter so
[00:43:51] computer simulator for a helicopter so you know that's actually a simulator
[00:43:52] you know that's actually a simulator right like a video game simulator of a
[00:43:55] right like a video game simulator of a helicopter um the advantage of using you
[00:43:58] helicopter um the advantage of using you know say a video game simulator
[00:43:59] know say a video game simulator helicopter is you could travel all
[00:44:01] helicopter is you could travel all things trash a lot in simulation
[00:44:03] things trash a lot in simulation you know which is cheap whereas crashing
[00:44:05] you know which is cheap whereas crashing a helicopter in real life is this is
[00:44:07] a helicopter in real life is this is slightly dangerous and also more
[00:44:10] slightly dangerous and also more expensive but so step one build a
[00:44:14] expensive but so step one build a similar helicopter except to choose the
[00:44:17] similar helicopter except to choose the cost function and for today I'm just
[00:44:20] cost function and for today I'm just using a relatively simple cost function
[00:44:21] using a relatively simple cost function which is squared error so you want the
[00:44:23] which is squared error so you want the helicopter to fly the position X desired
[00:44:26] helicopter to fly the position X desired and your hug copter is dead yeah wanders
[00:44:29] and your hug copter is dead yeah wanders off to some other place X so let's use a
[00:44:31] off to some other place X so let's use a squared error to penalize it right when
[00:44:34] squared error to penalize it right when we talk about reinforcement learning
[00:44:36] we talk about reinforcement learning towards the end of this quarter well why
[00:44:38] towards the end of this quarter well why should go through the same example again
[00:44:40] should go through the same example again by using the reinforcement or any
[00:44:42] by using the reinforcement or any terminology you understand is that
[00:44:44] terminology you understand is that slightly in a slightly deeper level and
[00:44:45] slightly in a slightly deeper level and we'll go over this exact same example
[00:44:47] we'll go over this exact same example after you learned about reinforcement
[00:44:48] after you learned about reinforcement learning but we'll just go over a
[00:44:50] learning but we'll just go over a slightly simplified very simplified
[00:44:52] slightly simplified very simplified version today and so running reinforce a
[00:44:57] version today and so running reinforce a learning algorithm and whatever
[00:44:59] learning algorithm and whatever enforcement learning algorithm does is
[00:45:00] enforcement learning algorithm does is it tries to minimize that cost function
[00:45:03] it tries to minimize that cost function J of theta and so you know and so you
[00:45:08] J of theta and so you know and so you learn some set of parameters theta
[00:45:09] learn some set of parameters theta subscript R L for controlling the
[00:45:13] subscript R L for controlling the helicopter right and we're talked about
[00:45:14] helicopter right and we're talked about reinforcement learning you know well you
[00:45:17] reinforcement learning you know well you see all this redone with proper
[00:45:19] see all this redone with proper reinforcement learning notation where J
[00:45:21] reinforcement learning notation where J is a reward function theta R is a
[00:45:23] is a reward function theta R is a control policy and so on but don't worry
[00:45:24] control policy and so on but don't worry about that for now um so let's say you
[00:45:28] about that for now um so let's say you do this and the resulting controller
[00:45:31] do this and the resulting controller right the way you fly the helicopter it
[00:45:34] right the way you fly the helicopter it gives much worse performance than your
[00:45:36] gives much worse performance than your human pilot you know so the helicopter
[00:45:38] human pilot you know so the helicopter wobbles off of the place and then
[00:45:40] wobbles off of the place and then doesn't quite stay where you were hoping
[00:45:42] doesn't quite stay where you were hoping it will so what do you do next right
[00:45:46] it will so what do you do next right well here are some options corresponding
[00:45:49] well here are some options corresponding to the three steps above you could work
[00:45:51] to the three steps above you could work on improving your simulator it turns out
[00:45:55] on improving your simulator it turns out even today you know we've had
[00:45:57] even today you know we've had helicopters for what I don't know I
[00:46:01] helicopters for what I don't know I think that having all commercial houses
[00:46:03] think that having all commercial houses around 1950 zip or technical trunk also
[00:46:06] around 1950 zip or technical trunk also for many decades now but air flow around
[00:46:09] for many decades now but air flow around the helicopter is very complicated and
[00:46:10] the helicopter is very complicated and even today they're actually some details
[00:46:13] even today they're actually some details of how air flows are
[00:46:15] of how air flows are after the aerodynamics textbook you know
[00:46:18] after the aerodynamics textbook you know that even aero-astro people write the
[00:46:21] that even aero-astro people write the explicitly our answer cannot fully
[00:46:22] explicitly our answer cannot fully explain so helicopter is incredibly
[00:46:25] explain so helicopter is incredibly complicated and there's almost unlimited
[00:46:27] complicated and there's almost unlimited Headroom for building better and more
[00:46:30] Headroom for building better and more accurate simulators our copter so maybe
[00:46:32] accurate simulators our copter so maybe you want to do that or maybe you think
[00:46:35] you want to do that or maybe you think the cost function is messed up you know
[00:46:36] the cost function is messed up you know maybe a square error isn't the best
[00:46:39] maybe a square error isn't the best metric right and it turns out you know
[00:46:42] metric right and it turns out you know the way a helicopter helicopter has a
[00:46:44] the way a helicopter helicopter has a tail rotor that blows went to one side
[00:46:46] tail rotor that blows went to one side right so yes because the the main rotor
[00:46:49] right so yes because the the main rotor spins in one direction if it only had a
[00:46:52] spins in one direction if it only had a main rotor then the body was spinning in
[00:46:54] main rotor then the body was spinning in the opposite direction kind of equal and
[00:46:56] the opposite direction kind of equal and opposite reaction
[00:46:56] opposite reaction but in 12 right so the main rotor spins
[00:46:59] but in 12 right so the main rotor spins in one direction if it only had the main
[00:47:01] in one direction if it only had the main rotor the rotor on top and it just spun
[00:47:03] rotor the rotor on top and it just spun that there's a body and her cotton spin
[00:47:05] that there's a body and her cotton spin in the opposite direction so that's why
[00:47:06] in the opposite direction so that's why you need a tear altar to blow air down
[00:47:08] you need a tear altar to blow air down off to one side to not make it spin the
[00:47:12] off to one side to not make it spin the opposite direction but because of that
[00:47:14] opposite direction but because of that it turns out a helicopter staying in
[00:47:16] it turns out a helicopter staying in place is actually tilted slightly to a
[00:47:18] place is actually tilted slightly to a side because the tail rotor blows air in
[00:47:20] side because the tail rotor blows air in one direction so it's pushing you off to
[00:47:22] one direction so it's pushing you off to one side so you have to tell you how to
[00:47:24] one side so you have to tell you how to caulked in the opposite direction so the
[00:47:26] caulked in the opposite direction so the main role said blows air to one side -
[00:47:28] main role said blows air to one side - terrible - blows air to the other side
[00:47:29] terrible - blows air to the other side so you actually stay in place
[00:47:31] so you actually stay in place right so how often is actually
[00:47:32] right so how often is actually asymmetric the definite right is not the
[00:47:34] asymmetric the definite right is not the same so so so because of this
[00:47:37] same so so so because of this complication maybe squared error isn't
[00:47:39] complication maybe squared error isn't the best error because you know you're
[00:47:43] the best error because you know you're your orientation your optimal
[00:47:45] your orientation your optimal orientation is actually not zero right
[00:47:47] orientation is actually not zero right so so so maybe you should multiply the
[00:47:51] so so so maybe you should multiply the cost function or maybe you want to
[00:47:54] cost function or maybe you want to modify the reinforcement learning
[00:47:56] modify the reinforcement learning algorithm because you secretly suspect
[00:47:58] algorithm because you secretly suspect that your algorithm is not doing a great
[00:48:01] that your algorithm is not doing a great job of minimizing that cost function
[00:48:04] job of minimizing that cost function great that is not actually finding the
[00:48:07] great that is not actually finding the value of theta that absolutely minimizes
[00:48:09] value of theta that absolutely minimizes J of theta so it turns out that each one
[00:48:15] J of theta so it turns out that each one of these topics can easily be a PhD
[00:48:18] of these topics can easily be a PhD thesis and you could definitely work for
[00:48:20] thesis and you could definitely work for six years on anyone
[00:48:21] six years on anyone these topics and the problem is you know
[00:48:26] these topics and the problem is you know so actually I actually know someone that
[00:48:29] so actually I actually know someone that wrote a PhD thesis on write improving
[00:48:32] wrote a PhD thesis on write improving helicopter simulator right but the
[00:48:35] helicopter simulator right but the problem is maybe a helicopter simulator
[00:48:37] problem is maybe a helicopter simulator is good enough and you can spend six
[00:48:39] is good enough and you can spend six years improving your helicopter
[00:48:42] years improving your helicopter simulator but will that actually get you
[00:48:44] simulator but will that actually get you there is and you can write and you can
[00:48:45] there is and you can write and you can write a PhD season together PhD doing
[00:48:47] write a PhD season together PhD doing that maybe but if you go is not just a
[00:48:49] that maybe but if you go is not just a very PhD thesis and actually make your
[00:48:51] very PhD thesis and actually make your helicopter fly better is that he's not
[00:48:53] helicopter fly better is that he's not not totally clear right if that's the
[00:48:56] not totally clear right if that's the key thing for you to spend time on um so
[00:49:00] key thing for you to spend time on um so what I'd like to do is describe to you a
[00:49:04] what I'd like to do is describe to you a set of Diagnostics that allows you to
[00:49:06] set of Diagnostics that allows you to use this sort of logical step-by-step
[00:49:08] use this sort of logical step-by-step reasoning to debug which of these three
[00:49:12] reasoning to debug which of these three things is what you should actually be
[00:49:14] things is what you should actually be spending time on right so is it possible
[00:49:17] spending time on right so is it possible for us to come up with a debugging
[00:49:19] for us to come up with a debugging process to logically reason so to select
[00:49:24] process to logically reason so to select one of these things to work on and have
[00:49:26] one of these things to work on and have conviction and then be relatively
[00:49:27] conviction and then be relatively confident that this is a useful thing to
[00:49:30] confident that this is a useful thing to work all right so here's how we're going
[00:49:35] work all right so here's how we're going to do it so just to summarize a scenario
[00:49:39] to do it so just to summarize a scenario right the controller given by theta RL
[00:49:43] right the controller given by theta RL the false Paulie right so this is how I
[00:49:47] the false Paulie right so this is how I would reason through a learning
[00:49:48] would reason through a learning algorithm right so suppose suppose all
[00:49:53] algorithm right so suppose suppose all of these things were true suppose that
[00:49:57] of these things were true suppose that again corresponding to the three steps
[00:49:59] again corresponding to the three steps in the previous slide suppose the
[00:50:00] in the previous slide suppose the helicopter simulator was accurate and
[00:50:02] helicopter simulator was accurate and suppose you know the learning algorithm
[00:50:07] suppose you know the learning algorithm correctly you know minimizes the cost
[00:50:10] correctly you know minimizes the cost function and suppose J of theta is a
[00:50:13] function and suppose J of theta is a good cost function right if all of these
[00:50:16] good cost function right if all of these things were true then the learn
[00:50:18] things were true then the learn parameters should fly well on the actual
[00:50:20] parameters should fly well on the actual helicopter right but it doesn't fire on
[00:50:25] helicopter right but it doesn't fire on the helicopter so one of these three
[00:50:27] the helicopter so one of these three things is false
[00:50:29] things is false and our job as a figure out is its
[00:50:33] and our job as a figure out is its identified at least one of these three
[00:50:35] identified at least one of these three statements one two or three that is
[00:50:37] statements one two or three that is false because that that lets you sink
[00:50:39] false because that that lets you sink your teeth into something that to work
[00:50:41] your teeth into something that to work on right and I think to make an analogy
[00:50:46] on right and I think to make an analogy to more conventional software debugging
[00:50:48] to more conventional software debugging of a big complicated program and for
[00:50:51] of a big complicated program and for some reason your program crashes you're
[00:50:53] some reason your program crashes you're like the cool downs or whatever if you
[00:50:56] like the cool downs or whatever if you can isolate this big complicated program
[00:50:58] can isolate this big complicated program into one component that crashes then you
[00:51:02] into one component that crashes then you can focus your attention on that
[00:51:04] can focus your attention on that component that you know crashes for some
[00:51:06] component that you know crashes for some reason and try to find a bug there right
[00:51:09] reason and try to find a bug there right and so instead of trying to look over a
[00:51:11] and so instead of trying to look over a huge codebase if you could do binary
[00:51:13] huge codebase if you could do binary search or try to isolate the problem in
[00:51:15] search or try to isolate the problem in a smaller part of your codebase then you
[00:51:17] a smaller part of your codebase then you could focus your debugging efforts on
[00:51:20] could focus your debugging efforts on that padukone page try to figure out why
[00:51:21] that padukone page try to figure out why it crashes and then fix that at first
[00:51:23] it crashes and then fix that at first and after you fix that they might still
[00:51:25] and after you fix that they might still crash then there might be a second
[00:51:26] crash then there might be a second problem that we work on but at least you
[00:51:28] problem that we work on but at least you know that trying to fix the first bug
[00:51:31] know that trying to fix the first bug seems like it seems like a worthwhile
[00:51:32] seems like it seems like a worthwhile thing to do right so what we're going to
[00:51:37] thing to do right so what we're going to do is come up with a server the gradient
[00:51:41] do is come up with a server the gradient design come over set of Diagnostics to
[00:51:43] design come over set of Diagnostics to isolate the problem to one of these
[00:51:45] isolate the problem to one of these three components okay so the first step
[00:51:49] three components okay so the first step is let's look at how well the algorithm
[00:51:54] is let's look at how well the algorithm flies in simulation right so what I said
[00:51:57] flies in simulation right so what I said just now was you ran the algorithm and
[00:51:59] just now was you ran the algorithm and it resulted in a set of parameters that
[00:52:02] it resulted in a set of parameters that doesn't do well on your actual
[00:52:04] doesn't do well on your actual helicopter so the first thing I would do
[00:52:06] helicopter so the first thing I would do is just check how well does this thing
[00:52:08] is just check how well does this thing even do in simulation right and there
[00:52:12] even do in simulation right and there are two possible cases if it flies well
[00:52:16] are two possible cases if it flies well in simulation but doesn't do well in
[00:52:18] in simulation but doesn't do well in real life then means something's wrong
[00:52:21] real life then means something's wrong with a simulator right and it means is
[00:52:23] with a simulator right and it means is actually worth working on the simulator
[00:52:25] actually worth working on the simulator because you know if it's already working
[00:52:27] because you know if it's already working well in the simulator I mean what else
[00:52:30] well in the simulator I mean what else could you expect to learn the very force
[00:52:32] could you expect to learn the very force of learning algorithms
[00:52:33] of learning algorithms right you know you you told the reporter
[00:52:35] right you know you you told the reporter learning algorithm to go and fly well in
[00:52:36] learning algorithm to go and fly well in the simulator because it's just training
[00:52:38] the simulator because it's just training simulation it's already doing well in
[00:52:40] simulation it's already doing well in the simulator so there's not much to
[00:52:42] the simulator so there's not much to improve on their release is hard to
[00:52:44] improve on their release is hard to improve on that but but but if you found
[00:52:48] improve on that but but but if you found it learning out if you learn the I room
[00:52:49] it learning out if you learn the I room just one simulator but not in real life
[00:52:51] just one simulator but not in real life then this means that the simulator isn't
[00:52:55] then this means that the simulator isn't matching real life well and so dish that
[00:52:58] matching real life well and so dish that does strong evidence there's strong
[00:53:00] does strong evidence there's strong grounds for you to spend some time to
[00:53:02] grounds for you to spend some time to improve your simulator yeah yeah right
[00:53:12] improve your simulator yeah yeah right is that it just repeats another camera
[00:53:14] is that it just repeats another camera is it is ever the case that it flies
[00:53:16] is it is ever the case that it flies values away to about one roll life I
[00:53:18] values away to about one roll life I wish that happen
[00:53:23] very rarely I I think if that happens I
[00:53:27] very rarely I I think if that happens I would I would still work on improving
[00:53:28] would I would still work on improving the simulator so there's actually once
[00:53:32] the simulator so there's actually once an era where that happens it turns out
[00:53:33] an era where that happens it turns out that when we train this helicopter in
[00:53:39] that when we train this helicopter in the simulator or really any robot
[00:53:40] the simulator or really any robot simulator we often add a lava noise to
[00:53:42] simulator we often add a lava noise to the simulator because one lessons of
[00:53:44] the simulator because one lessons of learn is that if your simulator is noisy
[00:53:46] learn is that if your simulator is noisy customizers are always wrong right I
[00:53:48] customizers are always wrong right I mean any digital simulation is only an
[00:53:50] mean any digital simulation is only an approximation in real world so we tend
[00:53:51] approximation in real world so we tend to have a lot of noise so all of our
[00:53:53] to have a lot of noise so all of our simulators because we think that the
[00:53:55] simulators because we think that the learning algorithm is robust so all this
[00:53:57] learning algorithm is robust so all this noise you've thrown at it in simulation
[00:53:59] noise you've thrown at it in simulation then whatever noise the real world
[00:54:01] then whatever noise the real world throws at it it has a bigger chance at
[00:54:03] throws at it it has a bigger chance at being robust to as well and so we tend
[00:54:06] being robust to as well and so we tend to throw a lot of noise into Intel
[00:54:09] to throw a lot of noise into Intel simulators and so one case where that
[00:54:11] simulators and so one case where that does happen is when we find we threw too
[00:54:13] does happen is when we find we threw too much noise on it in simulation and then
[00:54:15] much noise on it in simulation and then that might be a sign we should dial back
[00:54:17] that might be a sign we should dial back the noise a bit
[00:54:17] the noise a bit yeah all right cool oh so yeah so this
[00:54:26] yeah all right cool oh so yeah so this first I know see tells you should work
[00:54:27] first I know see tells you should work on improving simulation but just I think
[00:54:30] on improving simulation but just I think if there's a big mismatch between
[00:54:32] if there's a big mismatch between simulation performance and real work
[00:54:34] simulation performance and real work performance that's a good sign that you
[00:54:37] performance that's a good sign that you know that
[00:54:38] know that improve insulation second um this is
[00:54:42] improve insulation second um this is actually very similar to the diagnostic
[00:54:44] actually very similar to the diagnostic we use on the spam you know based on
[00:54:47] we use on the spam you know based on logistic regression as a SVM example so
[00:54:52] logistic regression as a SVM example so what we're going to do is we're going to
[00:54:56] what we're going to do is we're going to measure this equation and this is this
[00:55:00] measure this equation and this is this again this very similar to our previous
[00:55:02] again this very similar to our previous equation which is take the cost function
[00:55:05] equation which is take the cost function - similar to previous example take the
[00:55:09] - similar to previous example take the cost function J that reinforcement
[00:55:11] cost function J that reinforcement learning is total minimize right it's JJ
[00:55:15] learning is total minimize right it's JJ of theta was a squared error right so
[00:55:17] of theta was a squared error right so take the cost function that
[00:55:19] take the cost function that reinforcement learning was total
[00:55:21] reinforcement learning was total minimize and see if the human achieves
[00:55:26] minimize and see if the human achieves better squared error than the
[00:55:28] better squared error than the reinforcement learning algorithm and
[00:55:30] reinforcement learning algorithm and just see you know this human flies
[00:55:32] just see you know this human flies better so let's measure the human
[00:55:34] better so let's measure the human performance on this squared error cost
[00:55:36] performance on this squared error cost function and see which one does better
[00:55:40] function and see which one does better so they're two cases that equation will
[00:55:43] so they're two cases that equation will be either less than or they'll be
[00:55:45] be either less than or they'll be greater than or equal to is there less
[00:55:47] greater than or equal to is there less law greater than equal to so case one is
[00:55:50] law greater than equal to so case one is say two human is less than sees me J of
[00:55:55] say two human is less than sees me J of theta human is less than J of theta RL
[00:55:57] theta human is less than J of theta RL that would be this case then that tells
[00:56:00] that would be this case then that tells you that the problem is with the
[00:56:02] you that the problem is with the reinforcement learning algorithm right
[00:56:04] reinforcement learning algorithm right that somehow the human achieves a lower
[00:56:07] that somehow the human achieves a lower squared error and so the learning
[00:56:11] squared error and so the learning algorithm is not finding the best
[00:56:13] algorithm is not finding the best possible squared error that is some
[00:56:15] possible squared error that is some other controller as evidenced by
[00:56:17] other controller as evidenced by whatever the human is doing that
[00:56:19] whatever the human is doing that actually achieves a lower cost function
[00:56:20] actually achieves a lower cost function right so in this case we think the
[00:56:26] right so in this case we think the learning algorithm or the harbor
[00:56:28] learning algorithm or the harbor enforcement learning algorithm is not
[00:56:29] enforcement learning algorithm is not doing a good job minimizing that and
[00:56:31] doing a good job minimizing that and what were Connie reinforcement or anyhow
[00:56:33] what were Connie reinforcement or anyhow really the other case would be of the
[00:56:37] really the other case would be of the sine inequality is the other way around
[00:56:38] sine inequality is the other way around right now in this case you can infer
[00:56:43] right now in this case you can infer that the problem is in the cost
[00:56:45] that the problem is in the cost function because what happens here is
[00:56:49] function because what happens here is the humanists line better than your
[00:56:52] the humanists line better than your enforcement learning algorithm but the
[00:56:55] enforcement learning algorithm but the human is achieving what looks like a
[00:56:57] human is achieving what looks like a worse cause that your enforcement
[00:56:59] worse cause that your enforcement learning algorithm so what this tells
[00:57:03] learning algorithm so what this tells you is that minimizing J of theta does
[00:57:06] you is that minimizing J of theta does not correspond to flying well right your
[00:57:09] not correspond to flying well right your learning algorithm achieves a better
[00:57:10] learning algorithm achieves a better value for J of theta you know J of theta
[00:57:13] value for J of theta you know J of theta are out is actually smaller than one of
[00:57:15] are out is actually smaller than one of the human is doing so the reinforcement
[00:57:17] the human is doing so the reinforcement learning algorithm as far as it knows
[00:57:19] learning algorithm as far as it knows this doing a great job because it's
[00:57:21] this doing a great job because it's finding a value of theta where J of
[00:57:23] finding a value of theta where J of theta is really really small but in this
[00:57:25] theta is really really small but in this last case you know that finding such a
[00:57:31] last case you know that finding such a small value of J of theta doesn't
[00:57:33] small value of J of theta doesn't correspond to flying well off because a
[00:57:35] correspond to flying well off because a human doesn't achieve such a good value
[00:57:37] human doesn't achieve such a good value in the cost function but the helicopter
[00:57:38] in the cost function but the helicopter actually just looks better was flying in
[00:57:40] actually just looks better was flying in a more satisfactory way and that tells
[00:57:43] a more satisfactory way and that tells you that the squared error cost function
[00:57:45] you that the squared error cost function is not the right cost function for what
[00:57:49] is not the right cost function for what flying after it events right and so um
[00:57:53] flying after it events right and so um through this set of Diagnostics you
[00:57:58] through this set of Diagnostics you could decide which one of these three
[00:58:00] could decide which one of these three things improving the simulator improving
[00:58:03] things improving the simulator improving in our our algorithm before so learning
[00:58:05] in our our algorithm before so learning algorithm or improving the cost function
[00:58:08] algorithm or improving the cost function is the thing you should work on and what
[00:58:11] is the thing you should work on and what happens in actually in this particular
[00:58:14] happens in actually in this particular project and what often happens in
[00:58:16] project and what often happens in machine learning applications is you run
[00:58:18] machine learning applications is you run the set of Diagnostics and this actually
[00:58:20] the set of Diagnostics and this actually happened when we're working on this
[00:58:21] happened when we're working on this helicopter we've run the set of
[00:58:23] helicopter we've run the set of Diagnostics and then one week we were
[00:58:25] Diagnostics and then one week we were saying yep simulated score the problem
[00:58:27] saying yep simulated score the problem let's work on that and it would improve
[00:58:28] let's work on that and it would improve the simulator improves the simulator and
[00:58:30] the simulator improves the simulator and after a couple weeks of work we run
[00:58:32] after a couple weeks of work we run these Diagnostics and say oh it looks
[00:58:34] these Diagnostics and say oh it looks like the simulator is not good enough
[00:58:35] like the simulator is not good enough and maybe there's a problem with the our
[00:58:37] and maybe there's a problem with the our our algorithm then we work on that work
[00:58:38] our algorithm then we work on that work on that improve that and after that
[00:58:40] on that improve that and after that after a while I'll say oh they'll say
[00:58:41] after a while I'll say oh they'll say that's also good enough and the problem
[00:58:43] that's also good enough and the problem is in the cost function and sometimes
[00:58:45] is in the cost function and sometimes the the location of the most acute
[00:58:48] the the location of the most acute problem shifts right after you've
[00:58:49] problem shifts right after you've cleared out one set of problems it might
[00:58:52] cleared out one set of problems it might be the case that now the ball Oneg is
[00:58:55] be the case that now the ball Oneg is the simulator right and so I often use
[00:58:59] the simulator right and so I often use this workflow to constantly drive
[00:59:03] this workflow to constantly drive prioritization for what to work on next
[00:59:05] prioritization for what to work on next and and to answer a question just now
[00:59:08] and and to answer a question just now about how do you find a new cost
[00:59:10] about how do you find a new cost function it turns out find me a new cost
[00:59:12] function it turns out find me a new cost function it is actually not that easy
[00:59:13] function it is actually not that easy so as you want one my own former PhD
[00:59:16] so as you want one my own former PhD students Adam coats through this type of
[00:59:19] students Adam coats through this type of process realize that finding a good cost
[00:59:22] process realize that finding a good cost function is actually really difficult
[00:59:23] function is actually really difficult because if you want a helicopter to fly
[00:59:25] because if you want a helicopter to fly maneuver you're like fire speed and they
[00:59:28] maneuver you're like fire speed and they make a bank turn right like how do you
[00:59:30] make a bank turn right like how do you math Matthew define what is an accurate
[00:59:32] math Matthew define what is an accurate bank determines thank you really
[00:59:33] bank determines thank you really difficult to write down an equation to
[00:59:35] difficult to write down an equation to specify what is a good way for how to
[00:59:37] specify what is a good way for how to fly like that and then do a turn it's
[00:59:39] fly like that and then do a turn it's just how do you specify what is a good
[00:59:41] just how do you specify what is a good turn so he wound up writing a research
[00:59:44] turn so he wound up writing a research paper one of the best application paper
[00:59:47] paper one of the best application paper more than I see now on how to define a
[00:59:50] more than I see now on how to define a good cost function it's actually pretty
[00:59:52] good cost function it's actually pretty complicated but the reason he did it and
[00:59:55] complicated but the reason he did it and it was a good use of his time was
[00:59:56] it was a good use of his time was running Diagnostics like these which
[00:59:58] running Diagnostics like these which gave us confidence that this was
[01:00:00] gave us confidence that this was actually a worthwhile problem and then
[01:00:02] actually a worthwhile problem and then that results in you know making real
[01:00:04] that results in you know making real progress now about it yeah okay
[01:00:07] progress now about it yeah okay um any questions about this all right
[01:00:16] um any questions about this all right cool actually I think of all right
[01:00:19] cool actually I think of all right anyway all right fun how the to do
[01:00:21] anyway all right fun how the to do so let's not show this is fine yeah you
[01:00:23] so let's not show this is fine yeah you guys saw some of these earlier all right
[01:00:26] guys saw some of these earlier all right so um a long time
[01:00:37] all right let's go for this so um in
[01:00:44] all right let's go for this so um in addition to these specific diagnoses of
[01:00:48] addition to these specific diagnoses of bias versus variance and optimization of
[01:00:51] bias versus variance and optimization of the results musician objective
[01:00:53] the results musician objective oh sorry and when we do our L I want to
[01:00:56] oh sorry and when we do our L I want to just go through that example one more
[01:00:58] just go through that example one more time so you see everything we just saw
[01:00:59] time so you see everything we just saw again after you've learned about
[01:01:01] again after you've learned about reinforcement learning they turn this
[01:01:03] reinforcement learning they turn this course L okay
[01:01:04] course L okay now in addition to these type of
[01:01:08] now in addition to these type of Diagnostics how to debug learning
[01:01:11] Diagnostics how to debug learning algorithms there's one other set of
[01:01:13] algorithms there's one other set of tools you find very useful which is
[01:01:15] tools you find very useful which is error analysis tools which lets you
[01:01:18] error analysis tools which lets you figure out which is another way for you
[01:01:20] figure out which is another way for you to figure out what's working what's not
[01:01:22] to figure out what's working what's not working or really what's not working the
[01:01:24] working or really what's not working the learning algorithm so let's let's go
[01:01:26] learning algorithm so let's let's go through a multi-beam example um so let's
[01:01:29] through a multi-beam example um so let's say you're building a you know like a
[01:01:33] say you're building a you know like a security system so when someone walks in
[01:01:35] security system so when someone walks in front of a door you unlock the door not
[01:01:37] front of a door you unlock the door not based on whether or not you know that
[01:01:39] based on whether or not you know that person is authorized to enter right that
[01:01:41] person is authorized to enter right that please and so let's say that so they're
[01:01:48] please and so let's say that so they're longer machine learning applications
[01:01:49] longer machine learning applications where is not just one learning algorithm
[01:01:52] where is not just one learning algorithm right but instead you have a pipeline
[01:01:53] right but instead you have a pipeline that's trained together many different
[01:01:55] that's trained together many different steps so how do you build a face
[01:01:57] steps so how do you build a face recognition algorithm to decide if
[01:02:00] recognition algorithm to decide if someone approaching your front doors
[01:02:01] someone approaching your front doors authorized unlock the door right well
[01:02:04] authorized unlock the door right well here's something you could do which is
[01:02:05] here's something you could do which is you start with a camera image like this
[01:02:07] you start with a camera image like this and then um you could do pre-processing
[01:02:10] and then um you could do pre-processing to remove the background so all that
[01:02:12] to remove the background so all that complicated color background let's get
[01:02:15] complicated color background let's get rid of that and it turns out that um
[01:02:16] rid of that and it turns out that um when you have a camera against a static
[01:02:19] when you have a camera against a static background right you could actually do
[01:02:21] background right you could actually do this you know what a little bit of noise
[01:02:23] this you know what a little bit of noise be relatively easy because if you have a
[01:02:25] be relatively easy because if you have a fixed camera that's just like mounted
[01:02:27] fixed camera that's just like mounted you know on your doorframe it always
[01:02:29] you know on your doorframe it always sees the same background and so you can
[01:02:32] sees the same background and so you can just look at what pixels have changed
[01:02:34] just look at what pixels have changed and and just keep the pixels that have
[01:02:36] and and just keep the pixels that have changed compared to I mean right because
[01:02:38] changed compared to I mean right because you know this camera always sees that
[01:02:40] you know this camera always sees that gray background and
[01:02:42] gray background and some Brown bench in the back and so just
[01:02:44] some Brown bench in the back and so just look at what pixels have changed the
[01:02:45] look at what pixels have changed the login and does that's background remove
[01:02:47] login and does that's background remove all right so business this this is this
[01:02:50] all right so business this this is this is actually feasible back just looking
[01:02:51] is actually feasible back just looking at what pixels have changed and keeping
[01:02:52] at what pixels have changed and keeping the pixels they've changed relative
[01:02:53] the pixels they've changed relative today
[01:02:55] today and so after getting to the background
[01:02:58] and so after getting to the background you could run the face detection
[01:03:00] you could run the face detection algorithm and then after detecting the
[01:03:03] algorithm and then after detecting the face it turns out actually you know I've
[01:03:07] face it turns out actually you know I've actually worked a bunch of face
[01:03:08] actually worked a bunch of face detection told a bunch of face face
[01:03:09] detection told a bunch of face face recognition systems it turns out that
[01:03:11] recognition systems it turns out that for some of the leading face recognition
[01:03:13] for some of the leading face recognition systems someone depends on details with
[01:03:16] systems someone depends on details with some of them it turns out that the
[01:03:18] some of them it turns out that the appearance of the eyes is a very
[01:03:20] appearance of the eyes is a very important cue for recognizing people
[01:03:22] important cue for recognizing people this is why if you cover your eyes a
[01:03:24] this is why if you cover your eyes a  much all the time recognizing
[01:03:25] much all the time recognizing people as eyes are very distinct were
[01:03:26] people as eyes are very distinct were people just segment out the eyes
[01:03:30] people just segment out the eyes segments the nose and not the same you
[01:03:33] segments the nose and not the same you send out the know it's Halloween and
[01:03:41] send out the know it's Halloween and then and then feed these features into
[01:03:43] then and then feed these features into some other algorithm say which is
[01:03:44] some other algorithm say which is Russian that then you know finally
[01:03:46] Russian that then you know finally outputs a label that says is this the
[01:03:49] outputs a label that says is this the person right that that you know you're
[01:03:52] person right that that you know you're authorized to open the door for um so it
[01:03:56] authorized to open the door for um so it so in many learning algorithms you have
[01:03:59] so in many learning algorithms you have a complicated pipeline like this of
[01:04:01] a complicated pipeline like this of different components that that have to
[01:04:03] different components that that have to be strung together and you know if you
[01:04:07] be strung together and you know if you read the newspaper articles about or if
[01:04:10] read the newspaper articles about or if you read research papers in machine
[01:04:11] you read research papers in machine learning often the research papers will
[01:04:16] learning often the research papers will say oh we build a machine translation
[01:04:18] say oh we build a machine translation system where you train down gazillion
[01:04:19] system where you train down gazillion you know sentences found on the internet
[01:04:22] you know sentences found on the internet and it does great and a pure end-to-end
[01:04:24] and it does great and a pure end-to-end system so there's like one learning
[01:04:25] system so there's like one learning algorithm that sucks in an input like
[01:04:27] algorithm that sucks in an input like suck on an English sentence and spit on
[01:04:29] suck on an English sentence and spit on the French sentence or something right
[01:04:31] the French sentence or something right so this that's like one learning
[01:04:33] so this that's like one learning algorithm it turns out that for a lot of
[01:04:35] algorithm it turns out that for a lot of practical applications if you don't have
[01:04:37] practical applications if you don't have a gazillion examples you end up
[01:04:39] a gazillion examples you end up designing
[01:04:40] designing much more complex machine learning
[01:04:42] much more complex machine learning pipelines like this where it's not just
[01:04:44] pipelines like this where it's not just one monolithic learning algorithm but
[01:04:47] one monolithic learning algorithm but instead there are many different smaller
[01:04:48] instead there are many different smaller components and I think in I think that
[01:04:53] components and I think in I think that you know the
[01:04:56] you know the I think that having a lot of data is
[01:04:59] I think that having a lot of data is great right I love having more data but
[01:05:01] great right I love having more data but big data has also been a little bit
[01:05:02] big data has also been a little bit overhyped and a lot of things you could
[01:05:05] overhyped and a lot of things you could do with small data sets as well and in
[01:05:07] do with small data sets as well and in the teams I work with we find that if
[01:05:11] the teams I work with we find that if you have a relatively small D that's
[01:05:13] you have a relatively small D that's that often you can still get great
[01:05:14] that often you can still get great results you know my team's often get
[01:05:17] results you know my team's often get great results with 100 images a hundred
[01:05:19] great results with 100 images a hundred change examples something but when
[01:05:21] change examples something but when you're a small data it often takes more
[01:05:23] you're a small data it often takes more insightful design about machine learning
[01:05:25] insightful design about machine learning pipelines like this know we have a
[01:05:31] pipelines like this know we have a machine learning pipeline like this the
[01:05:33] machine learning pipeline like this the things you want to do one thing you want
[01:05:35] things you want to do one thing you want to do is so so you you build the
[01:05:38] to do is so so you you build the pipeline like this and it doesn't work
[01:05:40] pipeline like this and it doesn't work right there's this common workflow you
[01:05:41] right there's this common workflow you build a pipe build something doesn't
[01:05:43] build a pipe build something doesn't work so you want to debug it so in order
[01:05:47] work so you want to debug it so in order to decide which part of the pipeline to
[01:05:49] to decide which part of the pipeline to work on is very useful if you can look
[01:05:52] work on is very useful if you can look at your the error of your system
[01:05:54] at your the error of your system and try to attribute the error to the
[01:05:58] and try to attribute the error to the different components so you can decide
[01:06:00] different components so you can decide which component to work on X right and
[01:06:04] which component to work on X right and and there's a here I'll tell you true
[01:06:06] and there's a here I'll tell you true story you're the pre process background
[01:06:08] story you're the pre process background removal step right since you're getting
[01:06:10] removal step right since you're getting rid of the background it turns out that
[01:06:13] rid of the background it turns out that there are a lot of details of how to do
[01:06:16] there are a lot of details of how to do background removal for example the
[01:06:19] background removal for example the simple way to do it is to look at every
[01:06:21] simple way to do it is to look at every pixel and just see which pixels have
[01:06:23] pixel and just see which pixels have changed oh but it turns out that if
[01:06:26] changed oh but it turns out that if there's a tree in the background that
[01:06:28] there's a tree in the background that you know waves a little bit because the
[01:06:29] you know waves a little bit because the wind moves the tree and blows the leaves
[01:06:32] wind moves the tree and blows the leaves and branches around a little bit then
[01:06:34] and branches around a little bit then sometimes the background pixels do
[01:06:35] sometimes the background pixels do change a little bit and so they're
[01:06:37] change a little bit and so they're actually really complicates a background
[01:06:39] actually really complicates a background removal algorithms they tried to model
[01:06:41] removal algorithms they tried to model basically the trees and the bushes
[01:06:43] basically the trees and the bushes moving around a little bit in background
[01:06:44] moving around a little bit in background so you know that even though the pixels
[01:06:47] so you know that even though the pixels of the tree roos around this part of the
[01:06:48] of the tree roos around this part of the background is just still get rid of it
[01:06:50] background is just still get rid of it so background removal there's simple
[01:06:52] so background removal there's simple versions where you just look at each
[01:06:53] versions where you just look at each pixel and see how much has changed and
[01:06:55] pixel and see how much has changed and they're incredibly complicated versions
[01:06:57] they're incredibly complicated versions so I actually know someone that was
[01:07:03] so I actually know someone that was trying to work on a problem like this
[01:07:04] trying to work on a problem like this and they decided to improve
[01:07:07] and they decided to improve background removal algorithm and they
[01:07:09] background removal algorithm and they actually does this per person actually
[01:07:11] actually does this per person actually literally wrote a PhD theses on
[01:07:13] literally wrote a PhD theses on background removal and so I'm glad you
[01:07:15] background removal and so I'm glad you got a PhD but it turned but you know
[01:07:19] got a PhD but it turned but you know when I look at the problem he was
[01:07:20] when I look at the problem he was actually trying to solve I don't think
[01:07:22] actually trying to solve I don't think it actually moved in you know right so
[01:07:24] it actually moved in you know right so so this one I suspected the American I
[01:07:28] so this one I suspected the American I so laws you know you can still publish a
[01:07:31] so laws you know you can still publish a paper and and it was technically
[01:07:36] paper and and it was technically innovative I was thanking very good
[01:07:37] innovative I was thanking very good technical work but but but but if so
[01:07:40] technical work but but but but if so there you go suppose you favor great do
[01:07:42] there you go suppose you favor great do that but there goes to build a better
[01:07:44] that but there goes to build a better face recognition system then I would
[01:07:45] face recognition system then I would carefully ask which components should
[01:07:47] carefully ask which components should you actually spend your time to work all
[01:07:49] you actually spend your time to work all right um so yes what you can do with
[01:07:54] right um so yes what you can do with error analysis which is say your overall
[01:07:57] error analysis which is say your overall system has eighty five percent accuracy
[01:08:00] system has eighty five percent accuracy it's what I would do I would go in and
[01:08:04] it's what I would do I would go in and in your depth set and your development
[01:08:07] in your depth set and your development said to hold our cross-validation settle
[01:08:09] said to hold our cross-validation settle right go in and for every one of your
[01:08:12] right go in and for every one of your examples in the DEF set I would plug in
[01:08:16] examples in the DEF set I would plug in the ground truth for the background
[01:08:18] the ground truth for the background meaning that rather than using a some
[01:08:22] meaning that rather than using a some you know approximate heuristic algorithm
[01:08:24] you know approximate heuristic algorithm for roughly cleaning out the background
[01:08:26] for roughly cleaning out the background which may or may not were that well I
[01:08:28] which may or may not were that well I would just use Photoshop and for every
[01:08:30] would just use Photoshop and for every example in the Deaf said I would give it
[01:08:32] example in the Deaf said I would give it the perfect background removal right so
[01:08:34] the perfect background removal right so imagine if instead of some noisy harbor
[01:08:37] imagine if instead of some noisy harbor I'm trying to remove the background this
[01:08:39] I'm trying to remove the background this step of the algorithm was just had
[01:08:41] step of the algorithm was just had perfect performance right and then you
[01:08:43] perfect performance right and then you could give it perfect performance on
[01:08:45] could give it perfect performance on your depth set on your test set just by
[01:08:46] your depth set on your test set just by using Photoshop to just tell it this is
[01:08:48] using Photoshop to just tell it this is a background this is the foreground
[01:08:50] a background this is the foreground right and let's say that when you plug
[01:08:53] right and let's say that when you plug in this perfect background remove all
[01:08:55] in this perfect background remove all the accuracy improves to eighty five
[01:08:57] the accuracy improves to eighty five point one percent and then you can keep
[01:09:00] point one percent and then you can keep on going from left to right in this pat
[01:09:02] on going from left to right in this pat pipeline which is now instead of using
[01:09:06] pipeline which is now instead of using some learning algorithm to do face
[01:09:07] some learning algorithm to do face detection this just go in and for the
[01:09:09] detection this just go in and for the test set
[01:09:10] test set you know modify kind of have the face
[01:09:12] you know modify kind of have the face detection algorithm cheat right having
[01:09:14] detection algorithm cheat right having just memorized it right
[01:09:15] just memorized it right location for the face and the test seven
[01:09:18] location for the face and the test seven just give it a perfect result in the
[01:09:20] just give it a perfect result in the test set so when I shade in these things
[01:09:22] test set so when I shade in these things that means I'm giving a perfect result
[01:09:26] that means I'm giving a perfect result right so let's just go in and on the
[01:09:28] right so let's just go in and on the test set give it the perfect face
[01:09:30] test set give it the perfect face detection for every single example and
[01:09:32] detection for every single example and and then look at the final output and
[01:09:34] and then look at the final output and see how that changes the accuracy of the
[01:09:36] see how that changes the accuracy of the final output right and then same for
[01:09:39] final output right and then same for these components I segmentation no
[01:09:42] these components I segmentation no segmentation most segmentation and then
[01:09:46] segmentation most segmentation and then and you do this one at a time and then
[01:09:48] and you do this one at a time and then finally for let's just regression if you
[01:09:50] finally for let's just regression if you give it the perfect output you know your
[01:09:52] give it the perfect output you know your your accuracy should be a hundred
[01:09:54] your accuracy should be a hundred percent right so now what you can do is
[01:09:58] percent right so now what you can do is look at the sequence of of steps and see
[01:10:02] look at the sequence of of steps and see which one gave you the biggest gain and
[01:10:06] which one gave you the biggest gain and it looks like um in this example it
[01:10:09] it looks like um in this example it looks like when you gave it perfect face
[01:10:13] looks like when you gave it perfect face detection the accuracy improved from
[01:10:16] detection the accuracy improved from eighty five point one to ninety one
[01:10:17] eighty five point one to ninety one percent so roughly a six percent
[01:10:19] percent so roughly a six percent improvement and that tells you that if
[01:10:22] improvement and that tells you that if only you can improve your face detection
[01:10:23] only you can improve your face detection algorithm maybe your overall system
[01:10:26] algorithm maybe your overall system could get better by as much as six
[01:10:28] could get better by as much as six percent so this gives you faith that you
[01:10:30] percent so this gives you faith that you know maybe it's worth improving on your
[01:10:32] know maybe it's worth improving on your face detection component and in contrast
[01:10:34] face detection component and in contrast just tells you that even if you had
[01:10:37] just tells you that even if you had perfect background removal it's only 0.1
[01:10:40] perfect background removal it's only 0.1 percent better so maybe don't don't
[01:10:42] percent better so maybe don't don't don't spend too much time on that and it
[01:10:46] don't spend too much time on that and it looks like that when you gave it perfect
[01:10:48] looks like that when you gave it perfect eye segmentation it went up another four
[01:10:50] eye segmentation it went up another four percent so maybe that's another good
[01:10:52] percent so maybe that's another good project to prioritize right um and if
[01:10:56] project to prioritize right um and if you're in a team one common structure
[01:10:58] you're in a team one common structure would be to do this type of analysis and
[01:11:00] would be to do this type of analysis and then have some people work on face
[01:11:02] then have some people work on face detection some people work on our
[01:11:03] detection some people work on our segmentation you could usually do a few
[01:11:05] segmentation you could usually do a few things in parallel if you have a large
[01:11:07] things in parallel if you have a large area team but at least this should give
[01:11:09] area team but at least this should give you a sense of the relative position of
[01:11:11] you a sense of the relative position of the different things this question
[01:11:29] Yeah right so if you just cumulatively
[01:11:31] Yeah right so if you just cumulatively sighs just give a perfect eye
[01:11:32] sighs just give a perfect eye cementation then add on top perfect no
[01:11:34] cementation then add on top perfect no segmentation or do you give a perfect
[01:11:36] segmentation or do you give a perfect eye segmentation and then take that away
[01:11:38] eye segmentation and then take that away and then give it perfect no segmentation
[01:11:40] and then give it perfect no segmentation um the way I presented it here is done
[01:11:42] um the way I presented it here is done cumulatively and and it turns out that
[01:11:46] cumulatively and and it turns out that uh let's see if you give it once you
[01:11:48] uh let's see if you give it once you give it perfect face uh once you give it
[01:11:51] give it perfect face uh once you give it you know perfect things in the later
[01:11:53] you know perfect things in the later stages maybe the earliest stages doesn't
[01:11:56] stages maybe the earliest stages doesn't matter that much anymore so that's one
[01:11:57] matter that much anymore so that's one pattern but it turns out that you could
[01:12:00] pattern but it turns out that you could do it either way right for the eyes nose
[01:12:03] do it either way right for the eyes nose mouth you could do it cumulatively or
[01:12:05] mouth you could do it cumulatively or one at a time and you'll probably get
[01:12:07] one at a time and you'll probably get relatively similar results no guarantee
[01:12:10] relatively similar results no guarantee you might get different results in terms
[01:12:12] you might get different results in terms of conclusions but but I think to the
[01:12:15] of conclusions but but I think to the extent that you are wondering if doing a
[01:12:17] extent that you are wondering if doing a cumulative Leivers is not a cure to you
[01:12:19] cumulative Leivers is not a cure to you might give you different results I would
[01:12:20] might give you different results I would just do it both ways and then and then
[01:12:22] just do it both ways and then and then and then I think these um error analysis
[01:12:26] and then I think these um error analysis is not hard mathematical rule if that
[01:12:29] is not hard mathematical rule if that makes sense it's not that you do this
[01:12:31] makes sense it's not that you do this and then there's a form that that tells
[01:12:33] and then there's a form that that tells you okay work on face detection right I
[01:12:36] you okay work on face detection right I think that this should be married with
[01:12:40] think that this should be married with judgments on you know how hard do you
[01:12:43] judgments on you know how hard do you think it is to improve face detection
[01:12:44] think it is to improve face detection versus my segmentation right but this at
[01:12:47] versus my segmentation right but this at least gives you a sense of it gives you
[01:12:49] least gives you a sense of it gives you a sense of privatization and it's worth
[01:12:51] a sense of privatization and it's worth doing this in multiple ways if if you
[01:12:54] doing this in multiple ways if if you think that if you're concerned I'm a
[01:12:56] think that if you're concerned I'm a discrepancy in accumulative in
[01:12:57] discrepancy in accumulative in documented versions um so when you have
[01:13:02] documented versions um so when you have a complex machine learning pipeline this
[01:13:04] a complex machine learning pipeline this type of error analysis helps you break
[01:13:07] type of error analysis helps you break down the error to attribute the error to
[01:13:09] down the error to attribute the error to different components which lets you
[01:13:11] different components which lets you focus your attention on what to work on
[01:13:15] Oh ready yeah if your face insertion
[01:13:22] Oh ready yeah if your face insertion accuracy and then you're Eric jumps what
[01:13:24] accuracy and then you're Eric jumps what is that anything um it's not impossible
[01:13:26] is that anything um it's not impossible for that to happen it would be quite
[01:13:29] for that to happen it would be quite rare
[01:13:29] rare I will so at the high level what I would
[01:13:35] I will so at the high level what I would do is go in and try to figure out what's
[01:13:36] do is go in and try to figure out what's going on actually I wouldn't ignore that
[01:13:38] going on actually I wouldn't ignore that so this is another thing I see sometimes
[01:13:41] so this is another thing I see sometimes the team gets a discovers a weird
[01:13:43] the team gets a discovers a weird phenomena like that and they just ignore
[01:13:45] phenomena like that and they just ignore and they move on I wouldn't do that I
[01:13:46] and they move on I wouldn't do that I would is actually go whenever you find
[01:13:49] would is actually go whenever you find one of these weird things
[01:13:50] one of these weird things I wouldn't gloss over the lurid I would
[01:13:54] I wouldn't gloss over the lurid I would go in and figure out what's going on
[01:13:55] go in and figure out what's going on this make sense it's like debugging
[01:13:57] this make sense it's like debugging software you know if you are if you're a
[01:13:59] software you know if you are if you're a chunk debugger piece of software and if
[01:14:02] chunk debugger piece of software and if whenever you move your mouse over you
[01:14:05] whenever you move your mouse over you know some button some random pixel color
[01:14:07] know some button some random pixel color changes you go home that's weird and
[01:14:09] changes you go home that's weird and then some people just ignore it and say
[01:14:11] then some people just ignore it and say oh well the user won't see this so what
[01:14:19] oh well the user won't see this so what you're saying is quite rare but not
[01:14:21] you're saying is quite rare but not impossible but I would I would I don't
[01:14:24] impossible but I would I would I don't have an easy solution for how to figure
[01:14:25] have an easy solution for how to figure out what's going on but I would I would
[01:14:27] out what's going on but I would I would want to figure out what's going on all
[01:14:31] want to figure out what's going on all right so one last thing before we break
[01:14:35] right so one last thing before we break so error analysis helps figure out the
[01:14:38] so error analysis helps figure out the difference between where you are now 85%
[01:14:41] difference between where you are now 85% overall system accuracy and 100% right
[01:14:45] overall system accuracy and 100% right so it tries to explain difference
[01:14:46] so it tries to explain difference between where you are and you know
[01:14:48] between where you are and you know perfect performance there's a different
[01:14:50] perfect performance there's a different type of analysis called ablative
[01:14:52] type of analysis called ablative analysis which figures out the
[01:14:54] analysis which figures out the difference between where you are and
[01:14:55] difference between where you are and something much worse so so here's what I
[01:14:57] something much worse so so here's what I mean um so let's say that you built on
[01:15:05] less heat build a good anti-spam
[01:15:07] less heat build a good anti-spam crossfire by adding lots of clever
[01:15:09] crossfire by adding lots of clever features so this is a Russian right so
[01:15:11] features so this is a Russian right so you know spelling correction because
[01:15:12] you know spelling correction because families tried to misspell words to mess
[01:15:15] families tried to misspell words to mess up the tokenizer
[01:15:16] up the tokenizer to make words not you know spammy was
[01:15:19] to make words not you know spammy was not look like spam
[01:15:21] not look like spam send the host features what machines
[01:15:23] send the host features what machines email come from Eva Heather features you
[01:15:26] email come from Eva Heather features you could have a parser from NLP pasta text
[01:15:30] could have a parser from NLP pasta text use JavaScript pauses understand write
[01:15:33] use JavaScript pauses understand write or even couldn't fetch the webpages to
[01:15:36] or even couldn't fetch the webpages to know that email refers to and pause that
[01:15:39] know that email refers to and pause that and the question is how what the ditch
[01:15:42] and the question is how what the ditch these components really hope and it
[01:15:45] these components really hope and it turns out if you're writing a research
[01:15:46] turns out if you're writing a research paper you know sometimes you rather use
[01:15:48] paper you know sometimes you rather use me to say hey look I build a great spam
[01:15:50] me to say hey look I build a great spam classifier and that's okay that's like a
[01:15:52] classifier and that's okay that's like a nice result to have but if you can
[01:15:54] nice result to have but if you can explain to your reader either in a
[01:15:56] explain to your reader either in a research paper or or in a class project
[01:15:58] research paper or or in a class project report like a term project what what
[01:16:00] report like a term project what what actually made a difference that conveys
[01:16:02] actually made a difference that conveys a lot of insight as well so um so say
[01:16:07] a lot of insight as well so um so say simple which is Russian whether all
[01:16:09] simple which is Russian whether all these clever features got ninety four
[01:16:11] these clever features got ninety four percent performance and with all of your
[01:16:13] percent performance and with all of your addition of all these clever features
[01:16:15] addition of all these clever features you've got ninety nine percent accuracy
[01:16:19] you've got ninety nine percent accuracy so in ablative analysis what you would
[01:16:22] so in ablative analysis what you would do is remove the components one at a
[01:16:25] do is remove the components one at a time to see how it breaks right so
[01:16:28] time to see how it breaks right so Justin so just now we were adding to the
[01:16:30] Justin so just now we were adding to the system by making components perfect with
[01:16:33] system by making components perfect with error analysis is how it improves here
[01:16:35] error analysis is how it improves here we're gonna remove things one at a time
[01:16:39] did not mean to remove that to figure
[01:16:44] did not mean to remove that to figure out what's going on with PowerPoint all
[01:16:45] out what's going on with PowerPoint all right remove things one at a time to see
[01:16:48] right remove things one at a time to see how it breaks so lets you remove
[01:16:49] how it breaks so lets you remove spelling correction and as a set
[01:16:52] spelling correction and as a set features the error goes down like that
[01:16:53] features the error goes down like that then let's remove the center host
[01:16:55] then let's remove the center host features room with email header features
[01:16:57] features room with email header features and so on until when you remove all of
[01:17:02] and so on until when you remove all of these features you end up there and
[01:17:03] these features you end up there and again you could do this cumulative lee
[01:17:05] again you could do this cumulative lee or the roof one and put it back we want
[01:17:07] or the roof one and put it back we want to put back you know you could do it
[01:17:09] to put back you know you could do it both ways and see if they give you
[01:17:10] both ways and see if they give you slightly different insights and so the
[01:17:14] slightly different insights and so the conclusion from this peculiar analysis
[01:17:16] conclusion from this peculiar analysis is that the biggest gap is from the text
[01:17:21] is that the biggest gap is from the text positive features because when you
[01:17:22] positive features because when you remove that the error the accuracy went
[01:17:26] remove that the error the accuracy went down by four percent
[01:17:27] down by four percent so you know this is strong evidence oh
[01:17:29] so you know this is strong evidence oh you want to publish a paper you can say
[01:17:31] you want to publish a paper you can say write text positive features sick of
[01:17:33] write text positive features sick of weekly improves spam filter accuracy and
[01:17:35] weekly improves spam filter accuracy and then that level of insight and then if
[01:17:38] then that level of insight and then if you're working on the spam filter for
[01:17:39] you're working on the spam filter for many years right you know they're
[01:17:41] many years right you know they're they're out there there are really
[01:17:42] they're out there there are really important applications where sometimes
[01:17:43] important applications where sometimes the same team will work on for many
[01:17:45] the same team will work on for many years
[01:17:45] years so this types of error analysis gives
[01:17:47] so this types of error analysis gives you intuition about what's important and
[01:17:49] you intuition about what's important and what's not and helps you decide to maybe
[01:17:53] what's not and helps you decide to maybe even double down on to expose the
[01:17:55] even double down on to expose the features or maybe ever or maybe a
[01:17:58] features or maybe ever or maybe a dissent the host features as to
[01:18:00] dissent the host features as to competition expensive to compute tells
[01:18:02] competition expensive to compute tells you maybe just get rid of that then
[01:18:03] you maybe just get rid of that then without too much harm but and also if
[01:18:05] without too much harm but and also if you're publishing paper or sending a
[01:18:07] you're publishing paper or sending a report this gives much more insight into
[01:18:09] report this gives much more insight into your replicates okay all right
[01:18:12] your replicates okay all right um so that's it for error analysis and
[01:18:16] um so that's it for error analysis and if later analysis I hope this was useful
[01:18:18] if later analysis I hope this was useful real class projects as well take one
[01:18:20] real class projects as well take one last question oh right
[01:18:21] last question oh right oh yeah how are you - zo there was no
[01:18:29] oh yeah how are you - zo there was no systematic way if you can have a
[01:18:30] systematic way if you can have a systematic way you do that the other way
[01:18:32] systematic way you do that the other way to non-cumulative would remove one
[01:18:34] to non-cumulative would remove one coming back when you might put it back
[01:18:35] coming back when you might put it back so either way it works
[01:18:37] so either way it works alright let's break and problem set who
[01:18:41] alright let's break and problem set who is due tonight a friendly reminder and
[01:18:44] is due tonight a friendly reminder and prom said three will be posted in the
[01:18:46] prom said three will be posted in the next like several tens of minutes okay
[01:18:48] next like several tens of minutes okay thanks everyone


================================================================================
LECTURE 014
================================================================================

Lecture 14 - Expectation-Maximization Algorithms | Stanford CS229: Machine Learning (Autumn 2018)

Source: https://www.youtube.com/watch?v=rVfZHWTwXSA

---

Transcript

[00:00:03] all right um let's get started so um
[00:00:08] all right um let's get started so um let's see logistical reminder the class
[00:00:12] let's see logistical reminder the class midterm is this Wednesday and it's for
[00:00:15] midterm is this Wednesday and it's for the eighth article midterm and the
[00:00:18] the eighth article midterm and the logistical details you can find at this
[00:00:21] logistical details you can find at this Piazza Post right so the midterm will
[00:00:23] Piazza Post right so the midterm will start Wednesday evening your $40.00 to
[00:00:25] start Wednesday evening your $40.00 to do it and then submit it online through
[00:00:27] do it and then submit it online through great scope and because of the midterm
[00:00:31] great scope and because of the midterm there won't be a section this Friday ok
[00:00:33] there won't be a section this Friday ok oh and the midterm will cover everything
[00:00:36] oh and the midterm will cover everything up to and including e/m which will spend
[00:00:39] up to and including e/m which will spend most of today talking about Susie they
[00:00:42] most of today talking about Susie they don't look so stressed it'll be fun all
[00:00:47] don't look so stressed it'll be fun all right
[00:00:47] right um so what I'd like to do today is start
[00:00:50] um so what I'd like to do today is start our foray into unsupervised learning so
[00:00:55] our foray into unsupervised learning so far spent a lot of time on supervised
[00:00:57] far spent a lot of time on supervised learning algorithms including advice and
[00:00:59] learning algorithms including advice and how-to applies to advise there any
[00:01:01] how-to applies to advise there any algorithms in which you'd have you know
[00:01:05] algorithms in which you'd have you know positive examples and negative examples
[00:01:08] positive examples and negative examples and you run logistic regression or
[00:01:11] and you run logistic regression or something or there's V M or something to
[00:01:12] something or there's V M or something to find the line find the decision boundary
[00:01:14] find the line find the decision boundary between them in unsupervised learning
[00:01:17] between them in unsupervised learning you're given unlabeled data so rather
[00:01:20] you're given unlabeled data so rather than given data with x and y you're
[00:01:24] than given data with x and y you're given only X and so your training set
[00:01:27] given only X and so your training set now looks like x1 x2 up through XM and
[00:01:33] now looks like x1 x2 up through XM and you're asked to find something
[00:01:36] you're asked to find something interesting about the data
[00:01:37] interesting about the data so the first unsupervised learning
[00:01:39] so the first unsupervised learning algorithm we'll talk about is clustering
[00:01:41] algorithm we'll talk about is clustering in which given a data set like this
[00:01:43] in which given a data set like this hopefully we can have an algorithm that
[00:01:45] hopefully we can have an algorithm that can figure out that this data set has
[00:01:49] can figure out that this data set has two separate clusters and so one of the
[00:01:52] two separate clusters and so one of the most common uses of clustering is market
[00:01:55] most common uses of clustering is market segmentation if you have a website you
[00:01:57] segmentation if you have a website you know selling things online we have a
[00:01:59] know selling things online we have a huge database of many different users
[00:02:01] huge database of many different users and brand clustering to decide what are
[00:02:03] and brand clustering to decide what are the different market segments right so
[00:02:05] the different market segments right so there may be you know people were
[00:02:06] there may be you know people were certain age range of a certain gender
[00:02:08] certain age range of a certain gender are people different age range different
[00:02:10] are people different age range different of Education and people
[00:02:11] of Education and people the East Coast versus West Coast versus
[00:02:13] the East Coast versus West Coast versus also in the country but by clustering
[00:02:15] also in the country but by clustering you can group people into different
[00:02:18] you can group people into different groups right so I want to show you an
[00:02:22] groups right so I want to show you an animation of really the most commonly
[00:02:25] animation of really the most commonly used the clustering algorithm called
[00:02:28] used the clustering algorithm called k-means clustering and let me show you
[00:02:31] k-means clustering and let me show you an animation of what k-means does and
[00:02:33] an animation of what k-means does and then we'll write right out the map and
[00:02:34] then we'll write right out the map and then how you can implement it so um
[00:02:38] then how you can implement it so um let's you're given a set like this so
[00:02:40] let's you're given a set like this so all these are unlabeled examples so just
[00:02:42] all these are unlabeled examples so just exported here and we want an algorithm
[00:02:45] exported here and we want an algorithm to try to find maybe the two clusters
[00:02:47] to try to find maybe the two clusters here the first step of k-means is to
[00:02:50] here the first step of k-means is to pick two points denoted by the two crop
[00:02:53] pick two points denoted by the two crop two crosses called cluster centroids and
[00:02:55] two crosses called cluster centroids and the cluster centroids are your best
[00:02:57] the cluster centroids are your best guess for where are the Centers of the
[00:03:00] guess for where are the Centers of the two clusters you're trying to find and
[00:03:01] two clusters you're trying to find and then k-means is an iterative algorithm
[00:03:03] then k-means is an iterative algorithm and repeatedly you do two things the
[00:03:06] and repeatedly you do two things the first thing is go through each of your
[00:03:08] first thing is go through each of your training example oh I'm sorry oh okay
[00:03:10] training example oh I'm sorry oh okay thank you right let me know if it
[00:03:13] thank you right let me know if it happens again okay right so right so
[00:03:16] happens again okay right so right so right you have two cluster centroids so
[00:03:18] right you have two cluster centroids so the first thing you do is go through
[00:03:19] the first thing you do is go through each of your training examples the green
[00:03:21] each of your training examples the green dots and for each of them you color them
[00:03:24] dots and for each of them you color them either red or blue depending on which is
[00:03:26] either red or blue depending on which is the closer cluster centroid so here
[00:03:28] the closer cluster centroid so here we've taken every daunting color that
[00:03:30] we've taken every daunting color that you know red or blue depending on which
[00:03:33] you know red or blue depending on which side it is which costs essential is
[00:03:35] side it is which costs essential is close to two and then the second thing
[00:03:37] close to two and then the second thing you do is look at all the blue dots and
[00:03:41] you do is look at all the blue dots and compute the average right just find the
[00:03:44] compute the average right just find the mean of all the blue dots and move the
[00:03:46] mean of all the blue dots and move the blue cluster centroid there and
[00:03:48] blue cluster centroid there and similarly look at all the red dots and
[00:03:50] similarly look at all the red dots and look at only the red dots and find it
[00:03:53] look at only the red dots and find it mean finding oh what's wrong with this
[00:03:55] mean finding oh what's wrong with this that's it oh this thing was very strange
[00:03:58] that's it oh this thing was very strange all right apparently if I keep moving my
[00:03:59] all right apparently if I keep moving my mouse it doesn't do that all right thank
[00:04:02] mouse it doesn't do that all right thank you and they find a mean of all the red
[00:04:04] you and they find a mean of all the red dots and move your red cluster centroid
[00:04:08] dots and move your red cluster centroid there so let me do that right so the
[00:04:11] there so let me do that right so the cluster centroids move as follows to the
[00:04:14] cluster centroids move as follows to the mean of the red and the blue dot since
[00:04:16] mean of the red and the blue dot since it's just a standard arithmetic gap
[00:04:18] it's just a standard arithmetic gap and then you repeat again where you look
[00:04:22] and then you repeat again where you look at each of the dots and color in either
[00:04:24] at each of the dots and color in either red or blue depending on which cross the
[00:04:27] red or blue depending on which cross the central is closer so when I recolor
[00:04:29] central is closer so when I recolor every point based on you know what's
[00:04:31] every point based on you know what's closer so that's the new set of colors
[00:04:35] closer so that's the new set of colors and then the second part of the
[00:04:38] and then the second part of the algorithm was again look at the blue
[00:04:40] algorithm was again look at the blue dots find a mean look at the red cards
[00:04:42] dots find a mean look at the red cards find the mean and then move the cluster
[00:04:44] find the mean and then move the cluster centroids over excuse me to that mean
[00:04:48] centroids over excuse me to that mean okay
[00:04:50] okay and so and it turns out if you keep
[00:04:53] and so and it turns out if you keep running the algorithm nothing changes so
[00:04:55] running the algorithm nothing changes so the arm has converged so if you look at
[00:04:57] the arm has converged so if you look at this picture and you repeatedly color
[00:04:59] this picture and you repeatedly color each point red or blue depending on
[00:05:01] each point red or blue depending on which cross the central discloser
[00:05:03] which cross the central discloser nothing changes and if you repeatedly
[00:05:05] nothing changes and if you repeatedly look at each the two clusters of colored
[00:05:07] look at each the two clusters of colored on same computer mean and move the
[00:05:08] on same computer mean and move the cluster there nothing changes so this
[00:05:11] cluster there nothing changes so this album has converged even if you keep on
[00:05:13] album has converged even if you keep on running these two steps okay so um let's
[00:05:19] running these two steps okay so um let's see let's write down in math what we
[00:05:23] see let's write down in math what we just did
[00:05:29] all right so this is um a clustering
[00:05:35] all right so this is um a clustering algorithm and specifically this is a
[00:05:38] algorithm and specifically this is a k-means clustering algorithm so your
[00:05:43] k-means clustering algorithm so your data set now does not come with any
[00:05:47] data set now does not come with any labels and so in k-means step one is
[00:05:52] labels and so in k-means step one is initialize the cluster centroids right
[00:06:02] initialize the cluster centroids right I'm gonna call them mu1 up to the UK
[00:06:07] I'm gonna call them mu1 up to the UK random V so this was a step where you
[00:06:13] random V so this was a step where you plop down the Red Cross and the Blue
[00:06:15] plop down the Red Cross and the Blue Cross and when I did it on the
[00:06:18] Cross and when I did it on the powerpoints you know I did it as it will
[00:06:20] powerpoints you know I did it as it will just choose these as random vectors in
[00:06:23] just choose these as random vectors in practice is a good way of it they're
[00:06:25] practice is a good way of it they're actually the most common way to select
[00:06:27] actually the most common way to select their brand-new initial cross the
[00:06:28] their brand-new initial cross the centroids isn't quite what I showed is
[00:06:29] centroids isn't quite what I showed is to actually pick K examples out of your
[00:06:32] to actually pick K examples out of your training set and just set the cluster
[00:06:34] training set and just set the cluster centroids to be equal to K randomly
[00:06:36] centroids to be equal to K randomly chosen the examples right so in the low
[00:06:38] chosen the examples right so in the low dimensional space like a 2d plot you
[00:06:40] dimensional space like a 2d plot you know they can do on the diagram it
[00:06:41] know they can do on the diagram it doesn't really matter but when you work
[00:06:43] doesn't really matter but when you work with very hard dimensional data says the
[00:06:45] with very hard dimensional data says the more common way to initialize these two
[00:06:47] more common way to initialize these two just pick you know K training examples
[00:06:50] just pick you know K training examples and set the cluster centroids to be at
[00:06:52] and set the cluster centroids to be at exactly the location of those examples
[00:06:55] exactly the location of those examples but then the all dimensional spaces you
[00:06:57] but then the all dimensional spaces you know it doesn't make a big difference
[00:07:00] know it doesn't make a big difference and then next you repeat until
[00:07:03] and then next you repeat until convergence one is
[00:07:28] right so this is a I just write this
[00:07:40] right so this is a I just write this down
[00:08:17] so ever since so the two steps you would
[00:08:21] so ever since so the two steps you would alternate between the first one is set
[00:08:24] alternate between the first one is set CI for every value of I so for every
[00:08:27] CI for every value of I so for every example set C are equal to you know
[00:08:30] example set C are equal to you know either 1 or 2 depending on whether that
[00:08:33] either 1 or 2 depending on whether that example X is closer to cluster Center 1
[00:08:36] example X is closer to cluster Center 1 no cluster centroid to right so it's as
[00:08:38] no cluster centroid to right so it's as taking point of color either red or blue
[00:08:40] taking point of color either red or blue or and we represent that by setting CI
[00:08:44] or and we represent that by setting CI equals 1 or 2 if you have two clusters
[00:08:47] equals 1 or 2 if you have two clusters if K is equal to 2 oh the note say L 1 R
[00:08:59] if K is equal to 2 oh the note say L 1 R squared from this morning one else was
[00:09:08] squared from this morning one else was sent out this morning oh that's weird
[00:09:12] sent out this morning oh that's weird it shouldn't be L 1 norm if it says that
[00:09:15] it shouldn't be L 1 norm if it says that one norm that's a mistake
[00:09:17] one norm that's a mistake serve all that but usually and and it
[00:09:20] serve all that but usually and and it turns out whether you use L 2 norm L 2
[00:09:22] turns out whether you use L 2 norm L 2 norm square they give you the same
[00:09:23] norm square they give you the same answer because the augment is the same
[00:09:25] answer because the augment is the same either way but it's usually do a type
[00:09:28] either way but it's usually do a type one of those oh I see oh god oh oh oh
[00:09:33] one of those oh I see oh god oh oh oh okay looks like a nose he wrote that
[00:09:35] okay looks like a nose he wrote that okay cool
[00:09:36] okay cool but by default when we write that norm
[00:09:38] but by default when we write that norm we actually use we mean L 2 norm yeah by
[00:09:41] we actually use we mean L 2 norm yeah by default this is the L 2 norm of X it is
[00:09:45] default this is the L 2 norm of X it is unspecified if it's L 1 norm we usually
[00:09:48] unspecified if it's L 1 norm we usually write this so L 2 norm is more common
[00:09:50] write this so L 2 norm is more common and with what without the square you get
[00:09:52] and with what without the square you get the same image okay thank you
[00:09:55] the same image okay thank you all right so that's colored adults pain
[00:09:59] all right so that's colored adults pain each dot either red or blue and then for
[00:10:04] each dot either red or blue and then for this this is you know some career
[00:10:08] this this is you know some career examples and take all the examples
[00:10:10] examples and take all the examples assigned to certain cluster right
[00:10:12] assigned to certain cluster right assigned to cluster J and set new J to
[00:10:15] assigned to cluster J and set new J to be average of all the points assigned to
[00:10:16] be average of all the points assigned to that cluster chain yeah oh you know I
[00:10:31] that cluster chain yeah oh you know I don't think I don't know whatever all
[00:10:34] don't think I don't know whatever all right none of the black markers are
[00:10:35] right none of the black markers are working this better
[00:10:38] working this better alright let me try to use this is that
[00:10:40] alright let me try to use this is that part of this is unclear if this part is
[00:10:42] part of this is unclear if this part is you can't see oh I'll write it out more
[00:10:43] you can't see oh I'll write it out more clearly oh sure I do a place in France
[00:11:00] got it
[00:11:03] got it let there be light
[00:11:05] let there be light all right awesome great that was the
[00:11:08] all right awesome great that was the easy request isatis like okay let you
[00:11:13] easy request isatis like okay let you look at it for another minute all right
[00:11:14] look at it for another minute all right okay thank you
[00:11:26] go for it and this wasn't positive okay
[00:11:33] go for it and this wasn't positive okay all right now I can move it up
[00:11:44] all right um so it turns out that this
[00:11:49] all right um so it turns out that this algorithm can be proven to converge that
[00:11:52] algorithm can be proven to converge that exactly why is written out in the
[00:11:55] exactly why is written out in the lecture notes but it turns out if you
[00:11:57] lecture notes but it turns out if you write this as a cost function so the
[00:12:10] write this as a cost function so the cost function for a certain set of
[00:12:12] cost function for a certain set of assignments of points of examples to
[00:12:15] assignments of points of examples to cross the centroids and for a certain
[00:12:17] cross the centroids and for a certain set of positions of the cluster
[00:12:18] set of positions of the cluster centroids so so see these are the
[00:12:20] centroids so so see these are the assignments and these are the centroids
[00:12:27] assignments and these are the centroids right so so this cost here is some of
[00:12:31] right so so this cost here is some of your training set what's the squared
[00:12:33] your training set what's the squared distance between each point and the
[00:12:36] distance between each point and the cluster centroid did this assigned to so
[00:12:39] cluster centroid did this assigned to so it turns out I won't prove this a little
[00:12:41] it turns out I won't prove this a little bit more details legend elsewhere on
[00:12:43] bit more details legend elsewhere on truth is it turns out then on every
[00:12:45] truth is it turns out then on every iteration K means we'll drive this cost
[00:12:47] iteration K means we'll drive this cost function down and so you know beyond a
[00:12:50] function down and so you know beyond a certain point this cost function can't
[00:12:52] certain point this cost function can't go even you can't go any lower
[00:12:55] go even you can't go any lower look just this can't go below zero right
[00:12:57] look just this can't go below zero right and so this shows that k-means must
[00:12:59] and so this shows that k-means must converge release this function must
[00:13:01] converge release this function must converge because there's a strictly
[00:13:03] converge because there's a strictly non-negative function that's going down
[00:13:05] non-negative function that's going down on every derivation so at some point
[00:13:06] on every derivation so at some point that has to stop going down and then you
[00:13:08] that has to stop going down and then you could declare gave me a self converged
[00:13:10] could declare gave me a self converged in practice if you're running k-means
[00:13:13] in practice if you're running k-means and i'm very very large dataset then as
[00:13:15] and i'm very very large dataset then as you plot the number of iterations j may
[00:13:18] you plot the number of iterations j may go down and you know and and just
[00:13:20] go down and you know and and just because of lack of compute or lack of
[00:13:23] because of lack of compute or lack of patience you might just stop this
[00:13:24] patience you might just stop this running after a while it is going down
[00:13:26] running after a while it is going down too slowly
[00:13:27] too slowly so that's sort of k-means in practice
[00:13:28] so that's sort of k-means in practice but maybe hasn't totally conversions
[00:13:30] but maybe hasn't totally conversions just cut it off and call it good enough
[00:13:33] just cut it off and call it good enough now the most frequently asked question I
[00:13:38] now the most frequently asked question I get the k-means is how do you choose K
[00:13:40] get the k-means is how do you choose K it turns out that I when I use k-means i
[00:13:43] it turns out that I when I use k-means i still usually choose K by hand and so
[00:13:46] still usually choose K by hand and so and
[00:13:47] and why which is in a supervised learning
[00:13:51] why which is in a supervised learning sometimes it's just ambiguous right how
[00:13:54] sometimes it's just ambiguous right how many clusters there are what this data
[00:14:03] many clusters there are what this data said some of you will see two clusters
[00:14:05] said some of you will see two clusters and some of you will see full clusters
[00:14:09] and some of you will see full clusters and it's just inherently ambiguous what
[00:14:12] and it's just inherently ambiguous what is the right number of clusters so there
[00:14:14] is the right number of clusters so there are some formulas you can find online
[00:14:16] are some formulas you can find online with criteria like AIC and B I sieve
[00:14:18] with criteria like AIC and B I sieve automatically choosing the number of
[00:14:19] automatically choosing the number of clusters in practice I tend not to use
[00:14:21] clusters in practice I tend not to use them
[00:14:22] them because I usually look at the downstream
[00:14:25] because I usually look at the downstream application of what you actually want to
[00:14:27] application of what you actually want to use k-means for in order to make a
[00:14:29] use k-means for in order to make a decision on a number of classes so for
[00:14:31] decision on a number of classes so for example if you're doing a market
[00:14:33] example if you're doing a market segmentation here because you're
[00:14:35] segmentation here because you're marketers want to design different
[00:14:36] marketers want to design different marketing campaigns right for different
[00:14:38] marketing campaigns right for different groups of users then your marketers
[00:14:41] groups of users then your marketers might have the bandwidth to design for
[00:14:42] might have the bandwidth to design for separate marketing campaigns but now the
[00:14:44] separate marketing campaigns but now the hundred marketing campaigns so they'll
[00:14:46] hundred marketing campaigns so they'll be good reason to choose for clusters
[00:14:47] be good reason to choose for clusters rather than hundred clusters so as often
[00:14:49] rather than hundred clusters so as often if you look at the purpose of what
[00:14:52] if you look at the purpose of what you're doing this for I think in the
[00:14:53] you're doing this for I think in the program exercise in homework do you see
[00:14:56] program exercise in homework do you see a image compression exercise where you
[00:14:59] a image compression exercise where you want to cluster colors into smaller
[00:15:02] want to cluster colors into smaller number of clusters do you implement this
[00:15:03] number of clusters do you implement this it's actually one of the most fun
[00:15:04] it's actually one of the most fun exercises I think but but so there you
[00:15:10] exercises I think but but so there you you know be saying well how much do you
[00:15:11] you know be saying well how much do you want to compress the image to decide how
[00:15:13] want to compress the image to decide how many clusters to try to use okay so I
[00:15:16] many clusters to try to use okay so I usually pick the number of clusters you
[00:15:19] usually pick the number of clusters you know either manually or looking at what
[00:15:21] know either manually or looking at what you want to use Kanis cluster for are
[00:15:23] you want to use Kanis cluster for are you trying to cluster news articles like
[00:15:25] you trying to cluster news articles like the Google News example I think I showed
[00:15:27] the Google News example I think I showed in the first nature you say well how
[00:15:29] in the first nature you say well how many clusters kind of make sense for
[00:15:31] many clusters kind of make sense for news articles okay all right so
[00:15:37] news articles okay all right so oh sure welcome to yet second local
[00:15:43] oh sure welcome to yet second local minima oh yes
[00:15:44] minima oh yes kami intercepts of the local minima
[00:15:46] kami intercepts of the local minima sometimes and so if you're worried about
[00:15:48] sometimes and so if you're worried about local minima or the thing you can do is
[00:15:51] local minima or the thing you can do is run k-means say ten times or 100 times
[00:15:54] run k-means say ten times or 100 times 1,000 times from different random
[00:15:55] 1,000 times from different random initializations of the cluster centroids
[00:15:58] initializations of the cluster centroids and then run it you know say a hundred
[00:15:59] and then run it you know say a hundred times and then pick whichever run
[00:16:02] times and then pick whichever run resulted in the lowest value for this
[00:16:05] resulted in the lowest value for this cost function alright so you play up
[00:16:12] cost function alright so you play up this more in in the program exercise
[00:16:18] this more in in the program exercise know there's a there's a problem that
[00:16:22] know there's a there's a problem that seems so closely related but but there's
[00:16:29] seems so closely related but but there's actually quite different where he's
[00:16:30] actually quite different where he's arrived the algorithms which is density
[00:16:33] arrived the algorithms which is density estimation so let me motivate this I
[00:16:36] estimation so let me motivate this I actually about well right some time back
[00:16:39] actually about well right some time back has some friends working on a problem
[00:16:42] has some friends working on a problem which I simplified little bits of you
[00:16:45] which I simplified little bits of you know if you have aircraft engines coming
[00:16:47] know if you have aircraft engines coming off the assembly line alright and every
[00:16:49] off the assembly line alright and every time an aircraft engine comes on the
[00:16:51] time an aircraft engine comes on the assembly line you measure some features
[00:16:53] assembly line you measure some features of this engine so you measure some
[00:16:54] of this engine so you measure some features about the vibration and you
[00:16:56] features about the vibration and you measure some features of all the heat
[00:16:58] measure some features of all the heat that the aircraft engine is producing
[00:17:01] that the aircraft engine is producing and let's say that you gathered a set
[00:17:09] and the anomaly detection problem is if
[00:17:21] and the anomaly detection problem is if you get a new aircraft engine that comes
[00:17:23] you get a new aircraft engine that comes off the assembly line and if the
[00:17:25] off the assembly line and if the vibration feature it takes on this value
[00:17:27] vibration feature it takes on this value and the heat feature takes on this value
[00:17:29] and the heat feature takes on this value is that aircraft engine an anomalous one
[00:17:32] is that aircraft engine an anomalous one this is your
[00:17:33] this is your right and so the application of this is
[00:17:36] right and so the application of this is that as your aircraft engine comes off
[00:17:39] that as your aircraft engine comes off the assembly line if you see a very
[00:17:40] the assembly line if you see a very unusual signature in terms of the
[00:17:42] unusual signature in terms of the vibrations and heat the aircraft engine
[00:17:44] vibrations and heat the aircraft engine is generating then probably something's
[00:17:46] is generating then probably something's wrong with this aircraft engine if your
[00:17:48] wrong with this aircraft engine if your people have you have your team inspected
[00:17:50] people have you have your team inspected further or tested further before you
[00:17:52] further or tested further before you should the airplane before you ship the
[00:17:54] should the airplane before you ship the engine tort or airplane may occur and
[00:17:56] engine tort or airplane may occur and then something goes around the air and
[00:17:57] then something goes around the air and there's a there's a major accident a
[00:17:59] there's a there's a major accident a major disaster right
[00:18:01] major disaster right and so anomaly detection is most
[00:18:04] and so anomaly detection is most commonly done or one of the common ways
[00:18:06] commonly done or one of the common ways to implement anomaly detection is the
[00:18:10] to implement anomaly detection is the model P of X which is given all of these
[00:18:13] model P of X which is given all of these blue examples given all these thoughts
[00:18:16] blue examples given all these thoughts can you model what is the density from
[00:18:19] can you model what is the density from which X was drawn so then if P of X is
[00:18:24] which X was drawn so then if P of X is very small then you flag an anomaly
[00:18:29] very small then you flag an anomaly meaning that gee I think something's
[00:18:31] meaning that gee I think something's funny here and maybe someone should
[00:18:34] funny here and maybe someone should inspect this aircraft engine a little
[00:18:37] inspect this aircraft engine a little bit further sonar detection is used for
[00:18:40] bit further sonar detection is used for tasks like this for inspection tossed
[00:18:43] tasks like this for inspection tossed like this is used for many years ago as
[00:18:46] like this is used for many years ago as su work of some telecoms providers with
[00:18:49] su work of some telecoms providers with you know helping out telecoms company e
[00:18:51] you know helping out telecoms company e on anomaly detection to figure out if
[00:18:54] on anomaly detection to figure out if something's gone wrong with part of this
[00:18:56] something's gone wrong with part of this cells her network right so if one day
[00:18:58] cells her network right so if one day one of the South Tower starts throwing
[00:19:00] one of the South Tower starts throwing off network patterns that seem very
[00:19:02] off network patterns that seem very unusual then maybe something's wrong
[00:19:03] unusual then maybe something's wrong with that cell tower like that
[00:19:05] with that cell tower like that something's gone wrong it sent out the
[00:19:06] something's gone wrong it sent out the technician to fix it it's also used a
[00:19:09] technician to fix it it's also used a computer security of a computer save
[00:19:11] computer security of a computer save computer Stanford start sending are very
[00:19:14] computer Stanford start sending are very strange
[00:19:14] strange you know network traffic there's very
[00:19:17] you know network traffic there's very unusual relative their views on the four
[00:19:19] unusual relative their views on the four browser what was this is a very
[00:19:21] browser what was this is a very anomalous network traffic then maybe IT
[00:19:23] anomalous network traffic then maybe IT stops you have a look to see if that
[00:19:25] stops you have a look to see if that good computer has been hacked so these
[00:19:27] good computer has been hacked so these are some of the applications that were
[00:19:29] are some of the applications that were an all new section and what good way to
[00:19:31] an all new section and what good way to do this is given the unlabeled data set
[00:19:33] do this is given the unlabeled data set model P of X and then if you have very
[00:19:35] model P of X and then if you have very low probability examples you flag that
[00:19:37] low probability examples you flag that as a possible anomaly for further study
[00:19:40] as a possible anomaly for further study now given this data sets how do you
[00:19:45] now given this data sets how do you model this one distinct thing about this
[00:19:47] model this one distinct thing about this green dots is that neither the vibration
[00:19:50] green dots is that neither the vibration no the heat signature is actually out of
[00:19:52] no the heat signature is actually out of range right you know like there are a
[00:19:53] range right you know like there are a lot of aircraft engines with vibrations
[00:19:56] lot of aircraft engines with vibrations in that range they're long of aircraft
[00:19:57] in that range they're long of aircraft engines with heat in that range so
[00:19:59] engines with heat in that range so neither feature by itself is actually
[00:20:01] neither feature by itself is actually data unusual it's actually the
[00:20:02] data unusual it's actually the combination of the two that is unusual
[00:20:04] combination of the two that is unusual and so that's less what I want to do is
[00:20:07] and so that's less what I want to do is uh come up with an algorithm to model
[00:20:10] uh come up with an algorithm to model this and in fact welcome of an algorithm
[00:20:12] this and in fact welcome of an algorithm they can model you know maybe maybe your
[00:20:15] they can model you know maybe maybe your data density looks like this made more
[00:20:16] data density looks like this made more of an L shape like that but how do you
[00:20:18] of an L shape like that but how do you model P of X with the data coming from
[00:20:22] model P of X with the data coming from an L shape and it turns out that there
[00:20:24] an L shape and it turns out that there is no textbook distribution right you
[00:20:27] is no textbook distribution right you know there isn't you know if you look at
[00:20:28] know there isn't you know if you look at this simple and there's no exponential
[00:20:30] this simple and there's no exponential family model the types of distributions
[00:20:32] family model the types of distributions there is no distribution for modeling
[00:20:35] there is no distribution for modeling very very complex distributions like
[00:20:37] very very complex distributions like this so what I'm going to talk about is
[00:20:40] this so what I'm going to talk about is the mixture of gaussians volatile which
[00:20:42] the mixture of gaussians volatile which would look for data like this and say it
[00:20:44] would look for data like this and say it looks like this data actually comes from
[00:20:46] looks like this data actually comes from two Gaussian there's one Gaussian maybe
[00:20:48] two Gaussian there's one Gaussian maybe that's one type of aircraft engine that
[00:20:50] that's one type of aircraft engine that you know it's drawn from a Gaussian like
[00:20:52] you know it's drawn from a Gaussian like the one below and a separate aircraft
[00:20:54] the one below and a separate aircraft type of aircraft engine that's drawn
[00:20:57] type of aircraft engine that's drawn from a Gaussian like that above and this
[00:21:00] from a Gaussian like that above and this is why there's a lot of probably Mars in
[00:21:02] is why there's a lot of probably Mars in just O'Shea region by very low
[00:21:04] just O'Shea region by very low probability outside that O'Shay region
[00:21:07] probability outside that O'Shay region right oh and these ellipses I'm drawing
[00:21:09] right oh and these ellipses I'm drawing other contours of these two gaussians
[00:21:11] other contours of these two gaussians right and so what I'd like to do next is
[00:21:16] right and so what I'd like to do next is develop the mixture of gaussians model
[00:21:19] develop the mixture of gaussians model which is useful for an audience section
[00:21:22] which is useful for an audience section and and and then those this will lead us
[00:21:26] and and and then those this will lead us to our second unsupervised so in order
[00:21:34] to our second unsupervised so in order to make the mixture of gaussians model a
[00:21:38] to make the mixture of gaussians model a bit easier to develop let me just use a
[00:21:40] bit easier to develop let me just use a one-dimensional example where so let's
[00:21:48] one-dimensional example where so let's see so let's say that we gather data set
[00:21:51] see so let's say that we gather data set that looks like
[00:21:52] that looks like this so it's just one roll number
[00:22:05] this so it's just one roll number searches online I've plotted a few dots
[00:22:07] searches online I've plotted a few dots um so looks like this day there maybe
[00:22:10] um so looks like this day there maybe comes from two gaussians or it looks
[00:22:12] comes from two gaussians or it looks like you know there's some data from
[00:22:13] like you know there's some data from this Gaussian and there's some data from
[00:22:17] this Gaussian and there's some data from that Gaussian on the right um and it's
[00:22:21] that Gaussian on the right um and it's and if only we knew right which example
[00:22:25] and if only we knew right which example had come from which Gaussian if if we
[00:22:28] had come from which Gaussian if if we knew that these examples that come from
[00:22:31] knew that these examples that come from Gaussian one we wanted to know with
[00:22:33] Gaussian one we wanted to know with crosses and if only we knew what the
[00:22:37] crosses and if only we knew what the actually this finally fell over if only
[00:22:40] actually this finally fell over if only we knew that these examples that come
[00:22:43] we knew that these examples that come from Gaussian to which I'm willing to
[00:22:45] from Gaussian to which I'm willing to draw with oles then we just fake calcium
[00:22:47] draw with oles then we just fake calcium 1/2 the crosses figure out into the O's
[00:22:49] 1/2 the crosses figure out into the O's and then we'd be pretty much done right
[00:22:51] and then we'd be pretty much done right oh and sorry and so these are the two
[00:22:54] oh and sorry and so these are the two gaussians and so the overall density
[00:22:56] gaussians and so the overall density would be something like this right
[00:22:58] would be something like this right that's the probability of all the party
[00:23:01] that's the probability of all the party muscle left while probably must know
[00:23:02] muscle left while probably must know very low less probably mass on so the
[00:23:08] very low less probably mass on so the overall density just told again would be
[00:23:10] overall density just told again would be no high no high something like that
[00:23:13] no high no high something like that right but the reason and then if you
[00:23:19] right but the reason and then if you actually had these labels if you knew
[00:23:20] actually had these labels if you knew that these examples came from gaussian
[00:23:22] that these examples came from gaussian one those examples come from gaussian
[00:23:24] one those examples come from gaussian two then you can actually use an
[00:23:26] two then you can actually use an algorithm very similar to GD a gaussian
[00:23:28] algorithm very similar to GD a gaussian difference to fit this model the problem
[00:23:32] difference to fit this model the problem with this density estimation problem is
[00:23:34] with this density estimation problem is you just see this data and maybe the
[00:23:37] you just see this data and maybe the data came from two different gaussians
[00:23:39] data came from two different gaussians but you don't know which example
[00:23:40] but you don't know which example actually came from which coliseum okay
[00:23:42] actually came from which coliseum okay so the e/m algorithm or the expectation
[00:23:45] so the e/m algorithm or the expectation maximization algorithm will allow us to
[00:23:47] maximization algorithm will allow us to fit a model
[00:23:50] fit a model despite not knowing which Gaussian each
[00:23:54] despite not knowing which Gaussian each example
[00:24:07] so let me first write down the young
[00:24:10] so let me first write down the young mixture of gaussians model and then
[00:24:20] mixture of gaussians model and then we'll describe the EML room for this so
[00:24:24] we'll describe the EML room for this so let's imagine let's suppose that there's
[00:24:28] let's imagine let's suppose that there's a so the term we sometimes use this
[00:24:36] a so the term we sometimes use this latent but latent just means hidden
[00:24:40] observed
[00:25:39] so so let's imagine that there's some
[00:25:44] so so let's imagine that there's some hidden random variable Z and the term
[00:25:47] hidden random variable Z and the term latent just means hidden on observe it
[00:25:49] latent just means hidden on observe it means that it exists but you don't get
[00:25:50] means that it exists but you don't get to see the value directly so I say later
[00:25:53] to see the value directly so I say later it just means hidden on observe so let's
[00:25:56] it just means hidden on observe so let's imagine that this hidden or latent
[00:25:57] imagine that this hidden or latent random variable Z and Xin Z I had this
[00:26:02] random variable Z and Xin Z I had this joint distribution and this this this is
[00:26:04] joint distribution and this this this is very very similar to the model you saw
[00:26:05] very very similar to the model you saw in Gaussian destroyers but Zi is
[00:26:09] in Gaussian destroyers but Zi is multinomial with some set of parameters
[00:26:11] multinomial with some set of parameters Phi for a mixture of two gaussians this
[00:26:14] Phi for a mixture of two gaussians this would just be Bernoulli with two values
[00:26:16] would just be Bernoulli with two values but if you're a mixture of K calcium's
[00:26:17] but if you're a mixture of K calcium's then Z you know can take on values from
[00:26:20] then Z you know can take on values from 1 through K and it was two gaussians it
[00:26:25] 1 through K and it was two gaussians it just before nearly and then once you
[00:26:28] just before nearly and then once you know that one example comes from
[00:26:30] know that one example comes from Gaussian number J then X condition that
[00:26:34] Gaussian number J then X condition that Zi is equal to J that is drawn from a
[00:26:37] Zi is equal to J that is drawn from a Gaussian distribution with some mean and
[00:26:39] Gaussian distribution with some mean and some coherence Sigma okay so the two
[00:26:44] some coherence Sigma okay so the two unimportant ways this is different than
[00:26:46] unimportant ways this is different than GTA one well I set Z to be one of K
[00:26:50] GTA one well I set Z to be one of K values instead of one of two values and
[00:26:52] values instead of one of two values and GDA god-centered from analysis we had Z
[00:26:56] GDA god-centered from analysis we had Z know why the labels Y took on one of two
[00:26:58] know why the labels Y took on one of two values and then second is I have Sigma J
[00:27:02] values and then second is I have Sigma J instead of Sigma so by convention when
[00:27:05] instead of Sigma so by convention when we feed mixture of gaussians models we
[00:27:07] we feed mixture of gaussians models we let each gaussian have his own
[00:27:08] let each gaussian have his own covariance matrix Sigma we could
[00:27:10] covariance matrix Sigma we could actually force it to be the same way you
[00:27:11] actually force it to be the same way you want but these are the trivial
[00:27:12] want but these are the trivial differences the most significant
[00:27:15] differences the most significant difference is that in Gaussian districts
[00:27:20] difference is that in Gaussian districts I Y I whereas Y was observed and the
[00:27:25] I Y I whereas Y was observed and the main difference between this and
[00:27:27] main difference between this and Gaussian disappearing analysis is now we
[00:27:29] Gaussian disappearing analysis is now we have replaced that with this latent or
[00:27:31] have replaced that with this latent or hidden random variables Z are they
[00:27:33] hidden random variables Z are they do not get to see in the training set
[00:27:34] do not get to see in the training set okay so all right that was better
[00:28:03] all right so if we knew the sea-ice
[00:28:12] all right so if we knew the sea-ice right then we can use maximum likelihood
[00:28:18] right then we can use maximum likelihood estimation right so if only we knew the
[00:28:20] estimation right so if only we knew the value of the Z is which we don't but if
[00:28:22] value of the Z is which we don't but if only we did then we could use maximum
[00:28:24] only we did then we could use maximum likelihood estimation or mo e to
[00:28:26] likelihood estimation or mo e to estimate everything you know so we were
[00:28:28] estimate everything you know so we were right the log likelihood other
[00:28:30] right the log likelihood other parameters equals some log P of X our Zi
[00:28:39] parameters equals some log P of X our Zi you know given the parameters right and
[00:28:44] you know given the parameters right and then you take the river to set the ders
[00:28:46] then you take the river to set the ders equal to zero and you guys did this in
[00:28:48] equal to zero and you guys did this in problem set one right and then you find
[00:28:50] problem set one right and then you find that Phi J is equal to 1 over m
[00:29:22] okay so if only you knew the values of
[00:29:25] okay so if only you knew the values of the sea-ice then you could use maximum
[00:29:28] the sea-ice then you could use maximum likelihood estimates and this is what
[00:29:31] likelihood estimates and this is what you get and this is pretty much the
[00:29:32] you get and this is pretty much the formulas actually these two are exactly
[00:29:35] formulas actually these two are exactly the formulas we had for Gaussian Tuscon
[00:29:38] the formulas we had for Gaussian Tuscon analysis except we'll replace Y with Z
[00:29:42] analysis except we'll replace Y with Z and then there's some other formula for
[00:29:44] and then there's some other formula for Sigma just written in the lecture notes
[00:29:46] Sigma just written in the lecture notes but I won't that one right down here
[00:29:47] but I won't that one right down here okay um
[00:29:50] okay um but the reason we can't use this use
[00:29:54] but the reason we can't use this use these formulas we don't actually know
[00:29:55] these formulas we don't actually know whether the values of Z so what we will
[00:30:00] whether the values of Z so what we will do in the e/m algorithm is two steps in
[00:30:13] do in the e/m algorithm is two steps in the first step we will guess the value
[00:30:18] the first step we will guess the value of the Z's and in the second step we
[00:30:21] of the Z's and in the second step we will use these equations using the
[00:30:24] will use these equations using the values of disease we just guessed so let
[00:30:27] values of disease we just guessed so let me so sometimes in machine learning
[00:30:29] me so sometimes in machine learning something to call this a bootstrap
[00:30:31] something to call this a bootstrap procedure where you get something they
[00:30:33] procedure where you get something they run an algorithm you're using your
[00:30:35] run an algorithm you're using your guesses and then you update your guesses
[00:30:37] guesses and then you update your guesses and then run the algorithm okay let me
[00:30:38] and then run the algorithm okay let me let me make that concrete by writing
[00:30:40] let me make that concrete by writing this down
[00:30:54] so the e/m algorithm has two steps the a
[00:30:59] so the e/m algorithm has two steps the a step also called the expectation step is
[00:31:13] step also called the expectation step is set w IJ so W IJ is going to be the
[00:31:30] set w IJ so W IJ is going to be the probability that Zi is equal to J okay
[00:31:36] probability that Zi is equal to J okay given all the parameters and and much as
[00:31:39] given all the parameters and and much as we did with generative learning
[00:31:41] we did with generative learning algorithms right with generative
[00:31:44] algorithms right with generative learning algorithms we'll use Bayes rule
[00:31:45] learning algorithms we'll use Bayes rule to estimate the probability of Y given X
[00:31:50] to estimate the probability of Y given X and so to compute this you use a similar
[00:31:52] and so to compute this you use a similar Bayes rule type of calculation and so
[00:31:56] Bayes rule type of calculation and so disappear
[00:32:18] right where for example this term here P
[00:32:27] right where for example this term here P of X i given Z I equals J this would be
[00:32:30] of X i given Z I equals J this would be a Gaussian density right this comes from
[00:32:32] a Gaussian density right this comes from a Gaussian density with mean mu J and
[00:32:36] a Gaussian density with mean mu J and covariance Sigma J right and so this
[00:32:39] covariance Sigma J right and so this term here would be a 1 over you know 2
[00:32:43] term here would be a 1 over you know 2 pi it's an N over 2 Sigma J and then
[00:33:02] pi it's an N over 2 Sigma J and then this term here I guess this would be a
[00:33:04] this term here I guess this would be a Phi J that's just a Bernoulli
[00:33:06] Phi J that's just a Bernoulli probability remember Z is multinomial
[00:33:08] probability remember Z is multinomial right Suzy this multinomial we're
[00:33:14] right Suzy this multinomial we're parameters Phi so I guess the parameters
[00:33:17] parameters Phi so I guess the parameters v for multinomial distribution tell you
[00:33:19] v for multinomial distribution tell you what's the chance of Z B 1 2 3 4 and so
[00:33:22] what's the chance of Z B 1 2 3 4 and so on up to K so the chance of Zi being for
[00:33:25] on up to K so the chance of Zi being for the K is just this chance of Zi pee
[00:33:28] the K is just this chance of Zi pee really Jane is just Phi J right it's
[00:33:30] really Jane is just Phi J right it's just read it off one of the parameters
[00:33:32] just read it off one of the parameters and your multinomial probability but for
[00:33:36] and your multinomial probability but for the also CV different values okay and so
[00:33:39] the also CV different values okay and so and similarly the terms of denominator
[00:33:41] and similarly the terms of denominator this term here is from Gaussian and that
[00:33:44] this term here is from Gaussian and that second term is from the multinomial
[00:33:48] second term is from the multinomial probability that you have for Z and so
[00:33:51] probability that you have for Z and so that's how you plug in all of these
[00:33:52] that's how you plug in all of these numbers and use Bayes rule use this
[00:33:54] numbers and use Bayes rule use this equation to compute given all given the
[00:33:57] equation to compute given all given the position of all these gaussians what is
[00:33:59] position of all these gaussians what is the chance of W IJ taking on a certain
[00:34:02] the chance of W IJ taking on a certain value
[00:34:06] and and and so to to make this really
[00:34:09] and and and so to to make this really concrete you remember how I guess ones
[00:34:13] concrete you remember how I guess ones and zeros are the other way if you were
[00:34:15] and zeros are the other way if you were to look at these if you were to scan
[00:34:18] to look at these if you were to scan through right to left
[00:34:19] through right to left remember how you know you give the
[00:34:21] remember how you know you give the sigmoid function right the same point
[00:34:23] sigmoid function right the same point can be this way or this way or interval
[00:34:24] can be this way or this way or interval sign I guess these are positive examples
[00:34:26] sign I guess these are positive examples these negatives you have a sigmoid
[00:34:28] these negatives you have a sigmoid function like this and so W IJ is just
[00:34:32] function like this and so W IJ is just the height of this Sigma is just a
[00:34:35] the height of this Sigma is just a chance you know each of these examples
[00:34:38] chance you know each of these examples being coming from either the Z equals 1
[00:34:42] being coming from either the Z equals 1 is equal 0 and then you store all of
[00:34:44] is equal 0 and then you store all of these numbers in the variables W IJ okay
[00:34:48] these numbers in the variables W IJ okay so W IJ is just compute the posterior
[00:34:51] so W IJ is just compute the posterior chance of discharge this example coming
[00:34:53] chance of discharge this example coming from the nest Gaussian present right now
[00:34:54] from the nest Gaussian present right now saying they just saw that W IJ so that's
[00:35:09] saying they just saw that W IJ so that's the e set and you compute the W IJ for
[00:35:14] the e set and you compute the W IJ for every single training example I mix the
[00:35:19] every single training example I mix the m-step is
[00:35:29] sorry is this what oh this one you're
[00:35:37] sorry is this what oh this one you're sorry okay so in the so the e step tells
[00:35:48] sorry okay so in the so the e step tells us you know trying to guess the values
[00:35:50] us you know trying to guess the values of the Z's right we figure out what's
[00:35:52] of the Z's right we figure out what's the probability of Z being one two three
[00:35:53] the probability of Z being one two three four after Cain was stolen here and then
[00:35:56] four after Cain was stolen here and then in the m-step what we're going to do is
[00:35:58] in the m-step what we're going to do is use the formulas behalf for maximum
[00:36:02] use the formulas behalf for maximum likelihood estimation and I want you to
[00:36:06] likelihood estimation and I want you to compare these with the equations I had
[00:36:08] compare these with the equations I had above okay see but so these equations
[00:36:31] above okay see but so these equations are a lot like the equations above
[00:36:33] are a lot like the equations above except that instead of indicator Z I
[00:36:36] except that instead of indicator Z I equals J we replaced it with W IJ right
[00:36:42] equals J we replaced it with W IJ right which by the way is the expected value
[00:36:44] which by the way is the expected value of this indicates a function because the
[00:36:52] of this indicates a function because the expected value of an indicator function
[00:36:54] expected value of an indicator function is just equal to the probability of that
[00:36:56] is just equal to the probability of that thing in the middle being true and then
[00:37:04] thing in the middle being true and then there's a formula for Sigma J as well
[00:37:06] there's a formula for Sigma J as well that's all you can get from the lecture
[00:37:08] that's all you can get from the lecture notes but I want I won't write down here
[00:37:10] notes but I want I won't write down here okay so one intuition of this mixture of
[00:37:19] okay so one intuition of this mixture of gaussians algorithm is that it's a
[00:37:21] gaussians algorithm is that it's a little bit like k-means but with Sophos
[00:37:23] little bit like k-means but with Sophos i'm in so in k-means in the first step
[00:37:27] i'm in so in k-means in the first step we will take each point and just assign
[00:37:29] we will take each point and just assign it to one of the clade k cluster
[00:37:31] it to one of the clade k cluster centroids right and it was a little bit
[00:37:34] centroids right and it was a little bit closer to the red cluster centroid than
[00:37:36] closer to the red cluster centroid than the blue cluster centroid we would just
[00:37:37] the blue cluster centroid we would just assign it to the red cross
[00:37:39] assign it to the red cross so even with just a little bit closer
[00:37:41] so even with just a little bit closer one closer than another k-means we just
[00:37:43] one closer than another k-means we just make what's called a hard assignment
[00:37:45] make what's called a hard assignment meaning you know whatever plus the
[00:37:46] meaning you know whatever plus the centroid is closed assume we just
[00:37:48] centroid is closed assume we just assigned at a hundred percent to that
[00:37:49] assigned at a hundred percent to that say cluster centroid so yeah is you can
[00:37:55] say cluster centroid so yeah is you can think well yeah implements a softer way
[00:37:59] think well yeah implements a softer way of assigning points to to the different
[00:38:02] of assigning points to to the different cluster centroids because instead of
[00:38:04] cluster centroids because instead of just picking the one closest Gaussian
[00:38:06] just picking the one closest Gaussian Center and the signing of there it uses
[00:38:08] Center and the signing of there it uses these probabilities and gives it a
[00:38:10] these probabilities and gives it a weighting in terms of how much the sign
[00:38:12] weighting in terms of how much the sign to calcium want versus gals into and
[00:38:15] to calcium want versus gals into and then second obtains you know the means
[00:38:17] then second obtains you know the means accordingly write sum over all the
[00:38:20] accordingly write sum over all the excise to the extent you're assigned to
[00:38:21] excise to the extent you're assigned to that cluster centroid divided by the
[00:38:23] that cluster centroid divided by the number of examples assigned to a cluster
[00:38:25] number of examples assigned to a cluster centroid okay so so so that's one
[00:38:28] centroid okay so so so that's one intuition behind between them and
[00:38:31] intuition behind between them and k-means and in a second but but when you
[00:38:37] k-means and in a second but but when you run this algorithm it turns out that
[00:38:40] run this algorithm it turns out that this algorithm will converge with some
[00:38:43] this algorithm will converge with some caveats I'll get to later and this will
[00:38:46] caveats I'll get to later and this will find a pretty decent estimate of the
[00:38:51] find a pretty decent estimate of the parameters you know say fitting a
[00:38:53] parameters you know say fitting a mixture of two gaussians model so this
[00:39:00] mixture of two gaussians model so this is some they owe and so if you are given
[00:39:03] is some they owe and so if you are given the data set of say airplane engines you
[00:39:06] the data set of say airplane engines you can run this algorithm for the mixture
[00:39:07] can run this algorithm for the mixture of two gaussians and then when a new
[00:39:09] of two gaussians and then when a new airplane engine rolls off the assembly
[00:39:10] airplane engine rolls off the assembly line you so after your fitting the
[00:39:15] line you so after your fitting the k-means algorithm you now have a after
[00:39:18] k-means algorithm you now have a after 15 ml room you now have a joint density
[00:39:20] 15 ml room you now have a joint density of a p of x comma Z and so the density
[00:39:23] of a p of x comma Z and so the density for X is just sum over all the values of
[00:39:26] for X is just sum over all the values of Z P of X comma Z and so
[00:39:39] and so a mixture of gaussians can fit
[00:39:42] and so a mixture of gaussians can fit distributions that look like this it can
[00:39:44] distributions that look like this it can fit distributions that look like this
[00:39:46] fit distributions that look like this right there's these up these are both
[00:39:47] right there's these up these are both mixtures of two Gaussian so this gives
[00:39:49] mixtures of two Gaussian so this gives you a very rich family of models to fits
[00:39:52] you a very rich family of models to fits very complicated distributions and now
[00:39:54] very complicated distributions and now that right and you've also fit no know
[00:39:58] that right and you've also fit no know something like this so this is a mixture
[00:40:01] something like this so this is a mixture of two gaussians I guess one thin narrow
[00:40:03] of two gaussians I guess one thin narrow Gaussian here and one much wider fatter
[00:40:05] Gaussian here and one much wider fatter Gaussian so mixture of two gaussians
[00:40:07] Gaussian so mixture of two gaussians can't you fill them all the different
[00:40:08] can't you fill them all the different things can fit a lot and the mixture of
[00:40:12] things can fit a lot and the mixture of more than two gaussians can fit even
[00:40:14] more than two gaussians can fit even richer models and so by doing this you
[00:40:17] richer models and so by doing this you can now model P of X for many
[00:40:20] can now model P of X for many complicated densities including this one
[00:40:23] complicated densities including this one this example I just now this will allow
[00:40:26] this example I just now this will allow you to fit a priori density function
[00:40:28] you to fit a priori density function that puts some all the promos on on a
[00:40:30] that puts some all the promos on on a region that looks like this
[00:40:31] region that looks like this and so we have a new example you can
[00:40:34] and so we have a new example you can evaluate P of X and a P of X is large
[00:40:36] evaluate P of X and a P of X is large then you can say you know this looks
[00:40:39] then you can say you know this looks okay and the P of X is less than Epsilon
[00:40:43] okay and the P of X is less than Epsilon you can find in an RV and say Oh take a
[00:40:46] you can find in an RV and say Oh take a look take another look at this airplane
[00:40:47] look take another look at this airplane engine okay
[00:40:50] engine okay so um I kind of just wrote down this
[00:40:54] so um I kind of just wrote down this algorithm with a little bit of a hand
[00:40:56] algorithm with a little bit of a hand wavy explanation at the house derive
[00:40:58] wavy explanation at the house derive right so I said if only you knew the
[00:41:00] right so I said if only you knew the values of C and just used maximum
[00:41:02] values of C and just used maximum likelihood estimation
[00:41:03] likelihood estimation so let's guess the values of Z and then
[00:41:05] so let's guess the values of Z and then plug that into the formula so maximum IQ
[00:41:07] plug that into the formula so maximum IQ estimation it turns out that hand-wavy
[00:41:10] estimation it turns out that hand-wavy explanation works in the particular case
[00:41:12] explanation works in the particular case of
[00:41:12] of Yampa mixtures of gaussians but that
[00:41:15] Yampa mixtures of gaussians but that there is a more formal way of deriving
[00:41:18] there is a more formal way of deriving the EML rhythm that shows that this is a
[00:41:22] the EML rhythm that shows that this is a maximum likelihood estimation algorithm
[00:41:23] maximum likelihood estimation algorithm and then it converges at least the local
[00:41:25] and then it converges at least the local optimum and in particular there what
[00:41:29] optimum and in particular there what we'll do is show that if you go is given
[00:41:34] we'll do is show that if you go is given a model P of X Z prior tries by theta if
[00:41:39] a model P of X Z prior tries by theta if you go this to maximize P of X right
[00:41:44] you go this to maximize P of X right excuse me
[00:41:51] right so this is what maximum likely
[00:41:53] right so this is what maximum likely you're supposed to do that
[00:41:55] you're supposed to do that eeehm is exactly trying to do that okay
[00:41:57] eeehm is exactly trying to do that okay so I'll go on in a minute present this
[00:42:01] so I'll go on in a minute present this more general derivation that the full
[00:42:04] more general derivation that the full morris derivation of the e/m algorithm
[00:42:06] morris derivation of the e/m algorithm that doesn't rely on this hand wavy
[00:42:08] that doesn't rely on this hand wavy argument of thus guesses ease and use
[00:42:10] argument of thus guesses ease and use master like you were to guess value so
[00:42:12] master like you were to guess value so I'll do the rigorous derivation of VM in
[00:42:15] I'll do the rigorous derivation of VM in a minute but before I do that let me
[00:42:17] a minute but before I do that let me just pause and check if there are any
[00:42:19] just pause and check if there are any questions maybe let's see maybe I'll
[00:42:41] questions maybe let's see maybe I'll help to not think of them as weights
[00:42:43] help to not think of them as weights yeah I think this is actually there
[00:42:46] yeah I think this is actually there waiting you assigned to a certain
[00:42:48] waiting you assigned to a certain Gaussian so that's one intuition and
[00:42:51] Gaussian so that's one intuition and hence weights but so one way to think of
[00:42:59] hence weights but so one way to think of this as W IJ is how much X I is assigned
[00:43:09] this as W IJ is how much X I is assigned to you know to do so W IJ is a strength
[00:43:28] to you know to do so W IJ is a strength of how strongly you want to assign that
[00:43:31] of how strongly you want to assign that training example X I to that cluster or
[00:43:34] training example X I to that cluster or to that to that particular Gaussian and
[00:43:36] to that to that particular Gaussian and so this is a number of G 0 and 1 right
[00:43:39] so this is a number of G 0 and 1 right and the strength of all the assignments
[00:43:41] and the strength of all the assignments and every point is a sign with a total
[00:43:44] and every point is a sign with a total strength equal to 1 because all these
[00:43:46] strength equal to 1 because all these properties must sum up to 1 and so when
[00:43:48] properties must sum up to 1 and so when I take this point and assign it you know
[00:43:50] I take this point and assign it you know 0.82 more close gaussian and point to to
[00:43:54] 0.82 more close gaussian and point to to a more distinct
[00:43:55] a more distinct and this is our guest though you know
[00:43:57] and this is our guest though you know well there's an 80% chance of him but
[00:43:59] well there's an 80% chance of him but that gal seen a 20% chance of camera a
[00:44:00] that gal seen a 20% chance of camera a second girl seen this make sense oh I
[00:44:10] second girl seen this make sense oh I see so let's see um so when you're
[00:44:14] see so let's see um so when you're running the ml room you never know
[00:44:15] running the ml room you never know whether the true values of Z all right
[00:44:17] whether the true values of Z all right you're given the data set so you only
[00:44:19] you're given the data set so you only told the excess and false we know these
[00:44:23] told the excess and false we know these airplane engines were generated off you
[00:44:26] airplane engines were generated off you know two different gaussians maybe there
[00:44:27] know two different gaussians maybe there are two separate assembly processes you
[00:44:29] are two separate assembly processes you know one from the one from plot number
[00:44:33] know one from the one from plot number one one from plot number two and maybe
[00:44:35] one one from plot number two and maybe they're actually operate a little bit
[00:44:36] they're actually operate a little bit differently but by the time they merge
[00:44:38] differently but by the time they merge onto one but by the time the two
[00:44:42] onto one but by the time the two supplies of aircraft engines get to you
[00:44:44] supplies of aircraft engines get to you they've been mixed together and so you
[00:44:46] they've been mixed together and so you can't tell anymore which aircraft engine
[00:44:48] can't tell anymore which aircraft engine came from profit pond one and which
[00:44:50] came from profit pond one and which profit aircraft engine came from plant -
[00:44:52] profit aircraft engine came from plant - I don't even know there are two fonts
[00:44:54] I don't even know there are two fonts you just see the stream of aircraft
[00:44:55] you just see the stream of aircraft engines you're hypothesizing they're the
[00:44:57] engines you're hypothesizing they're the two types and so in every iteration of
[00:45:00] two types and so in every iteration of PM you're taking each aircraft engine
[00:45:04] PM you're taking each aircraft engine and guessing you know for this one I
[00:45:06] and guessing you know for this one I think does 80% chance that came for
[00:45:07] think does 80% chance that came for process one the 30% chance came for
[00:45:09] process one the 30% chance came for process - so that's the e step and then
[00:45:12] process - so that's the e step and then in the m-step you look at all the
[00:45:14] in the m-step you look at all the engines that you're kind of guessing
[00:45:16] engines that you're kind of guessing were generated by process one and you
[00:45:19] were generated by process one and you update your Gaussian to be a better
[00:45:20] update your Gaussian to be a better model for all of the things that were
[00:45:22] model for all of the things that were that you kind of think were generated by
[00:45:24] that you kind of think were generated by process one and if there's something
[00:45:26] process one and if there's something that you're absolutely sure came from
[00:45:28] that you're absolutely sure came from process one then it has a weight of one
[00:45:30] process one then it has a weight of one close to one and this do you think there
[00:45:32] close to one and this do you think there was something that you know are the 10%
[00:45:34] was something that you know are the 10% chance come to process 1 then that
[00:45:36] chance come to process 1 then that example is given a lower weight and how
[00:45:38] example is given a lower weight and how you update the meaning for that
[00:45:44] all right so
[00:46:31] well I still remember when I was an
[00:46:34] well I still remember when I was an undergrad doing a summer internship at
[00:46:36] undergrad doing a summer internship at AT&amp;T Bell Labs and then someone the few
[00:46:39] AT&amp;T Bell Labs and then someone the few offices down had learned about diem for
[00:46:41] offices down had learned about diem for the mixture of gaussians her first time
[00:46:42] the mixture of gaussians her first time was running on his computer and he's
[00:46:44] was running on his computer and he's going around to every single office
[00:46:46] going around to every single office saying oh my god you gotta check this
[00:46:48] saying oh my god you gotta check this out this is unbelievable look at what
[00:46:49] out this is unbelievable look at what this elephant can do Tiffany makes is a
[00:46:51] this elephant can do Tiffany makes is a Gaussian so it shows you those other
[00:46:54] Gaussian so it shows you those other people I hang out with all right um so
[00:47:06] people I hang out with all right um so in order to derive you know so slightly
[00:47:09] in order to derive you know so slightly hand wavy arguments that oh let's get to
[00:47:11] hand wavy arguments that oh let's get to let's guess the values of the Z's let's
[00:47:13] let's guess the values of the Z's let's just have these ways and plug them into
[00:47:14] just have these ways and plug them into maximum likelihood um what I like to do
[00:47:17] maximum likelihood um what I like to do is give a more rigorous derivation for
[00:47:20] is give a more rigorous derivation for ye M algorithm is a reasonable algorithm
[00:47:22] ye M algorithm is a reasonable algorithm and Y is a massive likely estimation
[00:47:25] and Y is a massive likely estimation algorithm and why we can expect it to
[00:47:26] algorithm and why we can expect it to converge and it turns out there rather
[00:47:29] converge and it turns out there rather than just proving you know that this is
[00:47:31] than just proving you know that this is a sound algorithm what we'll see on
[00:47:33] a sound algorithm what we'll see on Wednesday is that this view of p.m.
[00:47:35] Wednesday is that this view of p.m. allows us to derive em in a in a more
[00:47:39] allows us to derive em in a in a more correct way for other models as well
[00:47:41] correct way for other models as well they make sense of gaussians on
[00:47:42] they make sense of gaussians on Wednesday we'll talk about a model
[00:47:46] Wednesday we'll talk about a model called factor analysis unless you model
[00:47:48] called factor analysis unless you model gaussians an extremely high dimensional
[00:47:49] gaussians an extremely high dimensional spaces where if you have a thousand
[00:47:51] spaces where if you have a thousand dimensional data but only thirty
[00:47:52] dimensional data but only thirty examples how do you for the girls into
[00:47:54] examples how do you for the girls into that so we talked about that on
[00:47:55] that so we talked about that on Wednesday and it turns out this
[00:47:57] Wednesday and it turns out this derivation that yeah we're gonna go
[00:47:58] derivation that yeah we're gonna go about through now is crucial for
[00:48:01] about through now is crucial for applying M accurately in problems like
[00:48:05] applying M accurately in problems like that so in order to lead up to that
[00:48:10] that so in order to lead up to that derivation let me describe Jensen's
[00:48:14] derivation let me describe Jensen's inequality so let F be a convex function
[00:48:25] to do yeah we're actually going to need
[00:48:27] to do yeah we're actually going to need concave functions so be all - of
[00:48:29] concave functions so be all - of everything but what gets it done in a
[00:48:31] everything but what gets it done in a second but so a convex function means
[00:48:39] second but so a convex function means the second derivative is greater than 0
[00:48:42] the second derivative is greater than 0 or in other words it looks like that
[00:48:43] or in other words it looks like that right so that's a convex function that X
[00:48:48] right so that's a convex function that X be a random variable then F of the
[00:48:59] be a random variable then F of the expected value of x is less than equal
[00:49:01] expected value of x is less than equal to the expected value of x
[00:49:25] maybe young here's an example right so
[00:49:32] maybe young here's an example right so here's a let's see that's the function f
[00:49:38] here's a let's see that's the function f of X and let's say that these are the
[00:49:40] of X and let's say that these are the values 1 2 3 4 5 and suppose that X is
[00:49:47] values 1 2 3 4 5 and suppose that X is equal to 1 with probability 1/2 is equal
[00:49:53] equal to 1 with probability 1/2 is equal to 5 probably just an illustration then
[00:50:03] here is the F of 1 here is F of 5 here
[00:50:15] here is the F of 1 here is F of 5 here is f of 3 and F of 3 is f of the
[00:50:19] is f of 3 and F of 3 is f of the expected value of x right because so the
[00:50:22] expected value of x right because so the expected value of x and sometimes I
[00:50:25] expected value of x and sometimes I write 2 so called the square brackets
[00:50:27] write 2 so called the square brackets it's the average of X is equal to 3 and
[00:50:30] it's the average of X is equal to 3 and so the expected value seems to be F of
[00:50:34] so the expected value seems to be F of the expected value of x is equal to this
[00:50:37] the expected value of x is equal to this value whereas the expected value of f of
[00:50:42] value whereas the expected value of f of X is the mean of F of 1 and F of 5 right
[00:50:53] X is the mean of F of 1 and F of 5 right so the expected value of f of X f of X
[00:50:55] so the expected value of f of X f of X is a 50% chance of being F of 1 and a
[00:50:57] is a 50% chance of being F of 1 and a 50% chance of being a 4/5 and so the
[00:51:01] 50% chance of being a 4/5 and so the expected value of f of X is equal to
[00:51:03] expected value of f of X is equal to this value in the middle let's really
[00:51:05] this value in the middle let's really take these two take this value and this
[00:51:08] take these two take this value and this value and take the mean so it's this
[00:51:09] value and take the mean so it's this value up here and and this value
[00:51:14] expensive value and so in this example
[00:51:19] expensive value and so in this example the expected value of f of X is greater
[00:51:22] the expected value of f of X is greater than F of the expected value of x as
[00:51:25] than F of the expected value of x as predicted by Jensen's inequality I'm
[00:51:28] predicted by Jensen's inequality I'm going to just draw one illustration that
[00:51:30] going to just draw one illustration that may or may not help is some of my
[00:51:31] may or may not help is some of my friends like it I sometimes use it but
[00:51:33] friends like it I sometimes use it but it was confusing then don't worry about
[00:51:34] it was confusing then don't worry about it but it turns out that if you draw a
[00:51:37] it but it turns out that if you draw a line that connects these two then the
[00:51:40] line that connects these two then the midpoint of this line is the height of F
[00:51:43] midpoint of this line is the height of F of expected value of x right so the
[00:51:46] of expected value of x right so the height of this you know so given these
[00:51:48] height of this you know so given these two points this point in this point if
[00:51:50] two points this point in this point if you draw this line it's called a chord
[00:51:52] you draw this line it's called a chord then the height of this point is
[00:51:57] then the height of this point is expected value of f of X and this point
[00:52:06] expected value of f of X and this point is f of the expected value events and in
[00:52:13] is f of the expected value events and in any convex function you know really take
[00:52:19] any convex function you know really take any convex function that's also called
[00:52:22] any convex function that's also called back function if you draw any chords
[00:52:24] back function if you draw any chords that mean point it's always higher right
[00:52:30] that mean point it's always higher right then that group Green Point which is Y
[00:52:33] then that group Green Point which is Y which is another way of seeing Y
[00:52:36] which is another way of seeing Y Jensen's equality holds true okay if
[00:52:39] Jensen's equality holds true okay if this visualization doesn't help don't
[00:52:40] this visualization doesn't help don't worry about it but it's just so actually
[00:52:42] worry about it but it's just so actually what a lot my friends do is we the cell
[00:52:45] what a lot my friends do is we the cell you know we keep on forgetting which
[00:52:47] you know we keep on forgetting which direction Jensen's equality goes that's
[00:52:50] direction Jensen's equality goes that's not great that Sol all of my friends
[00:52:52] not great that Sol all of my friends were don't remember we draw this picture
[00:52:54] were don't remember we draw this picture and draw that chord and if we quickly
[00:52:55] and draw that chord and if we quickly figure out which we do equality girls
[00:53:01] all right so one addendum further it's
[00:53:21] all right so one addendum further it's strictly greater than zero
[00:53:23] strictly greater than zero and so if this is the case we say F is
[00:53:26] and so if this is the case we say F is strictly convex so let's see a straight
[00:53:52] strictly convex so let's see a straight line is also convex function right so
[00:53:54] line is also convex function right so this is the convex function this
[00:53:55] this is the convex function this congressional district on various ocean
[00:53:57] congressional district on various ocean turns out a straight line that's also a
[00:53:59] turns out a straight line that's also a convex function but so in this addendum
[00:54:02] convex function but so in this addendum is saying that if F is a strictly convex
[00:54:04] is saying that if F is a strictly convex function meaning racing it's not a
[00:54:06] function meaning racing it's not a straight line right bit more than is not
[00:54:09] straight line right bit more than is not a straight line but if the curvature if
[00:54:11] a straight line but if the curvature if it's always bending up then the only way
[00:54:15] it's always bending up then the only way for the left and right hand sides to be
[00:54:17] for the left and right hand sides to be equal is an X is a constant meaning it's
[00:54:20] equal is an X is a constant meaning it's a random variable that always takes on
[00:54:22] a random variable that always takes on the same value
[00:54:23] the same value okay so Jensen's equality says that you
[00:54:27] okay so Jensen's equality says that you know left hand sides got to be the same
[00:54:31] know left hand sides got to be the same as right hand side sorry I think I
[00:54:32] as right hand side sorry I think I reversed the order of these two for that
[00:54:34] reversed the order of these two for that equation that doesn't matter right so
[00:54:35] equation that doesn't matter right so Jennison equality says left hand side is
[00:54:37] Jennison equality says left hand side is always less than equals to the right
[00:54:39] always less than equals to the right hand side and the only way is equal is
[00:54:41] hand side and the only way is equal is if X you know is a random variable that
[00:54:44] if X you know is a random variable that always takes on the same value
[00:54:55] yeah
[00:54:57] yeah so it turns out what if the other have
[00:54:58] so it turns out what if the other have one single that now the another feat it
[00:55:00] one single that now the another feat it turns out does vary so let's see so one
[00:55:04] turns out does vary so let's see so one way that could happen would be if the
[00:55:06] way that could happen would be if the function were like that and then if you
[00:55:09] function were like that and then if you take the drawing horde we take the
[00:55:11] take the drawing horde we take the meanest no higher then this Impala if
[00:55:22] meanest no higher then this Impala if you had a flat part here then the
[00:55:24] you had a flat part here then the function is not strictly convex and so
[00:55:26] function is not strictly convex and so it's still less than equal to but it's
[00:55:28] it's still less than equal to but it's not but it can't be equal to Y of X is
[00:55:30] not but it can't be equal to Y of X is random so um and and and we'll use this
[00:55:36] random so um and and and we'll use this in a little bit well actually end up
[00:55:38] in a little bit well actually end up using this and again for the strict
[00:55:42] using this and again for the strict proper low states you know if those of
[00:55:43] proper low states you know if those of you that don't know take classes in
[00:55:45] you that don't know take classes in advanced probability the technical way
[00:55:48] advanced probability the technical way of saying X is a constant is X's equal
[00:55:51] of saying X is a constant is X's equal to DX
[00:55:52] to DX we're probability one you know what I
[00:56:01] we're probability one you know what I think for all practical human purposes
[00:56:03] think for all practical human purposes you do not need to worry about this but
[00:56:04] you do not need to worry about this but if you think the cost in measure theory
[00:56:07] if you think the cost in measure theory the professor in measure theory will be
[00:56:09] the professor in measure theory will be happy if you say this then you say X is
[00:56:11] happy if you say this then you say X is a constant but maybe maybe none of you
[00:56:13] a constant but maybe maybe none of you know okay this is don't worry about it
[00:56:14] know okay this is don't worry about it oh yes okay now um just one one more
[00:56:21] oh yes okay now um just one one more addendum to this is that the form of
[00:56:26] addendum to this is that the form of Jensen's equality we're gonna use is
[00:56:28] Jensen's equality we're gonna use is actually a form for a concave function
[00:56:30] actually a form for a concave function so instead of convex I'm gonna say
[00:56:34] so instead of convex I'm gonna say concave and so you know a concave
[00:56:37] concave and so you know a concave function is just a negative of a convex
[00:56:39] function is just a negative of a convex function right if you take a convex
[00:56:41] function right if you take a convex function and take negative of that it
[00:56:43] function and take negative of that it becomes concave and so the whole thing
[00:56:46] becomes concave and so the whole thing works with the with everything flipped
[00:56:51] works with the with everything flipped around the other way
[00:57:01] okay so the phone with Jensen's
[00:57:04] okay so the phone with Jensen's inequality we're going to use it's
[00:57:06] inequality we're going to use it's actually the concave foam with Jensen's
[00:57:09] actually the concave foam with Jensen's equality and we're actually going to
[00:57:11] equality and we're actually going to apply it to the log function so the log
[00:57:13] apply it to the log function so the log function write log X looks like this and
[00:57:15] function write log X looks like this and so that's a concave function and so the
[00:57:18] so that's a concave function and so the inequality will use to be in this
[00:57:19] inequality will use to be in this direction to have an orange
[00:57:23] all right
[00:57:54] so just the density estimation problem
[00:57:59] so just the density estimation problem meaning density estimation means you
[00:58:02] meaning density estimation means you want to estimate P of X all right so we
[00:58:04] want to estimate P of X all right so we have a model of a P of X comma Z with
[00:58:13] have a model of a P of X comma Z with parameters theta and so you know instead
[00:58:16] parameters theta and so you know instead of writing out Mu Sigma Nu Sigma Phi
[00:58:21] of writing out Mu Sigma Nu Sigma Phi like we did for the mixture of gaussians
[00:58:22] like we did for the mixture of gaussians I'm just gonna capture all the
[00:58:24] I'm just gonna capture all the parameters you have whatever your
[00:58:25] parameters you have whatever your parameters are obviously capture them in
[00:58:27] parameters are obviously capture them in one variable theta and you only observe
[00:58:34] one variable theta and you only observe thanks so your training set looks like
[00:58:40] thanks so your training set looks like that so the UM log likelihood of the
[00:58:48] that so the UM log likelihood of the parameters theta is equal to some of
[00:58:53] parameters theta is equal to some of your training examples log hearings i
[00:58:57] your training examples log hearings i franchised by theta and this in turn is
[00:59:03] franchised by theta and this in turn is log of sum over Z P of X I see I
[00:59:13] franchise by theta right because P of X
[00:59:20] franchise by theta right because P of X you know is just taking the Joint
[00:59:22] you know is just taking the Joint Distribution and summary notes
[00:59:24] Distribution and summary notes marginalizing out Zi
[00:59:28] and so what we want is maximum
[00:59:34] and so what we want is maximum likelihood estimation which is to find
[00:59:36] likelihood estimation which is to find the value of theta that maximizes is
[00:59:42] the value of theta that maximizes is long likelihood and what well like what
[00:59:46] long likelihood and what well like what we'd like to do is derive in year now
[00:59:48] we'd like to do is derive in year now derive an algorithm which will turn out
[00:59:50] derive an algorithm which will turn out to be an e/m algorithm as an iterative
[00:59:52] to be an e/m algorithm as an iterative algorithm for finding the mass of life
[00:59:56] algorithm for finding the mass of life for an estimate of the parameters theta
[01:00:05] so let me draw a picture that I could
[01:00:12] so let me draw a picture that I could keep in mind as we go through the math
[01:00:14] keep in mind as we go through the math which is you know the horizontal axis is
[01:00:18] which is you know the horizontal axis is a space of possible values of parameters
[01:00:20] a space of possible values of parameters theta and so there's some function o of
[01:00:25] theta and so there's some function o of theta then you try to maximize
[01:00:35] and so what yen does is lesson you
[01:00:39] and so what yen does is lesson you initialize theta at some value may be
[01:00:43] initialize theta at some value may be randomly initialize so similar to the
[01:00:47] randomly initialize so similar to the k-means clustering we just you know
[01:00:48] k-means clustering we just you know randomly initialize your muse for that
[01:00:50] randomly initialize your muse for that ratio gaussians what the IAM algorithm
[01:00:53] ratio gaussians what the IAM algorithm does is in the east step we're going to
[01:00:56] does is in the east step we're going to construct a lower bound shown in green
[01:01:00] construct a lower bound shown in green here for the log likelihood and this
[01:01:04] here for the log likelihood and this lower bound is being curve has two
[01:01:06] lower bound is being curve has two properties one is it is a lower bound so
[01:01:09] properties one is it is a lower bound so everywhere you look you know over all
[01:01:10] everywhere you look you know over all values of theta the green curve lies
[01:01:13] values of theta the green curve lies below the blue curve so this is a lower
[01:01:15] below the blue curve so this is a lower bound and the second property that the
[01:01:17] bound and the second property that the green curve has is that it is equal to
[01:01:21] green curve has is that it is equal to the blue curve at the current value of
[01:01:24] the blue curve at the current value of theta so what the east step does which
[01:01:28] theta so what the east step does which you will see later on and just keep this
[01:01:30] you will see later on and just keep this picture in mind as we go through the
[01:01:31] picture in mind as we go through the east of an e/m set is um it'll construct
[01:01:34] east of an e/m set is um it'll construct the lower bound it looks like this right
[01:01:36] the lower bound it looks like this right oh and and also to foreshadow a part of
[01:01:41] oh and and also to foreshadow a part of the derivation right there was that
[01:01:42] the derivation right there was that addendum to Jensen's equality what we
[01:01:45] addendum to Jensen's equality what we said well under these conditions it
[01:01:46] said well under these conditions it holds with equality right here if f of X
[01:01:49] holds with equality right here if f of X equals F of G of X we said well the two
[01:01:51] equals F of G of X we said well the two things are equal with under certain
[01:01:52] things are equal with under certain conditions we want things to be equal we
[01:01:55] conditions we want things to be equal we want the green curve to be equal to the
[01:01:57] want the green curve to be equal to the blue curve at the old value of theta so
[01:01:59] blue curve at the old value of theta so what we'll use that addendum to just
[01:02:01] what we'll use that addendum to just inequality when we do like that so
[01:02:04] inequality when we do like that so that's estep is draw the green curve and
[01:02:08] that's estep is draw the green curve and then what the m-step does is it takes a
[01:02:11] then what the m-step does is it takes a green curve and it finds the maximum
[01:02:17] so what the em set does is it takes a
[01:02:21] so what the em set does is it takes a green curve and it finds the maximum and
[01:02:25] green curve and it finds the maximum and one step of eeehm will then move theta
[01:02:30] one step of eeehm will then move theta from this green value to this red value
[01:02:32] from this green value to this red value okay so the e step constructs the green
[01:02:36] okay so the e step constructs the green curve and the m-step finds the maximum
[01:02:40] curve and the m-step finds the maximum of the green curve and this is one
[01:02:42] of the green curve and this is one iteration of M the second iteration of M
[01:02:46] iteration of M the second iteration of M now that you're at this red thing is
[01:02:48] now that you're at this red thing is will construct a new lower bound again
[01:02:51] will construct a new lower bound again you know is it different though about
[01:02:52] you know is it different though about everywhere the red curve is below the
[01:02:54] everywhere the red curve is below the blue curve and the values are equal at
[01:02:57] blue curve and the values are equal at this new value that's the e step and an
[01:03:01] this new value that's the e step and an M step will maximize this red curve and
[01:03:07] M step will maximize this red curve and so on now you're here construct another
[01:03:11] so on now you're here construct another thing do that and you can kind of tell
[01:03:15] thing do that and you can kind of tell that they keep running eeehm this is
[01:03:17] that they keep running eeehm this is constantly trying to increase L of theta
[01:03:20] constantly trying to increase L of theta trying to increase the log likelihood
[01:03:22] trying to increase the log likelihood until it converges to local optima they
[01:03:26] until it converges to local optima they give Albert does converge only to local
[01:03:28] give Albert does converge only to local Optima so if you if there was another
[01:03:29] Optima so if you if there was another even bigger thing there that they may
[01:03:32] even bigger thing there that they may never find its way over to that other
[01:03:34] never find its way over to that other better optimum but the e/m algorithm by
[01:03:37] better optimum but the e/m algorithm by repeatedly doing this will hopefully
[01:03:39] repeatedly doing this will hopefully converse to a pretty good local optimum
[01:03:43] converse to a pretty good local optimum all right so that's right to how we do
[01:03:50] all right so that's right to how we do that
[01:04:10] so I've already said that our goal is to
[01:04:13] so I've already said that our goal is to find the Frances data then maximize this
[01:04:21] and so that equation we said or just now
[01:04:24] and so that equation we said or just now is some of our I log some of the Zi P X
[01:04:30] is some of our I log some of the Zi P X Y comma Z i given theta okay so this is
[01:04:38] Y comma Z i given theta okay so this is just what we had written down I guess on
[01:04:40] just what we had written down I guess on the left what I'm going to do next is
[01:04:50] divided by I must find if I buy this
[01:05:11] where a Qi of Zi is a probability
[01:05:17] where a Qi of Zi is a probability distribution
[01:05:26] ie some of us Zi Qi of Zi equals one so
[01:05:37] ie some of us Zi Qi of Zi equals one so with the multiplying defined by some
[01:05:39] with the multiplying defined by some high resolution and we'll decide later
[01:05:41] high resolution and we'll decide later how to come up with this probably this
[01:05:43] how to come up with this probably this usually Qi right but you know I'm
[01:05:45] usually Qi right but you know I'm allowed to construct a prize
[01:05:47] allowed to construct a prize distribution and multiply and divide by
[01:05:48] distribution and multiply and divide by the same thing right
[01:05:50] the same thing right now if you look at this all right let's
[01:05:55] now if you look at this all right let's put square brackets here
[01:05:56] put square brackets here if these Qi is this a probably
[01:05:59] if these Qi is this a probably distribution meaning that some of us Zi
[01:06:00] distribution meaning that some of us Zi Qi Zi sums over from some some one then
[01:06:04] Qi Zi sums over from some some one then this thing inside is equal to sum of I
[01:06:11] this thing inside is equal to sum of I of an expected value of Zi drawing from
[01:06:16] of an expected value of Zi drawing from the Qi distribution we use colors to
[01:06:34] the Qi distribution we use colors to make this clearer
[01:06:42] right so the way you compute the
[01:06:45] right so the way you compute the expected value of you know some function
[01:06:47] expected value of you know some function of Z is you sum over all the possible
[01:06:50] of Z is you sum over all the possible values of Z I of the property of Zi
[01:06:53] values of Z I of the property of Zi times what if that function is so this
[01:06:56] times what if that function is so this equation is just the expected value who
[01:06:58] equation is just the expected value who respect to Z I drawn from that Qi
[01:07:00] respect to Z I drawn from that Qi distribution of that thing in the square
[01:07:03] distribution of that thing in the square brackets in the purple square brackets
[01:07:10] now using the concave form of Jensen's
[01:07:18] now using the concave form of Jensen's inequality we have that this is greater
[01:07:24] inequality we have that this is greater than or equal to so this is a form of
[01:07:56] than or equal to so this is a form of Jensen's equality where f of X is
[01:08:01] Jensen's equality where f of X is greater than or equal to X where here
[01:08:09] greater than or equal to X where here this is the logarithmic function so the
[01:08:14] this is the logarithmic function so the log function is a concave function it
[01:08:15] log function is a concave function it looks like that and so using the I guess
[01:08:20] looks like that and so using the I guess you use it using the form of Jensen's
[01:08:22] you use it using the form of Jensen's equality with the science reverse write
[01:08:26] equality with the science reverse write f of e^x is great enclose in a of FX so
[01:08:29] f of e^x is great enclose in a of FX so you get log of expectation is pretty
[01:08:31] you get log of expectation is pretty equal to expectation it along
[01:08:35] and then finally let me just take this
[01:08:39] and then finally let me just take this expectation and unpack it one more time
[01:08:41] expectation and unpack it one more time so this is now sum of I sum of Zi so I
[01:09:03] so this is now sum of I sum of Zi so I just took this expected value and turn
[01:09:05] just took this expected value and turn the back to the sum over random variable
[01:09:06] the back to the sum over random variable probably times that thing okay so if you
[01:09:16] probably times that thing okay so if you remember this picture from the middle
[01:09:17] remember this picture from the middle what we wanted to do was to construct a
[01:09:20] what we wanted to do was to construct a function construct this green curve
[01:09:23] function construct this green curve there's a lower bound for the blue curve
[01:09:25] there's a lower bound for the blue curve and if you view this formula here as a
[01:09:31] and if you view this formula here as a function of theta right so your X X is
[01:09:35] function of theta right so your X X is just your data and Z is a variable you
[01:09:37] just your data and Z is a variable you sum over so this whole thing is the
[01:09:39] sum over so this whole thing is the function of theta or because X is FX Z
[01:09:42] function of theta or because X is FX Z is yourself you found some over so this
[01:09:44] is yourself you found some over so this whole formula here this is a function of
[01:09:46] whole formula here this is a function of the parameters theta and what we're
[01:09:49] the parameters theta and what we're showing is that this thing you know this
[01:09:53] showing is that this thing you know this formula here this is a lower bound for
[01:09:56] formula here this is a lower bound for the log likelihood I thought for this
[01:09:59] the log likelihood I thought for this thing I guess this is our theta so
[01:10:07] oh how we got to disagree sure sure so
[01:10:24] messy let's go let's say that Z takes on
[01:10:32] messy let's go let's say that Z takes on values from 135 right unless these
[01:10:34] values from 135 right unless these details on Val's room one through ten
[01:10:36] details on Val's room one through ten zero attend sided guys and I want to
[01:10:39] zero attend sided guys and I want to compute you know the expected value of
[01:10:43] compute you know the expected value of some function of some function G G of Z
[01:10:47] some function of some function G G of Z right then expected value G of Z is sum
[01:10:51] right then expected value G of Z is sum of all the possible values of C of the
[01:10:53] of all the possible values of C of the probability do you get that Z times G of
[01:10:58] probability do you get that Z times G of Z right so that's that's what's the
[01:11:00] Z right so that's that's what's the expected value of a function of a random
[01:11:02] expected value of a function of a random variable right and and this is the
[01:11:05] variable right and and this is the expected value of Z is some of us Z P of
[01:11:09] expected value of Z is some of us Z P of Z times Z right that's the average of
[01:11:11] Z times Z right that's the average of random variable and so in the notation
[01:11:15] random variable and so in the notation that we have the probability of Z taking
[01:11:19] that we have the probability of Z taking on different values is to note about a
[01:11:21] on different values is to note about a of Z which is why we wind up with that
[01:11:24] of Z which is why we wind up with that formula does make sense okay
[01:11:33] all right if one of these steps doesn't
[01:11:35] all right if one of these steps doesn't make sense then you know other questions
[01:11:45] okay
[01:11:52] all right
[01:12:04] now one of the things we want when
[01:12:09] now one of the things we want when constructing this green lower bound is
[01:12:11] constructing this green lower bound is we want that green lower bound to be
[01:12:13] we want that green lower bound to be equal to the blue function at this point
[01:12:15] equal to the blue function at this point right this is actually how you guarantee
[01:12:18] right this is actually how you guarantee that when you optimize the green
[01:12:20] that when you optimize the green function by improving on the green
[01:12:22] function by improving on the green function you're improving on the blue
[01:12:23] function you're improving on the blue function so we want this lower bound to
[01:12:25] function so we want this lower bound to be tight right to meet the two functions
[01:12:27] be tight right to meet the two functions being equal or tangent to each other so
[01:12:29] being equal or tangent to each other so in other words we want this inequality
[01:12:31] in other words we want this inequality to hold with equality so we want yeah so
[01:12:36] to hold with equality so we want yeah so we want the left-hand side on the
[01:12:37] we want the left-hand side on the right-hand side to be equal for the
[01:12:41] right-hand side to be equal for the current value of theta
[01:13:03] so on a given iteration with the current
[01:13:13] so on a given iteration with the current perhapses equal to theta we want we want
[01:13:52] perhapses equal to theta we want we want I know this is a lot of math but you
[01:13:55] I know this is a lot of math but you know we wanted the left and right hand
[01:13:56] know we wanted the left and right hand sides to be equal to each other because
[01:13:59] sides to be equal to each other because that's what it means for almost for the
[01:14:05] that's what it means for almost for the lower bound to be tight for the green
[01:14:07] lower bound to be tight for the green curve to be exactly touching the blue
[01:14:08] curve to be exactly touching the blue curve as we construct that know about
[01:14:11] curve as we construct that know about and so for this to be true we need the
[01:14:23] and so for this to be true we need the random variable inside to be a constant
[01:14:25] random variable inside to be a constant so we need p of x i zi / qi of zi to be
[01:14:33] so we need p of x i zi / qi of zi to be equal to constitutes a constant meaning
[01:14:38] equal to constitutes a constant meaning that no matter what value of Zi you plug
[01:14:41] that no matter what value of Zi you plug in this should evaluate to the same
[01:14:44] in this should evaluate to the same value you know in other words the ratio
[01:14:46] value you know in other words the ratio between the numerator and denominator
[01:14:48] between the numerator and denominator must be the same and fortunately so far
[01:14:52] must be the same and fortunately so far but not yet specified how will choose
[01:14:55] but not yet specified how will choose this distribution for zi right so so far
[01:14:58] this distribution for zi right so so far the only constraint we have is that Qi
[01:15:00] the only constraint we have is that Qi has to be a probability density has
[01:15:02] has to be a probability density has their probability distribution over Zi
[01:15:03] their probability distribution over Zi we could choose whatever distribution
[01:15:05] we could choose whatever distribution you want for Zi and it turns out that
[01:15:11] we can set qi of zi to be proportional
[01:15:17] we can set qi of zi to be proportional to p of x i zi parametrized by theta and
[01:15:25] to p of x i zi parametrized by theta and this means that for any value of Z you
[01:15:27] this means that for any value of Z you know whether those e indicates is it
[01:15:30] know whether those e indicates is it from Gaussian one of Gaussian to right
[01:15:31] from Gaussian one of Gaussian to right so this means that the chance of
[01:15:33] so this means that the chance of Gaussian one is proportional to the
[01:15:35] Gaussian one is proportional to the chance of Gaussian one versus goes into
[01:15:37] chance of Gaussian one versus goes into whether Zi takes on one or two is
[01:15:39] whether Zi takes on one or two is proportional to this and I don't want to
[01:15:44] proportional to this and I don't want to prove it but one way to ensure this and
[01:15:46] prove it but one way to ensure this and this is proven in the lecture notes but
[01:15:48] this is proven in the lecture notes but it turns out that one way to ensure well
[01:15:51] it turns out that one way to ensure well so the Q is need to sum to one so one
[01:15:54] so the Q is need to sum to one so one way to ensure that this is proportional
[01:15:55] way to ensure that this is proportional to the right hand side is to just take
[01:15:58] to the right hand side is to just take the right hand side so one so let's see
[01:16:16] right so the cure eyes have to sum to
[01:16:18] right so the cure eyes have to sum to one and so one way to ensure the
[01:16:20] one and so one way to ensure the proportionality is to just take the
[01:16:22] proportionality is to just take the right hand side and normalize it it's
[01:16:29] right hand side and normalize it it's something one and after after a couple
[01:16:39] something one and after after a couple steps that intellectually I don't want
[01:16:41] steps that intellectually I don't want to do here you can show that this
[01:16:46] to do here you can show that this results in 7qi of zi to be equal to that
[01:16:49] results in 7qi of zi to be equal to that that posterior probability okay and so
[01:16:56] that posterior probability okay and so sorry I skipped a couple of steps here
[01:16:58] sorry I skipped a couple of steps here you can get from the lecture notes but
[01:17:00] you can get from the lecture notes but it turns out that if you want this to be
[01:17:02] it turns out that if you want this to be constant meaning whether you plugged in
[01:17:04] constant meaning whether you plugged in CI equals 1 or Z equals 2 or whatever
[01:17:06] CI equals 1 or Z equals 2 or whatever disavows the same constant the only way
[01:17:09] disavows the same constant the only way to do that is make sure the numerator
[01:17:11] to do that is make sure the numerator and denominator are proportional to each
[01:17:13] and denominator are proportional to each other and because qi of zi is a density
[01:17:17] other and because qi of zi is a density that must sound
[01:17:18] that must sound one one way to mr. Li proportional it
[01:17:20] one one way to mr. Li proportional it was to just said this to be really right
[01:17:22] was to just said this to be really right hand side but normalize the sum to one
[01:17:24] hand side but normalize the sum to one okay and then we derived this a little
[01:17:25] okay and then we derived this a little bit more carefully your lecture notes so
[01:17:36] just to summarize this gives us de em
[01:17:41] just to summarize this gives us de em algorithm let's take all this everything
[01:17:43] algorithm let's take all this everything we're just doing Rapids Indian algorithm
[01:17:45] we're just doing Rapids Indian algorithm and it Estep we're going to set Q I of
[01:17:50] and it Estep we're going to set Q I of Zi equal to that and previously this was
[01:18:03] Zi equal to that and previously this was the W IJ s right so incentive so
[01:18:06] the W IJ s right so incentive so previously restoring these probabilities
[01:18:08] previously restoring these probabilities and the variables you call WI J's and
[01:18:13] and the variables you call WI J's and then in the M step we're going to take
[01:18:25] then in the M step we're going to take that know about that we constructed
[01:18:27] that know about that we constructed which is this function and maximize it
[01:18:39] which is this function and maximize it with respect to theta okay and so
[01:18:44] with respect to theta okay and so remember in the M set we constructed
[01:18:47] remember in the M set we constructed this thing on the right hand side
[01:18:48] this thing on the right hand side there's a lower bound for the log
[01:18:50] there's a lower bound for the log likelihood and so for the fixed value of
[01:18:53] likelihood and so for the fixed value of Q you can maximize this respect to theta
[01:18:55] Q you can maximize this respect to theta and that updates the theta
[01:18:57] and that updates the theta you know maximizing the green lower
[01:18:59] you know maximizing the green lower boundary that's what the end step does
[01:19:01] boundary that's what the end step does and if you any rate these two steps then
[01:19:04] and if you any rate these two steps then you find that this should converge to
[01:19:06] you find that this should converge to the whole optimal okay oh and there's
[01:19:10] the whole optimal okay oh and there's just maybe that's the obvious question
[01:19:12] just maybe that's the obvious question um why don't we try to maximize vary
[01:19:15] um why don't we try to maximize vary theta why we try to massage the log like
[01:19:19] theta why we try to massage the log like indirectly it turns out that if you take
[01:19:22] indirectly it turns out that if you take the mixture of gaussians model try to
[01:19:24] the mixture of gaussians model try to take derivatives of this and set their 2
[01:19:26] take derivatives of this and set their 2 is equal to 0
[01:19:26] is equal to 0 there's no known way to solve for the
[01:19:28] there's no known way to solve for the value of theta the maximizing the log
[01:19:30] value of theta the maximizing the log likelihood but you'll find that for the
[01:19:32] likelihood but you'll find that for the mixture of gaussians model and for many
[01:19:33] mixture of gaussians model and for many models including factor analysis we
[01:19:35] models including factor analysis we talked about on Wednesday if you
[01:19:37] talked about on Wednesday if you actually plug in the Gaussian density if
[01:19:40] actually plug in the Gaussian density if you actually plug in the mixture of
[01:19:41] you actually plug in the mixture of gaussians model for p and take you know
[01:19:44] gaussians model for p and take you know take take the riveter cetera t goes here
[01:19:46] take take the riveter cetera t goes here and solve you will be able to find an
[01:19:48] and solve you will be able to find an analytic solution to maximize this M
[01:19:50] analytic solution to maximize this M step and it'll be exactly what we have
[01:19:52] step and it'll be exactly what we have worked out ok but so this derivation
[01:19:57] worked out ok but so this derivation shows that the yam algorithm you know is
[01:20:01] shows that the yam algorithm you know is a maximum likelihood estimation
[01:20:03] a maximum likelihood estimation algorithm with optimization solved by
[01:20:06] algorithm with optimization solved by constructing little balance and
[01:20:07] constructing little balance and optimizing those bounds ok all right
[01:20:11] optimizing those bounds ok all right that's it for today and only it's tough
[01:20:14] that's it for today and only it's tough up to here right and so this stuff will
[01:20:17] up to here right and so this stuff will be up to midterm but we're talking about
[01:20:20] be up to midterm but we're talking about on factor analysis we're not on my way
[01:20:22] on factor analysis we're not on my way okay
[01:20:23] okay so let's break for today and I'll see
[01:20:25] so let's break for today and I'll see you guys on Wednesday


================================================================================
LECTURE 015
================================================================================

Lecture 15 - EM Algorithm & Factor Analysis | Stanford CS229: Machine Learning Andrew Ng -Autumn2018

Source: https://www.youtube.com/watch?v=tw6cmL5STuY

---

Transcript

[00:00:03] alright hey everyone welcome back so
[00:00:10] alright hey everyone welcome back so what we'll see today is additional
[00:00:15] what we'll see today is additional elaborations on the e/m on the
[00:00:19] elaborations on the e/m on the expectation-maximization
[00:00:21] expectation-maximization harbor volley and so what you see today
[00:00:24] harbor volley and so what you see today is go over you know quick recap of all
[00:00:27] is go over you know quick recap of all we talked about
[00:00:28] we talked about eeehm on Monday and then describe how
[00:00:31] eeehm on Monday and then describe how you can monitor if eeehm is converging
[00:00:34] you can monitor if eeehm is converging and on Monday we talked about the
[00:00:39] and on Monday we talked about the mixture of gaussians model and started
[00:00:42] mixture of gaussians model and started deriving Ian's for that I won't just
[00:00:44] deriving Ian's for that I won't just take these two equations and map it back
[00:00:46] take these two equations and map it back to specifically the E&amp;M steps that you
[00:00:49] to specifically the E&amp;M steps that you saw for the mixture of gaussians models
[00:00:51] saw for the mixture of gaussians models to see exactly how these map - you know
[00:00:55] to see exactly how these map - you know updating the weights of UI and so on how
[00:00:58] updating the weights of UI and so on how you should derive the m-step and then
[00:01:00] you should derive the m-step and then most of what I went spend today talking
[00:01:03] most of what I went spend today talking about is the model called the factor
[00:01:05] about is the model called the factor analysis model and this is model useful
[00:01:08] analysis model and this is model useful but for data there can be very
[00:01:13] but for data there can be very high-dimensional even when you have very
[00:01:14] high-dimensional even when you have very few training examples so what I want to
[00:01:16] few training examples so what I want to do is talk a bit about properties of
[00:01:18] do is talk a bit about properties of Gaussian distributions and then describe
[00:01:21] Gaussian distributions and then describe the factor analysis model some more
[00:01:24] the factor analysis model some more about Gaussian distributions and then
[00:01:26] about Gaussian distributions and then we'll derive e/m for the factor analysis
[00:01:28] we'll derive e/m for the factor analysis model and once talked about factor
[00:01:31] model and once talked about factor analysis with two reasons I guess one is
[00:01:33] analysis with two reasons I guess one is you know useful algorithm in an episome
[00:01:35] you know useful algorithm in an episome right and second the derivation for
[00:01:38] right and second the derivation for ian's for factor analysis is actually
[00:01:39] ian's for factor analysis is actually one of the trickier ones and there are
[00:01:42] one of the trickier ones and there are key steps and how you actually derived
[00:01:43] key steps and how you actually derived in E and M steps that I think you learn
[00:01:46] in E and M steps that I think you learn better or you better master better by
[00:01:49] better or you better master better by going through the factor analysis
[00:01:51] going through the factor analysis example ok um so just a recap last
[00:01:56] example ok um so just a recap last Monday or on Monday we had talked about
[00:02:00] Monday or on Monday we had talked about the Ian algorithm and we wound up
[00:02:03] the Ian algorithm and we wound up figuring out this estep and this n step
[00:02:06] figuring out this estep and this n step remember that if this is the log
[00:02:08] remember that if this is the log likelihood that you trying to maximize
[00:02:10] likelihood that you trying to maximize what the estep does is it constructs a
[00:02:13] what the estep does is it constructs a lower bound then this is a funk
[00:02:15] lower bound then this is a funk theta so this thing on the right hand
[00:02:18] theta so this thing on the right hand side this is a function of the
[00:02:21] side this is a function of the parameters theta what we proved last
[00:02:24] parameters theta what we proved last time was that that function is a lower
[00:02:29] time was that that function is a lower bound of the log-likelihood right and
[00:02:31] bound of the log-likelihood right and depending on what you choose for Q you
[00:02:34] depending on what you choose for Q you get different lower bound so one choice
[00:02:35] get different lower bound so one choice of Q you make it this little bow for a
[00:02:37] of Q you make it this little bow for a different choice Q might get that lower
[00:02:38] different choice Q might get that lower bound for a different choice you may get
[00:02:40] bound for a different choice you may get that low about and what the e set does
[00:02:44] that low about and what the e set does is it uses Q together lower bound this
[00:02:46] is it uses Q together lower bound this tight that just touches the lock like
[00:02:49] tight that just touches the lock like here at the current value of theta and
[00:02:50] here at the current value of theta and what the M set does is it chooses the
[00:02:53] what the M set does is it chooses the parameters later that maximizes that all
[00:02:55] parameters later that maximizes that all right so those are eeehm algorithm that
[00:02:59] right so those are eeehm algorithm that we saw now um I want to step through how
[00:03:02] we saw now um I want to step through how you would take this you know slightly
[00:03:05] you would take this you know slightly abstract mathematical definition of VM
[00:03:07] abstract mathematical definition of VM and derive a concrete algorithm that you
[00:03:10] and derive a concrete algorithm that you would implement right and in you know in
[00:03:13] would implement right and in you know in Python and so let's let's just step
[00:03:16] Python and so let's let's just step through this for the mixture of
[00:03:18] through this for the mixture of gaussians model so for the mixture of
[00:03:21] gaussians model so for the mixture of gaussians model we had a model for P of
[00:03:25] gaussians model we had a model for P of X I comma Z I which given CI times P I
[00:03:35] X I comma Z I which given CI times P I and a model was that Z is a multinomial
[00:03:41] with some set of parameters Phi oh and
[00:03:44] with some set of parameters Phi oh and so you know the probability of Z I J is
[00:03:48] so you know the probability of Z I J is equal to Phi J right so Phi is just a
[00:03:51] equal to Phi J right so Phi is just a vector of numbers that sum to one
[00:03:53] vector of numbers that sum to one specifying what's the chance of Z being
[00:03:55] specifying what's the chance of Z being each of the K possible to speak values
[00:03:58] each of the K possible to speak values and then we have that X i given the CI
[00:04:03] and then we have that X i given the CI equals J that that is Gaussian with some
[00:04:07] equals J that that is Gaussian with some mean and what we said last time was that
[00:04:11] mean and what we said last time was that um this is a lot like the Gaussian
[00:04:14] um this is a lot like the Gaussian destroying an Alice model and the the
[00:04:18] destroying an Alice model and the the trivial one trivial difference is this a
[00:04:20] trivial one trivial difference is this a sigma j instead of Sigma right GTA call
[00:04:23] sigma j instead of Sigma right GTA call centers analysis had the same Sigma
[00:04:24] centers analysis had the same Sigma every cost but that's not the key
[00:04:25] every cost but that's not the key difference the key difference is that
[00:04:28] difference the key difference is that in this density estimation problem Z is
[00:04:32] in this density estimation problem Z is not observable z is a latent random
[00:04:34] not observable z is a latent random variable Raven which is why we have all
[00:04:36] variable Raven which is why we have all this machinery of so now that you have
[00:04:44] this machinery of so now that you have this model let's see so now that you
[00:04:57] this model let's see so now that you have this model this is how you would
[00:05:01] have this model this is how you would derive the E and the M steps right so
[00:05:06] derive the E and the M steps right so the e step is you know you have Q I of
[00:05:10] the e step is you know you have Q I of CI right but let me just write this as
[00:05:12] CI right but let me just write this as qi of Zi equals J this is sort of the
[00:05:15] qi of Zi equals J this is sort of the probability of Z I equals J I know this
[00:05:18] probability of Z I equals J I know this notation a little bit strange but under
[00:05:19] notation a little bit strange but under the Qi distribution what do you want the
[00:05:22] the Qi distribution what do you want the chance of Z being equal to J right and
[00:05:25] chance of Z being equal to J right and so in the estep you were said that to P
[00:05:28] so in the estep you were said that to P of Z I equals J given X I parameterize
[00:05:33] of Z I equals J given X I parameterize by all of the parameters and we actually
[00:05:38] by all of the parameters and we actually saw with Bayes rule right how you would
[00:05:41] saw with Bayes rule right how you would fetch this out okay and what we do in
[00:05:44] fetch this out okay and what we do in the Estep is saw this number right in
[00:05:50] the Estep is saw this number right in what we wrote as W IJ last time okay and
[00:05:54] what we wrote as W IJ last time okay and so you remember if you a mixture of two
[00:05:57] so you remember if you a mixture of two gaussians maybe that's the first girls
[00:05:58] gaussians maybe that's the first girls in the second gaussian you have an
[00:06:00] in the second gaussian you have an example X I here or so looks like it's
[00:06:02] example X I here or so looks like it's more likely I come from the first and
[00:06:04] more likely I come from the first and second girls yet and so this would be
[00:06:06] second girls yet and so this would be reflected in W IJ that that example is
[00:06:08] reflected in W IJ that that example is assigned more to the first gaussian then
[00:06:11] assigned more to the first gaussian then to second gaussian so what you implement
[00:06:13] to second gaussian so what you implement encodes is you know you write code to
[00:06:15] encodes is you know you write code to compute this number and store it in W IJ
[00:06:28] and then for the m-step
[00:06:34] and then for the m-step you will want to maximize over the
[00:06:37] you will want to maximize over the parameters of the model right fine mu
[00:06:39] parameters of the model right fine mu and signal these are the parameter
[00:06:42] and signal these are the parameter parameters of the mixture of gaussians
[00:06:44] parameters of the mixture of gaussians of some of the why some of the Zi and so
[00:07:07] of some of the why some of the Zi and so the way you actually derived this is
[00:07:09] the way you actually derived this is your write this as sum of I so Zi you
[00:07:14] your write this as sum of I so Zi you know takes on the certain to speak to
[00:07:16] know takes on the certain to speak to the value so Zi to turn turn Zi into J
[00:07:19] the value so Zi to turn turn Zi into J right so Zi can be I guess one or two of
[00:07:22] right so Zi can be I guess one or two of you have make sure to gaussians you sum
[00:07:23] you have make sure to gaussians you sum of all the indices of the different
[00:07:25] of all the indices of the different clusters W IJ times log of the numerator
[00:07:32] clusters W IJ times log of the numerator is going to be
[00:07:53] times pi J that's the numerator and so
[00:08:01] times pi J that's the numerator and so you know this term is equal to this
[00:08:05] you know this term is equal to this first gaussian term times that second
[00:08:07] first gaussian term times that second term right because this term is P of X i
[00:08:09] term right because this term is P of X i given Z I write the parameters and this
[00:08:15] given Z I write the parameters and this is just Q and then if you take this and
[00:08:24] is just Q and then if you take this and divide it by W IJ okay
[00:08:29] divide it by W IJ okay so I'm gonna step you through the steps
[00:08:31] so I'm gonna step you through the steps you would go through if you're deriving
[00:08:33] you would go through if you're deriving um using that you know you step in M
[00:08:36] um using that you know you step in M said we wrote up above but if you're
[00:08:38] said we wrote up above but if you're deriving this for the mixture of
[00:08:40] deriving this for the mixture of gaussians modeled and these are the
[00:08:43] gaussians modeled and these are the steps of algebra so in order to perform
[00:08:56] steps of algebra so in order to perform this maximization what you will do is
[00:09:02] this maximization what you will do is you want to maximize this formula right
[00:09:05] you want to maximize this formula right this big double summation with respect
[00:09:07] this big double summation with respect to each the parameters Phi mu and Sigma
[00:09:10] to each the parameters Phi mu and Sigma and so what you would do is you know
[00:09:13] and so what you would do is you know take this big formula right and take the
[00:09:17] take this big formula right and take the derivatives with respect to each of the
[00:09:18] derivatives with respect to each of the parameters so you take the derivative
[00:09:20] parameters so you take the derivative respective mu J yeah thought that out is
[00:09:23] respective mu J yeah thought that out is that big formula on the left set it to
[00:09:25] that big formula on the left set it to zero right and take and then and then it
[00:09:28] zero right and take and then and then it turns out if you do this you will derive
[00:09:32] turns out if you do this you will derive that nu J should be equal to some K and
[00:09:45] that nu J should be equal to some K and this is what we said is how you update
[00:09:48] this is what we said is how you update the means mu right the WI J's are the
[00:09:51] the means mu right the WI J's are the strength with which X I so W IJ is
[00:09:57] strength with which X I so W IJ is informally this is the strength
[00:10:01] with which Xie is a sign right -
[00:10:08] with which Xie is a sign right - Gaussian and more formally this is
[00:10:15] Gaussian and more formally this is really P of Z I equals J given X I the
[00:10:20] really P of Z I equals J given X I the parameters and so so you end up with
[00:10:24] parameters and so so you end up with this formula but the way you compute
[00:10:26] this formula but the way you compute this formula is by the the rigorous way
[00:10:29] this formula is by the the rigorous way to show this is the right formula
[00:10:31] to show this is the right formula updating you J is looking at this
[00:10:33] updating you J is looking at this objective taking derivative saying there
[00:10:35] objective taking derivative saying there is equal zero to maximize and therefore
[00:10:38] is equal zero to maximize and therefore deriving that equation for new J you
[00:10:41] deriving that equation for new J you know by by solving for the value of MU J
[00:10:43] know by by solving for the value of MU J that maximizes this expression and
[00:10:46] that maximizes this expression and similarly you know you take derivatives
[00:10:49] similarly you know you take derivatives respective this thing respective fires
[00:10:52] respective this thing respective fires said it's a zero take derivatives of
[00:10:57] said it's a zero take derivatives of this thing and set that to zero and
[00:11:04] this thing and set that to zero and that's how you would derive the update
[00:11:07] that's how you would derive the update equations in the m-step for fire and for
[00:11:09] equations in the m-step for fire and for sigma as well okay um so and and so for
[00:11:15] sigma as well okay um so and and so for example when you do this you find that
[00:11:18] example when you do this you find that the optimal value for Phi is we had this
[00:11:31] the optimal value for Phi is we had this near the start of Monday's lecture as
[00:11:34] near the start of Monday's lecture as well okay um so this is the process of
[00:11:38] well okay um so this is the process of how you would look the estep CM steps I
[00:11:42] how you would look the estep CM steps I wrote up and apply it to a specific
[00:11:44] wrote up and apply it to a specific model such as the mixtures of gaussians
[00:11:47] model such as the mixtures of gaussians model and that's how you you know solve
[00:11:49] model and that's how you you know solve for the maximization in DM step okay and
[00:11:53] for the maximization in DM step okay and so what I'd like to do today is describe
[00:11:56] so what I'd like to do today is describe the application of VM to a more complex
[00:11:58] the application of VM to a more complex model called the factor of analysis
[00:12:00] model called the factor of analysis model and so it's important that so I
[00:12:04] model and so it's important that so I hope you understand the mechanics of how
[00:12:05] hope you understand the mechanics of how you do this because we're gonna do this
[00:12:07] you do this because we're gonna do this today for a different model
[00:12:10] today for a different model questions about this with why move on oh
[00:12:22] questions about this with why move on oh so in order to you know foreshadow a
[00:12:33] so in order to you know foreshadow a little bit what we'll see when it comes
[00:12:35] little bit what we'll see when it comes down to the mixture of gaussians model
[00:12:38] down to the mixture of gaussians model excuse me the factor analysis model
[00:12:40] excuse me the factor analysis model which we talked about which is what was
[00:12:42] which we talked about which is what was famous today talking about in the factor
[00:12:44] famous today talking about in the factor analysis model instead of Zi being
[00:12:47] analysis model instead of Zi being discrete Zi will be continuous right and
[00:12:51] discrete Zi will be continuous right and the paper zi will be distributed
[00:12:54] the paper zi will be distributed Gaussian so in the mixture of gaussians
[00:12:55] Gaussian so in the mixture of gaussians model we had a joint distribution for X
[00:12:58] model we had a joint distribution for X and Z where X was a discrete random
[00:13:00] and Z where X was a discrete random variable so in the factor analysis model
[00:13:03] variable so in the factor analysis model will describe a different model you know
[00:13:05] will describe a different model you know for p.m. X and Z where Z is continuous
[00:13:08] for p.m. X and Z where Z is continuous and so instead of sum over Zi just be an
[00:13:11] and so instead of sum over Zi just be an integral over Zi of d zi right so
[00:13:16] integral over Zi of d zi right so instead of sum becomes integral and and
[00:13:19] instead of sum becomes integral and and it turns out that yeah well right yeah
[00:13:26] it turns out that yeah well right yeah and it turns out that if you go through
[00:13:28] and it turns out that if you go through the derivation of the EEM algorithm that
[00:13:30] the derivation of the EEM algorithm that we worked out on Monday all of the steps
[00:13:33] we worked out on Monday all of the steps with Jenson equality all of those steps
[00:13:35] with Jenson equality all of those steps work exactly that's before many of you
[00:13:37] work exactly that's before many of you check every single step for whether Zi
[00:13:39] check every single step for whether Zi was continuous it work the same as
[00:13:41] was continuous it work the same as before if you have changed the sum to an
[00:13:43] before if you have changed the sum to an integral all right
[00:13:50] so so I want to mention one other view
[00:14:16] so so I want to mention one other view of yam that's equivalent everything
[00:14:18] of yam that's equivalent everything we've seen up until now which is um let
[00:14:22] we've seen up until now which is um let me define J of theta comma Q okay says
[00:14:47] me define J of theta comma Q okay says that phone that you've seen a few times
[00:14:48] that phone that you've seen a few times now what we proved on Monday was um Ella
[00:14:56] now what we proved on Monday was um Ella theta is greater than or equal to J of
[00:15:00] theta is greater than or equal to J of theta comma Q right and this is true for
[00:15:03] theta comma Q right and this is true for any theta and any choice of Q okay so
[00:15:08] any theta and any choice of Q okay so using using Jensen's inequality you can
[00:15:11] using using Jensen's inequality you can show that you know J for any choice of
[00:15:14] show that you know J for any choice of theta and Q is a lower bound for the
[00:15:16] theta and Q is a lower bound for the log-likelihood of theta so it turns out
[00:15:21] log-likelihood of theta so it turns out that an equivalent view of PM as
[00:15:23] that an equivalent view of PM as everything was seen before is that in
[00:15:26] everything was seen before is that in the Estep what you're doing is maximize
[00:15:30] the Estep what you're doing is maximize J with respect to Q and in the m-step
[00:15:40] maximize J with respect to theta right
[00:15:46] maximize J with respect to theta right so in the East step you're picking the
[00:15:48] so in the East step you're picking the choice of Q that maximizes this and it
[00:15:52] choice of Q that maximizes this and it turns out that the choice of Q we have
[00:15:54] turns out that the choice of Q we have will set J equal to L and then M step
[00:15:59] will set J equal to L and then M step maximizes this respect to theta and
[00:16:01] maximizes this respect to theta and pushes the value of L
[00:16:04] pushes the value of L even higher so this algorithm is
[00:16:06] even higher so this algorithm is sometimes called coordinate ascent if
[00:16:07] sometimes called coordinate ascent if you have a function of two variables and
[00:16:09] you have a function of two variables and you are twice respected this also has
[00:16:11] you are twice respected this also has respect to this they go back and forth
[00:16:13] respect to this they go back and forth and optimize the respective one at a
[00:16:14] and optimize the respective one at a time that's a procedure that sometimes
[00:16:17] time that's a procedure that sometimes called coordinate ascent because you're
[00:16:18] called coordinate ascent because you're maximizing respect to one coordinate at
[00:16:20] maximizing respect to one coordinate at a time and so e/m is a coordinate ascent
[00:16:24] a time and so e/m is a coordinate ascent algorithm relative to this cost function
[00:16:26] algorithm relative to this cost function J right and and and you know and on
[00:16:29] J right and and and you know and on every iteration J ends up being sent to
[00:16:31] every iteration J ends up being sent to L which is why you know that as the
[00:16:34] L which is why you know that as the algorithm increases J you know that the
[00:16:36] algorithm increases J you know that the log likelihood is increasing on every
[00:16:38] log likelihood is increasing on every iteration and if you want to track
[00:16:40] iteration and if you want to track whether the e/m algorithm is converging
[00:16:43] whether the e/m algorithm is converging or how was converging you can plot you
[00:16:45] or how was converging you can plot you know the value of J or the value of L on
[00:16:48] know the value of J or the value of L on successive iterations and see this very
[00:16:50] successive iterations and see this very valid data is going are monotonically
[00:16:52] valid data is going are monotonically and then when it plateaus and isn't
[00:16:54] and then when it plateaus and isn't improving anymore then you might have a
[00:16:55] improving anymore then you might have a sense that the algorithm is converging
[00:17:04] all right
[00:17:12] okay so that's it the basic algorithm
[00:17:17] okay so that's it the basic algorithm and make sure of gaussians what I want
[00:17:20] and make sure of gaussians what I want to do now is and it's going to talk
[00:17:23] to do now is and it's going to talk about the factor analysis out all right
[00:17:39] about the factor analysis out all right so um you know that the factor analysis
[00:17:43] so um you know that the factor analysis algorithm will work actually so over the
[00:17:49] algorithm will work actually so over the compare and contrast mixture of
[00:17:50] compare and contrast mixture of gaussians with factor analysis are
[00:17:53] gaussians with factor analysis are talking about a little bit which is uh
[00:17:54] talking about a little bit which is uh for the mixture of gaussians let's say N
[00:17:58] for the mixture of gaussians let's say N equals 2 and M equals 100 right see if a
[00:18:02] equals 2 and M equals 100 right see if a dataset with two features x1 and x2 so
[00:18:05] dataset with two features x1 and x2 so any two or two and maybe of a data set
[00:18:07] any two or two and maybe of a data set that looks like this
[00:18:11] you know then make sure to gaussians we
[00:18:13] you know then make sure to gaussians we were a pretty good model for this data
[00:18:15] were a pretty good model for this data set right now say one gaussian there for
[00:18:18] set right now say one gaussian there for the second gaussian there you can kind
[00:18:19] the second gaussian there you can kind of capture a distribution like this with
[00:18:20] of capture a distribution like this with a mixture of two gaussians and this is
[00:18:24] a mixture of two gaussians and this is one illustration of when when you apply
[00:18:27] one illustration of when when you apply make sure gaussians in this picture M is
[00:18:31] make sure gaussians in this picture M is much bigger than n right you have a lot
[00:18:34] much bigger than n right you have a lot more examples then you have dimensions
[00:18:41] where I will not use mixture of
[00:18:43] where I will not use mixture of gaussians and where you see the minute
[00:18:49] gaussians and where you see the minute factor analysis will apply maybe if M is
[00:18:54] factor analysis will apply maybe if M is about similar to N no even n is or even
[00:19:03] about similar to N no even n is or even M it's much less than okay and so um
[00:19:08] M it's much less than okay and so um just for purposes of illustration let's
[00:19:11] just for purposes of illustration let's say M equals 30 and equals 100 right so
[00:19:18] say M equals 30 and equals 100 right so let's say you have a hundred dimensional
[00:19:20] let's say you have a hundred dimensional data but only thirty examples and so to
[00:19:26] data but only thirty examples and so to make this more concrete you know many
[00:19:31] make this more concrete you know many years ago there was a Stanford PhD
[00:19:33] years ago there was a Stanford PhD student that was placing temperature
[00:19:36] student that was placing temperature sensors are around different standard
[00:19:38] sensors are around different standard buildings and so what you do is model
[00:19:41] buildings and so what you do is model you measure the temperature at many
[00:19:44] you measure the temperature at many different places right around campus but
[00:19:48] different places right around campus but if you have a hundred sensors you know
[00:19:54] if you have a hundred sensors you know taking a hundred temperature readings
[00:19:56] taking a hundred temperature readings around campus but only thirty days of
[00:19:59] around campus but only thirty days of data or maybe thirty examples then you
[00:20:02] data or maybe thirty examples then you would have a hundred dimensional data
[00:20:04] would have a hundred dimensional data because each example is a vector of a
[00:20:08] because each example is a vector of a hundred temperature readings you know at
[00:20:10] hundred temperature readings you know at different points around this building
[00:20:11] different points around this building say but you may have only thirty
[00:20:13] say but you may have only thirty examples of if you say thirty thirty
[00:20:17] examples of if you say thirty thirty such vectors and so the application that
[00:20:20] such vectors and so the application that the Stanford PhD student at the time was
[00:20:22] the Stanford PhD student at the time was working on was he wants a model P of X
[00:20:25] working on was he wants a model P of X right so this is X as a vector of a
[00:20:27] right so this is X as a vector of a hundred sends a hundred temperature
[00:20:30] hundred sends a hundred temperature readings because if something goes wrong
[00:20:33] readings because if something goes wrong for example it has a bad case if there's
[00:20:35] for example it has a bad case if there's a fire in one of the rooms then there'll
[00:20:38] a fire in one of the rooms then there'll be a very anomalous temperature reading
[00:20:41] be a very anomalous temperature reading in one place and if you can model P of X
[00:20:43] in one place and if you can model P of X and if you ever observe a value of P of
[00:20:45] and if you ever observe a value of P of X that is very small you would say oh
[00:20:48] X that is very small you would say oh looks like there's an anomaly there
[00:20:49] looks like there's an anomaly there right and we ratio let's worry about
[00:20:52] right and we ratio let's worry about fires on Stanford the use cases
[00:20:55] fires on Stanford the use cases was it an energy conservation if someone
[00:20:58] was it an energy conservation if someone unexpectedly these a window open in the
[00:21:00] unexpectedly these a window open in the building you were studying you know and
[00:21:02] building you were studying you know and it was hot and was it and it's it's
[00:21:04] it was hot and was it and it's it's winter and it's warmer inside the
[00:21:07] winter and it's warmer inside the building and ku air blows in and the
[00:21:08] building and ku air blows in and the temperature of one Broome Johnson and
[00:21:10] temperature of one Broome Johnson and all this way you want to realize that
[00:21:11] all this way you want to realize that something was going wrong with the
[00:21:12] something was going wrong with the windows or the or the temperature in
[00:21:15] windows or the or the temperature in part of the building okay so for the
[00:21:17] part of the building okay so for the application like that you need to model
[00:21:19] application like that you need to model P of X as a Joint Distribution over you
[00:21:24] P of X as a Joint Distribution over you know all of the different senses right
[00:21:26] know all of the different senses right actually
[00:21:26] actually if you imagine maybe just in this room
[00:21:28] if you imagine maybe just in this room let's say we have thirty senses in this
[00:21:31] let's say we have thirty senses in this room then the temperatures that the
[00:21:33] room then the temperatures that the thirty different points in this room
[00:21:35] thirty different points in this room will be highly correlated with each
[00:21:36] will be highly correlated with each other but how do you model this vector
[00:21:38] other but how do you model this vector of a hundred hundred original vector
[00:21:41] of a hundred hundred original vector with relatively small training set so it
[00:21:48] with relatively small training set so it turns out that the problem with applying
[00:21:52] turns out that the problem with applying a Gaussian model right so one thing
[00:22:02] a Gaussian model right so one thing could do is model this as a single
[00:22:04] could do is model this as a single Gaussian and say that X is distributed
[00:22:09] right and if you look at your training
[00:22:13] right and if you look at your training sets of thirty examples and find the
[00:22:15] sets of thirty examples and find the maximum likely estimate parameters you
[00:22:17] maximum likely estimate parameters you find that the maximum likelihood
[00:22:18] find that the maximum likelihood estimate of MU is just the average and
[00:22:22] estimate of MU is just the average and the best of like the estimate of Sigma
[00:22:24] the best of like the estimate of Sigma is this but it turns out that if M is
[00:22:40] is this but it turns out that if M is less than equal to n then Sigma this
[00:22:45] less than equal to n then Sigma this covariance matrix will be singular and
[00:22:51] covariance matrix will be singular and singular I just means our non-invertible
[00:23:00] I'll step for an illustration in a
[00:23:03] I'll step for an illustration in a second
[00:23:15] but if it looks like formula for the
[00:23:18] but if it looks like formula for the Gaussian density right so the Gaussian
[00:23:28] Gaussian density right so the Gaussian density kind of looks like this right
[00:23:31] density kind of looks like this right abstracting away some details and when
[00:23:34] abstracting away some details and when the covariance matrix is singular then
[00:23:36] the covariance matrix is singular then this term this determinant term will be
[00:23:41] this term this determinant term will be zero so you end up with 1 over 0 and
[00:23:45] zero so you end up with 1 over 0 and then Sigma inverse is also undefined or
[00:23:49] then Sigma inverse is also undefined or blows up to infinity or depending how
[00:23:51] blows up to infinity or depending how you think about it right so you know the
[00:23:53] you think about it right so you know the inverse of a matrix like um 110 right
[00:23:58] inverse of a matrix like um 110 right would be I guess 1 1 over 10 and an
[00:24:03] would be I guess 1 1 over 10 and an example of a non invertible matrix so
[00:24:05] example of a non invertible matrix so singular matrix would be this and you
[00:24:07] singular matrix would be this and you can't actually calculate the inverse of
[00:24:09] can't actually calculate the inverse of that matrix right so it turns out that
[00:24:12] that matrix right so it turns out that um if your number of training examples
[00:24:14] um if your number of training examples is less than the dimension of the data
[00:24:16] is less than the dimension of the data if you use the usual formula to derive
[00:24:19] if you use the usual formula to derive the maximum likelihood estimate of Sigma
[00:24:20] the maximum likelihood estimate of Sigma you end up with a covariance matrix that
[00:24:23] you end up with a covariance matrix that a singular singular just means
[00:24:25] a singular singular just means non-invertible which means the
[00:24:26] non-invertible which means the covariance ratios would look like this
[00:24:27] covariance ratios would look like this and so you know the Gaussian density we
[00:24:29] and so you know the Gaussian density we try to compute P of X you get you can't
[00:24:32] try to compute P of X you get you can't get infinity over 0
[00:24:38] oh sorry not actually zero over zero
[00:24:41] oh sorry not actually zero over zero doesn't matter it's all bad um and I
[00:24:45] doesn't matter it's all bad um and I think let me just illustrate what this
[00:24:48] think let me just illustrate what this looks like
[00:24:48] looks like which is um let's say M equals 2 and N
[00:24:53] which is um let's say M equals 2 and N equals 2 right so you have two
[00:24:56] equals 2 right so you have two dimensional data x1 and x2 and so N
[00:25:01] dimensional data x1 and x2 and so N equals two and the number of training
[00:25:02] equals two and the number of training examples 0 2 so it turns out that let's
[00:25:05] examples 0 2 so it turns out that let's see
[00:25:06] see so you see me draw contours of Gaussian
[00:25:08] so you see me draw contours of Gaussian densities like this right like the lips
[00:25:10] densities like this right like the lips is like that it turns out that if you
[00:25:12] is like that it turns out that if you have two examples the two dimensional
[00:25:14] have two examples the two dimensional space and you compute the most likely
[00:25:16] space and you compute the most likely maximum likelihood estimate the
[00:25:17] maximum likelihood estimate the parameters of the Gaussian if it is a
[00:25:19] parameters of the Gaussian if it is a there then it turns out that these
[00:25:21] there then it turns out that these contours will look like that except that
[00:25:27] contours will look like that except that instead of being very thin as I'm
[00:25:30] instead of being very thin as I'm drawing it it'll be it'll be infinitely
[00:25:32] drawing it it'll be it'll be infinitely skinny see and ever be Gaussian density
[00:25:35] skinny see and ever be Gaussian density where I can't draw lines you know of
[00:25:37] where I can't draw lines you know of zero width on the whiteboard right but
[00:25:39] zero width on the whiteboard right but it turns out that the contours will be
[00:25:41] it turns out that the contours will be squished infinitely thin so you end up
[00:25:44] squished infinitely thin so you end up with a Gaussian density all of whose
[00:25:46] with a Gaussian density all of whose mass is on the straight line over there
[00:25:50] mass is on the straight line over there with infinitely thin cons was just you
[00:25:52] with infinitely thin cons was just you know this is question centers on the on
[00:25:55] know this is question centers on the on the plane I guess or on the line
[00:25:56] the plane I guess or on the line connecting these two points and so this
[00:25:59] connecting these two points and so this is so first there are practical
[00:26:03] is so first there are practical numerical problems right as in your
[00:26:04] numerical problems right as in your network so zero over zero if you try to
[00:26:06] network so zero over zero if you try to compute P of X for any example and
[00:26:08] compute P of X for any example and second this is this very poorly
[00:26:12] second this is this very poorly conditioned Gaussian density puts all
[00:26:14] conditioned Gaussian density puts all the Prairie mouths on this line segments
[00:26:16] the Prairie mouths on this line segments and so any example right over there just
[00:26:19] and so any example right over there just a little bit off has no probably mas
[00:26:21] a little bit off has no probably mas because has probably loss of zero a
[00:26:23] because has probably loss of zero a probably density of 0 because the
[00:26:25] probably density of 0 because the Gaussian is squish infinitely thin you
[00:26:27] Gaussian is squish infinitely thin you know on that on that line okay but but
[00:26:30] know on that on that line okay but but you know this just not be good just this
[00:26:33] you know this just not be good just this is just not a good model right for this
[00:26:35] is just not a good model right for this danger
[00:26:41] so what we're gonna do is uh come up
[00:26:45] so what we're gonna do is uh come up with a model that will work even for for
[00:26:50] with a model that will work even for for these applications even even for a
[00:26:51] these applications even even for a dataset like this right
[00:26:53] dataset like this right there's actually a I think the the
[00:26:56] there's actually a I think the the origins of the factor analysis model one
[00:26:59] origins of the factor analysis model one of the very early applications was
[00:27:00] of the very early applications was actually a psychological testing where
[00:27:03] actually a psychological testing where if you have a you know administer a
[00:27:06] if you have a you know administer a psychology exam two people to measure
[00:27:09] psychology exam two people to measure different personality attributes right
[00:27:12] different personality attributes right so you might measure you might have a
[00:27:14] so you might measure you might have a hundred questions or measure a hundred
[00:27:17] hundred questions or measure a hundred psychological attributes but have a data
[00:27:23] psychological attributes but have a data set of thirty persons right okay you
[00:27:28] set of thirty persons right okay you know doing doing psych research
[00:27:30] know doing doing psych research collecting you know running survey data
[00:27:31] collecting you know running survey data is harder it's a myriad of a sample of
[00:27:33] is harder it's a myriad of a sample of 30 people and each person answers a
[00:27:35] 30 people and each person answers a hundred quiz questions and so each
[00:27:38] hundred quiz questions and so each person is one gives you one example X
[00:27:43] person is one gives you one example X and the dimension of this is hundred
[00:27:46] and the dimension of this is hundred image now B of only thirty of these and
[00:27:49] image now B of only thirty of these and so if you want to model P of X try to
[00:27:51] so if you want to model P of X try to model how correlated are the different
[00:27:53] model how correlated are the different psychological attributes of people rate
[00:27:55] psychological attributes of people rate um Louis intelligence correlated math
[00:27:57] um Louis intelligence correlated math ability is that correlated with language
[00:28:00] ability is that correlated with language ability is that correlated with other
[00:28:01] ability is that correlated with other things then how do you build a model for
[00:28:04] things then how do you build a model for P of X okay alright so um if the
[00:28:11] P of X okay alright so um if the standard Gaussian model doesn't work
[00:28:12] standard Gaussian model doesn't work let's look at some alternatives one
[00:28:16] let's look at some alternatives one thing you could do is constrained Sigma
[00:28:31] to be diagonal right so Sigma is a
[00:28:34] to be diagonal right so Sigma is a covariance matrix is an N by n
[00:28:36] covariance matrix is an N by n covariance matrix so in this case be one
[00:28:38] covariance matrix so in this case be one hundred one hundred matrix but let's say
[00:28:42] hundred one hundred matrix but let's say we constrain it to just have diagonal
[00:28:47] we constrain it to just have diagonal entries and zeros on the off diagonals
[00:28:49] entries and zeros on the off diagonals right so these giant zeroes I mean the
[00:28:52] right so these giant zeroes I mean the diagonal entries of the square matrix
[00:28:53] diagonal entries of the square matrix are these values and all of the entries
[00:28:56] are these values and all of the entries of the diagonals you're set to zero so
[00:28:58] of the diagonals you're set to zero so that's one thing you could do and this
[00:29:00] that's one thing you could do and this turns out to be this turns out to
[00:29:04] turns out to be this turns out to correspond to constraining your Gaussian
[00:29:07] correspond to constraining your Gaussian to have axis aligned contours so this is
[00:29:10] to have axis aligned contours so this is a Gaussian with zero diagonals this
[00:29:14] a Gaussian with zero diagonals this would be another one this would be
[00:29:18] would be another one this would be another one so these are examples of
[00:29:21] another one so these are examples of Gaussian of contours of Gaussian
[00:29:23] Gaussian of contours of Gaussian densities with zero off diagonal so the
[00:29:27] densities with zero off diagonal so the axis here and thanks more than x2 right
[00:29:30] axis here and thanks more than x2 right whereas you cannot model something like
[00:29:33] whereas you cannot model something like this if you're 0 if you're off diagonals
[00:29:37] this if you're 0 if you're off diagonals are 0 and so if you do this the maximum
[00:29:42] are 0 and so if you do this the maximum likely estimate of the parameters Sigma
[00:29:44] likely estimate of the parameters Sigma J is pretty much what you'd expect
[00:29:46] J is pretty much what you'd expect actually the maximum like give you an
[00:29:52] actually the maximum like give you an estimate of the mean vector mu is the
[00:29:55] estimate of the mean vector mu is the same as before and this is Mathematica
[00:29:57] same as before and this is Mathematica estimate of Sigma J right this is kind
[00:30:00] estimate of Sigma J right this is kind of not a huge surprise gonna put you
[00:30:01] of not a huge surprise gonna put you things back oh and it turns out that and
[00:30:06] things back oh and it turns out that and so the covariance matrix here as n
[00:30:08] so the covariance matrix here as n parameters instead of in spirit or about
[00:30:10] parameters instead of in spirit or about n square over 2 parameters the
[00:30:16] n square over 2 parameters the covariance matrix Sigma now just as in
[00:30:18] covariance matrix Sigma now just as in parameters which the n diagonal entries
[00:30:21] parameters which the n diagonal entries now the problem with this is that this
[00:30:24] now the problem with this is that this molding assumption assumes that all of
[00:30:27] molding assumption assumes that all of your features are uncorrelated right so
[00:30:29] your features are uncorrelated right so you know this just assumes that any two
[00:30:31] you know this just assumes that any two features they connect share are
[00:30:33] features they connect share are completely uncorrelated and
[00:30:35] completely uncorrelated and if you have temperature sensors in this
[00:30:37] if you have temperature sensors in this room there's just not a good assumption
[00:30:38] room there's just not a good assumption to assume the temperature at all points
[00:30:40] to assume the temperature at all points of this room are completely uncorrelated
[00:30:42] of this room are completely uncorrelated completely independent
[00:30:43] completely independent each other or if you measure yes the
[00:30:45] each other or if you measure yes the glaciers the people is just not a great
[00:30:47] glaciers the people is just not a great assumption to assume that you know the
[00:30:49] assumption to assume that you know the different different psychological
[00:30:51] different different psychological measures you might have a completely
[00:30:52] measures you might have a completely independent so while this model would
[00:30:57] independent so while this model would take care of the problem the the
[00:30:59] take care of the problem the the technical problem of the covariance
[00:31:01] technical problem of the covariance matrix being singular you can't fit this
[00:31:04] matrix being singular you can't fit this model on a hundred dimensional data set
[00:31:08] model on a hundred dimensional data set with 30 examples you can fit this you
[00:31:10] with 30 examples you can fit this you won't get in you this you could build
[00:31:11] won't get in you this you could build this model you won't run into numerical
[00:31:14] this model you won't run into numerical or singular cover and species size
[00:31:16] or singular cover and species size problems it's just not a very good model
[00:31:18] problems it's just not a very good model you're just assuming nothing is
[00:31:19] you're just assuming nothing is correlated than anything else
[00:31:36] something else that you can do is make
[00:31:43] something else that you can do is make an even stronger assumption so this is
[00:31:44] an even stronger assumption so this is an even worse model but I go through it
[00:31:46] an even worse model but I go through it because there'll be a building buffer
[00:31:48] because there'll be a building buffer will actually do later which is
[00:31:52] constraint Sigma to be Sigma equals
[00:32:01] lowercase Sigma squared times R right
[00:32:03] lowercase Sigma squared times R right and so constraint Sigma to be done not
[00:32:08] and so constraint Sigma to be done not only diagonal but to have the same entry
[00:32:11] only diagonal but to have the same entry in every single element so now you've
[00:32:14] in every single element so now you've gone from I guess n parameters to just
[00:32:18] gone from I guess n parameters to just one parameter right and this means that
[00:32:22] one parameter right and this means that you're constraining the covariance
[00:32:24] you're constraining the covariance matrix the constraining the gaussians
[00:32:27] matrix the constraining the gaussians you used to have circular control so
[00:32:29] you used to have circular control so this is an example of what you can model
[00:32:31] this is an example of what you can model and this would be another example this
[00:32:35] and this would be another example this is another example okay so you could
[00:32:37] is another example okay so you could model things like this where every
[00:32:39] model things like this where every feature not only is every features
[00:32:41] feature not only is every features uncorrelated but every feature further
[00:32:42] uncorrelated but every feature further has the same variance as every other
[00:32:43] has the same variance as every other feature and maximum likelihood is this
[00:32:58] not a huge surprise does the average
[00:33:00] not a huge surprise does the average over the previous values so what we'd
[00:33:06] over the previous values so what we'd like to do is not quite used either of
[00:33:10] like to do is not quite used either of these options right which assumes really
[00:33:11] these options right which assumes really bigger problems as soon as the features
[00:33:13] bigger problems as soon as the features are uncorrelated and what we'd like to
[00:33:15] are uncorrelated and what we'd like to do is build a model that you can fit
[00:33:18] do is build a model that you can fit even when you have very high dimensional
[00:33:19] even when you have very high dimensional data and the relatively small number of
[00:33:22] data and the relatively small number of examples but that allows you to capture
[00:33:24] examples but that allows you to capture some of the correlations right so if we
[00:33:26] some of the correlations right so if we have 30 temperature sensors in this room
[00:33:28] have 30 temperature sensors in this room you know probably there are some
[00:33:30] you know probably there are some correlations very probably decided in
[00:33:32] correlations very probably decided in the room temperature is gonna be
[00:33:33] the room temperature is gonna be correlated data of insertion correlated
[00:33:36] correlated data of insertion correlated and maybe the ambient temperature in
[00:33:37] and maybe the ambient temperature in this whole building or the temperature
[00:33:39] this whole building or the temperature of this room pretty girls up and down
[00:33:40] of this room pretty girls up and down there's a whole but maybe some of the
[00:33:42] there's a whole but maybe some of the lands on the side heat up that's out of
[00:33:44] lands on the side heat up that's out of the room a bit more so different
[00:33:45] the room a bit more so different different there are correlations but
[00:33:47] different there are correlations but maybe you don't need for
[00:33:48] maybe you don't need for covariance matrix either so what what
[00:33:52] covariance matrix either so what what factor analysis we'll do is give us a
[00:33:54] factor analysis we'll do is give us a model that you can fit even when you
[00:33:56] model that you can fit even when you have you know hundred emissions a there
[00:33:58] have you know hundred emissions a there are thirty examples they captures some
[00:34:00] are thirty examples they captures some of the correlations but that doesn't run
[00:34:02] of the correlations but that doesn't run into the onion vertical covariance
[00:34:06] into the onion vertical covariance matrices that the naive Gaussian model
[00:34:08] matrices that the naive Gaussian model does alright so let me just just track
[00:34:22] does alright so let me just just track anyone let me let me describe the models
[00:34:24] anyone let me let me describe the models check any questions by anyone oh sure
[00:34:35] check any questions by anyone oh sure yes yes there is one thing you can do a
[00:34:38] yes yes there is one thing you can do a common thing to do is apply which hot
[00:34:41] common thing to do is apply which hot Pryor and what that boils down to is as
[00:34:44] Pryor and what that boils down to is as a small diagonal value to that to the
[00:34:46] a small diagonal value to that to the maximum I could estimate it kind of in a
[00:34:50] maximum I could estimate it kind of in a technical sense it takes away the
[00:34:51] technical sense it takes away the non-invertible matrix problem there's
[00:34:54] non-invertible matrix problem there's actually not the best algorithm from
[00:34:56] actually not the best algorithm from other types of danger the the the
[00:35:00] other types of danger the the the Wishart or inverse Wishart prior yeah as
[00:35:03] Wishart or inverse Wishart prior yeah as you know basically you take the maximum
[00:35:04] you know basically you take the maximum likely the Sigma and add you know some
[00:35:08] likely the Sigma and add you know some constant to the diagonal it take has the
[00:35:11] constant to the diagonal it take has the problem in a technical way but it's not
[00:35:13] problem in a technical way but it's not it's not the best model for a lot of
[00:35:15] it's not the best model for a lot of data says I see
[00:35:20] oh yes why do you think about option two
[00:35:25] oh yes why do you think about option two when it's like even worse than option
[00:35:27] when it's like even worse than option one um yes option two is not a good
[00:35:29] one um yes option two is not a good option but I need to use this as a
[00:35:31] option but I need to use this as a building block for factor analysis so
[00:35:32] building block for factor analysis so you see this is a small component of C I
[00:35:35] you see this is a small component of C I actually plan these things out and I
[00:35:40] actually plan these things out and I actually took on it's just just a
[00:35:42] actually took on it's just just a mention you know just mentioned some
[00:35:46] mention you know just mentioned some things I see yeah she's actually did the
[00:35:47] things I see yeah she's actually did the machine there any work he balls all the
[00:35:48] machine there any work he balls all the time we should find it fascinating but
[00:35:50] time we should find it fascinating but they look at all the big tech companies
[00:35:52] they look at all the big tech companies um a lot of the large tech companies
[00:35:54] um a lot of the large tech companies they're all like working on exactly the
[00:35:55] they're all like working on exactly the same problems right every large tech
[00:35:57] same problems right every large tech company you know software AI complete
[00:36:00] company you know software AI complete his work on machine translation every
[00:36:01] his work on machine translation every one of them works on speech recognition
[00:36:03] one of them works on speech recognition every one of them works on face
[00:36:05] every one of them works on face recognition and I've been part of these
[00:36:07] recognition and I've been part of these teams myself right and I think it's
[00:36:08] teams myself right and I think it's great that we have so much progress in
[00:36:11] great that we have so much progress in machine translation cuz there's so many
[00:36:12] machine translation cuz there's so many people in so many large companies work
[00:36:14] people in so many large companies work on machine translation it's actually
[00:36:16] on machine translation it's actually really happy to see so much progress in
[00:36:17] really happy to see so much progress in these problems that every single large
[00:36:19] these problems that every single large tech company large software a Irish tech
[00:36:22] tech company large software a Irish tech company works on um one of the
[00:36:24] company works on um one of the fascinating things I see is that because
[00:36:27] fascinating things I see is that because of all this work in the large tech
[00:36:30] of all this work in the large tech companies work on very similar problems
[00:36:32] companies work on very similar problems one of the really overlooked parts of
[00:36:34] one of the really overlooked parts of the machine there any world is a small
[00:36:35] the machine there any world is a small data problems right so they're all
[00:36:37] data problems right so they're all working big data representing English
[00:36:40] working big data representing English and French and Chinese and Spanish
[00:36:41] and French and Chinese and Spanish sentences it's in across small does it
[00:36:43] sentences it's in across small does it work and I think there's actually a lack
[00:36:47] work and I think there's actually a lack of attention like a disproportionate
[00:36:49] of attention like a disproportionate small amount of attention on you know
[00:36:51] small amount of attention on you know small data problems we're instead of a
[00:36:53] small data problems we're instead of a hundred million images you maybe have
[00:36:55] hundred million images you maybe have 100 images and so some of the teams I
[00:36:58] 100 images and so some of the teams I work with these days actually landing a
[00:37:00] work with these days actually landing a I actually spent all my time thinking
[00:37:02] I actually spent all my time thinking about small danger problems because a
[00:37:05] about small danger problems because a lot of the practical applications of
[00:37:06] lot of the practical applications of machine learning including the wrong
[00:37:08] machine learning including the wrong things you see in your class projects or
[00:37:10] things you see in your class projects or actually small data problems right I
[00:37:12] actually small data problems right I think when when onion works with the
[00:37:15] think when when onion works with the healthcare system where our substantive
[00:37:16] healthcare system where our substantive Hospital force all the problems you only
[00:37:18] Hospital force all the problems you only have 100 examples only a thousand only
[00:37:20] have 100 examples only a thousand only 10,000 you don't have a million patients
[00:37:22] 10,000 you don't have a million patients or the same medical condition and so I
[00:37:24] or the same medical condition and so I think that a lot of these models so in
[00:37:28] think that a lot of these models so in the game earlier this week I was using
[00:37:32] the game earlier this week I was using slightly modified version
[00:37:34] slightly modified version factor analysis on the manufacturing
[00:37:36] factor analysis on the manufacturing problem at landing AI right and I think
[00:37:38] problem at landing AI right and I think a lot of these small data problems are
[00:37:40] a lot of these small data problems are actually we're allowed the exciting work
[00:37:42] actually we're allowed the exciting work is to be done in machine learning and is
[00:37:44] is to be done in machine learning and is somehow it feels like a blind spot or
[00:37:47] somehow it feels like a blind spot or the feels like girls like a a good gap
[00:37:48] the feels like girls like a a good gap of a lot of their work done into a I
[00:37:51] of a lot of their work done into a I world today yeah why don't we use the
[00:38:09] world today yeah why don't we use the same arrow as a swan big data it turns
[00:38:10] same arrow as a swan big data it turns out that you know it turns out if you
[00:38:15] out that you know it turns out if you look at computer vision world right is a
[00:38:17] look at computer vision world right is a dataset everyone's working on now now
[00:38:19] dataset everyone's working on now now we're pasta we don't really use anymore
[00:38:21] we're pasta we don't really use anymore called image net which had a million
[00:38:23] called image net which had a million images and so the tons of computation
[00:38:25] images and so the tons of computation architectures that have been heavily
[00:38:27] architectures that have been heavily designed for the use case of if you have
[00:38:29] designed for the use case of if you have exactly 1 million training examples it
[00:38:32] exactly 1 million training examples it turns out that the algorithms that work
[00:38:34] turns out that the algorithms that work best maybe have 100 training examples is
[00:38:36] best maybe have 100 training examples is you know looks like is different than
[00:38:37] you know looks like is different than the best learning algorithm I think and
[00:38:41] the best learning algorithm I think and so I think right now we actually I think
[00:38:44] so I think right now we actually I think the Machine there any world we are not
[00:38:45] the Machine there any world we are not very good at understanding the scaling
[00:38:47] very good at understanding the scaling the best algorithm for one training
[00:38:49] the best algorithm for one training example you know as far as we are able
[00:38:52] example you know as far as we are able to invent algorithm mister community is
[00:38:54] to invent algorithm mister community is different than best algorithm for a
[00:38:55] different than best algorithm for a thousand best earn per million is
[00:38:57] thousand best earn per million is actually different than actually and
[00:39:00] actually different than actually and Facebook published paper we see with 3.5
[00:39:02] Facebook published paper we see with 3.5 billion images there's no school there's
[00:39:04] billion images there's no school there's very large right so was I think we don't
[00:39:06] very large right so was I think we don't actually have a good understanding of
[00:39:08] actually have a good understanding of how to modify our algorithms set one
[00:39:10] how to modify our algorithms set one algorithm work on every single point of
[00:39:12] algorithm work on every single point of the spectrum green from one example to
[00:39:14] the spectrum green from one example to like a billion examples and so there's a
[00:39:17] like a billion examples and so there's a lot of work optimizing for different
[00:39:20] lot of work optimizing for different points of the spectrum and I think
[00:39:21] points of the spectrum and I think there's been a lot of work optimizing
[00:39:24] there's been a lot of work optimizing for Big Data which is great you know
[00:39:25] for Big Data which is great you know built some of these large systems to
[00:39:27] built some of these large systems to handle whatever I petabytes of data a
[00:39:29] handle whatever I petabytes of data a day that's great
[00:39:31] day that's great but I feel like relative to the number
[00:39:35] but I feel like relative to the number of application opportunities there there
[00:39:37] of application opportunities there there there's a lot of work on small data as
[00:39:38] there's a lot of work on small data as well that I find very exciting that
[00:39:41] well that I find very exciting that that and I think of this as an example
[00:39:43] that and I think of this as an example the reason I was using this literally
[00:39:46] the reason I was using this literally well modified version of this earlier
[00:39:48] well modified version of this earlier this week on the manufacturing problem
[00:39:50] this week on the manufacturing problem is because there isn't that much data in
[00:39:54] is because there isn't that much data in those scenarios right all right those
[00:40:00] those scenarios right all right those them off topic but let's go and describe
[00:40:02] them off topic but let's go and describe well hopefully maybe yeah so this stuff
[00:40:04] well hopefully maybe yeah so this stuff does get used right so let's let's talk
[00:40:06] does get used right so let's let's talk about the model so similar to your
[00:40:12] about the model so similar to your information of Gaussian is I'm going to
[00:40:13] information of Gaussian is I'm going to define a model with PXE equals to P of X
[00:40:19] define a model with PXE equals to P of X given Z times P of Z and Z is hidden ok
[00:40:28] given Z times P of Z and Z is hidden ok so that's a framework same as mixture of
[00:40:32] so that's a framework same as mixture of gaussians so let me just define the
[00:40:34] gaussians so let me just define the factor analysis model
[00:40:50] so first um Z will be drawn distributed
[00:40:54] so first um Z will be drawn distributed according to a Gaussian density where Z
[00:40:56] according to a Gaussian density where Z is going to be an R D where D is less
[00:40:59] is going to be an R D where D is less than n and again to think about it maybe
[00:41:01] than n and again to think about it maybe in think of it as on D equals 3 N equals
[00:41:08] in think of it as on D equals 3 N equals 100 M equals 30 okay and and but I guess
[00:41:16] 100 M equals 30 okay and and but I guess just did me just as a concrete example
[00:41:17] just did me just as a concrete example to think about it and what we're going
[00:41:19] to think about it and what we're going to assume is that X is equal to mu plus
[00:41:22] to assume is that X is equal to mu plus lambda Z this is the capital Greek
[00:41:29] lambda Z this is the capital Greek alphabet lambda plus epsilon where
[00:41:33] alphabet lambda plus epsilon where epsilon is disputed Gaussian with mean 0
[00:41:36] epsilon is disputed Gaussian with mean 0 and covariance psy
[00:41:51] so the parameters of this model are mu
[00:41:55] so the parameters of this model are mu which is n dimensional lambda which is n
[00:42:02] which is n dimensional lambda which is n by D and psy which is n by n and we're
[00:42:08] by D and psy which is n by n and we're going to assume that psy is diagonal
[00:42:12] going to assume that psy is diagonal okay and so let's see the second
[00:42:18] okay and so let's see the second equation an equivalent way to write that
[00:42:21] equation an equivalent way to write that is that given the value of Z the
[00:42:30] is that given the value of Z the conditional distribution of X right X
[00:42:32] conditional distribution of X right X given Z this is Gaussian with me given
[00:42:35] given Z this is Gaussian with me given by mu plus lambda Z and kobir inside
[00:42:43] by mu plus lambda Z and kobir inside okay so once you've given Z once your
[00:42:46] okay so once you've given Z once your sample Z so this is P of Z and this is P
[00:42:49] sample Z so this is P of Z and this is P P of Z and SP of XE X given Z right so
[00:42:52] P of Z and SP of XE X given Z right so given Z X is computed as mu plus lambda
[00:42:56] given Z X is computed as mu plus lambda Z so this is just some constant and then
[00:42:59] Z so this is just some constant and then you add Gaussian noise to it and so this
[00:43:01] you add Gaussian noise to it and so this equation an equivalent way to defining
[00:43:03] equation an equivalent way to defining this equation is to say that the mean of
[00:43:05] this equation is to say that the mean of X conditioned on Z is this first term
[00:43:11] X conditioned on Z is this first term since that's the mean and the covariance
[00:43:15] since that's the mean and the covariance of x given Z is given by this you know
[00:43:19] of x given Z is given by this you know additional term for sy by that noise
[00:43:22] additional term for sy by that noise term that you had to it okay
[00:43:26] term that you had to it okay so let me go through a few examples and
[00:43:30] so let me go through a few examples and I think the intuition behind this model
[00:43:32] I think the intuition behind this model is in if you think that there are three
[00:43:36] is in if you think that there are three powerful forces driving temperatures
[00:43:38] powerful forces driving temperatures across this room maybe one powerful
[00:43:41] across this room maybe one powerful force is just what is the temperature
[00:43:42] force is just what is the temperature you know here in Palo Alto what's the
[00:43:45] you know here in Palo Alto what's the temperature here at Stanford and another
[00:43:46] temperature here at Stanford and another powerful force is how bright are the
[00:43:48] powerful force is how bright are the lights on the left side of the room and
[00:43:50] lights on the left side of the room and how hot does it heat up the side of room
[00:43:51] how hot does it heat up the side of room and another is how hot does it heat up
[00:43:53] and another is how hot does it heat up the right side of the group right so
[00:43:54] the right side of the group right so let's say there are three main driving
[00:43:56] let's say there are three main driving factors affecting the temperature of
[00:43:58] factors affecting the temperature of this room then that's when D would be
[00:44:01] this room then that's when D would be equal to three that you assume that you
[00:44:02] equal to three that you assume that you know there are three things in the world
[00:44:04] know there are three things in the world they drive the temperature of this room
[00:44:05] they drive the temperature of this room this three dimensional which is the
[00:44:07] this three dimensional which is the temperature in Palo Alto kind of around
[00:44:09] temperature in Palo Alto kind of around this area how bright are the lies there
[00:44:12] this area how bright are the lies there and how bright otherwise they're and you
[00:44:13] and how bright otherwise they're and you try to capture that with three numbers
[00:44:15] try to capture that with three numbers given those three numbers right given Z
[00:44:19] given those three numbers right given Z the actual temperature for the 100
[00:44:22] the actual temperature for the 100 sensors we scatter around this room will
[00:44:25] sensors we scatter around this room will be determined by each sensor right so
[00:44:29] be determined by each sensor right so plant 30 temperature sensors all over
[00:44:30] plant 30 temperature sensors all over this room each sensor we plant will
[00:44:34] this room each sensor we plant will measure an actual temperature there's a
[00:44:36] measure an actual temperature there's a linear function of those three powerful
[00:44:39] linear function of those three powerful forces and if it sends it on that side
[00:44:42] forces and if it sends it on that side of the room it will be affected more by
[00:44:43] of the room it will be affected more by how bright are the lights on that side
[00:44:45] how bright are the lights on that side of the room if the sensor near the door
[00:44:48] of the room if the sensor near the door it will be more affected by the
[00:44:49] it will be more affected by the temperature outside temperature here in
[00:44:52] temperature outside temperature here in Palo Alto right but so X will be a
[00:44:54] Palo Alto right but so X will be a linear function this first time I
[00:44:57] linear function this first time I underlined but rather than just that
[00:45:00] underlined but rather than just that term there's a little noise right so
[00:45:02] term there's a little noise right so your sensor has his own noise term which
[00:45:04] your sensor has his own noise term which is governed by this additional noise
[00:45:07] is governed by this additional noise term epsilon and the assumption that
[00:45:11] term epsilon and the assumption that this matrix psy is diagonal is saying
[00:45:16] this matrix psy is diagonal is saying that after you compute the mean the
[00:45:19] that after you compute the mean the noise that you observe that your sensor
[00:45:22] noise that you observe that your sensor is independent of the noise and every
[00:45:24] is independent of the noise and every other sensor right there maybe maybe the
[00:45:28] other sensor right there maybe maybe the sensor you know up there right maybe
[00:45:30] sensor you know up there right maybe it's just noisy or something this gust
[00:45:32] it's just noisy or something this gust of wind but you assume that the noise
[00:45:34] of wind but you assume that the noise observe at different senses is
[00:45:36] observe at different senses is independent
[00:45:37] independent the additional epsilon error term has a
[00:45:40] the additional epsilon error term has a diagonal covariance matrix given by side
[00:45:43] diagonal covariance matrix given by side okay so you can so you can think of that
[00:45:45] okay so you can so you can think of that as what factor analysis is trying to
[00:45:50] as what factor analysis is trying to model so let me um you just go through a
[00:45:54] model so let me um you just go through a couple of examples of the types of data
[00:45:57] couple of examples of the types of data factor analysis can model alright oh and
[00:46:09] factor analysis can model alright oh and again body constrains the white board
[00:46:11] again body constrains the white board I'm gonna have to go low-dimensional
[00:46:12] I'm gonna have to go low-dimensional here so actually let me let me go
[00:46:16] here so actually let me let me go through a couple examples so let's say Z
[00:46:23] through a couple examples so let's say Z is R 1 and x is r 2 so in this example I
[00:46:28] is R 1 and x is r 2 so in this example I guess D is equal to 1 and is equal to 2
[00:46:33] guess D is equal to 1 and is equal to 2 and let's say M is 7 right just just so
[00:46:40] and let's say M is 7 right just just so won't be a typical sample generated by
[00:46:42] won't be a typical sample generated by you know what would be an example of a
[00:46:44] you know what would be an example of a type of data that disc can model so this
[00:46:57] so this would be a typical sample of Zi
[00:47:00] so this would be a typical sample of Zi which is you know so this is Z is just
[00:47:06] which is you know so this is Z is just drawn from a static alcian so I guess Z
[00:47:09] drawn from a static alcian so I guess Z miss Gaussian with mean 0 unit variance
[00:47:12] miss Gaussian with mean 0 unit variance so that's a number line and if you draw
[00:47:14] so that's a number line and if you draw seven points from the Gaussian you know
[00:47:16] seven points from the Gaussian you know maybe you get a sample like that okay
[00:47:18] maybe you get a sample like that okay and now let's say lambda is two one and
[00:47:26] and now let's say lambda is two one and let's just say mu is zero zero okay so
[00:47:33] let's just say mu is zero zero okay so now let's compute lambda X plus mu so
[00:47:44] now let's compute lambda X plus mu so give it a sample like that
[00:47:46] give it a sample like that if you computes lambda X plus mu this
[00:47:50] if you computes lambda X plus mu this will now be an r2 right so here's X 1 is
[00:47:54] will now be an r2 right so here's X 1 is X 2 I'm gonna take those examples and
[00:47:56] X 2 I'm gonna take those examples and map them to align as follows right wait
[00:48:04] map them to align as follows right wait these examples on R 1 so oh excuse me
[00:48:08] these examples on R 1 so oh excuse me long disease okay so this is just a real
[00:48:11] long disease okay so this is just a real number and so lambda Z plus mu is now
[00:48:15] number and so lambda Z plus mu is now two-dimensional right because uh lambda
[00:48:18] two-dimensional right because uh lambda R it's a 2 by 1 matrix okay so you end
[00:48:22] R it's a 2 by 1 matrix okay so you end up with so this would be a typical
[00:48:24] up with so this would be a typical sample to look around the sample of
[00:48:26] sample to look around the sample of lambda Z plus mu and there's a 2
[00:48:29] lambda Z plus mu and there's a 2 dimensional data set but all of the
[00:48:31] dimensional data set but all of the examples lie perfectly on a straight
[00:48:34] examples lie perfectly on a straight line okay
[00:48:36] line okay then finally let's say that sy the
[00:48:39] then finally let's say that sy the covariance matrix is equal to this as a
[00:48:45] covariance matrix is equal to this as a diagonal covariance matrix and so this
[00:48:49] diagonal covariance matrix and so this covariance matrix corresponds to x2
[00:48:51] covariance matrix corresponds to x2 having a bigger Berens the next one
[00:48:52] having a bigger Berens the next one right now so you know this this I guess
[00:48:55] right now so you know this this I guess the density of epsilon has ellipses that
[00:48:58] the density of epsilon has ellipses that look a little bit like this is taller
[00:48:59] look a little bit like this is taller than Y the aspect ratio should
[00:49:01] than Y the aspect ratio should technically be 1 over root 2
[00:49:03] technically be 1 over root 2 right there's a standard deviation would
[00:49:05] right there's a standard deviation would be root 2 oh yes and so in the last step
[00:49:10] be root 2 oh yes and so in the last step of what we're going to do x equals
[00:49:12] of what we're going to do x equals lambda Z plus mu those Epsilon we're
[00:49:16] lambda Z plus mu those Epsilon we're going to take each of these points we
[00:49:17] going to take each of these points we have and put a little Gaussian contour
[00:49:23] you know there's that shape there's this
[00:49:26] you know there's that shape there's this this I'm just doing one contour yes it
[00:49:28] this I'm just doing one contour yes it is a shape and just puts it on top of
[00:49:30] is a shape and just puts it on top of this and if you sample one point from
[00:49:34] this and if you sample one point from each of these gaussians there may be
[00:49:36] each of these gaussians there may be again this example this example this
[00:49:38] again this example this example this example this example so what I just did
[00:49:43] example this example so what I just did was looking into the Gaussian contours
[00:49:45] was looking into the Gaussian contours in sample a point from that Gaussian and
[00:49:48] in sample a point from that Gaussian and so the red crosses here are a typical
[00:49:52] so the red crosses here are a typical sample drawn from this model okay and so
[00:49:57] sample drawn from this model okay and so if you have data that looks like this
[00:49:58] if you have data that looks like this that looks like the red crosses come
[00:50:00] that looks like the red crosses come disease are latent random variables
[00:50:02] disease are latent random variables right and when you get it is that you
[00:50:04] right and when you get it is that you can't actually see Z so what you
[00:50:05] can't actually see Z so what you actually see is just you know the red
[00:50:07] actually see is just you know the red crosses that's your training set and if
[00:50:11] crosses that's your training set and if you apply the factor analysis model with
[00:50:12] you apply the factor analysis model with these parameters then you know by e/m
[00:50:15] these parameters then you know by e/m and so on hopefully you can find
[00:50:16] and so on hopefully you can find parameters and all those this data set
[00:50:18] parameters and all those this data set pretty well but hopefully this to the
[00:50:19] pretty well but hopefully this to the sense of the type of data set this could
[00:50:23] sense of the type of data set this could generate
[00:50:25] generate and so and so um and then one one way to
[00:50:35] and so and so um and then one one way to think of this data is you have two
[00:50:37] think of this data is you have two dimensional data but most of the data
[00:50:39] dimensional data but most of the data lies on a 1d subspace so this is how to
[00:50:43] lies on a 1d subspace so this is how to think about it you have two dimensional
[00:50:44] think about it you have two dimensional data since n is two but most of the data
[00:50:47] data since n is two but most of the data lies on the roughly one dimensional
[00:50:49] lies on the roughly one dimensional subspace meaning lies are up there on
[00:50:51] subspace meaning lies are up there on the line and then there's a little bit
[00:50:52] the line and then there's a little bit of noise off that line okay all right
[00:50:58] of noise off that line okay all right let me quickly do one more example
[00:50:59] let me quickly do one more example because these are these are high
[00:51:01] because these are these are high dimensional spaces I think this I think
[00:51:03] dimensional spaces I think this I think is useful to build intuition all right
[00:51:06] is useful to build intuition all right um
[00:51:08] um so let's go through the example where Z
[00:51:13] so let's go through the example where Z is 2 X is an r3 and let's use M equals
[00:51:21] is 2 X is an r3 and let's use M equals 5/2 so um what about different set of
[00:51:31] 5/2 so um what about different set of parameters let's look at the type of
[00:51:33] parameters let's look at the type of data you can generate a factor analysis
[00:51:34] data you can generate a factor analysis which is just e 1 is e 2 Z is
[00:51:37] which is just e 1 is e 2 Z is distributed Gaussian standing goes in to
[00:51:39] distributed Gaussian standing goes in to do you so beyond you know circular
[00:51:40] do you so beyond you know circular Gaussian so maybe this is what a typical
[00:51:42] Gaussian so maybe this is what a typical sample right looks like if you if you
[00:51:47] sample right looks like if you if you sample sort of z1 and z2 from a standard
[00:51:49] sample sort of z1 and z2 from a standard Gaussian right that would be a typical
[00:51:51] Gaussian right that would be a typical sample in z1 and z2 so now all right I'm
[00:51:59] sample in z1 and z2 so now all right I'm gonna do a demo let me take these five
[00:52:02] gonna do a demo let me take these five examples and just copy them to this
[00:52:04] examples and just copy them to this piece of paper okay so all right there
[00:52:11] piece of paper okay so all right there great transfer this from the whiteboard
[00:52:14] great transfer this from the whiteboard to this piece of paper to this brown
[00:52:16] to this piece of paper to this brown cardboard so now you have z1 and z2 in a
[00:52:20] cardboard so now you have z1 and z2 in a two dimensional space what we're going
[00:52:23] two dimensional space what we're going to do is compute lambda Z plus mu and
[00:52:28] to do is compute lambda Z plus mu and this will be 3 by 2 and this will be 3
[00:52:32] this will be 3 by 2 and this will be 3 by 1
[00:52:34] by 1 so what this computation will do as you
[00:52:37] so what this computation will do as you map from Z in two dimensions to lambda Z
[00:52:40] map from Z in two dimensions to lambda Z plus mu is you're going to map from
[00:52:41] plus mu is you're going to map from two-dimensional data to
[00:52:43] two-dimensional data to three-dimensional data in other words
[00:52:45] three-dimensional data in other words you're going to take the two dimensional
[00:52:47] you're going to take the two dimensional data lying on the plane of the
[00:52:49] data lying on the plane of the whiteboard and map it check out the
[00:52:51] whiteboard and map it check out the school animation into the
[00:52:53] school animation into the three-dimensional space of our classroom
[00:53:04] and then the last step is for each of
[00:53:09] and then the last step is for each of these points in this video space like X
[00:53:11] these points in this video space like X 1 X 2 X 3 right we'll have a little
[00:53:14] 1 X 2 X 3 right we'll have a little Gaussian bump that is acts as a line
[00:53:16] Gaussian bump that is acts as a line because epsilon is the features the the
[00:53:20] because epsilon is the features the the components of epsilon are uncorrelated
[00:53:22] components of epsilon are uncorrelated and taking each of these five points and
[00:53:25] and taking each of these five points and add the low you know be the fuzziness a
[00:53:27] add the low you know be the fuzziness a little bit of Gaussian noise to it and
[00:53:29] little bit of Gaussian noise to it and so what you end up with is a set of red
[00:53:32] so what you end up with is a set of red crosses and you end up with a few
[00:53:35] crosses and you end up with a few examples you know L above the noise you
[00:53:38] examples you know L above the noise you end up with except that they would have
[00:53:41] end up with except that they would have a bit of noise off this plane as well
[00:53:42] a bit of noise off this plane as well right but so what the factor analysis
[00:53:46] right but so what the factor analysis model can capture is if you have data in
[00:53:48] model can capture is if you have data in 3d right in this 3d space but most of
[00:53:51] 3d right in this 3d space but most of the data set lies on this maybe roughly
[00:53:54] the data set lies on this maybe roughly two dimensional pancake but there's a
[00:53:56] two dimensional pancake but there's a little bit of fuzziness off the
[00:53:57] little bit of fuzziness off the packaging right so so this would be an
[00:54:00] packaging right so so this would be an example of the type of data that factor
[00:54:02] example of the type of data that factor analysis can model
[00:54:07] and the intuition is really think factor
[00:54:11] and the intuition is really think factor analysis can take very high-dimensional
[00:54:12] analysis can take very high-dimensional data say 100 dimensional data and model
[00:54:17] data say 100 dimensional data and model the data is roughly lying on a three
[00:54:19] the data is roughly lying on a three dimensional five dimensional subspace
[00:54:21] dimensional five dimensional subspace with a little bit of fuss will blow of
[00:54:24] with a little bit of fuss will blow of noise of that low dimensional subspace
[00:54:55] so let's talk about Oh right does not
[00:55:04] so let's talk about Oh right does not work as well if the data is not lying on
[00:55:06] work as well if the data is not lying on the load original subspace um let's see
[00:55:08] the load original subspace um let's see so even in 2d if you have this data set
[00:55:13] so even in 2d if you have this data set your I should the freedom to choose
[00:55:16] your I should the freedom to choose Gaussian noise just like that in which
[00:55:18] Gaussian noise just like that in which case you can actually model things that
[00:55:20] case you can actually model things that quite far off a subspace but yeah you
[00:55:26] quite far off a subspace but yeah you know I think we're very high dimensional
[00:55:27] know I think we're very high dimensional dataset it's actually very difficult to
[00:55:29] dataset it's actually very difficult to know what's going on because you can't
[00:55:30] know what's going on because you can't visualize these very high dimensional
[00:55:32] visualize these very high dimensional data sets and you also don't have enough
[00:55:34] data sets and you also don't have enough data to go very secure to models so so I
[00:55:37] data to go very secure to models so so I feel like yes if you have if the data
[00:55:40] feel like yes if you have if the data actually does not roughly line the
[00:55:42] actually does not roughly line the subspace then this model you know may
[00:55:45] subspace then this model you know may not be the best model but when you have
[00:55:47] not be the best model but when you have such high dimensional data in such a
[00:55:49] such high dimensional data in such a small data set
[00:55:51] small data set you can't fit very complex models to it
[00:55:54] you can't fit very complex models to it anyway so this might be pretty
[00:55:55] anyway so this might be pretty reasonable so um so it turns out that
[00:56:11] reasonable so um so it turns out that the derivation of vm for factor analysis
[00:56:14] the derivation of vm for factor analysis is actually is actually one of the
[00:56:16] is actually is actually one of the trickiest VM derivations in terms of how
[00:56:18] trickiest VM derivations in terms of how you calculate at each step and how you
[00:56:20] you calculate at each step and how you calculate the m-step the whole algorithm
[00:56:23] calculate the m-step the whole algorithm is you know describe every every every
[00:56:25] is you know describe every every every single step step through in great detail
[00:56:27] single step step through in great detail the lecture notes but what I want to do
[00:56:29] the lecture notes but what I want to do is give you the flavor of how to do the
[00:56:31] is give you the flavor of how to do the derivation and to especially draw
[00:56:33] derivation and to especially draw attention to the trickiest step so that
[00:56:35] attention to the trickiest step so that if you need to derive an algorithm like
[00:56:37] if you need to derive an algorithm like this yourself or maybe a different
[00:56:38] this yourself or maybe a different Gaussian model that you know how to do
[00:56:40] Gaussian model that you know how to do it but I won't do every step the algebra
[00:56:42] it but I won't do every step the algebra here um so in order to set ourselves up
[00:56:46] here um so in order to set ourselves up to derive factor analysis here enm for
[00:56:50] to derive factor analysis here enm for better analysis I want to describe a few
[00:56:52] better analysis I want to describe a few properties of multivariate gaussians
[00:56:56] properties of multivariate gaussians so let's say that X is a vector and I'm
[00:57:01] so let's say that X is a vector and I'm gonna write this as a partition vector
[00:57:03] gonna write this as a partition vector right and we jumped if there are our
[00:57:08] right and we jumped if there are our components there and s components there
[00:57:10] components there and s components there so x1 is in SS so if X is Gaussian with
[00:57:26] so x1 is in SS so if X is Gaussian with mean mu and covariance Sigma then let's
[00:57:30] mean mu and covariance Sigma then let's similarly let me be written as this sort
[00:57:34] similarly let me be written as this sort of partition vector right just break it
[00:57:36] of partition vector right just break it up into two sub vectors corresponding to
[00:57:38] up into two sub vectors corresponding to the first R components in the second s
[00:57:41] the first R components in the second s components and similarly lesson that the
[00:57:44] components and similarly lesson that the covariance matrix be partitioned into
[00:57:50] you know these four diagonal blocks
[00:57:53] you know these four diagonal blocks where I guess this is our components
[00:57:55] where I guess this is our components this is s components this is our
[00:57:58] this is s components this is our components this is s components so all
[00:58:01] components this is s components so all this means is you take the covariance
[00:58:04] this means is you take the covariance matrix and take the top leftmost R by R
[00:58:07] matrix and take the top leftmost R by R elements and call that Sigma 1 1 right
[00:58:10] elements and call that Sigma 1 1 right and and similarly for the other sub
[00:58:14] and and similarly for the other sub blocks of this covariance matrix so in
[00:58:22] blocks of this covariance matrix so in order to derive factor analysis one of
[00:58:25] order to derive factor analysis one of the things you need to do is compute
[00:58:27] the things you need to do is compute marginal and conditional distributions
[00:58:31] marginal and conditional distributions of gaussians so the marginal is you know
[00:58:37] of gaussians so the marginal is you know what is P of x1 and so the the if you
[00:58:44] what is P of x1 and so the the if you you know where to derive this the way
[00:58:47] you know where to derive this the way you compute the marginal is to take the
[00:58:49] you compute the marginal is to take the joint density of P of X right and you
[00:58:52] joint density of P of X right and you can write this as P of x1 x2 because X
[00:58:55] can write this as P of x1 x2 because X can be partitioned into x1 and x2 and
[00:58:57] can be partitioned into x1 and x2 and then integrate out x2
[00:58:59] then integrate out x2 P of X 1 X 2 right DX 2 and just to give
[00:59:04] P of X 1 X 2 right DX 2 and just to give you P of X 1 and if you plug in the
[00:59:07] you P of X 1 and if you plug in the Gaussian density the formula for the
[00:59:09] Gaussian density the formula for the Gaussian density if you plug in I guess
[00:59:11] Gaussian density if you plug in I guess you know 1 over 2 pi to the N over 2 C
[00:59:15] you know 1 over 2 pi to the N over 2 C it was in one half e to the minus 1/2 X
[00:59:20] it was in one half e to the minus 1/2 X 1 minus mu 1 if you plug this into P of
[00:59:40] 1 minus mu 1 if you plug this into P of X 1 comma X 2 and as she do the integral
[00:59:49] then you will find that the marginal
[00:59:53] then you will find that the marginal distribution of X 1 is given by X 1 is a
[00:59:58] distribution of X 1 is given by X 1 is a Gaussian with mean mu 1 and covariance
[01:00:02] Gaussian with mean mu 1 and covariance Sigma 1 1 so it's kind of not a shocking
[01:00:06] Sigma 1 1 so it's kind of not a shocking result that the marginal distribution is
[01:00:08] result that the marginal distribution is given just by that and that's and then
[01:00:12] given just by that and that's and then again the way to show it vigorously is
[01:00:14] again the way to show it vigorously is to do this calculation but it's actually
[01:00:16] to do this calculation but it's actually not shocking I guess that that's what
[01:00:18] not shocking I guess that that's what you would get ok um and then the other
[01:00:24] you would get ok um and then the other property you will need to use is a
[01:00:25] property you will need to use is a conditional which is given the value of
[01:00:30] conditional which is given the value of X 2 what is the conditional value of x 1
[01:00:35] X 2 what is the conditional value of x 1 and so the way to do that would be you
[01:00:38] and so the way to do that would be you know in theory you would take P of X 1
[01:00:39] know in theory you would take P of X 1 comma X 2 divided by P of X 2 right and
[01:00:43] comma X 2 divided by P of X 2 right and then simplify and it turns out you can
[01:00:45] then simplify and it turns out you can show that X 1 given X 2 is itself
[01:00:50] show that X 1 given X 2 is itself Gaussian or some mean in some covariance
[01:00:56] Gaussian or some mean in some covariance which comes right at some U of 1 given 2
[01:00:59] which comes right at some U of 1 given 2 and Sigma of 1 given 2 over mu of 1
[01:01:02] and Sigma of 1 given 2 over mu of 1 given 2 is and but this is one of those
[01:01:05] given 2 is and but this is one of those long is that I actually don't
[01:01:07] long is that I actually don't I actually don't manage to remember but
[01:01:09] I actually don't manage to remember but every time I need
[01:01:10] every time I need just look what is written election I was
[01:01:11] just look what is written election I was as well so so that's how you compute
[01:01:40] as well so so that's how you compute modules and conditionals of a Gaussian
[01:01:42] modules and conditionals of a Gaussian distribution
[01:01:47] so using these properties of the
[01:01:59] so using these properties of the multivariate Gaussian density let's go
[01:02:02] multivariate Gaussian density let's go through the high-level steps of how you
[01:02:04] through the high-level steps of how you derive the EML rule
[01:02:23] step one is less compute actually let's
[01:02:34] let's derive what is the joint
[01:02:37] let's derive what is the joint distribution of P of X and Z and in
[01:02:42] distribution of P of X and Z and in particular it turns out that if you take
[01:02:44] particular it turns out that if you take Z and X and stack them up into a vector
[01:02:47] Z and X and stack them up into a vector like so Z and X viewed as a vector will
[01:02:52] like so Z and X viewed as a vector will be Gaussian wouldn't mean with some mean
[01:03:01] be Gaussian wouldn't mean with some mean and some Co various because X and Z
[01:03:05] and some Co various because X and Z jointly will have a Gaussian density and
[01:03:08] jointly will have a Gaussian density and let's try to quickly figure out what are
[01:03:12] let's try to quickly figure out what are this mean and that's covariance matrix
[01:03:18] so that was a definition of these terms
[01:03:25] so that was a definition of these terms and so the expected value of Z is equal
[01:03:30] and so the expected value of Z is equal to 0 because 0 Z is Gaussian with mean 0
[01:03:34] to 0 because 0 Z is Gaussian with mean 0 and covariance identity and do you
[01:03:36] and covariance identity and do you expected value of x is equal to the
[01:03:39] expected value of x is equal to the expected value of mu plus lambda Z plus
[01:03:43] expected value of mu plus lambda Z plus epsilon but Z has 0 expected value
[01:03:46] epsilon but Z has 0 expected value epsilon 0 expected value so that just
[01:03:48] epsilon 0 expected value so that just leaves you with mu and so this mean
[01:03:52] leaves you with mu and so this mean vector mu XZ is going to equal to 0 and
[01:03:59] vector mu XZ is going to equal to 0 and so this is d dimensional and this is a n
[01:04:02] so this is d dimensional and this is a n dimensional and it turns out that let's
[01:04:11] dimensional and it turns out that let's see
[01:04:22] and it turns out that you can similarly
[01:04:25] and it turns out that you can similarly compute the covariance matrix Sigma
[01:04:32] right where this is D dimensions and
[01:04:37] right where this is D dimensions and this is n dimensions it turns out that
[01:04:42] this is n dimensions it turns out that if you take this partition vector and
[01:04:44] if you take this partition vector and compute the covariance matrix the four
[01:04:47] compute the covariance matrix the four blocks of the covariance matrix that can
[01:04:49] blocks of the covariance matrix that can be written as follows and you can
[01:05:21] be written as follows and you can one-at-a-time derive what each of these
[01:05:23] one-at-a-time derive what each of these different blocks look like and let me
[01:05:28] different blocks look like and let me just do one of these and now let me just
[01:05:32] just do one of these and now let me just derive one Sigma 2 to the lower-right
[01:05:34] derive one Sigma 2 to the lower-right blockers and the rest are derived
[01:05:36] blockers and the rest are derived similarly and also fleshed out in the
[01:05:38] similarly and also fleshed out in the lecture notes
[01:05:43] so the way you derive what this block is
[01:05:46] so the way you derive what this block is like is that you say Sigma 2 2 is X
[01:05:51] like is that you say Sigma 2 2 is X minus y X X minus X transpose and so if
[01:06:00] minus y X X minus X transpose and so if I plug in the definition of X that would
[01:06:02] I plug in the definition of X that would be a lambda Z plus mu plus epsilon minus
[01:06:06] be a lambda Z plus mu plus epsilon minus mu the same thing so that's X minus yeah
[01:06:21] mu the same thing so that's X minus yeah so that's X minus y X okay because the
[01:06:25] so that's X minus y X okay because the expected value of x is 2 so the Meuse
[01:06:30] expected value of x is 2 so the Meuse cancel out and then if you do the
[01:06:34] cancel out and then if you do the quadratic expansion I guess this becomes
[01:06:36] quadratic expansion I guess this becomes expected value of let's see
[01:06:44] lambda Z times each of these two terms
[01:06:50] lambda Z times each of these two terms transpose plus it's all about you know a
[01:06:57] transpose plus it's all about you know a plus B times a plus B right is a times a
[01:07:02] plus B times a plus B right is a times a times a plus B times a plus B you get
[01:07:04] times a plus B times a plus B you get four terms as a result and so the first
[01:07:06] four terms as a result and so the first term is lambda Z times lambda Z
[01:07:08] term is lambda Z times lambda Z transpose which is this plus lambda Z
[01:07:14] transpose which is this plus lambda Z epsilon transpose plus with Epsilon
[01:07:25] and so this term has your expected value
[01:07:31] and so this term has your expected value because epsilon and and Z both have zero
[01:07:34] because epsilon and and Z both have zero expected value and correlated so this is
[01:07:37] expected value and correlated so this is 0 this is 0 on expectation and so you
[01:07:42] 0 this is 0 on expectation and so you just left with an expected value of
[01:07:44] just left with an expected value of lambda Z Z transpose lambda transpose
[01:07:47] lambda Z Z transpose lambda transpose plus the expected value of epsilon
[01:07:50] plus the expected value of epsilon epsilon transpose and so by the
[01:07:56] epsilon transpose and so by the linearity of expectation you can take a
[01:07:58] linearity of expectation you can take a expectation inside the mouth matrix
[01:08:00] expectation inside the mouth matrix multiplication so this launder is the
[01:08:02] multiplication so this launder is the expected value of Z Z transpose times
[01:08:05] expected value of Z Z transpose times lambda transpose plus and this is just
[01:08:08] lambda transpose plus and this is just the covariance of epsilon right which is
[01:08:11] the covariance of epsilon right which is which is Sai and then because Z is drawn
[01:08:16] which is Sai and then because Z is drawn from a standard Gaussian with identity
[01:08:18] from a standard Gaussian with identity covariance that expectation in the
[01:08:19] covariance that expectation in the middle is just the identity
[01:08:21] middle is just the identity so there's lambda lambda transpose plus
[01:08:26] so there's lambda lambda transpose plus sign okay so that's how you work out
[01:08:32] sign okay so that's how you work out what is this lower right block of this
[01:08:33] what is this lower right block of this um covariance matrix I know I did that a
[01:08:36] um covariance matrix I know I did that a little bit quickly but every every step
[01:08:38] little bit quickly but every every step is written out more slowly into lecture
[01:08:42] is written out more slowly into lecture notes as well and it turns out that if
[01:08:45] notes as well and it turns out that if you go through a similar process at
[01:08:47] you go through a similar process at Phaedra you know one at a time using a
[01:08:49] Phaedra you know one at a time using a similar process what are the other
[01:08:50] similar process what are the other blocks of this covariance matrix you
[01:08:53] blocks of this covariance matrix you find that the other blocks of this
[01:08:54] find that the other blocks of this covariance matrix are identity lounder
[01:08:57] covariance matrix are identity lounder transpose that one we just worked out so
[01:09:03] transpose that one we just worked out so that that's the one we just worked out
[01:09:04] that that's the one we just worked out but so that is the covariance matrix
[01:09:07] but so that is the covariance matrix side
[01:09:41] so where we are is that we figured out
[01:09:44] so where we are is that we figured out that the joint distribution at the joint
[01:09:46] that the joint distribution at the joint density of Z X is Gaussian would mean
[01:09:49] density of Z X is Gaussian would mean given by that vector and covariance
[01:09:52] given by that vector and covariance given by that matrix and so what you
[01:10:04] given by that matrix and so what you could do is you write down write P of X
[01:10:10] could do is you write down write P of X I and try to take the so P of X I will
[01:10:14] I and try to take the so P of X I will be this Gaussian density and what you
[01:10:16] be this Gaussian density and what you could do is take derivatives of the
[01:10:18] could do is take derivatives of the log-likelihood respect to the parameters
[01:10:20] log-likelihood respect to the parameters instead on 0 0 and solve and you find
[01:10:22] instead on 0 0 and solve and you find that there is no known closed form
[01:10:23] that there is no known closed form solution there is actually no closed
[01:10:26] solution there is actually no closed form solution for finding the values of
[01:10:28] form solution for finding the values of lambda and sy and mu that maximizes life
[01:10:32] lambda and sy and mu that maximizes life likelihood so in order to fit the
[01:10:40] likelihood so in order to fit the parameters of the model we're instead
[01:10:43] parameters of the model we're instead going to resort to Y M and so in the
[01:10:49] going to resort to Y M and so in the estep
[01:11:02] so let's let's first derive what is the
[01:11:05] so let's let's first derive what is the e step which is an e step you need to
[01:11:07] e step which is an e step you need to compute this now Zi here is a continuous
[01:11:13] compute this now Zi here is a continuous random variable when we're fitting a
[01:11:16] random variable when we're fitting a mixture of gaussians distribution Zi was
[01:11:18] mixture of gaussians distribution Zi was discrete and so you could have a list of
[01:11:20] discrete and so you could have a list of numbers represented by you know W IJ
[01:11:22] numbers represented by you know W IJ that just just have a vector soaring
[01:11:25] that just just have a vector soaring what is the probability of each of the
[01:11:26] what is the probability of each of the discrete values of Zi but in this case
[01:11:29] discrete values of Zi but in this case Zi is a continuous density so how do you
[01:11:32] Zi is a continuous density so how do you represent qi of Zi in a computer it
[01:11:37] represent qi of Zi in a computer it turns out that using the formulas we
[01:11:39] turns out that using the formulas we have for the marginal excuse me for the
[01:11:42] have for the marginal excuse me for the conditional distribution of a Gaussian
[01:11:43] conditional distribution of a Gaussian it turns out that if you compute this
[01:11:46] it turns out that if you compute this right hand side you'll find that Zi
[01:11:49] right hand side you'll find that Zi given X I this is going to be Gaussian
[01:11:52] given X I this is going to be Gaussian with some mean and some covariance right
[01:12:01] with some mean and some covariance right where it's basically those formulas mu
[01:12:09] where it's basically those formulas mu of Z i given X I is equal to if you kind
[01:12:14] of Z i given X I is equal to if you kind of take that foam and then apply it all
[01:12:15] of take that foam and then apply it all thing here is 0 plus lambda transpose
[01:12:44] okay so these equations are exactly
[01:12:47] okay so these equations are exactly these two equations right maps to map to
[01:12:51] these two equations right maps to map to that big Gaussian density that we have
[01:12:53] that big Gaussian density that we have okay so what you would do in the East
[01:12:57] okay so what you would do in the East step is compute this and compute this
[01:13:01] step is compute this and compute this compute this vector in computers matrix
[01:13:03] compute this vector in computers matrix and saw that sorters in you know store
[01:13:06] and saw that sorters in you know store these as variables and your
[01:13:08] these as variables and your representation of the Qi is that Qi is a
[01:13:12] representation of the Qi is that Qi is a Gaussian density right with this mean
[01:13:16] Gaussian density right with this mean and disco beer so this is what you
[01:13:17] and disco beer so this is what you actually compute to represent Qi
[01:13:32] all right so step two was to write the e
[01:13:35] all right so step two was to write the e step and step three is the RIVM step and
[01:13:46] step and step three is the RIVM step and the derivation of the m-step is is quite
[01:13:50] the derivation of the m-step is is quite long and complicated but I want to
[01:13:54] long and complicated but I want to mention just a key out your algebraic
[01:13:56] mention just a key out your algebraic trick you need to use when deriving the
[01:13:58] trick you need to use when deriving the m-step so you know we know from the east
[01:14:02] m-step so you know we know from the east step that qi of zi is that gaussian
[01:14:05] step that qi of zi is that gaussian density right so yes 1 over 2 pi to the
[01:14:08] density right so yes 1 over 2 pi to the D over 2 that thing and e to the
[01:14:12] D over 2 that thing and e to the negative 1/2 so that's the formula for
[01:14:16] negative 1/2 so that's the formula for Qi it turns out that in the m-step
[01:14:21] Qi it turns out that in the m-step there'll be a few places in the
[01:14:23] there'll be a few places in the derivation where you need to compute
[01:14:24] derivation where you need to compute something like this and one way to
[01:14:35] something like this and one way to approach this would be to plug in the
[01:14:37] approach this would be to plug in the density for Qi which is so you end up
[01:14:42] density for Qi which is so you end up with this 1/2 pi to the T over 2 Sigma
[01:14:46] with this 1/2 pi to the T over 2 Sigma you know and so on time zi Dzi and then
[01:15:00] you know and so on time zi Dzi and then charlie compute this integral it turns
[01:15:03] charlie compute this integral it turns out there's a much simpler way to
[01:15:04] out there's a much simpler way to computers in the draw anyone know what
[01:15:06] computers in the draw anyone know what it is
[01:15:13] all right cool awesome right expect the
[01:15:16] all right cool awesome right expect the value so the other way to compute this
[01:15:18] value so the other way to compute this integral is to note that is that this is
[01:15:20] integral is to note that is that this is the expected value of CI when CI is
[01:15:26] the expected value of CI when CI is drawn from Qi right so you know the
[01:15:30] drawn from Qi right so you know the definition of expect about value of
[01:15:32] definition of expect about value of random variables
[01:15:33] random variables expected value of Z is equal to integral
[01:15:37] expected value of Z is equal to integral over C probably up to Z times Z DZ right
[01:15:41] over C probably up to Z times Z DZ right that's what the expected value of a
[01:15:42] that's what the expected value of a random variable is and so this integral
[01:15:46] random variable is and so this integral is the expected value of Z respect to Z
[01:15:49] is the expected value of Z respect to Z drawn from the QI distribution but we
[01:15:52] drawn from the QI distribution but we know that Q is Gaussian with a certain
[01:15:56] know that Q is Gaussian with a certain mean a certain variance and so the
[01:15:57] mean a certain variance and so the expected value of this this is just mu
[01:16:00] expected value of this this is just mu of Z i given X I is that thing that
[01:16:04] of Z i given X I is that thing that you've already computed and so when
[01:16:09] you've already computed and so when students derived the m-step you know for
[01:16:12] students derived the m-step you know for young implementations of gaussians one
[01:16:15] young implementations of gaussians one of the key things to notice is when are
[01:16:17] of the key things to notice is when are you actually taking an expected value
[01:16:19] you actually taking an expected value respect to around in variable in which
[01:16:20] respect to around in variable in which case is just the value of computer
[01:16:22] case is just the value of computer already and when do you need to plug in
[01:16:25] already and when do you need to plug in this big complicated integral which can
[01:16:27] this big complicated integral which can lead to very complicated very
[01:16:28] lead to very complicated very intractable calculations ok so just when
[01:16:31] intractable calculations ok so just when you're whenever you see this think about
[01:16:34] you're whenever you see this think about whether you need to be expanding a big
[01:16:37] whether you need to be expanding a big complicated integral or if it can be
[01:16:38] complicated integral or if it can be interpreted as an expected out
[01:16:46] and so for the m-step is really you know
[01:16:55] and so for the m-step is really you know the m-step
[01:16:55] the m-step is right so that's the m-step
[01:17:24] is right so that's the m-step and if you rewrite this term as on some
[01:17:29] and if you rewrite this term as on some of I the expected value of Zi drawn from
[01:17:34] of I the expected value of Zi drawn from Qi it turns out that um if you go ahead
[01:17:49] Qi it turns out that um if you go ahead and plug in the Gaussian density here
[01:18:02] actually want one rule of thumb for
[01:18:04] actually want one rule of thumb for whether or not you should plug in a
[01:18:05] whether or not you should plug in a complicated in the grow-op sarcoma
[01:18:07] complicated in the grow-op sarcoma Gaussian density this is just a rule of
[01:18:09] Gaussian density this is just a rule of thumb after doing this type of map a
[01:18:11] thumb after doing this type of map a long time it see if there's a log in
[01:18:12] long time it see if there's a log in front if there's a log in front of a
[01:18:14] front if there's a log in front of a Gaussian density because the Gaussian
[01:18:16] Gaussian density because the Gaussian density as an exponentiation right
[01:18:18] density as an exponentiation right through the Gaussian density is you know
[01:18:19] through the Gaussian density is you know 1 over e to the something so one of
[01:18:22] 1 over e to the something so one of there's a log in front of volcanic
[01:18:23] there's a log in front of volcanic initiation cancel out and this equations
[01:18:25] initiation cancel out and this equations simplify so one trick as you're doing
[01:18:27] simplify so one trick as you're doing these derivations is just see if there's
[01:18:29] these derivations is just see if there's a log in front of a Gaussian density and
[01:18:31] a log in front of a Gaussian density and when there is a plugin go ahead and plug
[01:18:33] when there is a plugin go ahead and plug in the formulas for Gaussian density the
[01:18:35] in the formulas for Gaussian density the log will simplify that and what you end
[01:18:37] log will simplify that and what you end up with is the log of a Gaussian density
[01:18:40] up with is the log of a Gaussian density as there being a quadratic function
[01:18:44] a quadratic function of the parameters
[01:18:46] a quadratic function of the parameters and if you take the expected value
[01:18:48] and if you take the expected value respect to a Gaussian Jessie respect to
[01:18:51] respect to a Gaussian Jessie respect to a quadratic function this whole thing
[01:18:53] a quadratic function this whole thing ends up being a quadratic function and
[01:18:56] ends up being a quadratic function and then you can take derivatives of that
[01:18:58] then you can take derivatives of that equation with respect to the parameters
[01:19:01] equation with respect to the parameters respective new that whole thing set it
[01:19:05] respective new that whole thing set it to zero and then solve and it'll be
[01:19:07] to zero and then solve and it'll be roughly
[01:19:08] roughly love of complexity of maximizing
[01:19:11] love of complexity of maximizing quadratic function okay hope that makes
[01:19:14] quadratic function okay hope that makes sense um the actual formulas are a
[01:19:16] sense um the actual formulas are a little bit complicated so I don't I'll
[01:19:18] little bit complicated so I don't I'll leave user luckily I shall for this the
[01:19:20] leave user luckily I shall for this the lecture notes but I think the takeaway
[01:19:21] lecture notes but I think the takeaway is uh don't expand this in the draw and
[01:19:25] is uh don't expand this in the draw and when you are deriving this plug in the
[01:19:28] when you are deriving this plug in the Gaussian densities here because they'll
[01:19:30] Gaussian densities here because they'll all be simplified okay and details of
[01:19:32] all be simplified okay and details of it's an election notes so let's break
[01:19:35] it's an election notes so let's break for today best of luck with the midterm
[01:19:38] for today best of luck with the midterm oh and no Suzy I hope you guys do well
[01:19:40] oh and no Suzy I hope you guys do well alright I'll see you guys in a few days


================================================================================
LECTURE 016
================================================================================

Lecture 16 - Independent Component Analysis & RL | Stanford CS229: Machine Learning (Autumn 2018)

Source: https://www.youtube.com/watch?v=YQA9lLdLig8

---

Transcript

[00:00:03] hey everyone let's get started so um
[00:00:12] hey everyone let's get started so um let's see plan for today is we'll go
[00:00:15] let's see plan for today is we'll go over the rest of ICA independent
[00:00:18] over the rest of ICA independent component analysis and in particular
[00:00:19] component analysis and in particular talk about CDF's cumulative distribution
[00:00:24] talk about CDF's cumulative distribution functions and then um
[00:00:37] all right so plan is a we'll go over the
[00:00:42] all right so plan is a we'll go over the rest of ICA independent components
[00:00:44] rest of ICA independent components analysis and we'll talk a bit about CDF
[00:00:47] analysis and we'll talk a bit about CDF cumulative distribution functions and
[00:00:50] cumulative distribution functions and then derive the ISEE model and in the
[00:00:53] then derive the ISEE model and in the second half of today we'll start on the
[00:00:56] second half of today we'll start on the final of the four major topics of the
[00:01:04] final of the four major topics of the cost which is reinforcement learning we
[00:01:05] cost which is reinforcement learning we talk about MDPs or Marvel teacher
[00:01:06] talk about MDPs or Marvel teacher processes
[00:01:07] processes so to recap briefly we had you remember
[00:01:13] so to recap briefly we had you remember the overlapping voices demo so we said
[00:01:16] the overlapping voices demo so we said that in the I see a problem dependent
[00:01:19] that in the I see a problem dependent components now this problem we're seeing
[00:01:21] components now this problem we're seeing we have sources s which are are n if you
[00:01:25] we have sources s which are are n if you have n speakers so for example if this
[00:01:28] have n speakers so for example if this is speaker ones audio then at time T s
[00:01:32] is speaker ones audio then at time T s you know superscript parenthesis T
[00:01:35] you know superscript parenthesis T subscript 1 is the sound emitted by
[00:01:38] subscript 1 is the sound emitted by speaker 1 at time T as I've seen all
[00:01:43] speaker 1 at time T as I've seen all right make that go a little bit and
[00:01:47] right make that go a little bit and we're using sometimes I to index
[00:01:50] we're using sometimes I to index training examples and so the training
[00:01:52] training examples and so the training examples sweep over time and sometimes
[00:01:55] examples sweep over time and sometimes usually I use I sometimes I use T I
[00:01:57] usually I use I sometimes I use T I guess in the case where the different
[00:02:01] guess in the case where the different examples come from different points in
[00:02:03] examples come from different points in time in your recording and what your
[00:02:06] time in your recording and what your microphones record is X I equals a of s
[00:02:09] microphones record is X I equals a of s I so just for now let's say you have two
[00:02:12] I so just for now let's say you have two speakers and two microphones in which
[00:02:14] speakers and two microphones in which case a will be a 2 by 2 matrix and home
[00:02:17] case a will be a 2 by 2 matrix and home or problem your face because in five
[00:02:18] or problem your face because in five microphones in which case a will be a
[00:02:20] microphones in which case a will be a five by five matrix what's helped later
[00:02:22] five by five matrix what's helped later about what happens is the numbers
[00:02:24] about what happens is the numbers because in microphones is not the same
[00:02:25] because in microphones is not the same and the goal is to find a matrix W which
[00:02:30] and the goal is to find a matrix W which should hopefully be a inverse so that si
[00:02:35] should hopefully be a inverse so that si is w times X recover the original
[00:02:37] is w times X recover the original sources and we're going to use these W 1
[00:02:42] sources and we're going to use these W 1 up to WN
[00:02:43] up to WN represent the rows of this matrix w oh
[00:02:52] yes you're right thank you
[00:02:59] so last time we had all right just
[00:03:08] so last time we had all right just remember this is a picture the cocktail
[00:03:09] remember this is a picture the cocktail party problem and last time I showed
[00:03:12] party problem and last time I showed these pictures about you know why why is
[00:03:15] these pictures about you know why why is ICA even possible right given two
[00:03:17] ICA even possible right given two overlapping voices how is even possible
[00:03:20] overlapping voices how is even possible to separate them out how is there enough
[00:03:23] to separate them out how is there enough information to know you know what are
[00:03:26] information to know you know what are the two overlapping voices and so one
[00:03:28] the two overlapping voices and so one picture we saw was this one where if s1
[00:03:31] picture we saw was this one where if s1 and s2 are uniform between minus 1 and
[00:03:33] and s2 are uniform between minus 1 and plus 1 then the distribution of data
[00:03:35] plus 1 then the distribution of data will look like this if you pass this
[00:03:38] will look like this if you pass this data through the mixing matrix a then
[00:03:41] data through the mixing matrix a then your observations now the axes have
[00:03:44] your observations now the axes have changed X 1 and X 2 may look like this
[00:03:46] changed X 1 and X 2 may look like this and your job is a finding unmixing
[00:03:48] and your job is a finding unmixing matrix W that map's this data back to
[00:03:51] matrix W that map's this data back to the square ok now this example is
[00:03:57] the square ok now this example is possible because the examples because
[00:04:00] possible because the examples because the sources s-one and s-two were
[00:04:02] the sources s-one and s-two were distributed uniformly between minus 1
[00:04:04] distributed uniformly between minus 1 and plus 1 it turns out human voices you
[00:04:07] and plus 1 it turns out human voices you know the recordings per moment in time I
[00:04:09] know the recordings per moment in time I not distributed uniform between minus 1
[00:04:12] not distributed uniform between minus 1 and plus 1 and it turns out that dumped
[00:04:14] and plus 1 and it turns out that dumped if the data was Gaussian then ICA is
[00:04:18] if the data was Gaussian then ICA is actually not possible
[00:04:19] actually not possible here's what I mean let's say that so the
[00:04:24] here's what I mean let's say that so the uniform distribution is a highly non
[00:04:25] uniform distribution is a highly non Gaussian distribution right uniformly
[00:04:27] Gaussian distribution right uniformly mine's one plus one you know this is not
[00:04:29] mine's one plus one you know this is not Gaussian and that that makes I see
[00:04:30] Gaussian and that that makes I see possible
[00:04:31] possible what if s1 and s2 came from Gaussian
[00:04:36] what if s1 and s2 came from Gaussian densities right if that were the case
[00:04:39] densities right if that were the case then this distribution s1 and s2 would
[00:04:43] then this distribution s1 and s2 would be rotationally symmetric and so there'd
[00:04:47] be rotationally symmetric and so there'd be a rotational ambiguity right any axis
[00:04:50] be a rotational ambiguity right any axis could be s 1 and s 2 you can't map you
[00:04:53] could be s 1 and s 2 you can't map you know this type of parallelogram back to
[00:04:55] know this type of parallelogram back to this square
[00:04:56] this square right so so you can't so if I think in
[00:04:58] right so so you can't so if I think in this parallelogram you can sort of read
[00:05:04] this parallelogram you can sort of read off you know there may be one axis
[00:05:05] off you know there may be one axis should look like that
[00:05:06] should look like that sorry I'm joining with Mouse not doing
[00:05:08] sorry I'm joining with Mouse not doing very well well second axis should maybe
[00:05:11] very well well second axis should maybe look like that right and by by inverting
[00:05:14] look like that right and by by inverting that you can get the data back to the
[00:05:15] that you can get the data back to the square but in the case of the data look
[00:05:19] square but in the case of the data look like this then you actually don't know
[00:05:22] like this then you actually don't know because maybe this should be s 1 and
[00:05:26] because maybe this should be s 1 and that should be s 2 right but so there's
[00:05:29] that should be s 2 right but so there's this rotational ambiguity because the
[00:05:32] this rotational ambiguity because the Gaussian distribution is rotationally
[00:05:34] Gaussian distribution is rotationally symmetric of s 1 and s 2 are standard
[00:05:36] symmetric of s 1 and s 2 are standard Gaussian then then this distribution is
[00:05:39] Gaussian then then this distribution is rotation symmetric and you don't have
[00:05:40] rotation symmetric and you don't have enough information to recover the
[00:05:42] enough information to recover the directions that correspond to the
[00:05:44] directions that correspond to the original sources
[00:05:44] original sources ok so it turns out that there is some
[00:05:49] ok so it turns out that there is some ambiguity and the output of ICA in
[00:05:52] ambiguity and the output of ICA in particular last time we talked about two
[00:05:54] particular last time we talked about two sources of ambiguity you don't know
[00:05:57] sources of ambiguity you don't know which is speaker 1 at which the speaker
[00:05:58] which is speaker 1 at which the speaker 2 right you don't know which one to
[00:05:59] 2 right you don't know which one to number speaker 1 which on the numbers
[00:06:01] number speaker 1 which on the numbers because you and you might take this data
[00:06:04] because you and you might take this data and flip it horizontally reflect this
[00:06:07] and flip it horizontally reflect this you know on the name s 1 goes to
[00:06:10] you know on the name s 1 goes to negative s 1 or reflect this on the
[00:06:13] negative s 1 or reflect this on the vertical axis we don't know this
[00:06:14] vertical axis we don't know this positive s 2 and negative s 2 and in the
[00:06:17] positive s 2 and negative s 2 and in the case of this example where s 1 s 2 a
[00:06:19] case of this example where s 1 s 2 a uniform minus 1 plus 1 those are the
[00:06:21] uniform minus 1 plus 1 those are the only sources of ambiguity but the data
[00:06:25] only sources of ambiguity but the data was Gaussian that the additional
[00:06:26] was Gaussian that the additional rotation on vacuity which actually in
[00:06:28] rotation on vacuity which actually in part which actually makes it impossible
[00:06:30] part which actually makes it impossible to separate out the sources ok so it
[00:06:34] to separate out the sources ok so it turns out that
[00:06:43] so it turns out that the Gaussian
[00:06:46] so it turns out that the Gaussian density is the only distribution that is
[00:06:50] density is the only distribution that is rotationally symmetric if s1 and s2 are
[00:06:54] rotationally symmetric if s1 and s2 are independent and the distribution is
[00:06:57] independent and the distribution is rotationally symmetric meaning that the
[00:06:59] rotationally symmetric meaning that the distribution has sort of circular
[00:07:02] distribution has sort of circular contours then it then it then it must be
[00:07:04] contours then it then it then it must be a Gaussian density and so there is a
[00:07:07] a Gaussian density and so there is a theorem which just stated formally that
[00:07:10] theorem which just stated formally that I see is possible only if your data is
[00:07:12] I see is possible only if your data is not Gaussian right but but so once your
[00:07:14] not Gaussian right but but so once your data is not Gaussian then it is possible
[00:07:17] data is not Gaussian then it is possible to recover the independent sources okay
[00:07:19] to recover the independent sources okay I'm just taking that informally so let's
[00:07:27] I'm just taking that informally so let's see so what I'd like to do is develop
[00:07:31] see so what I'd like to do is develop the ICA algorithm assuming that the data
[00:07:35] the ICA algorithm assuming that the data is non Gaussian okay now in order to
[00:07:43] is non Gaussian okay now in order to divert the ICA model we need to figure
[00:07:47] divert the ICA model we need to figure out what is the density of s right and
[00:07:51] out what is the density of s right and I'm going to use P subscript s you know
[00:07:53] I'm going to use P subscript s you know of the of the random variable s to
[00:07:57] of the of the random variable s to represent the density of s an equivalent
[00:08:01] represent the density of s an equivalent way to represent the probability of the
[00:08:03] way to represent the probability of the density of continuous random variables
[00:08:05] density of continuous random variables virus CDF which stands for cumulative
[00:08:09] virus CDF which stands for cumulative distribution functions and the
[00:08:17] distribution functions and the cumulative distribution function of a
[00:08:20] cumulative distribution function of a random variable f of s in probability is
[00:08:24] random variable f of s in probability is defined as the chance that the random
[00:08:26] defined as the chance that the random variable is less than that value so I
[00:08:29] variable is less than that value so I guess notation has been inconsistent
[00:08:31] guess notation has been inconsistent sorry but this is capital S I'm using to
[00:08:34] sorry but this is capital S I'm using to denote the random variable and this is
[00:08:37] denote the random variable and this is some constant right and it's that same
[00:08:42] some constant right and it's that same constants is that lowercase s okay and
[00:08:45] constants is that lowercase s okay and so for example if
[00:08:50] this is the PDF of random variable s may
[00:08:54] this is the PDF of random variable s may be of a Gaussian right the CDF is a
[00:09:01] be of a Gaussian right the CDF is a function that increases from 0 to 1
[00:09:16] function that increases from 0 to 1 where the height of a CDF at a certain
[00:09:22] where the height of a CDF at a certain point is the probability so if you take
[00:09:27] point is the probability so if you take the curves the same point so the height
[00:09:30] the curves the same point so the height of a CDF at a certain point lowercase s
[00:09:35] of a CDF at a certain point lowercase s is the probability that the random
[00:09:38] is the probability that the random variable takes on the value equal to
[00:09:40] variable takes on the value equal to this value or lower which means that the
[00:09:42] this value or lower which means that the height of this function is equal to you
[00:09:46] height of this function is equal to you know the probability mass the area under
[00:09:48] know the probability mass the area under the curve of your PDF over to the left
[00:09:51] the curve of your PDF over to the left at that point okay so that's a know some
[00:09:54] at that point okay so that's a know some sometimes this some problem in
[00:09:56] sometimes this some problem in statistics courses teach this concept in
[00:09:58] statistics courses teach this concept in some Jones I guess but there so there's
[00:10:00] some Jones I guess but there so there's a mapping between the PDS and the CDF of
[00:10:03] a mapping between the PDS and the CDF of a function of a continuous random
[00:10:06] a function of a continuous random variable and the relation between the
[00:10:09] variable and the relation between the PDF and the CDF is that the density is
[00:10:13] PDF and the CDF is that the density is equal to the first derivative right F
[00:10:17] equal to the first derivative right F prime so if you take the derivative of
[00:10:19] prime so if you take the derivative of the CDF then you should recover the PDF
[00:10:23] the CDF then you should recover the PDF ok but so I think in order to specify
[00:10:27] ok but so I think in order to specify you know some random variable we could
[00:10:30] you know some random variable we could either specify the PDF right the
[00:10:32] either specify the PDF right the probably density function or you could
[00:10:35] probably density function or you could specify the CDF which is just no less
[00:10:37] specify the CDF which is just no less tell me what's the chance of the random
[00:10:39] tell me what's the chance of the random variable taking on any value less than
[00:10:41] variable taking on any value less than any particular value s and by taking the
[00:10:43] any particular value s and by taking the derivative this you can always recover
[00:10:45] derivative this you can always recover the PDF and by integrating this you can
[00:10:47] the PDF and by integrating this you can always go to the senior ok and so what
[00:10:51] always go to the senior ok and so what we're going to do in ICA is instead of
[00:10:55] we're going to do in ICA is instead of specifying a PDF for how speakers voices
[00:10:59] specifying a PDF for how speakers voices sound we're instead going to specify a
[00:11:00] sound we're instead going to specify a CDF and we have to choose as India
[00:11:03] CDF and we have to choose as India that is not the Gaussian density CDF
[00:11:07] that is not the Gaussian density CDF because we have assumed that the data is
[00:11:08] because we have assumed that the data is non Gaussian and and the CDF you know is
[00:11:13] non Gaussian and and the CDF you know is a function that always goes from right
[00:11:16] a function that always goes from right zero to one okay so
[00:11:33] all right so we'll specify so in a
[00:11:45] all right so we'll specify so in a little bit we'll specify some CDF for
[00:11:48] little bit we'll specify some CDF for the density of the sources of what human
[00:11:50] the density of the sources of what human voices sound like let's say and if you
[00:11:52] voices sound like let's say and if you differentiate this you will get the PDF
[00:11:56] differentiate this you will get the PDF or the density is equal to that now um
[00:12:04] or the density is equal to that now um we're going to derive a massive likely
[00:12:06] we're going to derive a massive likely estimation algorithm in a minute but our
[00:12:09] estimation algorithm in a minute but our model is that X is equal to a s which is
[00:12:14] model is that X is equal to a s which is equal to I guess W inverse of s and s is
[00:12:17] equal to I guess W inverse of s and s is equal to WX right so that that's that's
[00:12:20] equal to WX right so that that's that's the model and in order to derive a
[00:12:24] the model and in order to derive a maximum likelihood estimate for the
[00:12:25] maximum likelihood estimate for the parameters when you have so this is
[00:12:30] parameters when you have so this is going to be the density X so this is a
[00:12:43] going to be the density X so this is a relationship between this is
[00:12:47] relationship between this is relationship between X and s X is equal
[00:12:51] relationship between X and s X is equal to a s equals W inverse s and has to
[00:12:53] to a s equals W inverse s and has to equal to WX right so this is a model and
[00:12:55] equal to WX right so this is a model and what I'd like to do is let's say you
[00:12:58] what I'd like to do is let's say you know what's the density of s what is the
[00:13:04] know what's the density of s what is the density of X if X is computed as the
[00:13:09] density of X if X is computed as the matrix a times s so one step that's
[00:13:15] matrix a times s so one step that's tempting to take is to just say well s
[00:13:18] tempting to take is to just say well s is equal to W times X so the probability
[00:13:22] is equal to W times X so the probability of X is just equal to the probability of
[00:13:24] of X is just equal to the probability of s taking on the certain value right so
[00:13:27] s taking on the certain value right so so I mean this is s and so the
[00:13:31] so I mean this is s and so the probability of seeing a certain value of
[00:13:33] probability of seeing a certain value of x is equal to the probability of s
[00:13:35] x is equal to the probability of s taking on that corresponding value
[00:13:37] taking on that corresponding value because assuming W is an invertible
[00:13:39] because assuming W is an invertible matrix
[00:13:40] matrix is one-to-one mapping between X and s so
[00:13:43] is one-to-one mapping between X and s so to find it probably FX just find a pair
[00:13:45] to find it probably FX just find a pair of s and compute a corresponding
[00:13:47] of s and compute a corresponding probability it turns out this is this is
[00:13:51] probability it turns out this is this is incorrect and this works were probably
[00:13:53] incorrect and this works were probably mask functions but discreet party
[00:13:55] mask functions but discreet party distributions that take on discrete
[00:13:57] distributions that take on discrete values but this is actually incorrect
[00:13:59] values but this is actually incorrect for continuous probability densities so
[00:14:02] for continuous probability densities so let me let me um show an illustration
[00:14:04] let me let me um show an illustration and go back to derive what is a correct
[00:14:07] and go back to derive what is a correct way of computing the density of X oh and
[00:14:11] way of computing the density of X oh and we'll want a density of X because when
[00:14:15] we'll want a density of X because when you get the training set you only get to
[00:14:17] you get the training set you only get to observe X and so for finding a master of
[00:14:21] observe X and so for finding a master of like ledesma parameters you need to know
[00:14:23] like ledesma parameters you need to know what's the density of X you come and you
[00:14:25] what's the density of X you come and you know choose the parameters choose the
[00:14:26] know choose the parameters choose the parameters W the maximizing likelihood
[00:14:28] parameters W the maximizing likelihood so that's what we want to compute the
[00:14:30] so that's what we want to compute the density of X but um let's let's use a
[00:14:33] density of X but um let's let's use a simple example let's say the density of
[00:14:35] simple example let's say the density of s is indicator SS between 0 and 1 okay
[00:14:43] s is indicator SS between 0 and 1 okay so this is SS distribution uniform from
[00:14:49] so this is SS distribution uniform from 0 to 1 and let's say X is equal to 2
[00:14:55] 0 to 1 and let's say X is equal to 2 times s sitting on notation is equal to
[00:14:58] times s sitting on notation is equal to 2 W is equal to 1/2 this is a N equals 1
[00:15:02] 2 W is equal to 1/2 this is a N equals 1 one dimensional example so this is a
[00:15:10] one dimensional example so this is a density of s right uniform distribution
[00:15:14] density of s right uniform distribution from 0 to 1 and if X is equal to 2 times
[00:15:18] from 0 to 1 and if X is equal to 2 times s then this seems like X should be equal
[00:15:22] s then this seems like X should be equal X is distributed uniformly from 0 to 2
[00:15:27] X is distributed uniformly from 0 to 2 right Chris if s is uniform from 0 to 1
[00:15:30] right Chris if s is uniform from 0 to 1 you multiply by the 2 X a certain
[00:15:32] you multiply by the 2 X a certain uniformly from 0 to 2 and so the density
[00:15:35] uniformly from 0 to 2 and so the density for X is equal to this
[00:15:45] and it's now half as tall because
[00:15:48] and it's now half as tall because probably density functions need to
[00:15:50] probably density functions need to integrate to one right so this is a
[00:15:52] integrate to one right so this is a uniform from zero to two
[00:15:54] uniform from zero to two probably density function and so the
[00:15:58] probably density function and so the correct formula is P of X x equals 1/2
[00:16:08] times indicator is realistic of the X
[00:16:14] let's say equal to 2 okay and more
[00:16:27] let's say equal to 2 okay and more generally the correct formula for this
[00:16:30] generally the correct formula for this is actually this x this is the
[00:16:37] is actually this x this is the determinant of the matrix W and in the
[00:16:46] determinant of the matrix W and in the case of a real number the determine if I
[00:16:48] case of a real number the determine if I want real number is just this absolute
[00:16:49] want real number is just this absolute value which is why we have the density
[00:16:54] value which is why we have the density of x equals 1/2 you know that's the
[00:16:58] of x equals 1/2 you know that's the absolute value of the determinant of W x
[00:17:04] absolute value of the determinant of W x times y times indicator where there are
[00:17:08] times y times indicator where there are two times s is within 0 0 to 1 ok
[00:17:16] right so I guess this oh this is
[00:17:19] right so I guess this oh this is indicator zero less than 1/2 okay so
[00:17:28] indicator zero less than 1/2 okay so this is an illustration showing why this
[00:17:30] this is an illustration showing why this is the right way with the determinant of
[00:17:33] is the right way with the determinant of W most value here as the as a way to
[00:17:36] W most value here as the as a way to compute its identity of X and don't you
[00:17:39] compute its identity of X and don't you familiar with determinants and
[00:17:42] familiar with determinants and determinants is the function you can
[00:17:44] determinants is the function you can call your numpy to compute but also the
[00:17:48] call your numpy to compute but also the intuition of deterministic measures how
[00:17:50] intuition of deterministic measures how much it stretches out a local whopping
[00:17:53] much it stretches out a local whopping and so you need to sort of divide by the
[00:17:57] and so you need to sort of divide by the determinant of a or multiply by
[00:18:00] determinant of a or multiply by determinant of W in order to make sure
[00:18:03] determinant of W in order to make sure these distributions don't normalizes the
[00:18:05] these distributions don't normalizes the one right so that's where that comes
[00:18:06] one right so that's where that comes from
[00:18:09] so we're nearly done
[00:18:13] so we're nearly done just one more decision and then we can
[00:18:16] just one more decision and then we can derive a maximum likelihood estimation
[00:18:18] derive a maximum likelihood estimation to derive a mess with likely estimate of
[00:18:20] to derive a mess with likely estimate of this of the parameters the last thing we
[00:18:23] this of the parameters the last thing we need to do is choose the density of what
[00:18:29] need to do is choose the density of what you know speakers voices sound like and
[00:18:32] you know speakers voices sound like and as I said just now what we are going to
[00:18:36] as I said just now what we are going to do is choose a non Gaussian distribution
[00:18:40] do is choose a non Gaussian distribution right and so well f of S is equal to the
[00:18:45] right and so well f of S is equal to the chance of this person's voice right
[00:18:48] chance of this person's voice right random variable s being less than
[00:18:50] random variable s being less than certain value and we need a smooth
[00:18:52] certain value and we need a smooth function that goes between you know 0
[00:18:55] function that goes between you know 0 and 1 right we need a smooth function
[00:18:58] and 1 right we need a smooth function that has Davey that shake and so well
[00:19:01] that has Davey that shake and so well what functions we know that they be that
[00:19:03] what functions we know that they be that shape let's take the sigmoid function
[00:19:08] shape let's take the sigmoid function and it turns out this world this will
[00:19:11] and it turns out this world this will work ok there are many choices that
[00:19:13] work ok there are many choices that actually work fine it turns out that if
[00:19:16] actually work fine it turns out that if you choose the sigmoid function to be
[00:19:18] you choose the sigmoid function to be the CDF then if you look at the PDF this
[00:19:22] the CDF then if you look at the PDF this induces if you take the derivative this
[00:19:24] induces if you take the derivative this right so take P of x equals the
[00:19:26] right so take P of x equals the derivative
[00:19:27] derivative CEDIA it turns out that if this is the
[00:19:33] CEDIA it turns out that if this is the Gaussian then the PDF that this choice
[00:19:39] Gaussian then the PDF that this choice induces is something with fatter tails
[00:19:44] induces is something with fatter tails by which I mean that it goes to zero you
[00:19:48] by which I mean that it goes to zero you know so Gaussian density goes to zero
[00:19:58] know so Gaussian density goes to zero very quickly right it's like e to the
[00:20:01] very quickly right it's like e to the negative x squared the Gaussian is a
[00:20:03] negative x squared the Gaussian is a square in the exponent of the density
[00:20:05] square in the exponent of the density and it turns out that this because the
[00:20:07] and it turns out that this because the density taken by compute derivative a
[00:20:10] density taken by compute derivative a sigmoid it goes to zero more slowly and
[00:20:12] sigmoid it goes to zero more slowly and this captures human voice and many
[00:20:15] this captures human voice and many natural phenomena better than the
[00:20:16] natural phenomena better than the Gaussian density because there are
[00:20:18] Gaussian density because there are larger number of extreme outliers there
[00:20:20] larger number of extreme outliers there are more than one or two standard
[00:20:22] are more than one or two standard deviations away but they're actually
[00:20:24] deviations away but they're actually multiple distributions that work you
[00:20:26] multiple distributions that work you could have used a double double
[00:20:27] could have used a double double exponential distribution so this is an
[00:20:30] exponential distribution so this is an exponential distribution exponential
[00:20:31] exponential distribution exponential then see if you take a symmetric go to
[00:20:33] then see if you take a symmetric go to side the explanation that's the PIA best
[00:20:35] side the explanation that's the PIA best they'll also work quite well for ICA but
[00:20:37] they'll also work quite well for ICA but I think early history of ICA you know
[00:20:41] I think early history of ICA you know researchers I think of them might have
[00:20:44] researchers I think of them might have been Terry Sadowski download the softest
[00:20:46] been Terry Sadowski download the softest if you just needed a function with these
[00:20:49] if you just needed a function with these properties and you picked the sigmoid
[00:20:50] properties and you picked the sigmoid and plugged it in and it works just fine
[00:20:51] and plugged it in and it works just fine it's been a good enough default that tom
[00:20:54] it's been a good enough default that tom is still it's still widely use right but
[00:20:57] is still it's still widely use right but but but but they've used this double
[00:21:00] but but but they've used this double side the exponential sometimes also
[00:21:01] side the exponential sometimes also called the laplacian distribution this
[00:21:03] called the laplacian distribution this works fine as well as a choice of P of s
[00:21:17] so the final step the density of s is
[00:21:31] so the final step the density of s is equal to the product of the lessee soap
[00:21:46] equal to the product of the lessee soap rather from I equals 1 through your n
[00:21:48] rather from I equals 1 through your n sources of the probability of each of
[00:21:51] sources of the probability of each of the speakers emitting that sound right
[00:21:54] the speakers emitting that sound right because the N speakers are speaking
[00:22:05] because the N speakers are speaking independently right
[00:22:12] wait say that again
[00:22:24] oh yes you're right sorry about that yes
[00:22:27] oh yes you're right sorry about that yes this should be sorry yes this should
[00:22:34] this should be sorry yes this should have been up here right go from a CD f
[00:22:41] have been up here right go from a CD f sub-p di by taking derivatives
[00:22:43] sub-p di by taking derivatives oh cool
[00:22:47] oh cool so um s is the vector of all you know
[00:22:53] so um s is the vector of all you know two speakers are all five speakers
[00:22:54] two speakers are all five speakers voices at one moment in time so the
[00:22:57] voices at one moment in time so the density of s right s is in RN is the
[00:23:01] density of s right s is in RN is the product of the individual speakers
[00:23:03] product of the individual speakers probabilities and this is the key
[00:23:06] probabilities and this is the key assumption of ICA that you know your two
[00:23:08] assumption of ICA that you know your two speakers or your five speakers are
[00:23:10] speakers or your five speakers are having independent conversations and so
[00:23:12] having independent conversations and so at every moment in time they choose
[00:23:14] at every moment in time they choose independently of each other what sound
[00:23:16] independently of each other what sound teammate and so using the formulas you
[00:23:20] teammate and so using the formulas you worked out just now the density of X is
[00:23:24] worked out just now the density of X is equal to well as we did the density of W
[00:23:36] equal to well as we did the density of W x times the determinant of W so and this
[00:23:44] x times the determinant of W so and this is equal to
[00:23:59] Oh in this notation WI transpose X this
[00:24:04] Oh in this notation WI transpose X this is um right because WI is the I've row
[00:24:08] is um right because WI is the I've row of the matrix W and so you know I guess
[00:24:13] of the matrix W and so you know I guess s SJ is equal to W J transpose X right
[00:24:19] s SJ is equal to W J transpose X right so you take a corresponding row and
[00:24:20] so you take a corresponding row and multiply it by X to get the
[00:24:22] multiply it by X to get the corresponding source actually sorry I
[00:24:25] corresponding source actually sorry I think that's right yeah let me use J
[00:24:27] think that's right yeah let me use J there okay and so um this writes out so
[00:24:42] there okay and so um this writes out so this shows what is the density of X
[00:24:46] this shows what is the density of X expressed as a function of P of s which
[00:24:51] expressed as a function of P of s which have assumed which affects as a CDF of
[00:24:53] have assumed which affects as a CDF of the sigmoid as a as the derivative of
[00:24:56] the sigmoid as a as the derivative of the sigmoid and as a function of the
[00:24:58] the sigmoid and as a function of the parameter W right so this is a model
[00:25:02] parameter W right so this is a model that given a setting of the parameters W
[00:25:05] that given a setting of the parameters W which square matrix allows us to write
[00:25:09] which square matrix allows us to write down what's the density of banks
[00:25:20] so the final step is we could use
[00:25:25] so the final step is we could use maximum likelihood estimation to
[00:25:28] maximum likelihood estimation to estimate the parameters W so the log
[00:25:32] estimate the parameters W so the log likelihood of W is equal to sum over the
[00:25:35] likelihood of W is equal to sum over the training examples of log and you can use
[00:25:55] training examples of log and you can use the cost agree in the sense take the
[00:26:03] the cost agree in the sense take the derivative of W respective along
[00:26:05] derivative of W respective along likelihood and it turns out this is
[00:26:09] likelihood and it turns out this is derived a lecture notes I'll just write
[00:26:11] derived a lecture notes I'll just write it out here
[00:26:31] I hope I got that right yeah okay right
[00:26:41] and it turns out that if you use this
[00:26:46] and it turns out that if you use this formula don't don't worry about the form
[00:26:48] formula don't don't worry about the form for derivatives the full derivations
[00:26:49] for derivatives the full derivations give a legend else but it turns out that
[00:26:52] give a legend else but it turns out that if you use the derivative of the log
[00:26:54] if you use the derivative of the log likelihood with respect to parameter
[00:26:56] likelihood with respect to parameter matrix W and use stochastic gradient a
[00:26:59] matrix W and use stochastic gradient a sense to maximize the log likelihood run
[00:27:02] sense to maximize the log likelihood run this for a while then you can get ICA to
[00:27:07] this for a while then you can get ICA to find they're pretty good matrix W for
[00:27:11] find they're pretty good matrix W for unmixing the sources okay so just to
[00:27:14] unmixing the sources okay so just to recap the whole algorithm right you
[00:27:18] recap the whole algorithm right you would have a training set of x1 up
[00:27:25] would have a training set of x1 up through XM where each of your training
[00:27:27] through XM where each of your training examples is the microphone recordings at
[00:27:32] examples is the microphone recordings at one moment in time and so the time goes
[00:27:34] one moment in time and so the time goes from 1 through m what you do is
[00:27:37] from 1 through m what you do is initialize the matrix W say randomly and
[00:27:40] initialize the matrix W say randomly and use gradient descent with this formula
[00:27:44] use gradient descent with this formula for the derivative in order to maximize
[00:27:46] for the derivative in order to maximize the log likelihood of the data and after
[00:27:49] the log likelihood of the data and after a gradient ascent converges you then
[00:27:51] a gradient ascent converges you then have a matrix W and you can then recover
[00:27:54] have a matrix W and you can then recover the sources as s equals W of X and then
[00:27:58] the sources as s equals W of X and then now we have the sources you can take say
[00:28:03] now we have the sources you can take say s1 1 through s 1m and play that through
[00:28:09] s1 1 through s 1m and play that through your you know laptop speaker in order to
[00:28:13] your you know laptop speaker in order to see what source 1 sounds like so that's
[00:28:17] see what source 1 sounds like so that's how you would take you know
[00:28:19] how you would take you know overlapping voices and try to unmixed up
[00:28:25] a wise choice the save point now
[00:28:32] a wise choice the save point now rotation America
[00:28:34] rotation America boy how to visualize that try plotting
[00:28:39] boy how to visualize that try plotting it in numpy matplotlib I guess if you
[00:28:43] it in numpy matplotlib I guess if you plot the contours of the day so it turns
[00:28:45] plot the contours of the day so it turns out that if this is s 1 and s 2 what you
[00:28:49] out that if this is s 1 and s 2 what you do not want is a density whose contours
[00:28:52] do not want is a density whose contours look like that haven't done this for a
[00:28:56] look like that haven't done this for a while I believe if you take this
[00:28:57] while I believe if you take this distribution the contours will look like
[00:29:00] distribution the contours will look like that it's been a while since I hope that
[00:29:03] that it's been a while since I hope that this but I think it'll look like that so
[00:29:05] this but I think it'll look like that so this is not rotations in magic do you
[00:29:08] this is not rotations in magic do you know this laplacian yeah ok yeah oh yes
[00:29:10] know this laplacian yeah ok yeah oh yes the pasta he looks like that I think
[00:29:12] the pasta he looks like that I think sigmoid looks a bit like that too yeah
[00:29:14] sigmoid looks a bit like that too yeah talk to this even right Paul's on Piazza
[00:29:17] talk to this even right Paul's on Piazza if one of you positive because you can
[00:29:19] if one of you positive because you can see I haven't done ever louder
[00:29:28] oh why don't you interpret differently
[00:29:42] oh why don't you interpret differently along that actually yes the law should
[00:29:46] along that actually yes the law should be like this I think oh sorry
[00:29:54] be like this I think oh sorry G is the sigmoid function yes so Jia Zi
[00:30:15] sure what's the yeah what's the closest
[00:30:20] sure what's the yeah what's the closest nonlinear extension of this I don't we
[00:30:24] nonlinear extension of this I don't we don't have great answer to that right
[00:30:27] don't have great answer to that right now frankly so a bunch of people
[00:30:33] now frankly so a bunch of people including you know my former students
[00:30:36] including you know my former students and me have done research to try to
[00:30:38] and me have done research to try to extend this to nonlinear versions and
[00:30:40] extend this to nonlinear versions and there's some stuff that kind of works
[00:30:41] there's some stuff that kind of works but I don't think there's like a
[00:30:42] but I don't think there's like a tried-and-true algorithm that I'm ready
[00:30:45] tried-and-true algorithm that I'm ready to say this is the right way to do it
[00:30:50] yeah actually maybe I should I can say a
[00:30:52] yeah actually maybe I should I can say a little bit more about other people
[00:30:53] little bit more about other people interesting well yeah yeah yeah let me
[00:30:56] interesting well yeah yeah yeah let me let me try there
[00:31:25] all right let's see
[00:31:36] so so for several several years ago and
[00:31:41] so so for several several years ago and and so kind of ongoing there's been
[00:31:43] and so kind of ongoing there's been research some done by my collaboration
[00:31:47] research some done by my collaboration me some time my others aren't trying to
[00:31:49] me some time my others aren't trying to build nonlinear versions of ICA and so
[00:31:51] build nonlinear versions of ICA and so some of you might have seen this
[00:31:53] some of you might have seen this slightly infamous Google Katz result
[00:31:56] slightly infamous Google Katz result right so it doesn't want to leave the
[00:31:59] right so it doesn't want to leave the Google brain project one of the first
[00:32:00] Google brain project one of the first parts if you did this a few years ago
[00:32:01] parts if you did this a few years ago now where we trained in your network on
[00:32:07] now where we trained in your network on was it many many hours of YouTube videos
[00:32:10] was it many many hours of YouTube videos and eventually it learned to detect cats
[00:32:15] and eventually it learned to detect cats because apparently there are a lot of
[00:32:16] because apparently there are a lot of cats and YouTube videos and so it turns
[00:32:20] cats and YouTube videos and so it turns out that the algorithm we used was a was
[00:32:23] out that the algorithm we used was a was sparse coding which is actually very
[00:32:26] sparse coding which is actually very closely related to ICA and so this rough
[00:32:30] closely related to ICA and so this rough algorithm was attempting to build a
[00:32:32] algorithm was attempting to build a nonlinear version of ICA where you train
[00:32:34] nonlinear version of ICA where you train one version once trained train one layer
[00:32:36] one version once trained train one layer of sparse coding let's say to extract
[00:32:38] of sparse coding let's say to extract low-level features and then recursively
[00:32:40] low-level features and then recursively apply this on top to learn not just edge
[00:32:42] apply this on top to learn not just edge detectors but object part detectors and
[00:32:44] detectors but object part detectors and then eventually you know the somewhat
[00:32:46] then eventually you know the somewhat infamous two somewhat infamous google
[00:32:49] infamous two somewhat infamous google cat but I think that this is actually
[00:32:51] cat but I think that this is actually still ongoing research I think the most
[00:32:54] still ongoing research I think the most interesting research some of the most
[00:32:57] interesting research some of the most interesting research has been on
[00:32:58] interesting research has been on hierarchical versions of sparse coding
[00:32:59] hierarchical versions of sparse coding in sparse coding it's a different
[00:33:01] in sparse coding it's a different algorithm that turns out to be very
[00:33:02] algorithm that turns out to be very closely related to ICA and then you can
[00:33:04] closely related to ICA and then you can show that they're optimizing for very
[00:33:06] show that they're optimizing for very similar things so if I say sparse coding
[00:33:08] similar things so if I say sparse coding is very similar ICA but there are
[00:33:10] is very similar ICA but there are hierarchical versions of this they tried
[00:33:11] hierarchical versions of this they tried to turn this as a multi-layer neural
[00:33:13] to turn this as a multi-layer neural network and it kind of works wherever
[00:33:15] network and it kind of works wherever that show can learn there's new features
[00:33:17] that show can learn there's new features but what happened was a supervised
[00:33:19] but what happened was a supervised learning there and really took off in
[00:33:21] learning there and really took off in the whole world shifted Walters
[00:33:22] the whole world shifted Walters attention to supervised learning and
[00:33:24] attention to supervised learning and building deep supervised learning in
[00:33:25] building deep supervised learning in your own networks and so the hierarchal
[00:33:28] your own networks and so the hierarchal sparse coding running I see over and
[00:33:29] sparse coding running I see over and over to learn nonlinear versions there's
[00:33:32] over to learn nonlinear versions there's there's pretty less attention from
[00:33:34] there's pretty less attention from research on that on that topic then it
[00:33:37] research on that on that topic then it then it really deserves so maybe you
[00:33:39] then it really deserves so maybe you maybe someone in a costly go back and do
[00:33:41] maybe someone in a costly go back and do more research on that I still think is a
[00:33:43] more research on that I still think is a promising area
[00:33:45] promising area all right um so let me wrap up with some
[00:33:53] ICA examples so there's actually a
[00:33:58] ICA examples so there's actually a former ta from the class Katie Chang and
[00:34:03] former ta from the class Katie Chang and so it turns out that ICS routinely used
[00:34:08] so it turns out that ICS routinely used to clean up
[00:34:09] to clean up EEG data today so what's an EEG
[00:34:12] EEG data today so what's an EEG right place many electrodes on your
[00:34:14] right place many electrodes on your scalp to measure little electrical
[00:34:17] scalp to measure little electrical recordings on the surface of your scalp
[00:34:20] recordings on the surface of your scalp so you know what does human brain do
[00:34:22] so you know what does human brain do right human brain your neurons in your
[00:34:24] right human brain your neurons in your brain right now
[00:34:25] brain right now fire generated little pulses of
[00:34:27] fire generated little pulses of electricity and if you place electrode
[00:34:30] electricity and if you place electrode on your scalp you can get a very weak
[00:34:31] on your scalp you can get a very weak measurement of the of the voltage of the
[00:34:34] measurement of the of the voltage of the electrical activity in a you know at a
[00:34:36] electrical activity in a you know at a certain point in your scalp so the
[00:34:38] certain point in your scalp so the analogy to oh excuse me what's wrong
[00:34:43] analogy to oh excuse me what's wrong alright so the analogy to the cocktail
[00:34:46] alright so the analogy to the cocktail of Hardy problem the overlapping
[00:34:48] of Hardy problem the overlapping speakers voices is that you know your
[00:34:51] speakers voices is that you know your your brain does a lot of things at the
[00:34:54] your brain does a lot of things at the same time right your brain helps
[00:34:56] same time right your brain helps regulate your heartbeat part of your
[00:34:59] regulate your heartbeat part of your brain does that and now the part of your
[00:35:00] brain does that and now the part of your brain you know makes your eyes blink
[00:35:02] brain you know makes your eyes blink every now and then another part of your
[00:35:04] every now and then another part of your brain probably brain is also responsible
[00:35:05] brain probably brain is also responsible making sure that you breathe and then
[00:35:08] making sure that you breathe and then part of your brain is responsible
[00:35:09] part of your brain is responsible thinking about machine learning and
[00:35:11] thinking about machine learning and stuff like that right so so your brain
[00:35:13] stuff like that right so so your brain actually handles memories didn't ask at
[00:35:14] actually handles memories didn't ask at the same time and as your brain sorry
[00:35:17] the same time and as your brain sorry else not sure what's wrong with this
[00:35:19] else not sure what's wrong with this okay and as your brain carries out these
[00:35:23] okay and as your brain carries out these different tasks in parallel different
[00:35:26] different tasks in parallel different parts of your brain generate different
[00:35:27] parts of your brain generate different electrical impulses so I think of there
[00:35:30] electrical impulses so I think of there as imagine that you have a you know
[00:35:32] as imagine that you have a you know cocktail party in your head right so
[00:35:34] cocktail party in your head right so many overlapping voices so this is now
[00:35:37] many overlapping voices so this is now voices in your head bad but one one one
[00:35:41] voices in your head bad but one one one part of your brain is saying alright hot
[00:35:43] part of your brain is saying alright hot go and be hot go and beat harder and
[00:35:45] go and be hot go and beat harder and beaten and not my brains I hate breathe
[00:35:46] beaten and not my brains I hate breathe in and breathe out breathe in and
[00:35:47] in and breathe out breathe in and breathe out now if I were in a zoo you
[00:35:49] breathe out now if I were in a zoo you know what's wrong with this PowerPoint
[00:35:53] know what's wrong with this PowerPoint right um and what's each electrode on
[00:35:57] right um and what's each electrode on the surface of your scalp does is it
[00:35:59] the surface of your scalp does is it measures an overlapping combination of
[00:36:01] measures an overlapping combination of all of these voices because the
[00:36:02] all of these voices because the different positive brain are sending
[00:36:04] different positive brain are sending these electric impulses they add up and
[00:36:06] these electric impulses they add up and so any one point on the surface of your
[00:36:08] so any one point on the surface of your brain reflects a sum or a mixture really
[00:36:11] brain reflects a sum or a mixture really a sum of these different voices of these
[00:36:13] a sum of these different voices of these different things your brain is doing and
[00:36:16] different things your brain is doing and so if you just just zooming into the EEG
[00:36:20] so if you just just zooming into the EEG plot each line is the voltage measured
[00:36:24] plot each line is the voltage measured at a single electrode right on say your
[00:36:27] at a single electrode right on say your scalp and these signals are quite
[00:36:31] scalp and these signals are quite correlated you see that when there's a
[00:36:33] correlated you see that when there's a massive voice in your brain shouting you
[00:36:36] massive voice in your brain shouting you know like right beat your heart or blink
[00:36:39] know like right beat your heart or blink your eyes that signal can go through all
[00:36:42] your eyes that signal can go through all of the different electrodes which is why
[00:36:44] of the different electrodes which is why you can see these artifacts are affected
[00:36:46] you can see these artifacts are affected in all of these electrodes all right
[00:36:51] in all of these electrodes all right turns out a pretty good way to clean up
[00:36:53] turns out a pretty good way to clean up this data is to take all of these time
[00:36:56] this data is to take all of these time series
[00:36:57] series pretty-pretty exactly as we learned
[00:36:59] pretty-pretty exactly as we learned about it with the ISEE algorithm and
[00:37:01] about it with the ISEE algorithm and separate out into the independent
[00:37:03] separate out into the independent components and so it turns out in this
[00:37:06] components and so it turns out in this example there are two components
[00:37:08] example there are two components corresponding to driving the heartbeat
[00:37:10] corresponding to driving the heartbeat that's actually the eye blink component
[00:37:13] that's actually the eye blink component and so one way to clean up this data
[00:37:15] and so one way to clean up this data sorry I should really wonder what's
[00:37:18] sorry I should really wonder what's wrong with this all right let me try
[00:37:19] wrong with this all right let me try something
[00:37:31] all right
[00:37:44] if you write says hi Peters I blink and
[00:37:52] alright and if you run I see a and then
[00:37:55] alright and if you run I see a and then remove out I have a person say oh that's
[00:37:58] remove out I have a person say oh that's happy that's I blink and remove subtract
[00:38:00] happy that's I blink and remove subtract out those components then you can end up
[00:38:02] out those components then you can end up with a much more cleaned up
[00:38:05] with a much more cleaned up eg signal which you can then use for
[00:38:07] eg signal which you can then use for downstream processing sorry overpass
[00:38:09] downstream processing sorry overpass there is a lot of research on your
[00:38:11] there is a lot of research on your chicken eg reading to try to guess at
[00:38:13] chicken eg reading to try to guess at the high level what you're thinking
[00:38:15] the high level what you're thinking right it turns out that if you train a
[00:38:17] right it turns out that if you train a trainer trainer you know supervised
[00:38:20] trainer trainer you know supervised learning algorithm to try to decide are
[00:38:22] learning algorithm to try to decide are you thinking of a noun or a verb or you
[00:38:24] you thinking of a noun or a verb or you thinking of something edible or are you
[00:38:26] thinking of something edible or are you thinking of something any other boat
[00:38:28] thinking of something any other boat there's been very interesting research
[00:38:29] there's been very interesting research trying to use EEG to figure out just in
[00:38:32] trying to use EEG to figure out just in a very coarse level no not quite my
[00:38:36] a very coarse level no not quite my reading every thought you are thinking
[00:38:38] reading every thought you are thinking but that that can we categorize very
[00:38:41] but that that can we categorize very coarse level thoughts like are you
[00:38:43] coarse level thoughts like are you thinking of a person or you think of an
[00:38:45] thinking of a person or you think of an object then you can actually do that to
[00:38:47] object then you can actually do that to some extent using EQ meeting it's been
[00:38:48] some extent using EQ meeting it's been cleaning up the data to get really I
[00:38:51] cleaning up the data to get really I blink the heartbeat artifacts is a very
[00:38:53] blink the heartbeat artifacts is a very useful pre-processing step to get
[00:38:56] useful pre-processing step to get cleaner data to feed into the learning
[00:38:58] cleaner data to feed into the learning algorithm to try to figure out try to
[00:38:59] algorithm to try to figure out try to categorize you know some coarse cavity
[00:39:01] categorize you know some coarse cavity of what you're thinking okay and then
[00:39:04] of what you're thinking okay and then more research it turns out that what
[00:39:06] more research it turns out that what kind of I mentioned that Google can't
[00:39:08] kind of I mentioned that Google can't thing just now it turns out that if you
[00:39:12] thing just now it turns out that if you train I see a font is messed up if you
[00:39:17] train I see a font is messed up if you train I see a on natural images I see a
[00:39:21] train I see a on natural images I see a will say that the natural independent
[00:39:23] will say that the natural independent components of natural images are these
[00:39:25] components of natural images are these edges and as in that you know when you
[00:39:29] edges and as in that you know when you see a little image patch in the world we
[00:39:30] see a little image patch in the world we see you know look somewhere in there one
[00:39:32] see you know look somewhere in there one looked just a tiny little piece of the
[00:39:34] looked just a tiny little piece of the image right like 10 pixels by 10 pixels
[00:39:36] image right like 10 pixels by 10 pixels and if you take that data and model as
[00:39:39] and if you take that data and model as ICA I say we'll say that the world is
[00:39:42] ICA I say we'll say that the world is made up of edges or made up of patches
[00:39:44] made up of edges or made up of patches like these and that the way you end up
[00:39:47] like these and that the way you end up with images in the world is by each of
[00:39:49] with images in the world is by each of these patches
[00:39:50] these patches you know independently saying is there
[00:39:51] you know independently saying is there reservations or horizontal insurers
[00:39:53] reservations or horizontal insurers is there this type of light on the left
[00:39:57] is there this type of light on the left dark on the right is that this type of
[00:39:58] dark on the right is that this type of lighter on top doctor the bottom and so
[00:40:01] lighter on top doctor the bottom and so on and it's by adding all of these
[00:40:03] on and it's by adding all of these voices there you get a typical image
[00:40:04] voices there you get a typical image passionate world so there are there
[00:40:06] passionate world so there are there interesting theories in neuroscience
[00:40:07] interesting theories in neuroscience about whether this is how you know the
[00:40:09] about whether this is how you know the human brain learns to see as well so so
[00:40:12] human brain learns to see as well so so very very same work on them I see and
[00:40:14] very very same work on them I see and sparse coding to try to use these
[00:40:16] sparse coding to try to use these mechanisms to explain how you know the
[00:40:18] mechanisms to explain how you know the human brain tries to explain it tries to
[00:40:22] human brain tries to explain it tries to learn to perceive images for example
[00:40:24] learn to perceive images for example okay so all right so that's it for um
[00:40:36] the algorithms of ICA justify no
[00:40:40] the algorithms of ICA justify no comments I think on Mondays someone asks
[00:40:44] comments I think on Mondays someone asks do the number of speakers the number of
[00:40:46] do the number of speakers the number of microphones need to be equal so it turns
[00:40:49] microphones need to be equal so it turns out that if the number of microphones is
[00:40:53] out that if the number of microphones is larger than the number of speakers
[00:40:55] larger than the number of speakers that's actually fine right if you're the
[00:40:58] that's actually fine right if you're the number of microphones large number of
[00:40:59] number of microphones large number of speakers then if you run ICA or a
[00:41:02] speakers then if you run ICA or a slightly modified version of it you find
[00:41:04] slightly modified version of it you find that some of the speakers are just
[00:41:05] that some of the speakers are just silent speakers and so you know if you
[00:41:08] silent speakers and so you know if you have ten microphones and five speakers
[00:41:10] have ten microphones and five speakers if you run this algorithm on ten
[00:41:13] if you run this algorithm on ten microphones you can find that well maybe
[00:41:14] microphones you can find that well maybe five of the sources are just silent or
[00:41:17] five of the sources are just silent or there ways to just now model those five
[00:41:18] there ways to just now model those five sources as well right if you think that
[00:41:21] sources as well right if you think that they're just some sources of silence so
[00:41:23] they're just some sources of silence so so this so slightly modified version of
[00:41:26] so this so slightly modified version of this works quite well if the number of
[00:41:30] this works quite well if the number of speakers is larger than the number of
[00:41:31] speakers is larger than the number of microphones if the excuse me the number
[00:41:34] microphones if the excuse me the number of microphones is lodged in the Armagh
[00:41:35] of microphones is lodged in the Armagh speakers this works quite well if the
[00:41:38] speakers this works quite well if the number of microphones is smaller than
[00:41:40] number of microphones is smaller than the number of speakers then that's still
[00:41:42] the number of speakers then that's still very much a cutting-edge research
[00:41:44] very much a cutting-edge research problem so so for example if you have
[00:41:47] problem so so for example if you have two speakers and one microphone it turns
[00:41:51] two speakers and one microphone it turns out that if you have one male and one
[00:41:54] out that if you have one male and one female speaker so one relatively high
[00:41:56] female speaker so one relatively high patient one much lower pitch then you
[00:41:58] patient one much lower pitch then you can sometimes have some algorithms that
[00:42:00] can sometimes have some algorithms that separate out two voices with one
[00:42:03] separate out two voices with one microphone but it doesn't work that
[00:42:05] microphone but it doesn't work that reliably is a little bit finicky but
[00:42:07] reliably is a little bit finicky but there have been research papers
[00:42:08] there have been research papers published showing that you know you
[00:42:10] published showing that you know you could make a reasonable attempt at
[00:42:11] could make a reasonable attempt at separating out two voices with my one
[00:42:14] separating out two voices with my one microphone though the pitches are quite
[00:42:16] microphone though the pitches are quite different such as is one male one female
[00:42:17] different such as is one male one female voice but separating out two male voices
[00:42:21] voice but separating out two male voices or two female voices is still very hard
[00:42:24] or two female voices is still very hard and then there's ongoing research in in
[00:42:28] and then there's ongoing research in in those settings so that's ICA and I guess
[00:42:34] those settings so that's ICA and I guess you get to play more of it in your
[00:42:36] you get to play more of it in your homework problem as well okay any last
[00:42:39] homework problem as well okay any last questions about ICA
[00:42:42] questions about ICA oh wait sorry it would be Jose yeah so
[00:43:29] oh wait sorry it would be Jose yeah so um I think actually go through a lot of
[00:43:33] um I think actually go through a lot of math it just breaks down I think because
[00:43:36] math it just breaks down I think because there you can have two independent
[00:43:38] there you can have two independent sources but W is now no longer a square
[00:43:41] sources but W is now no longer a square matrix right it'll be uh what is it
[00:43:46] so I write so is that X is equal to a s
[00:43:50] so I write so is that X is equal to a s right and so if X is a real number and s
[00:43:55] right and so if X is a real number and s was two-dimensional so I guess this
[00:44:01] was two-dimensional so I guess this would be um a would be two by one s
[00:44:07] would be um a would be two by one s would be a would be 2 by 1 SOT - Suzi a
[00:44:11] would be a would be 2 by 1 SOT - Suzi a would be 1 by 2 and s would be a 2 by 1
[00:44:15] would be 1 by 2 and s would be a 2 by 1 and this is 1 by 1 then you know ain't
[00:44:19] and this is 1 by 1 then you know ain't inverse kind of doesn't exist right so
[00:44:21] inverse kind of doesn't exist right so you need to come over way to form the
[00:44:23] you need to come over way to form the mass molecular model and where you have
[00:44:25] mass molecular model and where you have one microphone it's just how do you
[00:44:27] one microphone it's just how do you separate out to overlapping voices so it
[00:44:31] separate out to overlapping voices so it takes much higher level knowledge yeah
[00:44:35] takes much higher level knowledge yeah to separate out two voices
[00:44:42] oh I see right let's see so right so if
[00:44:59] oh I see right let's see so right so if you don't know how many speakers there
[00:45:00] you don't know how many speakers there are you have all these microphones where
[00:45:01] are you have all these microphones where you about the number of electrodes you
[00:45:03] you about the number of electrodes you have is fixed so that's just a data set
[00:45:04] have is fixed so that's just a data set and it turns out that if you run ICA
[00:45:09] and it turns out that if you run ICA where the large numbers speakers you
[00:45:10] where the large numbers speakers you find them in the speakers are silent
[00:45:12] find them in the speakers are silent there are also some versions of ICA that
[00:45:14] there are also some versions of ICA that you so if you think that there are let's
[00:45:21] you so if you think that there are let's see why
[00:45:23] see why no smells worse on this but it turns out
[00:45:24] no smells worse on this but it turns out that um if you think that there is a
[00:45:27] that um if you think that there is a relatively small number of speakers then
[00:45:30] relatively small number of speakers then you don't need to explicitly model all
[00:45:32] you don't need to explicitly model all the speakers instead what you would
[00:45:34] the speakers instead what you would model so again suppose sense of Max or
[00:45:39] model so again suppose sense of Max or likely estimation problem let's say
[00:45:42] likely estimation problem let's say that's X is in our 10 right Co 10
[00:45:45] that's X is in our 10 right Co 10 recordings but you suspect that you're
[00:45:47] recordings but you suspect that you're near 5 speakers then in this case I
[00:45:50] near 5 speakers then in this case I guess the matrix a would be um what is
[00:45:54] guess the matrix a would be um what is it was it be 10 by 5 is it right to mix
[00:46:01] it was it be 10 by 5 is it right to mix the five sources into 10 speakers and
[00:46:04] the five sources into 10 speakers and you could for me the maximum likelihood
[00:46:08] you could for me the maximum likelihood estimation problem assuming the
[00:46:10] estimation problem assuming the existence of only 5 speakers without
[00:46:11] existence of only 5 speakers without modeling a lot of speakers and then
[00:46:14] modeling a lot of speakers and then finding later that they're all silent so
[00:46:17] finding later that they're all silent so if your formula so if you parameterize
[00:46:19] if your formula so if you parameterize model like this using a instead of W
[00:46:22] model like this using a instead of W then can form their maximum likelihood
[00:46:24] then can form their maximum likelihood estimation problem where you just assume
[00:46:27] estimation problem where you just assume that they're 5 speakers and s is
[00:46:29] that they're 5 speakers and s is generated by five speakers mixing
[00:46:32] generated by five speakers mixing through a linear thing plus the noise
[00:46:35] through a linear thing plus the noise oh I see sure right how do you know if
[00:46:45] oh I see sure right how do you know if you have how do you know how this
[00:46:45] you have how do you know how this because you have so I think it's one of
[00:46:48] because you have so I think it's one of those things a little bit like k-means I
[00:46:49] those things a little bit like k-means I guess where you try it and see what
[00:46:51] guess where you try it and see what works and if you find that the first
[00:46:53] works and if you find that the first view you know speakers will capture
[00:46:55] view you know speakers will capture mostly variance you find the digital
[00:46:57] mostly variance you find the digital speakers are quite silent and they're
[00:46:59] speakers are quite silent and they're quite small that you could just cut off
[00:47:00] quite small that you could just cut off at that time I don't want to go too much
[00:47:02] at that time I don't want to go too much into the different numbers of speakers
[00:47:04] into the different numbers of speakers and and and microphones I see a verbose
[00:47:09] and and and microphones I see a verbose let me just take a couple of questions
[00:47:11] let me just take a couple of questions only one question yeah is it oh do you
[00:47:20] only one question yeah is it oh do you ever see my parent of you um I'm sure
[00:47:24] ever see my parent of you um I'm sure you can is not usually done in this
[00:47:27] you can is not usually done in this version of the algorithm but I would not
[00:47:30] version of the algorithm but I would not be surprised if there are some other
[00:47:31] be surprised if there are some other versions where you do I've not seen that
[00:47:33] versions where you do I've not seen that about myself actually all right cool
[00:47:48] good um so
[00:48:39] all right so that wraps up our chapter
[00:48:45] all right so that wraps up our chapter on unsupervised learning right so you
[00:48:50] on unsupervised learning right so you learned about yes k-means clustering the
[00:48:54] learned about yes k-means clustering the Yemm algorithm for mixture of gaussians
[00:48:57] Yemm algorithm for mixture of gaussians really makes your gaseous model factor
[00:49:00] really makes your gaseous model factor analysis model and also PCA and then you
[00:49:04] analysis model and also PCA and then you know today the ica independent
[00:49:07] know today the ica independent components analysis algorithm and all of
[00:49:09] components analysis algorithm and all of these were the algorithms that could
[00:49:11] these were the algorithms that could take as input an unlabeled training set
[00:49:12] take as input an unlabeled training set just the excise and no labels and we
[00:49:15] just the excise and no labels and we find various interesting structures in
[00:49:17] find various interesting structures in the data such as clusters or subspaces
[00:49:19] the data such as clusters or subspaces or in the case of ICA the voices of you
[00:49:21] or in the case of ICA the voices of you and the speakers and you implement ICA
[00:49:24] and the speakers and you implement ICA and play about yourself in the homework
[00:49:27] and play about yourself in the homework problem well you get to separate out
[00:49:28] problem well you get to separate out many five overlapping voices the loss of
[00:49:33] many five overlapping voices the loss of the four major topics well cover in this
[00:49:36] the four major topics well cover in this verse we Thomas two eyes learning kind
[00:49:38] verse we Thomas two eyes learning kind of advice machine learning on two eyes
[00:49:40] of advice machine learning on two eyes learning and the fourth and the final
[00:49:43] learning and the fourth and the final major topics we cover in this class will
[00:49:45] major topics we cover in this class will be a reinforcement learning so to
[00:49:55] be a reinforcement learning so to motivate reinforcement learning let's
[00:50:00] motivate reinforcement learning let's say you want to have a computer learn to
[00:50:04] say you want to have a computer learn to fly a helicopter right I think I showed
[00:50:07] fly a helicopter right I think I showed some of the videos that are in the first
[00:50:09] some of the videos that are in the first lecture and so I'll just skip that here
[00:50:10] lecture and so I'll just skip that here but it turns out that if you are at
[00:50:14] but it turns out that if you are at every point in time given the position
[00:50:16] every point in time given the position of a helicopter call the state of a
[00:50:18] of a helicopter call the state of a helicopter and you also take an action
[00:50:20] helicopter and you also take an action on how to move the control sticks you
[00:50:22] on how to move the control sticks you know to make the helicopter fly in a
[00:50:24] know to make the helicopter fly in a certian trajectory it turns out that
[00:50:26] certian trajectory it turns out that it's very difficult to know what's the
[00:50:28] it's very difficult to know what's the one right answer for how to move the
[00:50:30] one right answer for how to move the control sticks of a helicopter right so
[00:50:32] control sticks of a helicopter right so if you don't have a mapping from x to y
[00:50:34] if you don't have a mapping from x to y because you can't quite specify the one
[00:50:36] because you can't quite specify the one true way to fly a helicopter it's hard
[00:50:40] true way to fly a helicopter it's hard to use supervised learning
[00:50:42] to use supervised learning and what the enforcement learning does
[00:50:44] and what the enforcement learning does is is is it is an algorithm that doesn't
[00:50:47] is is is it is an algorithm that doesn't ask you to tell it the right answer at
[00:50:50] ask you to tell it the right answer at every step it doesn't ask you to tell it
[00:50:51] every step it doesn't ask you to tell it exactly what's the one true way to move
[00:50:53] exactly what's the one true way to move the controls of a helicopter at any
[00:50:55] the controls of a helicopter at any moment in time instead your
[00:50:57] moment in time instead your responsibility as a designer a machine
[00:50:59] responsibility as a designer a machine or an engineer or an engineer is to
[00:51:01] or an engineer or an engineer is to specify reward function that just tells
[00:51:04] specify reward function that just tells the helicopter when it's flying well and
[00:51:06] the helicopter when it's flying well and when it's lying poorly so your job as a
[00:51:08] when it's lying poorly so your job as a designer is to write a cost function or
[00:51:11] designer is to write a cost function or reward function that gives the
[00:51:13] reward function that gives the helicopter a high reward whenever it's
[00:51:15] helicopter a high reward whenever it's doing well flying accurately find reject
[00:51:17] doing well flying accurately find reject you want sue and gives the helicopter a
[00:51:19] you want sue and gives the helicopter a large negative reward whenever it
[00:51:22] large negative reward whenever it crashes with or something bad then I
[00:51:24] crashes with or something bad then I think I think you know think of this
[00:51:26] think I think you know think of this like training a doll right when do you
[00:51:28] like training a doll right when do you say good dog when you say bad dog and
[00:51:30] say good dog when you say bad dog and the dog figures out when to do more the
[00:51:32] the dog figures out when to do more the good dog things and your job is not to
[00:51:33] good dog things and your job is not to tell the dog you know well you can't
[00:51:35] tell the dog you know well you can't actually talk to the dog and tell what
[00:51:37] actually talk to the dog and tell what to do I guess that doesn't work but you
[00:51:39] to do I guess that doesn't work but you can tell a good dog and bad dog and
[00:51:41] can tell a good dog and bad dog and hopefully their instruments positive
[00:51:42] hopefully their instruments positive negative was how to do more of the good
[00:51:44] negative was how to do more of the good things another example let's say you
[00:51:49] things another example let's say you want to write their program to play
[00:51:51] want to write their program to play chess or I guess most know somewhat
[00:51:54] chess or I guess most know somewhat famously and arguably somewhat slightly
[00:51:57] famously and arguably somewhat slightly over height go alpha go right so it's
[00:52:01] over height go alpha go right so it's very difficult to know in given a
[00:52:04] very difficult to know in given a certain chess board position or checkers
[00:52:06] certain chess board position or checkers or go for position what is the one true
[00:52:08] or go for position what is the one true move what's the one best move so it's
[00:52:10] move what's the one best move so it's very difficult to formulate you know
[00:52:13] very difficult to formulate you know playing chess as a supervised learning
[00:52:15] playing chess as a supervised learning problem and instead the mechanisms used
[00:52:19] problem and instead the mechanisms used to play chess are much more like
[00:52:21] to play chess are much more like reinforcement learning where you can let
[00:52:25] reinforcement learning where you can let your program play chess or go or
[00:52:26] your program play chess or go or whatever and whenever it wins you go oh
[00:52:29] whatever and whenever it wins you go oh good computer and when it loses you go
[00:52:31] good computer and when it loses you go oh bad computer so that's a reward
[00:52:34] oh bad computer so that's a reward function and learning algorithms job is
[00:52:36] function and learning algorithms job is to figure out by itself how to get more
[00:52:38] to figure out by itself how to get more of the positive rewards right and
[00:52:40] of the positive rewards right and actually common rewards for learning to
[00:52:43] actually common rewards for learning to play chess or checkers or fellow go is a
[00:52:46] play chess or checkers or fellow go is a plus reward of plus one for win
[00:52:50] plus reward of plus one for win - one for Luzon zero for a time say
[00:52:56] - one for Luzon zero for a time say ready a chest pain program this be a
[00:52:58] ready a chest pain program this be a common choice reward where R is the
[00:53:01] common choice reward where R is the reward function and s is the state okay
[00:53:04] reward function and s is the state okay and I will go into the notation in a
[00:53:08] and I will go into the notation in a little bit and so as you can imagine
[00:53:12] little bit and so as you can imagine giving only this type of information to
[00:53:15] giving only this type of information to their chest pain program it places much
[00:53:17] their chest pain program it places much more burden on the program to figure out
[00:53:19] more burden on the program to figure out what to do in fact one of the challenges
[00:53:22] what to do in fact one of the challenges of reinforcement learning is so just
[00:53:25] of reinforcement learning is so just call the reward that's called the state
[00:53:29] call the reward that's called the state and the state means on the status of the
[00:53:32] and the state means on the status of the chess board where are the pieces on a
[00:53:33] chess board where are the pieces on a chess board or the status of the
[00:53:35] chess board or the status of the helicopter where exactly is a helicopter
[00:53:37] helicopter where exactly is a helicopter and are you the right side up or upside
[00:53:39] and are you the right side up or upside down and where are you right and it
[00:53:42] down and where are you right and it turns out one of the challenges one of
[00:53:49] turns out one of the challenges one of the things that makes them reinforce the
[00:53:51] the things that makes them reinforce the learning Hart is the credit assignment
[00:53:54] learning Hart is the credit assignment problem and that means that if your
[00:53:57] problem and that means that if your program is playing a game of chess and
[00:53:59] program is playing a game of chess and let's say it loses on move 50 you know
[00:54:03] let's say it loses on move 50 you know so plays a game and then I'll move 50
[00:54:05] so plays a game and then I'll move 50 right it's checkmate and then loses his
[00:54:07] right it's checkmate and then loses his opponent so gets a reward a negative one
[00:54:10] opponent so gets a reward a negative one but how can the program actually figure
[00:54:12] but how can the program actually figure out what it did well and what it did
[00:54:14] out what it did well and what it did poorly right if you lose a game and move
[00:54:16] poorly right if you lose a game and move 50 it might be that the program made a
[00:54:19] 50 it might be that the program made a really bad move made a blunder and move
[00:54:21] really bad move made a blunder and move 20 and then you know but they just hope
[00:54:24] 20 and then you know but they just hope another 30 moves before his fates were
[00:54:26] another 30 moves before his fates were sealed right so in the game of chess we
[00:54:28] sealed right so in the game of chess we made a bad mistake early on you can
[00:54:30] made a bad mistake early on you can still take many many games there are
[00:54:32] still take many many games there are many many moves in the game of chess
[00:54:33] many many moves in the game of chess before before the final outcome of
[00:54:35] before before the final outcome of losing or winning or losing this reached
[00:54:38] losing or winning or losing this reached or in a and a another it turns out that
[00:54:44] or in a and a another it turns out that so if you are trying to build a
[00:54:45] so if you are trying to build a self-driving car if ever car crashes
[00:54:50] self-driving car if ever car crashes rain chances are the thing the car was
[00:54:52] rain chances are the thing the car was doing right before it crashes was break
[00:54:54] doing right before it crashes was break but it's not breaking that causes a
[00:54:56] but it's not breaking that causes a crash it's pretty something else I
[00:54:58] crash it's pretty something else I called it many many seconds ago then
[00:55:00] called it many many seconds ago then - the bad outcome so there's a bad
[00:55:01] - the bad outcome so there's a bad outcome how does the algorithm know of
[00:55:04] outcome how does the algorithm know of all the things that did before how does
[00:55:06] all the things that did before how does it know whether it did well which you
[00:55:07] it know whether it did well which you should do more of and one it did poorly
[00:55:09] should do more of and one it did poorly which you should do less of and
[00:55:11] which you should do less of and conversely if that's a good outcome
[00:55:14] conversely if that's a good outcome you're like a wins a game of chess well
[00:55:15] you're like a wins a game of chess well how do you know what you did well right
[00:55:17] how do you know what you did well right so that's called the credit assignment
[00:55:19] so that's called the credit assignment problem which is when your algorithm
[00:55:21] problem which is when your algorithm gets some reward how do you actually
[00:55:23] gets some reward how do you actually figure out what you did what you did
[00:55:24] figure out what you did what you did poorly so you know what to do more of
[00:55:27] poorly so you know what to do more of and what to do less up right so as we
[00:55:30] and what to do less up right so as we develop reinforcement learning
[00:55:32] develop reinforcement learning algorithms will see that the algorithms
[00:55:34] algorithms will see that the algorithms we use have to at least indirectly try
[00:55:38] we use have to at least indirectly try to solve the credit assignment problem
[00:55:39] to solve the credit assignment problem okay so um reinforcement learning
[00:55:46] okay so um reinforcement learning problems like play chess of AI
[00:55:48] problems like play chess of AI helicopters or you know building these
[00:55:51] helicopters or you know building these there's robots is modeled using the MDP
[00:56:02] there's robots is modeled using the MDP or the Markov decision process
[00:56:18] and this is a way this is a notation in
[00:56:22] and this is a way this is a notation in the formulism but modeling how the world
[00:56:24] the formulism but modeling how the world works and then reinforcement learning
[00:56:25] works and then reinforcement learning algorithms will solve problems using
[00:56:27] algorithms will solve problems using this formula zhim so what is an MDP sin
[00:56:31] this formula zhim so what is an MDP sin MDP is a five tuple and let me explain
[00:56:40] MDP is a five tuple and let me explain what each of these are so s there's a
[00:56:44] what each of these are so s there's a set of states so for example in chess
[00:56:52] set of states so for example in chess this would be the set of all possible
[00:56:53] this would be the set of all possible chess positions or in flying a
[00:56:56] chess positions or in flying a helicopter this would be the set of all
[00:56:58] helicopter this would be the set of all the possible positions and orientations
[00:56:59] the possible positions and orientations and velocities of your helicopter a is
[00:57:05] and velocities of your helicopter a is the set of actions where in the
[00:57:11] the set of actions where in the helicopter this would be all the
[00:57:12] helicopter this would be all the positions you could move your control
[00:57:14] positions you could move your control sticks or in Chester's be all the moves
[00:57:16] sticks or in Chester's be all the moves you could make you know in a in a game
[00:57:18] you could make you know in a in a game of chess
[00:57:39] P subscript s a is is a state transition
[00:57:43] P subscript s a is is a state transition probabilities and so we'll see later
[00:57:46] probabilities and so we'll see later this state transition probably is tell
[00:57:49] this state transition probably is tell you if you take a certain action a and a
[00:57:53] you if you take a certain action a and a certain state s once the chance of you
[00:57:56] certain state s once the chance of you ending up at a particular different
[00:57:59] ending up at a particular different state s prime our gamma is the discount
[00:58:16] state s prime our gamma is the discount factor as number between 0 and 1
[00:58:19] factor as number between 0 and 1 don't worry about this for now we'll
[00:58:20] don't worry about this for now we'll come back to this in a minute and R is
[00:58:23] come back to this in a minute and R is that all-important reward function so
[00:58:41] in order to develop a reinforcement
[00:58:45] in order to develop a reinforcement learning algorithm I'm going to use as a
[00:58:49] learning algorithm I'm going to use as a running example a simplified MDP that we
[00:58:52] running example a simplified MDP that we can draw on the whiteboard right so
[00:58:54] can draw on the whiteboard right so helicopters in chess and go and so on
[00:58:56] helicopters in chess and go and so on they're really complicated MDP so just
[00:58:58] they're really complicated MDP so just to illustrate the algorithms I'm going
[00:59:00] to illustrate the algorithms I'm going to use a simpler MVP and this is an
[00:59:04] to use a simpler MVP and this is an example we drawn from the textbook
[00:59:06] example we drawn from the textbook Russell and Norvig then we'll use imply
[00:59:12] Russell and Norvig then we'll use imply MVP in which you have a robot navigating
[00:59:16] MVP in which you have a robot navigating this simple maze and there's an obstacle
[00:59:18] this simple maze and there's an obstacle so this is a grid world icy robot you
[00:59:21] so this is a grid world icy robot you know and it's navigating this very
[00:59:27] know and it's navigating this very simple maze and this is a pillar or this
[00:59:30] simple maze and this is a pillar or this is a wall so you can't walk into that
[00:59:32] is a wall so you can't walk into that wall and let me just use indexing on the
[00:59:38] wall and let me just use indexing on the states as follows so this MDP let's
[00:59:44] states as follows so this MDP let's let's go through the five tempo and talk
[00:59:45] let's go through the five tempo and talk about what the the each of the five
[00:59:49] about what the the each of the five things are so this MVP has eleven states
[00:59:53] things are so this MVP has eleven states corresponding to the eleven possible
[00:59:55] corresponding to the eleven possible positions that the robot could be in
[00:59:58] positions that the robot could be in right each of these banks square so the
[00:59:59] right each of these banks square so the eleven possible states and the actions
[01:00:08] are north south east and west
[01:00:11] are north south east and west right you can come on your robot to move
[01:00:12] right you can come on your robot to move in any of these directions and I don't
[01:00:16] in any of these directions and I don't know if you're working robots before you
[01:00:18] know if you're working robots before you know that um when you come on the robot
[01:00:20] know that um when you come on the robot you know head straight it doesn't always
[01:00:23] you know head straight it doesn't always go exactly straight sometimes the wheel
[01:00:26] go exactly straight sometimes the wheel slip it veers of a slight angle and so
[01:00:28] slip it veers of a slight angle and so in this simplified example we're going
[01:00:30] in this simplified example we're going to model it as that if you command the
[01:00:34] to model it as that if you command the robot to go north from a certain state
[01:00:37] robot to go north from a certain state that there's a 0.8 percent chance
[01:00:40] that there's a 0.8 percent chance they'll successfully go where you told
[01:00:43] they'll successfully go where you told it to and there's zero point one chance
[01:00:45] it to and there's zero point one chance that they'll accidentally if you're off
[01:00:47] that they'll accidentally if you're off to the left for a student if you're off
[01:00:48] to the left for a student if you're off to the right
[01:00:49] to the right okay if you are working on row robots
[01:00:52] okay if you are working on row robots right what's a lot of robots it is
[01:00:54] right what's a lot of robots it is actually important to model the noisy
[01:00:57] actually important to model the noisy dynamics of a robot real slipping so
[01:00:59] dynamics of a robot real slipping so your orientation being slightly off now
[01:01:01] your orientation being slightly off now in a real robot you'd have a much bigger
[01:01:04] in a real robot you'd have a much bigger stage space than the eleven states right
[01:01:06] stage space than the eleven states right so so this is simplified so this is not
[01:01:08] so so this is simplified so this is not a realistic model for how robots
[01:01:10] a realistic model for how robots actually slip but because we're using
[01:01:11] actually slip but because we're using such a small state space I think just
[01:01:13] such a small state space I think just for illustration purposes as well well
[01:01:16] for illustration purposes as well well we'll use this and so for example the
[01:01:24] we'll use this and so for example the state transition probably so specified
[01:01:26] state transition probably so specified is you say that every under state 3 1 so
[01:01:28] is you say that every under state 3 1 so the state 3 comma 1 and you command it
[01:01:31] the state 3 comma 1 and you command it to go north that the chance of getting
[01:01:34] to go north that the chance of getting to the state 3 2 is 0.8 and the chance
[01:01:45] to the state 3 2 is 0.8 and the chance are getting to the States for 10.1
[01:01:52] Charles again 2 to 1 is 0.1 and the
[01:01:59] Charles again 2 to 1 is 0.1 and the chance of getting to other states is
[01:02:02] chance of getting to other states is like 3 3 and other states is equal to 0
[01:02:05] like 3 3 and other states is equal to 0 ok so the state transition probabilities
[01:02:07] ok so the state transition probabilities would capture that if you here in
[01:02:09] would capture that if you here in surgical north as the whole point a
[01:02:10] surgical north as the whole point a chance of getting here 0.1 chance again
[01:02:13] chance of getting here 0.1 chance again here 0.1 chance of getting here and you
[01:02:15] here 0.1 chance of getting here and you know point Oh a chance of right hopping
[01:02:18] know point Oh a chance of right hopping to steps
[01:02:18] to steps oh it's implement a PE example we'll
[01:02:29] oh it's implement a PE example we'll just assume that the robot you know hits
[01:02:31] just assume that the robot you know hits a wall it just bounces off the wall and
[01:02:33] a wall it just bounces off the wall and stays where it is
[01:02:33] stays where it is so if you told us to go 'yes it slips
[01:02:36] so if you told us to go 'yes it slips off it just bounced off the wall and
[01:02:37] off it just bounced off the wall and stay exactly where this now let's
[01:02:43] stay exactly where this now let's specify the reward function we'll come
[01:02:46] specify the reward function we'll come back to discount factor later but let's
[01:02:49] back to discount factor later but let's say you want the robot to navigate to
[01:02:52] say you want the robot to navigate to this cell in the upper right hand corner
[01:02:55] this cell in the upper right hand corner and so to incentivize the reward
[01:02:59] and so to incentivize the reward incentivize the robot to get to this
[01:03:01] incentivize the robot to get to this square you know that's the prize
[01:03:02] square you know that's the prize Saguna knees let's put a +1 reward there
[01:03:05] Saguna knees let's put a +1 reward there and let's say you really don't want the
[01:03:08] and let's say you really don't want the robot to go to this cell they could put
[01:03:12] robot to go to this cell they could put a negative one or what there right so
[01:03:14] a negative one or what there right so the way you specify the toss for a robot
[01:03:18] the way you specify the toss for a robot to do is in designing the reward
[01:03:21] to do is in designing the reward function
[01:03:40] so in our example I'm just copied out of
[01:03:48] so in our example I'm just copied out of the game plus one minus one we have that
[01:03:57] the game plus one minus one we have that the reward at the cell for three is plus
[01:04:01] the reward at the cell for three is plus one and the reward at the cell for 2 is
[01:04:06] one and the reward at the cell for 2 is minus one and then you know if you want
[01:04:09] minus one and then you know if you want the robot to get to the plus one rewards
[01:04:12] the robot to get to the plus one rewards cell as quickly as possible then again
[01:04:16] cell as quickly as possible then again there there are many ways of designing
[01:04:18] there there are many ways of designing reward functions but one common choice
[01:04:20] reward functions but one common choice would be to put a negative penalty a
[01:04:24] would be to put a negative penalty a very small negative penalty right such
[01:04:34] very small negative penalty right such as a set of rewards a negative 0.02 for
[01:04:37] as a set of rewards a negative 0.02 for all other states and the effect of a
[01:04:40] all other states and the effect of a small negative reward like this is to
[01:04:42] small negative reward like this is to charge it right every every step it is
[01:04:45] charge it right every every step it is just loitering around so charge a little
[01:04:47] just loitering around so charge a little bit for using electricity and wandering
[01:04:49] bit for using electricity and wandering around because this incentivizes a robot
[01:04:52] around because this incentivizes a robot to hurry up and get to the plus one
[01:04:54] to hurry up and get to the plus one reward right so you give a small penalty
[01:04:57] reward right so you give a small penalty you know loitering and wasting
[01:04:59] you know loitering and wasting electricity
[01:05:11] so this is how an MDP works your robot
[01:05:17] so this is how an MDP works your robot wakes up at some state as zero at time
[01:05:20] wakes up at some state as zero at time zero because you turn on the robot and
[01:05:22] zero because you turn on the robot and the robot says oh I'm at this state
[01:05:23] the robot says oh I'm at this state and based on what state it is in it will
[01:05:28] and based on what state it is in it will get to choose some action a zero so
[01:05:33] get to choose some action a zero so decides I want to go north south east or
[01:05:34] decides I want to go north south east or west let's choose some action based on
[01:05:38] west let's choose some action based on the action the consequence of the choice
[01:05:40] the action the consequence of the choice is it will get to some state s1 to stay
[01:05:45] is it will get to some state s1 to stay that the next time step which is
[01:05:46] that the next time step which is distributed according to the state
[01:05:50] distributed according to the state transition probability is governed by
[01:05:51] transition probability is governed by the previous state and the action and
[01:05:53] the previous state and the action and chose so develop what actually chose is
[01:05:55] chose so develop what actually chose is there's different chances of moving
[01:05:57] there's different chances of moving north south east or west now that is an
[01:06:01] north south east or west now that is an s-1 it then has to choose a new action
[01:06:05] s-1 it then has to choose a new action a1 and as a consequence of the action a1
[01:06:10] a1 and as a consequence of the action a1 it will get to some new state s2 which
[01:06:14] it will get to some new state s2 which is governed by the state transition
[01:06:17] is governed by the state transition probabilities you know s 1 a1 and so on
[01:06:23] probabilities you know s 1 a1 and so on okay and then the robot just keeps on
[01:06:25] okay and then the robot just keeps on running
[01:06:32] and so the robot will go through a
[01:06:37] and so the robot will go through a sequence of states s0 s1 s2 and so on
[01:06:43] sequence of states s0 s1 s2 and so on depending on the choices it receives
[01:06:45] depending on the choices it receives defend the actions it chooses and the
[01:06:48] defend the actions it chooses and the total payoff is written as follows with
[01:06:57] total payoff is written as follows with one more detail is that term gamma so
[01:07:09] one more detail is that term gamma so think of gamma as a number like 0.99 so
[01:07:13] think of gamma as a number like 0.99 so gamma is usually chosen to be just
[01:07:15] gamma is usually chosen to be just slightly less than one and what the so
[01:07:19] slightly less than one and what the so the total payoff is the sum of rewards
[01:07:22] the total payoff is the sum of rewards or more technically as a sum of
[01:07:23] or more technically as a sum of discounted rewards and what this does is
[01:07:26] discounted rewards and what this does is it adds up all the rewards that the
[01:07:28] it adds up all the rewards that the robot receives over time but the further
[01:07:31] robot receives over time but the further reward is into the future you know the
[01:07:35] reward is into the future you know the smaller the gammas ^ time that that
[01:07:38] smaller the gammas ^ time that that reward is x okay so anyway what'd you
[01:07:41] reward is x okay so anyway what'd you get this time one you get all of that
[01:07:44] get this time one you get all of that every one you get at time - its x point
[01:07:46] every one you get at time - its x point 99 Roy gets thanks that this one point
[01:07:49] 99 Roy gets thanks that this one point 99 squared or not a cube and so on and
[01:07:52] 99 squared or not a cube and so on and so what the discount factor does is it
[01:07:56] so what the discount factor does is it has the effect of giving a smaller way
[01:07:58] has the effect of giving a smaller way to rewards in the distant future and
[01:08:01] to rewards in the distant future and this means that this encourages the
[01:08:03] this means that this encourages the robot to also get the positive rewards
[01:08:06] robot to also get the positive rewards faster or postpone the negative rewards
[01:08:09] faster or postpone the negative rewards right and so in financial applications
[01:08:12] right and so in financial applications the discount factor has one as has a
[01:08:15] the discount factor has one as has a natural interpretation as the time value
[01:08:18] natural interpretation as the time value of money because if you have a dollar
[01:08:19] of money because if you have a dollar today you know you're better off having
[01:08:22] today you know you're better off having a dollar today they're having a year $1
[01:08:24] a dollar today they're having a year $1 year from now right because you put the
[01:08:26] year from now right because you put the dollar in the bank and
[01:08:27] dollar in the bank and interests for a year on your dollar and
[01:08:30] interests for a year on your dollar and so dollars they're strictly rather than
[01:08:31] so dollars they're strictly rather than thought in the future
[01:08:33] thought in the future and conversely having to pay $100 or
[01:08:36] and conversely having to pay $100 or having to pay one dollar a year from now
[01:08:38] having to pay one dollar a year from now is also better than having to pay a
[01:08:40] is also better than having to pay a dollar today right because if you could
[01:08:42] dollar today right because if you could you know save your money and earn inches
[01:08:44] you know save your money and earn inches and then issue a payment to someone else
[01:08:46] and then issue a payment to someone else a year from now rather than now then
[01:08:48] a year from now rather than now then you're actually slightly wealthier and
[01:08:50] you're actually slightly wealthier and so and so the gamma and financial
[01:08:53] so and so the gamma and financial applications as an interpretation as the
[01:08:55] applications as an interpretation as the time value of money
[01:08:57] time value of money oh it's the interest rate I guess and
[01:09:02] oh it's the interest rate I guess and but but but more generally even for
[01:09:04] but but but more generally even for non-financial applications most of our
[01:09:07] non-financial applications most of our most there are some financial
[01:09:08] most there are some financial applications our enforcement but lots of
[01:09:10] applications our enforcement but lots of non fan traffic as well this mechanism
[01:09:13] non fan traffic as well this mechanism of using a discount factor has the
[01:09:15] of using a discount factor has the effect of encouraging the system to get
[01:09:18] effect of encouraging the system to get to the positive was as quickly as
[01:09:19] to the positive was as quickly as possible but then also conversely to try
[01:09:22] possible but then also conversely to try to push the negative rewards as founds
[01:09:24] to push the negative rewards as founds in the future as possible right oh and I
[01:09:29] in the future as possible right oh and I think to be pragmatic there are two
[01:09:32] think to be pragmatic there are two reasons why people use gamma the story I
[01:09:34] reasons why people use gamma the story I just told time value of money
[01:09:36] just told time value of money your friends'll sponsor was postponed
[01:09:38] your friends'll sponsor was postponed that was that's that's the story you
[01:09:41] that was that's that's the story you tend to people you tend to hear people
[01:09:44] tend to people you tend to hear people say in terms of why we have a discount
[01:09:46] say in terms of why we have a discount factor the other reason where the
[01:09:48] factor the other reason where the discount factor is actually much more
[01:09:49] discount factor is actually much more pragmatic one which is that lovely
[01:09:51] pragmatic one which is that lovely reinforcement learning algorithms you
[01:09:52] reinforcement learning algorithms you see they converge much faster or they
[01:09:54] see they converge much faster or they work much better if you're willing to
[01:09:56] work much better if you're willing to have a discount factor right so it turns
[01:09:58] have a discount factor right so it turns out that if gamma is is equal to 1 if
[01:10:01] out that if gamma is is equal to 1 if gamma is not strictly less than 1 it's
[01:10:04] gamma is not strictly less than 1 it's much harder or they're there many
[01:10:06] much harder or they're there many ripples to learning algorithms that may
[01:10:08] ripples to learning algorithms that may not converge you as much how the croutha
[01:10:10] not converge you as much how the croutha conversions of no may not converse which
[01:10:11] conversions of no may not converse which isn't a pragmatic thing this makes the
[01:10:14] isn't a pragmatic thing this makes the job much easier for your algorithms now
[01:10:16] job much easier for your algorithms now I see sorry you're shaking your heads in
[01:10:17] I see sorry you're shaking your heads in this disapproval all right
[01:10:31] yeah yeah yes yeah that's a good point
[01:10:34] yeah yeah yes yeah that's a good point yes so one of the things if there's no
[01:10:35] yes so one of the things if there's no camera is that the reward summer was you
[01:10:38] camera is that the reward summer was you know could be can increase or decrease
[01:10:40] know could be can increase or decrease of our balance so by having gammer
[01:10:42] of our balance so by having gammer discount easier the total payoff is a
[01:10:45] discount easier the total payoff is a finite value whereas the boundary value
[01:10:47] finite value whereas the boundary value so that that's one of the parts they go
[01:10:50] so that that's one of the parts they go into some of the proofs or something
[01:10:51] into some of the proofs or something reasons behind why rate for so many
[01:10:53] reasons behind why rate for so many avenues conversion yeah okay so the go
[01:11:02] avenues conversion yeah okay so the go of reinforcement learning is to choose
[01:11:05] of reinforcement learning is to choose actions over time to maximize the
[01:11:15] actions over time to maximize the expected total payoff
[01:11:32] and in particular what most
[01:11:40] and in particular what most reinforcement learning algorithms will
[01:11:42] reinforcement learning algorithms will come up with is a policy that maps from
[01:11:53] come up with is a policy that maps from States to actions right so the output of
[01:11:57] States to actions right so the output of most reinforcement learning algorithms
[01:11:59] most reinforcement learning algorithms will be a policy or controller in the
[01:12:03] will be a policy or controller in the our world we tend to use the term policy
[01:12:05] our world we tend to use the term policy but policy just means controller there
[01:12:07] but policy just means controller there maps of states actions so it turns out
[01:12:10] maps of states actions so it turns out that for the MDP that we have right it
[01:12:24] that for the MDP that we have right it turns out that this is the optimal
[01:12:27] turns out that this is the optimal policy
[01:12:36] so for example I want you to take this
[01:12:38] so for example I want you to take this example just this cell here to sell over
[01:12:45] example just this cell here to sell over here
[01:12:46] here this policy is saying PI apply to the
[01:12:49] this policy is saying PI apply to the state 3 1 as equal to West and that so
[01:12:59] so it separately worked out what is the
[01:13:01] so it separately worked out what is the optimal policy and this turns out to be
[01:13:03] optimal policy and this turns out to be also a policy in the sense that if you
[01:13:06] also a policy in the sense that if you we say execute this policy so the
[01:13:10] we say execute this policy so the executors policy means that whenever you
[01:13:12] executors policy means that whenever you in the state s take the action given by
[01:13:23] in the state s take the action given by PI of s so that's what it means to
[01:13:32] PI of s so that's what it means to execute a certain policy and it turns
[01:13:35] execute a certain policy and it turns out that this policy was I worked out
[01:13:39] out that this policy was I worked out separately right offline you know in my
[01:13:42] separately right offline you know in my laptop that this is the optimal policy
[01:13:45] laptop that this is the optimal policy for this MVP and it turns out that if
[01:13:48] for this MVP and it turns out that if you execute this policy meaning whenever
[01:13:50] you execute this policy meaning whenever a certain state you know take the action
[01:13:52] a certain state you know take the action indicated by the arrow that this is the
[01:13:55] indicated by the arrow that this is the policy that will maximize the expected
[01:13:57] policy that will maximize the expected total payoff okay and the problem in
[01:14:03] total payoff okay and the problem in reinforcement learning is given a
[01:14:06] reinforcement learning is given a definition for an MDP or given a problem
[01:14:09] definition for an MDP or given a problem suppose the problem is an MDP figure out
[01:14:13] suppose the problem is an MDP figure out what's the set of states with set of
[01:14:15] what's the set of states with set of actions one of the state transition
[01:14:17] actions one of the state transition probabilities specify a discount factor
[01:14:19] probabilities specify a discount factor a specified reward function and then to
[01:14:22] a specified reward function and then to a reinforcer learning algorithm find the
[01:14:25] a reinforcer learning algorithm find the policy PI that maximizes expected payoff
[01:14:28] policy PI that maximizes expected payoff and then when you want your robot to act
[01:14:31] and then when you want your robot to act or when you want your chess playing
[01:14:32] or when you want your chess playing program to act whenever you're in
[01:14:34] program to act whenever you're in something s take the action given by PI
[01:14:37] something s take the action given by PI of s and hopefully this will result in a
[01:14:40] of s and hopefully this will result in a robot that you know efficiently
[01:14:41] robot that you know efficiently navigates to the +1 state
[01:14:45] navigates to the +1 state so turns out that MVPs are quite good at
[01:14:49] so turns out that MVPs are quite good at making fine distinction so one example
[01:14:51] making fine distinction so one example is actually not totally obvious whether
[01:14:54] is actually not totally obvious whether here you're better off going off or
[01:14:56] here you're better off going off or going west right and it turns out that
[01:14:58] going west right and it turns out that there is a trade off if you go Wes here
[01:15:01] there is a trade off if you go Wes here then you know you're gonna take a longer
[01:15:03] then you know you're gonna take a longer route to get to the plus one so you take
[01:15:05] route to get to the plus one so you take longer the plus one is discounted more
[01:15:08] longer the plus one is discounted more heavily you're taking these penalties
[01:15:09] heavily you're taking these penalties along the way excuse me
[01:15:13] along the way excuse me but on the flip side if you were to try
[01:15:16] but on the flip side if you were to try to go north you could try to get there
[01:15:18] to go north you could try to get there faster but on this that there's a 0.1%
[01:15:22] faster but on this that there's a 0.1% chance that you accidentally slip off to
[01:15:25] chance that you accidentally slip off to the minus one state so so what is the
[01:15:27] the minus one state so so what is the optimal action right it's actually quite
[01:15:29] optimal action right it's actually quite hard to just look at it with your eyes
[01:15:30] hard to just look at it with your eyes and make a decision but it turns out
[01:15:33] and make a decision but it turns out that if you solve for the optimal set of
[01:15:35] that if you solve for the optimal set of actions and does MDP you in this example
[01:15:37] actions and does MDP you in this example is they just take longer and safer route
[01:15:46] the sense of why cycles and policies so
[01:15:50] the sense of why cycles and policies so if the optimal set of actions is the
[01:15:52] if the optimal set of actions is the cycle around then it should find out I
[01:15:55] cycle around then it should find out I mean for example if they're only
[01:15:57] mean for example if they're only penalties everywhere and she's just go
[01:15:59] penalties everywhere and she's just go and run in a circle you know then then
[01:16:01] and run in a circle you know then then the algorithm watch she choose to do
[01:16:03] the algorithm watch she choose to do that but in this case you want to get to
[01:16:06] that but in this case you want to get to the plus one as quickly as possible
[01:16:07] the plus one as quickly as possible and so what we'll see is one question
[01:16:29] wait so alright sure sorry so testing
[01:16:32] wait so alright sure sorry so testing checkers and go and so on they're a
[01:16:33] checkers and go and so on they're a little more complication is you take a
[01:16:35] little more complication is you take a move so actually to refine the
[01:16:38] move so actually to refine the description of chess what happens in
[01:16:40] description of chess what happens in playing chess is the state status your
[01:16:44] playing chess is the state status your board right says your move so you see a
[01:16:46] board right says your move so you see a board that's the state and so you make a
[01:16:48] board that's the state and so you make a move and then the opponent makes a move
[01:16:49] move and then the opponent makes a move and then that's the new state so the
[01:16:51] and then that's the new state so the state is when you and your opponent both
[01:16:53] state is when you and your opponent both make take turns then it's cut back to
[01:16:55] make take turns then it's cut back to you right and because you don't know
[01:16:58] you right and because you don't know exactly what your opponent will do there
[01:17:00] exactly what your opponent will do there is a probably distribution over if I
[01:17:01] is a probably distribution over if I make a move or what's the other person
[01:17:03] make a move or what's the other person gonna do yeah right oh they're probably
[01:17:15] gonna do yeah right oh they're probably sighs I'm very no two point eight point
[01:17:17] sighs I'm very no two point eight point one point one where does that come from
[01:17:18] one point one where does that come from so we'll talk about that later
[01:17:21] so we'll talk about that later in some applications does this learn so
[01:17:23] in some applications does this learn so if you build a robot you might not know
[01:17:25] if you build a robot you might not know is it point eight point one point one or
[01:17:27] is it point eight point one point one or you know point seven point one five
[01:17:29] you know point seven point one five point one five so it's quite common to
[01:17:31] point one five so it's quite common to use data to learn those state transition
[01:17:33] use data to learn those state transition probabilities as well well we'll see a
[01:17:35] probabilities as well well we'll see a specific example that into it okay so
[01:17:38] specific example that into it okay so alright so where we are just to
[01:17:39] alright so where we are just to summarize this is how you formulate a
[01:17:42] summarize this is how you formulate a problem as an MDP and then the the job
[01:17:47] problem as an MDP and then the the job reinforcing learning algorithm is ready
[01:17:49] reinforcing learning algorithm is ready to go from there MDP to telling you what
[01:17:52] to go from there MDP to telling you what is a good policy okay
[01:17:54] is a good policy okay so let's break and then Oh have a good
[01:17:57] so let's break and then Oh have a good Thanksgiving everyone
[01:17:58] Thanksgiving everyone won't see you for like a week and half
[01:18:00] won't see you for like a week and half enjoy yourselves and we'll reconvene
[01:18:02] enjoy yourselves and we'll reconvene after Thanksgiving with


================================================================================
LECTURE 017
================================================================================

Lecture 17 - MDPs & Value/Policy Iteration | Stanford CS229: Machine Learning Andrew Ng (Autumn2018)

Source: https://www.youtube.com/watch?v=d5gaWTo6kDM

---

Transcript

[00:00:03] welcome back everyone hope you had a
[00:00:05] welcome back everyone hope you had a good Thanksgiving these chairs here um
[00:00:12] good Thanksgiving these chairs here um by the way not sure thanks awning not
[00:00:15] by the way not sure thanks awning not sure you guys funny didn't use but in
[00:00:17] sure you guys funny didn't use but in reinforcement learning which had a lot
[00:00:19] reinforcement learning which had a lot about robotics right then one of the you
[00:00:21] about robotics right then one of the you know cost a problem lot of people use
[00:00:24] know cost a problem lot of people use reinforcement to solve is robotics and I
[00:00:27] reinforcement to solve is robotics and I think back in May the insight Mars
[00:00:32] think back in May the insight Mars Lander had launched from here in
[00:00:34] Lander had launched from here in California and is about to make an
[00:00:35] California and is about to make an attempt at landing on the planet Mars in
[00:00:38] attempt at landing on the planet Mars in the next two and a half hours or so so
[00:00:40] the next two and a half hours or so so excited about that I think is actually
[00:00:42] excited about that I think is actually one of the grandest applications of
[00:00:45] one of the grandest applications of robotics because you know with what 20
[00:00:47] robotics because you know with what 20 minute life speed from Earth to Mars
[00:00:49] minute life speed from Earth to Mars you know once it starts this landing
[00:00:51] you know once it starts this landing there's nothing anyone on the earth can
[00:00:53] there's nothing anyone on the earth can do and so I think is actually one the
[00:00:54] do and so I think is actually one the most exciting applications of a Tong's
[00:00:56] most exciting applications of a Tong's robotics but you launched this thing is
[00:00:58] robotics but you launched this thing is now about 20 20 light minutes away from
[00:01:00] now about 20 20 light minutes away from Planet Earth so you actually can't
[00:01:02] Planet Earth so you actually can't control it in real time and you just
[00:01:04] control it in real time and you just have to hope like crazy that your
[00:01:06] have to hope like crazy that your software works well enough but land on
[00:01:08] software works well enough but land on this planet you know and stuff well
[00:01:11] this planet you know and stuff well we'll find out a little bit afternoon if
[00:01:13] we'll find out a little bit afternoon if the landing have been successfully or
[00:01:15] the landing have been successfully or longer as you know III think um sir I
[00:01:18] longer as you know III think um sir I just get excited about stuff like this I
[00:01:19] just get excited about stuff like this I hope you guys do too and but they'll see
[00:01:22] hope you guys do too and but they'll see they're from California I mean take some
[00:01:23] they're from California I mean take some pride that it launched from my home
[00:01:24] pride that it launched from my home state of California and it's now nearing
[00:01:27] state of California and it's now nearing is uh landing on Mars alright so um what
[00:01:34] is uh landing on Mars alright so um what I want to do today is continue our
[00:01:36] I want to do today is continue our discussion on reinforcement learning do
[00:01:40] discussion on reinforcement learning do a quick recap of the MDP or the Markov
[00:01:43] a quick recap of the MDP or the Markov decision process framework and then
[00:01:46] decision process framework and then we'll start to talk about algorithms for
[00:01:47] we'll start to talk about algorithms for solving them DPS in particular need to
[00:01:50] solving them DPS in particular need to define something called the value
[00:01:52] define something called the value function which tells you how good it is
[00:01:55] function which tells you how good it is to be in different states of the MDP and
[00:01:58] to be in different states of the MDP and then we'll define the value function and
[00:02:01] then we'll define the value function and then talk about an algorithm called
[00:02:03] then talk about an algorithm called Valley iteration for computing the value
[00:02:05] Valley iteration for computing the value function and this will help us figure
[00:02:07] function and this will help us figure out how to actually find a good control
[00:02:10] out how to actually find a good control or a finally good policy for them DP and
[00:02:13] or a finally good policy for them DP and it will wrap up with learning state
[00:02:14] it will wrap up with learning state transition probabilities and how to put
[00:02:16] transition probabilities and how to put alson together
[00:02:17] alson together into an actual reinforcement learning
[00:02:19] into an actual reinforcement learning algorithm that you can implement to
[00:02:23] algorithm that you can implement to recap our motivating example run the
[00:02:26] recap our motivating example run the example from the last time from before
[00:02:28] example from the last time from before Thanksgiving was this 11 state MVP and
[00:02:31] Thanksgiving was this 11 state MVP and we said that an MDP comprises a five
[00:02:35] we said that an MDP comprises a five tuple list of five things with States so
[00:02:38] tuple list of five things with States so that example had 11 States actions and
[00:02:43] that example had 11 States actions and in this example the actions were the
[00:02:45] in this example the actions were the compass direction north south east and
[00:02:47] compass direction north south east and west we can try to go in each of the
[00:02:48] west we can try to go in each of the four compass directions the state
[00:02:50] four compass directions the state transition probabilities and example if
[00:02:53] transition probabilities and example if the robot attempts to go north
[00:02:55] the robot attempts to go north it has 80% chance of heading north and
[00:02:58] it has 80% chance of heading north and 0.1% chance of viewing off to the left
[00:03:01] 0.1% chance of viewing off to the left and the point one chance of veering off
[00:03:02] and the point one chance of veering off to the right gamma is a number slightly
[00:03:07] to the right gamma is a number slightly less than one usually say less than one
[00:03:10] less than one usually say less than one there's a discount factor think of the
[00:03:12] there's a discount factor think of the 0.99 and R is the reward function that
[00:03:16] 0.99 and R is the reward function that helps us specify where we want the robot
[00:03:20] helps us specify where we want the robot to end up and so what we said last time
[00:03:24] to end up and so what we said last time was that the way an MDP works is you
[00:03:28] was that the way an MDP works is you start up in some state as zero honestly
[00:03:30] start up in some state as zero honestly better you choose an action a zero and
[00:03:34] better you choose an action a zero and as a result of that it transitions the
[00:03:36] as a result of that it transitions the new state s 1 which is drawn according
[00:03:39] new state s 1 which is drawn according to p FS 0 a 0 and then you choose a new
[00:03:43] to p FS 0 a 0 and then you choose a new action a 1 and as a result the MDP
[00:03:46] action a 1 and as a result the MDP transition system new state PF s 1 a 1
[00:03:52] transition system new state PF s 1 a 1 and the total payoff is the sum of
[00:03:55] and the total payoff is the sum of rewards and the goal is to come up with
[00:04:06] rewards and the goal is to come up with a way and formally that goes to come
[00:04:10] a way and formally that goes to come over policy pi which is a mapping from
[00:04:15] over policy pi which is a mapping from the states to the actions that will tell
[00:04:19] the states to the actions that will tell you how to choose actions from whatever
[00:04:21] you how to choose actions from whatever state you're in so that the policy
[00:04:23] state you're in so that the policy maximizes the expected value of the
[00:04:26] maximizes the expected value of the total payoff ok
[00:04:28] total payoff ok and so I think lost time I I kind of
[00:04:31] and so I think lost time I I kind of claimed that this is the optimal policy
[00:04:35] claimed that this is the optimal policy for this MVP and what this means for
[00:04:42] for this MVP and what this means for example is if you look at this state but
[00:04:47] example is if you look at this state but this policy is telling you that fire 3
[00:04:50] this policy is telling you that fire 3 comma 1 equals West I guess oh you can
[00:04:54] comma 1 equals West I guess oh you can write west or left or what do you call
[00:04:56] write west or left or what do you call that left arrow right we're from the
[00:04:58] that left arrow right we're from the state from the state 3 1 you know the
[00:05:03] state from the state 3 1 you know the best action to take this to go left us
[00:05:05] best action to take this to go left us to go west and so if you're executing
[00:05:08] to go west and so if you're executing this policy what that means is that on
[00:05:11] this policy what that means is that on every step the action you choose would
[00:05:14] every step the action you choose would be you know PI right of the of the state
[00:05:17] be you know PI right of the of the state that you're in
[00:05:17] that you're in ok so what I'd like to do is now define
[00:05:24] ok so what I'd like to do is now define the value function so how did I come up
[00:05:27] the value function so how did I come up with this right what I like to do is
[00:05:28] with this right what I like to do is have you learn given an MDP given this
[00:05:32] have you learn given an MDP given this five tuple how do you compute the octal
[00:05:36] five tuple how do you compute the octal policy and one of the challenges with
[00:05:41] policy and one of the challenges with finding the optimal policy is that you
[00:05:43] finding the optimal policy is that you know there's a there's an exponentially
[00:05:44] know there's a there's an exponentially large number of possible policies right
[00:05:46] large number of possible policies right if you have eleven states and four
[00:05:48] if you have eleven states and four actions per state the number of possible
[00:05:51] actions per state the number of possible policies is four to the power of 11
[00:05:53] policies is four to the power of 11 which is not that Bay because 11 is a
[00:05:55] which is not that Bay because 11 is a small MDP right this is the number of
[00:05:57] small MDP right this is the number of policies possible policies for an MTP is
[00:06:00] policies possible policies for an MTP is combinatorially large is a number of
[00:06:02] combinatorially large is a number of actions the power a number of states so
[00:06:04] actions the power a number of states so how do you find the best policy so what
[00:06:08] how do you find the best policy so what you learned today is how to compute the
[00:06:12] you learned today is how to compute the auto policy now in order to develop an
[00:06:17] auto policy now in order to develop an algorithm for computing an auto policy
[00:06:19] algorithm for computing an auto policy we'll need to define three things so
[00:06:23] we'll need to define three things so just as a roadmap what I'm about to do
[00:06:27] just as a roadmap what I'm about to do is define V PI V Star and PI star okay
[00:06:33] is define V PI V Star and PI star okay and based on these definitions will see
[00:06:35] and based on these definitions will see that will come to definition
[00:06:39] that will come to definition derived that pi-star is the auto policy
[00:06:42] derived that pi-star is the auto policy okay but so let's let's go through these
[00:06:44] okay but so let's let's go through these few definitions first V PI so for a
[00:06:51] few definitions first V PI so for a policy PI V PI is a function mapping
[00:06:57] policy PI V PI is a function mapping from States to the rails is such that V
[00:07:05] from States to the rails is such that V PI of s is the expected total payoff for
[00:07:24] PI of s is the expected total payoff for starting and state that's executing PI
[00:07:33] starting and state that's executing PI and so sometimes you write this as V PI
[00:07:35] and so sometimes you write this as V PI of s is the expected
[00:07:38] of s is the expected well total payoff given that you execute
[00:07:47] well total payoff given that you execute the policy PI and the initial state as 0
[00:07:52] the policy PI and the initial state as 0 is equal to s so the definition of a V
[00:07:56] is equal to s so the definition of a V PI this is called the value function for
[00:08:00] PI this is called the value function for a policy this is called the value
[00:08:02] a policy this is called the value function for the policy PI ok and so
[00:08:15] function for the policy PI ok and so what the value function for a policy PI
[00:08:18] what the value function for a policy PI denoted be pious is it tells you for any
[00:08:22] denoted be pious is it tells you for any state you might start it there's a
[00:08:23] state you might start it there's a function mapping of states the rewards
[00:08:25] function mapping of states the rewards write for any say you might start saying
[00:08:26] write for any say you might start saying what's the expected total payoff if you
[00:08:29] what's the expected total payoff if you start off your robot in that state and
[00:08:31] start off your robot in that state and if you execute the policy PI and XC the
[00:08:34] if you execute the policy PI and XC the policy PI means take actions according
[00:08:37] policy PI means take actions according to the policy PI right so here's a
[00:08:38] to the policy PI right so here's a here's a specific example this policy so
[00:08:46] here's a specific example this policy so let's consider the following policy PI
[00:08:48] let's consider the following policy PI right
[00:08:59] so this is now the great policy right
[00:09:02] so this is now the great policy right you're from some of these days it looks
[00:09:04] you're from some of these days it looks like is heading to the minus one reward
[00:09:06] like is heading to the minus one reward oh sorry Segura the reward was plus one
[00:09:08] oh sorry Segura the reward was plus one we get here and technically this called
[00:09:11] we get here and technically this called an absorbing state meaning that if you
[00:09:12] an absorbing state meaning that if you ever get to the plus one to the minus
[00:09:14] ever get to the plus one to the minus one then the world ends and then no more
[00:09:16] one then the world ends and then no more rewards or penalties after that right so
[00:09:18] rewards or penalties after that right so but so there's actually not a very good
[00:09:21] but so there's actually not a very good policy so policy is any function mapping
[00:09:22] policy so policy is any function mapping from the states to the actions so this
[00:09:25] from the states to the actions so this is one policy that says are in this
[00:09:28] is one policy that says are in this state you know this policy tells you in
[00:09:32] state you know this policy tells you in this state for one go north which is
[00:09:34] this state for one go north which is actually pretty bad thing to do right
[00:09:36] actually pretty bad thing to do right it's take you to the minus one reward so
[00:09:38] it's take you to the minus one reward so this is not a great policy but but just
[00:09:41] this is not a great policy but but just just a policy and V PI for this policy
[00:10:10] don't worry too much about the specific
[00:10:13] don't worry too much about the specific numbers but yo if you look at this
[00:10:15] numbers but yo if you look at this policy you see that from this set of
[00:10:17] policy you see that from this set of states it's pretty efficient at getting
[00:10:19] states it's pretty efficient at getting you to the really bad reward and from
[00:10:22] you to the really bad reward and from this set of states is pretty efficient
[00:10:24] this set of states is pretty efficient at getting you to the good reward right
[00:10:26] at getting you to the good reward right what's some mixing because of their
[00:10:28] what's some mixing because of their noise in the robotic veering off to the
[00:10:30] noise in the robotic veering off to the side and so you know these numbers are
[00:10:34] side and so you know these numbers are all negative and those numbers are at
[00:10:36] all negative and those numbers are at least somewhat positive right so but so
[00:10:39] least somewhat positive right so but so V PI is just um if you start from say
[00:10:43] V PI is just um if you start from say this state from the state 1 1 on
[00:10:46] this state from the state 1 1 on expectation you're expected some of this
[00:10:48] expectation you're expected some of this counter Wars will be negative
[00:10:50] counter Wars will be negative point-eight
[00:10:54] so that's what be pious
[00:10:59] no the following equation governs the
[00:11:26] no the following equation governs the value function it's called it's called a
[00:11:38] value function it's called it's called a bellman equation and this is that your
[00:11:48] bellman equation and this is that your expected payoff at a given stage is the
[00:11:50] expected payoff at a given stage is the reward that you receive plus a discount
[00:11:53] reward that you receive plus a discount factor times the future rewards so let
[00:11:56] factor times the future rewards so let me let me actually explain the intuition
[00:11:58] me let me actually explain the intuition behind is right which is that let's say
[00:12:02] behind is right which is that let's say you start off at some state as 0 right
[00:12:05] you start off at some state as 0 right so oh and again let's let's say s is
[00:12:07] so oh and again let's let's say s is equal to s 0 so V PI of s it is equal to
[00:12:11] equal to s 0 so V PI of s it is equal to well just for your robots waking up in
[00:12:17] well just for your robots waking up in that I'm gonna add to it in a second ok
[00:12:19] that I'm gonna add to it in a second ok but just for the sake just for this for
[00:12:21] but just for the sake just for this for the fact that your robot woke up in this
[00:12:25] the fact that your robot woke up in this state s you get the immediate you get it
[00:12:28] state s you get the immediate you get it reward RF as zero right away just as
[00:12:31] reward RF as zero right away just as something's called this is also called
[00:12:32] something's called this is also called the immediate reward because you know
[00:12:39] the immediate reward because you know just for the for the good fortune of bad
[00:12:42] just for the for the good fortune of bad fortune of starting off in this state
[00:12:44] fortune of starting off in this state the robot gets a reward right away this
[00:12:47] the robot gets a reward right away this is called the immediate reward and then
[00:12:50] is called the immediate reward and then it will take some action and get to some
[00:12:56] it will take some action and get to some new stage s1 well receive you know gamma
[00:12:59] new stage s1 well receive you know gamma times the reward of s1 and then right
[00:13:10] times the reward of s1 and then right and then I'll get some future reward at
[00:13:12] and then I'll get some future reward at the next step and so on and just to
[00:13:14] the next step and so on and just to flesh out the definition the value
[00:13:17] flesh out the definition the value function V PI is really this given that
[00:13:20] function V PI is really this given that you execute the policy PI and s0 equals
[00:13:24] you execute the policy PI and s0 equals s right and you start off in the same as
[00:13:27] s right and you start off in the same as zero now what I'm going to do is we
[00:13:30] zero now what I'm going to do is we write this part of the equation a little
[00:13:31] write this part of the equation a little bit I'm going to factor out I'm just
[00:13:33] bit I'm going to factor out I'm just going to take the rest of this and
[00:13:35] going to take the rest of this and factor out one factor of gamma so let me
[00:13:38] factor out one factor of gamma so let me put parentheses around this right and
[00:13:42] put parentheses around this right and just take out gamma there okay so I'm
[00:13:45] just take out gamma there okay so I'm just you know taking this PVC this was
[00:13:48] just you know taking this PVC this was gamma squared right but I think the
[00:13:51] gamma squared right but I think the parentheses here I'm just taking out one
[00:13:53] parentheses here I'm just taking out one factor of gamma that multiplies in the
[00:13:56] factor of gamma that multiplies in the restaurant equation okay does any sense
[00:13:59] restaurant equation okay does any sense no so as in gamma R of s 1 plus gamma
[00:14:04] no so as in gamma R of s 1 plus gamma squared R of s 2 equals gamma R times R
[00:14:09] squared R of s 2 equals gamma R times R of s 1 ok so that's that's what I did
[00:14:15] of s 1 ok so that's that's what I did down there right just factor out 1 1
[00:14:17] down there right just factor out 1 1 factor of gamma and so this is the the
[00:14:24] factor of gamma and so this is the the value of state s is the immediate reward
[00:14:26] value of state s is the immediate reward plus gamma times the expected future
[00:14:29] plus gamma times the expected future rewards right so this the expected value
[00:14:34] rewards right so this the expected value of this is really V PI of s 1 right so
[00:14:44] of this is really V PI of s 1 right so this and
[00:14:46] this and so the second term here this is the
[00:14:50] so the second term here this is the expected future rewards so pelvis
[00:14:59] expected future rewards so pelvis equation says that the value of a state
[00:15:03] equation says that the value of a state the value the expected total payoff you
[00:15:06] the value the expected total payoff you get if your robot wakes up in the state
[00:15:09] get if your robot wakes up in the state s is the immediate reward plus gamma
[00:15:12] s is the immediate reward plus gamma times the expected future rewards okay
[00:15:16] times the expected future rewards okay right and and this thing under you know
[00:15:19] right and and this thing under you know above the curly braces is really asking
[00:15:25] above the curly braces is really asking if you rope out wakes up at the state s1
[00:15:27] if you rope out wakes up at the state s1 and excuse PI what is the expected total
[00:15:30] and excuse PI what is the expected total payoff right and this what if you robot
[00:15:33] payoff right and this what if you robot wakes I'm gonna state s1 then you know
[00:15:34] wakes I'm gonna state s1 then you know take an action get us to take an
[00:15:36] take an action get us to take an actually get to s3 and this is the sum
[00:15:39] actually get to s3 and this is the sum of this counter war sort of it starts
[00:15:41] of this counter war sort of it starts off with the state s1 okay so this base
[00:15:52] off with the state s1 okay so this base on this you can write out
[00:15:54] on this you can write out well these justify Bellman's equations
[00:15:57] well these justify Bellman's equations which is oh and and the mapping from
[00:16:02] which is oh and and the mapping from this equation to this equation
[00:16:22] all right the mapping from the equation
[00:16:25] all right the mapping from the equation on top to the equation that bottom is
[00:16:27] on top to the equation that bottom is that s maps to s 0 and s prime master s
[00:16:33] that s maps to s 0 and s prime master s 1 right and and so if we have that be PI
[00:16:42] 1 right and and so if we have that be PI of s equals so the value of state s is
[00:17:09] of s equals so the value of state s is our Vespas V PI of s prime where this is
[00:17:13] our Vespas V PI of s prime where this is s 0 and this is s 1 and and in the
[00:17:19] s 0 and this is s 1 and and in the notation of MDP if you want to write a
[00:17:21] notation of MDP if you want to write a long sequence of States we tend to use s
[00:17:23] long sequence of States we tend to use s 0 s 1 s 2 s 3 and s 4 and so on but if
[00:17:26] 0 s 1 s 2 s 3 and s 4 and so on but if you have want to look at just the
[00:17:27] you have want to look at just the current state and the state you get 2
[00:17:29] current state and the state you get 2 after 1 times that we tend to use s and
[00:17:31] after 1 times that we tend to use s and s prime for that so that's why this is
[00:17:33] s prime for that so that's why this is mapping between these two pieces
[00:17:34] mapping between these two pieces notation so s prime is say you get two
[00:17:39] notation so s prime is say you get two after one step well let's see one is s
[00:17:43] after one step well let's see one is s prime drawn from write this so does the
[00:17:45] prime drawn from write this so does the state s prime or s 1 is the state you
[00:17:48] state s prime or s 1 is the state you get to after 1 time step so what is what
[00:17:51] get to after 1 time step so what is what is the distribution to s prime is drawn
[00:17:53] is the distribution to s prime is drawn from s prime is drawn from P of what
[00:18:02] okay PFS cool because in state s you
[00:18:16] okay PFS cool because in state s you will take action a equals PI of s right
[00:18:23] will take action a equals PI of s right so we're executing the policy pi so that
[00:18:26] so we're executing the policy pi so that means that when you're in the state s
[00:18:28] means that when you're in the state s you're going to take the action a given
[00:18:30] you're going to take the action a given by PI of s goes PI of s tells you please
[00:18:33] by PI of s goes PI of s tells you please take this action a when you're in state
[00:18:35] take this action a when you're in state s and so s prime is drawn from P of s a
[00:18:42] s and so s prime is drawn from P of s a where a is equal to PI of s right
[00:18:46] where a is equal to PI of s right because the cause that's the action you
[00:18:47] because the cause that's the action you took which is why s Prime the state you
[00:18:50] took which is why s Prime the state you get to after one time step is drawn from
[00:18:52] get to after one time step is drawn from the distribution s PI of s
[00:19:08] so putting all that together that's why
[00:19:10] so putting all that together that's why well I just write the other game where
[00:19:13] well I just write the other game where belma's equations which is V PI of s
[00:19:16] belma's equations which is V PI of s equals R of s plus the discount factor
[00:19:20] equals R of s plus the discount factor times the expected value of V PI of s
[00:19:23] times the expected value of V PI of s prime and so this term here is just some
[00:19:27] prime and so this term here is just some of the S prime be s PI of s be PI of s
[00:19:36] of the S prime be s PI of s be PI of s Prime okay so that underlying term I
[00:19:38] Prime okay so that underlying term I guess is this underlying term here um
[00:19:43] guess is this underlying term here um now notice that this gives you a linear
[00:19:46] now notice that this gives you a linear system of equations for actually solving
[00:19:48] system of equations for actually solving for the value function so let's say I
[00:19:51] for the value function so let's say I give you a policy it could be a good
[00:19:53] give you a policy it could be a good policy could be a bad policy and you
[00:19:55] policy could be a bad policy and you want to solve the PI of s what this does
[00:20:00] want to solve the PI of s what this does is if you think of the PI of s as the
[00:20:05] is if you think of the PI of s as the unknowns you're trying to solve for
[00:20:08] unknowns you're trying to solve for given PI write these equations
[00:20:27] these are the pelvis equations defines a
[00:20:30] these are the pelvis equations defines a linear system of equations in terms of
[00:20:34] linear system of equations in terms of the PI of s as the very values to be
[00:20:38] the PI of s as the very values to be solved for so maybe here's a here's a
[00:20:40] solved for so maybe here's a here's a specific example let's take the state v1
[00:20:46] right so this is the state 3 1 what this
[00:20:52] right so this is the state 3 1 what this what balance equation this tells us is
[00:20:54] what balance equation this tells us is the PI of the state 3 comma 1 is equal
[00:21:00] the PI of the state 3 comma 1 is equal to the mediator what you get at the
[00:21:04] to the mediator what you get at the state 3 1 plus the discount factor times
[00:21:11] state 3 1 plus the discount factor times well some of s prime PS PI of s be PI of
[00:21:14] well some of s prime PS PI of s be PI of s prime right so oh and let's say that
[00:21:20] s prime right so oh and let's say that PI of 3 1 is no so let's see try to go
[00:21:25] PI of 3 1 is no so let's see try to go north if you try to go north from the
[00:21:27] north if you try to go north from the state then you have a 0.8 chance of
[00:21:30] state then you have a 0.8 chance of getting to 3/2 plus a 0.1 chance of
[00:21:38] veering left plus 0.1 chance of veering
[00:21:46] veering left plus 0.1 chance of veering right so that's what balance equation
[00:21:58] right so that's what balance equation says about these values right and if
[00:22:04] says about these values right and if your goal is to solve for the value
[00:22:07] your goal is to solve for the value function then these things I'm just
[00:22:12] function then these things I'm just circling in purple are the unknown
[00:22:16] circling in purple are the unknown variables
[00:22:18] variables and if you have eleven states like in
[00:22:22] and if you have eleven states like in our MDP then this gives you a system of
[00:22:25] our MDP then this gives you a system of eleven linear equations with eleven
[00:22:27] eleven linear equations with eleven unknowns and so using server linear
[00:22:32] unknowns and so using server linear algebra solver you can solve explicitly
[00:22:34] algebra solver you can solve explicitly for the value of these eleven unknowns
[00:22:37] for the value of these eleven unknowns so they way you it so let's say give you
[00:22:40] so they way you it so let's say give you a policy PI you know any policy PI the
[00:22:44] a policy PI you know any policy PI the way you can solve for the value function
[00:22:46] way you can solve for the value function is create an eleven dimensional vector
[00:22:50] is create an eleven dimensional vector with V PI of you know one one V PI of 1
[00:22:58] with V PI of you know one one V PI of 1 2 and so on down to the PI of whether is
[00:23:02] 2 and so on down to the PI of whether is the last thing you have eleven state so
[00:23:03] the last thing you have eleven state so V PI of easy or whatever for three right
[00:23:10] V PI of easy or whatever for three right so if you want to solve for those eleven
[00:23:14] so if you want to solve for those eleven numbers I wrote up just in terms of
[00:23:16] numbers I wrote up just in terms of defining V PI what you can do is I'll
[00:23:19] defining V PI what you can do is I'll give you a policy PI you can then
[00:23:21] give you a policy PI you can then construct an eleven dimensional vector
[00:23:24] construct an eleven dimensional vector you know 11 dimensional vector of
[00:23:27] you know 11 dimensional vector of unknown values that you want to solve
[00:23:29] unknown values that you want to solve for and balanced equations for each of
[00:23:32] for and balanced equations for each of the eleven states for each of the eleven
[00:23:35] the eleven states for each of the eleven states you could plug in on the left
[00:23:36] states you could plug in on the left hand side just gives you one equation
[00:23:38] hand side just gives you one equation for how one of the values is determined
[00:23:42] for how one of the values is determined as a linear function of a few other of
[00:23:44] as a linear function of a few other of the values in this vector okay and so
[00:23:49] the values in this vector okay and so what this does is it sets up a linear
[00:23:52] what this does is it sets up a linear system of equations with eleven
[00:23:54] system of equations with eleven variables in eleven unknowns right and
[00:23:56] variables in eleven unknowns right and using a linear algebra solver you you
[00:23:59] using a linear algebra solver you you will be able to solve this linear system
[00:24:01] will be able to solve this linear system of equations does make sense okay all
[00:24:09] of equations does make sense okay all right and so this works so lousy about
[00:24:11] right and so this works so lousy about this piece yeah if you have eleven
[00:24:13] this piece yeah if you have eleven states you know it takes this takes
[00:24:16] states you know it takes this takes almost it takes almost no time right and
[00:24:18] almost it takes almost no time right and the computer to solve and then this is
[00:24:19] the computer to solve and then this is an eleven equation so that's how you
[00:24:20] an eleven equation so that's how you would actually get those values if you
[00:24:23] would actually get those values if you have a called on to solve for V pi okay
[00:24:30] actually the there why just say make
[00:24:33] actually the there why just say make sense raise your hand if what I just
[00:24:34] sense raise your hand if what I just explained made sense like cool awesome
[00:24:36] explained made sense like cool awesome thing
[00:24:45] all right good
[00:24:47] all right good so moving on our roadmap will define V
[00:24:51] so moving on our roadmap will define V PI let's now define V Star so so V Star
[00:25:13] PI let's now define V Star so so V Star is the optimal value function and we'll
[00:25:24] is the optimal value function and we'll define it as V star of s equals max
[00:25:32] define it as V star of s equals max overall policies PI of V PI one of the
[00:25:39] overall policies PI of V PI one of the slightly confusing things about
[00:25:41] slightly confusing things about reinforced with an in terminology is
[00:25:43] reinforced with an in terminology is that there two types of value function
[00:25:44] that there two types of value function there's value function for a given
[00:25:47] there's value function for a given policy PI and that's the optimal value
[00:25:49] policy PI and that's the optimal value function V star so both of these are
[00:25:51] function V star so both of these are called value functions but one is a
[00:25:53] called value functions but one is a value function for a specific policy
[00:25:55] value function for a specific policy could be a great policy could be
[00:25:56] could be a great policy could be terrible policy could be also policy the
[00:25:58] terrible policy could be also policy the other is V star which is the optimal
[00:26:00] other is V star which is the optimal optimal value function so V Star is
[00:26:03] optimal value function so V Star is defined as locally value for you know
[00:26:07] defined as locally value for you know any look across all of the possible
[00:26:09] any look across all of the possible policies you could have all four to
[00:26:11] policies you could have all four to eleven where all the company totally
[00:26:14] eleven where all the company totally large number of possible policy so
[00:26:15] large number of possible policy so there's MVP and these star affairs is
[00:26:18] there's MVP and these star affairs is well let's just take the max which is of
[00:26:20] well let's just take the max which is of all the possible of all the policies you
[00:26:23] all the possible of all the policies you know anyone could implement of all the
[00:26:24] know anyone could implement of all the possible policies let's take the value
[00:26:27] possible policies let's take the value of the best possible policy for that
[00:26:28] of the best possible policy for that state so that's V star okay that's the
[00:26:31] state so that's V star okay that's the all Tolle also a value function and it
[00:26:36] all Tolle also a value function and it turns out that
[00:26:40] there is a different version of bellman
[00:26:44] there is a different version of bellman equations for this and again there's a
[00:26:53] equations for this and again there's a balance equations for be pi/4 value of a
[00:26:57] balance equations for be pi/4 value of a policy and then there's a different
[00:26:59] policy and then there's a different version of bellman equations for the
[00:27:00] version of bellman equations for the optimal value function right so just as
[00:27:03] optimal value function right so just as the two versions of value functions
[00:27:06] the two versions of value functions there are two versions of balance
[00:27:08] there are two versions of balance equations but let me just write this out
[00:27:10] equations but let me just write this out hopefully this will make sense actually
[00:27:18] hopefully this will make sense actually let's think this through so let's say
[00:27:20] let's think this through so let's say you start off your robot in a state s
[00:27:22] you start off your robot in a state s what is the best possible expected some
[00:27:26] what is the best possible expected some of this counselor was what's the best
[00:27:28] of this counselor was what's the best possible payoff it again right well just
[00:27:32] possible payoff it again right well just for the privilege of waking up in state
[00:27:34] for the privilege of waking up in state s the robot will receive an immediate
[00:27:37] s the robot will receive an immediate what R of s and then it has to take some
[00:27:40] what R of s and then it has to take some action and after taking some action it
[00:27:44] action and after taking some action it will get to some other state as a prime
[00:27:50] you know and after some other state s
[00:27:53] you know and after some other state s prime it will receive future expected
[00:27:56] prime it will receive future expected rewards v-star best prime and we have to
[00:27:59] rewards v-star best prime and we have to discount that by camera right so so well
[00:28:06] discount that by camera right so so well the state s prime was arrived at by
[00:28:09] the state s prime was arrived at by you're taking some action a from the
[00:28:12] you're taking some action a from the initial state and so whatever the action
[00:28:16] initial state and so whatever the action is you know but if you take action a so
[00:28:30] is you know but if you take action a so if you take an action a in the state s
[00:28:34] if you take an action a in the state s then your total payoff will be expected
[00:28:37] then your total payoff will be expected total payoff will be the immediate
[00:28:38] total payoff will be the immediate reward plus gamma times the expected
[00:28:40] reward plus gamma times the expected value of the future payoff but what is
[00:28:45] value of the future payoff but what is the action a that we should plug in here
[00:28:47] the action a that we should plug in here right
[00:28:48] right well the optimal action to take in the
[00:28:50] well the optimal action to take in the MDP
[00:28:50] MDP is whatever action maximizes your
[00:28:53] is whatever action maximizes your expected total payoff maximize you
[00:28:56] expected total payoff maximize you expected some rewards which is why the
[00:28:58] expected some rewards which is why the action you want to plug in is just
[00:29:01] action you want to plug in is just whatever action a maximizes that okay so
[00:29:05] whatever action a maximizes that okay so this is um Domus equations for the
[00:29:08] this is um Domus equations for the optimal value function which says that
[00:29:11] optimal value function which says that the best possible expected total payoff
[00:29:14] the best possible expected total payoff you could receive starting from state s
[00:29:17] you could receive starting from state s is the immediate reward R of s plus max
[00:29:21] is the immediate reward R of s plus max over all possible actions of whatever
[00:29:23] over all possible actions of whatever action allows you to maximize you know
[00:29:25] action allows you to maximize you know your expected total payoff expect a
[00:29:28] your expected total payoff expect a future payoff
[00:29:29] future payoff okay so this is the expected future
[00:29:32] okay so this is the expected future payoff expected future reward now based
[00:29:51] payoff expected future reward now based on the argument we just went through
[00:29:54] on the argument we just went through this allows us to figure out how to
[00:29:58] this allows us to figure out how to compute PI star of s as well right which
[00:30:04] compute PI star of s as well right which is let's say let's say we have a way of
[00:30:08] is let's say let's say we have a way of computing V star of s right we don't yet
[00:30:10] computing V star of s right we don't yet but let's say I tell you what does V
[00:30:12] but let's say I tell you what does V Sarvis and then I'll see you you know
[00:30:15] Sarvis and then I'll see you you know what is the action you should take in a
[00:30:17] what is the action you should take in a given stage so remember PI spy star of
[00:30:20] given stage so remember PI spy star of PI star is going to auto policy and so
[00:30:29] PI star is going to auto policy and so what should PI star vests be right which
[00:30:31] what should PI star vests be right which is let's say let's say we're we're
[00:30:33] is let's say let's say we're we're computing V Star and I now ask you hey
[00:30:37] computing V Star and I now ask you hey my robots in state s what is the best
[00:30:39] my robots in state s what is the best action I should take from the state s
[00:30:42] action I should take from the state s right then how do I decide what action
[00:30:46] right then how do I decide what action to take in the state yes
[00:30:49] to take in the state yes well what would think is the best action
[00:30:51] well what would think is the best action to take from this state and the answer
[00:30:55] to take from this state and the answer is almost given in the equation of oh
[00:30:57] is almost given in the equation of oh yeah
[00:31:01] yeah cool awesome right so the best
[00:31:05] yeah cool awesome right so the best action to take and state us and best
[00:31:07] action to take and state us and best means maximizing expected total payoff
[00:31:10] means maximizing expected total payoff but the option that maximizes your
[00:31:12] but the option that maximizes your expenses total payoff is you know well
[00:31:13] expenses total payoff is you know well whatever action we were choosing a up
[00:31:16] whatever action we were choosing a up here and so it's just long max over a
[00:31:27] and because gamma is just a constant
[00:31:30] and because gamma is just a constant that doesn't affect the arcmap usually
[00:31:32] that doesn't affect the arcmap usually we just we just eliminate that this is
[00:31:34] we just we just eliminate that this is just a positive number right so this
[00:31:42] just a positive number right so this gives us the strategy we will use for
[00:31:46] gives us the strategy we will use for finding the also policy for an MVP which
[00:31:50] finding the also policy for an MVP which is we're going to find a way to compute
[00:31:54] is we're going to find a way to compute V Star of S which we don't have a way of
[00:31:56] V Star of S which we don't have a way of doing yet rightly star was defined as a
[00:31:58] doing yet rightly star was defined as a max over combinatorially or
[00:32:00] max over combinatorially or exponentially large number policies so
[00:32:02] exponentially large number policies so we don't have way of computing piece not
[00:32:03] we don't have way of computing piece not yet but if we can find a way to compute
[00:32:05] yet but if we can find a way to compute B star then you know using this equation
[00:32:08] B star then you know using this equation certainly just scratch themself using
[00:32:11] certainly just scratch themself using this equation gives you a way for every
[00:32:14] this equation gives you a way for every state of every state s pretty
[00:32:16] state of every state s pretty efficiently computes this augment and
[00:32:19] efficiently computes this augment and therefore figure out what is the optimal
[00:32:21] therefore figure out what is the optimal action for every state
[00:32:46] all right so all right so just practice
[00:32:54] all right so all right so just practice with confusing notation all right let's
[00:33:02] with confusing notation all right let's see if you understand this equation I'm
[00:33:04] see if you understand this equation I'm just claiming this I'm not proving this
[00:33:05] just claiming this I'm not proving this but for every state as V Star of s
[00:33:09] but for every state as V Star of s equals G of Pi star of s is greater than
[00:33:14] equals G of Pi star of s is greater than a PI of s all right for every policy
[00:33:20] a PI of s all right for every policy Pyne every state s okay so I hope this
[00:33:24] Pyne every state s okay so I hope this equation makes sense this is what I'm
[00:33:27] equation makes sense this is what I'm claiming I didn't prove this one
[00:33:28] claiming I didn't prove this one claiming is that the October value for
[00:33:31] claiming is that the October value for state s is this is the optimal value
[00:33:34] state s is this is the optimal value function on the left this is the value
[00:33:38] function on the left this is the value function for pi star so this is this is
[00:33:47] function for pi star so this is this is about optimal value function this is the
[00:33:49] about optimal value function this is the value function for a specific policy PI
[00:33:51] value function for a specific policy PI where the policy PI happens to be PI
[00:33:53] where the policy PI happens to be PI star and so what I'm claiming here is
[00:33:56] star and so what I'm claiming here is that what what I'm writing here is that
[00:33:58] that what what I'm writing here is that the optimal value for state s is equal
[00:34:01] the optimal value for state s is equal to the value function for PI star
[00:34:03] to the value function for PI star applied to the state s and this is great
[00:34:06] applied to the state s and this is great sin equal to V PI of s for any other
[00:34:08] sin equal to V PI of s for any other policy by
[00:34:17] so the strategy you can use for finding
[00:34:22] so the strategy you can use for finding for also policy is one v V star to you
[00:34:31] for also policy is one v V star to you know use the R max equation to find pi
[00:34:42] know use the R max equation to find pi star okay and so what we're going to do
[00:34:45] star okay and so what we're going to do is well step to write we know how to do
[00:34:49] is well step to write we know how to do from the optimized equation so what
[00:34:50] from the optimized equation so what we're gonna do is top an algorithm for
[00:34:52] we're gonna do is top an algorithm for actually computing visa because if you
[00:34:55] actually computing visa because if you can compute V song then this equation
[00:34:57] can compute V song then this equation helps allows you to pretty quickly find
[00:35:00] helps allows you to pretty quickly find the optimal action for every state
[00:35:10] so um so value iteration is as an album
[00:35:32] so um so value iteration is as an album you can use to to find V star so let me
[00:35:37] you can use to to find V star so let me just write out the algorithm so in the
[00:36:17] just write out the algorithm so in the value iteration algorithm you initialize
[00:36:20] value iteration algorithm you initialize the estimated value of every state to
[00:36:23] the estimated value of every state to zero and then you update these estimated
[00:36:27] zero and then you update these estimated values using Bellman's equations and
[00:36:29] values using Bellman's equations and this is the optimal value function the V
[00:36:32] this is the optimal value function the V star version of Bellman's equations and
[00:36:46] so to be concrete about how you
[00:36:48] so to be concrete about how you implement this you have um inferencing
[00:36:50] implement this you have um inferencing this right if you're implying didn't
[00:36:52] this right if you're implying didn't Python what you would do is create an 11
[00:36:55] Python what you would do is create an 11 dimensional vector to store all the
[00:36:57] dimensional vector to store all the values of V of s so you create a you
[00:37:00] values of V of s so you create a you know 11 dimensional vector right that
[00:37:02] know 11 dimensional vector right that that represents V of 1 1 V of 1 2 you
[00:37:08] that represents V of 1 1 V of 1 2 you know down to V over 4 3 right so this is
[00:37:12] know down to V over 4 3 right so this is um 11 dimensional vector corresponding
[00:37:15] um 11 dimensional vector corresponding to the 11 states oh I'm sorry I should
[00:37:20] to the 11 states oh I'm sorry I should wait did I say 11 where 10 stays in the
[00:37:22] wait did I say 11 where 10 stays in the MTB don't we wait yes we have 10 sees
[00:37:25] MTB don't we wait yes we have 10 sees I've been saying 11 all along sorry okay
[00:37:27] I've been saying 11 all along sorry okay 10 oh yes you're right sorry yes okay
[00:37:41] 10 oh yes you're right sorry yes okay sorry
[00:37:45] so 11 state MDP serie credit initial
[00:37:48] so 11 state MDP serie credit initial credit 11 dimensional vector and
[00:37:51] credit 11 dimensional vector and initialize all of these values to 0 and
[00:37:54] initialize all of these values to 0 and then you will repeatedly update the
[00:38:00] then you will repeatedly update the estimated value of every state according
[00:38:03] estimated value of every state according to balance equations right and so
[00:38:09] to balance equations right and so they're there they're actually two ways
[00:38:11] they're there they're actually two ways to interpret this and similar to some of
[00:38:15] to interpret this and similar to some of the gradient descent right we've written
[00:38:16] the gradient descent right we've written out you know a gradient descent rule for
[00:38:18] out you know a gradient descent rule for updating the theta the the vector
[00:38:22] updating the theta the the vector parameters theta and what you do is you
[00:38:25] parameters theta and what you do is you know and you have and what you do is you
[00:38:28] know and you have and what you do is you update all of the components of theta
[00:38:30] update all of the components of theta simultaneously right and so that's
[00:38:32] simultaneously right and so that's called a synchronous update in gradient
[00:38:35] called a synchronous update in gradient descent so one way to so the way you
[00:38:38] descent so one way to so the way you would update this equation in what's
[00:38:42] would update this equation in what's called a synchronous update
[00:38:47] will behave you compute the right hand
[00:38:50] will behave you compute the right hand side for all 11 states and then you
[00:38:53] side for all 11 states and then you simultaneously overwrite all 11 values
[00:38:56] simultaneously overwrite all 11 values at the same time and then you compute
[00:38:58] at the same time and then you compute all 11 values for the right hand side
[00:39:00] all 11 values for the right hand side and then you're simultaneously update
[00:39:02] and then you're simultaneously update all 11 values okay the alternative would
[00:39:05] all 11 values okay the alternative would be an asynchronous update and then a
[00:39:13] be an asynchronous update and then a synchronous update what you do is you
[00:39:14] synchronous update what you do is you compute V f11 right and the value of V
[00:39:18] compute V f11 right and the value of V of 1 1 depends on some of the other
[00:39:20] of 1 1 depends on some of the other values on the right hand side right but
[00:39:22] values on the right hand side right but in a synchronous update you compute V of
[00:39:24] in a synchronous update you compute V of 1 1 and then you would overwrite this
[00:39:26] 1 1 and then you would overwrite this value first and then you use that
[00:39:28] value first and then you use that equation to compute V of 1 2 and then
[00:39:31] equation to compute V of 1 2 and then you update this and then you update
[00:39:33] you update this and then you update these one at a time and the difference
[00:39:36] these one at a time and the difference between synchronous and asynchronous is
[00:39:38] between synchronous and asynchronous is you know if you're using asynchronous
[00:39:40] you know if you're using asynchronous update by the time you're using V or 4/3
[00:39:43] update by the time you're using V or 4/3 which depends on some of the earlier
[00:39:44] which depends on some of the earlier values you'd be using a new and refresh
[00:39:47] values you'd be using a new and refresh value of some of the earlier values on
[00:39:49] value of some of the earlier values on your list ok it turns out that value
[00:39:54] your list ok it turns out that value iteration works fine with either
[00:39:55] iteration works fine with either synchronous up these or asynchronous
[00:39:57] synchronous up these or asynchronous updates but further but because it
[00:40:03] updates but further but because it vectorized is better because you can use
[00:40:05] vectorized is better because you can use more efficient matrix operations most
[00:40:07] more efficient matrix operations most people use the synchronous update but it
[00:40:08] people use the synchronous update but it turns out that the algorithm will work
[00:40:10] turns out that the algorithm will work whether using is synchronous or
[00:40:12] whether using is synchronous or asynchronous update sorry is unless
[00:40:14] asynchronous update sorry is unless unless otherwise you know stated you
[00:40:18] unless otherwise you know stated you should usually assume that when I talk
[00:40:20] should usually assume that when I talk about validation I'm referring to
[00:40:22] about validation I'm referring to synchronous update where you compute all
[00:40:24] synchronous update where you compute all the values all 11 values using the and
[00:40:28] the values all 11 values using the and then update all 11 values at the same
[00:40:30] then update all 11 values at the same time ok is there a question just now so
[00:40:32] time ok is there a question just now so my that yeah
[00:40:53] yeah yes so I think they're there yes so
[00:40:57] yeah yes so I think they're there yes so how do you represent the absorbing state
[00:40:59] how do you represent the absorbing state the sink say we go to plus or minus one
[00:41:00] the sink say we go to plus or minus one day the world ends in this framework one
[00:41:03] day the world ends in this framework one way to code that up would be to say that
[00:41:05] way to code that up would be to say that the state has inference from that to any
[00:41:08] the state has inference from that to any other state is zero that's one way to
[00:41:09] other state is zero that's one way to that that would work another way would
[00:41:13] that that would work another way would be less done less often maybe
[00:41:15] be less done less often maybe mathematical but clean up and not how
[00:41:17] mathematical but clean up and not how people tend to do this it would be to
[00:41:19] people tend to do this it would be to take your let me say MDP and then create
[00:41:23] take your let me say MDP and then create at all state and the tall state always
[00:41:25] at all state and the tall state always goes back to itself with no further than
[00:41:27] goes back to itself with no further than what so do both both of these would give
[00:41:29] what so do both both of these would give you the same result though mat batty is
[00:41:31] you the same result though mat batty is pretty more convenient to just set you
[00:41:33] pretty more convenient to just set you know PFF say s prime equals 0 for all
[00:41:35] know PFF say s prime equals 0 for all other states it's not quite safe hasn't
[00:41:38] other states it's not quite safe hasn't already but that that will give you the
[00:41:40] already but that that will give you the ranch as well all right cool
[00:41:48] ranch as well all right cool so just as a point of notation if you're
[00:41:51] so just as a point of notation if you're using synchronous updates you can think
[00:41:54] using synchronous updates you can think of this as taking the old value function
[00:41:58] of this as taking the old value function o estimate right and using it to compute
[00:42:07] o estimate right and using it to compute the new estimates so this this you know
[00:42:11] the new estimates so this this you know assuming the synchronous update you have
[00:42:13] assuming the synchronous update you have some previous 11 dimensional vector with
[00:42:17] some previous 11 dimensional vector with your estimate of the value from the
[00:42:19] your estimate of the value from the previous iteration and after doing one
[00:42:22] previous iteration and after doing one iteration of this you have a new set of
[00:42:24] iteration of this you have a new set of estimate so one step of this algorithm
[00:42:26] estimate so one step of this algorithm is sometimes called via bellman back of
[00:42:28] is sometimes called via bellman back of operator and so where you update the
[00:42:35] operator and so where you update the equals
[00:42:37] equals b.o.b right where we're now he is a 11
[00:42:41] b.o.b right where we're now he is a 11 dimensional vector so you have an O the
[00:42:43] dimensional vector so you have an O the leverage the original vector compute the
[00:42:45] leverage the original vector compute the bellmen backup operator was just that
[00:42:47] bellmen backup operator was just that equation there and update the according
[00:42:49] equation there and update the according to B and so one thing that you see in
[00:42:55] to B and so one thing that you see in the problem set is is showing that this
[00:43:02] the problem set is is showing that this will make a BFS
[00:43:04] will make a BFS condors to be stock so it turns out that
[00:43:26] okay so it turns out that you can prove
[00:43:31] okay so it turns out that you can prove and you see more details that this is a
[00:43:33] and you see more details that this is a problem set that by repeatedly enforcing
[00:43:37] problem set that by repeatedly enforcing Bellman's equations that this equate
[00:43:40] Bellman's equations that this equate this this algorithm will cause your
[00:43:42] this this algorithm will cause your vector of eleven value so cause V to
[00:43:44] vector of eleven value so cause V to converge to the optimal value function V
[00:43:47] converge to the optimal value function V star okay and more details you see the
[00:43:50] star okay and more details you see the homework Illumina lecture notes and it
[00:43:52] homework Illumina lecture notes and it turns out this algorithm actually
[00:43:53] turns out this algorithm actually converges quite quickly right so to give
[00:43:57] converges quite quickly right so to give you a flavor I think that uh with the
[00:43:59] you a flavor I think that uh with the discount factor if the discount factor
[00:44:01] discount factor if the discount factor is 0.99 it turns out that you can show
[00:44:03] is 0.99 it turns out that you can show that the error reduces your by a factor
[00:44:07] that the error reduces your by a factor of point 99 on every iteration and so V
[00:44:10] of point 99 on every iteration and so V actually converges quite quickly
[00:44:12] actually converges quite quickly dramatically quickly if you are
[00:44:13] dramatically quickly if you are exponentially quickly to the October
[00:44:15] exponentially quickly to the October value function V Star and service you
[00:44:18] value function V Star and service you know the discount factor is 0.99 there
[00:44:20] know the discount factor is 0.99 there was like a few where behind your
[00:44:22] was like a few where behind your iterations there are a few hundred
[00:44:23] iterations there are a few hundred iterations v p-- would be very close to
[00:44:25] iterations v p-- would be very close to be stock okay and and the discount
[00:44:28] be stock okay and and the discount factors point nine then with just you
[00:44:30] factors point nine then with just you know ten or few dozens of innovations
[00:44:32] know ten or few dozens of innovations would be very close to be saw so this
[00:44:34] would be very close to be saw so this outer measure converges quite quickly to
[00:44:37] outer measure converges quite quickly to be stock
[00:44:41] so let's see
[00:44:47] [Applause]
[00:45:15] so just to put everything together if
[00:45:20] so just to put everything together if you if you run value iteration on debt
[00:45:28] you if you run value iteration on debt MDP
[00:45:37] you end up with this so this is B star
[00:46:03] you end up with this so this is B star so solicited eleven numbers telling you
[00:46:06] so solicited eleven numbers telling you what is the optimal expected payoff for
[00:46:10] what is the optimal expected payoff for starting off in into the eleven possible
[00:46:13] starting off in into the eleven possible states and so I had previously said I
[00:46:17] states and so I had previously said I think I said last week of the week
[00:46:21] think I said last week of the week before Thanksgiving that this is the
[00:46:23] before Thanksgiving that this is the optimal policy so you know let's just
[00:46:33] optimal policy so you know let's just use as a case study how you compute the
[00:46:36] use as a case study how you compute the optimal action for that state given this
[00:46:41] optimal action for that state given this v-star
[00:46:42] v-star all right well what you do is you
[00:46:44] all right well what you do is you actually just use this equation and so
[00:46:48] actually just use this equation and so if you were to go Wes then if you were
[00:46:52] if you were to go Wes then if you were to compute I guess this term sum of s
[00:46:58] to compute I guess this term sum of s prime Wetzel left I guess right P of si
[00:47:02] prime Wetzel left I guess right P of si s prime B star of s prime is equal to if
[00:47:08] s prime B star of s prime is equal to if you were to go west you have a
[00:47:29] right so if you're in this state and if
[00:47:33] right so if you're in this state and if you attempt to go left then there's a
[00:47:35] you attempt to go left then there's a point a chance you end up there with a
[00:47:40] point a chance you end up there with a visa 0.75 there's a point 1 chance you
[00:47:45] visa 0.75 there's a point 1 chance you know if you try to go left this point
[00:47:48] know if you try to go left this point one chance you veer off to the north and
[00:47:49] one chance you veer off to the north and have a 0.69 and then there's a point 1
[00:47:53] have a 0.69 and then there's a point 1 chance that you actually go south and
[00:47:55] chance that you actually go south and bounce off the wall and end up with 0.71
[00:47:58] bounce off the wall and end up with 0.71 and so do you expected future rewards
[00:48:02] and so do you expected future rewards expected future payoff given this
[00:48:03] expected future payoff given this equation is that if you tend to go Wes
[00:48:05] equation is that if you tend to go Wes you end up with 0.7 for 0 as expected
[00:48:09] you end up with 0.7 for 0 as expected future rewards whereas you were to go
[00:48:11] future rewards whereas you were to go north if you do a similar computation
[00:48:18] you know so 0.8 times point 6 9 plus
[00:48:21] you know so 0.8 times point 6 9 plus point 1 times 175 plus point 1 times 24
[00:48:24] point 1 times 175 plus point 1 times 24 9 the appropriate way to the average you
[00:48:26] 9 the appropriate way to the average you find that is equal to 0.67 6 which is
[00:48:31] find that is equal to 0.67 6 which is why the expected future rewards so if
[00:48:34] why the expected future rewards so if you go Wes it'll know left is 0.74 0
[00:48:38] you go Wes it'll know left is 0.74 0 which is quite a bit higher than if you
[00:48:40] which is quite a bit higher than if you go north which is why we can conclude
[00:48:42] go north which is why we can conclude based on this low calculation that the
[00:48:45] based on this low calculation that the also policy is to go left at that state
[00:48:48] also policy is to go left at that state and then really and technically you
[00:48:50] and then really and technically you check north south east and west and make
[00:48:52] check north south east and west and make sure that going west gives a high reward
[00:48:53] sure that going west gives a high reward and that's how you can conclude that
[00:48:55] and that's how you can conclude that going west is actually the better action
[00:48:57] going west is actually the better action at this state okay so that's value
[00:49:02] at this state okay so that's value iteration and based on this if you are
[00:49:07] iteration and based on this if you are given an MDP you can implement this
[00:49:10] given an MDP you can implement this solve a V Star and be able to compute PI
[00:49:16] solve a V Star and be able to compute PI sock
[00:49:20] a few more things I'll go over but
[00:49:22] a few more things I'll go over but before I move on let me check are there
[00:49:24] before I move on let me check are there any questions
[00:49:25] any questions oh sure yep Islamic state is always
[00:49:34] oh sure yep Islamic state is always finite so in what we're discussing so
[00:49:36] finite so in what we're discussing so far yes but what we'll see on Wednesday
[00:49:39] far yes but what we'll see on Wednesday is how to generalize this framework well
[00:49:42] is how to generalize this framework well looted this a little bit later but it
[00:49:44] looted this a little bit later but it turns out if you have a continuous state
[00:49:46] turns out if you have a continuous state MDP one of the things that's often done
[00:49:50] MDP one of the things that's often done i guess is to discretize into finite
[00:49:53] i guess is to discretize into finite number of states but then there are also
[00:49:55] number of states but then there are also some other versions of you know value
[00:49:59] some other versions of you know value duration that applies directly to
[00:50:01] duration that applies directly to continuous states as well so what I
[00:50:19] continuous states as well so what I described is an algorithm called value
[00:50:21] described is an algorithm called value iteration the other on a common sort of
[00:50:25] iteration the other on a common sort of textbook algorithm for solving for MVPs
[00:50:28] textbook algorithm for solving for MVPs is called policy iteration and let me
[00:50:36] is called policy iteration and let me just well just write out what the
[00:50:38] just well just write out what the algorithm is so here's the algorithm
[00:50:43] algorithm is so here's the algorithm which is um you know initialize PI
[00:50:46] which is um you know initialize PI randomly
[00:51:42] okay so let's see what this algorithm
[00:51:45] okay so let's see what this algorithm does so let's talk about pros and cons
[00:51:47] does so let's talk about pros and cons evaluation versus policy aeration a
[00:51:48] evaluation versus policy aeration a little bit in policy iteration instead
[00:51:53] little bit in policy iteration instead of solving for the optimal policy vsauce
[00:51:56] of solving for the optimal policy vsauce in that the iteration a focus of
[00:51:58] in that the iteration a focus of attention was v star right where you
[00:52:01] attention was v star right where you know you do a lot of work to try to find
[00:52:03] know you do a lot of work to try to find the value function and then once you
[00:52:04] the value function and then once you solve for v song you then figure out the
[00:52:07] solve for v song you then figure out the best policy in policy iteration the
[00:52:09] best policy in policy iteration the focus of attention is on the policy pi
[00:52:12] focus of attention is on the policy pi rather than the value function and so
[00:52:14] rather than the value function and so initialize pi randomly so that means for
[00:52:17] initialize pi randomly so that means for each of the 11 states pick a random
[00:52:19] each of the 11 states pick a random action it's a random initial time and
[00:52:21] action it's a random initial time and then we're going to repeatedly carry out
[00:52:24] then we're going to repeatedly carry out these two steps the first step is solve
[00:52:28] these two steps the first step is solve for the value function for the policy pi
[00:52:30] for the value function for the policy pi right I remember for V PI this was a
[00:52:36] right I remember for V PI this was a linear system of equations right with
[00:52:44] linear system of equations right with eleven variables with eleven unknowns in
[00:52:47] eleven variables with eleven unknowns in it was a linear system of eleven
[00:52:49] it was a linear system of eleven equations with eleven unknowns and so
[00:52:50] equations with eleven unknowns and so using a sort of linear algebra solver or
[00:52:53] using a sort of linear algebra solver or a linear equation solver given a fixed
[00:52:56] a linear equation solver given a fixed policy PI you could just you know at the
[00:52:58] policy PI you could just you know at the cost of inverting a matrix roughly right
[00:53:01] cost of inverting a matrix roughly right you can solve for you can solve for all
[00:53:03] you can solve for you can solve for all of these eleven values and so in policy
[00:53:05] of these eleven values and so in policy iteration you would you know use a
[00:53:09] iteration you would you know use a linear solver to solve for the optimal
[00:53:12] linear solver to solve for the optimal value function for this policy pi that
[00:53:14] value function for this policy pi that we just randomly initialized and then
[00:53:17] we just randomly initialized and then set V to be the value function for that
[00:53:20] set V to be the value function for that policy okay and so this is done quite
[00:53:25] policy okay and so this is done quite efficiently with a linear solver and
[00:53:28] efficiently with a linear solver and then the second step of policy duration
[00:53:31] then the second step of policy duration is pretend that V is the optimal value
[00:53:34] is pretend that V is the optimal value function and update PI of s you know
[00:53:40] function and update PI of s you know using the balanced equations for the
[00:53:43] using the balanced equations for the octal value function very updated as you
[00:53:47] octal value function very updated as you saw right how do you update the higher
[00:53:50] saw right how do you update the higher best
[00:53:50] best and then you iterate and then given a
[00:53:53] and then you iterate and then given a new policy you then solve that linear
[00:53:55] new policy you then solve that linear system equations for your new policy PI
[00:53:58] system equations for your new policy PI to get a new B PI and you keep on
[00:54:00] to get a new B PI and you keep on iterating these two steps until
[00:54:03] iterating these two steps until converges ok yeah yep yes that's right
[00:54:22] converges ok yeah yep yes that's right so in in value yeah yeah so in in value
[00:54:28] so in in value yeah yeah so in in value iteration in value iteration think
[00:54:33] iteration in value iteration think evaluator HS waiting to the end to
[00:54:35] evaluator HS waiting to the end to compute PI of s very soft wavy stop
[00:54:37] compute PI of s very soft wavy stop first and compute PI of s whereas in
[00:54:39] first and compute PI of s whereas in policy iteration we're coming up with a
[00:54:41] policy iteration we're coming up with a new policy on every single iteration
[00:54:43] new policy on every single iteration okay so um pros and cons of poly and it
[00:54:50] okay so um pros and cons of poly and it turns out that this algorithm will also
[00:54:51] turns out that this algorithm will also converge to the optimal policy pros and
[00:54:55] converge to the optimal policy pros and cons of policy iteration versus
[00:54:56] cons of policy iteration versus valuation policy duration requires
[00:54:59] valuation policy duration requires solving this linear system of equations
[00:55:02] solving this linear system of equations in order to get B PI and so it turns out
[00:55:06] in order to get B PI and so it turns out that if you have a relatively small
[00:55:08] that if you have a relatively small state space like if you have 11 states
[00:55:11] state space like if you have 11 states is really easy to solve a linear system
[00:55:14] is really easy to solve a linear system of equations
[00:55:15] of equations you know if 11 equations in order to get
[00:55:18] you know if 11 equations in order to get V PI and so if you're relatively small
[00:55:20] V PI and so if you're relatively small set of states like eleven states are
[00:55:22] set of states like eleven states are really anything you know like a few
[00:55:23] really anything you know like a few hundred States policy raishin we're
[00:55:27] hundred States policy raishin we're quite quickly but if you have a
[00:55:30] quite quickly but if you have a relatively large set of states you know
[00:55:32] relatively large set of states you know like ten thousand stays or a million
[00:55:35] like ten thousand stays or a million states then this step would be much
[00:55:38] states then this step would be much slower at least if you do it right by
[00:55:40] slower at least if you do it right by solving the system of equations and then
[00:55:42] solving the system of equations and then I would favor a value iteration over
[00:55:44] I would favor a value iteration over policy iterations so for larger problems
[00:55:46] policy iterations so for larger problems usually value iteration will usually I
[00:55:51] usually value iteration will usually I would use value iteration because
[00:55:53] would use value iteration because solving this linear system of equations
[00:55:55] solving this linear system of equations you know this is pretty expensive if
[00:55:58] you know this is pretty expensive if it's a good million by there's a million
[00:56:00] it's a good million by there's a million equations a million unknowns that's
[00:56:02] equations a million unknowns that's quite expensive
[00:56:03] quite expensive but if in Lebanon stays in Lebanon knows
[00:56:04] but if in Lebanon stays in Lebanon knows there's very small system equations and
[00:56:07] there's very small system equations and then one one other pros and cons one of
[00:56:09] then one one other pros and cons one of the difference that's maybe maybe more
[00:56:13] the difference that's maybe maybe more academic than practical but it turns out
[00:56:16] academic than practical but it turns out that if you use value iteration V will
[00:56:19] that if you use value iteration V will converge to what V Star but they won't
[00:56:23] converge to what V Star but they won't ever get to exactly the star right so
[00:56:26] ever get to exactly the star right so just as if you apply gradient descent
[00:56:28] just as if you apply gradient descent for linear regression gradient descent
[00:56:31] for linear regression gradient descent gets closer and closer and closer to the
[00:56:33] gets closer and closer and closer to the global optimum but it never you know
[00:56:35] global optimum but it never you know guess exactly the global optimum it just
[00:56:37] guess exactly the global optimum it just gets really really close really fast
[00:56:39] gets really really close really fast actually great in the sand actually
[00:56:40] actually great in the sand actually turns out as an topically converges
[00:56:42] turns out as an topically converges geometrically quickly really quickly
[00:56:44] geometrically quickly really quickly right but but never quite gets you know
[00:56:46] right but but never quite gets you know definitively to the optimal to the one
[00:56:48] definitively to the optimal to the one optimal value whereas you saw using
[00:56:51] optimal value whereas you saw using normals equations it just jump straight
[00:56:52] normals equations it just jump straight to the optimal value and there's no you
[00:56:55] to the optimal value and there's no you know converging slowly and so value
[00:56:57] know converging slowly and so value duration converges to or V star but it
[00:57:00] duration converges to or V star but it doesn't ever end up at exactly the value
[00:57:02] doesn't ever end up at exactly the value V star this difference may be a bit
[00:57:04] V star this difference may be a bit epidemic because in practice it doesn't
[00:57:06] epidemic because in practice it doesn't matter right but in policy iteration if
[00:57:12] matter right but in policy iteration if you innovate this algorithm then after a
[00:57:15] you innovate this algorithm then after a finite number of iterations this album
[00:57:17] finite number of iterations this album will stop changing meaning that after
[00:57:20] will stop changing meaning that after certain number of iterations PI of s
[00:57:23] certain number of iterations PI of s would just not change anymore right so
[00:57:26] would just not change anymore right so you find higher best update the value
[00:57:28] you find higher best update the value function and then after another
[00:57:29] function and then after another iteration when you take these out maxes
[00:57:31] iteration when you take these out maxes you end up with exactly the same policy
[00:57:33] you end up with exactly the same policy and so this just salsa the also a value
[00:57:36] and so this just salsa the also a value and the also policy and they just you
[00:57:38] and the also policy and they just you know it doesn't converge it doesn't does
[00:57:41] know it doesn't converge it doesn't does converge to what the also value it just
[00:57:43] converge to what the also value it just gets the optimal value when it when it
[00:57:46] gets the optimal value when it when it converges okay so I think in practice I
[00:57:51] converges okay so I think in practice I actually see value iteration use much
[00:57:53] actually see value iteration use much more
[00:57:55] more because solving this linear equations
[00:57:58] because solving this linear equations gets expensive you know if you have a
[00:58:00] gets expensive you know if you have a large estate space but valuation it's
[00:58:04] large estate space but valuation it's usually policy I see valuation use much
[00:58:06] usually policy I see valuation use much more but if you have a small problem you
[00:58:08] more but if you have a small problem you know I think you could also use policy
[00:58:10] know I think you could also use policy iteration which might converse a little
[00:58:11] iteration which might converse a little bit faster if you have a small problem
[00:58:23] so the last thing is kind of putting it
[00:58:28] so the last thing is kind of putting it together right and what if you don't
[00:58:30] together right and what if you don't know so it turns out that when you apply
[00:58:46] know so it turns out that when you apply this to a practical problem you know in
[00:58:49] this to a practical problem you know in robotics right one common scenario you
[00:58:54] robotics right one common scenario you run into is if you do not know what is P
[00:58:57] run into is if you do not know what is P of Si if you don't know the state
[00:58:59] of Si if you don't know the state transition probabilities right so when
[00:59:01] transition probabilities right so when we built the MDP we said well let's say
[00:59:04] we built the MDP we said well let's say the robots if you go off you know has a
[00:59:09] the robots if you go off you know has a point a chance a great knife and a point
[00:59:10] point a chance a great knife and a point one chance of varying off so therefore
[00:59:12] one chance of varying off so therefore rights if you actually the game this is
[00:59:13] rights if you actually the game this is a very simplified robot but if you build
[00:59:16] a very simplified robot but if you build a actual robot to build a you know
[00:59:18] a actual robot to build a you know helicopter or whatever play play chess
[00:59:21] helicopter or whatever play play chess against an opponent the state-transition
[00:59:23] against an opponent the state-transition properties are often not known in
[00:59:25] properties are often not known in advance and so in many MVP
[00:59:32] advance and so in many MVP implementations you need to estimate
[00:59:34] implementations you need to estimate this from data and so the workflow of
[00:59:38] this from data and so the workflow of many many reinforcement learning
[00:59:40] many many reinforcement learning projects will be that you will have some
[00:59:43] projects will be that you will have some policy and have the robot run around you
[00:59:46] policy and have the robot run around you know just have a robot run around a maze
[00:59:47] know just have a robot run around a maze and counter of all the times you had to
[00:59:50] and counter of all the times you had to take the action north how often did it
[00:59:52] take the action north how often did it actually go know of and how often do
[00:59:54] actually go know of and how often do they fear often left or right right so
[00:59:56] they fear often left or right right so you use those statistics as state
[00:59:58] you use those statistics as state transition probabilities so let me just
[01:00:00] transition probabilities so let me just write this out so you estimate so after
[01:00:03] write this out so you estimate so after you know taking maybe a random policy to
[01:00:06] you know taking maybe a random policy to take some policy execute some policy in
[01:00:08] take some policy execute some policy in the MD
[01:00:08] the MD for a while and then you would estimate
[01:00:11] for a while and then you would estimate this from data and so the obvious
[01:00:14] this from data and so the obvious formula would be SVP of SAS prime to be
[01:00:17] formula would be SVP of SAS prime to be number of times took action a and state
[01:00:25] number of times took action a and state s and got to s Prime and divide that by
[01:00:37] s and got to s Prime and divide that by the number of times you took action
[01:00:46] that's right so TFSAs prime estimate
[01:00:51] that's right so TFSAs prime estimate there's actually a massive likely
[01:00:52] there's actually a massive likely estimate when you look at the number of
[01:00:54] estimate when you look at the number of times you took action in state s and
[01:00:57] times you took action in state s and that was the fraction of times you got
[01:00:59] that was the fraction of times you got to this day that's prime right or 1 over
[01:01:07] to this day that's prime right or 1 over s in a common you know heuristic is if
[01:01:16] s in a common you know heuristic is if you've never taken this action in just a
[01:01:19] you've never taken this action in just a before if you if the number of times you
[01:01:22] before if you if the number of times you try action in state as a zero so you've
[01:01:24] try action in state as a zero so you've never tried this action this state so
[01:01:25] never tried this action this state so you have no idea what's gonna do then
[01:01:27] you have no idea what's gonna do then just assume that the state transition
[01:01:30] just assume that the state transition probability is 1 over 11 right then
[01:01:32] probability is 1 over 11 right then you're randomly takes you to endlessly
[01:01:34] you're randomly takes you to endlessly so this would be rather common
[01:01:36] so this would be rather common heuristics that people use when
[01:01:37] heuristics that people use when implementing reports or learning
[01:01:39] implementing reports or learning algorithms and it turns out that you can
[01:01:46] algorithms and it turns out that you can use the paths moving for this if you
[01:01:48] use the paths moving for this if you wish but you don't have to because so
[01:01:51] wish but you don't have to because so you're in the past moving right Sofia
[01:01:54] you're in the past moving right Sofia you know adds one to the numerator and
[01:01:56] you know adds one to the numerator and and 11 to the denominator would be if
[01:01:59] and 11 to the denominator would be if you were to use Laplace smoothing which
[01:02:01] you were to use Laplace smoothing which a voice the problems of zero over zeroes
[01:02:03] a voice the problems of zero over zeroes as well but it turns out that unlike the
[01:02:05] as well but it turns out that unlike the naive Bayes algorithm these solvers MDPs
[01:02:10] naive Bayes algorithm these solvers MDPs are not that sensitive to 0 values so if
[01:02:13] are not that sensitive to 0 values so if if one your estimates are probably is
[01:02:15] if one your estimates are probably is zero you know unlike naive Bayes we're
[01:02:19] zero you know unlike naive Bayes we're having a zero probability was very
[01:02:21] having a zero probability was very problematic
[01:02:22] problematic for the classifications made by naive
[01:02:24] for the classifications made by naive Bayes it turns out that MDP solvers
[01:02:27] Bayes it turns out that MDP solvers including evaluation and policy duration
[01:02:29] including evaluation and policy duration they do not give sort of nonsensical /
[01:02:32] they do not give sort of nonsensical / horrible results just because of a few
[01:02:34] horrible results just because of a few probabilities are exactly zero and so in
[01:02:38] probabilities are exactly zero and so in practice you know you can use the
[01:02:39] practice you know you can use the Laplace moving if you wish but because
[01:02:42] Laplace moving if you wish but because the reinforcement learning algorithms
[01:02:44] the reinforcement learning algorithms don't don't perform that badly of these
[01:02:47] don't don't perform that badly of these estimates are often well below zero in
[01:02:48] estimates are often well below zero in practice the past moving is not commonly
[01:02:51] practice the past moving is not commonly unison what I just wrote is it's more
[01:02:53] unison what I just wrote is it's more common so to put it together
[01:03:27] if I give you a robot and ask you to
[01:03:30] if I give you a robot and ask you to implement a MTP solver to find a good
[01:03:33] implement a MTP solver to find a good policy for this robot what you would do
[01:03:35] policy for this robot what you would do is the following take actions respect to
[01:03:44] is the following take actions respect to some policy PI to get experience in the
[01:03:56] some policy PI to get experience in the MDP so go ahead and let your robot loose
[01:04:04] MDP so go ahead and let your robot loose and have it ask you some policy for a
[01:04:06] and have it ask you some policy for a while and then update estimates of PFS a
[01:04:18] based on the observations where the
[01:04:21] based on the observations where the robot goes when takes different states
[01:04:23] robot goes when takes different states update update SMS app EFSA solve Velma's
[01:04:36] update update SMS app EFSA solve Velma's equation using value iteration
[01:04:47] to get V and then I'll update so this is
[01:05:08] to get V and then I'll update so this is the value generation way of putting it
[01:05:10] the value generation way of putting it together if you want to plug in policy
[01:05:12] together if you want to plug in policy innovation instead and just that that's
[01:05:14] innovation instead and just that that's also okay but so if you actually get a
[01:05:18] also okay but so if you actually get a robot you know right if you actually get
[01:05:27] robot you know right if you actually get a robot where you do not know in advance
[01:05:30] a robot where you do not know in advance the state transition probabilities then
[01:05:32] the state transition probabilities then this is what you would do in order to
[01:05:35] this is what you would do in order to enter in a few times I guess right
[01:05:38] enter in a few times I guess right repeatedly finally find a final policy
[01:05:41] repeatedly finally find a final policy given your carbon estimate of the state
[01:05:43] given your carbon estimate of the state transition probabilities get some
[01:05:45] transition probabilities get some experience update your S Pen is finally
[01:05:46] experience update your S Pen is finally your policy and kind of repeat this
[01:05:48] your policy and kind of repeat this process until hopefully converges to
[01:05:51] process until hopefully converges to good policy
[01:06:03] now just to add more color more richness
[01:06:08] now just to add more color more richness to this we usually think of we usually
[01:06:18] to this we usually think of we usually think of the reward function as being
[01:06:21] think of the reward function as being given right as part of the problem
[01:06:23] given right as part of the problem specification but sometimes you see that
[01:06:26] specification but sometimes you see that the reward function may be unknown and
[01:06:29] the reward function may be unknown and so for example if you're building a
[01:06:31] so for example if you're building a stock trading application and the reward
[01:06:34] stock trading application and the reward is the returns on a certain day it may
[01:06:37] is the returns on a certain day it may not be a function of the statement may
[01:06:38] not be a function of the statement may be a little bit random or if you're
[01:06:41] be a little bit random or if you're robots is you know running around but
[01:06:43] robots is you know running around but depend on where it goes it may hit
[01:06:45] depend on where it goes it may hit different bumps in the road and you want
[01:06:47] different bumps in the road and you want to give her the penalty every time it
[01:06:48] to give her the penalty every time it hits a bump build self-driving car right
[01:06:50] hits a bump build self-driving car right and every time it hits a bump hits a
[01:06:52] and every time it hits a bump hits a pothole you give the negative reward
[01:06:53] pothole you give the negative reward then sometimes the rewards are random
[01:06:56] then sometimes the rewards are random function of the environment and so
[01:06:57] function of the environment and so sometimes you can also estimate the
[01:06:59] sometimes you can also estimate the expected value of a reward but but in
[01:07:02] expected value of a reward but but in some applications of the reward is the
[01:07:04] some applications of the reward is the random function the state then this
[01:07:06] random function the state then this process allows you to also estimate the
[01:07:09] process allows you to also estimate the expected value the reward from every
[01:07:10] expected value the reward from every state and then running this more oq2
[01:07:14] state and then running this more oq2 okay
[01:07:30] yeah yep cool great question so let me
[01:07:34] yeah yep cool great question so let me let me talk about exploration right so
[01:07:36] let me talk about exploration right so it turns out that um this one so it
[01:07:41] it turns out that um this one so it turns out this algorithm will work okay
[01:07:43] turns out this algorithm will work okay for some problems but there's one other
[01:07:46] for some problems but there's one other again to add richness to this there's
[01:07:49] again to add richness to this there's one other issue that this is not solving
[01:07:52] one other issue that this is not solving which is the exploration problem and
[01:07:55] which is the exploration problem and possible earnings sometimes you hear the
[01:07:57] possible earnings sometimes you hear the term exploration versus exploitation
[01:08:01] term exploration versus exploitation which is let me use a different MVP
[01:08:07] which is let me use a different MVP example which is um if your robot you
[01:08:12] example which is um if your robot you know starts off here and if there is a
[01:08:18] know starts off here and if there is a plus-one reward here right and maybe a
[01:08:23] plus-one reward here right and maybe a +10 the water here if just by chance
[01:08:27] +10 the water here if just by chance doing the first time you run the robot
[01:08:29] doing the first time you run the robot it happens to find its way to the +1
[01:08:32] it happens to find its way to the +1 then if you run this algorithm it may
[01:08:36] then if you run this algorithm it may figure out that going to the +1 is a
[01:08:39] figure out that going to the +1 is a good way right over we're giving a
[01:08:41] good way right over we're giving a discount factor does a feel so in charge
[01:08:43] discount factor does a feel so in charge of minus 0.02 on every step so if just
[01:08:46] of minus 0.02 on every step so if just by chance your robot happens to find
[01:08:49] by chance your robot happens to find this way to the +1 the first few times
[01:08:51] this way to the +1 the first few times you run this algorithm then this
[01:08:53] you run this algorithm then this algorithm is yourself locally greedy
[01:08:55] algorithm is yourself locally greedy right
[01:08:56] right it may figure out that this is a great
[01:08:59] it may figure out that this is a great way to get to +1 reward and in the world
[01:09:02] way to get to +1 reward and in the world ends it stops getting these minus 0.02
[01:09:04] ends it stops getting these minus 0.02 surcharges but fuel and so this
[01:09:08] surcharges but fuel and so this particular algorithm may converge to a
[01:09:10] particular algorithm may converge to a bad you know kind of local optima where
[01:09:14] bad you know kind of local optima where it's always heading to the +1 and as it
[01:09:17] it's always heading to the +1 and as it has a +1 sometimes OVR randomly right
[01:09:20] has a +1 sometimes OVR randomly right and you look a little bit more
[01:09:22] and you look a little bit more experienced in the right half of the
[01:09:23] experienced in the right half of the state space and end up with pretty good
[01:09:26] state space and end up with pretty good estimate of what happens in the right
[01:09:28] estimate of what happens in the right of the state space and and it may never
[01:09:31] of the state space and and it may never find this hard to define +10 pilot goat
[01:09:34] find this hard to define +10 pilot goat over on the lower left okay so this
[01:09:38] over on the lower left okay so this problem is sometimes called actually
[01:09:41] problem is sometimes called actually wrong it is called the exploration
[01:09:43] wrong it is called the exploration versus exploitation problem which is
[01:09:46] versus exploitation problem which is when you're acting on an MDP you know
[01:09:49] when you're acting on an MDP you know how aggressively or how greedy should
[01:09:52] how aggressively or how greedy should you be at just taking actions to
[01:09:54] you be at just taking actions to maximize your rewards and so the average
[01:09:57] maximize your rewards and so the average strive is relatively greedy meaning that
[01:10:03] strive is relatively greedy meaning that is taking your best estimate of the
[01:10:05] is taking your best estimate of the state transition probabilities and
[01:10:07] state transition probabilities and rewards and it's just taking whatever
[01:10:09] rewards and it's just taking whatever actions and this is really saying you
[01:10:10] actions and this is really saying you know pick the policy that maximizes your
[01:10:14] know pick the policy that maximizes your current estimate of the expected rewards
[01:10:17] current estimate of the expected rewards and it's just acting greedily meaning on
[01:10:19] and it's just acting greedily meaning on every step is just executing the policy
[01:10:21] every step is just executing the policy that it thing's allows it to maximize
[01:10:24] that it thing's allows it to maximize the expected payoff and what this album
[01:10:28] the expected payoff and what this album does not do at all is explore which is
[01:10:31] does not do at all is explore which is the process of taking actions that may
[01:10:34] the process of taking actions that may appear less optimal at the outset such
[01:10:37] appear less optimal at the outset such as if the robot hasn't seen this +10
[01:10:40] as if the robot hasn't seen this +10 reward doesn't know how we get there
[01:10:41] reward doesn't know how we get there maybe it should you know just try going
[01:10:44] maybe it should you know just try going left a couple times just for the heck of
[01:10:46] left a couple times just for the heck of it right to see what happens because
[01:10:48] it right to see what happens because even if it seems less even if going left
[01:10:51] even if it seems less even if going left from the perspective of the current
[01:10:53] from the perspective of the current state of the knowledge robot maybe if it
[01:10:56] state of the knowledge robot maybe if it try some new things has never tried
[01:10:58] try some new things has never tried before maybe you'll find a new pot of
[01:11:00] before maybe you'll find a new pot of gold okay so this is called the
[01:11:03] gold okay so this is called the exploration versus exploitation
[01:11:04] exploration versus exploitation trade-off oh and this is actually not
[01:11:07] trade-off oh and this is actually not just an academic problem it turns out
[01:11:09] just an academic problem it turns out that some of the large online web
[01:11:11] that some of the large online web advertising platforms have the same
[01:11:14] advertising platforms have the same problem as well again I I've mixed
[01:11:17] problem as well again I I've mixed feelings about the advertising business
[01:11:19] feelings about the advertising business it's very lucrative and it causes other
[01:11:20] it's very lucrative and it causes other problems as well but but it turns out
[01:11:23] problems as well but but it turns out that for something large online
[01:11:24] that for something large online platforms you know when when an
[01:11:28] platforms you know when when an advertiser starts selling a new ad or
[01:11:31] advertiser starts selling a new ad or you're posting a new ad on one of the
[01:11:33] you're posting a new ad on one of the large online ad platforms the ad
[01:11:35] large online ad platforms the ad platform does not know who is most
[01:11:37] platform does not know who is most likely to click on this ad
[01:11:39] likely to click on this ad and so pure explore in pure exploitation
[01:11:42] and so pure explore in pure exploitation boy exploitation is such horrible
[01:11:45] boy exploitation is such horrible conversation especially terrible on my
[01:11:46] conversation especially terrible on my lap that one's the technical term no
[01:11:49] lap that one's the technical term no there's no the social term when uses in
[01:11:51] there's no the social term when uses in the context but the pure you know
[01:11:52] the context but the pure you know reinforcement learning sense
[01:11:54] reinforcement learning sense exploitation policy not not the other
[01:11:56] exploitation policy not not the other even more horrible sense of exploitation
[01:11:58] even more horrible sense of exploitation would be to always just show you show
[01:12:01] would be to always just show you show show users the ads that you know they
[01:12:04] show users the ads that you know they are most likely to click on to drive
[01:12:05] are most likely to click on to drive short-term revenues who's gonna just
[01:12:07] short-term revenues who's gonna just show people the atom also you click on
[01:12:08] show people the atom also you click on to actual turn revenue whereas an
[01:12:10] to actual turn revenue whereas an exploration for policy for a large you
[01:12:13] exploration for policy for a large you know something large online ad platforms
[01:12:14] know something large online ad platforms it's a show people some ads that may not
[01:12:17] it's a show people some ads that may not be what we think you're most likely to
[01:12:18] be what we think you're most likely to click on in this moment in time but by
[01:12:20] click on in this moment in time but by showing you that ad or by showing the
[01:12:22] showing you that ad or by showing the pool of users an ad that you might be
[01:12:24] pool of users an ad that you might be less like quick on maybe we'll learn
[01:12:25] less like quick on maybe we'll learn more about your interests and that
[01:12:28] more about your interests and that increases the effectiveness of these
[01:12:30] increases the effectiveness of these large of these ad platforms that finding
[01:12:32] large of these ad platforms that finding more relevant ads for example I don't
[01:12:35] more relevant ads for example I don't know purview no I I guess uh they're
[01:12:38] know purview no I I guess uh they're probably know a participants for maja
[01:12:40] probably know a participants for maja slanders as I know but if the large
[01:12:42] slanders as I know but if the large online ad platforms don't know that I'm
[01:12:44] online ad platforms don't know that I'm actually pretty interested in Mars
[01:12:45] actually pretty interested in Mars Landers if it shows me an ad for Mars
[01:12:47] Landers if it shows me an ad for Mars Lander which I don't think such a thing
[01:12:49] Lander which I don't think such a thing exists right and I click on it didn't
[01:12:51] exists right and I click on it didn't learn that showing me as the Mars
[01:12:53] learn that showing me as the Mars Landers great thing right
[01:12:55] Landers great thing right or some other thing that I mean no no
[01:12:57] or some other thing that I mean no no interesting so this is actually a real
[01:12:59] interesting so this is actually a real problem there are some of the large
[01:13:02] problem there are some of the large online ad platforms actually do
[01:13:05] online ad platforms actually do explicitly consider exploration versus
[01:13:08] explicitly consider exploration versus exploitation and make sure that
[01:13:09] exploitation and make sure that sometimes it shows as that may not be
[01:13:12] sometimes it shows as that may not be the most likely you click on but you
[01:13:14] the most likely you click on but you know allows you to gather information to
[01:13:16] know allows you to gather information to then be better situated to figure out
[01:13:18] then be better situated to figure out where the future rewards to be better
[01:13:20] where the future rewards to be better position to learn how to mash it's not
[01:13:24] position to learn how to mash it's not just that you but other users like you
[01:13:27] sorry okay but so in order to make sure
[01:13:30] sorry okay but so in order to make sure they're reinforcement learning algorithm
[01:13:33] they're reinforcement learning algorithm explores as exploits a common
[01:13:39] explores as exploits a common modification to this would be tick
[01:13:43] modification to this would be tick instead of taking actions respect to pie
[01:13:45] instead of taking actions respect to pie you may have a zero point nine chance
[01:13:53] respective high and 0pi one chance take
[01:14:00] respective high and 0pi one chance take an action randomly okay and so this
[01:14:04] an action randomly okay and so this particular exploration policy is called
[01:14:11] particular exploration policy is called epsilon greedy we're on every time step
[01:14:14] epsilon greedy we're on every time step and on every time step you toss a biased
[01:14:16] and on every time step you toss a biased coin but on every time step let's say
[01:14:19] coin but on every time step let's say 90% of the chance you execute whatever
[01:14:23] 90% of the chance you execute whatever you think is a current best policy and
[01:14:25] you think is a current best policy and with 10% chance you just take a random
[01:14:27] with 10% chance you just take a random action and this type of exploration
[01:14:30] action and this type of exploration policy increases the odds that you know
[01:14:32] policy increases the odds that you know every now and then maybe just by chance
[01:14:34] every now and then maybe just by chance right it'll find this way to the plus 10
[01:14:37] right it'll find this way to the plus 10 polyp goals and learning state
[01:14:39] polyp goals and learning state transition probabilities and and and
[01:14:41] transition probabilities and and and then eventually end up exploring the
[01:14:44] then eventually end up exploring the state space more thoroughly okay this is
[01:14:48] state space more thoroughly okay this is called epsilon greedy exploration and
[01:14:50] called epsilon greedy exploration and there's a little bit of a misnomer I
[01:14:52] there's a little bit of a misnomer I think so in in in the way we think of
[01:14:55] think so in in in the way we think of epsilon we D epsilon is on say 0.1 is
[01:14:59] epsilon we D epsilon is on say 0.1 is the chance of taking a random action
[01:15:01] the chance of taking a random action instead of the greedy action this
[01:15:03] instead of the greedy action this algorithm it's has always been a little
[01:15:05] algorithm it's has always been a little bit strangely named because if 0.1 is
[01:15:09] bit strangely named because if 0.1 is actually the chance of you're acting
[01:15:10] actually the chance of you're acting randomly right so epsilon greedy it
[01:15:12] randomly right so epsilon greedy it sounds like you're being greedy point
[01:15:14] sounds like you're being greedy point one of the time but but you're actually
[01:15:16] one of the time but but you're actually taking actions randomly upon one at a
[01:15:18] taking actions randomly upon one at a time so epsilon greediest actually may
[01:15:20] time so epsilon greediest actually may be one minus Epsilon greedy so this name
[01:15:23] be one minus Epsilon greedy so this name has always been a little bit off but
[01:15:25] has always been a little bit off but that's what that's that's that's how
[01:15:27] that's what that's that's that's how people use this term epsilon greedy
[01:15:28] people use this term epsilon greedy exploration means epsilon at the time
[01:15:30] exploration means epsilon at the time which is a hyper parameter which is the
[01:15:32] which is a hyper parameter which is the parameter the algorithm you act randomly
[01:15:34] parameter the algorithm you act randomly into instead of going to what you think
[01:15:36] into instead of going to what you think is the best policy
[01:15:37] is the best policy okay and it turns out that if you
[01:15:41] okay and it turns out that if you implement this algorithm with epsilon
[01:15:45] implement this algorithm with epsilon greedy exploration then this this
[01:15:49] greedy exploration then this this algorithm will converge to the optimal
[01:15:53] algorithm will converge to the optimal policy for any discrete state MVP right
[01:15:56] policy for any discrete state MVP right sometimes didn't take a long time
[01:15:57] sometimes didn't take a long time because you know if there's a if it
[01:15:59] because you know if there's a if it takes a long time to randomly find plus
[01:16:01] takes a long time to randomly find plus ten
[01:16:02] ten it could take a long time before
[01:16:03] it could take a long time before randomly stumbles upon 2 plus 10
[01:16:05] randomly stumbles upon 2 plus 10 particles but this algorithm women with
[01:16:09] particles but this algorithm women with an exploration policy will converge to
[01:16:11] an exploration policy will converge to the optimal what will converge the
[01:16:14] the optimal what will converge the optimal policy for any MDP oh yeah yes
[01:16:28] optimal policy for any MDP oh yeah yes sir right she always keep us on constant
[01:16:29] sir right she always keep us on constant or she view dilation Epsilon so yes
[01:16:32] or she view dilation Epsilon so yes there are there are many heuristics for
[01:16:35] there are there are many heuristics for how to explore one reasonable thing to
[01:16:37] how to explore one reasonable thing to do would be start with a large value of
[01:16:39] do would be start with a large value of epsilon and it's low e-string
[01:16:40] epsilon and it's low e-string another common heuristic would be um
[01:16:43] another common heuristic would be um there's a different type of exploration
[01:16:45] there's a different type of exploration called Bo's from the exploration we can
[01:16:47] called Bo's from the exploration we can look up if you want which is uh if you
[01:16:49] look up if you want which is uh if you think that the value of going off is you
[01:16:52] think that the value of going off is you know 10 and the value of going solvus 1
[01:16:54] know 10 and the value of going solvus 1 then there's such a huge difference that
[01:16:56] then there's such a huge difference that you should bias your action to are going
[01:16:59] you should bias your action to are going to the bigger result the bigger reward
[01:17:02] to the bigger result the bigger reward and then you could have the probability
[01:17:04] and then you could have the probability be e to the value basically right time
[01:17:06] be e to the value basically right time divided times the scaling factor right
[01:17:10] divided times the scaling factor right so that's called bozeman exploration
[01:17:12] so that's called bozeman exploration where instead of having a 10% chance of
[01:17:14] where instead of having a 10% chance of taking action completely at random you
[01:17:16] taking action completely at random you could just you know have a very strong
[01:17:18] could just you know have a very strong bias to heading toward the higher values
[01:17:21] bias to heading toward the higher values but also have some probability to go
[01:17:23] but also have some probability to go into lower values but where the exact
[01:17:25] into lower values but where the exact probability depends on the different
[01:17:27] probability depends on the different values so the another pair either I
[01:17:30] values so the another pair either I think I've saw greenie I feel like I see
[01:17:32] think I've saw greenie I feel like I see this use the most often for these types
[01:17:34] this use the most often for these types of MVPs and then Boltzmann exploration
[01:17:36] of MVPs and then Boltzmann exploration which is why just like this also use
[01:17:39] which is why just like this also use just two more questions you'd wrap up
[01:17:41] just two more questions you'd wrap up good
[01:17:45] oh yes you give her one for reaching say
[01:17:49] oh yes you give her one for reaching say she has never seen before
[01:17:51] she has never seen before yes there's a fascinating line of
[01:17:52] yes there's a fascinating line of research called intrinsic reinforcement
[01:17:54] research called intrinsic reinforcement learning I mean we started by sitting
[01:17:58] learning I mean we started by sitting the same if you google for intrinsic
[01:17:59] the same if you google for intrinsic intrinsic motivation you find some
[01:18:02] intrinsic motivation you find some research papers on and then there was
[01:18:04] research papers on and then there was some recent fall on work I think by deep
[01:18:06] some recent fall on work I think by deep line or some of the groups but intrinsic
[01:18:08] line or some of the groups but intrinsic motivation is to turn to Google where
[01:18:10] motivation is to turn to Google where you reward reinforcement learning
[01:18:11] you reward reinforcement learning algorithm for finding new things about
[01:18:13] algorithm for finding new things about the world oh I see right
[01:18:27] the world oh I see right how often should on the Asha issue you
[01:18:29] how often should on the Asha issue you take before updating pi um
[01:18:32] take before updating pi um there's no how to do it as frequently as
[01:18:34] there's no how to do it as frequently as possible in the if you're doing this
[01:18:37] possible in the if you're doing this with a real robot what you know I've
[01:18:39] with a real robot what you know I've seen is this is sometimes going to a
[01:18:42] seen is this is sometimes going to a physical robot and so you know I don't
[01:18:45] physical robot and so you know I don't know one of five helicopters you've gone
[01:18:46] know one of five helicopters you've gone to the view for a day collect all data
[01:18:48] to the view for a day collect all data and then go back to the lab and evening
[01:18:50] and then go back to the lab and evening and rerun the algorithms but if there's
[01:18:53] and rerun the algorithms but if there's no barrier to running this all the time
[01:18:54] no barrier to running this all the time then it doesn't hurt the performance
[01:18:56] then it doesn't hurt the performance they're just running as beautiful as you
[01:18:57] they're just running as beautiful as you can all right that's it for basis of MDP
[01:19:01] can all right that's it for basis of MDP on Wednesday we'll continue with
[01:19:03] on Wednesday we'll continue with generalizing all these to continuous
[01:19:06] generalizing all these to continuous state ok let's break loose your
[01:19:09] state ok let's break loose your Wednesday


================================================================================
LECTURE 018
================================================================================

Lecture 18 - Continous State MDP & Model Simulation | Stanford CS229: Machine Learning (Autumn 2018)

Source: https://www.youtube.com/watch?v=QFu5nuc-S0s

---

Transcript

[00:00:03] alright hey everyone welcome back um so
[00:00:10] alright hey everyone welcome back um so let's continue our discussion today of
[00:00:12] let's continue our discussion today of reinforcement learning and mdps
[00:00:15] reinforcement learning and mdps and specifically what I hope you learn
[00:00:18] and specifically what I hope you learn from today is how to apply reinforcement
[00:00:21] from today is how to apply reinforcement learning um even to continuous states or
[00:00:23] learning um even to continuous states or infinite state MVPs so top out
[00:00:27] infinite state MVPs so top out discretization model based RL talked
[00:00:30] discretization model based RL talked about models the simulation and fitted
[00:00:32] about models the simulation and fitted value iteration is the main algorithm I
[00:00:34] value iteration is the main algorithm I want to lead up to for today just a
[00:00:38] want to lead up to for today just a recap because we're going to build on
[00:00:40] recap because we're going to build on what we had learned in the last two
[00:00:42] what we had learned in the last two lectures wanna make sure that you have
[00:00:44] lectures wanna make sure that you have the notation fresh in your mind
[00:00:47] the notation fresh in your mind MVP was state's actions transition
[00:00:50] MVP was state's actions transition probabilities discount back to reward
[00:00:51] probabilities discount back to reward that was an example V PI was the value
[00:00:56] that was an example V PI was the value function for a policy PI which is
[00:00:58] function for a policy PI which is expected payoff if you execute that
[00:01:00] expected payoff if you execute that policy starting from a status and these
[00:01:02] policy starting from a status and these star was the optimal value function and
[00:01:05] star was the optimal value function and last time we figured out that if you
[00:01:07] last time we figured out that if you know what is V star then Python the
[00:01:11] know what is V star then Python the optimal policy or the optimal action for
[00:01:13] optimal policy or the optimal action for a given state can be computed as the
[00:01:16] a given state can be computed as the augments of that and one one one thing
[00:01:21] augments of that and one one one thing though we'll come back to later is an
[00:01:24] though we'll come back to later is an equivalent way of writing that formula
[00:01:26] equivalent way of writing that formula is that this is the expectation with
[00:01:30] is that this is the expectation with respect to s Prime
[00:01:31] respect to s Prime drawn from PS a B star best prime right
[00:01:39] drawn from PS a B star best prime right so when we go to we've been we have been
[00:01:43] so when we go to we've been we have been working with discrete state MVPs with
[00:01:45] working with discrete state MVPs with the eleven state MVP so this is a sum
[00:01:48] the eleven state MVP so this is a sum over all the states s prime but when we
[00:01:50] over all the states s prime but when we have will go to continuous state MVPs
[00:01:52] have will go to continuous state MVPs the generalization of this or what this
[00:01:55] the generalization of this or what this becomes this is the expected value with
[00:01:57] becomes this is the expected value with respect to s prime drawn from the state
[00:01:59] respect to s prime drawn from the state transition probabilities you know with
[00:02:02] transition probabilities you know with index by s a current state currents that
[00:02:05] index by s a current state currents that action of the value that you attain in
[00:02:09] action of the value that you attain in the future so V star of s Prime and
[00:02:13] the future so V star of s Prime and we saw the value iteration algorithm
[00:02:16] we saw the value iteration algorithm we're also so we talked about valuation
[00:02:18] we're also so we talked about valuation policy iteration but today we'll build
[00:02:20] policy iteration but today we'll build on value iteration but the value of the
[00:02:22] on value iteration but the value of the raishin algorithm uses Spellman's
[00:02:25] raishin algorithm uses Spellman's equations which says take the left hand
[00:02:28] equations which says take the left hand side set it to the right hand side right
[00:02:30] side set it to the right hand side right and for V star if V was equal to V stop
[00:02:34] and for V star if V was equal to V stop the left hand side it's equal to the
[00:02:35] the left hand side it's equal to the right hand side that was um oh I'm sorry
[00:02:38] right hand side that was um oh I'm sorry it's missing a max there right so if he
[00:02:44] it's missing a max there right so if he was equal to V star then the left hand
[00:02:46] was equal to V star then the left hand side and the right hand side would be
[00:02:47] side and the right hand side would be equal to each other but what value
[00:02:50] equal to each other but what value iteration does is an algorithm that
[00:02:52] iteration does is an algorithm that initializes V of s is 0 and repeatedly
[00:02:54] initializes V of s is 0 and repeatedly carries his update until the converges
[00:02:57] carries his update until the converges to V Star and after that you can then
[00:03:00] to V Star and after that you can then compute PI stock or fine for every state
[00:03:03] compute PI stock or fine for every state find the optimal action
[00:03:05] find the optimal action ok so um because we're going to build on
[00:03:08] ok so um because we're going to build on this notation and this set of ideas
[00:03:11] this notation and this set of ideas today supposed to make sure all this
[00:03:13] today supposed to make sure all this makes sense right any questions ok all
[00:03:22] makes sense right any questions ok all right so ok so everything we've done so
[00:03:28] right so ok so everything we've done so far was built on the MDP having a finite
[00:03:32] far was built on the MDP having a finite set of states right so with 11 state
[00:03:34] set of states right so with 11 state MTPs was a discrete set of states um
[00:03:37] MTPs was a discrete set of states um last time on Monday I think so much she
[00:03:39] last time on Monday I think so much she asked well how do you do of continuous
[00:03:41] asked well how do you do of continuous States so we'll work on that today but
[00:03:43] States so we'll work on that today but let's say you want to build a right
[00:03:49] let's say you want to build a right let's say you want to build a car maybe
[00:03:52] let's say you want to build a car maybe a self-driving car right the state space
[00:03:56] a self-driving car right the state space of a car is let's see I'm gonna well
[00:04:00] of a car is let's see I'm gonna well instead of taking the my artistic side
[00:04:04] instead of taking the my artistic side view of the car if you take a top-down
[00:04:06] view of the car if you take a top-down view of a car alright so this is from
[00:04:10] view of a car alright so this is from the satellite imagery you know top down
[00:04:12] the satellite imagery you know top down the other car with two wheels of call
[00:04:13] the other car with two wheels of call having this way how do you model the
[00:04:16] having this way how do you model the state of a car right well calm the way
[00:04:18] state of a car right well calm the way to model the state of a car that's
[00:04:20] to model the state of a car that's driving around the planet earth is that
[00:04:22] driving around the planet earth is that you need to know the position and so
[00:04:27] you need to know the position and so that can be represented as X comma Y two
[00:04:30] that can be represented as X comma Y two numbers to represent you know really
[00:04:31] numbers to represent you know really lots your lattice you or something right
[00:04:34] lots your lattice you or something right you probably want to know the
[00:04:37] you probably want to know the orientation of the car right may be
[00:04:41] orientation of the car right may be measured relative to north
[00:04:43] measured relative to north what's the orientation of the car and
[00:04:46] what's the orientation of the car and then it turns out if you're driving at
[00:04:48] then it turns out if you're driving at very low speeds this is fine but if
[00:04:51] very low speeds this is fine but if you're driving anything other than very
[00:04:53] you're driving anything other than very low speeds then will often include in
[00:04:58] low speeds then will often include in the state space also the velocities and
[00:05:01] the state space also the velocities and angular velocity so X dot is the
[00:05:03] angular velocity so X dot is the velocity in the X direction so X all
[00:05:06] velocity in the X direction so X all this dx/dt right oh this this velocity X
[00:05:08] this dx/dt right oh this this velocity X Direction Y dot it's lost in y direction
[00:05:10] Direction Y dot it's lost in y direction and theta dot is the angular velocity
[00:05:12] and theta dot is the angular velocity the rate at which your car is turning
[00:05:14] the rate at which your car is turning okay and this sort of um up to you how
[00:05:18] okay and this sort of um up to you how you want to model the car is it
[00:05:19] you want to model the car is it important to model the current angle of
[00:05:22] important to model the current angle of the steering view is it important to
[00:05:23] the steering view is it important to model how worn down is your front left
[00:05:26] model how worn down is your front left tire I supposed to have worn down is
[00:05:28] tire I supposed to have worn down is your beer right tire so depending on the
[00:05:30] your beer right tire so depending on the application you are building is up to
[00:05:33] application you are building is up to you to decide what is the state base
[00:05:36] you to decide what is the state base state space you want to use the model
[00:05:38] state space you want to use the model this car and I guess it and if you're
[00:05:40] this car and I guess it and if you're building a car to race on a racetrack
[00:05:42] building a car to race on a racetrack maybe it is important to model what is
[00:05:45] maybe it is important to model what is the temperature of the engine and how
[00:05:46] the temperature of the engine and how you know one down this each of your four
[00:05:49] you know one down this each of your four tires separately but for a lot of normal
[00:05:51] tires separately but for a lot of normal driving this would be you know
[00:05:53] driving this would be you know sufficient level of detail to model the
[00:05:55] sufficient level of detail to model the state space but so this is a six
[00:05:59] state space but so this is a six dimensional so this is a six dimensional
[00:06:06] dimensional so this is a six dimensional state space representation oh and for
[00:06:08] state space representation oh and for those they work in robotics that would
[00:06:10] those they work in robotics that would be called the kinematic mode of the car
[00:06:12] be called the kinematic mode of the car and that would be the dynamics model car
[00:06:13] and that would be the dynamics model car right if you want to model the
[00:06:15] right if you want to model the velocities as well
[00:06:17] velocities as well um oh let's see how about the helicopter
[00:06:25] all right
[00:06:27] all right the States howdy marvelous taste is a
[00:06:29] the States howdy marvelous taste is a helicopter helicopter flies around in 3d
[00:06:31] helicopter helicopter flies around in 3d rather than drives around in 2d and so
[00:06:33] rather than drives around in 2d and so common way to model the statesman's
[00:06:36] common way to model the statesman's helicopter would be to model it as
[00:06:37] helicopter would be to model it as having a position X Y Z and then also a
[00:06:42] having a position X Y Z and then also a 3d orientation of a helicopter is
[00:06:46] 3d orientation of a helicopter is usually modeled with three numbers which
[00:06:49] usually modeled with three numbers which we sometimes call the roll pitch and yaw
[00:06:52] we sometimes call the roll pitch and yaw right so you know if you ever an
[00:06:54] right so you know if you ever an airplane row is that you roll into left
[00:06:56] airplane row is that you roll into left or right pictures are you pitching up
[00:06:57] or right pictures are you pitching up and down and you always know are you
[00:06:59] and down and you always know are you facing north south east or west so this
[00:07:01] facing north south east or west so this is one way to turn the three-dimensional
[00:07:03] is one way to turn the three-dimensional orientation of an object like an
[00:07:05] orientation of an object like an airplane or helicopter into CT numbers
[00:07:07] airplane or helicopter into CT numbers so so the details aren't important if
[00:07:10] so so the details aren't important if you actually work a helicopter you can
[00:07:12] you actually work a helicopter you can just figure this out but for today's
[00:07:13] just figure this out but for today's purposes just right I guess the row
[00:07:16] purposes just right I guess the row picture but to represent the orientation
[00:07:23] picture but to represent the orientation of a three-dimensional object flying
[00:07:25] of a three-dimensional object flying around is conventionally represented
[00:07:27] around is conventionally represented with three numbers such as a rotation
[00:07:29] with three numbers such as a rotation jaw and then I thought why don't you see
[00:07:34] jaw and then I thought why don't you see dot Phi dot theta dot side on the linear
[00:07:40] dot Phi dot theta dot side on the linear velocity and the angular velocity okay
[00:07:47] maybe just one last example so it turns
[00:07:50] maybe just one last example so it turns out in enforcement learning maybe early
[00:07:54] out in enforcement learning maybe early early history of reinforcement learning
[00:07:55] early history of reinforcement learning one of the problems that a lot of people
[00:07:57] one of the problems that a lot of people just happen to work on and and therefore
[00:08:01] just happen to work on and and therefore you see in a lot of reinforcement
[00:08:03] you see in a lot of reinforcement learning textbooks there's something
[00:08:04] learning textbooks there's something called the inverted pendulum problem but
[00:08:06] called the inverted pendulum problem but what that is is a little toy which is a
[00:08:10] what that is is a little toy which is a little cot there's on wheels there's on
[00:08:12] little cot there's on wheels there's on a track and you have a little pole that
[00:08:17] a track and you have a little pole that is attached to this cot and there's a
[00:08:20] is attached to this cot and there's a three swivel there
[00:08:27] and so this pole just flops over all
[00:08:29] and so this pole just flops over all this pose just swings freely and there's
[00:08:31] this pose just swings freely and there's no motor there's no motor at this at
[00:08:34] no motor there's no motor at this at this little hinge there and so the
[00:08:36] this little hinge there and so the inverted pendulum problem is see that my
[00:08:39] inverted pendulum problem is see that my prepared this right is if you have a if
[00:08:44] prepared this right is if you have a if you have a free PO and if this is your
[00:08:46] you have a free PO and if this is your carts moving left and right the inverted
[00:08:48] carts moving left and right the inverted pendulum problem is you know can you
[00:08:50] pendulum problem is you know can you with a swivel can you kind of balance
[00:08:52] with a swivel can you kind of balance that right well and so one of the common
[00:09:02] that right well and so one of the common textbook examples of reinforced learning
[00:09:05] textbook examples of reinforced learning is can you choose actions over time to
[00:09:09] is can you choose actions over time to move this left and right so as to keep
[00:09:12] move this left and right so as to keep the pole oriented up with right and so
[00:09:15] the pole oriented up with right and so for a problem like this if you have a
[00:09:17] for a problem like this if you have a linear wheel just a one-dimensional you
[00:09:19] linear wheel just a one-dimensional you know like a real aircraft that this cart
[00:09:21] know like a real aircraft that this cart is on the state space would be X which
[00:09:24] is on the state space would be X which is the position of the cot theta which
[00:09:31] is the position of the cot theta which is the orientation of the pole as well X
[00:09:35] is the orientation of the pole as well X dot and theta dot right so this would be
[00:09:38] dot and theta dot right so this would be a four dimensional state space for the
[00:09:42] a four dimensional state space for the inverted pendulum if this like running
[00:09:44] inverted pendulum if this like running left and right on a railway track and
[00:09:46] left and right on a railway track and the one dimensional railway track right
[00:09:48] the one dimensional railway track right um
[00:09:52] so for all of these problems if you want
[00:09:55] so for all of these problems if you want to build you know a self-driving car and
[00:09:57] to build you know a self-driving car and have it do something or build an
[00:09:59] have it do something or build an autonomous helicopter and have it either
[00:10:01] autonomous helicopter and have it either harvest a bleed or flight trajectory or
[00:10:03] harvest a bleed or flight trajectory or keep the pole upright and inverted
[00:10:05] keep the pole upright and inverted pendulum these are examples of robotics
[00:10:08] pendulum these are examples of robotics problems where you would model the state
[00:10:09] problems where you would model the state space as a continuous state space so
[00:10:13] space as a continuous state space so what I want to do today is focus on
[00:10:15] what I want to do today is focus on problems where the state space is our n
[00:10:20] problems where the state space is our n so n dimensional set of real numbers and
[00:10:22] so n dimensional set of real numbers and in these examples I guess n would be
[00:10:24] in these examples I guess n would be four or six or twelve right oh and again
[00:10:28] four or six or twelve right oh and again for the for the mathematicians in this
[00:10:30] for the for the mathematicians in this class
[00:10:31] class technically angles are not real numbers
[00:10:33] technically angles are not real numbers because they wrap around
[00:10:35] because they wrap around we go to 360 and then they wrap around
[00:10:36] we go to 360 and then they wrap around to zero but I think for the purposes of
[00:10:38] to zero but I think for the purposes of today that's not important so we're just
[00:10:41] today that's not important so we're just treatises RN so um serve the most
[00:10:57] treatises RN so um serve the most straight straightforward way the most
[00:11:06] straight straightforward way the most straightforward way to work with a
[00:11:08] straightforward way to work with a continuous state space is discretization
[00:11:11] continuous state space is discretization where you know you might have in this
[00:11:16] where you know you might have in this example a two dimensional state space
[00:11:17] example a two dimensional state space maybe uh
[00:11:18] maybe uh X and theta for the inverted pendulum
[00:11:20] X and theta for the inverted pendulum and then you just lay down the set of
[00:11:24] and then you just lay down the set of grid values right and disparate eyes it
[00:11:29] grid values right and disparate eyes it back to a discrete state problem and so
[00:11:32] back to a discrete state problem and so you know so you can give the state's a
[00:11:34] you know so you can give the state's a set of names
[00:11:35] set of names one two three four whatever and anywhere
[00:11:38] one two three four whatever and anywhere within that little square you just
[00:11:39] within that little square you just pretend that you're MDP that you robot
[00:11:42] pretend that you're MDP that you robot doesn't stay number one so this takes a
[00:11:44] doesn't stay number one so this takes a Content stay problem and turns it back
[00:11:46] Content stay problem and turns it back to a discrete state problem um
[00:11:48] to a discrete state problem um this is such a simple straightforward
[00:11:50] this is such a simple straightforward way to do it this is actually reasonable
[00:11:53] way to do it this is actually reasonable to do for small problems and if you have
[00:11:56] to do for small problems and if you have a relatively small low dimensional state
[00:11:58] a relatively small low dimensional state Series MVP like an inverted pendulum
[00:12:00] Series MVP like an inverted pendulum problem you're a four dimensional it's
[00:12:02] problem you're a four dimensional it's actually perfectly fine to discretize
[00:12:04] actually perfectly fine to discretize the state space and solve it this way
[00:12:05] the state space and solve it this way let me describe some disadvantages of
[00:12:08] let me describe some disadvantages of discretization first and then and then
[00:12:11] discretization first and then and then which a little bit about when you should
[00:12:12] which a little bit about when you should just use discretization because even
[00:12:14] just use discretization because even though it's not the best algorithm it
[00:12:16] though it's not the best algorithm it works fine for smaller problems but for
[00:12:19] works fine for smaller problems but for bigger problems we'll have to go to more
[00:12:21] bigger problems we'll have to go to more sophisticated algorithms like fitted
[00:12:23] sophisticated algorithms like fitted value iteration okay but um so what are
[00:12:26] value iteration okay but um so what are the problems with discretization right
[00:12:31] well first
[00:12:41] this is a very this is kind of a naive
[00:12:45] this is a very this is kind of a naive representation for V star and PI star
[00:12:56] representation for V star and PI star right which is you know remember the
[00:12:59] right which is you know remember the very first problem we talked about of
[00:13:02] very first problem we talked about of predicting housing prices
[00:13:05] predicting housing prices imagine if X was the size of a house and
[00:13:10] imagine if X was the size of a house and vertical axis was the price of a house
[00:13:13] vertical axis was the price of a house and you had a data set that look like
[00:13:15] and you had a data set that look like this discritization is that the
[00:13:21] this discritization is that the discretization equivalent of trying to
[00:13:23] discretization equivalent of trying to for the function of this data would be
[00:13:25] for the function of this data would be to look at the input feature and you
[00:13:29] to look at the input feature and you know let's discretize it into five
[00:13:32] know let's discretize it into five values and for each of these little
[00:13:34] values and for each of these little buckets in each of these five intervals
[00:13:36] buckets in each of these five intervals let's fit a constant function right
[00:13:40] let's fit a constant function right something like that so this staircase
[00:13:44] something like that so this staircase would be how you know descritization
[00:13:47] would be how you know descritization will represent the price of a house as a
[00:13:49] will represent the price of a house as a function of the size and the analogy is
[00:13:55] function of the size and the analogy is that what we're doing in reinforcement
[00:13:57] that what we're doing in reinforcement learning is you want to approximate the
[00:13:59] learning is you want to approximate the value function and if you were to
[00:14:01] value function and if you were to discretize it then on the x axis is
[00:14:04] discretize it then on the x axis is maybe the state and now I'm down to one
[00:14:07] maybe the state and now I'm down to one dimensional state right because that's
[00:14:08] dimensional state right because that's what I can plot and you're saying that
[00:14:10] what I can plot and you're saying that well let's approximate the value
[00:14:12] well let's approximate the value function you know as a as a staircase
[00:14:16] function you know as a as a staircase function as a function of the set of
[00:14:18] function as a function of the set of states right and you know and this is
[00:14:20] states right and you know and this is not terrible if you have a lot of data
[00:14:21] not terrible if you have a lot of data and very few input features you can get
[00:14:23] and very few input features you can get away with this this will work okay but
[00:14:25] away with this this will work okay but this doesn't it doesn't seem to allow
[00:14:29] this doesn't it doesn't seem to allow you to fit a smooth function right so
[00:14:31] you to fit a smooth function right so that's one downside so it's not a very
[00:14:34] that's one downside so it's not a very good representation and the second
[00:14:37] good representation and the second downside is the
[00:14:46] right someone fancifully named curse of
[00:14:49] right someone fancifully named curse of dimensionality which is Richard bellman
[00:14:53] dimensionality which is Richard bellman had given this name as a cool sounding
[00:14:56] had given this name as a cool sounding name but what it means is that if the
[00:14:59] name but what it means is that if the state spaces in RN and disparate eyes
[00:15:05] you know each dimension into K values
[00:15:14] you know each dimension into K values then you get paid to the end discrete
[00:15:19] then you get paid to the end discrete states so if this critize position and
[00:15:26] states so if this critize position and orientation into ten values which is
[00:15:28] orientation into ten values which is quite small then you end up with you
[00:15:31] quite small then you end up with you know ten to ten states which grows
[00:15:32] know ten to ten states which grows exponentially in dimensional state space
[00:15:34] exponentially in dimensional state space n so this transition works fine if you
[00:15:37] n so this transition works fine if you have relatively low dimensional problems
[00:15:40] have relatively low dimensional problems like two dimensions no problem four
[00:15:42] like two dimensions no problem four dimensions maybe it's okay but they were
[00:15:44] dimensions maybe it's okay but they were very high dimensional state spaces this
[00:15:46] very high dimensional state spaces this is this is not a good this is not a good
[00:15:49] is this is not a good this is not a good representation and it turns out the
[00:15:53] representation and it turns out the curse of dimensionality to take a
[00:15:56] curse of dimensionality to take a slightly aside from continuous state
[00:15:58] slightly aside from continuous state spaces because the dimensionality also
[00:16:00] spaces because the dimensionality also applies for very large discrete state
[00:16:03] applies for very large discrete state MDPs
[00:16:03] MDPs so for example one of the places people
[00:16:06] so for example one of the places people have apply reinforcement learning is in
[00:16:08] have apply reinforcement learning is in factory optimization right so if you
[00:16:10] factory optimization right so if you have a factory with a hundred machines
[00:16:12] have a factory with a hundred machines in a factory and if every machine in the
[00:16:16] in a factory and if every machine in the factory is doing something slightly
[00:16:17] factory is doing something slightly different then if you have a hundred
[00:16:21] different then if you have a hundred machines in the giant factory and each
[00:16:26] machines in the giant factory and each machine can be in K different states
[00:16:30] machine can be in K different states then the total number of states of your
[00:16:33] then the total number of states of your factory is K to the power of 100 right
[00:16:38] factory is K to the power of 100 right and so even if so so curse of
[00:16:40] and so even if so so curse of dimensionality also applies to very
[00:16:42] dimensionality also applies to very large discrete state spaces such as if a
[00:16:45] large discrete state spaces such as if a factory over hundred machines and then
[00:16:47] factory over hundred machines and then your total state space becomes kids at
[00:16:49] your total state space becomes kids at 100 and it turns out that for this type
[00:16:51] 100 and it turns out that for this type of discrete state space fits a value
[00:16:54] of discrete state space fits a value iteration
[00:16:55] iteration can be a much better album as well we'll
[00:16:57] can be a much better album as well we'll get to Fitz evaluation a little bit okay
[00:17:00] get to Fitz evaluation a little bit okay so let's see so some practical so now
[00:17:12] so let's see so some practical so now despite all this criticism of
[00:17:13] despite all this criticism of digitalization if you have a small stage
[00:17:15] digitalization if you have a small stage space is a simple method to try to apply
[00:17:18] space is a simple method to try to apply you know and and if you're if you're
[00:17:20] you know and and if you're if you're very small so you say go ahead and
[00:17:22] very small so you say go ahead and discreet eyes they could be one of the
[00:17:23] discreet eyes they could be one of the quick things to try and just get
[00:17:25] quick things to try and just get something working so let me share have
[00:17:27] something working so let me share have you some maybe guidelines this is this
[00:17:30] you some maybe guidelines this is this is how I do it I guess right if you have
[00:17:32] is how I do it I guess right if you have a you know two dimensional state space
[00:17:34] a you know two dimensional state space or three dimensional state space is no
[00:17:37] or three dimensional state space is no problem just discretized usually for a
[00:17:40] problem just discretized usually for a lot of problems it's just fine if you
[00:17:43] lot of problems it's just fine if you have maybe a four to six dimensional
[00:17:47] have maybe a four to six dimensional state space you know I would think about
[00:17:50] state space you know I would think about it and it will still often work so for
[00:17:53] it and it will still often work so for the inverted pendulum which is four
[00:17:54] the inverted pendulum which is four dimensional state space it works just
[00:17:55] dimensional state space it works just fine I've had some friends work on
[00:17:58] fine I've had some friends work on trying to drive a bicycle right which
[00:18:02] trying to drive a bicycle right which you can model the six dimensional state
[00:18:04] you can model the six dimensional state space and you know disk realization it
[00:18:07] space and you know disk realization it kind of works is that it works if you
[00:18:09] kind of works is that it works if you put some work into it one of the tricks
[00:18:12] put some work into it one of the tricks you want to use as you approach to four
[00:18:14] you want to use as you approach to four to six dimensional state space range is
[00:18:18] to six dimensional state space range is choose your discretization more
[00:18:20] choose your discretization more carefully so for example if the state s2
[00:18:23] carefully so for example if the state s2 is really important so if you think the
[00:18:27] is really important so if you think the actions you need to take or the value of
[00:18:30] actions you need to take or the value of the performance is really sensitive to
[00:18:31] the performance is really sensitive to state as to and less institute to state
[00:18:34] state as to and less institute to state as one then in this range people end up
[00:18:37] as one then in this range people end up designing unequal discretization where
[00:18:41] designing unequal discretization where you might discretize as too much more
[00:18:42] you might discretize as too much more funny than s1 right and then the reason
[00:18:44] funny than s1 right and then the reason you do that is the number of states the
[00:18:47] you do that is the number of states the number of discrete states is now blowing
[00:18:49] number of discrete states is now blowing up exponentially something the power
[00:18:50] up exponentially something the power power for some of the politics and these
[00:18:52] power for some of the politics and these tricks allow you to just reduce a little
[00:18:54] tricks allow you to just reduce a little bit the number of discrete States you
[00:18:56] bit the number of discrete States you end up
[00:18:57] end up I think you know if you have a 7/8
[00:19:01] I think you know if you have a 7/8 dimensional problem that's that's
[00:19:04] dimensional problem that's that's pushing it that's when I would kind of
[00:19:05] pushing it that's when I would kind of be nervous and and you know be
[00:19:08] be nervous and and you know be increasingly inclined to not use
[00:19:10] increasingly inclined to not use dissociation I personally rarely use
[00:19:12] dissociation I personally rarely use this realization for problems that are 8
[00:19:14] this realization for problems that are 8 dimensional and then when your problem
[00:19:17] dimensional and then when your problem is that you even higher dimensional than
[00:19:18] is that you even higher dimensional than this you know like 9 10 and higher then
[00:19:21] this you know like 9 10 and higher then I would very seriously consider an
[00:19:24] I would very seriously consider an algorithm that does not dispute eyes
[00:19:26] algorithm that does not dispute eyes it's very very rare to use this code
[00:19:29] it's very very rare to use this code ization for 4 problems as high even 78
[00:19:31] ization for 4 problems as high even 78 is quite rare I've seen it done in rare
[00:19:33] is quite rare I've seen it done in rare occasions but but and these things get
[00:19:36] occasions but but and these things get worse exponentially right with the
[00:19:38] worse exponentially right with the number of dimensions so maybe there's a
[00:19:39] number of dimensions so maybe there's a set of guidelines for when to use the
[00:19:42] set of guidelines for when to use the scores ation it when to seriously
[00:19:44] scores ation it when to seriously consider doing something else all right
[00:19:51] consider doing something else all right so um in the alternative approach that
[00:19:55] so um in the alternative approach that you see today what you'll be able to do
[00:19:59] you see today what you'll be able to do is to approximate you start directly
[00:20:08] without resorting to descritization and
[00:20:21] without resorting to descritization and there'll be an analogy that will make
[00:20:23] there'll be an analogy that will make later just you know alluding to this
[00:20:26] later just you know alluding to this plot again right to this analogy between
[00:20:28] plot again right to this analogy between linear regression when you're trying to
[00:20:30] linear regression when you're trying to approximate y is function of X and value
[00:20:33] approximate y is function of X and value iteration when you're trying to learn
[00:20:35] iteration when you're trying to learn their approximate V as a function of s
[00:20:39] their approximate V as a function of s which is that in linear regression you
[00:20:44] which is that in linear regression you say let's approximate X as a linear
[00:20:48] say let's approximate X as a linear function of Y right or if you don't want
[00:20:53] function of Y right or if you don't want to use the raw features Y what you can
[00:20:56] to use the raw features Y what you can do is use you know theta transpose theta
[00:21:02] do is use you know theta transpose theta transpose v oh I'm sorry
[00:21:05] transpose v oh I'm sorry totally picks up right where Phi of X is
[00:21:11] totally picks up right where Phi of X is the features of X so if right so this is
[00:21:19] the features of X so if right so this is what linear regression does where if X
[00:21:21] what linear regression does where if X is your housing price then maybe Phi of
[00:21:23] is your housing price then maybe Phi of X is equal to you know X 1 X 2 X 1
[00:21:28] X is equal to you know X 1 X 2 X 1 squared X 1 X 2 and so on right so
[00:21:32] squared X 1 X 2 and so on right so that's how that's how you can use linear
[00:21:34] that's how that's how you can use linear regression to approximate the price of a
[00:21:36] regression to approximate the price of a house either as a function of the raw
[00:21:38] house either as a function of the raw features or as a function of some you
[00:21:41] features or as a function of some you know slightly more sophisticated study
[00:21:42] know slightly more sophisticated study more complex of the features of the
[00:21:44] more complex of the features of the house and what we what you'll see in
[00:21:48] house and what we what you'll see in fitted value iteration is a model where
[00:21:52] fitted value iteration is a model where we will approximate Basara ves as a
[00:21:58] we will approximate Basara ves as a linear function of features of the state
[00:22:04] linear function of features of the state ok so that's the algorithm wolf build up
[00:22:08] ok so that's the algorithm wolf build up to and yeah we're going to try to use
[00:22:13] to and yeah we're going to try to use linear regression with a lot of
[00:22:15] linear regression with a lot of modifications to approximate the value
[00:22:18] modifications to approximate the value function okay and and and again enforce
[00:22:21] function okay and and and again enforce our learning in value iteration the your
[00:22:25] our learning in value iteration the your goal is to find a good approximation to
[00:22:27] goal is to find a good approximation to the value function because once you have
[00:22:29] the value function because once you have that you can then use you know the
[00:22:31] that you can then use you know the equation we had earlier to compute the
[00:22:32] equation we had earlier to compute the optimal action for every state right so
[00:22:34] optimal action for every state right so so we just focus on computing the value
[00:22:36] so we just focus on computing the value function now in order to derive the
[00:22:43] function now in order to derive the fitted value iteration algorithm it
[00:22:48] fitted value iteration algorithm it turns out that fits a value duration
[00:22:53] turns out that fits a value duration works best with a model or the simulator
[00:22:57] works best with a model or the simulator of D MVP so let me describe what that
[00:22:59] of D MVP so let me describe what that means and how you get a model and then
[00:23:01] means and how you get a model and then we'll talk about how you can actually
[00:23:02] we'll talk about how you can actually you'll implement the fitted value
[00:23:05] you'll implement the fitted value generation algorithm and have it work on
[00:23:06] generation algorithm and have it work on these types of problems ok
[00:23:18] all right so um what a model or a
[00:23:36] all right so um what a model or a simulator of your robot is is is just a
[00:23:39] simulator of your robot is is is just a function that takes as input a state
[00:23:45] function that takes as input a state takes as inputs in action and it outputs
[00:23:49] takes as inputs in action and it outputs the next state s prime drawn from the
[00:23:54] the next state s prime drawn from the state transition probabilities okay and
[00:23:58] state transition probabilities okay and the way that model is built is that the
[00:24:04] the way that model is built is that the states and the actions are both and
[00:24:07] states and the actions are both and let's see and the way the model is built
[00:24:10] let's see and the way the model is built is the state is just a real value vector
[00:24:12] is the state is just a real value vector okay oh and um I think for simplicity
[00:24:16] okay oh and um I think for simplicity but now let's assume that the action
[00:24:20] but now let's assume that the action space is discrete it turns out that for
[00:24:23] space is discrete it turns out that for a lot of em DPS the state space can be
[00:24:27] a lot of em DPS the state space can be very high dimensional and the action
[00:24:29] very high dimensional and the action space is much lower dimensional than the
[00:24:31] space is much lower dimensional than the state space so for example for a car you
[00:24:34] state space so for example for a car you know s is six dimensional but the space
[00:24:40] know s is six dimensional but the space of actions is just two dimensionals
[00:24:42] of actions is just two dimensionals right the steering and braking it turns
[00:24:44] right the steering and braking it turns out for a helicopter you know the state
[00:24:48] out for a helicopter you know the state space is twelve dimensional and I guess
[00:24:51] space is twelve dimensional and I guess you pray most of you I wouldn't expect
[00:24:53] you pray most of you I wouldn't expect you in their heart helicopter flies but
[00:24:54] you in their heart helicopter flies but it turns out that you have two four
[00:24:56] it turns out that you have two four dimensional actions in the helicopter
[00:24:58] dimensional actions in the helicopter the way you find welcomes these are two
[00:24:59] the way you find welcomes these are two control sticks so your left hand the
[00:25:01] control sticks so your left hand the right hand you know can move has two
[00:25:05] right hand you know can move has two dimensions of control and for the
[00:25:06] dimensions of control and for the inverted pendulum here's the state space
[00:25:10] inverted pendulum here's the state space is for D and the action space is just
[00:25:12] is for D and the action space is just one D right you move left or right so
[00:25:14] one D right you move left or right so you actually see in a lot of
[00:25:16] you actually see in a lot of reinforcement learning problems that
[00:25:18] reinforcement learning problems that it's quite common for the state space to
[00:25:21] it's quite common for the state space to be much
[00:25:21] be much dimensional in the action space and so
[00:25:24] dimensional in the action space and so let's say for now that we do not want to
[00:25:28] let's say for now that we do not want to discretize the state space because it's
[00:25:30] discretize the state space because it's your high dimensional but just for the
[00:25:32] your high dimensional but just for the sake of simplicity let's say we
[00:25:33] sake of simplicity let's say we discretize the action space for now
[00:25:35] discretize the action space for now right which is which is usually much
[00:25:36] right which is which is usually much easier to do but I think as we develop
[00:25:39] easier to do but I think as we develop it evaluation as well well well you you
[00:25:42] it evaluation as well well well you you might you get hints of when maybe you
[00:25:45] might you get hints of when maybe you don't need to discretize the action
[00:25:46] don't need to discretize the action space either but let's just say we have
[00:25:48] space either but let's just say we have a dispute
[00:25:48] a dispute dispute action space so all right so how
[00:26:10] dispute action space so all right so how do you get a model right one way to
[00:26:22] do you get a model right one way to build a model is to use a physics
[00:26:25] build a model is to use a physics simulator so you know in the case of an
[00:26:31] simulator so you know in the case of an inverted pendulum right it turns out
[00:26:36] inverted pendulum right it turns out that well if the action is what's the
[00:26:39] that well if the action is what's the acceleration you apply to either a
[00:26:41] acceleration you apply to either a positive negative or to the to the X all
[00:26:42] positive negative or to the to the X all right so therefore the right then it
[00:26:44] right so therefore the right then it turns out that let's see so the state
[00:26:48] turns out that let's see so the state space is four-dimensional right and it
[00:26:54] space is four-dimensional right and it turns out that if you sort of flip open
[00:26:57] turns out that if you sort of flip open the you know physics textbook using
[00:26:59] the you know physics textbook using Newtonian mechanics if you know the
[00:27:02] Newtonian mechanics if you know the weight of the card the way to the pole
[00:27:05] yeah I think that says actually you know
[00:27:07] yeah I think that says actually you know the mass of the constant mass in the
[00:27:08] the mass of the constant mass in the pole and the length of the pole it turns
[00:27:11] pole and the length of the pole it turns out you can derive equations about what
[00:27:14] out you can derive equations about what is the velocity right so it's thought is
[00:27:16] is the velocity right so it's thought is equal you know don't don't worry about
[00:27:19] equal you know don't don't worry about this think of the map as declaration
[00:27:22] this think of the map as declaration other than something you need to learn
[00:27:24] other than something you need to learn where you know L was the length of the
[00:27:26] where you know L was the length of the pole M is the mass of one of these
[00:27:28] pole M is the mass of one of these things as you don't know m is the Hamas
[00:27:30] things as you don't know m is the Hamas a is the force extender
[00:27:33] a is the force extender and so on and and and conventional
[00:27:37] and so on and and and conventional physics textbook will kind of let you
[00:27:39] physics textbook will kind of let you derive these equations or rather than
[00:27:42] derive these equations or rather than trying to derive this yourself using you
[00:27:45] trying to derive this yourself using you know either yourself using Newtonian
[00:27:47] know either yourself using Newtonian mechanics or finding the help of the
[00:27:49] mechanics or finding the help of the physicist friend there are also a lot of
[00:27:52] physicist friend there are also a lot of open source physics simulator software
[00:27:55] open source physics simulator software packages we can download open source
[00:27:57] packages we can download open source simulator plug in the dimensions and
[00:27:59] simulator plug in the dimensions and mass and so on of your system and then
[00:28:00] mass and so on of your system and then they'll spit out the simulators and
[00:28:02] they'll spit out the simulators and tells you how the state evolves from one
[00:28:04] tells you how the state evolves from one time said to another times then right
[00:28:06] time said to another times then right and so but so in this example the
[00:28:08] and so but so in this example the simulator will say that s prime is equal
[00:28:12] simulator will say that s prime is equal to S Plus you know delta T times I guess
[00:28:20] to S Plus you know delta T times I guess I times s dot where delta T could be
[00:28:24] I times s dot where delta T could be lets say 0.1 seconds right so if you
[00:28:29] lets say 0.1 seconds right so if you want to simulate this at 10 Hertz so
[00:28:31] want to simulate this at 10 Hertz so that 10 10 10 updates per second so that
[00:28:34] that 10 10 10 updates per second so that the time difference between the current
[00:28:36] the time difference between the current state in the next day there's one tenth
[00:28:37] state in the next day there's one tenth of a second then you write a simulator
[00:28:40] of a second then you write a simulator like this okay and but and really the
[00:28:44] like this okay and but and really the the most common way to do this is not to
[00:28:46] the most common way to do this is not to actually derive the physics update
[00:28:49] actually derive the physics update equations the most common way to do this
[00:28:51] equations the most common way to do this is to just download one of the open
[00:28:53] is to just download one of the open source physics engines right so um so
[00:28:59] source physics engines right so um so this will work okay for problems like
[00:29:02] this will work okay for problems like the inverted pendulum I once use a
[00:29:06] the inverted pendulum I once use a physics engine to build a simulator for
[00:29:09] physics engine to build a simulator for a four-legged robot and manager user
[00:29:10] a four-legged robot and manager user enforcer learning together for the girl
[00:29:12] enforcer learning together for the girl over to walk around right so it works
[00:29:21] the second way to get a model is to
[00:29:24] the second way to get a model is to learn it from data
[00:29:31] right and I press they end up using this
[00:29:33] right and I press they end up using this much more often so um here's what I mean
[00:29:45] much more often so um here's what I mean let's say you want to build a controller
[00:29:48] let's say you want to build a controller for an autonomous helicopter right so so
[00:29:50] for an autonomous helicopter right so so this is case study and what I'm
[00:29:52] this is case study and what I'm describing is real like this will
[00:29:53] describing is real like this will actually work right let's do you want to
[00:29:55] actually work right let's do you want to build up let's say you haven't let's say
[00:29:58] build up let's say you haven't let's say you have a helicopter and you want to
[00:29:59] you have a helicopter and you want to build on songs controller for it what
[00:30:01] build on songs controller for it what you can do is start your helicopter off
[00:30:05] you can do is start your helicopter off in some state s0 right so with GPS
[00:30:09] in some state s0 right so with GPS accelerometers magnetic compass you can
[00:30:11] accelerometers magnetic compass you can just measure the position and
[00:30:13] just measure the position and orientation of the helicopter and then
[00:30:15] orientation of the helicopter and then have a human pilot fly the helicopter
[00:30:18] have a human pilot fly the helicopter around so the human pilot you know using
[00:30:20] around so the human pilot you know using control sticks will move the helicopter
[00:30:23] control sticks will move the helicopter they'll know their command the
[00:30:25] they'll know their command the helicopter with some action a zero and
[00:30:27] helicopter with some action a zero and then a tenth of a second later the
[00:30:30] then a tenth of a second later the helicopter will get to some slightly
[00:30:32] helicopter will get to some slightly different position and orientation
[00:30:33] different position and orientation that's one and then the human pilot you
[00:30:37] that's one and then the human pilot you know will just keep on moving the
[00:30:39] know will just keep on moving the control sticks and so you record down
[00:30:42] control sticks and so you record down what action they are taking a1 and based
[00:30:45] what action they are taking a1 and based on that how copter will get to some new
[00:30:47] on that how copter will get to some new state s2 and then they will take some
[00:30:50] state s2 and then they will take some action a to or get to some state s3 and
[00:30:53] action a to or get to some state s3 and so on and let them just write this as
[00:30:56] so on and let them just write this as capital T right so in other words what
[00:30:59] capital T right so in other words what you do is a take the helicopter out to
[00:31:01] you do is a take the helicopter out to the field and hire a human pilot to fly
[00:31:04] the field and hire a human pilot to fly this thing for a while and record the
[00:31:06] this thing for a while and record the position of the helicopter ten times a
[00:31:09] position of the helicopter ten times a second and also record all the actions
[00:31:11] second and also record all the actions that human pilot was taking on the
[00:31:14] that human pilot was taking on the control stick okay and then do this not
[00:31:19] control stick okay and then do this not just one time but to do this M time so
[00:31:23] just one time but to do this M time so let me use a superscript one what you
[00:31:27] let me use a superscript one what you get the idea
[00:31:30] to denote the first trajectory so you do
[00:31:34] to denote the first trajectory so you do this a second time and so on and maybe
[00:31:41] this a second time and so on and maybe do this every time
[00:31:43] do this every time so there's just a lot of map of saying
[00:31:46] so there's just a lot of map of saying fly the helicopter around you know M
[00:31:48] fly the helicopter around you know M times right and then recall everything
[00:31:50] times right and then recall everything that happened and now your goal is to
[00:32:05] that happened and now your goal is to apply supervised learning right to
[00:32:15] apply supervised learning right to estimate s T plus 1 as a function of s T
[00:32:25] estimate s T plus 1 as a function of s T and a T so the job of the model the
[00:32:30] and a T so the job of the model the jobless simulator is to take as input
[00:32:32] jobless simulator is to take as input the current state and the current option
[00:32:34] the current state and the current option and tell you where the helicopters gonna
[00:32:36] and tell you where the helicopters gonna go you know like a 0.1 seconds later and
[00:32:39] go you know like a 0.1 seconds later and so given all this data what you can do
[00:32:44] so given all this data what you can do is apply a supervised learning algorithm
[00:32:46] is apply a supervised learning algorithm to predict well what is the next state s
[00:32:49] to predict well what is the next state s prime as a function of the current state
[00:32:52] prime as a function of the current state in action right and the other notation
[00:32:54] in action right and the other notation is when I drew the boxless emulator
[00:32:56] is when I drew the boxless emulator above I was using s prime to denote s T
[00:32:59] above I was using s prime to denote s T plus 1 and s n right so that's the
[00:33:03] plus 1 and s n right so that's the mapping between the notations and so if
[00:33:10] mapping between the notations and so if you use the linear regression version of
[00:33:18] you use the linear regression version of this idea you will say this approximate
[00:33:23] this idea you will say this approximate s T plus 1 as a linear function of the
[00:33:27] s T plus 1 as a linear function of the previous state plus another linear
[00:33:30] previous state plus another linear function of the previous state and it
[00:33:33] function of the previous state and it turns out this actually works ok for
[00:33:35] turns out this actually works ok for helicopters flying at slow speeds this
[00:33:37] helicopters flying at slow speeds this is actually not a terrible model about
[00:33:39] is actually not a terrible model about if your helicopter is moving slowly and
[00:33:41] if your helicopter is moving slowly and and not flying upside down if you have a
[00:33:43] and not flying upside down if you have a copters flying in the relatively level
[00:33:46] copters flying in the relatively level way and kind of at slow speeds this
[00:33:47] way and kind of at slow speeds this model is not too bad if you find your
[00:33:51] model is not too bad if you find your helicopter in the highly dynamic
[00:33:52] helicopter in the highly dynamic situations find very fast making a very
[00:33:54] situations find very fast making a very fast aggressive turn
[00:33:55] fast aggressive turn this is not a great model but this is
[00:33:57] this is not a great model but this is that you okay first little speed spiking
[00:34:06] um and so I guess a here will be a and
[00:34:12] um and so I guess a here will be a and by n matrix because the state space is n
[00:34:15] by n matrix because the state space is n dimensional you know so a is a square
[00:34:18] dimensional you know so a is a square matrix and B will usually be a tall
[00:34:22] matrix and B will usually be a tall skinny matrix I guess whereas the
[00:34:23] skinny matrix I guess whereas the dimension of B is the dimensional state
[00:34:26] dimension of B is the dimensional state space by the dimension of the action
[00:34:28] space by the dimension of the action space right and so in order to fit the
[00:34:33] space right and so in order to fit the parameters a and B you would minimize
[00:34:35] parameters a and B you would minimize with respect to the parameters a and B
[00:34:38] with respect to the parameters a and B this so you wanna approximate as
[00:35:09] this so you wanna approximate as cheapest one as a function of that and
[00:35:13] cheapest one as a function of that and so you know pretty natural to fit the
[00:35:18] so you know pretty natural to fit the parameters of this linear model in a way
[00:35:20] parameters of this linear model in a way that minimizes the squared difference
[00:35:22] that minimizes the squared difference between the left hand side the right
[00:35:24] between the left hand side the right hand side wait did I screw up yes
[00:35:30] okay oh sure
[00:35:37] okay oh sure what's the difference we find helicopter
[00:35:39] what's the difference we find helicopter M times RS by helicopter once very very
[00:35:42] M times RS by helicopter once very very long in this example it makes no
[00:35:44] long in this example it makes no difference yeah this is fine either way
[00:35:47] difference yeah this is fine either way unless some yeah for purposes doesn't
[00:35:52] unless some yeah for purposes doesn't matter sorry
[00:35:53] matter sorry umm for the person since classes doesn't
[00:35:56] umm for the person since classes doesn't matter for practical purposes if you
[00:35:58] matter for practical purposes if you find helicopter M times it turns out the
[00:36:00] find helicopter M times it turns out the fuel burns down slowly and so the way to
[00:36:02] fuel burns down slowly and so the way to her coffee changes slowly and you've won
[00:36:04] her coffee changes slowly and you've won an average over how much fuel do you
[00:36:06] an average over how much fuel do you have for winning conditions this is what
[00:36:08] have for winning conditions this is what actually it's done but for the purposes
[00:36:10] actually it's done but for the purposes of understanding without room playing a
[00:36:12] of understanding without room playing a single time for a long time you know
[00:36:13] single time for a long time you know well it's just fine as well okay um so
[00:36:22] well it's just fine as well okay um so this is the linear regression version of
[00:36:24] this is the linear regression version of this and it and we actually talked about
[00:36:28] this and it and we actually talked about some other models later called lqr in
[00:36:31] some other models later called lqr in lqg you you see this linear regression
[00:36:34] lqg you you see this linear regression version of a model as well disree just a
[00:36:37] version of a model as well disree just a linear model the dynamics right well
[00:36:41] linear model the dynamics right well we'll come back to linear models
[00:36:42] we'll come back to linear models dynamics later next week but it turns
[00:36:45] dynamics later next week but it turns out that if you want to use a nonlinear
[00:36:49] out that if you want to use a nonlinear model you know plug in a nonlinear you
[00:36:51] model you know plug in a nonlinear you know if you you can also plug in write
[00:36:54] know if you you can also plug in write Phi of s you know it may be v prime of a
[00:36:57] Phi of s you know it may be v prime of a as well if you want to have a low
[00:36:59] as well if you want to have a low nonlinear model and this will work even
[00:37:02] nonlinear model and this will work even better depending on your choice of
[00:37:04] better depending on your choice of features okay now um
[00:37:11] finally having run this your little
[00:37:14] finally having run this your little linear regression thing where you were
[00:37:16] linear regression thing where you were and this is not quite linear regression
[00:37:18] and this is not quite linear regression because a and B are matrices but but you
[00:37:20] because a and B are matrices but but you can minimize this objective but it turns
[00:37:22] can minimize this objective but it turns out - this turns out to be equivalent to
[00:37:24] out - this turns out to be equivalent to running linear regression n times so s
[00:37:28] running linear regression n times so s has 12 dimensions this turns out to
[00:37:31] has 12 dimensions this turns out to equivalent to running linear regression
[00:37:33] equivalent to running linear regression n times to predict the first day second
[00:37:36] n times to predict the first day second day third state to variable and so on
[00:37:38] day third state to variable and so on right that that's this one what this is
[00:37:40] right that that's this one what this is equivalent to but having done this you
[00:37:43] equivalent to but having done this you now have a choice of two possible models
[00:37:45] now have a choice of two possible models one model would be to just say my model
[00:37:49] one model would be to just say my model will said st plus 1 as a st plus b 18 or
[00:37:55] will said st plus 1 as a st plus b 18 or another version would be to set st plus
[00:38:08] another version would be to set st plus 1 equals a cos B T plus epsilon t where
[00:38:12] 1 equals a cos B T plus epsilon t where epsilon T is distributed maybe from from
[00:38:20] epsilon T is distributed maybe from from a Gaussian from a Gaussian density okay
[00:38:25] a Gaussian from a Gaussian density okay and so this first model would be a
[00:38:28] and so this first model would be a deterministic model and this model would
[00:38:33] deterministic model and this model would be a stochastic model and if you use a
[00:38:39] be a stochastic model and if you use a stochastic model then that's saying that
[00:38:52] when you're running your simulator when
[00:38:54] when you're running your simulator when you're running in the model every time
[00:38:56] you're running in the model every time you generate st plus 1 you would be
[00:38:59] you generate st plus 1 you would be something this epsilon from a Gaussian
[00:39:01] something this epsilon from a Gaussian vector and adding it to the prediction
[00:39:04] vector and adding it to the prediction of your linear model and and they've
[00:39:06] of your linear model and and they've uses stochastic model what that means is
[00:39:08] uses stochastic model what that means is that you know if you similar you have a
[00:39:10] that you know if you similar you have a calcifying around your simulator will
[00:39:12] calcifying around your simulator will generate random noise the add and
[00:39:14] generate random noise the add and subtract a little bit to the state space
[00:39:16] subtract a little bit to the state space of the helicopter as if there were
[00:39:18] of the helicopter as if there were little wind gusts blowing it blowing the
[00:39:19] little wind gusts blowing it blowing the helicopter around okay and this is a
[00:39:27] so-so it and in in most cases when
[00:39:38] so-so it and in in most cases when you're building reinforcement learning
[00:39:39] you're building reinforcement learning models oh and so the the approach we're
[00:39:42] models oh and so the the approach we're taking here this is called model-based
[00:39:44] taking here this is called model-based reinforcement learning when you're going
[00:39:46] reinforcement learning when you're going to build a model of your robot and then
[00:39:49] to build a model of your robot and then let's train the reinforcement learning
[00:39:51] let's train the reinforcement learning algorithm in the simulator and then take
[00:39:54] algorithm in the simulator and then take the policy learn and take the policy PI
[00:39:55] the policy learn and take the policy PI you learn in simulation and apply it
[00:39:57] you learn in simulation and apply it back on your real robot alright so this
[00:39:59] back on your real robot alright so this is this dis approach we're taking is
[00:40:01] is this dis approach we're taking is called model-based RL there is an
[00:40:10] called model-based RL there is an alternative called model free RL which
[00:40:12] alternative called model free RL which is you just run your enforcement
[00:40:14] is you just run your enforcement learning algorithm on the robot directly
[00:40:15] learning algorithm on the robot directly and that the robot - the robot around
[00:40:17] and that the robot - the robot around and so on and then I learn I think that
[00:40:20] and so on and then I learn I think that in terms of robotics applications I
[00:40:23] in terms of robotics applications I think model-based RL has been taking off
[00:40:27] think model-based RL has been taking off faster a lot of the most promising
[00:40:29] faster a lot of the most promising approaches are model-based RL because of
[00:40:31] approaches are model-based RL because of your physical robot you know you just
[00:40:34] your physical robot you know you just can't afford to have a reinforcement
[00:40:35] can't afford to have a reinforcement learning algorithm - your robot around
[00:40:37] learning algorithm - your robot around for too long or how many helicopters do
[00:40:39] for too long or how many helicopters do you want to crash before you learn the
[00:40:40] you want to crash before you learn the armor things as well
[00:40:41] armor things as well model free RL works fine if you want to
[00:40:45] model free RL works fine if you want to play video games because if you're
[00:40:47] play video games because if you're trying to get a computer or play chess
[00:40:49] trying to get a computer or play chess or thell or go right because you have a
[00:40:52] or thell or go right because you have a perfect simulator for the video game
[00:40:54] perfect simulator for the video game which is a video game itself and so your
[00:40:56] which is a video game itself and so your your your ro algorithm you can on there
[00:40:59] your your ro algorithm you can on there blow up hundreds of millions of times in
[00:41:01] blow up hundreds of millions of times in a video game and
[00:41:02] a video game and that's fine episode 4 playing video
[00:41:04] that's fine episode 4 playing video games were playing on like you know
[00:41:07] games were playing on like you know traditional games model free approaches
[00:41:09] traditional games model free approaches can work fine but I most of the a lot of
[00:41:13] can work fine but I most of the a lot of the successful applications of
[00:41:16] the successful applications of reinforced knowledge of robots have been
[00:41:18] reinforced knowledge of robots have been model based although again the field is
[00:41:21] model based although again the field is evolving quickly so there's there's very
[00:41:23] evolving quickly so there's there's very interesting work at the intersection of
[00:41:24] interesting work at the intersection of model-based in model free that gets more
[00:41:27] model-based in model free that gets more complicated but I would say if you want
[00:41:29] complicated but I would say if you want to use something tried-and-true you know
[00:41:31] to use something tried-and-true you know for robotics problems seriously consider
[00:41:33] for robotics problems seriously consider using model based RL because you can
[00:41:35] using model based RL because you can then fly a helicopter in simulation let
[00:41:38] then fly a helicopter in simulation let me crash a million times right and no
[00:41:39] me crash a million times right and no one's hurt there's no physical damage
[00:41:41] one's hurt there's no physical damage anywhere the world is just OK and and oh
[00:41:46] anywhere the world is just OK and and oh and just one last tip one things we
[00:41:49] and just one last tip one things we learned building these reinforcer
[00:41:52] learned building these reinforcer learning algorithms for a lot of robots
[00:41:54] learning algorithms for a lot of robots is that you know have you run this model
[00:41:57] is that you know have you run this model you might ask well how do I choose the
[00:41:59] you might ask well how do I choose the distribution for this noise right how do
[00:42:04] distribution for this noise right how do you model the distribution for the noise
[00:42:06] you model the distribution for the noise once you could do is estimate it from
[00:42:09] once you could do is estimate it from data but as a practical matter what
[00:42:12] data but as a practical matter what happens is so long as you remember to
[00:42:15] happens is so long as you remember to inject so let's see it turns out if you
[00:42:17] inject so let's see it turns out if you used to deterministic simulator a lot of
[00:42:20] used to deterministic simulator a lot of reinforcement learning our and also
[00:42:21] reinforcement learning our and also learn a very brittle model that works in
[00:42:24] learn a very brittle model that works in your simulator but doesn't actually work
[00:42:26] your simulator but doesn't actually work when you put it into your real robot and
[00:42:29] when you put it into your real robot and so if you actually look on YouTube or
[00:42:31] so if you actually look on YouTube or Twitter in the last year or two there
[00:42:34] Twitter in the last year or two there been a lot of cool looking videos that
[00:42:36] been a lot of cool looking videos that people using reinforce learning to
[00:42:38] people using reinforce learning to control various really configured robots
[00:42:40] control various really configured robots a really good snake robot or some five
[00:42:43] a really good snake robot or some five ago thing or some whatever is this cool
[00:42:45] ago thing or some whatever is this cool random is I I'm not gonna drink this but
[00:42:48] random is I I'm not gonna drink this but you know if you build a 5 bigger robot I
[00:42:50] you know if you build a 5 bigger robot I didn't know what has five legs right how
[00:42:52] didn't know what has five legs right how do you control that it turns out that if
[00:42:54] do you control that it turns out that if you have a deterministic simulator using
[00:42:58] you have a deterministic simulator using these methods it's not that hard to
[00:43:00] these methods it's not that hard to generate a cool-looking video of your
[00:43:02] generate a cool-looking video of your reinforcement learning algorithms
[00:43:03] reinforcement learning algorithms supposedly controlling a 5 thing
[00:43:06] supposedly controlling a 5 thing or some crazy you know a worm with two
[00:43:09] or some crazy you know a worm with two legs or something these crazy robots so
[00:43:11] legs or something these crazy robots so you can build in
[00:43:12] you can build in simulator but it turns out that even
[00:43:16] simulator but it turns out that even those easy to well not easy even though
[00:43:18] those easy to well not easy even though you can generate those types of videos
[00:43:20] you can generate those types of videos in the deterministic simulator if you
[00:43:23] in the deterministic simulator if you use a deterministic model of a robot and
[00:43:26] use a deterministic model of a robot and you ever actually try to build a
[00:43:27] you ever actually try to build a physical robot and you take that policy
[00:43:29] physical robot and you take that policy from your physics simulator to the real
[00:43:32] from your physics simulator to the real robot the odds of it work on the real
[00:43:34] robot the odds of it work on the real robot
[00:43:35] robot are quite low if you use the
[00:43:37] are quite low if you use the deterministic simulator great because
[00:43:39] deterministic simulator great because the problem with simulators is that your
[00:43:42] the problem with simulators is that your simulators never 100% accurate right
[00:43:44] simulators never 100% accurate right yeah it's always just a little bit off
[00:43:46] yeah it's always just a little bit off and one of the lessons we learned the
[00:43:48] and one of the lessons we learned the field the whole few learned applying RL
[00:43:52] field the whole few learned applying RL so a lot of robots is that if you want
[00:43:54] so a lot of robots is that if you want your model-based are aware to work not
[00:43:58] your model-based are aware to work not just in simulation engineer cool video
[00:43:59] just in simulation engineer cool video but you wanted to actually work on a
[00:44:01] but you wanted to actually work on a physical robot like a physical
[00:44:03] physical robot like a physical helicopter that you own that is really
[00:44:05] helicopter that you own that is really important to add some noise to your
[00:44:07] important to add some noise to your simulator because if the policy you
[00:44:10] simulator because if the policy you learn is robust to a slightly stochastic
[00:44:15] learn is robust to a slightly stochastic simulator then the all server
[00:44:17] simulator then the all server generalizing you know to the to the real
[00:44:20] generalizing you know to the to the real world to the physical real world it's
[00:44:22] world to the physical real world it's much higher than if you had a completely
[00:44:24] much higher than if you had a completely deterministic simulator so I think
[00:44:26] deterministic simulator so I think whenever I'm building a robot right III
[00:44:28] whenever I'm building a robot right III pretty much yeah actually yeah I don't
[00:44:30] pretty much yeah actually yeah I don't think I with one exception LKR LQG
[00:44:33] think I with one exception LKR LQG without ball next week well one with one
[00:44:35] without ball next week well one with one very narrow exception I pretty much
[00:44:37] very narrow exception I pretty much never use deterministic simulators when
[00:44:40] never use deterministic simulators when welcome to robotic control problems
[00:44:43] welcome to robotic control problems unless assuming assuming I wanted to
[00:44:45] unless assuming assuming I wanted to work in the real world as well and and
[00:44:49] work in the real world as well and and again you know tips and tricks so the
[00:44:53] again you know tips and tricks so the most important thing is to add some
[00:44:55] most important thing is to add some noise and then sometimes the exact
[00:44:58] noise and then sometimes the exact distribution of noise you know go ahead
[00:45:00] distribution of noise you know go ahead and try to pick something realistic but
[00:45:01] and try to pick something realistic but the exact distribution of noise actually
[00:45:03] the exact distribution of noise actually matters less I want to say then just a
[00:45:05] matters less I want to say then just a faculty remembering to add some noise
[00:45:08] faculty remembering to add some noise okay
[00:45:20] by the way you guys really don't know
[00:45:22] by the way you guys really don't know this but my PhD thesis was using
[00:45:26] this but my PhD thesis was using reinforcement learning to fly
[00:45:27] reinforcement learning to fly helicopters so so I'm trying to oh no so
[00:45:31] helicopters so so I'm trying to oh no so you're talking to someone just crash a
[00:45:32] you're talking to someone just crash a bunch of helicopters and that's model
[00:45:35] bunch of helicopters and that's model helicopters and has lived through the
[00:45:37] helicopters and has lived through the the pain and the joys are seeing this
[00:45:39] the pain and the joys are seeing this stuff work or not work all right so now
[00:45:57] stuff work or not work all right so now that you have built a model build a
[00:46:00] that you have built a model build a simulator for your helicopter for your
[00:46:03] simulator for your helicopter for your folding a robot or for your car how do
[00:46:07] folding a robot or for your car how do you how do you approximate the value
[00:46:11] you how do you approximate the value function right so um in order to apply
[00:46:23] fitted value iteration the first step is
[00:46:27] fitted value iteration the first step is to choose features of the state s right
[00:46:37] to choose features of the state s right and then we're approximately of s you
[00:46:43] and then we're approximately of s you know we're approximately saw using a
[00:46:45] know we're approximately saw using a function V of s which is going to be
[00:46:47] function V of s which is going to be theta transpose Phi of s and so and so
[00:46:57] you know in the case of a in the case of
[00:47:02] you know in the case of a in the case of a on inverted pendulum right then Phi of
[00:47:06] a on inverted pendulum right then Phi of s maybe you have X x dot maybe one x
[00:47:10] s maybe you have X x dot maybe one x squared or x times X dot or x times the
[00:47:14] squared or x times X dot or x times the pole orientation and so on so take take
[00:47:17] pole orientation and so on so take take two states as and think of some
[00:47:19] two states as and think of some nonlinear features
[00:47:21] nonlinear features that you think might be useful for
[00:47:22] that you think might be useful for representing the value and remember that
[00:47:26] representing the value and remember that what the value is the value of a state
[00:47:28] what the value is the value of a state is your expected payoff from that state
[00:47:31] is your expected payoff from that state expect some discount or it was so the
[00:47:33] expect some discount or it was so the value function captures if your robot
[00:47:36] value function captures if your robot starts off in this state you know how
[00:47:38] starts off in this state you know how well is it gonna do if it starts here so
[00:47:41] well is it gonna do if it starts here so when you're designing features pick a
[00:47:43] when you're designing features pick a bunch of features that you think help
[00:47:45] bunch of features that you think help convey how well is your robot doing
[00:47:48] convey how well is your robot doing and so maybe for the inverted pendulum
[00:47:51] and so maybe for the inverted pendulum for example if the PO is way over to the
[00:47:55] for example if the PO is way over to the right then maybe the pole will fall over
[00:47:56] right then maybe the pole will fall over or give it a reward of minus one when
[00:47:59] or give it a reward of minus one when the pole falls over right but so sorry
[00:48:03] the pole falls over right but so sorry I'm overloading notation a bit theta is
[00:48:05] I'm overloading notation a bit theta is both the angle of the pole as was the
[00:48:06] both the angle of the pole as was the parameters but but but if the PO is
[00:48:09] parameters but but but if the PO is falling way over that looks like it's
[00:48:10] falling way over that looks like it's doing pretty badly
[00:48:11] doing pretty badly unless X dot is very large and positive
[00:48:16] unless X dot is very large and positive right and so maybe that's interaction
[00:48:18] right and so maybe that's interaction between Phi and X dots you might say
[00:48:20] between Phi and X dots you might say well let me have a new feature which is
[00:48:22] well let me have a new feature which is the anchor the PO multiplied by the
[00:48:24] the anchor the PO multiplied by the velocity right because then because it
[00:48:28] velocity right because then because it seems like these two variables cannot
[00:48:29] seems like these two variables cannot depend on each other so so so just as
[00:48:34] depend on each other so so so just as when you are trying to predict the price
[00:48:35] when you are trying to predict the price of a house you would say well what are
[00:48:37] of a house you would say well what are the most useful features for the price
[00:48:38] the most useful features for the price of a house you do something similar for
[00:48:43] of a house you do something similar for fit evaluation and one nice thing about
[00:48:49] fit evaluation and one nice thing about one nice thing about maldo based RL is
[00:48:53] one nice thing about maldo based RL is that one small debasement folsom
[00:48:55] that one small debasement folsom learning is that once you have built a
[00:48:58] learning is that once you have built a model you see a little bit that you can
[00:49:00] model you see a little bit that you can collect an essentially infinite amount
[00:49:02] collect an essentially infinite amount of data from your model right and so
[00:49:05] of data from your model right and so with a lot of data you can usually
[00:49:07] with a lot of data you can usually afford to choose a larger number of
[00:49:09] afford to choose a larger number of features because you can generate a ton
[00:49:12] features because you can generate a ton of data with which to fit this linear
[00:49:14] of data with which to fit this linear fashion and so you know you are usually
[00:49:17] fashion and so you know you are usually not super constrained in terms of
[00:49:19] not super constrained in terms of needing to be really careful not to
[00:49:21] needing to be really careful not to choose too many features because of fear
[00:49:23] choose too many features because of fear of overfitting you could get so much
[00:49:25] of overfitting you could get so much data from a simulator that you know you
[00:49:28] data from a simulator that you know you could usually make up quite a lot of
[00:49:30] could usually make up quite a lot of features
[00:49:31] features and I saw the features and the not being
[00:49:32] and I saw the features and the not being useful is okay because you can get an
[00:49:34] useful is okay because you can get an update from running your simulator for
[00:49:37] update from running your simulator for the algorithm to store for their pretty
[00:49:39] the algorithm to store for their pretty good set of parameters data even if you
[00:49:41] good set of parameters data even if you have a lot of features because you can
[00:49:43] have a lot of features because you can have a log that you can generate a lot
[00:49:44] have a log that you can generate a lot of data to fit this function so um let's
[00:49:50] of data to fit this function so um let's talk through the fitted value iteration
[00:49:52] talk through the fitted value iteration algorithm alright you know what this is
[00:49:59] algorithm alright you know what this is a long algorithm let me just use a fresh
[00:50:01] a long algorithm let me just use a fresh board for this alright so let me just
[00:50:15] board for this alright so let me just write down the original value iteration
[00:50:18] write down the original value iteration algorithm to speed States so what we had
[00:50:21] algorithm to speed States so what we had previously
[00:50:22] previously was we would update BFS according to our
[00:50:26] was we would update BFS according to our FS plus gamma max over here right so
[00:50:37] FS plus gamma max over here right so this is what we had lost Monday and I
[00:50:41] this is what we had lost Monday and I said at the start of today's lecture
[00:50:44] said at the start of today's lecture that you can also write this as this
[00:50:57] so let's take that and generalize it to
[00:51:01] so let's take that and generalize it to fit to value iteration
[00:51:30] all right um so first let's choose a set
[00:51:39] all right um so first let's choose a set of States randomly unless initializer
[00:51:58] of States randomly unless initializer Prime's is equal to zero and what we're
[00:52:02] Prime's is equal to zero and what we're going to do is we're so so let's see in
[00:52:06] going to do is we're so so let's see in many regression and you learn the
[00:52:08] many regression and you learn the mapping from X to the Y and you have a
[00:52:12] mapping from X to the Y and you have a discrete set of examples for X and you
[00:52:15] discrete set of examples for X and you fit a function mapping from X to Y so in
[00:52:18] fit a function mapping from X to Y so in what we're going to do here we're going
[00:52:19] what we're going to do here we're going to learn a mapping from s to V of s and
[00:52:24] to learn a mapping from s to V of s and we are going to take a discrete set of
[00:52:27] we are going to take a discrete set of examples for s and try to figure out
[00:52:30] examples for s and try to figure out what is V of s for them and then for the
[00:52:32] what is V of s for them and then for the straight line you know to try to model
[00:52:34] straight line you know to try to model this relationship right so so just as
[00:52:36] this relationship right so so just as you had a finite set of examples a
[00:52:38] you had a finite set of examples a finite set of houses that you see a
[00:52:40] finite set of houses that you see a certain set of values of X in your
[00:52:42] certain set of values of X in your training set for predicting housing
[00:52:43] training set for predicting housing prices we're going to see you know a
[00:52:46] prices we're going to see you know a certain set of states and then use that
[00:52:47] certain set of states and then use that finite set of examples to use linear
[00:52:50] finite set of examples to use linear regression into 50 of s right so that's
[00:52:53] regression into 50 of s right so that's what this initial sample is meant to do
[00:52:55] what this initial sample is meant to do and so this is the Ultimo's loop of
[00:53:03] and so this is the Ultimo's loop of value iteration a fit evaluation and
[00:53:07] value iteration a fit evaluation and then for I equals 1
[00:53:13] through em
[00:53:44] let's see
[00:54:11] all right so what we're going to do is
[00:54:22] go over each of these M States for go
[00:54:25] go over each of these M States for go over each of these M States and for each
[00:54:29] over each of these M States and for each one of them we're going to and for each
[00:54:33] one of them we're going to and for each one of those days of each one of those
[00:54:34] one of those days of each one of those actions we're going to take a sample of
[00:54:38] actions we're going to take a sample of K things in order to estimate that
[00:54:40] K things in order to estimate that expected value right
[00:54:42] expected value right and so this expectation is over s Prime
[00:54:46] and so this expectation is over s Prime drawn from the state-transition
[00:54:47] drawn from the state-transition distribution it's saying you're from
[00:54:49] distribution it's saying you're from this state if you take this action where
[00:54:51] this state if you take this action where you get tunics and so these two loops
[00:54:56] you get tunics and so these two loops this for I equals 1 through m and for
[00:54:59] this for I equals 1 through m and for each action a this is just looping over
[00:55:01] each action a this is just looping over every state in every action and taking K
[00:55:04] every state in every action and taking K samples that something K examples of
[00:55:06] samples that something K examples of where you get to if you take an action a
[00:55:09] where you get to if you take an action a in a certain status right and so and by
[00:55:15] in a certain status right and so and by taking that K examples and computing
[00:55:19] taking that K examples and computing this average QA right is your estimate
[00:55:26] this average QA right is your estimate of that expectation okay so so all we've
[00:55:29] of that expectation okay so so all we've done so far is a take K samples you know
[00:55:33] done so far is a take K samples you know from this distribution of with s prime
[00:55:37] from this distribution of with s prime is drawn and average V of s OS yeah oh
[00:55:41] is drawn and average V of s OS yeah oh I'm sorry
[00:55:42] I'm sorry and if I move our FS inside sorry then
[00:55:48] and if I move our FS inside sorry then that's Q of a yeah
[00:55:53] sorry there's never just rewrite this to
[00:55:56] sorry there's never just rewrite this to move our FS inside
[00:56:05] so this is written as gamma if you write
[00:56:12] so this is written as gamma if you write this as Matt a little bit a of our s
[00:56:15] this as Matt a little bit a of our s plus gamma yeah okay yes siree
[00:56:34] plus gamma yeah okay yes siree so move the Max and expectation out then
[00:56:36] so move the Max and expectation out then this is this is Q of a next let's set Y
[00:56:58] this is this is Q of a next let's set Y I equals max over a of Q of a and so by
[00:57:11] I equals max over a of Q of a and so by taking the max over a of Q of a that's
[00:57:20] taking the max over a of Q of a that's what Y is is your estimate at the right
[00:57:23] what Y is is your estimate at the right hand side of value iteration
[00:57:33] and so why is your estimate for for this
[00:57:41] and so why is your estimate for for this quantity for the right-hand side of
[00:57:43] quantity for the right-hand side of valuation now in the original value
[00:57:57] valuation now in the original value iteration algorithm I'm just using VI to
[00:58:01] iteration algorithm I'm just using VI to approximate out to abbreviate value
[00:58:04] approximate out to abbreviate value elevation in the original algorithm what
[00:58:07] elevation in the original algorithm what we did was we set V of Si to be equal to
[00:58:13] we did was we set V of Si to be equal to Y I write it said you know in the
[00:58:15] Y I write it said you know in the original value iteration algorithm we
[00:58:17] original value iteration algorithm we would compute the right-hand side this
[00:58:19] would compute the right-hand side this purple thing and then said VFS equals to
[00:58:22] purple thing and then said VFS equals to that he just said right-hand side equal
[00:58:23] that he just said right-hand side equal to set the left-hand side equal the
[00:58:25] to set the left-hand side equal the right-hand side but in fitted value
[00:58:29] right-hand side but in fitted value iteration you know V of s is now
[00:58:34] approximated by a linear function so you
[00:58:36] approximated by a linear function so you can't just go into a linear function and
[00:58:38] can't just go into a linear function and set the value of points individually so
[00:58:42] set the value of points individually so what we're going to do instead is in
[00:58:46] what we're going to do instead is in fitted VI
[00:58:50] we're going to use linear regression to
[00:58:56] we're going to use linear regression to make V of si as close as possible to y:i
[00:58:59] make V of si as close as possible to y:i but VF si is now represented as a linear
[00:59:06] but VF si is now represented as a linear function of the state so a linear
[00:59:10] function of the state so a linear function of the features of states so VF
[00:59:12] function of the features of states so VF si is Theta transpose Phi of Si and you
[00:59:15] si is Theta transpose Phi of Si and you want that to be close to Y I and so the
[00:59:19] want that to be close to Y I and so the final step is run linear regression to
[00:59:26] final step is run linear regression to choose the parameters theta that
[00:59:30] choose the parameters theta that minimizes the squared error
[01:00:19] oh yes just make my curly braces match
[01:00:31] okay
[01:00:34] so that's fitted question oh this one oh
[01:00:50] so that's fitted question oh this one oh this one oh no the M is used differently
[01:00:54] this one oh no the M is used differently the so when we were learning a model M
[01:00:57] the so when we were learning a model M was just how many times do you fly the
[01:00:59] was just how many times do you fly the helicopter in order to build a model and
[01:01:01] helicopter in order to build a model and the number of times you find the
[01:01:03] the number of times you find the helicopter in order to build a physics
[01:01:05] helicopter in order to build a physics model to build a model helicopter
[01:01:07] model to build a model helicopter dynamics has it has nothing to do with
[01:01:10] dynamics has it has nothing to do with this M which is the number of states you
[01:01:13] this M which is the number of states you use in order to sort of anchor or in
[01:01:16] use in order to sort of anchor or in order to so I think I'm actually so the
[01:01:20] order to so I think I'm actually so the the way to think about this is um you
[01:01:23] the way to think about this is um you want to learn a mapping from States to B
[01:01:26] want to learn a mapping from States to B of s and so the sample you know this M
[01:01:31] of s and so the sample you know this M stays is we're gonna choose M States on
[01:01:34] stays is we're gonna choose M States on the x-axis right so and that M is the
[01:01:40] the x-axis right so and that M is the number of points you choose on the
[01:01:41] number of points you choose on the x-axis and then in each iteration
[01:01:45] x-axis and then in each iteration evaluation we're going to go through
[01:01:47] evaluation we're going to go through this procedure so you have sort of s 1
[01:01:50] this procedure so you have sort of s 1 up to SM right and then for each of
[01:01:53] up to SM right and then for each of these you're going to compute some value
[01:01:58] Y I using this procedure and then you
[01:02:03] Y I using this procedure and then you fill a straight line to this sample of Y
[01:02:05] fill a straight line to this sample of Y eyes
[01:02:17] think of these think of the way you
[01:02:21] think of these think of the way you build a model and the way you apply
[01:02:23] build a model and the way you apply fitted value duration as two completely
[01:02:25] fitted value duration as two completely separate operations so you can have one
[01:02:29] separate operations so you can have one team of 10 engineers fly the helicopter
[01:02:31] team of 10 engineers fly the helicopter around you know five helicopter around a
[01:02:33] around you know five helicopter around a thousand times build them although run
[01:02:35] thousand times build them although run linear regression and they have a model
[01:02:37] linear regression and they have a model and then they could publish the model on
[01:02:40] and then they could publish the model on the internet and a totally different
[01:02:42] the internet and a totally different team could download their model and do
[01:02:44] team could download their model and do this and the second team does no need to
[01:02:46] this and the second team does no need to talk the first team at all other than
[01:02:48] talk the first team at all other than downloading the model off the internet
[01:02:49] downloading the model off the internet oh yes a good question you mean there's
[01:02:59] oh yes a good question you mean there's something there's something K times
[01:03:02] something there's something K times right yep that's a great question yes
[01:03:04] right yep that's a great question yes that was a yes that was one of my next
[01:03:07] that was a yes that was one of my next points which is the reason you sample
[01:03:10] points which is the reason you sample from this distribution is because you
[01:03:13] from this distribution is because you are using so you should do this if
[01:03:15] are using so you should do this if you're using a stochastic simulator
[01:03:16] you're using a stochastic simulator right and then and actually there's
[01:03:19] right and then and actually there's actually also ask you guys what should
[01:03:21] actually also ask you guys what should you do
[01:03:22] you do how can you simplify this algorithm if
[01:03:24] how can you simplify this algorithm if you use a deterministic simulator and
[01:03:26] you use a deterministic simulator and service elastic simulator
[01:03:34] let's see so if you said determining if
[01:03:37] let's see so if you said determining if you said deterministic simulator then
[01:03:39] you said deterministic simulator then you know given a certain state kind of
[01:03:44] you know given a certain state kind of such an action it will always map to the
[01:03:46] such an action it will always map to the exact same s Prime right so how can you
[01:03:49] exact same s Prime right so how can you simplify the yep yes so if your
[01:04:00] simplify the yep yes so if your determines simulator you can set a
[01:04:02] determines simulator you can set a equals one and set the sample only once
[01:04:05] equals one and set the sample only once because this distribution it always
[01:04:08] because this distribution it always returns the same value so all of these
[01:04:10] returns the same value so all of these case ampuls would be exactly the same so
[01:04:13] case ampuls would be exactly the same so you might as well just do this once
[01:04:14] you might as well just do this once rather than K times this one oh no this
[01:04:34] rather than K times this one oh no this is uh this is actually as square
[01:04:36] is uh this is actually as square brackets the thing is we're trying to
[01:04:39] brackets the thing is we're trying to approximate this expectation and the way
[01:04:43] approximate this expectation and the way you're approximate the mean is you know
[01:04:45] you're approximate the mean is you know sample K times if you take the average
[01:04:47] sample K times if you take the average right right so so what we've done here
[01:04:50] right right so so what we've done here is in alternate approximate dis
[01:04:51] is in alternate approximate dis expectation
[01:04:52] expectation we're going to draw K samples and then
[01:04:55] we're going to draw K samples and then sum over them and divide by K so you
[01:04:57] sum over them and divide by K so you average over the case our polls
[01:05:20] let's see so how do you choose em and
[01:05:24] let's see so how do you choose em and how do you test the overfitting so you
[01:05:27] how do you test the overfitting so you know one once you have a model one of
[01:05:29] know one once you have a model one of the nice things about model Bizzaro is
[01:05:30] the nice things about model Bizzaro is let's say that Phi of s right let's say
[01:05:34] let's say that Phi of s right let's say that Phi of s is 50 features so let's
[01:05:40] that Phi of s is 50 features so let's say you chose 50 features approximately
[01:05:41] say you chose 50 features approximately the value function of your inverted
[01:05:44] the value function of your inverted pendulum system then we know that you
[01:05:48] pendulum system then we know that you know that you're going to be fitting
[01:05:49] know that you're going to be fitting linear regression right to this 50
[01:05:51] linear regression right to this 50 dimensional state space I mean this step
[01:05:53] dimensional state space I mean this step here this is really linear regression
[01:05:57] here this is really linear regression and so you can ask if you want to run
[01:06:01] and so you can ask if you want to run linear regression with 50 parameters how
[01:06:04] linear regression with 50 parameters how many examples do you need to fit in
[01:06:06] many examples do you need to fit in linear regression and I would say you
[01:06:08] linear regression and I would say you know if M was maybe 500 right maybe it'd
[01:06:12] know if M was maybe 500 right maybe it'd be ok you have 5 10 examples to fit 50
[01:06:14] be ok you have 5 10 examples to fit 50 parameters but if for computational
[01:06:17] parameters but if for computational reasons if it doesn't run too slowly to
[01:06:20] reasons if it doesn't run too slowly to even set M equals 1000 or even 5000 then
[01:06:24] even set M equals 1000 or even 5000 then there's no harm to letting em be bigger
[01:06:26] there's no harm to letting em be bigger so usually mu must all said to be as big
[01:06:30] so usually mu must all said to be as big as you feel like subject to the program
[01:06:32] as you feel like subject to the program not taking too long to run because it
[01:06:35] not taking too long to run because it you know if you're if you're fitting it
[01:06:38] you know if you're if you're fitting it unlike supervised learning if you're
[01:06:40] unlike supervised learning if you're fitting data to housing prices you need
[01:06:43] fitting data to housing prices you need to go out and you know collect data
[01:06:45] to go out and you know collect data right off Craigslist or was Zillow or
[01:06:50] right off Craigslist or was Zillow or Trulia or Redfern or whatever about
[01:06:53] Trulia or Redfern or whatever about prices of houses and so data is
[01:06:56] prices of houses and so data is expensive to collect in the real world
[01:06:58] expensive to collect in the real world but once you have a model you could set
[01:07:00] but once you have a model you could set m equals 5,000 or 10,000 or 100,000 and
[01:07:03] m equals 5,000 or 10,000 or 100,000 and just and then your algorithm will run
[01:07:05] just and then your algorithm will run more slowly but but selassie algorithm
[01:07:08] more slowly but but selassie algorithm doesn't run too slowly there's no harm
[01:07:10] doesn't run too slowly there's no harm to setting them to be bigger
[01:07:18] cool so so I know there's a lot going on
[01:07:24] cool so so I know there's a lot going on to this algorithm but this is fitted
[01:07:26] to this algorithm but this is fitted value iteration and if you do this this
[01:07:30] value iteration and if you do this this skill you can get reasonable behavior on
[01:07:32] skill you can get reasonable behavior on a lot of robots by choosing Casella
[01:07:35] a lot of robots by choosing Casella features and learning the value
[01:07:37] features and learning the value functions are approximate the value of
[01:07:39] functions are approximate the value of the really approximate expected payoff
[01:07:42] the really approximate expected payoff of a robot starting off in different
[01:07:44] of a robot starting off in different states okay now just a few
[01:07:52] details to wrap up again some practical
[01:07:58] details to wrap up again some practical aspects of how you do this after you've
[01:08:13] aspects of how you do this after you've learned all these parameters this you've
[01:08:16] learned all these parameters this you've now learned yeah
[01:08:32] OSE yes thank you yes so in this
[01:08:37] OSE yes thank you yes so in this expression where do you get a V of s
[01:08:40] expression where do you get a V of s prime J from yes so you would get this
[01:08:43] prime J from yes so you would get this from theta transpose Phi of s prime J
[01:08:47] from theta transpose Phi of s prime J using the parameters of theta from the
[01:08:50] using the parameters of theta from the last iteration of fitted value iteration
[01:08:53] last iteration of fitted value iteration just as in value iteration this is the
[01:08:57] just as in value iteration this is the values from the last iteration the you
[01:08:59] values from the last iteration the you use update the new iteration so then you
[01:09:01] use update the new iteration so then you use the last value of theta is updated
[01:09:07] oh and one one one other thing you could
[01:09:16] oh and one one one other thing you could do which is I talked about the linear
[01:09:20] do which is I talked about the linear regression version of this algorithm
[01:09:21] regression version of this algorithm which is you know this hope that this
[01:09:25] which is you know this hope that this whole exercise is about generating a
[01:09:27] whole exercise is about generating a sample of s and Y so you can apply
[01:09:31] sample of s and Y so you can apply linear regression to predict the value
[01:09:32] linear regression to predict the value of y from the values of s right but
[01:09:35] of y from the values of s right but there's nothing in this algorithm that
[01:09:37] there's nothing in this algorithm that says you have to use linear regression
[01:09:39] says you have to use linear regression in order to now that you've generated
[01:09:41] in order to now that you've generated this data set there's this box that
[01:09:43] this data set there's this box that happier this is linear regression right
[01:09:46] happier this is linear regression right but you don't have to use linear
[01:09:47] but you don't have to use linear regression in Mauldin you know deep
[01:09:50] regression in Mauldin you know deep reinforcement learning one of the ways
[01:09:52] reinforcement learning one of the ways well one of the ways to go from
[01:09:54] well one of the ways to go from reinforce on a deeper enforcer learning
[01:09:55] reinforce on a deeper enforcer learning is to just use the neural network for
[01:09:57] is to just use the neural network for this step instead then you can call then
[01:09:58] this step instead then you can call then then you call that deep reinforcement
[01:10:00] then you call that deep reinforcement learning where no but hey it's legit you
[01:10:03] learning where no but hey it's legit you know but but you can also use locally
[01:10:08] know but but you can also use locally weighted linear regression or whatever
[01:10:10] weighted linear regression or whatever regression algorithm you want in order
[01:10:12] regression algorithm you want in order to estimate Y as a function of the state
[01:10:15] to estimate Y as a function of the state s yeah and I should have used a neural
[01:10:19] s yeah and I should have used a neural network it relieves the need to choose
[01:10:21] network it relieves the need to choose features PI as well you can feed in the
[01:10:23] features PI as well you can feed in the raw features you know poor angle poor
[01:10:25] raw features you know poor angle poor orientation and use a neural network to
[01:10:27] orientation and use a neural network to learn them having a supervisor alright
[01:10:37] learn them having a supervisor alright so one last important I guess practical
[01:10:43] so one last important I guess practical implementation of detail which is fitted
[01:10:47] implementation of detail which is fitted VI right just approximation to V Star
[01:10:58] VI right just approximation to V Star and this um implicitly defines PI star
[01:11:08] and this um implicitly defines PI star right because the definition for pi star
[01:11:12] right because the definition for pi star is that
[01:11:36] so when you're running a robot you know
[01:11:40] so when you're running a robot you know you need to execute a policy prior given
[01:11:43] you need to execute a policy prior given the stage music and actually given the
[01:11:44] the stage music and actually given the stage Nipigon action and and having
[01:11:47] stage Nipigon action and and having computed v-star it only implicitly
[01:11:50] computed v-star it only implicitly defines the optimal policy PI staff and
[01:11:56] defines the optimal policy PI staff and so if you're running a rover or if
[01:11:59] so if you're running a rover or if you're running a robot in real time then
[01:12:02] you're running a robot in real time then you know actually if you find a
[01:12:03] you know actually if you find a helicopter you might have to choose
[01:12:05] helicopter you might have to choose control actions at ten Hertz meaning ten
[01:12:07] control actions at ten Hertz meaning ten times a second you given the state you
[01:12:08] times a second you given the state you have you choose in action if you're
[01:12:11] have you choose in action if you're building a self-driving car again a ten
[01:12:12] building a self-driving car again a ten Hertz controller would be pretty
[01:12:14] Hertz controller would be pretty reasonable guys choose a new action
[01:12:15] reasonable guys choose a new action there maybe ten times a second would be
[01:12:17] there maybe ten times a second would be pretty reasonable but how do you compute
[01:12:20] pretty reasonable but how do you compute this expectation and this maximization
[01:12:22] this expectation and this maximization ten times for a second so in what we use
[01:12:27] ten times for a second so in what we use for fitted value iteration we used
[01:12:32] sample of we use K samples to
[01:12:44] sample of we use K samples to approximate the expectation but if
[01:12:48] approximate the expectation but if you're running this in real time on a
[01:12:51] you're running this in real time on a helicopter you know probably you don't
[01:12:53] helicopter you know probably you don't want to at least I don't know for my
[01:12:58] want to at least I don't know for my robotics implementations I have been
[01:13:00] robotics implementations I have been reluctant to use a random number
[01:13:02] reluctant to use a random number generator right in the inner loop of how
[01:13:04] generator right in the inner loop of how we control a helicopter it might work
[01:13:07] we control a helicopter it might work but I but I think you know it's
[01:13:09] but I but I think you know it's approximately if you want to compute
[01:13:11] approximately if you want to compute this arc Merricks it's approximate
[01:13:12] this arc Merricks it's approximate expectation and do you really want to be
[01:13:15] expectation and do you really want to be running a random number generator on a
[01:13:17] running a random number generator on a helicopter and if you're really unlucky
[01:13:18] helicopter and if you're really unlucky and a random air engineer generator
[01:13:20] and a random air engineer generator journey is an unlucky value.we
[01:13:22] journey is an unlucky value.we helicopter to do something you know iiii
[01:13:26] helicopter to do something you know iiii would again just emotionally i don't
[01:13:28] would again just emotionally i don't feel very good
[01:13:29] feel very good you yourself driving car has a random
[01:13:32] you yourself driving car has a random number generator in a loop of house
[01:13:34] number generator in a loop of house choosing to drive so just as a practical
[01:13:38] choosing to drive so just as a practical matter there are a couple of tricks that
[01:13:43] matter there are a couple of tricks that people often use which is the simulator
[01:13:58] is often of this form
[01:14:15] so most simulators of this form next
[01:14:18] so most simulators of this form next state is equal to some function of the
[01:14:21] state is equal to some function of the Peter previous state and action plus
[01:14:24] Peter previous state and action plus some noise and so one thing that is
[01:14:27] some noise and so one thing that is often done is for your deployment or for
[01:14:36] often done is for your deployment or for the you know for the for the actual
[01:14:39] the you know for the for the actual policy you implement on the robot set
[01:14:44] policy you implement on the robot set epsilon T equals zero and set K equals
[01:14:50] epsilon T equals zero and set K equals one right and so so just this this is a
[01:14:56] one right and so so just this this is a reasonable way to make this policy run
[01:14:58] reasonable way to make this policy run on a helicopter which is during training
[01:15:02] on a helicopter which is during training you do want to add noise to the
[01:15:03] you do want to add noise to the simulator because it causes a policy you
[01:15:07] simulator because it causes a policy you learn to be much more robust so little
[01:15:09] learn to be much more robust so little errors in the simulator your simulator
[01:15:11] errors in the simulator your simulator is always going a little bit off you
[01:15:12] is always going a little bit off you know maybe it didn't quite simulate wind
[01:15:14] know maybe it didn't quite simulate wind gust or when you turn the helicopter
[01:15:16] gust or when you turn the helicopter does it back exactly right amount some
[01:15:18] does it back exactly right amount some of its as always in practice is always a
[01:15:20] of its as always in practice is always a little bit off so it's important to have
[01:15:23] little bit off so it's important to have noise in the simulator in model-based RL
[01:15:25] noise in the simulator in model-based RL but when you're deploying this in a
[01:15:27] but when you're deploying this in a physical simulator one thing you could
[01:15:30] physical simulator one thing you could do to be very reasonable is just get rid
[01:15:32] do to be very reasonable is just get rid of the noise and stay K equals one and
[01:15:35] of the noise and stay K equals one and so what you would do is
[01:15:46] let's see whenever you're in the state s
[01:15:58] pick the option a according to our masks
[01:16:05] pick the option a according to our masks over a of V s a so this F is this F from
[01:16:15] over a of V s a so this F is this F from here so this is the simulator with the
[01:16:25] here so this is the simulator with the noise removed okay and so what you would
[01:16:29] noise removed okay and so what you would do is actually and and you know
[01:16:32] do is actually and and you know computers are now fast enough you can
[01:16:33] computers are now fast enough you can you could do this ten times a second
[01:16:34] you could do this ten times a second right if you want to control helicopters
[01:16:36] right if you want to control helicopters self-driving car ten Hertz you can
[01:16:37] self-driving car ten Hertz you can actually easily do this you know ten
[01:16:40] actually easily do this you know ten times a second which is your car or your
[01:16:42] times a second which is your car or your helicopters in some physical state in
[01:16:44] helicopters in some physical state in the world so you know what is s and so
[01:16:47] the world so you know what is s and so you can quickly for every possible
[01:16:50] you can quickly for every possible action a that you could take use a
[01:16:53] action a that you could take use a simulator to simulate where your
[01:16:54] simulator to simulate where your helicopter will go if you were to take
[01:16:58] helicopter will go if you were to take that action so go ahead and run your
[01:16:59] that action so go ahead and run your simulator
[01:17:00] simulator you know once for each possible action
[01:17:02] you know once for each possible action you could take right computer actually
[01:17:04] you could take right computer actually fast enough to do this in real time and
[01:17:06] fast enough to do this in real time and then for each of the possible next
[01:17:09] then for each of the possible next actions you could get to compute V apply
[01:17:12] actions you could get to compute V apply to that so this is really right as a
[01:17:15] to that so this is really right as a prime drawn from PSA but with this term
[01:17:20] prime drawn from PSA but with this term the six simulator
[01:17:32] right so every tenth of a second you
[01:17:34] right so every tenth of a second you could assimilate to try out every single
[01:17:37] could assimilate to try out every single possible action user simulator to figure
[01:17:41] possible action user simulator to figure out where you would go under each every
[01:17:43] out where you would go under each every single possible action and apply your
[01:17:45] single possible action and apply your value function to see of all of these
[01:17:48] value function to see of all of these possible actions which one gets my
[01:17:50] possible actions which one gets my helicopter you know in the next one
[01:17:53] helicopter you know in the next one tenth of a second to the state that
[01:17:55] tenth of a second to the state that looks best according to the value
[01:17:57] looks best according to the value function you've learned from fits
[01:17:58] function you've learned from fits evaluation and it turns out if you do
[01:18:04] evaluation and it turns out if you do this then you can this is how you
[01:18:06] this then you can this is how you actually implement something that runs
[01:18:07] actually implement something that runs in the whole time and oh and I just
[01:18:10] in the whole time and oh and I just mentioned you know the the idea of a
[01:18:12] mentioned you know the the idea of a training was so costly simulator and
[01:18:15] training was so costly simulator and then just setting the noise is zero it's
[01:18:17] then just setting the noise is zero it's one of those things there's not very
[01:18:19] one of those things there's not very rigorously justified but in practice
[01:18:21] rigorously justified but in practice this this works well oh yes so so um for
[01:18:28] this this works well oh yes so so um for purpose of this you can assume you have
[01:18:30] purpose of this you can assume you have a discretized action space and it turns
[01:18:33] a discretized action space and it turns out that for a self-driving car is
[01:18:34] out that for a self-driving car is actually okay to this precise reaction
[01:18:36] actually okay to this precise reaction space for a helicopter we tend not to
[01:18:40] space for a helicopter we tend not to disguise the action space but it turns
[01:18:43] disguise the action space but it turns out if F is a continuous function then
[01:18:45] out if F is a continuous function then you can use other methods as well right
[01:18:48] you can use other methods as well right this is about optimizing over the I
[01:18:50] this is about optimizing over the I didn't mean to talk about this so sorry
[01:18:51] didn't mean to talk about this so sorry this getting a little bit deeper but
[01:18:53] this getting a little bit deeper but even if a was a continuous thing you can
[01:18:57] even if a was a continuous thing you can actually use real time optimization
[01:18:58] actually use real time optimization algorithms to very quickly try to
[01:19:01] algorithms to very quickly try to optimize this function even as a
[01:19:02] optimize this function even as a function of the concerns actually
[01:19:04] function of the concerns actually there's a literature on something called
[01:19:06] there's a literature on something called model predictive control which we can
[01:19:08] model predictive control which we can actually you can actually do these
[01:19:09] actually you can actually do these optimizations in real time and use final
[01:19:11] optimizations in real time and use final thoughts last question
[01:19:22] wait oh say that what's the question is
[01:19:24] wait oh say that what's the question is oh I use an observation yeah yes yes so
[01:19:38] oh I use an observation yeah yes yes so you take an action and then your
[01:19:39] you take an action and then your helicopter do something there'll be some
[01:19:41] helicopter do something there'll be some wind your model may be off and so you
[01:19:43] wind your model may be off and so you would then a tenth of a second later
[01:19:45] would then a tenth of a second later take another you know GPS reading
[01:19:47] take another you know GPS reading accelerometer reading magnetic compass
[01:19:49] accelerometer reading magnetic compass reading and use the whole copper sensor
[01:19:51] reading and use the whole copper sensor to tell you where you actually are no
[01:19:53] to tell you where you actually are no cool okay cool all right I hope yeah
[01:19:57] cool okay cool all right I hope yeah hopefully this was helpful I feel like
[01:20:00] hopefully this was helpful I feel like you know the I think that's fascinating
[01:20:01] you know the I think that's fascinating that the excitement that by myself
[01:20:02] that the excitement that by myself driving cars and final hug calls and all
[01:20:04] driving cars and final hug calls and all that it gives both down to equations
[01:20:06] that it gives both down to equations like these though I think that's not
[01:20:07] like these though I think that's not cool okay that's great thanks I'll see
[01:20:09] cool okay that's great thanks I'll see you guys next week


================================================================================
LECTURE 019
================================================================================

Lecture 19 - Reward Model & Linear Dynamical System | Stanford CS229: Machine Learning (Autumn 2018)

Source: https://www.youtube.com/watch?v=0rt2CsEQv6U

---

Transcript

[00:00:04] okay hey everyone so welcome to the
[00:00:08] okay hey everyone so welcome to the final week of the class um what I want
[00:00:13] final week of the class um what I want to do today is share with you a few
[00:00:15] to do today is share with you a few generalizations of reinforcement
[00:00:18] generalizations of reinforcement learning and of mdps so you've learned
[00:00:22] learning and of mdps so you've learned about the basic MVP formula zone of
[00:00:24] about the basic MVP formula zone of states action stations info releases
[00:00:26] states action stations info releases compactor and rewards the first thing
[00:00:30] compactor and rewards the first thing you see today is to you know slight
[00:00:33] you see today is to you know slight generalizations of this framework to
[00:00:35] generalizations of this framework to state action rewards and to find the
[00:00:36] state action rewards and to find the horizon MVPs that make it a little bit
[00:00:39] horizon MVPs that make it a little bit easier for you to model certain types of
[00:00:41] easier for you to model certain types of problems certain types of robots or
[00:00:43] problems certain types of robots or certain types of factory automation
[00:00:44] certain types of factory automation problems will be easier to model with
[00:00:46] problems will be easier to model with these two small generalizations so talk
[00:00:50] these two small generalizations so talk about those first and then second we'll
[00:00:52] about those first and then second we'll talk about linear dynamical systems last
[00:00:56] talk about linear dynamical systems last Wednesday you saw a fitted value
[00:00:58] Wednesday you saw a fitted value iteration which was a way to solve for
[00:01:03] iteration which was a way to solve for an MDP even when the state space may be
[00:01:05] an MDP even when the state space may be infinite even when the state space is
[00:01:07] infinite even when the state space is several numbers was RN so it's an
[00:01:10] several numbers was RN so it's an infinite list of states or contingency
[00:01:12] infinite list of states or contingency other states we use fitted value
[00:01:14] other states we use fitted value iteration in which we're to use a
[00:01:15] iteration in which we're to use a functional approximator right like
[00:01:17] functional approximator right like linear regression to try to approximate
[00:01:19] linear regression to try to approximate the value function there's one very
[00:01:21] the value function there's one very important special case of an MDP where
[00:01:24] important special case of an MDP where even if the state space is infinite of
[00:01:27] even if the state space is infinite of continuous real numbers does that well
[00:01:31] continuous real numbers does that well there's one important special case we
[00:01:32] there's one important special case we can still compute the value function
[00:01:35] can still compute the value function exactly without needing to use you know
[00:01:38] exactly without needing to use you know like a linear function approximate or to
[00:01:40] like a linear function approximate or to use something like linear regression in
[00:01:41] use something like linear regression in the inner loop a fitted value iteration
[00:01:43] the inner loop a fitted value iteration and so you also see that today and when
[00:01:47] and so you also see that today and when you can take a robot or some factory
[00:01:50] you can take a robot or some factory automation tools or whatever problem and
[00:01:52] automation tools or whatever problem and model within this framework it turns out
[00:01:54] model within this framework it turns out to be incredibly efficient because you
[00:01:56] to be incredibly efficient because you can fit a continuous for the value
[00:01:58] can fit a continuous for the value function as a function of the states
[00:02:00] function as a function of the states without needing to approximate you can
[00:02:03] without needing to approximate you can just compute the exact value function
[00:02:04] just compute the exact value function even though the state space is
[00:02:06] even though the state space is continuous so this is a framework that
[00:02:09] continuous so this is a framework that doesn't apply to all problems but when
[00:02:11] doesn't apply to all problems but when it does apply is incredibly convenient
[00:02:13] it does apply is incredibly convenient gruffly efficient so you see that in a
[00:02:16] gruffly efficient so you see that in a second half of today oh yes a 1:1
[00:02:21] second half of today oh yes a 1:1 tactical oh two two tactical things um
[00:02:23] tactical oh two two tactical things um let's see from the questions that we're
[00:02:26] let's see from the questions that we're getting from students um since they're
[00:02:27] getting from students um since they're asking us oh how is grading and CSU's
[00:02:29] asking us oh how is grading and CSU's you know and whatever I did well and
[00:02:30] you know and whatever I did well and does you know didn't do so on that um
[00:02:32] does you know didn't do so on that um for people taking a class pass/fail c-
[00:02:36] for people taking a class pass/fail c- or better as a passing great this is
[00:02:37] or better as a passing great this is quite I think there's a standard at
[00:02:39] quite I think there's a standard at Stanford and I think sisters mignon has
[00:02:43] Stanford and I think sisters mignon has historically been one of the heavy
[00:02:44] historically been one of the heavy workload classes we know that people
[00:02:46] workload classes we know that people taking sis you know I yeah I see a few
[00:02:48] taking sis you know I yeah I see a few has nothing people King sisters end up
[00:02:55] has nothing people King sisters end up you know putting a lot of work on this
[00:02:56] you know putting a lot of work on this class maybe frankly more than average
[00:02:58] class maybe frankly more than average for even Stanford courses and so we've
[00:03:01] for even Stanford courses and so we've usually been quite nice with respect to
[00:03:04] usually been quite nice with respect to party and acknowledge that so I think
[00:03:06] party and acknowledge that so I think yeah just for what as well so don't
[00:03:09] yeah just for what as well so don't don't don't sweat too much do work hard
[00:03:11] don't don't sweat too much do work hard for the finer projects don't don't sweat
[00:03:13] for the finer projects don't don't sweat too much um oh and on Wednesday after
[00:03:17] too much um oh and on Wednesday after Claus had a funny question after I
[00:03:19] Claus had a funny question after I talked about the fitted value iteration
[00:03:21] talked about the fitted value iteration question the Sun came out to me and said
[00:03:22] question the Sun came out to me and said hey Andrew um you know this algorithm
[00:03:25] hey Andrew um you know this algorithm you you just told us does it actually
[00:03:26] you you just told us does it actually work it doesn't actually work on the
[00:03:29] work it doesn't actually work on the Tongass helicopter and the answer is yes
[00:03:31] Tongass helicopter and the answer is yes the algorithms are teaching you know if
[00:03:33] the algorithms are teaching you know if you do fits evaluation as you learned
[00:03:37] you do fits evaluation as you learned last week it will work on flying an
[00:03:38] last week it will work on flying an autonomous helicopter at low speed so
[00:03:40] autonomous helicopter at low speed so your fly very high speeds very dynamic
[00:03:42] your fly very high speeds very dynamic maneuvers crazy bang flipping upside
[00:03:44] maneuvers crazy bang flipping upside down you need a bit more than that but
[00:03:46] down you need a bit more than that but for flying a helicopter at low speeds
[00:03:48] for flying a helicopter at low speeds the the exact algorithm that you learned
[00:03:50] the the exact algorithm that you learned last Wednesday as well as any of the
[00:03:53] last Wednesday as well as any of the algorithms you learned today including
[00:03:55] algorithms you learned today including them lqr you know if you actually ever
[00:03:58] them lqr you know if you actually ever need to find autonomous helicopter
[00:03:59] need to find autonomous helicopter forever all these albums were actually
[00:04:01] forever all these albums were actually work decently well work quite well for
[00:04:03] work decently well work quite well for flying helicopter at low speeds maybe
[00:04:05] flying helicopter at low speeds maybe not at very very high speeds and a crazy
[00:04:07] not at very very high speeds and a crazy dynamic maneuvers but those speeds these
[00:04:09] dynamic maneuvers but those speeds these algorithms
[00:04:10] algorithms pretty much as I'm presenting them won't
[00:04:12] pretty much as I'm presenting them won't work so okay
[00:04:16] work so okay so the first generalization to the MDP
[00:04:20] so the first generalization to the MDP framework that I want to describe is
[00:04:23] framework that I want to describe is state action rewards and so so far we've
[00:04:41] state action rewards and so so far we've had the rewards be a function mapping
[00:04:43] had the rewards be a function mapping from the states to the set of real
[00:04:46] from the states to the set of real numbers and we'll say action rewards
[00:04:50] numbers and we'll say action rewards this is a slight modification to the MDP
[00:04:53] this is a slight modification to the MDP formalism where now the reward function
[00:04:56] formalism where now the reward function R as a function mapping from States and
[00:05:00] R as a function mapping from States and actions to D rewards and so you know in
[00:05:04] actions to D rewards and so you know in an MDP you stop the mistake
[00:05:06] an MDP you stop the mistake s0 you take an action a zero then based
[00:05:09] s0 you take an action a zero then based on value gets s1 take an action a1 to
[00:05:12] on value gets s1 take an action a1 to state s to get to state s to take an
[00:05:15] state s to get to state s to take an action a 2 and so on and with a state
[00:05:18] action a 2 and so on and with a state action rewards the total payoff
[00:05:25] there's written like this and this is
[00:05:29] there's written like this and this is this this allows you to model that
[00:05:32] this this allows you to model that different actions may have different
[00:05:34] different actions may have different costs for example in the little robot
[00:05:37] costs for example in the little robot wandering around the maze example maybe
[00:05:40] wandering around the maze example maybe it's more costly for the robot to move
[00:05:42] it's more costly for the robot to move than to stay still
[00:05:44] than to stay still and so if you have an action for the
[00:05:46] and so if you have an action for the robot to stay still the reward can be
[00:05:48] robot to stay still the reward can be you know zero for staying slow and a
[00:05:51] you know zero for staying slow and a slight negative reward for moving
[00:05:52] slight negative reward for moving because you're burning or because
[00:05:53] because you're burning or because because you're using electricity and so
[00:06:01] because you're using electricity and so in that case
[00:06:07] Velma's equations becomes this v-star
[00:06:10] Velma's equations becomes this v-star equals where now you still break down
[00:06:37] equals where now you still break down the value of a state as a sum of the
[00:06:41] the value of a state as a sum of the immediate reward
[00:06:42] immediate reward plus the you know expected future
[00:06:44] plus the you know expected future rewards but now the immediate or what
[00:06:49] rewards but now the immediate or what you get depends on the action that you
[00:06:51] you get depends on the action that you take in the current state right so this
[00:06:54] take in the current state right so this is a and so this is Bellman's equations
[00:06:56] is a and so this is Bellman's equations and if and notice that previously you
[00:07:00] and if and notice that previously you know we had the max kind of over here
[00:07:02] know we had the max kind of over here but now you need to choose the option a
[00:07:05] but now you need to choose the option a that maximizes you immediate reward plus
[00:07:08] that maximizes you immediate reward plus your discounted future rewards which is
[00:07:10] your discounted future rewards which is why the max kind of moved right if your
[00:07:12] why the max kind of moved right if your local equation you look at this equation
[00:07:14] local equation you look at this equation I guess the Mac set to move outside
[00:07:16] I guess the Mac set to move outside because now the immediate reward you get
[00:07:18] because now the immediate reward you get depends on the action you choose at this
[00:07:21] depends on the action you choose at this step in time as well this models that
[00:07:23] step in time as well this models that different actions may have different
[00:07:26] different actions may have different costs yeah
[00:07:31] oh yes yes yes just max applies to the
[00:07:37] oh yes yes yes just max applies to the entire expression right yeah let's see
[00:07:53] entire expression right yeah let's see so in this formulation ever wasn't
[00:07:54] so in this formulation ever wasn't deterministic based on the state and
[00:07:56] deterministic based on the state and action yes that is correct
[00:07:58] action yes that is correct so in this formulation the reward
[00:08:01] so in this formulation the reward depends on the current state and the
[00:08:04] depends on the current state and the current action but all on the next date
[00:08:05] current action but all on the next date you get two okay oh and by the way there
[00:08:12] you get two okay oh and by the way there are multiple variations of formulations
[00:08:14] are multiple variations of formulations of MVPs but this is some one convenient
[00:08:17] of MVPs but this is some one convenient one I guess the model that different
[00:08:19] one I guess the model that different costs and I think and and action and
[00:08:21] costs and I think and and action and you're finding a helicopter a common
[00:08:24] you're finding a helicopter a common formulation of this would be to say that
[00:08:26] formulation of this would be to say that yanking aggressively on the control
[00:08:29] yanking aggressively on the control statements should be assigned a higher
[00:08:31] statements should be assigned a higher cost because yanking the control stick
[00:08:33] cost because yanking the control stick aggressively causes a helicopter to jerk
[00:08:36] aggressively causes a helicopter to jerk around more and so maybe you want to
[00:08:38] around more and so maybe you want to penalize that by setting reward function
[00:08:40] penalize that by setting reward function that you know penalizes very aggressive
[00:08:42] that you know penalizes very aggressive maneuvers so these are ways that this
[00:08:45] maneuvers so these are ways that this gives you the as a problem designer sort
[00:08:50] gives you the as a problem designer sort of more flexibility right and then and
[00:08:55] of more flexibility right and then and then finally so I'm gonna just write
[00:08:57] then finally so I'm gonna just write this on top in this formulation the
[00:09:00] this on top in this formulation the optimal action so right so in order to
[00:09:05] optimal action so right so in order to compute the value function you can still
[00:09:07] compute the value function you can still use value iteration right which is snow
[00:09:14] use value iteration right which is snow you know V of s just updated as
[00:09:17] you know V of s just updated as basically the right-hand side from
[00:09:20] basically the right-hand side from pellman's equations so now the iteration
[00:09:23] pellman's equations so now the iteration works just fine for the state action
[00:09:24] works just fine for the state action reward formulation as well and if you
[00:09:28] reward formulation as well and if you apply value iteration until the
[00:09:31] apply value iteration until the convergence of esau then the optimal
[00:09:33] convergence of esau then the optimal action is just the opposite
[00:09:49] right so soap I saw is just the odd max
[00:09:53] right so soap I saw is just the odd max of this thing right now when you're
[00:09:56] of this thing right now when you're given state you want to choose the
[00:09:57] given state you want to choose the action that maximizes your media reward
[00:09:59] action that maximizes your media reward plus your expected future rewards okay
[00:10:06] so I think just maybe another example if
[00:10:09] so I think just maybe another example if you want to use an MDP to planner
[00:10:14] you want to use an MDP to planner shortest route for robot say drive from
[00:10:17] shortest route for robot say drive from here in Stanford to drive up to San
[00:10:19] here in Stanford to drive up to San Francisco right then if it costs
[00:10:22] Francisco right then if it costs different amounts to drive on different
[00:10:24] different amounts to drive on different Road segments because they're traffic or
[00:10:26] Road segments because they're traffic or because of the speed limit on different
[00:10:27] because of the speed limit on different roads then this allows you to say that
[00:10:30] roads then this allows you to say that while driving this distance on this road
[00:10:32] while driving this distance on this road cost this much in terms of fuel
[00:10:34] cost this much in terms of fuel consumption or in terms of time and so
[00:10:36] consumption or in terms of time and so on so
[00:10:43] or in factory maintenance if you send in
[00:10:47] or in factory maintenance if you send in a team to maintain the machine that has
[00:10:49] a team to maintain the machine that has a certain cost versus if you do nothing
[00:10:51] a certain cost versus if you do nothing that has a different cost but then the
[00:10:53] that has a different cost but then the machine breaks down it has yet another
[00:10:54] machine breaks down it has yet another cost evaluations okay so that's the
[00:10:59] cost evaluations okay so that's the first generation the second generation
[00:11:02] first generation the second generation is the finite horizon
[00:11:09] MDP and in a final horizon MDP we're
[00:11:20] MDP and in a final horizon MDP we're going to replace the discount factor
[00:11:23] going to replace the discount factor gamma with a horizon time she and and
[00:11:31] gamma with a horizon time she and and we'll just forget about the discount
[00:11:33] we'll just forget about the discount factor and in the final horizon
[00:11:36] factor and in the final horizon MDP the MDP will run for a finite number
[00:11:44] MDP the MDP will run for a finite number of t step so you stopping to state a
[00:11:46] of t step so you stopping to state a zero take an action a zero get to s1
[00:11:49] zero take an action a zero get to s1 take action a one get to state st take
[00:11:52] take action a one get to state st take an action a T at time step T and in the
[00:11:55] an action a T at time step T and in the world ends and it we're done right and
[00:11:57] world ends and it we're done right and so the payoff is this finite sum and and
[00:12:08] so the payoff is this finite sum and and kinda it's just a full stop at the end
[00:12:10] kinda it's just a full stop at the end of that um you can also apply
[00:12:11] of that um you can also apply discounting but usually when you have a
[00:12:14] discounting but usually when you have a finite horizon MDP maybe there's no need
[00:12:16] finite horizon MDP maybe there's no need to apply discounting and so this model
[00:12:20] to apply discounting and so this model is a problem where there are you know T
[00:12:23] is a problem where there are you know T time steps and then the world ends after
[00:12:25] time steps and then the world ends after that right or what world end sounds a
[00:12:27] that right or what world end sounds a bit dire but you know if you find an
[00:12:30] bit dire but you know if you find an airplane or if I hold copter and you
[00:12:32] airplane or if I hold copter and you know you only have fuel you know for 30
[00:12:35] know you only have fuel you know for 30 minutes right RC helicopter whatsoever
[00:12:38] minutes right RC helicopter whatsoever have 20 30 minutes of fuel then you know
[00:12:40] have 20 30 minutes of fuel then you know that you're gonna run this thing for 30
[00:12:43] that you're gonna run this thing for 30 minutes and then you're done and so the
[00:12:45] minutes and then you're done and so the goal is to accumulate as many rewards as
[00:12:47] goal is to accumulate as many rewards as possible up until you you know run out
[00:12:50] possible up until you you know run out of fuel and then you have
[00:12:51] of fuel and then you have laughs right so that be example of a
[00:12:53] laughs right so that be example of a finer horizon MDP now and and and ago is
[00:12:59] finer horizon MDP now and and and ago is to maximize this payoff or the expected
[00:13:06] to maximize this payoff or the expected payoff over these tea time steps okay
[00:13:10] payoff over these tea time steps okay now one interesting property of a finite
[00:13:16] now one interesting property of a finite horizon of a fine horizon MDP is that
[00:13:20] horizon of a fine horizon MDP is that the action you take may depend on what
[00:13:23] the action you take may depend on what time it is on the clock right so there's
[00:13:25] time it is on the clock right so there's a clock marching from you know x at 0 to
[00:13:28] a clock marching from you know x at 0 to x at t whereupon right the world ends
[00:13:30] x at t whereupon right the world ends the way whereupon that's all the rewards
[00:13:32] the way whereupon that's all the rewards the MDP is trying to collect and one
[00:13:35] the MDP is trying to collect and one interesting effect of this is that this
[00:13:38] interesting effect of this is that this pendulum right is that the optimal
[00:13:45] pendulum right is that the optimal action may depend on what what the time
[00:13:51] action may depend on what what the time is on the clock so let's say your robot
[00:13:53] is on the clock so let's say your robot is running around this maze and there's
[00:13:56] is running around this maze and there's a small plus one reward here and much
[00:14:00] a small plus one reward here and much larger +10 reward there and let's say
[00:14:05] larger +10 reward there and let's say your robots is here right then the
[00:14:09] your robots is here right then the optimal action for whether you go left
[00:14:11] optimal action for whether you go left or go right will depend on how much time
[00:14:13] or go right will depend on how much time you have left on the clock if you have
[00:14:15] you have left on the clock if you have only you know two or three times as left
[00:14:17] only you know two or three times as left on the clock it's better to just rush
[00:14:19] on the clock it's better to just rush and get the plus one but we still have
[00:14:22] and get the plus one but we still have you know 10 20 ticks left on the clock
[00:14:24] you know 10 20 ticks left on the clock then you should just go and get the plus
[00:14:27] then you should just go and get the plus tender one and so in this example pi
[00:14:31] tender one and so in this example pi star of s it's not well-defined because
[00:14:36] star of s it's not well-defined because well the the optimal action to take when
[00:14:39] well the the optimal action to take when you robot is here in this station you go
[00:14:41] you robot is here in this station you go left watch it you're right
[00:14:42] left watch it you're right it actually depends on what time it is
[00:14:45] it actually depends on what time it is on the clock and so PI star in this
[00:14:48] on the clock and so PI star in this example should be written instead PI
[00:14:51] example should be written instead PI star subscript T of s because the auto
[00:14:56] star subscript T of s because the auto action depends on what time T it is the
[00:15:00] action depends on what time T it is the technical term for this is that this is
[00:15:02] technical term for this is that this is a non-stationary
[00:15:04] a non-stationary non stationary policy and non stationary
[00:15:11] non stationary policy and non stationary means it depends on the time actually
[00:15:14] means it depends on the time actually changes over time right whereas in
[00:15:22] changes over time right whereas in contrast up until now we've been saying
[00:15:25] contrast up until now we've been saying you know PI star of s is the octal
[00:15:28] you know PI star of s is the octal policy before we before this formalism
[00:15:30] policy before we before this formalism right which is at PI star of s and
[00:15:32] right which is at PI star of s and that's was a stationary policy and
[00:15:36] that's was a stationary policy and stationary means or does not change over
[00:15:45] stationary means or does not change over time okay so one one one thing that dumb
[00:15:48] time okay so one one one thing that dumb I didn't quite prove but that was
[00:15:49] I didn't quite prove but that was implicit was that the optimal action you
[00:15:52] implicit was that the optimal action you take in the original formulation is the
[00:15:55] take in the original formulation is the same action right no matter what time it
[00:15:58] same action right no matter what time it is in the MDP so in the original
[00:16:00] is in the MDP so in the original formulation that you saw last week the
[00:16:03] formulation that you saw last week the octal policy was stationary meaning that
[00:16:05] octal policy was stationary meaning that the optimal policy is the same policy no
[00:16:08] the optimal policy is the same policy no matter what time it is it doesn't change
[00:16:10] matter what time it is it doesn't change over time
[00:16:10] over time whereas in fine horizon MDP setting the
[00:16:14] whereas in fine horizon MDP setting the Austral policy you know the also action
[00:16:16] Austral policy you know the also action changes over time and so this is a non
[00:16:19] changes over time and so this is a non stationary policy so say XI versus on
[00:16:20] stationary policy so say XI versus on stage she just means does it change over
[00:16:22] stage she just means does it change over time or does it not change over time
[00:16:24] time or does it not change over time okay
[00:16:25] okay and so if you're using a non stationary
[00:16:33] and so if you're using a non stationary policy anyway you can also build an MDP
[00:16:38] policy anyway you can also build an MDP with non stationary the transition
[00:16:40] with non stationary the transition probabilities one on stage three rewards
[00:16:52] actually so maybe here's an example um
[00:16:55] actually so maybe here's an example um let's say you're driving from campus
[00:16:57] let's say you're driving from campus from Palo Alto to San Francisco and we
[00:17:00] from Palo Alto to San Francisco and we know that rush R is that what like 5
[00:17:03] know that rush R is that what like 5 p.m. or 6 p.m. or something right and
[00:17:05] p.m. or 6 p.m. or something right and maybe maybe the weather forecast even
[00:17:07] maybe maybe the weather forecast even says it's gonna rain at 6 p.m. or
[00:17:08] says it's gonna rain at 6 p.m. or something right but so you know that the
[00:17:10] something right but so you know that the dynamics of how you drive your car from
[00:17:12] dynamics of how you drive your car from here at the San Francisco will change
[00:17:14] here at the San Francisco will change over time
[00:17:15] over time as in the time it takes you know to
[00:17:17] as in the time it takes you know to drive on a certain segment of the road
[00:17:19] drive on a certain segment of the road is a function of time and if you want to
[00:17:22] is a function of time and if you want to build an MDP to solve the best way to
[00:17:25] build an MDP to solve the best way to drive from here in San Francisco say
[00:17:26] drive from here in San Francisco say then the state transitions so SC plus
[00:17:32] then the state transitions so SC plus one is drawn from state transition
[00:17:35] one is drawn from state transition probabilities indexed by the state at
[00:17:37] probabilities indexed by the state at time T and the action at time T and if
[00:17:40] time T and the action at time T and if these state transition probabilities
[00:17:42] these state transition probabilities change over time then if you index it by
[00:17:47] change over time then if you index it by the time T this would be an example of a
[00:17:49] the time T this would be an example of a non-stationary of a non stationary state
[00:17:53] non-stationary of a non stationary state transition probabilities okay or
[00:17:55] transition probabilities okay or alternatively if you want non stationary
[00:17:59] alternatively if you want non stationary rewards then you can have a superscript
[00:18:03] rewards then you can have a superscript T of si is the reward you get for taking
[00:18:07] T of si is the reward you get for taking a certain action for being at a certain
[00:18:11] a certain action for being at a certain state at a certain time okay so all of
[00:18:13] state at a certain time okay so all of these are different variations of MDPs
[00:18:17] these are different variations of MDPs and so maybe just a few examples of when
[00:18:20] and so maybe just a few examples of when you will want a final horizon MVP or use
[00:18:24] you will want a final horizon MVP or use a non stationary state transitions so
[00:18:29] a non stationary state transitions so let's see if you are flying an airplane
[00:18:33] let's see if you are flying an airplane right for some airplanes something like
[00:18:37] right for some airplanes something like for commercial very large commercial
[00:18:38] for commercial very large commercial airplanes sometimes over a third of the
[00:18:41] airplanes sometimes over a third of the weight of the airplane comes from the
[00:18:43] weight of the airplane comes from the field right so actually if you take a
[00:18:45] field right so actually if you take a large commercial airplane you know when
[00:18:47] large commercial airplane you know when you take off from SFO and you fly - oh
[00:18:51] you take off from SFO and you fly - oh no way you guys prefer to fly - I
[00:18:53] no way you guys prefer to fly - I applied to London or something it's very
[00:18:55] applied to London or something it's very direct flex appears in London by the
[00:18:57] direct flex appears in London by the time the plane lands and get much
[00:18:59] time the plane lands and get much lighter airplane than when you took off
[00:19:00] lighter airplane than when you took off because maybe sometimes maybe like a
[00:19:03] because maybe sometimes maybe like a third of the weight disappear you know
[00:19:05] third of the weight disappear you know because of burning fuel and so then the
[00:19:07] because of burning fuel and so then the dynamics there how an airplane feels
[00:19:11] dynamics there how an airplane feels between takeoff and landing is actually
[00:19:13] between takeoff and landing is actually different because the weight is
[00:19:14] different because the weight is dramatically different and so this would
[00:19:18] dramatically different and so this would be one example of where the state
[00:19:19] be one example of where the state transition priorities changes and a
[00:19:21] transition priorities changes and a pretty predictable way right
[00:19:23] pretty predictable way right or oh right already mentioned weather
[00:19:28] or oh right already mentioned weather forecasts right where weather forecasts
[00:19:33] forecasts right where weather forecasts of traffic for cars would be driving
[00:19:34] of traffic for cars would be driving here or drive yeah if you're driving
[00:19:37] here or drive yeah if you're driving over different types of terrain over
[00:19:39] over different types of terrain over time you know that's gonna rain tomorrow
[00:19:41] time you know that's gonna rain tomorrow you're gonna know it's gonna rain
[00:19:43] you're gonna know it's gonna rain tonight and the ground will turn muddy
[00:19:44] tonight and the ground will turn muddy you know then all the traffic would turn
[00:19:46] you know then all the traffic would turn bad and then on the industrial
[00:19:54] bad and then on the industrial automation um I'll see how friends work
[00:19:58] automation um I'll see how friends work on industrial automation and I think
[00:20:00] on industrial automation and I think that may be one example if you run a
[00:20:03] that may be one example if you run a factory 24 hours a day then the cost of
[00:20:06] factory 24 hours a day then the cost of labor you know getting people to come
[00:20:09] labor you know getting people to come into the factory to do some work at noon
[00:20:11] into the factory to do some work at noon is actually easier right and less costly
[00:20:14] is actually easier right and less costly than getting someone to show up in the
[00:20:16] than getting someone to show up in the factory to do some work at 3:00 a.m.
[00:20:17] factory to do some work at 3:00 a.m. right and so depending on really labor
[00:20:21] right and so depending on really labor availability over time the cost of
[00:20:23] availability over time the cost of taking different actions and the cost of
[00:20:26] taking different actions and the cost of and and the likelihood of transitioning
[00:20:28] and and the likelihood of transitioning to different stations and priorities can
[00:20:30] to different stations and priorities can vary over the 24 hour clock as well
[00:20:32] vary over the 24 hour clock as well right so these are other examples of
[00:20:35] right so these are other examples of when you can have a non safety policy
[00:20:41] when you can have a non safety policy and non safety state transitions okay
[00:20:43] and non safety state transitions okay now um let's talk about how you would
[00:20:47] now um let's talk about how you would actually solve for a fine horizon MDP
[00:20:50] actually solve for a fine horizon MDP and I think for the sake of simplicity
[00:20:52] and I think for the sake of simplicity for the most part I'm going to not
[00:20:55] for the most part I'm going to not bother with non stations transition
[00:20:57] bother with non stations transition rewards so for the most part just focus
[00:20:59] rewards so for the most part just focus on for the most part it's gonna forget
[00:21:01] on for the most part it's gonna forget about you know the fact that this could
[00:21:03] about you know the fact that this could be beer being I mentioned it briefly but
[00:21:06] be beer being I mentioned it briefly but I want to focus on the finer horizon
[00:21:09] I want to focus on the finer horizon aspect
[00:21:11] aspect so so let me define the autovalue
[00:21:28] so so let me define the autovalue function
[00:22:03] so this is the also value function for
[00:22:06] so this is the also value function for time T for starting a new state as so
[00:22:09] time T for starting a new state as so this is the expected total payoff
[00:22:19] starting in state s at time T and if you
[00:22:27] starting in state s at time T and if you execute you know the best possible
[00:22:32] execute you know the best possible policy so now the optimal value function
[00:22:35] policy so now the optimal value function depends on what time it is because if
[00:22:39] depends on what time it is because if you look at that example with the +1
[00:22:41] you look at that example with the +1 reward on the left and the +10 reward on
[00:22:43] reward on the left and the +10 reward on the right depending on how much time you
[00:22:45] the right depending on how much time you have left on the clock the amount of
[00:22:47] have left on the clock the amount of rewards you can accumulate can be quite
[00:22:49] rewards you can accumulate can be quite different if you have more time yet more
[00:22:51] different if you have more time yet more than you know you can more time to get
[00:22:53] than you know you can more time to get to the past and reward indeed and the +1
[00:22:55] to the past and reward indeed and the +1 and +10 rewards that I drew example that
[00:22:58] and +10 rewards that I drew example that you just now and so um in this example
[00:23:03] you just now and so um in this example value iteration becomes the following it
[00:23:10] value iteration becomes the following it actually becomes a dynamic programming
[00:23:12] actually becomes a dynamic programming algorithm what you see in a second ok
[00:23:19] algorithm what you see in a second ok which is that
[00:23:40] all right which is that the star of T of
[00:23:47] all right which is that the star of T of S is equal to max' over a R of the s a
[00:23:53] S is equal to max' over a R of the s a plus and actually this is a question for
[00:24:14] plus and actually this is a question for you so this does this one missing thing
[00:24:16] you so this does this one missing thing here right so what's saying that the
[00:24:21] here right so what's saying that the optimal value you can get when you start
[00:24:24] optimal value you can get when you start up in state as at time T is the max over
[00:24:27] up in state as at time T is the max over all actions of the immediate reward plus
[00:24:29] all actions of the immediate reward plus sum of s prime state transmitter is s
[00:24:32] sum of s prime state transmitter is s prime times V star of s prime and then
[00:24:34] prime times V star of s prime and then what should go in that box okay cool
[00:24:38] what should go in that box okay cool awesome great right and then PI star of
[00:24:48] awesome great right and then PI star of s is just you know off max of 80 right
[00:24:54] s is just you know off max of 80 right of the same thing of this whole
[00:24:56] of the same thing of this whole expression up on top and so this formula
[00:25:01] expression up on top and so this formula defines VT as a function of V T plus 1
[00:25:06] defines VT as a function of V T plus 1 so this is like oh this is like the
[00:25:08] so this is like oh this is like the iterative step right given meet engine
[00:25:10] iterative step right given meet engine compute V now I'm given V now you can
[00:25:12] compute V now I'm given V now you can abbreviate given behavior music v7 and
[00:25:15] abbreviate given behavior music v7 and so to start this off there's just one
[00:25:18] so to start this off there's just one last thing we need to define which is
[00:25:20] last thing we need to define which is the capital T at the finite step at the
[00:25:24] the capital T at the finite step at the final step when the clocks about to run
[00:25:26] final step when the clocks about to run out all you get to do is choose the
[00:25:31] out all you get to do is choose the action a
[00:25:37] that maximizes the immediate reward and
[00:25:40] that maximizes the immediate reward and then and then and then there's no sum
[00:25:42] then and then and then there's no sum after that right so if you start off at
[00:25:45] after that right so if you start off at state as at the final time step T then
[00:25:48] state as at the final time step T then you get to take an action and you get
[00:25:51] you get to take an action and you get immediate reward and then there is no
[00:25:53] immediate reward and then there is no next day because the world just ends
[00:25:54] next day because the world just ends right after that step which is why the
[00:25:58] right after that step which is why the auto value at time T is just max over a
[00:26:01] auto value at time T is just max over a at the immediate reward because what
[00:26:03] at the immediate reward because what happens after that doesn't matter okay
[00:26:05] happens after that doesn't matter okay so this is a dynamic programming
[00:26:09] so this is a dynamic programming algorithm in which this algorithm does
[00:26:13] algorithm in which this algorithm does step on top defines you allows you to
[00:26:16] step on top defines you allows you to compute V saw of T and then the
[00:26:19] compute V saw of T and then the inductive step or the n plus 1 step I
[00:26:21] inductive step or the n plus 1 step I guess is if you then having computed V
[00:26:24] guess is if you then having computed V Star of T for every state s right so you
[00:26:27] Star of T for every state s right so you know so you compute this for every state
[00:26:28] know so you compute this for every state that's having done this you can then
[00:26:30] that's having done this you can then compute V star t minus 1 using this
[00:26:34] compute V star t minus 1 using this inductive step then it's not t minus 2
[00:26:37] inductive step then it's not t minus 2 and so on down to V star of 0 so you
[00:26:41] and so on down to V star of 0 so you compute this for every state and then
[00:26:43] compute this for every state and then based on these you can compute no sorry
[00:26:46] based on these you can compute no sorry it's PI star of T right compute the auto
[00:26:48] it's PI star of T right compute the auto policy the non stationary policy for
[00:26:52] policy the non stationary policy for every states as function of both the
[00:26:54] every states as function of both the state ok and and I think again I don't
[00:27:04] state ok and and I think again I don't want to draw on this but if you want to
[00:27:07] want to draw on this but if you want to work with non stationary state
[00:27:09] work with non stationary state transition probabilities or non
[00:27:10] transition probabilities or non stationary rewards then this algorithm
[00:27:13] stationary rewards then this algorithm hardly changes in that you can just add
[00:27:18] hardly changes in that you can just add you know if your rewards the state
[00:27:21] you know if your rewards the state transceiver is at index by time as well
[00:27:23] transceiver is at index by time as well then this is just a very small
[00:27:25] then this is just a very small modification to this algorithm and it
[00:27:26] modification to this algorithm and it turns out that once you're using a
[00:27:28] turns out that once you're using a finite horizon
[00:27:29] finite horizon MDP making the rewards and state
[00:27:33] MDP making the rewards and state transmitters non stationary it's just a
[00:27:35] transmitters non stationary it's just a small tweak right so
[00:27:46] okay okay
[00:27:48] okay okay this one Oh an on station so in the end
[00:27:54] this one Oh an on station so in the end you get a policy PI star subscript T of
[00:27:57] you get a policy PI star subscript T of s I'm sorry oke this one oh this one
[00:28:12] s I'm sorry oke this one oh this one OSE sure yes Oh a price on this is a
[00:28:15] OSE sure yes Oh a price on this is a non-stationary policy yes so that's
[00:28:16] non-stationary policy yes so that's Island yeah yeah so this the the auto
[00:28:19] Island yeah yeah so this the the auto policy will be an on station policy yes
[00:28:22] policy will be an on station policy yes I think yes I think I was using PI
[00:28:25] I think yes I think I was using PI started not not to denote that it has to
[00:28:28] started not not to denote that it has to be a fictional target yes that's an
[00:28:30] be a fictional target yes that's an awesome awesome thank you right if you
[00:28:39] awesome awesome thank you right if you think big teeth and Finity can just
[00:28:41] think big teeth and Finity can just become the usual value iteration so the
[00:28:45] become the usual value iteration so the everything so the two things with that
[00:28:50] everything so the two things with that so the two frameworks are closely
[00:28:52] so the two frameworks are closely related right you can consideration ship
[00:28:54] related right you can consideration ship between the valuation um
[00:28:56] between the valuation um one problem with taking the strain were
[00:28:59] one problem with taking the strain were too big t to infinity is that the values
[00:29:02] too big t to infinity is that the values become unbounded right as in yeah well
[00:29:06] become unbounded right as in yeah well and that's actually one of the reasons
[00:29:07] and that's actually one of the reasons why we use a discount factor when you
[00:29:10] why we use a discount factor when you have an infinite horizon MVP when their
[00:29:12] have an infinite horizon MVP when their interviews goes on forever one of the
[00:29:14] interviews goes on forever one of the things that discount factor does is it
[00:29:16] things that discount factor does is it make sure that the value function
[00:29:18] make sure that the value function doesn't draw without bound
[00:29:21] doesn't draw without bound right and in fact you know if the
[00:29:23] right and in fact you know if the rewards are bounded by on right by some
[00:29:28] rewards are bounded by on right by some R max then when you use discounting then
[00:29:31] R max then when you use discounting then V you know is bounded by I guess R max
[00:29:35] V you know is bounded by I guess R max over 1 minus gab
[00:29:37] over 1 minus gab there's someone with dramatic sequence
[00:29:39] there's someone with dramatic sequence Oh and so but but when you find her as
[00:29:41] Oh and so but but when you find her as entropy because you only add up tira was
[00:29:43] entropy because you only add up tira was it it can't get bigger than T times R
[00:29:46] it it can't get bigger than T times R max oh let me think so I think they're
[00:30:19] max oh let me think so I think they're dumb boy so I think you know what you
[00:30:23] dumb boy so I think you know what you find is that um let's see
[00:30:31] actually let me just draw a one Dedra
[00:30:33] actually let me just draw a one Dedra just to make life simpler right so let's
[00:30:36] just to make life simpler right so let's see there's a plus one reward there the
[00:30:40] see there's a plus one reward there the plus one reward there if you look at the
[00:30:42] plus one reward there if you look at the optimal value function depending on what
[00:30:45] optimal value function depending on what time it is if you have two times and
[00:30:47] time it is if you have two times and that's let's say the dynamics are
[00:30:50] that's let's say the dynamics are deterministic right so there's no noise
[00:30:52] deterministic right so there's no noise then if you're two time steps left then
[00:30:55] then if you're two time steps left then I guess V star would be you know ten ten
[00:31:00] I guess V star would be you know ten ten ten one one one zero zero right and so
[00:31:05] ten one one one zero zero right and so depending on where you are I guess if
[00:31:08] depending on where you are I guess if you're yeah actually in fact I guess if
[00:31:12] you're yeah actually in fact I guess if you're here there's nothing you can do
[00:31:14] you're here there's nothing you can do right this kind of gets either reward in
[00:31:15] right this kind of gets either reward in time but depending on whether you're
[00:31:18] time but depending on whether you're here or here or here at the auto action
[00:31:20] here or here or here at the auto action will will change you to computer this pi
[00:31:22] will will change you to computer this pi star this make sense okay yeah maybe do
[00:31:27] star this make sense okay yeah maybe do do cars here there
[00:31:29] do cars here there if this yeah if you actually built a
[00:31:32] if this yeah if you actually built a little you know grid simulator and use
[00:31:34] little you know grid simulator and use these equations to compute PI sine V
[00:31:36] these equations to compute PI sine V star you will see that the Ottoman
[00:31:38] star you will see that the Ottoman policy when you have lots of time will
[00:31:41] policy when you have lots of time will be this wherever you are go for the ten
[00:31:45] be this wherever you are go for the ten rewards but when the clock runs down if
[00:31:47] rewards but when the clock runs down if then the also policy will
[00:31:49] then the also policy will end up being a mix-up go left and go
[00:31:52] end up being a mix-up go left and go right all right cool all right
[00:32:22] so the last thing I want to share of you
[00:32:25] so the last thing I want to share of you today is the new quadratic regulation
[00:32:34] and as a saying at the start
[00:32:38] and as a saying at the start um lqr applies only in the relatively
[00:32:41] um lqr applies only in the relatively small set of problems but whenever the
[00:32:44] small set of problems but whenever the plies this is a great out room and I
[00:32:46] plies this is a great out room and I just you know use it whenever right it
[00:32:48] just you know use it whenever right it seems usable to apply because it is very
[00:32:52] seems usable to apply because it is very efficient and sometimes gives very good
[00:32:54] efficient and sometimes gives very good control policies and let's see
[00:33:00] control policies and let's see and so lqr applies in the following
[00:33:05] and so lqr applies in the following setting so let's see in order to specify
[00:33:08] setting so let's see in order to specify an MDP we need to specify the state
[00:33:11] an MDP we need to specify the state reactions the state transition or movies
[00:33:14] reactions the state transition or movies I'm going to use to find a horizon
[00:33:16] I'm going to use to find a horizon formulation so capital T and rewards
[00:33:19] formulation so capital T and rewards this this also works with the discounted
[00:33:22] this this also works with the discounted MVP formalism but this would be a little
[00:33:24] MVP formalism but this would be a little bit easier a little bit more convenient
[00:33:26] bit easier a little bit more convenient to develop we've defined a horizon
[00:33:28] to develop we've defined a horizon setting so let me just use that today
[00:33:30] setting so let me just use that today and lqr applies under a specific set of
[00:33:34] and lqr applies under a specific set of circumstances which is that this set of
[00:33:38] circumstances which is that this set of states there's an RN set of actions is
[00:33:45] states there's an RN set of actions is in Rd and so to specify the state
[00:33:50] in Rd and so to specify the state transition probabilities we need to tell
[00:33:52] transition probabilities we need to tell you what's the distribution of the
[00:33:54] you what's the distribution of the Knicks a given the previous state so to
[00:33:56] Knicks a given the previous state so to specify the state transition
[00:33:58] specify the state transition probabilities I'm going to say that the
[00:34:00] probabilities I'm going to say that the way st plus one evolves is going to be
[00:34:03] way st plus one evolves is going to be as a linear function some matrix a times
[00:34:13] as a linear function some matrix a times s T plus some matrix B times a T plus
[00:34:16] s T plus some matrix B times a T plus some noise and sorry there's a little
[00:34:18] some noise and sorry there's a little bit of notation over loading key and
[00:34:20] bit of notation over loading key and sorry about that a is both the set of
[00:34:22] sorry about that a is both the set of actions as was this matrix a right so
[00:34:24] actions as was this matrix a right so there's two separate things but same
[00:34:26] there's two separate things but same symbol I think I think that the field of
[00:34:30] symbol I think I think that the field of all the ideas of lqr came from
[00:34:32] all the ideas of lqr came from traditional control
[00:34:34] traditional control from what from mom I guess from EE a
[00:34:38] from what from mom I guess from EE a mechanical engineering a lot of their
[00:34:39] mechanical engineering a lot of their ideas of reinforcement learning came
[00:34:41] ideas of reinforcement learning came from computer science so these two
[00:34:42] from computer science so these two literature's kind of evolved and then
[00:34:45] literature's kind of evolved and then when the literature's merge you end up
[00:34:47] when the literature's merge you end up with clashing notation so CS people use
[00:34:50] with clashing notation so CS people use a to denote the set of actions and the
[00:34:53] a to denote the set of actions and the stuff you know mechanical engineering
[00:34:55] stuff you know mechanical engineering and EE people use a to denote this
[00:34:57] and EE people use a to denote this matrix and when we merge these two
[00:35:00] matrix and when we merge these two literature's the notation ends up being
[00:35:02] literature's the notation ends up being overloaded okay oh and then um it turns
[00:35:09] overloaded okay oh and then um it turns out one thing we'll see later is that
[00:35:11] out one thing we'll see later is that this noise term it we'll see you later
[00:35:14] this noise term it we'll see you later is actually not super important but for
[00:35:16] is actually not super important but for now let's just assume that the noise
[00:35:18] now let's just assume that the noise term is distributed Gaussian with some
[00:35:20] term is distributed Gaussian with some mean zero and some covariance Sigma
[00:35:23] mean zero and some covariance Sigma subscript W okay but we'll see later
[00:35:26] subscript W okay but we'll see later that the noise will be less important
[00:35:28] that the noise will be less important than you think
[00:35:34] right
[00:35:35] right and so this matrix a is going to you are
[00:35:38] and so this matrix a is going to you are n by n and this matrix B is going to be
[00:35:41] n by n and this matrix B is going to be R and body D where N and D are
[00:35:45] R and body D where N and D are respectively the dimensions of the state
[00:35:47] respectively the dimensions of the state space in the dimension of the action
[00:35:49] space in the dimension of the action space so for driving a car for example
[00:35:53] space so for driving a car for example we saw last time that maybe the state
[00:35:55] we saw last time that maybe the state space is six dimensional so if you're
[00:35:57] space is six dimensional so if you're driving a car I mean the state space is
[00:35:59] driving a car I mean the state space is XY theta x dot y dot beta dot and the
[00:36:05] XY theta x dot y dot beta dot and the action space is you know steering
[00:36:09] action space is you know steering control so maybe a is two-dimensional
[00:36:11] control so maybe a is two-dimensional right acceleration and steering okay so
[00:36:18] right acceleration and steering okay so let's see so to specify an MDP we need
[00:36:20] let's see so to specify an MDP we need to specify this five tuple right so we
[00:36:23] to specify this five tuple right so we specify three of the elements the fourth
[00:36:27] specify three of the elements the fourth one T is just some number right so
[00:36:29] one T is just some number right so that's easy
[00:36:30] that's easy and then the final assumption we need to
[00:36:33] and then the final assumption we need to apply out your arm is that the reward
[00:36:35] apply out your arm is that the reward function has the following form
[00:36:38] function has the following form the reward is negative of s
[00:36:43] the reward is negative of s transpose u s plus a transpose VA where
[00:36:52] transpose u s plus a transpose VA where you yes and by n V is d by D and UV are
[00:37:05] you yes and by n V is d by D and UV are a positive semi-definite okay so these
[00:37:08] a positive semi-definite okay so these are matrices a and bringing the zeros or
[00:37:10] are matrices a and bringing the zeros or pauses immunity so the fact that U and V
[00:37:23] pauses immunity so the fact that U and V are positive semi-definite that implies
[00:37:25] are positive semi-definite that implies that s T u s is greater than equal to 0
[00:37:29] that s T u s is greater than equal to 0 and s transpose u s sorry a transpose VA
[00:37:34] and s transpose u s sorry a transpose VA is also greater than or equal to 0 so
[00:37:43] is also greater than or equal to 0 so here's one example if you want to fly an
[00:37:50] here's one example if you want to fly an autonomous helicopter and if you want
[00:37:54] autonomous helicopter and if you want you know the state the state vector to
[00:37:58] you know the state the state vector to be close to zero so the state vector
[00:38:00] be close to zero so the state vector captures a position orientation velocity
[00:38:02] captures a position orientation velocity angular velocity if you want a
[00:38:03] angular velocity if you want a helicopter to just hover in place then
[00:38:06] helicopter to just hover in place then maybe you want the state to be you know
[00:38:08] maybe you want the state to be you know regulated or to be controlled near some
[00:38:11] regulated or to be controlled near some zero position and so if you choose u
[00:38:19] zero position and so if you choose u equals the identity matrix and V also
[00:38:23] equals the identity matrix and V also equal to the identity matrix this will
[00:38:26] equal to the identity matrix this will be different dimensions right this would
[00:38:28] be different dimensions right this would be an N by n identity matrix as you add
[00:38:30] be an N by n identity matrix as you add D by D domain and D matrix then R of sa
[00:38:33] D by D domain and D matrix then R of sa ends up equal to negative normal for s
[00:38:38] ends up equal to negative normal for s squared plus normal a squared
[00:38:44] and so this allows you to this allows
[00:38:50] and so this allows you to this allows you to specify reward function that
[00:38:52] you to specify reward function that penalizes you know with a quadratic cost
[00:38:55] penalizes you know with a quadratic cost function the state deviating from zero
[00:38:57] function the state deviating from zero or if you want the actions deviating
[00:39:00] or if you want the actions deviating from zero that's penalizing very large
[00:39:02] from zero that's penalizing very large jerky motions on the control sticks or
[00:39:05] jerky motions on the control sticks or we said V equal to zero then the second
[00:39:07] we said V equal to zero then the second term goes away okay so these are some of
[00:39:09] term goes away okay so these are some of the cost functions you can specify in
[00:39:13] the cost functions you can specify in terms of a quadratic cost function okay
[00:39:17] terms of a quadratic cost function okay now again you know just so that you can
[00:39:21] now again you know just so that you can see the generalization if you want
[00:39:27] see the generalization if you want non-stationary dynamics this model is
[00:39:31] non-stationary dynamics this model is quite simple to change where you can say
[00:39:34] quite simple to change where you can say the matrices a and B depends on the time
[00:39:36] the matrices a and B depends on the time T you can also say these mean you know
[00:39:40] T you can also say these mean you know the matrices u and V depends on the time
[00:39:43] the matrices u and V depends on the time T so if you have non stationary state
[00:39:46] T so if you have non stationary state transition probabilities or non
[00:39:47] transition probabilities or non stationary cost function that's how you
[00:39:51] stationary cost function that's how you would modify this but I won't
[00:39:53] would modify this but I won't I won't use this generalization for
[00:39:56] I won't use this generalization for today now
[00:40:20] so the two key assumptions of the lqr
[00:40:23] so the two key assumptions of the lqr framework are that first the state
[00:40:27] framework are that first the state transition dynamics the way your sales
[00:40:29] transition dynamics the way your sales change is as a linear function of the
[00:40:32] change is as a linear function of the previous state and action plus some
[00:40:34] previous state and action plus some noise and second that the reward
[00:40:37] noise and second that the reward function is a quadratic cost function
[00:40:39] function is a quadratic cost function right so these are the two key
[00:40:41] right so these are the two key assumptions and so first you know we're
[00:40:46] assumptions and so first you know we're worth where do you get the matrices a
[00:40:53] worth where do you get the matrices a and B one thing that we talked about on
[00:40:57] and B one thing that we talked about on Wednesday already was so again this will
[00:41:00] Wednesday already was so again this will actually work if you're trying to apply
[00:41:01] actually work if you're trying to apply lqr to find the homicidal confidence
[00:41:03] lqr to find the homicidal confidence this will work for a helicopter flying
[00:41:05] this will work for a helicopter flying at low speeds which is if you find a
[00:41:07] at low speeds which is if you find a helicopter around you know start to some
[00:41:14] helicopter around you know start to some state as zero take an action a zero get
[00:41:18] state as zero take an action a zero get to state s1 do this until you get to st
[00:41:22] to state s1 do this until you get to st right and then this was the first trial
[00:41:24] right and then this was the first trial and then you do this M times so we
[00:41:31] and then you do this M times so we talked about this on Wednesday so fly
[00:41:33] talked about this on Wednesday so fly the helicopter through M trajectory sort
[00:41:36] the helicopter through M trajectory sort of T time steps each and then we know
[00:41:40] of T time steps each and then we know that we want st plus 1 is approximately
[00:41:43] that we want st plus 1 is approximately ast plus PA t and so you can minimize
[00:42:08] right so we want the left in the right
[00:42:12] right so we want the left in the right hand side to be close each other so you
[00:42:14] hand side to be close each other so you could you know minimize the squared
[00:42:18] could you know minimize the squared difference between the left hand side
[00:42:19] difference between the left hand side and the right hand side in a procedure a
[00:42:22] and the right hand side in a procedure a lot like linear regression in order to
[00:42:24] lot like linear regression in order to fit matrices a and B so if you actually
[00:42:28] fit matrices a and B so if you actually fly a helicopter around and collect this
[00:42:31] fly a helicopter around and collect this type of data and fit this model to it
[00:42:33] type of data and fit this model to it this will work you know this is actually
[00:42:35] this will work you know this is actually a pretty reasonable model for the
[00:42:37] a pretty reasonable model for the dynamics of a helicopter and those
[00:42:39] dynamics of a helicopter and those speeds okay so this is one way to do
[00:42:57] so let's see method one is to learn it
[00:43:01] so let's see method one is to learn it right a second method is to linearize a
[00:43:10] right a second method is to linearize a nonlinear model so um let me just
[00:43:22] nonlinear model so um let me just describe the ideas at a high level
[00:43:26] describe the ideas at a high level which is let's say that and I think for
[00:43:29] which is let's say that and I think for this it might be useful to think of the
[00:43:33] this it might be useful to think of the inverted pendulum right so that was uh
[00:43:35] inverted pendulum right so that was uh you know say imagine you've uh yeah
[00:43:37] you know say imagine you've uh yeah inverted pendulum that was that Rangi of
[00:43:39] inverted pendulum that was that Rangi of a pole you're trying to your long
[00:43:41] a pole you're trying to your long vertical pole you try to keep it
[00:43:43] vertical pole you try to keep it balanced so for an inverted pendulum
[00:43:45] balanced so for an inverted pendulum like this
[00:43:46] like this if you download an open source physics
[00:43:48] if you download an open source physics simulator or if you have a friend you
[00:43:51] simulator or if you have a friend you know from from the physics degree help
[00:43:54] know from from the physics degree help you derived in Newtonian mechanics
[00:43:55] you derived in Newtonian mechanics equations for this let's see I actually
[00:44:01] equations for this let's see I actually tried to work through the physics
[00:44:03] tried to work through the physics equations or inverted pendulum ones is
[00:44:04] equations or inverted pendulum ones is pretty complicated but you might have a
[00:44:13] function that tells you that if the
[00:44:16] function that tells you that if the state is a certain position orientation
[00:44:19] state is a certain position orientation of the pole lost the angular velocity
[00:44:21] of the pole lost the angular velocity and you was it apply a certain
[00:44:25] and you was it apply a certain acceleration the options are certainly
[00:44:27] acceleration the options are certainly left for settling right then you know
[00:44:29] left for settling right then you know one tenth of a second later the state
[00:44:31] one tenth of a second later the state will get to this right so you you know
[00:44:33] will get to this right so you you know your physics friend can help you derive
[00:44:35] your physics friend can help you derive this equation and and and then maybe
[00:44:39] this equation and and and then maybe plus noise right well no just ignore the
[00:44:41] plus noise right well no just ignore the noise for now and so what you have is a
[00:44:47] noise for now and so what you have is a function
[00:45:06] right the maps from the state xx dot
[00:45:10] right the maps from the state xx dot theta theta dot that's a position of the
[00:45:12] theta theta dot that's a position of the cart and the angle of the pole and the
[00:45:14] cart and the angle of the pole and the velocities and angular velocities
[00:45:16] velocities and angular velocities they're maps from the current state at
[00:45:17] they're maps from the current state at time t excuse me comma 80 right maps
[00:45:24] time t excuse me comma 80 right maps from the I guess current state vector to
[00:45:27] from the I guess current state vector to the next state vector as a function they
[00:45:30] the next state vector as a function they comments a nuclear reaction okay so um
[00:45:33] comments a nuclear reaction okay so um yes what Vineyards asian means and i'm
[00:45:38] yes what Vineyards asian means and i'm going to use a 1d example so because i
[00:45:41] going to use a 1d example so because i can only draw on a flat board right i
[00:45:43] can only draw on a flat board right i can't do because because of the two
[00:45:45] can't do because because of the two dimensional nature of the white board
[00:45:46] dimensional nature of the white board I'm just gonna use a let's let's suppose
[00:45:50] I'm just gonna use a let's let's suppose that do you have SC plus one equals f of
[00:45:52] that do you have SC plus one equals f of s T and let me just forget let me just
[00:45:55] s T and let me just forget let me just ignore the action for now so I have one
[00:45:56] ignore the action for now so I have one input and one output so I can draw this
[00:45:58] input and one output so I can draw this more easy on the white board so if you
[00:46:06] more easy on the white board so if you have some function like this so the
[00:46:08] have some function like this so the x-axis is s T and Y axis is SC plus one
[00:46:13] x-axis is s T and Y axis is SC plus one and this is the function f right well
[00:46:16] and this is the function f right well plug in back the action later what the
[00:46:19] plug in back the action later what the linearization process does is you pick a
[00:46:22] linearization process does is you pick a point I'm gonna call this point st over
[00:46:27] point I'm gonna call this point st over bar and we're going to
[00:46:34] you know take the derivative of F and
[00:46:36] you know take the derivative of F and finish straight lines are you annoyed
[00:46:38] finish straight lines are you annoyed really not very well take the tangent
[00:46:41] really not very well take the tangent straight line at this point s T bar and
[00:46:46] straight line at this point s T bar and what and we're going to use this
[00:46:49] what and we're going to use this straight line and we're going to use the
[00:46:59] straight line and we're going to use the green straight line to approximate the
[00:47:02] green straight line to approximate the function okay and so if you look at the
[00:47:06] function okay and so if you look at the equation for the green straight line the
[00:47:09] equation for the green straight line the green straight line is a function
[00:47:11] green straight line is a function mapping from st 2sc plus 1 and s bar is
[00:47:15] mapping from st 2sc plus 1 and s bar is the point around which your linearizing
[00:47:18] the point around which your linearizing the function so s bar is a constant and
[00:47:21] the function so s bar is a constant and this function is actually defined by SC
[00:47:24] this function is actually defined by SC plus 1 is approximately the derivative
[00:47:28] plus 1 is approximately the derivative of the function at s Bar times st minus
[00:47:33] of the function at s Bar times st minus s bar plus s Bar T ok and so so s bar T
[00:47:44] s bar plus s Bar T ok and so so s bar T is a constant and this equation
[00:47:50] is a constant and this equation expresses s T plus 1 as a linear
[00:47:53] expresses s T plus 1 as a linear function of s T so think of it as
[00:47:57] function of s T so think of it as volunteers a fixed number right it
[00:47:58] volunteers a fixed number right it doesn't vary so given some fixed s bar
[00:48:02] doesn't vary so given some fixed s bar this equation here this is actually the
[00:48:05] this equation here this is actually the equation of the green straight line
[00:48:07] equation of the green straight line which is it says you know if you've used
[00:48:09] which is it says you know if you've used the green straight line to approximate
[00:48:09] the green straight line to approximate the function f this tells you what is s
[00:48:12] the function f this tells you what is s T plus 1 as a function of s T and this
[00:48:15] T plus 1 as a function of s T and this is a linear and affine relationship
[00:48:17] is a linear and affine relationship between s T plus 1 ok so that's how you
[00:48:24] between s T plus 1 ok so that's how you will linearize a function and and in a
[00:48:28] will linearize a function and and in a more general case where and in a more
[00:48:41] more general case where and in a more general case where SC plus 1 is actually
[00:48:44] general case where SC plus 1 is actually a function of you know putting this back
[00:48:47] a function of you know putting this back again right both st and a t the formula
[00:48:55] again right both st and a t the formula becomes well i'll write out the form in
[00:49:05] becomes well i'll write out the form in a second effect on in this example s bar
[00:49:08] a second effect on in this example s bar T is usually chosen to be a typical
[00:49:11] T is usually chosen to be a typical value for s and so in particular if you
[00:49:20] value for s and so in particular if you expect your helicopter to be doing a
[00:49:22] expect your helicopter to be doing a pretty good job hovering near the stage
[00:49:24] pretty good job hovering near the stage 0 then it'd be pretty reasonable to
[00:49:27] 0 then it'd be pretty reasonable to choose s Bar T to be the vector of all
[00:49:29] choose s Bar T to be the vector of all zeros because if you look at how good is
[00:49:32] zeros because if you look at how good is the green line as an approximation of
[00:49:34] the green line as an approximation of the blue line right in a small region
[00:49:37] the blue line right in a small region like this you know the green line is
[00:49:38] like this you know the green line is actually pretty close to the blue line
[00:49:39] actually pretty close to the blue line and so if you choose as bar to be the
[00:49:44] and so if you choose as bar to be the place where you expect your helicopter
[00:49:46] place where you expect your helicopter to spend most of its time then the green
[00:49:48] to spend most of its time then the green line is not too bad an approximation so
[00:49:51] line is not too bad an approximation so the true function to the physics OC you
[00:49:52] the true function to the physics OC you know if you expect for the inverted
[00:49:54] know if you expect for the inverted pendulum if you expect that your
[00:49:56] pendulum if you expect that your inverted pendulum will spend most of its
[00:49:57] inverted pendulum will spend most of its time but the pole upright and the
[00:49:59] time but the pole upright and the velocity not too large then you choose s
[00:50:01] velocity not too large then you choose s bar to be maybe the zero vector and so
[00:50:04] bar to be maybe the zero vector and so long as your pendulum your inverted
[00:50:06] long as your pendulum your inverted pendulum is spending most of its time
[00:50:08] pendulum is spending most of its time kind of you know close to the zero state
[00:50:11] kind of you know close to the zero state then the green lines not too bad an
[00:50:13] then the green lines not too bad an approximation for the blue line right so
[00:50:15] approximation for the blue line right so this is an approximation but you try to
[00:50:17] this is an approximation but you try to choose them because I mean in in this
[00:50:20] choose them because I mean in in this little region it's actually not that bad
[00:50:22] little region it's actually not that bad an approximation is only when you go
[00:50:24] an approximation is only when you go really far away right that there's a
[00:50:26] really far away right that there's a huge gap between the linear
[00:50:28] huge gap between the linear approximation and the true function
[00:50:32] approximation and the true function okay and so in the more general case
[00:50:40] okay and so in the more general case where f is a function about the state
[00:50:43] where f is a function about the state and the action then what you have to do
[00:50:45] and the action then what you have to do is the input now becomes st comma 80
[00:50:50] is the input now becomes st comma 80 because f maps from st comma 82 st plus
[00:50:54] because f maps from st comma 82 st plus 1 and then instead of choosing as bhatia
[00:50:57] 1 and then instead of choosing as bhatia choosing as bar t comma a bar T which is
[00:51:00] choosing as bar t comma a bar T which is a typical state in action around which
[00:51:03] a typical state in action around which you linearize the function let me just
[00:51:05] you linearize the function let me just write down the formula for that
[00:51:29] in which you would say if you linearize
[00:51:33] in which you would say if you linearize around the points given by s party a
[00:51:38] around the points given by s party a body under the typical values then the
[00:51:41] body under the typical values then the form that you have is st plus 1 is given
[00:51:45] form that you have is st plus 1 is given by f of s bar ta bar t plus the gradient
[00:51:53] by f of s bar ta bar t plus the gradient with respect to s so this is the
[00:52:17] with respect to s so this is the generalization of the 1d function we
[00:52:20] generalization of the 1d function we measure just now we wrote down just now
[00:52:22] measure just now we wrote down just now which says that you know the next day
[00:52:24] which says that you know the next day there's approximately this point around
[00:52:26] there's approximately this point around you which you linearize plus the
[00:52:28] you which you linearize plus the gradient respect to s times how much the
[00:52:30] gradient respect to s times how much the state differs from the linearization
[00:52:32] state differs from the linearization point plus the gradient respect the
[00:52:34] point plus the gradient respect the actions times how much the action vary
[00:52:37] actions times how much the action vary from a bar and this kind of generalizes
[00:52:42] from a bar and this kind of generalizes that equation you wrote
[00:52:52] so so this equation expresses as T plus
[00:52:59] so so this equation expresses as T plus 1 as a linear function or technically an
[00:53:02] 1 as a linear function or technically an affine function of the previous state
[00:53:04] affine function of the previous state and previous action right well some
[00:53:08] and previous action right well some matrices in between and from this you
[00:53:11] matrices in between and from this you know after some algebraic mundane you
[00:53:14] know after some algebraic mundane you can re-express this as st plus 1 equals
[00:53:17] can re-express this as st plus 1 equals a st plus ba t and and just that there
[00:53:23] a st plus ba t and and just that there is just one other little detail which is
[00:53:25] is just one other little detail which is you might need to redefine st to add an
[00:53:28] you might need to redefine st to add an intercept term right and because this is
[00:53:33] intercept term right and because this is a affine function with a intercept term
[00:53:35] a affine function with a intercept term rather than the linear function but so
[00:53:37] rather than the linear function but so from this formula you know with a little
[00:53:40] from this formula you know with a little bit of algebraic Montaigne you should
[00:53:42] bit of algebraic Montaigne you should really figure out whether the matrices a
[00:53:43] really figure out whether the matrices a and B but you might need to add an
[00:53:46] and B but you might need to add an instep term to the s but is just an
[00:53:48] instep term to the s but is just an affine function you can rewrite in terms
[00:53:49] affine function you can rewrite in terms of nature's the same ok um alright so I
[00:54:00] of nature's the same ok um alright so I hope that makes sense right that this
[00:54:02] hope that makes sense right that this thing this linearization thing expresses
[00:54:05] thing this linearization thing expresses St plus 1 as a linear function of s tnat
[00:54:08] St plus 1 as a linear function of s tnat right this is just a linear system the
[00:54:10] right this is just a linear system the way st plus one varies you know it's
[00:54:12] way st plus one varies you know it's just a matrix i'm st some matrix times
[00:54:14] just a matrix i'm st some matrix times 80 and that's why was someone jean you
[00:54:17] 80 and that's why was someone jean you can get into this form there for some
[00:54:18] can get into this form there for some matrix a ok but because there are some
[00:54:22] matrix a ok but because there are some constants floating around as well like
[00:54:24] constants floating around as well like this you might need an extra interceptor
[00:54:26] this you might need an extra interceptor to multiply into a to give you that
[00:54:28] to multiply into a to give you that extra constant
[00:54:39] but where we are we now have that for
[00:54:44] but where we are we now have that for these MVPs either by learning a linear
[00:54:47] these MVPs either by learning a linear model with the matrices a and B or by
[00:54:50] model with the matrices a and B or by taking a nonlinear model and linearizing
[00:54:53] taking a nonlinear model and linearizing it like you just saw you can model
[00:54:56] it like you just saw you can model hopefully more than MVP as a linear
[00:55:00] hopefully more than MVP as a linear dynamical system meaning this you know
[00:55:02] dynamical system meaning this you know st plus 1 is this linear function or the
[00:55:04] st plus 1 is this linear function or the previous state in action as well as
[00:55:06] previous state in action as well as hopefully with a quadratic reward
[00:55:09] hopefully with a quadratic reward function or they really write in the
[00:55:12] function or they really write in the form that we saw just now so let me just
[00:55:18] form that we saw just now so let me just summarize the problem we want to solve
[00:55:23] summarize the problem we want to solve AST sorry
[00:55:25] AST sorry st plus 1 equals a st plus ba t + WT so
[00:55:32] st plus 1 equals a st plus ba t + WT so this is a noise term and then our s a
[00:55:41] equals negative s transpose you have a
[00:55:45] equals negative s transpose you have a transpose and this is a fine horizon MDP
[00:55:50] transpose and this is a fine horizon MDP so the total payoff is a R of s 0 ok
[00:56:07] so the total payoff is a R of s 0 ok so let's take other dynamic programming
[00:56:16] so let's take other dynamic programming algorithm for this the remarkable
[00:56:21] algorithm for this the remarkable problem that the remarkable property of
[00:56:24] problem that the remarkable property of lqr and what makes this so useful is
[00:56:29] lqr and what makes this so useful is that if you are willing to model your
[00:56:31] that if you are willing to model your MVP using those sets of equations then
[00:56:34] MVP using those sets of equations then the value function is a quadratic
[00:56:36] the value function is a quadratic function right and so let me show you
[00:56:39] function right and so let me show you what I mean and so if your if your model
[00:56:41] what I mean and so if your if your model if your MVP can be modeled as this type
[00:56:43] if your MVP can be modeled as this type of linear dynamical system with a
[00:56:45] of linear dynamical system with a quadratic cost function then it turns
[00:56:47] quadratic cost function then it turns out that V Star is a quadratic function
[00:56:49] out that V Star is a quadratic function and so you can compute V star exactly
[00:56:53] and so you can compute V star exactly right so let me show you what I mean
[00:56:56] right so let me show you what I mean we're going to develop a dynamic
[00:56:58] we're going to develop a dynamic programming algorithm to compute the
[00:57:04] programming algorithm to compute the optimal value function V star similar to
[00:57:08] optimal value function V star similar to what we did you know below earlier today
[00:57:11] what we did you know below earlier today with the final horizon MDP with a finite
[00:57:13] with the final horizon MDP with a finite set of states this starts with the final
[00:57:16] set of states this starts with the final time step and it will work backwards so
[00:57:19] time step and it will work backwards so V star T of s T is equal to max' over 80
[00:57:26] V star T of s T is equal to max' over 80 of R of s T 80 this is max over 80 of
[00:57:35] of R of s T 80 this is max over 80 of negative
[00:57:45] right but this is always greater than or
[00:57:50] right but this is always greater than or equal to 0 because V is positive
[00:57:52] equal to 0 because V is positive semi-definite and so the optimal action
[00:57:56] semi-definite and so the optimal action is actually they just choose the action
[00:57:58] is actually they just choose the action zero and so the max over this is equal
[00:58:01] zero and so the max over this is equal to negative st transpose u st right
[00:58:05] to negative st transpose u st right because uh because v is a positive semi
[00:58:07] because uh because v is a positive semi definite matrix this thing is always
[00:58:09] definite matrix this thing is always greater than zero and then and so this
[00:58:12] greater than zero and then and so this tells us also that pi star the final
[00:58:15] tells us also that pi star the final action is the arc max so the auto action
[00:58:19] action is the arc max so the auto action is to choose you know the vector of zero
[00:58:21] is to choose you know the vector of zero actions at the last time step okay so
[00:58:25] actions at the last time step okay so this is the base case for the dynamic
[00:58:27] this is the base case for the dynamic programming step of value duration where
[00:58:32] programming step of value duration where the optimal value at the last time step
[00:58:35] the optimal value at the last time step is just choose the action that maximizes
[00:58:37] is just choose the action that maximizes the immediate reward which means
[00:58:40] the immediate reward which means maximize this right and this is
[00:58:42] maximize this right and this is maximized by choosing the action zero at
[00:58:44] maximized by choosing the action zero at the last time step okay no
[00:59:09] now the key step to the dynamic program
[00:59:12] now the key step to the dynamic program implementation is the following which is
[00:59:15] implementation is the following which is um suppose that V star T plus 1 st plus
[00:59:22] um suppose that V star T plus 1 st plus 1 is equal to a quadratic function so
[00:59:50] 1 is equal to a quadratic function so indeed
[01:00:03] yes it's true that this term is also
[01:00:05] yes it's true that this term is also driven to zero without the minus sign
[01:00:07] driven to zero without the minus sign right what about the minus I that term
[01:00:09] right what about the minus I that term is positive and so but you only get to
[01:00:13] is positive and so but you only get to maximize respect to eighty right so so
[01:00:16] maximize respect to eighty right so so the best you could do for this term the
[01:00:18] the best you could do for this term the cell is zero thank you all right now for
[01:00:27] cell is zero thank you all right now for the inductive case we want to go from VT
[01:00:32] the inductive case we want to go from VT plus one we saw three plus one to
[01:00:34] plus one we saw three plus one to computing beats 13 and the key
[01:00:39] computing beats 13 and the key observation that makes lqr work is let's
[01:00:43] observation that makes lqr work is let's suppose V star T plus one the auto value
[01:00:46] suppose V star T plus one the auto value function at the next time step let's
[01:00:48] function at the next time step let's suppose it's a quadratic function so
[01:00:50] suppose it's a quadratic function so particularly suppose V star T plus one
[01:00:52] particularly suppose V star T plus one is you know this quadratic function
[01:00:57] is you know this quadratic function parameterize by some matrix this capital
[01:01:00] parameterize by some matrix this capital Phi T plus one which is an enviable n
[01:01:03] Phi T plus one which is an enviable n matrix and some constant offsets I which
[01:01:06] matrix and some constant offsets I which is a real number
[01:01:08] is a real number um what we'll be able to show is that if
[01:01:11] um what we'll be able to show is that if you do one step of dynamic programming
[01:01:14] you do one step of dynamic programming if this is true for V star T plus 1 that
[01:01:18] if this is true for V star T plus 1 that V T after one step as you go from V
[01:01:21] V T after one step as you go from V salty cylinder V T that the optimum
[01:01:23] salty cylinder V T that the optimum value function VT is also going to be a
[01:01:26] value function VT is also going to be a quadratic function with a very similar
[01:01:28] quadratic function with a very similar form right look I guess T plus one
[01:01:30] form right look I guess T plus one replaced by T alright and so in the
[01:01:36] replaced by T alright and so in the dynamic programming step we are going to
[01:01:41] dynamic programming step we are going to update VT s T equals max of eighty R of
[01:01:51] update VT s T equals max of eighty R of s C comma 80 plus and then you know I
[01:01:56] s C comma 80 plus and then you know I think you remember when I previously
[01:02:00] think you remember when I previously writers previously we had some over s
[01:02:04] writers previously we had some over s prime Oh actually St plus 1 I guess the
[01:02:08] prime Oh actually St plus 1 I guess the s T 18
[01:02:10] s T 18 SC plus one B star c plus 1 St plus 1 so
[01:02:17] SC plus one B star c plus 1 St plus 1 so that's what we had previously when we
[01:02:19] that's what we had previously when we had a discrete state space and was
[01:02:20] had a discrete state space and was summing over it but now that we have a
[01:02:22] summing over it but now that we have a continuous state space this formula
[01:02:24] continuous state space this formula becomes expected value with respect to s
[01:02:27] becomes expected value with respect to s T plus 1 drawn from the state transition
[01:02:30] T plus 1 drawn from the state transition probabilities B star pieces 1 so the
[01:02:56] probabilities B star pieces 1 so the optimal value when the clock is a time T
[01:02:59] optimal value when the clock is a time T is choose the action a the maximizes the
[01:03:02] is choose the action a the maximizes the immediate reward plus the expected value
[01:03:04] immediate reward plus the expected value of you know your future rewards when the
[01:03:07] of you know your future rewards when the clock has now ticked from time T to time
[01:03:10] clock has now ticked from time T to time T plus 1 in your in state as t plus 1 at
[01:03:13] T plus 1 in your in state as t plus 1 at time t plus 1 so let's see so this is a
[01:03:31] time t plus 1 so let's see so this is a pretty beefy piece of algebra to do I
[01:03:35] pretty beefy piece of algebra to do I think I feel like showing this full
[01:03:38] think I feel like showing this full result is oh no it's like at the love of
[01:03:41] result is oh no it's like at the love of complexity of a you know typical CS 229
[01:03:44] complexity of a you know typical CS 229 homework problem which is quite hard but
[01:03:48] homework problem which is quite hard but let me just show the outline of how you
[01:03:50] let me just show the outline of how you do this derivation and why you know why
[01:03:52] do this derivation and why you know why this inductive step works right but I
[01:03:54] this inductive step works right but I think you but but if you want you could
[01:03:55] think you but but if you want you could work through the algebra details
[01:03:57] work through the algebra details yourself at home which is that
[01:04:24] so be sautee of s T is equal to max'
[01:04:29] so be sautee of s T is equal to max' over 80 of the immediate reward right so
[01:04:37] over 80 of the immediate reward right so that's the immediate reward and then
[01:04:39] that's the immediate reward and then plus the expected value with respect to
[01:04:42] plus the expected value with respect to s T plus one is drawn from a Gaussian
[01:04:48] s T plus one is drawn from a Gaussian would mean a st plus ba t and covariance
[01:04:55] would mean a st plus ba t and covariance Sigma W so remember s T plus 1 is equal
[01:04:59] Sigma W so remember s T plus 1 is equal to ast plus ba t + WT where WT is
[01:05:05] to ast plus ba t + WT where WT is Gaussian with mean 0 and covariance
[01:05:09] Gaussian with mean 0 and covariance Sigma W right so if you choose an action
[01:05:13] Sigma W right so if you choose an action a T then this is the distribution of the
[01:05:15] a T then this is the distribution of the next state at time T plus 1 and then
[01:05:20] next state at time T plus 1 and then expected value of this quadratic term
[01:05:33] expected value of this quadratic term because this quadratic term here I kind
[01:05:37] because this quadratic term here I kind of in the inductive case was what we
[01:05:38] of in the inductive case was what we showed was V star for the for the next
[01:05:45] showed was V star for the for the next time step right so it turns out that
[01:05:55] let's see so this is a quadratic
[01:05:57] let's see so this is a quadratic function and this expectation is the
[01:06:02] function and this expectation is the expected value of a quadratic function
[01:06:04] expected value of a quadratic function with respect to s drawn from the
[01:06:07] with respect to s drawn from the Gaussian right well the certain mean a
[01:06:08] Gaussian right well the certain mean a certain variance so it turns out that
[01:06:11] certain variance so it turns out that the expected value of this thing well
[01:06:16] the expected value of this thing well this whole thing that I just circled
[01:06:18] this whole thing that I just circled this thing simplifies into a big
[01:06:23] this thing simplifies into a big quadratic function
[01:06:32] of the action 80 and then and so in
[01:06:50] of the action 80 and then and so in order to you know derive the odd max of
[01:06:52] order to you know derive the odd max of the derive V stylus you would derive
[01:06:55] the derive V stylus you would derive this big quadratic function take
[01:06:58] this big quadratic function take derivatives with respect to a tea set to
[01:07:04] derivatives with respect to a tea set to 0 right and solve for a tea and if you
[01:07:16] 0 right and solve for a tea and if you go through all that algebra then you
[01:07:18] go through all that algebra then you actually then you end up with the
[01:07:19] actually then you end up with the formula for a tea as follows
[01:07:40] okay and I'm gonna use I'm gonna I'm
[01:07:46] okay and I'm gonna use I'm gonna I'm gonna take that big matrix and don't
[01:07:47] gonna take that big matrix and don't know that Lt okay oh and so this shows
[01:07:54] know that Lt okay oh and so this shows also that PI star at time T of s T is
[01:07:59] also that PI star at time T of s T is equal to L T times s T so one the
[01:08:17] equal to L T times s T so one the takeaway from this is that under the
[01:08:21] takeaway from this is that under the assumptions we have rather than the
[01:08:23] assumptions we have rather than the dynamical systems a quadratic cost
[01:08:24] dynamical systems a quadratic cost function the octo action there's a
[01:08:30] function the octo action there's a linear function of the state s T right
[01:08:44] linear function of the state s T right and this is not a claim that is made
[01:08:48] and this is not a claim that is made through function approximation what did
[01:08:50] through function approximation what did I'm not saying that you can fit a
[01:08:52] I'm not saying that you can fit a straight line to the Osmo action and if
[01:08:54] straight line to the Osmo action and if you fit a straight line that you get
[01:08:57] you fit a straight line that you get this linear function right that's not
[01:08:58] this linear function right that's not what we're saying we're saying that of
[01:09:01] what we're saying we're saying that of all the functions any way I could
[01:09:02] all the functions any way I could possibly come up with in the world
[01:09:04] possibly come up with in the world linear or nonlinear the best function
[01:09:06] linear or nonlinear the best function the best option is linear so there is no
[01:09:09] the best option is linear so there is no approximation here right so it's just
[01:09:11] approximation here right so it's just that you know it's just a fact that if
[01:09:13] that you know it's just a fact that if you have linear dynamical system the
[01:09:15] you have linear dynamical system the best possible action at any state is
[01:09:17] best possible action at any state is going to be a linear function of that
[01:09:21] going to be a linear function of that state right so there's notice that we
[01:09:22] state right so there's notice that we have an approximate or anything right
[01:09:42] let me let me let me write this here and
[01:09:45] let me let me let me write this here and then the other step is that if you take
[01:09:50] then the other step is that if you take the autumn action and plug it into the
[01:09:52] the autumn action and plug it into the definition of V star then by simplifying
[01:09:56] definition of V star then by simplifying Michigan is quite a lot of algebra but
[01:09:58] Michigan is quite a lot of algebra but after simplifying you end up with this
[01:10:01] after simplifying you end up with this equation where again I just write out
[01:10:12] equation where again I just write out the formulas
[01:10:58] okay
[01:11:03] all right
[01:11:20] so to summarize the whole algorithm
[01:11:23] so to summarize the whole algorithm right this let's put everything together
[01:11:24] right this let's put everything together oh and so sorry and so what these two
[01:11:28] oh and so sorry and so what these two equations do is they allow you to go
[01:11:30] equations do is they allow you to go from B star t plus 1 which is defined in
[01:11:32] from B star t plus 1 which is defined in terms of 5t plus 1 and side t plus 1 and
[01:11:35] terms of 5t plus 1 and side t plus 1 and it allows you to recursively go back to
[01:11:38] it allows you to recursively go back to figure out what is V star T using these
[01:11:40] figure out what is V star T using these two equations right so Phi T depends on
[01:11:43] two equations right so Phi T depends on Phi T plus 1 sy T depends on Phi T plus
[01:11:46] Phi T plus 1 sy T depends on Phi T plus 1 inside p plus 1 and this Sigma W this
[01:11:50] 1 inside p plus 1 and this Sigma W this is the covariance of WT right there's
[01:11:55] is the covariance of WT right there's the Sigma subscript of you this is not a
[01:11:57] the Sigma subscript of you this is not a summation over W is a Sigma matrix
[01:11:59] summation over W is a Sigma matrix subscript 2 by W that was the covariance
[01:12:01] subscript 2 by W that was the covariance matrix for the noise terms we were
[01:12:04] matrix for the noise terms we were adding on every step you know linear
[01:12:06] adding on every step you know linear dynamical system okay and there's a
[01:12:09] dynamical system okay and there's a trace operator some of the diagonals ok
[01:12:12] trace operator some of the diagonals ok so just to summarize here's the
[01:12:19] so just to summarize here's the algorithm you will initialize Phi T to
[01:12:27] algorithm you will initialize Phi T to be equal to negative u and side T equals
[01:12:34] be equal to negative u and side T equals 0 and so you know that's just taking
[01:12:39] 0 and so you know that's just taking this equation and map here there right
[01:12:43] this equation and map here there right so the final time step that those two oh
[01:12:47] so the final time step that those two oh sorry
[01:12:48] sorry should be capital T so that those two
[01:12:55] should be capital T so that those two equations to fire inside it defines V
[01:12:58] equations to fire inside it defines V star of capital T and then you would you
[01:13:06] star of capital T and then you would you know recur circle let me calculate Phi T
[01:13:11] know recur circle let me calculate Phi T and sy t using Phi T plus 1 si T plus 1
[01:13:20] and sy t using Phi T plus 1 si T plus 1 so you go from you know but t equals t
[01:13:24] so you go from you know but t equals t minus 1 t minus 2 and so on and go
[01:13:27] minus 1 t minus 2 and so on and go backward calm down from right t minus 1
[01:13:30] backward calm down from right t minus 1 t minus 2 and so on down to 0 calculate
[01:13:40] t minus 2 and so on down to 0 calculate L T as above right then LT was a formula
[01:13:46] L T as above right then LT was a formula I guess we had over there saying how the
[01:13:49] I guess we had over there saying how the optimal action is a function of the
[01:13:52] optimal action is a function of the current state depending on the a and B
[01:13:54] current state depending on the a and B and Phi and then finally PI star of s T
[01:14:00] and Phi and then finally PI star of s T equals L T of s T ok and this algorithm
[01:14:13] equals L T of s T ok and this algorithm the remarkable thing one really cool
[01:14:14] the remarkable thing one really cool thing about lqr is that there is no
[01:14:17] thing about lqr is that there is no approximation anywhere right you you
[01:14:20] approximation anywhere right you you might need to make some approximation
[01:14:22] might need to make some approximation steps in order to approximate a
[01:14:25] steps in order to approximate a helicopter as a linear dynamical system
[01:14:27] helicopter as a linear dynamical system by you know fitting matrices a and B the
[01:14:31] by you know fitting matrices a and B the data or by taking a nonlinear thing and
[01:14:33] data or by taking a nonlinear thing and linearizing it and you might need to
[01:14:34] linearizing it and you might need to just restrict constrict you knee
[01:14:37] just restrict constrict you knee restrict your choice that possibly
[01:14:38] restrict your choice that possibly reward functions reward function is
[01:14:40] reward functions reward function is called driving but once you've made
[01:14:41] called driving but once you've made those assumptions none of this is
[01:14:43] those assumptions none of this is approximately everything is exactly
[01:14:53] yes that's right yep yeah so the
[01:14:56] yes that's right yep yeah so the approximation step neither are getting
[01:14:59] approximation step neither are getting your MDP into the form of a linear
[01:15:01] your MDP into the form of a linear dynamical system will record you have a
[01:15:03] dynamical system will record you have a reward so that is approximate but once
[01:15:05] reward so that is approximate but once you specify the MTP like that all of
[01:15:07] you specify the MTP like that all of these calculations were exactly right so
[01:15:08] these calculations were exactly right so so we're not approximating the value
[01:15:11] so we're not approximating the value function of a quadratic function is that
[01:15:13] function of a quadratic function is that the value function is a quadratic
[01:15:14] the value function is a quadratic function and you're computing it exactly
[01:15:17] function and you're computing it exactly and the also policy is a linear function
[01:15:19] and the also policy is a linear function and you're just computing computing that
[01:15:21] and you're just computing computing that exactly okay I want to mention before
[01:15:27] exactly okay I want to mention before wrap i want to mention one one unusual
[01:15:30] wrap i want to mention one one unusual fun facts about lqr and this is very
[01:15:32] fun facts about lqr and this is very specific to young and and and it's
[01:15:36] specific to young and and and it's convenient but but it let me say well
[01:15:39] convenient but but it let me say well the fact is then just be careful that
[01:15:40] the fact is then just be careful that this doesn't give it a wrong intuition
[01:15:42] this doesn't give it a wrong intuition just it does imply to anything other
[01:15:43] just it does imply to anything other than lqr which is that if you look at
[01:15:47] than lqr which is that if you look at where so first if you look at the
[01:15:50] where so first if you look at the formula for L
[01:15:57] all right
[01:15:58] all right even though can the formula LT you need
[01:16:01] even though can the formula LT you need to compute I mean the you know you the
[01:16:03] to compute I mean the you know you the go up doing all this work is to find the
[01:16:06] go up doing all this work is to find the also policy right so you want to find LT
[01:16:08] also policy right so you want to find LT so you can compute the auto policy you
[01:16:10] so you can compute the auto policy you notice that LT this depends on Phi but
[01:16:21] notice that LT this depends on Phi but not sigh right so you and and maybe it's
[01:16:25] not sigh right so you and and maybe it's kinda make sense you're going to when
[01:16:27] kinda make sense you're going to when you take an action you get to some new
[01:16:29] you take an action you get to some new stage and your future payoffs it's a
[01:16:31] stage and your future payoffs it's a quadratic function plus a constant it
[01:16:33] quadratic function plus a constant it doesn't matter what that constant is
[01:16:34] doesn't matter what that constant is right and so in order to compute the
[01:16:37] right and so in order to compute the optimal action and all compute e you
[01:16:40] optimal action and all compute e you need to you need to know Phi or actually
[01:16:42] need to you need to know Phi or actually Phi T plus one but you don't need to
[01:16:44] Phi T plus one but you don't need to know what is sigh t plus one know
[01:16:53] if you look at the way we do the dynamic
[01:16:57] if you look at the way we do the dynamic programming the backwards recursion one
[01:17:03] programming the backwards recursion one of you can ferment a piece of code that
[01:17:05] of you can ferment a piece of code that doesn't bother to compute sigh right so
[01:17:08] doesn't bother to compute sigh right so these are the two equations you use
[01:17:10] these are the two equations you use update fire inside but whatever you know
[01:17:12] update fire inside but whatever you know let's see you delete this line of code
[01:17:14] let's see you delete this line of code just don't bother the computer and just
[01:17:17] just don't bother the computer and just don't bother compute that and don't
[01:17:19] don't bother compute that and don't bother to compute that so you notice
[01:17:22] bother to compute that so you notice that Phi depends on Phi T plus one but
[01:17:25] that Phi depends on Phi T plus one but it doesn't depend on side and so you can
[01:17:28] it doesn't depend on side and so you can implement the whole thing and compute
[01:17:30] implement the whole thing and compute the octo policy and completely on also
[01:17:32] the octo policy and completely on also actions without ever computing sy right
[01:17:37] actions without ever computing sy right now the funny thing about this is that
[01:17:41] now the funny thing about this is that the only place that Sigma W appears is
[01:17:50] the only place that Sigma W appears is that it affects only citee right so you
[01:18:00] that it affects only citee right so you know if we do want to just cross out in
[01:18:02] know if we do want to just cross out in our range and just don't bother the
[01:18:04] our range and just don't bother the computes I T then the whole algorithm
[01:18:07] computes I T then the whole algorithm doesn't even use Sigma W so one very
[01:18:11] doesn't even use Sigma W so one very interesting property of the lqr you know
[01:18:15] interesting property of the lqr you know if this formalism is that the optimal
[01:18:18] if this formalism is that the optimal policy does not depend on Sigma W right
[01:18:21] policy does not depend on Sigma W right and I think maybe this is a cell V star
[01:18:30] depends on Sigma W because if the noise
[01:18:34] depends on Sigma W because if the noise is very large if they're huge does it
[01:18:36] is very large if they're huge does it wind blowing helicopter all over the
[01:18:37] wind blowing helicopter all over the place then the value would be worse but
[01:18:40] place then the value would be worse but PI star and LT do not depend
[01:18:50] I'm sigma/w okay um so this is a
[01:18:54] I'm sigma/w okay um so this is a property that's very specific to lqr
[01:18:56] property that's very specific to lqr don't don't don't over generalize it to
[01:18:59] don't don't don't over generalize it to other reinforcement learning algorithms
[01:19:00] other reinforcement learning algorithms but this I think the intuition to take
[01:19:06] but this I think the intuition to take from this is first if you're actually
[01:19:07] from this is first if you're actually applying the system you know don't
[01:19:09] applying the system you know don't bother to don't don't like so don't
[01:19:11] bother to don't don't like so don't don't try too hard to estimate Sigma W
[01:19:13] don't try too hard to estimate Sigma W because you know actually you don't
[01:19:15] because you know actually you don't actually need to use it which is why
[01:19:16] actually need to use it which is why when we're fitting a linear model I
[01:19:18] when we're fitting a linear model I didn't talk too much about how you
[01:19:20] didn't talk too much about how you actually estimate Sigma W because in the
[01:19:22] actually estimate Sigma W because in the lpr system it literally doesn't matter
[01:19:24] lpr system it literally doesn't matter in a mathematical sense in terms of what
[01:19:27] in a mathematical sense in terms of what is the optimal policy you compute and in
[01:19:29] is the optimal policy you compute and in a second there may be slightly useful
[01:19:31] a second there may be slightly useful intuition to take away from this is that
[01:19:33] intuition to take away from this is that for a lot of MVPs if you're building a
[01:19:36] for a lot of MVPs if you're building a robot you know remember to add some
[01:19:39] robot you know remember to add some noise to your system but the exact noise
[01:19:41] noise to your system but the exact noise you add doesn't matter as much as one
[01:19:44] you add doesn't matter as much as one might think so what I've seen and then
[01:19:46] might think so what I've seen and then working a lot of robots a lot of MVPs is
[01:19:48] working a lot of robots a lot of MVPs is you know do add some noise your system
[01:19:50] you know do add some noise your system and make sure your learning algorithm
[01:19:52] and make sure your learning algorithm Israel busting noise and the form of the
[01:19:54] Israel busting noise and the form of the noise you add it does matter I don't say
[01:19:56] noise you add it does matter I don't say it doesn't matter at all I mean now Kira
[01:19:58] it doesn't matter at all I mean now Kira doesn't matter all for other MGP as it
[01:20:00] doesn't matter all for other MGP as it does matter but I think the fact that
[01:20:02] does matter but I think the fact that you remember to add some noise is often
[01:20:05] you remember to add some noise is often in practice more important than the
[01:20:07] in practice more important than the exact details of you know is the noise
[01:20:09] exact details of you know is the noise 10% higher distant noise stem cell or if
[01:20:11] 10% higher distant noise stem cell or if the noise is a hundred percent higher
[01:20:13] the noise is a hundred percent higher lower that will often make the big
[01:20:14] lower that will often make the big difference but but but when I'm you know
[01:20:17] difference but but but when I'm you know training all the were helicopter or
[01:20:18] training all the were helicopter or something the noise is something that
[01:20:20] something the noise is something that you know I pay a little bit attention to
[01:20:21] you know I pay a little bit attention to but I pay much more attention to making
[01:20:23] but I pay much more attention to making sure that the matrices a and B are
[01:20:25] sure that the matrices a and B are accurate and then a little bit
[01:20:28] accurate and then a little bit sloppiness in the actress even noise
[01:20:30] sloppiness in the actress even noise ball doing something that an MTP could
[01:20:32] ball doing something that an MTP could probably survive then your policy for
[01:20:33] probably survive then your policy for survive okay let's take one last
[01:20:35] survive okay let's take one last question
[01:20:37] question Oh V Bo OSE sorry yes see my nose V that
[01:20:49] Oh V Bo OSE sorry yes see my nose V that was a this is a beat yes okay cool
[01:20:56] was a this is a beat yes okay cool thanks everyone
[01:20:57] thanks everyone let's break and I will see you for the
[01:20:58] let's break and I will see you for the final lecture on Wednesday
[01:21:01] final lecture on Wednesday thanks everyone


================================================================================
LECTURE 020
================================================================================

RL Debugging and Diagnostics | Stanford CS229: Machine Learning Andrew Ng - Lecture 20 (Autumn 2018)

Source: https://www.youtube.com/watch?v=pLhPQynL0tY

---

Transcript

[00:00:03] all right everyone Pontius Sandoval 8 so
[00:00:08] all right everyone Pontius Sandoval 8 so welcome to the final lecture of sis 229
[00:00:13] welcome to the final lecture of sis 229 this quarter or I guess to the home
[00:00:17] this quarter or I guess to the home viewers welcome to the season finale so
[00:00:23] viewers welcome to the season finale so what like to do today is wrap up our
[00:00:26] what like to do today is wrap up our discussion on reinforcer learning and
[00:00:29] discussion on reinforcer learning and then and it will conclude the class so I
[00:00:33] then and it will conclude the class so I think you know over the last few
[00:00:36] think you know over the last few lectures you saw a lot of we saw a lot
[00:00:42] lectures you saw a lot of we saw a lot at nav so maybe as a brief interlude
[00:00:45] at nav so maybe as a brief interlude here are some videos so sample
[00:00:51] here are some videos so sample autonomous helicopter you know there's a
[00:00:54] autonomous helicopter you know there's a project that I know Peter view Adam
[00:00:56] project that I know Peter view Adam coats some some former students here now
[00:00:59] coats some some former students here now some of the machine learning greats were
[00:01:00] some of the machine learning greats were on when they were PhD students here and
[00:01:05] on when they were PhD students here and oh and and and I think using algorithms
[00:01:08] oh and and and I think using algorithms similar to the ones you learned in this
[00:01:10] similar to the ones you learned in this class how do you make a helicopter fly
[00:01:12] class how do you make a helicopter fly so it just have fun there's a video shot
[00:01:13] so it just have fun there's a video shot on top of one of the Stanford soccer
[00:01:16] on top of one of the Stanford soccer fields I was actually a cameraman that
[00:01:20] fields I was actually a cameraman that day and zooming out the camera see the
[00:01:26] day and zooming out the camera see the trees planted in the sky
[00:01:38] say it turns out there's a small radio
[00:01:44] say it turns out there's a small radio control helicopter it turns out that
[00:01:46] control helicopter it turns out that when you're very far away you can't tell
[00:01:48] when you're very far away you can't tell if this is a small radio control
[00:01:50] if this is a small radio control helicopter if there is like a helicopter
[00:01:51] helicopter if there is like a helicopter with people sitting there named so there
[00:01:55] with people sitting there named so there was actually this um you know Foods is
[00:01:57] was actually this um you know Foods is on a kind of soccer field the big grass
[00:02:02] on a kind of soccer field the big grass field off Santo Road it turns out across
[00:02:04] field off Santo Road it turns out across Sand Hill Road and what that high-rises
[00:02:06] Sand Hill Road and what that high-rises there was a those an elder lady that
[00:02:09] there was a those an elder lady that lives in one of those apartments and
[00:02:10] lives in one of those apartments and when if she saw this she would call 911
[00:02:12] when if she saw this she would call 911 and say hey this copter about the crash
[00:02:14] and say hey this copter about the crash and then the the firemen would come out
[00:02:18] and I and I think they were Polly relief
[00:02:21] and I and I think they were Polly relief probably disappointed that there was no
[00:02:24] probably disappointed that there was no one for us for them to save I think um
[00:02:27] one for us for them to save I think um and so and and I think um let's see one
[00:02:34] and so and and I think um let's see one of the things I promise to do in the
[00:02:38] of the things I promise to do in the debugging learning algorithms lecture
[00:02:40] debugging learning algorithms lecture was to just go over the reinforcement
[00:02:43] was to just go over the reinforcement learning example again so let me just do
[00:02:45] learning example again so let me just do that now but with notation that I think
[00:02:48] that now but with notation that I think you now understand compared to oh yes
[00:02:50] you now understand compared to oh yes oh you as an aerobatic stunt yeah that I
[00:02:57] oh you as an aerobatic stunt yeah that I I don't think there's a good reason for
[00:02:58] I don't think there's a good reason for fine how it drops upside down other than
[00:03:01] fine how it drops upside down other than that you can there are a lot of videos
[00:03:04] that you can there are a lot of videos of samples on the side cough to find all
[00:03:06] of samples on the side cough to find all sorts of stunts go to heli stanford.edu
[00:03:08] sorts of stunts go to heli stanford.edu Akio I don't stanford.edu and the stem
[00:03:14] Akio I don't stanford.edu and the stem photons are cogs did a lot more than
[00:03:16] photons are cogs did a lot more than fine upside down I mean make some
[00:03:20] fine upside down I mean make some maneuvers that look aerodynamically
[00:03:22] maneuvers that look aerodynamically impossible such as a helicopter that
[00:03:24] impossible such as a helicopter that looks like a stumbling just spinning
[00:03:26] looks like a stumbling just spinning randomly but staying the same place in
[00:03:28] randomly but staying the same place in the air right now it's called the chaos
[00:03:30] the air right now it's called the chaos maneuver and if you look how to go wow
[00:03:32] maneuver and if you look how to go wow this work was turning upside down
[00:03:33] this work was turning upside down spinning around the area same direction
[00:03:35] spinning around the area same direction but it's just staying right there in the
[00:03:36] but it's just staying right there in the air and not crashing and so the
[00:03:38] air and not crashing and so the maneuvers like that that um the very
[00:03:40] maneuvers like that that um the very best human pilots in the world can fly
[00:03:42] best human pilots in the world can fly with helicopters and I think this was
[00:03:44] with helicopters and I think this was just
[00:03:46] just demonstration yes and I think a lot of
[00:03:49] demonstration yes and I think a lot of this work wound up influencing something
[00:03:51] this work wound up influencing something later work on the quadcopter drones and
[00:03:53] later work on the quadcopter drones and a few research labs and yeah I think it
[00:03:56] a few research labs and yeah I think it was a difficult control problem and it
[00:03:58] was a difficult control problem and it was it was one of those things you do
[00:04:00] was it was one of those things you do when you're you when you're a university
[00:04:02] when you're you when you're a university you want to solve on the hardest
[00:04:03] you want to solve on the hardest problems round um but one that step
[00:04:07] problems round um but one that step through of you the debugging process
[00:04:09] through of you the debugging process that we went through as we were you're
[00:04:12] that we went through as we were you're building a helicopter like this so when
[00:04:15] building a helicopter like this so when you're trying to get a helicopter to fly
[00:04:17] you're trying to get a helicopter to fly upside down fly stunts you don't wanna
[00:04:18] upside down fly stunts you don't wanna crash too often so step one is build a
[00:04:21] crash too often so step one is build a model or build a simulator of the
[00:04:22] model or build a simulator of the helicopter right much much as you saw we
[00:04:26] helicopter right much much as you saw we start to talk about fitted value
[00:04:27] start to talk about fitted value iteration and then choose the reward
[00:04:30] iteration and then choose the reward function like that and it turns out that
[00:04:34] function like that and it turns out that specifies the reward function for
[00:04:36] specifies the reward function for staying a place is not that hard you
[00:04:39] staying a place is not that hard you know like a quadratic function like that
[00:04:40] know like a quadratic function like that works okay but if you want a helicopter
[00:04:43] works okay but if you want a helicopter to fly aggressive maneuvers it's
[00:04:45] to fly aggressive maneuvers it's actually quite tricky to specify what is
[00:04:48] actually quite tricky to specify what is a good turn for a helicopter and then
[00:04:52] a good turn for a helicopter and then what you do is you run the course to
[00:04:54] what you do is you run the course to learning algorithm to try to maximize
[00:04:57] learning algorithm to try to maximize say the final horizon MTP formulation
[00:05:00] say the final horizon MTP formulation maximize some rewards over T time steps
[00:05:02] maximize some rewards over T time steps you get a policy pie and then whenever
[00:05:06] you get a policy pie and then whenever you do this the first time you do this
[00:05:08] you do this the first time you do this you find that the resulting controller
[00:05:10] you find that the resulting controller does much worse than the human pilot and
[00:05:12] does much worse than the human pilot and the question is whether you do Nix right
[00:05:14] the question is whether you do Nix right this is better this is almost I think
[00:05:17] this is better this is almost I think this is almost exactly the slide I
[00:05:18] this is almost exactly the slide I showed you last time because I might
[00:05:19] showed you last time because I might clean up the slide using reinforcement
[00:05:21] clean up the slide using reinforcement or any notation rather than it slightly
[00:05:23] or any notation rather than it slightly simplified notation you saw before you
[00:05:26] simplified notation you saw before you learned about reinforcement or anything
[00:05:27] learned about reinforcement or anything and so the question is and and again if
[00:05:32] and so the question is and and again if you work on the reinforcer learning
[00:05:33] you work on the reinforcer learning problem yourself you know there's a good
[00:05:35] problem yourself you know there's a good chance you have to answer this question
[00:05:37] chance you have to answer this question yourself for whatever
[00:05:39] yourself for whatever robot or other reinforcement learning or
[00:05:41] robot or other reinforcement learning or factory automation or stock trading
[00:05:42] factory automation or stock trading system or whatever it is you are trying
[00:05:45] system or whatever it is you are trying to get to work with enforcement learning
[00:05:47] to get to work with enforcement learning but do you want to improve the modelsim
[00:05:48] but do you want to improve the modelsim model or doing a modified reward
[00:05:51] model or doing a modified reward function or do you want to modify the
[00:05:53] function or do you want to modify the reinforcement learning algorithm
[00:05:55] reinforcement learning algorithm okay and multiply the reports of
[00:05:57] okay and multiply the reports of learning album includes things like
[00:05:59] learning album includes things like playing with the descritization that
[00:06:02] playing with the descritization that you're using if you are taking a
[00:06:04] you're using if you are taking a continuously MDP and discretizing it's a
[00:06:07] continuously MDP and discretizing it's a solve of a finite state MVP formulation
[00:06:09] solve of a finite state MVP formulation or modifying the reinforcement learning
[00:06:11] or modifying the reinforcement learning algorithm and Cruz also may be choosing
[00:06:13] algorithm and Cruz also may be choosing new features to use in physicality
[00:06:15] new features to use in physicality iteration small things we could try or
[00:06:17] iteration small things we could try or maybe instead of using a linear function
[00:06:19] maybe instead of using a linear function approximator
[00:06:20] approximator instead of fitting a linear function for
[00:06:22] instead of fitting a linear function for fit evaluation maybe you want to use a
[00:06:24] fit evaluation maybe you want to use a bigger you know deep neural network
[00:06:26] bigger you know deep neural network right but so which of these steps is the
[00:06:29] right but so which of these steps is the most useful thing to do so this is the
[00:06:33] most useful thing to do so this is the analysis of those three things you know
[00:06:36] analysis of those three things you know if I give you a second meters right but
[00:06:41] if I give you a second meters right but if these three statements are true then
[00:06:45] if these three statements are true then the learn controller should have flown
[00:06:47] the learn controller should have flown well on the helicopter right and so
[00:06:55] well on the helicopter right and so those three sentences correspond to the
[00:07:00] those three sentences correspond to the three things in yellow that you could
[00:07:03] three things in yellow that you could work on is a problem that you know
[00:07:07] work on is a problem that you know statement one is false that the
[00:07:09] statement one is false that the assimilator isn't good enough for his
[00:07:10] assimilator isn't good enough for his problem that statement two is false that
[00:07:16] problem that statement two is false that oh sorry I think actually two and three
[00:07:19] oh sorry I think actually two and three are reverse right but the three
[00:07:21] are reverse right but the three statements correspond to three things in
[00:07:22] statements correspond to three things in yellow I think two and three are and are
[00:07:24] yellow I think two and three are and are in opposite order right is the arrow
[00:07:28] in opposite order right is the arrow alpha maximizing some rewards is a
[00:07:30] alpha maximizing some rewards is a reward function actually the right thing
[00:07:32] reward function actually the right thing to maximize and so here the Diagnostics
[00:07:35] to maximize and so here the Diagnostics you could use to see this helicopter
[00:07:38] you could use to see this helicopter simulator is accurate
[00:07:39] simulator is accurate well first check if the policy flies
[00:07:43] well first check if the policy flies well in simulation if your policy flies
[00:07:48] well in simulation if your policy flies one simulation but not in real life then
[00:07:50] one simulation but not in real life then this shows that the problem is with your
[00:07:53] this shows that the problem is with your simulator and you should try to learn a
[00:07:55] simulator and you should try to learn a better model for your helicopter right
[00:07:58] better model for your helicopter right and if you're using a linear model this
[00:08:00] and if you're using a linear model this with the matrices a and B if you know st
[00:08:03] with the matrices a and B if you know st plus 1 equals a st plus ba t if you try
[00:08:07] plus 1 equals a st plus ba t if you try try anymore
[00:08:08] try anymore or maybe try a nonlinear model but if
[00:08:12] or maybe try a nonlinear model but if you find it the problem is not your
[00:08:14] you find it the problem is not your simulator if you find that your policy
[00:08:17] simulator if you find that your policy is flying poorly in simulation and
[00:08:20] is flying poorly in simulation and flying poorly in real life right then
[00:08:22] flying poorly in real life right then this is the diagnostic I will use so I
[00:08:27] this is the diagnostic I will use so I shall show these two lines
[00:08:28] shall show these two lines so let's that human be the human control
[00:08:30] so let's that human be the human control policy so hire a human pilot which we
[00:08:34] policy so hire a human pilot which we did we're fortunate that one of the best
[00:08:35] did we're fortunate that one of the best what one of them America is tall you
[00:08:39] what one of them America is tall you know aerobatic helicopter pilots working
[00:08:41] know aerobatic helicopter pilots working with us and he using his control signals
[00:08:44] with us and he using his control signals radio control can make a helicopter fly
[00:08:45] radio control can make a helicopter fly upside down tumble do flips loops rows
[00:08:48] upside down tumble do flips loops rows so we had very good human pilot help us
[00:08:51] so we had very good human pilot help us fly the helicopter manually so what you
[00:08:56] fly the helicopter manually so what you can do is test whether or not the so
[00:09:01] can do is test whether or not the so this this thing here right that's just a
[00:09:04] this this thing here right that's just a payoff of the learned policy as measured
[00:09:10] payoff of the learned policy as measured on your reward function so check if the
[00:09:14] on your reward function so check if the learn policy achieves a better or a
[00:09:17] learn policy achieves a better or a worse payoff then a human pilot you can
[00:09:20] worse payoff then a human pilot you can write and so that means you know go
[00:09:23] write and so that means you know go ahead and let the learn policy fly the
[00:09:25] ahead and let the learn policy fly the helicopter and we get the humans up fi
[00:09:27] helicopter and we get the humans up fi the helicopter and compute the summer
[00:09:29] the helicopter and compute the summer rewards on the sequence of states that
[00:09:32] rewards on the sequence of states that these two systems take the helicopter
[00:09:34] these two systems take the helicopter through and just see whether the human
[00:09:37] through and just see whether the human or the learn policy achieves a higher
[00:09:40] or the learn policy achieves a higher payoff achieves a higher summer rewards
[00:09:43] payoff achieves a higher summer rewards and if the payoff achieved by the
[00:09:47] and if the payoff achieved by the learning algorithm is less than a payoff
[00:09:49] learning algorithm is less than a payoff achieved by the human then this shows
[00:09:51] achieved by the human then this shows that the learned policy is not actually
[00:09:56] that the learned policy is not actually maximizing the summer rewards right
[00:09:58] maximizing the summer rewards right because whatever human is doing you know
[00:10:00] because whatever human is doing you know he or she is doing a better job
[00:10:01] he or she is doing a better job maximizing some rewards then the learn
[00:10:04] maximizing some rewards then the learn policy so this means that you should you
[00:10:07] policy so this means that you should you know consider working on the
[00:10:08] know consider working on the reinforcement learning algorithm to try
[00:10:09] reinforcement learning algorithm to try to make it do a better job maximizing
[00:10:11] to make it do a better job maximizing the Sun removals and then on the flip
[00:10:15] the Sun removals and then on the flip side it is an equality goes the other
[00:10:18] side it is an equality goes the other way right so positive if the payoff or
[00:10:21] way right so positive if the payoff or the rol
[00:10:22] the rol is greater than the payoff of the human
[00:10:24] is greater than the payoff of the human then what that means is that the ORR
[00:10:28] then what that means is that the ORR algorithm is actually doing a better job
[00:10:30] algorithm is actually doing a better job maximizing the summer rewards but
[00:10:32] maximizing the summer rewards but they're still flying worse so what this
[00:10:34] they're still flying worse so what this tells you is that doing a really good
[00:10:37] tells you is that doing a really good job maximizing some rewards does not
[00:10:39] job maximizing some rewards does not correspond to how you actually want the
[00:10:41] correspond to how you actually want the helicopter to fly and so that means that
[00:10:44] helicopter to fly and so that means that maybe you should work on improving the
[00:10:48] maybe you should work on improving the reward function that the reward function
[00:10:50] reward function that the reward function is not capturing what's actually most
[00:10:52] is not capturing what's actually most important to find helicopter well and
[00:10:55] important to find helicopter well and then you multiply the reward function
[00:10:57] then you multiply the reward function right so in a typical workflow I'm going
[00:11:01] right so in a typical workflow I'm going to describe to you what what what it
[00:11:02] to describe to you what what what it feels like to work on the machine
[00:11:04] feels like to work on the machine learning project like this and it was a
[00:11:05] learning project like this and it was a big multi or machine learning project
[00:11:07] big multi or machine learning project but when you're working on a big
[00:11:09] but when you're working on a big complicated machine learning project
[00:11:10] complicated machine learning project like this the bottleneck moves around
[00:11:13] like this the bottleneck moves around meaning that you build a helicopter get
[00:11:16] meaning that you build a helicopter get a human pilot fly it you're getting the
[00:11:18] a human pilot fly it you're getting the world near on these Diagnostics and
[00:11:20] world near on these Diagnostics and maybe the first time you do this you
[00:11:21] maybe the first time you do this you find wow the simulator is really
[00:11:23] find wow the simulator is really inaccurate then you are going to work on
[00:11:25] inaccurate then you are going to work on improving the simulator for a couple
[00:11:26] improving the simulator for a couple months and then you know and every now
[00:11:29] months and then you know and every now and then you come back and rerun this
[00:11:30] and then you come back and rerun this diagnostic then maybe for the first two
[00:11:33] diagnostic then maybe for the first two months the project you keep on saying
[00:11:34] months the project you keep on saying yep soon this is not good enough so it's
[00:11:36] yep soon this is not good enough so it's not good enough so as long enough after
[00:11:38] not good enough so as long enough after working on this simulator for a couple
[00:11:40] working on this simulator for a couple months you you may find that item one
[00:11:43] months you you may find that item one that's no longer the problem you might
[00:11:45] that's no longer the problem you might then find that item three is the problem
[00:11:47] then find that item three is the problem the simulator is now good enough but
[00:11:49] the simulator is now good enough but when you run this diagnostic two months
[00:11:51] when you run this diagnostic two months in the project you might say wow looks
[00:11:53] in the project you might say wow looks like you're our algorithm is maximally
[00:11:56] like you're our algorithm is maximally reward function but this is not good
[00:11:58] reward function but this is not good flying so now I think the biggest
[00:12:00] flying so now I think the biggest problem for the project or the biggest
[00:12:02] problem for the project or the biggest bottleneck with the project is that the
[00:12:04] bottleneck with the project is that the reward function is not good enough and
[00:12:06] reward function is not good enough and then you might spend you know another
[00:12:07] then you might spend you know another one or two or three or sometimes longer
[00:12:10] one or two or three or sometimes longer months working to try to improve the
[00:12:12] months working to try to improve the reward function and you might do that
[00:12:14] reward function and you might do that for a while and then when the reward
[00:12:16] for a while and then when the reward function is good enough then that
[00:12:17] function is good enough then that exposes the next problem your system
[00:12:19] exposes the next problem your system which might be that the ROI algorithm is
[00:12:21] which might be that the ROI algorithm is good enough and so the problem you
[00:12:23] good enough and so the problem you should be working on actually moves
[00:12:24] should be working on actually moves around and it's different in different
[00:12:26] around and it's different in different phases of the project and when you're
[00:12:29] phases of the project and when you're working on this it feels like every time
[00:12:31] working on this it feels like every time you solve the current problem that
[00:12:33] you solve the current problem that exposes the Nix
[00:12:35] exposes the Nix most important to work on then you work
[00:12:37] most important to work on then you work on that and we solve that then this
[00:12:39] on that and we solve that then this helps you identify an explosive next
[00:12:41] helps you identify an explosive next most important element work on and you
[00:12:43] most important element work on and you kind of keep doing that or you keep
[00:12:45] kind of keep doing that or you keep iterating or keep solving problems until
[00:12:47] iterating or keep solving problems until hopefully you get a helicopter that does
[00:12:49] hopefully you get a helicopter that does what you wanted to but I think teams
[00:12:54] what you wanted to but I think teams that have the discipline to prioritize
[00:12:59] that have the discipline to prioritize according to Diagnostics like this tend
[00:13:01] according to Diagnostics like this tend to be much more efficient teams that
[00:13:03] to be much more efficient teams that kind of go by gut feeling in terms of
[00:13:05] kind of go by gut feeling in terms of selecting you know what to what to spend
[00:13:07] selecting you know what to what to spend your time on all right any questions
[00:13:11] your time on all right any questions about this oh sorry so again yeah I kind
[00:13:31] about this oh sorry so again yeah I kind of want to say yes let me think
[00:13:33] of want to say yes let me think yeah I wouldn't usually check step one
[00:13:36] yeah I wouldn't usually check step one first and then if I think the simulator
[00:13:38] first and then if I think the simulator is okay then look at steps two and three
[00:13:41] is okay then look at steps two and three maybe one of the thing about when you
[00:13:44] maybe one of the thing about when you work on these projects there is some
[00:13:45] work on these projects there is some judgment involved so I think I'm
[00:13:47] judgment involved so I think I'm presenting these things as those a rigid
[00:13:49] presenting these things as those a rigid mathematical formula that's cut and dry
[00:13:51] mathematical formula that's cut and dry this formula says now working on step
[00:13:53] this formula says now working on step one then this one says now work on step
[00:13:55] one then this one says now work on step three there is there is more judgment
[00:13:58] three there is there is more judgment involved because when you run these
[00:14:00] involved because when you run these things I'll say if you might say well
[00:14:01] things I'll say if you might say well looks like the simulator is not that
[00:14:03] looks like the simulator is not that good but it's kind of good and there's a
[00:14:04] good but it's kind of good and there's a little bit ambiguous and oh looks like
[00:14:06] little bit ambiguous and oh looks like you know and so that's what it often
[00:14:08] you know and so that's what it often feels like and so a team would get
[00:14:10] feels like and so a team would get together look for the evidence from all
[00:14:12] together look for the evidence from all three steps and then say you know well
[00:14:14] three steps and then say you know well maybe the simulator is not that good but
[00:14:16] maybe the simulator is not that good but it's maybe good enough and but boy the
[00:14:18] it's maybe good enough and but boy the reinforcement the reward function is
[00:14:20] reinforcement the reward function is really bad let's focus on that so there
[00:14:22] really bad let's focus on that so there is some surrounding a hard and fast rule
[00:14:25] is some surrounding a hard and fast rule there there is some judgment needed to
[00:14:28] there there is some judgment needed to make these decisions but having a so
[00:14:32] make these decisions but having a so when a reading machine there any teams
[00:14:33] when a reading machine there any teams often my teams will you know run D
[00:14:35] often my teams will you know run D side.now sakes get together and look at
[00:14:37] side.now sakes get together and look at the evidence and then discuss in debate
[00:14:38] the evidence and then discuss in debate what's the best way to move forward but
[00:14:40] what's the best way to move forward but I think the process in making sure
[00:14:41] I think the process in making sure you've that discussion to debate it's
[00:14:43] you've that discussion to debate it's much better than the alternative which
[00:14:45] much better than the alternative which is you know someone just picked
[00:14:46] is you know someone just picked something
[00:14:47] something very random and the team does that so
[00:15:00] very random and the team does that so just yeah maybe I had the laptop up you
[00:15:06] just yeah maybe I had the laptop up you know a little bit of fun but a little
[00:15:07] know a little bit of fun but a little bit because I'm to illustrate fitted
[00:15:10] bit because I'm to illustrate fitted value iteration let me just show another
[00:15:14] value iteration let me just show another reinforcement learning video oh and by
[00:15:18] reinforcement learning video oh and by doing one of the I think if I look at
[00:15:20] doing one of the I think if I look at the future of a I featured machine
[00:15:21] the future of a I featured machine learning you know there's a lot of hype
[00:15:23] learning you know there's a lot of hype about reinforcement learning for game
[00:15:25] about reinforcement learning for game playing which is fine you know we all
[00:15:26] playing which is fine you know we all like we all love computers playing
[00:15:30] like we all love computers playing computer games like that's a great thing
[00:15:32] computer games like that's a great thing I think but but I think that some of the
[00:15:34] I think but but I think that some of the most exciting applications are
[00:15:35] most exciting applications are reinforced with learning coming down the
[00:15:37] reinforced with learning coming down the pipe I think will be robotics I don't
[00:15:39] pipe I think will be robotics I don't know the next few years even though
[00:15:40] know the next few years even though there are only a few success stories of
[00:15:43] there are only a few success stories of reinforcement they applied to robotics
[00:15:44] reinforcement they applied to robotics there are more and more right now one of
[00:15:46] there are more and more right now one of the trends I see you know we look at the
[00:15:49] the trends I see you know we look at the academic publications and some of the
[00:15:51] academic publications and some of the things making their way into industrial
[00:15:53] things making their way into industrial environments is I think in the next
[00:15:54] environments is I think in the next several years just based on the stuff I
[00:15:56] several years just based on the stuff I see my friends in many different
[00:15:57] see my friends in many different companies and many different entities
[00:15:58] companies and many different entities are working on I think there will be a
[00:16:00] are working on I think there will be a rise of reinforcement learning
[00:16:02] rise of reinforcement learning algorithms applied to robotics I think
[00:16:04] algorithms applied to robotics I think there will be one important area to
[00:16:06] there will be one important area to watch out for right um but um so you
[00:16:12] watch out for right um but um so you know there's a now Stanford video this
[00:16:15] know there's a now Stanford video this is again just using reinforcement
[00:16:16] is again just using reinforcement learning to get a robot dog to climb
[00:16:20] learning to get a robot dog to climb over obstacles like these my friends
[00:16:24] over obstacles like these my friends that were less generous did not want to
[00:16:29] that were less generous did not want to think of this as a robot dog they
[00:16:31] think of this as a robot dog they thought it was more like a robot
[00:16:32] thought it was more like a robot cockroach
[00:16:34] cockroach but I think cockroaches done that for
[00:16:36] but I think cockroaches done that for the x-ray coffee with six legs
[00:16:54] yeah but so how do you program a robot
[00:16:57] yeah but so how do you program a robot dog like this right to climb over
[00:17:00] dog like this right to climb over terrain so one of the key components is
[00:17:03] terrain so one of the key components is work I am Zico Colter
[00:17:05] work I am Zico Colter now a Connie Mellon professor another
[00:17:08] now a Connie Mellon professor another one of the machine learning greats is a
[00:17:13] one of the machine learning greats is a key part of this was a valley function
[00:17:15] key part of this was a valley function approximation where it dog sounds on the
[00:17:19] approximation where it dog sounds on the left and it goes get to the right then
[00:17:20] left and it goes get to the right then the approximate value function kind of
[00:17:25] the approximate value function kind of I'm simplifying a little bit right but
[00:17:27] I'm simplifying a little bit right but but the approximate value function tells
[00:17:29] but the approximate value function tells it given the 3d shape of the terrain the
[00:17:33] it given the 3d shape of the terrain the middle plots is a height map where the
[00:17:35] middle plots is a height map where the different shades tell you how tall is
[00:17:37] different shades tell you how tall is the terrain but given the 3d shape the
[00:17:40] the terrain but given the 3d shape the terrain the dog learns a value function
[00:17:44] terrain the dog learns a value function that tells it what is the cost of
[00:17:46] that tells it what is the cost of putting his feet on different locations
[00:17:48] putting his feet on different locations to the terrain and it learns among other
[00:17:50] to the terrain and it learns among other things you know not to put his feet at
[00:17:52] things you know not to put his feet at the edge of a cliff because then it's
[00:17:53] the edge of a cliff because then it's likely to slip off the edge of a cliff
[00:17:55] likely to slip off the edge of a cliff and fall over right so but but hopefully
[00:17:59] and fall over right so but but hopefully this gives a visualization of whether
[00:18:01] this gives a visualization of whether learning value functions very very
[00:18:04] learning value functions very very complicated functions I'll say and okay
[00:18:05] complicated functions I'll say and okay the states is very high dimension so
[00:18:07] the states is very high dimension so this is all kind of project on so 2d
[00:18:09] this is all kind of project on so 2d space you can visualize it but but this
[00:18:11] space you can visualize it but but this is what the simplified value function
[00:18:13] is what the simplified value function looks like a robot like this okay all
[00:18:18] looks like a robot like this okay all right so with that let me return to the
[00:18:30] so
[00:18:41] um there's just one class of algorithms
[00:18:46] um there's just one class of algorithms I want to describe to you today which
[00:18:48] I want to describe to you today which are called policy search algorithms and
[00:18:57] sometimes policy search is also called
[00:19:00] sometimes policy search is also called direct policy search and to explain what
[00:19:11] direct policy search and to explain what this means so far our approach to
[00:19:14] this means so far our approach to reinforcement learning has been to first
[00:19:16] reinforcement learning has been to first learn or approximate the value function
[00:19:19] learn or approximate the value function you know approximate v-star and then use
[00:19:22] you know approximate v-star and then use that to learn or at least hopefully
[00:19:24] that to learn or at least hopefully approximate PI star right so we had you
[00:19:27] approximate PI star right so we had you saw a value iteration top reading Apollo
[00:19:29] saw a value iteration top reading Apollo through a chinois philosophy to
[00:19:30] through a chinois philosophy to reinforce the learning was to estimate
[00:19:32] reinforce the learning was to estimate the value function and then use that you
[00:19:34] the value function and then use that you know that equation with the arc max to
[00:19:36] know that equation with the arc max to figure out what is pi star so this is an
[00:19:38] figure out what is pi star so this is an indirect way of getting at a policy
[00:19:40] indirect way of getting at a policy because we would first try to figure out
[00:19:43] because we would first try to figure out was the value function in direct policy
[00:19:45] was the value function in direct policy search we try to find a good policy
[00:19:54] search we try to find a good policy directly right hence the term direct
[00:19:58] directly right hence the term direct policy search because you don't you go
[00:20:00] policy search because you don't you go straight for trying to find a good
[00:20:01] straight for trying to find a good policy without the intermediate step of
[00:20:03] policy without the intermediate step of finding an approximation to the value
[00:20:06] finding an approximation to the value function so um let's see I'm gonna use
[00:20:12] function so um let's see I'm gonna use as the most Vida example the inverted
[00:20:15] as the most Vida example the inverted pendulum great so that is that thing
[00:20:18] pendulum great so that is that thing with the three things here and let's say
[00:20:21] with the three things here and let's say your actions are to accelerate left or
[00:20:24] your actions are to accelerate left or to a salary right right and then you
[00:20:26] to a salary right right and then you could have and you can have stay still a
[00:20:28] could have and you can have stay still a cell it's from a cell rate that strong
[00:20:30] cell it's from a cell rate that strong salary right you could more than two
[00:20:31] salary right you could more than two actions but let's just say you've an
[00:20:33] actions but let's just say you've an inverted pendulum with two actions so if
[00:20:39] inverted pendulum with two actions so if you want to talk about pros and cons the
[00:20:42] you want to talk about pros and cons the direct policy search later but if you
[00:20:44] direct policy search later but if you want to bypass direct policy search
[00:20:46] want to bypass direct policy search you're going to apply policy search the
[00:20:48] you're going to apply policy search the first step is to
[00:20:50] first step is to come up with the class of policies you
[00:20:52] come up with the class of policies you are entertained or come up with the set
[00:20:54] are entertained or come up with the set of functions you use to approximate the
[00:20:57] of functions you use to approximate the policy so again to make an analogy when
[00:21:01] policy so again to make an analogy when you saw logistic regression for the
[00:21:04] you saw logistic regression for the first time you know we kind of said that
[00:21:05] first time you know we kind of said that we would approximate Y as a hypothesis
[00:21:13] right whose form was governed by this
[00:21:15] right whose form was governed by this sigmoid function and you remember in
[00:21:19] sigmoid function and you remember in week 2 when I first described logistic
[00:21:23] week 2 when I first described logistic regression I kind of pulled this out of
[00:21:25] regression I kind of pulled this out of a hat right and said oh yeah trust me
[00:21:27] a hat right and said oh yeah trust me let's use logistic function and and then
[00:21:29] let's use logistic function and and then later we saw there's a special case of
[00:21:31] later we saw there's a special case of the generalized linear model but you
[00:21:34] the generalized linear model but you know we just had to write down some form
[00:21:36] know we just had to write down some form for how we will predict Y as a function
[00:21:39] for how we will predict Y as a function of X so in direct policy search we will
[00:21:44] of X so in direct policy search we will have to come up with a form for pi right
[00:21:47] have to come up with a form for pi right so right they just come up with a
[00:21:48] so right they just come up with a function for however is an H indirect
[00:21:52] function for however is an H indirect policy search will have to come our way
[00:21:54] policy search will have to come our way for how we approximate the policy pi
[00:21:56] for how we approximate the policy pi right and so you know one thing we have
[00:21:59] right and so you know one thing we have to do is say well maybe the action will
[00:22:02] to do is say well maybe the action will approximate with some policy PI may be
[00:22:06] approximate with some policy PI may be parametrized by theta and it's now a
[00:22:09] parametrized by theta and it's now a function of the states and maybe it'll
[00:22:12] function of the states and maybe it'll be 1 over 1 plus e to the negative theta
[00:22:15] be 1 over 1 plus e to the negative theta transpose you know the state vector
[00:22:18] transpose you know the state vector right where the state vector may be
[00:22:21] right where the state vector may be something like X dot and then the angle
[00:22:25] something like X dot and then the angle at the angle dot right if if just this
[00:22:29] at the angle dot right if if just this fine maybe add an intercept
[00:22:31] fine maybe add an intercept okay and I switch this from theta to Phi
[00:22:35] okay and I switch this from theta to Phi to avoid conflict into notation okay um
[00:22:39] to avoid conflict into notation okay um this isn't really the form of the policy
[00:22:41] this isn't really the form of the policy were right so let me let me make one
[00:22:42] were right so let me let me make one more definition and then I'll show you a
[00:22:45] more definition and then I'll show you a form of a specific form of policy you
[00:22:47] form of a specific form of policy you can use but it's actually not quite this
[00:22:49] can use but it's actually not quite this what we need to treat this a little bit
[00:22:51] what we need to treat this a little bit so the direct policy search algorithm
[00:22:55] so the direct policy search algorithm will use will use a stochastic policy so
[00:22:58] will use will use a stochastic policy so this is a new definition so sarcastic
[00:23:14] this is a new definition so sarcastic policy is a function so we're going to
[00:23:48] policy is a function so we're going to use for the direct policy search
[00:23:50] use for the direct policy search algorithm that you see today we're going
[00:23:52] algorithm that you see today we're going to use the classic policies meaning that
[00:23:55] to use the classic policies meaning that on every time step the policy will tell
[00:24:00] on every time step the policy will tell you what's the chance you want to
[00:24:02] you what's the chance you want to accelerate left versus what's the chance
[00:24:04] accelerate left versus what's the chance you want to settle right and then you
[00:24:07] you want to settle right and then you use a random number generator to select
[00:24:09] use a random number generator to select either left or right to accelerate on
[00:24:11] either left or right to accelerate on your inverted pendulum depending on the
[00:24:13] your inverted pendulum depending on the policies there depending on the
[00:24:14] policies there depending on the probability is output by this policy
[00:24:17] probability is output by this policy okay and so here's one example let's see
[00:24:27] okay and so here's one example let's see which is you can have
[00:24:30] which is you can have [Applause]
[00:24:37] so you know continuing with the inverted
[00:24:40] so you know continuing with the inverted pendulum here's one policy that might be
[00:24:49] pendulum here's one policy that might be reasonable where you say that let's see
[00:25:02] so you know in a state that's the chancy
[00:25:10] so you know in a state that's the chancy you take the salary right action is
[00:25:13] you take the salary right action is given by this sigmoid function and the
[00:25:16] given by this sigmoid function and the chance that in the state that's you take
[00:25:19] chance that in the state that's you take the accellerate left action is given by
[00:25:25] the accellerate left action is given by that okay and here's one example for why
[00:25:29] that okay and here's one example for why this might be a reasonable policy so
[00:25:31] this might be a reasonable policy so let's say the state vector s this one
[00:25:33] let's say the state vector s this one X X dot Phi Phi dot where you know this
[00:25:43] X X dot Phi Phi dot where you know this angle of the inverted pendulum is the
[00:25:46] angle of the inverted pendulum is the angle Phi and let's say for the sake of
[00:25:49] angle Phi and let's say for the sake of arguments that we set the parameter of
[00:25:53] arguments that we set the parameter of this policy Phi to be um zero zero zero
[00:25:58] this policy Phi to be um zero zero zero one zero so in this case this is saying
[00:26:03] one zero so in this case this is saying that let's see so theta transpose s is
[00:26:07] that let's see so theta transpose s is just equal to Phi right and so in this
[00:26:11] just equal to Phi right and so in this case right because you know theta
[00:26:14] case right because you know theta transpose s just 1 times Phi everything
[00:26:16] transpose s just 1 times Phi everything else gets multiplied zero and so in this
[00:26:18] else gets multiplied zero and so in this case the same that the chance to
[00:26:19] case the same that the chance to accelerate to the right is equal to one
[00:26:23] accelerate to the right is equal to one over one plus e to the negative how far
[00:26:26] over one plus e to the negative how far is the PO tilted over to the right and
[00:26:29] is the PO tilted over to the right and so this policy gives you the effect that
[00:26:32] so this policy gives you the effect that the further the PO is tilted to the
[00:26:34] the further the PO is tilted to the right the more aggressively you want to
[00:26:37] right the more aggressively you want to accelerate to the right okay so this is
[00:26:40] accelerate to the right okay so this is very simple policy it's not a great
[00:26:41] very simple policy it's not a great policy but it's not a totally
[00:26:43] policy but it's not a totally unreasonable policy which is well look
[00:26:45] unreasonable policy which is well look at how far the post tilted so that for
[00:26:47] at how far the post tilted so that for the right
[00:26:47] the right apply sigmoid function and then
[00:26:49] apply sigmoid function and then accelerate to the left or right you know
[00:26:51] accelerate to the left or right you know depending on how far is tilted to the
[00:26:53] depending on how far is tilted to the right now and and and because this is
[00:26:58] right now and and and because this is the right so this is really the chance
[00:27:02] the right so this is really the chance of taking the ass a very right action as
[00:27:07] of taking the ass a very right action as a function of the PO angle pi right now
[00:27:11] a function of the PO angle pi right now this is not the best policy because it
[00:27:15] this is not the best policy because it ignores all the features other than Phi
[00:27:18] ignores all the features other than Phi but if you were to set both theta equals
[00:27:22] but if you were to set both theta equals you know 0 negative 0.5 0 1 0 then this
[00:27:29] you know 0 negative 0.5 0 1 0 then this policy the negative 0.5 now multiplies
[00:27:33] policy the negative 0.5 now multiplies into the exposition right now this new
[00:27:37] into the exposition right now this new policy if you have this value of theta
[00:27:40] policy if you have this value of theta it takes an account how far is your
[00:27:42] it takes an account how far is your cards already to the right where I guess
[00:27:46] cards already to the right where I guess this is the X distance and the further
[00:27:50] this is the X distance and the further your cart is already
[00:27:51] your cart is already I guess if your cart is on the set of
[00:27:53] I guess if your cart is on the set of wheels right it's on the railway track
[00:27:55] wheels right it's on the railway track and you don't want to fall off the rim
[00:27:57] and you don't want to fall off the rim and you want to keep the car kind of
[00:27:59] and you want to keep the car kind of Center you don't want to fall off the
[00:28:00] Center you don't want to fall off the end of your table but this now says the
[00:28:02] end of your table but this now says the further this is to the right already
[00:28:04] further this is to the right already your well the less likely you should be
[00:28:05] your well the less likely you should be to accelerate to the right okay and so
[00:28:08] to accelerate to the right okay and so maybe this is suddenly better policy
[00:28:10] maybe this is suddenly better policy there were descending parameters and
[00:28:12] there were descending parameters and more generally what you would like is to
[00:28:17] more generally what you would like is to come up with five numbers that tells you
[00:28:21] come up with five numbers that tells you how to trade off how much you should
[00:28:22] how to trade off how much you should aside to the right based on the position
[00:28:24] aside to the right based on the position velocity angle and angular velocity of
[00:28:28] velocity angle and angular velocity of their current state of the car of the of
[00:28:32] their current state of the car of the of the inverted pendulum and what a direct
[00:28:34] the inverted pendulum and what a direct policy search Alber will do is help you
[00:28:38] policy search Alber will do is help you come up with a set of numbers that
[00:28:40] come up with a set of numbers that results in hopefully a reasonable policy
[00:28:43] results in hopefully a reasonable policy for controlling the inverted pendulum
[00:28:45] for controlling the inverted pendulum hope and a policy that hopefully results
[00:28:47] hope and a policy that hopefully results in a appropriate set of probabilities
[00:28:49] in a appropriate set of probabilities that cause it to accelerate to the right
[00:28:51] that cause it to accelerate to the right whenever's good to do so and Sarah to
[00:28:53] whenever's good to do so and Sarah to let you know more often when it's good
[00:28:55] let you know more often when it's good to do so
[00:28:58] so so I'll go is to find the five
[00:29:10] so so I'll go is to find the five parameters theta
[00:29:16] so that's when we execute PI of s a we
[00:29:29] maximize max of a theta the expected
[00:29:34] maximize max of a theta the expected value of R of s 0 is 0 plus dot dot plus
[00:29:49] and so the reward function could be
[00:29:51] and so the reward function could be negative 1 whenever the inverted
[00:29:53] negative 1 whenever the inverted pendulum falls over and 0 whenever it
[00:29:56] pendulum falls over and 0 whenever it stays up or whatever or something that
[00:29:59] stays up or whatever or something that measures how well you betcha Panem is
[00:30:00] measures how well you betcha Panem is doing but the goal of a direct policy
[00:30:03] doing but the goal of a direct policy search algorithm is to choose a set
[00:30:06] search algorithm is to choose a set parameters theta so that we actually the
[00:30:08] parameters theta so that we actually the policy you maximize your expected payoff
[00:30:10] policy you maximize your expected payoff and I'm gonna use to find a horizon
[00:30:12] and I'm gonna use to find a horizon setting for the album that was helpful
[00:30:15] setting for the album that was helpful today okay and then one one other
[00:30:18] today okay and then one one other difference between policy search
[00:30:21] difference between policy search compared to estimating the value
[00:30:24] compared to estimating the value function is that indirect policy search
[00:30:28] function is that indirect policy search here as 0 is a fixed initial State
[00:30:39] it turns out that when we were
[00:30:42] it turns out that when we were estimating the value function V saw you
[00:30:46] estimating the value function V saw you found the best possible policy for
[00:30:48] found the best possible policy for starting from any state right and
[00:30:50] starting from any state right and there's kind of no matter what state you
[00:30:51] there's kind of no matter what state you start from is simultaneously the best
[00:30:53] start from is simultaneously the best possible policy for all states indirect
[00:30:55] possible policy for all states indirect policy search we assume that either
[00:30:58] policy search we assume that either there's a fixed start state fix initial
[00:31:00] there's a fixed start state fix initial state at 0 or there's a fixed
[00:31:02] state at 0 or there's a fixed distribution over initial States I'm
[00:31:04] distribution over initial States I'm going to try to maximize the expected
[00:31:05] going to try to maximize the expected reward or
[00:31:06] reward or back to your initial state or respect to
[00:31:08] back to your initial state or respect to an initial priority distribution over
[00:31:11] an initial priority distribution over what is the initial state okay so that's
[00:31:13] what is the initial state okay so that's that's one other difference so
[00:31:33] all right so this right is out the go is
[00:31:40] all right so this right is out the go is a maximize overall theta the expected
[00:31:44] a maximize overall theta the expected value of R of s 0 a 0 because R of s 1 a
[00:31:49] value of R of s 0 a 0 because R of s 1 a 1 plus dot dot dot up to R of s t-80 you
[00:31:58] 1 plus dot dot dot up to R of s t-80 you know given pi theta and in order to
[00:32:05] know given pi theta and in order to simplify the math we'll write on this
[00:32:08] simplify the math we'll write on this board today I'm just gonna set G equals
[00:32:12] board today I'm just gonna set G equals 1 to simplify the math in order to not
[00:32:15] 1 to simplify the math in order to not carry such a long summation but it turns
[00:32:18] carry such a long summation but it turns out that so I'm just do like a 2 x mm DP
[00:32:22] out that so I'm just do like a 2 x mm DP just to simplify the derivation but
[00:32:24] just to simplify the derivation but everything works you know just with a
[00:32:26] everything works you know just with a longer some if you have a more general
[00:32:29] longer some if you have a more general version of T and so this term here the
[00:32:34] version of T and so this term here the expectation is equal to sum over all
[00:32:37] expectation is equal to sum over all possible state action sequences right
[00:32:40] possible state action sequences right and again this way go up to St and 80
[00:32:42] and again this way go up to St and 80 but you just said capital T equals 1 um
[00:32:46] but you just said capital T equals 1 um what's the chance your MVP starts out
[00:32:49] what's the chance your MVP starts out and some state as 0 so this is your
[00:32:51] and some state as 0 so this is your initial state distribution times the
[00:32:54] initial state distribution times the chance that in that state you take the
[00:32:59] chance that in that state you take the first action a zero oh sorry just let me
[00:33:03] first action a zero oh sorry just let me write this out right so the chance of
[00:33:06] write this out right so the chance of your MVP going through the state action
[00:33:09] your MVP going through the state action sequence times
[00:33:20] times that right so that's what it means
[00:33:22] times that right so that's what it means to self compute the expected value of
[00:33:26] to self compute the expected value of the payoff and so instead of writing all
[00:33:32] the payoff and so instead of writing all this sum I'm just going to call this the
[00:33:34] this sum I'm just going to call this the payoff and so this is equal to sum over
[00:33:41] payoff and so this is equal to sum over s 0 a 0 s 1 a 1 of the Chauncey MTP
[00:33:45] s 0 a 0 s 1 a 1 of the Chauncey MTP starts in state 0 times the challenge
[00:33:48] starts in state 0 times the challenge that in state 0 you end up choosing the
[00:33:51] that in state 0 you end up choosing the action a 0 times the chance governed by
[00:33:56] action a 0 times the chance governed by the state transition probabilities that
[00:33:58] the state transition probabilities that you end up in state 1 state s 1 times
[00:34:03] you end up in state 1 state s 1 times the chance a state that's one you end up
[00:34:05] the chance a state that's one you end up choosing so that's 1 and then times the
[00:34:12] choosing so that's 1 and then times the payoff ok
[00:34:14] payoff ok and so what we're going to be able to do
[00:34:18] and so what we're going to be able to do is derive a gradient ascent algorithm
[00:34:22] is derive a gradient ascent algorithm actually so costly gradient ascent
[00:34:24] actually so costly gradient ascent algorithm as a function of theta to
[00:34:27] algorithm as a function of theta to maximize this thing to maximize the
[00:34:29] maximize this thing to maximize the expected value of this thing and that
[00:34:31] expected value of this thing and that and and this is a this is how we'll do
[00:34:34] and and this is a this is how we'll do direct policy search ok so let me just
[00:34:39] direct policy search ok so let me just write out the algorithm and then we'll
[00:34:41] write out the algorithm and then we'll go through why the algorithm that I
[00:34:45] go through why the algorithm that I write down is maximizing this expected
[00:34:48] write down is maximizing this expected payoff
[00:34:57] so this algorithm is called the
[00:35:06] reinforced algorithm the option
[00:35:09] reinforced algorithm the option reinforced algorithm had a few other
[00:35:12] reinforced algorithm had a few other bells and whistles but explain the code
[00:35:16] bells and whistles but explain the code the idea but they were enforcing that
[00:35:19] the idea but they were enforcing that the reinforced algorithm does the
[00:35:22] the reinforced algorithm does the following which is you're going to run
[00:35:26] following which is you're going to run your MDP right and just you know run it
[00:35:31] your MDP right and just you know run it for a trajectory of tea time step so
[00:35:33] for a trajectory of tea time step so again you know I'm just gonna well right
[00:35:38] again you know I'm just gonna well right and and actually you would uh
[00:35:43] technically you would run it for tea
[00:35:47] technically you would run it for tea time steps but you know let's just say
[00:35:50] time steps but you know let's just say for now well we'll do only the thing in
[00:35:51] for now well we'll do only the thing in blue we run it for one time set go to
[00:35:53] blue we run it for one time set go to sleep capital T equal one and then you
[00:35:58] sleep capital T equal one and then you would compute the payoff right equals R
[00:36:03] would compute the payoff right equals R of 0 plus R of s 1 and then in the more
[00:36:08] of 0 plus R of s 1 and then in the more general case you know plus dot dot plus
[00:36:10] general case you know plus dot dot plus R of s T and then you perform the
[00:36:19] R of s T and then you perform the following update which is theta gets
[00:36:23] following update which is theta gets updated as theta plus the learning rate
[00:36:26] updated as theta plus the learning rate alpha times
[00:36:52] and then times the payoff and again I'm
[00:36:58] and then times the payoff and again I'm just setting capital t equals 1 if
[00:37:00] just setting capital t equals 1 if capital t was bigger you would just sum
[00:37:03] capital t was bigger you would just sum this all the way up to time T so that's
[00:37:08] this all the way up to time T so that's the algorithm that's on every iteration
[00:37:11] the algorithm that's on every iteration through the reinforced algorithm and
[00:37:15] through the reinforced algorithm and through the reinforced algorithm you
[00:37:17] through the reinforced algorithm you will take your robot take your inverted
[00:37:19] will take your robot take your inverted pendulum run it through t time steps
[00:37:24] pendulum run it through t time steps executing your current policy so choose
[00:37:26] executing your current policy so choose actions randomly according to the
[00:37:28] actions randomly according to the current stochastic policy using current
[00:37:30] current stochastic policy using current values of the parameters data compute
[00:37:33] values of the parameters data compute the total sum rewards you receive let's
[00:37:35] the total sum rewards you receive let's call the payoff and then update theta
[00:37:37] call the payoff and then update theta using this funny formula right now on
[00:37:42] using this funny formula right now on every iteration of this algorithm you're
[00:37:46] every iteration of this algorithm you're going to update theta and it turns out
[00:37:50] going to update theta and it turns out that grandpa's is a stochastic gradient
[00:37:53] that grandpa's is a stochastic gradient ascent algorithm and you remember when
[00:37:57] ascent algorithm and you remember when we talked about linear regression right
[00:38:00] we talked about linear regression right you saw me draw pictures like this if
[00:38:02] you saw me draw pictures like this if there's a global minimum then gradient
[00:38:04] there's a global minimum then gradient descent would just you know take a
[00:38:06] descent would just you know take a straight path to the minimum but
[00:38:08] straight path to the minimum but stochastic gradient descent would take a
[00:38:10] stochastic gradient descent would take a more random path right towards the
[00:38:12] more random path right towards the minimum and it kind of also lays around
[00:38:14] minimum and it kind of also lays around there maybe it doesn't quite converge
[00:38:17] there maybe it doesn't quite converge unless you slowly decrease the learning
[00:38:19] unless you slowly decrease the learning rate alpha so that's what we have for
[00:38:21] rate alpha so that's what we have for stochastic gradient descent for linear
[00:38:24] stochastic gradient descent for linear regression what we'll see in a minute is
[00:38:28] regression what we'll see in a minute is that reinforce is a stochastic gradient
[00:38:30] that reinforce is a stochastic gradient ascent algorithm meaning that each of
[00:38:33] ascent algorithm meaning that each of these updates is random because it
[00:38:35] these updates is random because it depends on what was this state action
[00:38:38] depends on what was this state action sequence that you just saw and what was
[00:38:40] sequence that you just saw and what was the payoff the you just saw but what
[00:38:42] the payoff the you just saw but what Willis show is that on expectation the
[00:38:47] Willis show is that on expectation the the average update you know this this
[00:38:50] the average update you know this this update to theta this thing you're having
[00:38:52] update to theta this thing you're having two theta that on average let's see so
[00:38:55] two theta that on average let's see so that on average this update here is
[00:38:58] that on average this update here is exactly in the direction of the gradient
[00:39:02] exactly in the direction of the gradient so that on average
[00:39:06] so that on average yeah because uh every every loo every
[00:39:08] yeah because uh every every loo every time through this loop you're making a
[00:39:11] time through this loop you're making a random update to theta and this random
[00:39:14] random update to theta and this random and noisy because it depends on this
[00:39:16] and noisy because it depends on this random state sequence right then just a
[00:39:18] random state sequence right then just a sequence is random because of the state
[00:39:21] sequence is random because of the state transition probabilities and also
[00:39:22] transition probabilities and also because of the fact that you're choosing
[00:39:23] because of the fact that you're choosing actions randomly but on but the expected
[00:39:27] actions randomly but on but the expected value of this update you see in a little
[00:39:29] value of this update you see in a little bit turns out to be exactly the
[00:39:31] bit turns out to be exactly the direction of the gradient which is why
[00:39:34] direction of the gradient which is why this report algorithm is a gradient
[00:39:37] this report algorithm is a gradient ascent algorithm so let's let's show
[00:39:43] ascent algorithm so let's let's show that now so
[00:40:04] all right so what we want to do is
[00:40:06] all right so what we want to do is maximize the expected payoff which is a
[00:40:09] maximize the expected payoff which is a formula we derive up there and so we're
[00:40:13] formula we derive up there and so we're going to want to take derivatives with
[00:40:15] going to want to take derivatives with respect to theta of the expected payoff
[00:40:20] right I'm just gonna copy that for me
[00:40:23] right I'm just gonna copy that for me there up there so that's a chance of
[00:40:41] there up there so that's a chance of that see going through that say action
[00:40:42] that see going through that say action sequence time to pay off and so we want
[00:40:45] sequence time to pay off and so we want to take derivatives of this and you know
[00:40:46] to take derivatives of this and you know so we can write go up hill using
[00:40:49] so we can write go up hill using gradient ascent so we're going to do
[00:40:55] gradient ascent so we're going to do this in four steps now first want to
[00:41:01] this in four steps now first want to remind you when you take the derivative
[00:41:03] remind you when you take the derivative of Smith of a product of three things
[00:41:06] of Smith of a product of three things right so let's say that you have three
[00:41:09] right so let's say that you have three functions f of theta times G of theta
[00:41:13] functions f of theta times G of theta times H of theta so by the product rule
[00:41:19] times H of theta so by the product rule you know derivatives product grew from
[00:41:22] you know derivatives product grew from calculus the derivative of the product
[00:41:24] calculus the derivative of the product of three things is obtained by you know
[00:41:29] of three things is obtained by you know taking the derivatives of each of them
[00:41:31] taking the derivatives of each of them one at a time right so this is f prime
[00:41:34] one at a time right so this is f prime times G times H plus G prime here so the
[00:41:50] times G times H plus G prime here so the product rule from calculus is that if
[00:41:53] product rule from calculus is that if you want to take derivatives of a
[00:41:54] you want to take derivatives of a product of three things then you kind of
[00:41:57] product of three things then you kind of take the derivatives one at a time you
[00:41:59] take the derivatives one at a time you end up with three sums right and so
[00:42:03] end up with three sums right and so we're going to apply the product rule to
[00:42:06] we're going to apply the product rule to this where we have here
[00:42:11] this where we have here we have two different terms that depend
[00:42:13] we have two different terms that depend on theta and so when we take the
[00:42:18] on theta and so when we take the derivative of this thing respect to
[00:42:19] derivative of this thing respect to theta we're gonna have of two terms that
[00:42:22] theta we're gonna have of two terms that correspond to taking derivative this one
[00:42:24] correspond to taking derivative this one is integral to doing that one's right
[00:42:26] is integral to doing that one's right and so this derivative is equal to so
[00:42:38] and so this derivative is equal to so the first term is the sum over all the
[00:42:41] the first term is the sum over all the state action sequences you have s0 and
[00:42:52] state action sequences you have s0 and then let's see so now we have PI of
[00:42:56] then let's see so now we have PI of theta excuse me the derivative respect
[00:43:00] theta excuse me the derivative respect to pi theta as zero a zero
[00:43:14] and then plus Oh and then times that pay
[00:43:37] and then plus Oh and then times that pay off right so the whole thing here is
[00:43:41] off right so the whole thing here is then multiplied by the payoff okay so we
[00:43:45] then multiplied by the payoff okay so we just applied the product rule for
[00:43:46] just applied the product rule for calculus where for the first term in the
[00:43:49] calculus where for the first term in the sum we kind of took the derivative of
[00:43:51] sum we kind of took the derivative of this first thing and then for the second
[00:43:54] this first thing and then for the second term in the sum we took the derivative
[00:43:55] term in the sum we took the derivative of the second thing and now I'm gonna
[00:44:01] of the second thing and now I'm gonna make one more algebraic trick which is
[00:44:04] make one more algebraic trick which is I'm going to multiply and divide by that
[00:44:09] I'm going to multiply and divide by that same term and then most fine divided by
[00:44:15] same term and then most fine divided by the same thing here right
[00:44:24] the same thing here right so lots of multi but most times divided
[00:44:27] so lots of multi but most times divided by the same thing right and then finally
[00:44:33] by the same thing right and then finally if you factor out so now the final step
[00:44:41] if you factor out so now the final step is I'm going to factor out these terms
[00:44:43] is I'm going to factor out these terms I'm underlining right because this terms
[00:44:50] I'm underlining right because this terms I underlined this is just you know the
[00:44:54] I underlined this is just you know the probability or the whole state sequence
[00:45:00] right and again for the orange thing
[00:45:03] right and again for the orange thing that this this orange thing right these
[00:45:09] that this this orange thing right these two orange things multiplied together is
[00:45:11] two orange things multiplied together is equal to that for each thing in that box
[00:45:13] equal to that for each thing in that box as well
[00:45:15] and so the final step is to factor out
[00:45:19] and so the final step is to factor out the orange box which is just P of s 0 a
[00:45:24] the orange box which is just P of s 0 a 0 s 1 a 1 right so that's the thing I
[00:45:31] 0 s 1 a 1 right so that's the thing I boxed up in orange times then those two
[00:45:37] boxed up in orange times then those two terms involving the derivatives
[00:46:03] okay and I think Oh
[00:46:08] right where I guess this term goes there
[00:46:13] right where I guess this term goes there and this term goes there and so this is
[00:46:30] and this term goes there and so this is just equal to well and if you look at
[00:46:36] just equal to well and if you look at the reinforced algorithm right that we
[00:46:38] the reinforced algorithm right that we wrote down this is just equal to sum
[00:46:43] wrote down this is just equal to sum over you know all the state action
[00:46:46] over you know all the state action sequences times the probability of the
[00:46:53] sequences times the probability of the gradient update
[00:47:00] because uh I guess I'm running out of
[00:47:03] because uh I guess I'm running out of colors but you know this is a gradient
[00:47:05] colors but you know this is a gradient update and that's just right equal to
[00:47:08] update and that's just right equal to this thing okay so what this shows is
[00:47:17] this thing okay so what this shows is that even though on each iteration the
[00:47:23] that even though on each iteration the direction of the gradient updates is
[00:47:25] direction of the gradient updates is random the the expected value of how you
[00:47:35] random the the expected value of how you update the parameters is exactly equal
[00:47:38] update the parameters is exactly equal to the derivative of your objective of
[00:47:42] to the derivative of your objective of your expected total payoff so we started
[00:47:45] your expected total payoff so we started saying that this formula is your
[00:47:47] saying that this formula is your expected total payoff so let's figure
[00:47:51] expected total payoff so let's figure out what's the derivative your expected
[00:47:52] out what's the derivative your expected total payoff and we found that the
[00:47:54] total payoff and we found that the expected the derivative your expected
[00:47:57] expected the derivative your expected total payoff the derivative the thing
[00:47:58] total payoff the derivative the thing you want to maximize is equal to the
[00:48:00] you want to maximize is equal to the expected value of your gradient update
[00:48:03] expected value of your gradient update and so this proves that on average you
[00:48:08] and so this proves that on average you know if you have a very small learning
[00:48:09] know if you have a very small learning rate you end up averaging over many
[00:48:10] rate you end up averaging over many steps right but on average the updates
[00:48:13] steps right but on average the updates that reinforce is taking on every
[00:48:15] that reinforce is taking on every iteration is exactly in the direction of
[00:48:18] iteration is exactly in the direction of the derivative of the expected total
[00:48:22] the derivative of the expected total payoff that you're trying to maximize
[00:48:29] any questions about this yeah
[00:48:35] oh is this impending the choice of the
[00:48:38] oh is this impending the choice of the dysfunction this is true for any form of
[00:48:41] dysfunction this is true for any form of a stochastic policy where the definition
[00:48:45] a stochastic policy where the definition is that you know pi theta as zero a zero
[00:48:49] is that you know pi theta as zero a zero has to be the chance of taking that
[00:48:51] has to be the chance of taking that action in that state but this could be
[00:48:53] action in that state but this could be any function you want it could be a soft
[00:48:56] any function you want it could be a soft massacree logistic function and many
[00:48:58] massacree logistic function and many many different complicated features it
[00:49:00] many different complicated features it could be has been continuous the has
[00:49:02] could be has been continuous the has been differentiable function and
[00:49:03] been differentiable function and actually one of the reasons we shifted
[00:49:06] actually one of the reasons we shifted to stochastic policies was because
[00:49:09] to stochastic policies was because previously just had two actions is
[00:49:11] previously just had two actions is either left or right right and so you
[00:49:13] either left or right right and so you can't define a derivative over a
[00:49:15] can't define a derivative over a discontinuous function like either left
[00:49:17] discontinuous function like either left or right but now we have a probability
[00:49:19] or right but now we have a probability that shifts slowly between what's the
[00:49:21] that shifts slowly between what's the probability to go let's go right and by
[00:49:24] probability to go let's go right and by making this a continuous function of
[00:49:25] making this a continuous function of theta you can then take derivatives in
[00:49:27] theta you can then take derivatives in five unison it doesn't really just a
[00:49:29] five unison it doesn't really just a function so another way to train a
[00:49:51] function so another way to train a helicopter controller is to use
[00:49:53] helicopter controller is to use supervised learning where you have a
[00:49:55] supervised learning where you have a human expert train you know so you can
[00:49:58] human expert train you know so you can also actually have a human pilot
[00:50:00] also actually have a human pilot demonstrate and just stay take this
[00:50:02] demonstrate and just stay take this action right and then you supervise the
[00:50:04] action right and then you supervise the running to just learn directly a mapping
[00:50:06] running to just learn directly a mapping from the state into the action I think
[00:50:08] from the state into the action I think this I don't know this might be okay for
[00:50:11] this I don't know this might be okay for low-speed helicopter flight I don't
[00:50:13] low-speed helicopter flight I don't think it works super well I bet you
[00:50:15] think it works super well I bet you could do this in not crash a helicopter
[00:50:18] could do this in not crash a helicopter but but to get the best results I
[00:50:22] but but to get the best results I wouldn't use this approach it turns out
[00:50:27] wouldn't use this approach it turns out for some of the maneuvers where she
[00:50:28] for some of the maneuvers where she fight better than human pilots as well
[00:50:37] oh and so for other types of policies
[00:50:43] oh and so for other types of policies messy
[00:50:46] messy [Applause]
[00:51:01] so direct policy search also works if
[00:51:06] so direct policy search also works if you have continuous value actions and
[00:51:08] you have continuous value actions and you don't want to discretize the action
[00:51:10] you don't want to discretize the action so maybe here's a simple example let's
[00:51:12] so maybe here's a simple example let's say a is a real number such as the
[00:51:14] say a is a real number such as the magnitude that the force you apply to
[00:51:17] magnitude that the force you apply to accelerating left or right it's around
[00:51:19] accelerating left or right it's around discretizing your inverted pendulum you
[00:51:21] discretizing your inverted pendulum you want to output a continuous number how
[00:51:23] want to output a continuous number how hard you're sorry to left or right or
[00:51:25] hard you're sorry to left or right or for self-driving car maybe theta is the
[00:51:28] for self-driving car maybe theta is the steering angle which is a real value
[00:51:30] steering angle which is a real value number so a simple policy would be a
[00:51:33] number so a simple policy would be a equals you know say the transpose s and
[00:51:37] equals you know say the transpose s and then plus Gaussian noise and if just for
[00:51:44] then plus Gaussian noise and if just for the purpose of training you're willing
[00:51:46] the purpose of training you're willing to pretend that your policy is to apply
[00:51:49] to pretend that your policy is to apply the action theta transpose s and then a
[00:51:51] the action theta transpose s and then a little bit of Gaussian noise to it then
[00:51:53] little bit of Gaussian noise to it then the whole framework for reinforce but
[00:51:56] the whole framework for reinforce but this type of gradient descent also will
[00:51:59] this type of gradient descent also will also work and now I guess I reckon
[00:52:03] also work and now I guess I reckon implementing this you pray turn off the
[00:52:04] implementing this you pray turn off the Gaussian noise I know there are little
[00:52:06] Gaussian noise I know there are little tricks like that as well um so let's see
[00:52:13] tricks like that as well um so let's see some pros and cons of so when should you
[00:52:16] some pros and cons of so when should you use direct policy search and when should
[00:52:19] use direct policy search and when should you use value iteration or a value
[00:52:21] you use value iteration or a value function based type of approach so it
[00:52:26] function based type of approach so it turns out this one setting actually
[00:52:30] turns out this one setting actually there are two settings where a direct
[00:52:31] there are two settings where a direct policy search works much better one is
[00:52:34] policy search works much better one is if you have a palm DP P Oh in this case
[00:52:39] if you have a palm DP P Oh in this case that's a partially observable and that's
[00:52:46] that's a partially observable and that's it for example you know for the inverted
[00:52:53] it for example you know for the inverted pendulum that's a pro angle Phi you have
[00:52:57] pendulum that's a pro angle Phi you have the car and this is your position X and
[00:53:00] the car and this is your position X and we've been saying that the state space
[00:53:02] we've been saying that the state space is X X dot Phi Phi dot right but let's
[00:53:08] is X X dot Phi Phi dot right but let's say that you have sensors on this
[00:53:12] say that you have sensors on this inverted pendulum that allow you to
[00:53:15] inverted pendulum that allow you to Asia only the position and only the
[00:53:17] Asia only the position and only the angle of the inverted pendulum so you
[00:53:20] angle of the inverted pendulum so you might have an angle sensor you know down
[00:53:22] might have an angle sensor you know down here and you might have a position
[00:53:24] here and you might have a position sensor for your birthday pendulum but
[00:53:26] sensor for your birthday pendulum but maybe you don't know the velocity and
[00:53:27] maybe you don't know the velocity and you don't know the angular velocity
[00:53:29] you don't know the angular velocity right so this is an example of a
[00:53:31] right so this is an example of a partially observable Markov decision
[00:53:33] partially observable Markov decision process because and what this means is
[00:53:36] process because and what this means is that on every step you do not get to see
[00:53:38] that on every step you do not get to see the whole state because you you don't
[00:53:40] the whole state because you you don't have enough sensors to tell you exactly
[00:53:42] have enough sensors to tell you exactly what is the state of the entire system
[00:53:45] what is the state of the entire system so in a partially observable MDP at each
[00:53:49] so in a partially observable MDP at each step you get a partial and potentially
[00:53:59] step you get a partial and potentially noisy measurement of the state right and
[00:54:11] noisy measurement of the state right and then have to take actions I have to
[00:54:13] then have to take actions I have to choose an action a using these partial
[00:54:23] choose an action a using these partial and potentially noisy measurements right
[00:54:26] and potentially noisy measurements right which is uh maybe you only observe the
[00:54:28] which is uh maybe you only observe the position and the angle but your senses
[00:54:31] position and the angle but your senses aren't even totally accurate so you get
[00:54:32] aren't even totally accurate so you get a slightly noisy you know estimate or
[00:54:35] a slightly noisy you know estimate or the position you get a slightly noisy as
[00:54:36] the position you get a slightly noisy as for the angle but you just have to
[00:54:38] for the angle but you just have to choose in action based on your noisy
[00:54:40] choose in action based on your noisy estimates of just two of the four state
[00:54:42] estimates of just two of the four state variables
[00:54:50] it turns out that there's been a lot of
[00:54:53] it turns out that there's been a lot of academic literature trying to generalize
[00:54:55] academic literature trying to generalize value function based approaches the pom
[00:54:58] value function based approaches the pom DPS and they're very complicated
[00:55:00] DPS and they're very complicated algorithms in the literature on trying
[00:55:02] algorithms in the literature on trying to apply value function based approaches
[00:55:04] to apply value function based approaches upon GPS but those algorithms despite
[00:55:07] upon GPS but those algorithms despite their very high level of complexity you
[00:55:10] their very high level of complexity you know are not are not widely in
[00:55:11] know are not are not widely in production right but if you use the
[00:55:15] production right but if you use the direct policy search algorithm then
[00:55:17] direct policy search algorithm then there's actually very little problem oh
[00:55:19] there's actually very little problem oh let me just write this down so let's say
[00:55:21] let me just write this down so let's say the observation is on every time step
[00:55:26] the observation is on every time step you observe y equals x phi plus noise
[00:55:31] you observe y equals x phi plus noise right so you just don't know what is a
[00:55:33] right so you just don't know what is a state and in a pom DP you cannot
[00:55:36] state and in a pom DP you cannot approximate the value function or even
[00:55:38] approximate the value function or even if you knew what was V Star right you
[00:55:43] if you knew what was V Star right you can't compute PI star because uh I mean
[00:55:46] can't compute PI star because uh I mean maybe you know what is pi star best
[00:55:47] maybe you know what is pi star best listen compute V Sarn pi saw but if you
[00:55:50] listen compute V Sarn pi saw but if you don't know what the state is you can't
[00:55:51] don't know what the state is you can't apply PI star to the state because it's
[00:55:53] apply PI star to the state because it's in so how do you choose in action if
[00:55:56] in so how do you choose in action if you're using direct policy search then
[00:55:59] you're using direct policy search then here's one thing you could do which is
[00:56:01] here's one thing you could do which is you can say that hi of given an
[00:56:05] you can say that hi of given an observation the chance of going to the
[00:56:09] observation the chance of going to the right given your parent observation is
[00:56:11] right given your parent observation is equal to 1 over 1 plus e to the negative
[00:56:14] equal to 1 over 1 plus e to the negative theta transpose Y where I guess Y can be
[00:56:18] theta transpose Y where I guess Y can be you know one read X plus noise v plus
[00:56:23] you know one read X plus noise v plus noise that's X plus noise and so you
[00:56:31] noise that's X plus noise and so you could run reinforce using just the
[00:56:32] could run reinforce using just the observations you have to try to still
[00:56:36] observations you have to try to still classically try to randomly choose an
[00:56:38] classically try to randomly choose an action and nothing in the frame way we
[00:56:40] action and nothing in the frame way we talked about provenza's album from
[00:56:42] talked about provenza's album from working and so direct policy search just
[00:56:44] working and so direct policy search just works very naturally even if you have
[00:56:46] works very naturally even if you have only partial observations of the state
[00:56:49] only partial observations of the state and more generally instead of plugging
[00:56:53] and more generally instead of plugging the direct observations this can be any
[00:56:55] the direct observations this can be any set of
[00:56:55] set of I just make a side comment for those who
[00:57:02] I just make a side comment for those who didn't know what common causes are don't
[00:57:04] didn't know what common causes are don't have you don't but one common one common
[00:57:07] have you don't but one common one common way of using the right policy search
[00:57:09] way of using the right policy search would be to use some estimate such as a
[00:57:12] would be to use some estimate such as a common filter a proper grammar model or
[00:57:13] common filter a proper grammar model or something to use your historical
[00:57:16] something to use your historical estimates look don't don't just look at
[00:57:18] estimates look don't don't just look at your one set of measurements now but
[00:57:20] your one set of measurements now but look at all your historical measurements
[00:57:21] look at all your historical measurements and then their algorithms such as
[00:57:23] and then their algorithms such as something called a common filter that
[00:57:25] something called a common filter that lets you estimate whatever has the
[00:57:27] lets you estimate whatever has the current state the full state vector you
[00:57:29] current state the full state vector you can plug that full state vector estimate
[00:57:30] can plug that full state vector estimate into the features you used to choose a
[00:57:32] into the features you used to choose a to choose an action that's a common
[00:57:35] to choose an action that's a common design paradigm if you don't know what a
[00:57:36] design paradigm if you don't know what a common filter is don't worry about it
[00:57:37] common filter is don't worry about it but you take take one a steal invoice
[00:57:39] but you take take one a steal invoice for something on them yeah but that's
[00:57:42] for something on them yeah but that's one common paradigm where you could use
[00:57:44] one common paradigm where you could use your partial observations that's been
[00:57:45] your partial observations that's been the full state and plug that as a
[00:57:47] the full state and plug that as a features into the rent policy session
[00:57:49] features into the rent policy session okay so that's one setting where the
[00:57:52] okay so that's one setting where the right policy search works just just
[00:57:55] right policy search works just just applies in a way that value function
[00:57:57] applies in a way that value function approximation is very difficult to even
[00:58:00] approximation is very difficult to even get to apply now one last thing is one
[00:58:09] get to apply now one last thing is one last consideration for secure apply
[00:58:11] last consideration for secure apply policy search algorithm or a value
[00:58:13] policy search algorithm or a value function transformation algorithm oh it
[00:58:15] function transformation algorithm oh it turns out the reinforced algorithm is is
[00:58:18] turns out the reinforced algorithm is is actually very inefficient as in you end
[00:58:21] actually very inefficient as in you end up you know when I when you look at
[00:58:23] up you know when I when you look at research papers on the reinforced
[00:58:25] research papers on the reinforced algorithm it's not unusual for people
[00:58:26] algorithm it's not unusual for people that run the reinforced algorithm for
[00:58:28] that run the reinforced algorithm for like a million iterations or ten million
[00:58:30] like a million iterations or ten million iterations so you just have to train it
[00:58:32] iterations so you just have to train it turns out the gradient estimates for the
[00:58:34] turns out the gradient estimates for the reinforced algorithm even though the
[00:58:35] reinforced algorithm even though the expected values right there's actually
[00:58:37] expected values right there's actually very noisy and so if you train the
[00:58:39] very noisy and so if you train the reinforced algorithm you end up just
[00:58:41] reinforced algorithm you end up just running for a very very very long time
[00:58:43] running for a very very very long time right it does work was a pretty
[00:58:45] right it does work was a pretty inefficient algorithm so that's one
[00:58:47] inefficient algorithm so that's one disadvantage of the reinforced algorithm
[00:58:49] disadvantage of the reinforced algorithm is that the gradient estimates on
[00:58:50] is that the gradient estimates on expectation are exactly what you want it
[00:58:53] expectation are exactly what you want it to be but there's a lot of variance in
[00:58:55] to be but there's a lot of variance in the gradient so you have to run it for a
[00:58:57] the gradient so you have to run it for a long time of a very small learning rate
[00:58:59] long time of a very small learning rate but one other reason to use the right
[00:59:03] but one other reason to use the right policy search is
[00:59:05] policy search is it's kind of ask yourself do you think
[00:59:07] it's kind of ask yourself do you think pi-star is simpler or is beast are
[00:59:16] simpler right and so um here's what I
[00:59:19] simpler right and so um here's what I mean there are the in in in robotics
[00:59:24] mean there are the in in in robotics there's sometimes what we call low level
[00:59:26] there's sometimes what we call low level controls house and one way to think of
[00:59:37] controls house and one way to think of low level controls house is flying a
[00:59:39] low level controls house is flying a helicopter hovering the helicopter is an
[00:59:41] helicopter hovering the helicopter is an example of a low-level control toss and
[00:59:43] example of a low-level control toss and one way to inform me think of local
[00:59:45] one way to inform me think of local control houses kind of really skilled
[00:59:47] control houses kind of really skilled human you know holding a joystick
[00:59:50] human you know holding a joystick control this thing making see the
[00:59:52] control this thing making see the depends decisions right so those are
[00:59:54] depends decisions right so those are kind of almost instinctual in the tiny
[00:59:56] kind of almost instinctual in the tiny fraction of a second and almost by few
[00:59:58] fraction of a second and almost by few you could control the thing those those
[01:00:00] you could control the thing those those are tend to be low level control Tasos
[01:00:02] are tend to be low level control Tasos either the parents holding a joystick a
[01:00:04] either the parents holding a joystick a skill person because that inverted
[01:00:06] skill person because that inverted pendulum or you know steer helicopter
[01:00:09] pendulum or you know steer helicopter those are low level control tasks in
[01:00:11] those are low level control tasks in contrast playing chess is not a
[01:00:14] contrast playing chess is not a low-level control toss yeah because for
[01:00:16] low-level control toss yeah because for the most part to be a very good chess
[01:00:18] the most part to be a very good chess player is not really a seat-of-the-pants
[01:00:21] player is not really a seat-of-the-pants you know take a bit make a decision in
[01:00:24] you know take a bit make a decision in like in 0.1 seconds right you kind of
[01:00:26] like in 0.1 seconds right you kind of have to think multiple steps ahead and
[01:00:29] have to think multiple steps ahead and in low-level control toss there's
[01:00:32] in low-level control toss there's usually some control policy that is
[01:00:34] usually some control policy that is quite simple a very simple function
[01:00:36] quite simple a very simple function mappings of states the actions that's
[01:00:38] mappings of states the actions that's pretty good and so that allows you to
[01:00:41] pretty good and so that allows you to specify a relatively simple class of
[01:00:43] specify a relatively simple class of functions of PI star and direct policy
[01:00:47] functions of PI star and direct policy search would be relatively promising for
[01:00:49] search would be relatively promising for tasks like those whereas in contrast if
[01:00:52] tasks like those whereas in contrast if you want to play chess okay go or do
[01:00:55] you want to play chess okay go or do these things we have multiple steps of
[01:00:56] these things we have multiple steps of reasoning I think that if you're driving
[01:00:59] reasoning I think that if you're driving a car on a straight road that's a
[01:01:02] a car on a straight road that's a low-level control toss we just look at
[01:01:04] low-level control toss we just look at the road you just you know turn the
[01:01:05] the road you just you know turn the steering or a little bit to stay on the
[01:01:07] steering or a little bit to stay on the road so that's a lot of control tasks
[01:01:08] road so that's a lot of control tasks but if you are planning how to you know
[01:01:12] but if you are planning how to you know overtake this car and avoid that other
[01:01:14] overtake this car and avoid that other car whether it's a pedestrian and the
[01:01:16] car whether it's a pedestrian and the bicycle is along the way then that's
[01:01:18] bicycle is along the way then that's less of a low-level controlled house and
[01:01:21] less of a low-level controlled house and that requires more multi-step reasoning
[01:01:23] that requires more multi-step reasoning right I guess depend how aggressive a
[01:01:25] right I guess depend how aggressive a driver you are right driving on the
[01:01:27] driver you are right driving on the highway you know may require more or
[01:01:28] highway you know may require more or less multi-step reasoning where you want
[01:01:30] less multi-step reasoning where you want to overtake this car before the trucker
[01:01:33] to overtake this car before the trucker comes in this Lane so that that type of
[01:01:34] comes in this Lane so that that type of thing is more multi-step reasoning and
[01:01:38] thing is more multi-step reasoning and the person's like that tend to be
[01:01:40] the person's like that tend to be difficult for a very simple like a
[01:01:42] difficult for a very simple like a linear function to be a good policy and
[01:01:45] linear function to be a good policy and for those things in playing chess
[01:01:46] for those things in playing chess playing go playing checkers a value
[01:01:49] playing go playing checkers a value function approximation approach may be
[01:01:51] function approximation approach may be more promising okay so any questions
[01:02:02] more promising okay so any questions about the oh and so again a long
[01:02:07] about the oh and so again a long helicopter flight actually my first
[01:02:11] helicopter flight actually my first attempts for flying helicopters were
[01:02:12] attempts for flying helicopters were actually the right policy search because
[01:02:15] actually the right policy search because flying helicopters I should see the
[01:02:16] flying helicopters I should see the pants things but then when you try to
[01:02:19] pants things but then when you try to find more complex maneuvers then you end
[01:02:22] find more complex maneuvers then you end up using something maybe closer to value
[01:02:24] up using something maybe closer to value function approximation that method so if
[01:02:26] function approximation that method so if you want to find very complicated
[01:02:27] you want to find very complicated maneuver so the video you saw just now
[01:02:31] maneuver so the video you saw just now the helicopter flying upside down the
[01:02:33] the helicopter flying upside down the algorithm implemented on for that pickle
[01:02:35] algorithm implemented on for that pickle video that was a different policy search
[01:02:36] video that was a different policy search algorithm right no not exactly this one
[01:02:39] algorithm right no not exactly this one a little bit different but that was a
[01:02:40] a little bit different but that was a tear apart see so geography but if one
[01:02:42] tear apart see so geography but if one helicopter fly very complicated maneuver
[01:02:44] helicopter fly very complicated maneuver then you need something maybe closer to
[01:02:45] then you need something maybe closer to the value from Shiprock Smith and Soda
[01:02:48] the value from Shiprock Smith and Soda and there is exciting research on how to
[01:02:50] and there is exciting research on how to blend direct policy search approaches
[01:02:52] blend direct policy search approaches together with value function
[01:02:53] together with value function approximation book approaches so
[01:02:55] approximation book approaches so actually alphago one of the reasons
[01:03:00] actually alphago one of the reasons alphago worked was sorry you know go
[01:03:04] alphago worked was sorry you know go claim program rate by deep I was there
[01:03:06] claim program rate by deep I was there was a blend of ideas from both of these
[01:03:08] was a blend of ideas from both of these types of literature which enabled it to
[01:03:10] types of literature which enabled it to scale to a much bigger system to play go
[01:03:13] scale to a much bigger system to play go in a very very very impressive
[01:03:16] in a very very very impressive all right any questions about this
[01:03:26] alright um so just final application
[01:03:29] alright um so just final application examples you know reinforcement learning
[01:03:33] examples you know reinforcement learning today is making strong let's see so
[01:03:38] today is making strong let's see so there's a lot of work on reinforce to
[01:03:40] there's a lot of work on reinforce to learning for game playing checkers chess
[01:03:43] learning for game playing checkers chess go
[01:03:44] go that is exciting um reinforcement
[01:03:47] that is exciting um reinforcement learning today is used in is using a
[01:03:50] learning today is used in is using a growing number of robotics applications
[01:03:51] growing number of robotics applications I think for controlling a lot of robots
[01:03:55] I think for controlling a lot of robots there is a honor if you go to robotics
[01:03:58] there is a honor if you go to robotics conferences if you look at some of the
[01:03:59] conferences if you look at some of the projects being done by some of the very
[01:04:01] projects being done by some of the very large companies that make very large
[01:04:03] large companies that make very large machines right I have many friends in
[01:04:05] machines right I have many friends in multiple you know large companies making
[01:04:08] multiple you know large companies making large machines that are increasingly
[01:04:10] large machines that are increasingly using reinforcement if you control them
[01:04:12] using reinforcement if you control them there is fascinating work using reports
[01:04:17] there is fascinating work using reports of learning for optimizing
[01:04:18] of learning for optimizing anti factory deployments there's
[01:04:22] anti factory deployments there's academic research we're still in
[01:04:25] academic research we're still in researcher as far as I know I shouldn't
[01:04:26] researcher as far as I know I shouldn't mean maybe scientific deployed on using
[01:04:28] mean maybe scientific deployed on using reinforcement learning to build chat
[01:04:30] reinforcement learning to build chat BOTS and actually on using reinforcement
[01:04:34] BOTS and actually on using reinforcement learning to build a a I based guidance
[01:04:37] learning to build a a I based guidance counselor for example right where the
[01:04:40] counselor for example right where the actions you take up what you say to
[01:04:41] actions you take up what you say to students and then and then the reward is
[01:04:43] students and then and then the reward is you know do you manage to help a student
[01:04:45] you know do you manage to help a student navigate their coursework or navigate
[01:04:47] navigate their coursework or navigate their career there is uh and there's
[01:04:51] their career there is uh and there's also starting to be applied to
[01:04:53] also starting to be applied to healthcare where one of the keys are
[01:04:55] healthcare where one of the keys are reinforced with learning is is this a
[01:04:56] reinforced with learning is is this a sequential decision making process right
[01:04:58] sequential decision making process right where do you have to take a sequence of
[01:05:00] where do you have to take a sequence of decisions that may affect your reward
[01:05:02] decisions that may affect your reward over time and I think and in in
[01:05:06] over time and I think and in in healthcare there is work on medical
[01:05:09] healthcare there is work on medical planning where the goal is not you know
[01:05:13] planning where the goal is not you know send you to get a blood test and then
[01:05:15] send you to get a blood test and then we're done right
[01:05:17] we're done right in complicated medical procedures we
[01:05:20] in complicated medical procedures we might essentially get a blood test then
[01:05:22] might essentially get a blood test then based on the outcome of the blood test
[01:05:23] based on the outcome of the blood test we might send you to get a biopsy or not
[01:05:26] we might send you to get a biopsy or not all right
[01:05:27] all right ask you to take a drug and then come
[01:05:29] ask you to take a drug and then come back in two weeks but is this very
[01:05:30] back in two weeks but is this very complicated sequential decision making
[01:05:32] complicated sequential decision making process for treatment of complicated
[01:05:34] process for treatment of complicated healthcare conditions and so this
[01:05:36] healthcare conditions and so this fascinating work on trying to apply
[01:05:38] fascinating work on trying to apply reinforcement learning that instead of
[01:05:39] reinforcement learning that instead of multi-step reasoning where it's not
[01:05:41] multi-step reasoning where it's not about what sense for treatment and then
[01:05:43] about what sense for treatment and then you never see you again for the rest of
[01:05:44] you never see you again for the rest of your life as well here's the first thing
[01:05:46] your life as well here's the first thing you do then come back let's see what
[01:05:48] you do then come back let's see what stain you get to after taking this blood
[01:05:50] stain you get to after taking this blood test so let's see what you can see you
[01:05:51] test so let's see what you can see you get to after trying a drug and then
[01:05:53] get to after trying a drug and then coming back on the week to see what has
[01:05:55] coming back on the week to see what has happened to symptoms but I think that
[01:05:57] happened to symptoms but I think that these are all sectors where
[01:05:59] these are all sectors where reinforcement learning is making inroads
[01:06:01] reinforcement learning is making inroads or even actually stock trading okay
[01:06:05] or even actually stock trading okay maybe not the most inspiring one but one
[01:06:07] maybe not the most inspiring one but one of my friends on the East Coast was and
[01:06:10] of my friends on the East Coast was and then was a and just actually if you or
[01:06:13] then was a and just actually if you or your parents invest in mutual funds this
[01:06:16] your parents invest in mutual funds this may be being used to buy and sell shares
[01:06:19] may be being used to buy and sell shares for them today depending on what back
[01:06:20] for them today depending on what back they're investing I know what Bank is
[01:06:22] they're investing I know what Bank is doing this but I won't say it out loud
[01:06:23] doing this but I won't say it out loud oh but but if you want to buy or sell
[01:06:28] oh but but if you want to buy or sell you know say a million shares of stock a
[01:06:30] you know say a million shares of stock a very large volume of stock you may not
[01:06:33] very large volume of stock you may not want to do it in a very public way
[01:06:35] want to do it in a very public way because that will affect the price of
[01:06:36] because that will affect the price of the shares right so if everyone knows
[01:06:38] the shares right so if everyone knows that a very large investors about to buy
[01:06:40] that a very large investors about to buy a million shares or buy ten million
[01:06:42] a million shares or buy ten million shares or whatever that will cause the
[01:06:44] shares or whatever that will cause the price to increase and this this is
[01:06:47] price to increase and this this is disadvantage as a person wanting to buy
[01:06:48] disadvantage as a person wanting to buy shares but so there's been very
[01:06:51] shares but so there's been very interesting work on using reinforcement
[01:06:52] interesting work on using reinforcement learning to decide how the sequence out
[01:06:56] learning to decide how the sequence out you'll you'll buy how to buy the stock
[01:06:59] you'll you'll buy how to buy the stock in small Lots
[01:07:00] in small Lots in this trading market is called dark
[01:07:02] in this trading market is called dark pools these are Google if you're curious
[01:07:03] pools these are Google if you're curious as you don't bother to try to buy a very
[01:07:08] as you don't bother to try to buy a very large lot of shares or so a very large
[01:07:10] large lot of shares or so a very large lot of shares without affecting the
[01:07:13] lot of shares without affecting the market price too much because the way
[01:07:14] market price too much because the way your effective market price always
[01:07:15] your effective market price always breaks against you know is always
[01:07:17] breaks against you know is always against you it's always bad right so
[01:07:21] against you it's always bad right so this work laid out as well
[01:07:22] this work laid out as well so anyway I think um many applications I
[01:07:26] so anyway I think um many applications I pursue you think that one of the most
[01:07:27] pursue you think that one of the most exciting areas for reinforcement
[01:07:28] exciting areas for reinforcement learning will be robotics but well we'll
[01:07:31] learning will be robotics but well we'll see what what happens over the next few
[01:07:33] see what what happens over the next few years
[01:07:35] years all right so let's see we're just five
[01:07:39] all right so let's see we're just five more minutes um and and just a wrap-up I
[01:07:42] more minutes um and and just a wrap-up I think you know we've gone through quite
[01:07:45] think you know we've gone through quite a lot of stuff I guess from supervised
[01:07:47] a lot of stuff I guess from supervised learning to learning theory and advice
[01:07:50] learning to learning theory and advice or apply learning algorithms to
[01:07:52] or apply learning algorithms to unsupervised learning although it was it
[01:07:54] unsupervised learning although it was it k-means pca EMA share gaussian factor
[01:07:58] k-means pca EMA share gaussian factor analysis in Pentonville analysis to most
[01:08:01] analysis in Pentonville analysis to most recently reinforcement learning with Val
[01:08:03] recently reinforcement learning with Val function approaches fitted value
[01:08:06] function approaches fitted value iteration policy search so feels like we
[01:08:09] iteration policy search so feels like we did feels like feels like I feels like
[01:08:11] did feels like feels like I feels like you've seen a lot of learning algorithms
[01:08:13] you've seen a lot of learning algorithms um go ahead
[01:08:18] how does enforce learn compared to have
[01:08:20] how does enforce learn compared to have a sarah learning I think of those as a
[01:08:22] a sarah learning I think of those as a pretty distinct logicians yeah yeah so I
[01:08:27] pretty distinct logicians yeah yeah so I think and again actually I until I I
[01:08:30] think and again actually I until I I know a lot of non publicly known facts
[01:08:32] know a lot of non publicly known facts about the machine there any world but uh
[01:08:34] about the machine there any world but uh one of the things that I actually happen
[01:08:37] one of the things that I actually happen to know is that some of the ideas our
[01:08:40] to know is that some of the ideas our adversary learning you know so can you
[01:08:44] adversary learning you know so can you take a picture of ice you know very
[01:08:46] take a picture of ice you know very little bit by tweaking a bunch of pixel
[01:08:47] little bit by tweaking a bunch of pixel values they're not visible to human eye
[01:08:49] values they're not visible to human eye they're fools our learning algorithm
[01:08:50] they're fools our learning algorithm into thinking that this picture is
[01:08:52] into thinking that this picture is actually cat one's clean all the cattle
[01:08:53] actually cat one's clean all the cattle whatever so I actually know that there
[01:08:55] whatever so I actually know that there are attackers out in the world today
[01:08:56] are attackers out in the world today using techniques like that to attack you
[01:08:59] using techniques like that to attack you know websites to try to fool you know
[01:09:04] know websites to try to fool you know some of the websites down pretty sure
[01:09:05] some of the websites down pretty sure you guys use in fooled there anti-spam
[01:09:08] you guys use in fooled there anti-spam anti-fraud anti undermining democracy
[01:09:10] anti-fraud anti undermining democracy types of algorithms into
[01:09:13] types of algorithms into to make decisions so it's a it's
[01:09:17] to make decisions so it's a it's exciting time doing machine learning
[01:09:18] exciting time doing machine learning right now that we get to fight battles
[01:09:21] right now that we get to fight battles like these okay and and I think you know
[01:09:28] like these okay and and I think you know I think we're really I think that what
[01:09:31] I think we're really I think that what the things you guys have learned in
[01:09:32] the things you guys have learned in machine learning I think all of you are
[01:09:34] machine learning I think all of you are now very knowledgeable right I think all
[01:09:37] now very knowledgeable right I think all of you are experts in all the ideas of
[01:09:40] of you are experts in all the ideas of core machine learning and I hope that um
[01:09:42] core machine learning and I hope that um I think when we look around the world
[01:09:44] I think when we look around the world there's so many worthwhile projects you
[01:09:46] there's so many worthwhile projects you could do with machine learning and the
[01:09:47] could do with machine learning and the number of you that know these techniques
[01:09:49] number of you that know these techniques is so small that I hope that you take
[01:09:51] is so small that I hope that you take these skills oh and some of you will go
[01:09:54] these skills oh and some of you will go you know build businesses and make a lot
[01:09:56] you know build businesses and make a lot of money that's great some of you will
[01:09:57] of money that's great some of you will take these ideas and help drive basic
[01:10:00] take these ideas and help drive basic research at Stanford or at other
[01:10:02] research at Stanford or at other institutions I think that's fantastic
[01:10:04] institutions I think that's fantastic but I think whatever you're doing the
[01:10:06] but I think whatever you're doing the number of worthwhile projects on the
[01:10:08] number of worthwhile projects on the planet is so large and the number of you
[01:10:10] planet is so large and the number of you that actually know how to use these
[01:10:12] that actually know how to use these techniques is so small that I hope that
[01:10:14] techniques is so small that I hope that you take these skills you're learning
[01:10:16] you take these skills you're learning from this cause and go and do something
[01:10:18] from this cause and go and do something meaningful and do something that helps
[01:10:20] meaningful and do something that helps other people I've even seen this looking
[01:10:22] other people I've even seen this looking valley that there are lot of ways you
[01:10:25] valley that there are lot of ways you know to build very valuable businesses
[01:10:27] know to build very valuable businesses and some of you do that and that's great
[01:10:29] and some of you do that and that's great but I hope that you do it in a way that
[01:10:31] but I hope that you do it in a way that helps other people I think over the past
[01:10:35] helps other people I think over the past few years we've seen I think that uh in
[01:10:39] few years we've seen I think that uh in Silicon Valley maybe ten years ago the
[01:10:42] Silicon Valley maybe ten years ago the contract we had with Society was that
[01:10:45] contract we had with Society was that people would trust us with their data
[01:10:46] people would trust us with their data and then we'll use their data to help
[01:10:48] and then we'll use their data to help them but I think in the past year that
[01:10:51] them but I think in the past year that contract feels like that has been broken
[01:10:53] contract feels like that has been broken and the world's faith in Silicon Valley
[01:10:56] and the world's faith in Silicon Valley has been shaken up but I think that
[01:10:59] has been shaken up but I think that places even more pressure on all of us
[01:11:01] places even more pressure on all of us on all of you to make sure that the work
[01:11:04] on all of you to make sure that the work you go out into the world to do is work
[01:11:05] you go out into the world to do is work that action is respectful of individuals
[01:11:07] that action is respectful of individuals respectful individuals privacy is
[01:11:09] respectful individuals privacy is transparent open and that ultimately is
[01:11:12] transparent open and that ultimately is helping drive forward humanity or
[01:11:16] helping drive forward humanity or helping people helping drive forward
[01:11:17] helping people helping drive forward basic research or building products that
[01:11:19] basic research or building products that actually help people rather than exploit
[01:11:22] actually help people rather than exploit their foibles for profit but to there
[01:11:25] their foibles for profit but to there I hope that all of you will take your
[01:11:27] I hope that all of you will take your superpowers that you now have an um go
[01:11:31] superpowers that you now have an um go out to do to do meaningful work and
[01:11:34] out to do to do meaningful work and let's see and I think oh end and lastly
[01:11:37] let's see and I think oh end and lastly just I just don't personally I want to
[01:11:39] just I just don't personally I want to you know thank all of you on behalf of
[01:11:41] you know thank all of you on behalf of the TAS the ho teaching team and myself
[01:11:44] the TAS the ho teaching team and myself wants to thank all of you for your hard
[01:11:46] wants to thank all of you for your hard work sometimes they go with homework
[01:11:47] work sometimes they go with homework problems the good party also runs ago
[01:11:49] problems the good party also runs ago Wow that she got that problem I thought
[01:11:50] Wow that she got that problem I thought that was really hard or not project
[01:11:52] that was really hard or not project Muslims go hey that's really cool look
[01:11:54] Muslims go hey that's really cool look forward to seeing your final project
[01:11:55] forward to seeing your final project results at the final poster session so I
[01:11:58] results at the final poster session so I know that all of you have worked really
[01:12:00] know that all of you have worked really hard and if you didn't don't tell me
[01:12:04] hard and if you didn't don't tell me that thing almost but I'm gonna make
[01:12:08] that thing almost but I'm gonna make sure you know there's a I think it
[01:12:09] sure you know there's a I think it wasn't that long ago that I was a
[01:12:10] wasn't that long ago that I was a student you know working late at night
[01:12:12] student you know working late at night on homework problems and and I know that
[01:12:14] on homework problems and and I know that many of you have been doing that for the
[01:12:16] many of you have been doing that for the homeworks standing for the midterm for
[01:12:19] homeworks standing for the midterm for work on your final term projects so want
[01:12:22] work on your final term projects so want to make sure you know I'm very grateful
[01:12:24] to make sure you know I'm very grateful for the hard work you put into this
[01:12:26] for the hard work you put into this class and I hope that I hope that your
[01:12:29] class and I hope that I hope that your your heart and skills will also reward
[01:12:32] your heart and skills will also reward you very well in the future and also
[01:12:33] you very well in the future and also help you do work that that you find this
[01:12:36] help you do work that that you find this meaningful so thank you very much
[01:12:38] meaningful so thank you very much [Applause]


================================================================================
LECTURE INDEX.md
================================================================================

CS229 – Machine Learning (Andrew Ng)

Playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU

Total Videos: 20
Transcripts Downloaded: 20
Failed/No Captions: 0

---

Lectures

1. Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=jGwO_UgTS7I](https://www.youtube.com/watch?v=jGwO_UgTS7I)
- Transcript: [001_jGwO_UgTS7I.md](001_jGwO_UgTS7I.md)

2. Stanford CS229: Machine Learning - Linear Regression and Gradient Descent |  Lecture 2 (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=4b4MUYve_U8](https://www.youtube.com/watch?v=4b4MUYve_U8)
- Transcript: [002_4b4MUYve_U8.md](002_4b4MUYve_U8.md)

3. Locally Weighted & Logistic Regression | Stanford CS229: Machine Learning - Lecture 3 (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=het9HFqo1TQ](https://www.youtube.com/watch?v=het9HFqo1TQ)
- Transcript: [003_het9HFqo1TQ.md](003_het9HFqo1TQ.md)

4. Lecture 4 - Perceptron & Generalized Linear Model | Stanford CS229: Machine Learning (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=iZTeva0WSTQ](https://www.youtube.com/watch?v=iZTeva0WSTQ)
- Transcript: [004_iZTeva0WSTQ.md](004_iZTeva0WSTQ.md)

5. Lecture 5 - GDA & Naive Bayes | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=nt63k3bfXS0](https://www.youtube.com/watch?v=nt63k3bfXS0)
- Transcript: [005_nt63k3bfXS0.md](005_nt63k3bfXS0.md)

6. Lecture 6 - Support Vector Machines | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=lDwow4aOrtg](https://www.youtube.com/watch?v=lDwow4aOrtg)
- Transcript: [006_lDwow4aOrtg.md](006_lDwow4aOrtg.md)

7. Lecture 7 - Kernels | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=8NYoQiRANpg](https://www.youtube.com/watch?v=8NYoQiRANpg)
- Transcript: [007_8NYoQiRANpg.md](007_8NYoQiRANpg.md)

8. Lecture 8 - Data Splits, Models & Cross-Validation | Stanford CS229: Machine Learning (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=rjbkWSTjHzM](https://www.youtube.com/watch?v=rjbkWSTjHzM)
- Transcript: [008_rjbkWSTjHzM.md](008_rjbkWSTjHzM.md)

9. Lecture 9 - Approx/Estimation Error & ERM | Stanford CS229: Machine Learning (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=iVOxMcumR4A](https://www.youtube.com/watch?v=iVOxMcumR4A)
- Transcript: [009_iVOxMcumR4A.md](009_iVOxMcumR4A.md)

10. Lecture 10 - Decision Trees and Ensemble Methods | Stanford CS229: Machine Learning (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=wr9gUr-eWdA](https://www.youtube.com/watch?v=wr9gUr-eWdA)
- Transcript: [010_wr9gUr-eWdA.md](010_wr9gUr-eWdA.md)

11. Lecture 11 - Introduction to Neural Networks | Stanford CS229: Machine Learning (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=MfIjxPh6Pys](https://www.youtube.com/watch?v=MfIjxPh6Pys)
- Transcript: [011_MfIjxPh6Pys.md](011_MfIjxPh6Pys.md)

12. Lecture 12 - Backprop & Improving Neural Networks | Stanford CS229: Machine Learning (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=zUazLXZZA2U](https://www.youtube.com/watch?v=zUazLXZZA2U)
- Transcript: [012_zUazLXZZA2U.md](012_zUazLXZZA2U.md)

13. Lecture 13 - Debugging ML Models and Error Analysis | Stanford CS229: Machine Learning (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=ORrStCArmP4](https://www.youtube.com/watch?v=ORrStCArmP4)
- Transcript: [013_ORrStCArmP4.md](013_ORrStCArmP4.md)

14. Lecture 14 - Expectation-Maximization Algorithms | Stanford CS229: Machine Learning (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=rVfZHWTwXSA](https://www.youtube.com/watch?v=rVfZHWTwXSA)
- Transcript: [014_rVfZHWTwXSA.md](014_rVfZHWTwXSA.md)

15. Lecture 15 - EM Algorithm & Factor Analysis | Stanford CS229: Machine Learning Andrew Ng -Autumn2018
- Video: [https://www.youtube.com/watch?v=tw6cmL5STuY](https://www.youtube.com/watch?v=tw6cmL5STuY)
- Transcript: [015_tw6cmL5STuY.md](015_tw6cmL5STuY.md)

16. Lecture 16 - Independent Component Analysis & RL | Stanford CS229: Machine Learning (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=YQA9lLdLig8](https://www.youtube.com/watch?v=YQA9lLdLig8)
- Transcript: [016_YQA9lLdLig8.md](016_YQA9lLdLig8.md)

17. Lecture 17 - MDPs & Value/Policy Iteration | Stanford CS229: Machine Learning Andrew Ng (Autumn2018)
- Video: [https://www.youtube.com/watch?v=d5gaWTo6kDM](https://www.youtube.com/watch?v=d5gaWTo6kDM)
- Transcript: [017_d5gaWTo6kDM.md](017_d5gaWTo6kDM.md)

18. Lecture 18 - Continous State MDP & Model Simulation | Stanford CS229: Machine Learning (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=QFu5nuc-S0s](https://www.youtube.com/watch?v=QFu5nuc-S0s)
- Transcript: [018_QFu5nuc-S0s.md](018_QFu5nuc-S0s.md)

19. Lecture 19 - Reward Model & Linear Dynamical System | Stanford CS229: Machine Learning (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=0rt2CsEQv6U](https://www.youtube.com/watch?v=0rt2CsEQv6U)
- Transcript: [019_0rt2CsEQv6U.md](019_0rt2CsEQv6U.md)

20. RL Debugging and Diagnostics | Stanford CS229: Machine Learning Andrew Ng - Lecture 20 (Autumn 2018)
- Video: [https://www.youtube.com/watch?v=pLhPQynL0tY](https://www.youtube.com/watch?v=pLhPQynL0tY)
- Transcript: [020_pLhPQynL0tY.md](020_pLhPQynL0tY.md)